Chapter 1
Problem Definition
Chapter 2
Introduction
2.1 Overview
In this environment, the decision of keeping repositories with dirty data goes far beyond
technical questions such as the overall speed or performance of data management systems. To
better understand the impact of this problem, it is important to list and analyze the major
consequences of allowing the existence of dirty data in the repositories.
These include, for example:
1) Performance degradationas additional useless data demand more processing, more time
is required to answer simple user queries;
2) Quality lossthe presence of replicas and other inconsistencies leads to distortions in
reports and misleading conclusions based on the existing data;
3) Increasing operational cost: Because of the additional volume of useless data, investments
are required on more storage media and extra computational processing power to keep the
response time levels acceptable.
To avoid these problems, it is necessary to study the causes of dirty data in repositories. A
major cause is the presence of duplicates, quasi replicas, or near-duplicates in these
repositories, mainly those constructed by the aggregation or integration of distinct data
sources. The problem of detecting and removing duplicate entries in a repository is generally
known as Record Deduplication.
2.2 Objectives
Record Deduplication is the task of identifying, in a data repository, records that refer to the
same real world entity or object in spite of misspelling words, typos, different writing styles
or even different schema representations or data types.
Chapter 3
Literature Survey
Using Genetic Programming, Removing the duplicated data is a vast and interesting topic in
the field of Data Mining. Today, this problem arises mainly when data are collected from
many different sources using different information description styles and metadata standards.
Other common place for replicas is found in data repositories created from OCR documents.
These situations can lead to inconsistencies that may affect many systems such as those that
depend on searching and mining tasks.
To remove these problems, it is necessary to design a de-duplication function that combines
the information available in the data repositories in order to identify whether a pair of record
entries refers to the same real-world entity.
In the realm of bibliographic citations, for instance, this problem was extensively discussed
by Lawrence et al. They propose a number of algorithms for matching citations from
different sources based on edit distance, word matching, phrase matching, and subfield
extraction.
As more strategies for extracting disparate pieces of evidence become available, many works
have proposed new distinct approaches to combine and use them. Elmagarmid et al. classify
these approaches into the following two categories: 1) Ad-Hoc or Domain Knowledge
Approaches: This category includes approaches that usually depend on specific domain
knowledge or specific string distance metrics. Techniques that make use of declarative
languages can be also classified in this category 2) Training-based Approaches: This category
includes all approaches that depend on some sort of training either supervised or semisupervised in order to identify the replicas. Probabilistic and machine learning approaches
fall into this category.
Next, we briefly comment on some works based on these two approaches (domain knowledge
and training-based), particularly those that exploit the domain knowledge and those that are
based on probabilistic and machine learning techniques, which are the ones more related to
our work.
Active Atlas is a system whose main goal is to learn rules for mapping records from two
distinct files in order to establish relationships among them. During the learning phase, the
mapping rule and the transformation weights are defined. The process of combining the
transformation weights is executed using decision trees. This system differs from the others
in the sense that it tries to reduce the amount of necessary training, relying on user-provided
information about the most relevant cases for training.
6
Before Marlin, this system was the state-of-the-art solution for the problem.
An approach distinct from the previous ones is presented in. The main idea is to generate
individual rankings for each field based on generated similarity scores.
The distance between these rankings is calculated by using the well-known Spearmans
Footrule metric, which is minimized by a modified version of the Hungarian Algorithm
specifically tailored to this problem by the authors. Then, a merge algorithm based on a score
scheme is applied to the resulting rankings. At the end of this process, the top records in this
global ranking are considered to be the most similar to the input record. Notice that this
approach requires no training. Unfortunately, the experiments conducted do not evaluate the
quality of the global ranking with respect to the record matching effectiveness.
In this project, we propose a GP-based approach to improve results produced by the Fellegi
and Sunters method. Particularly, we use GP to balance the weight vectors produced by that
statistical method, in order to generate a better evidence combination than the simple
summation used by it. In comparison with our previous results, this paper presents a more
general and improved GP-based approach for de-duplication, which is able to automatically
generate effective de-duplication functions even when a suitable similarity function for each
record attribute is not provided in advance. In addition, it also adapts the suggested functions
to changes on the replica identification boundary values used to classify a pair of records as
replicas or not. These two characteristics are extremely important since they free the user
from the burden of having to select the similarity function to use with each attribute required
for the de-duplication task and tune the replica identification boundary accordingly.
Detection of Duplicated records is the process of identifying different or multiple records that
refer to one unique real-world entity or object. Typically, the process of duplicate detection is
preceded by a data preparation stage, during which data entries are stored in a uniform
manner in the database, resolving (at least partially) the structural heterogeneity problem. The
data preparation stage includes a parsing, a data transformation, and a standardization step.
The approaches that deal with data preparation are also described under the using the term
ETL (Extraction, Transformation, Loading). These steps improve the quality of the in-flow
data and the data comparable and more usable. While data preparation is not the focus of this
survey, for completeness we describe briefly the tasks performed in that stage. A
comprehensive collection of papers related to various data transformation approaches can be
found in. Parsing is the first critical component in the data preparation stage. Parsing locates,
7
identifies and isolates individual data elements in the source. Parsing makes it easier to
correct, standardize, and match data because it allows the comparison of individual
components, rather than of long complex strings of data. For example, the appropriate parsing
of name and address components into consistent packets of information is a crucial part in the
data cleaning process. Multiple parsing methods have been proposed recently in the literature
and the area continues to be an active of research. Data transformation refers to simple
conversions that can be applied to the data in order for them to conform to the data types of
their corresponding domains. In other words, this type of conversion focuses on manipulating
one field at a time, without taking into account the values in the related field. The most
common form of a simple transformation is the conversion of a data element from one data
type to another. Such a data type conversion is usually required when a legacy or parent
application stored data in a data type that makes sense within the context of the original
application, but not in a newly developed or subsequent system. Renaming of a field from
one name to another is considered data transformation as well. Encoded values in operational
systems and in external data is another problem that is addressed at this stage. These values
should be converted to their decoded equivalents, so records from different sources can be
compared in a uniform manner. Range checking is yet another kind of data transformation
which involves examining data in a field to ensure that it falls within the expected range,
usually a numeric or date range. Lastly, dependency checking is slightly more involved since
it requires comparing the value in a particular field to the values in another field, to ensure a
minimal level of consistency in the data.
Data standardization refers to the process of standardizing the information represented in
certain fields to a content format. This is used for information that can be stored in many
different ways in various data sources and must be converted to a uniform representation
before the duplicate detection process starts. Without standardization, many duplicate entries
could erroneously be designated as non-duplicates, based on the fact that common identifying
information cannot be compared. Typically, when operational applications are designed and
constructed, there is very little uniform handling of date and time formats across applications.
Data standardization is a rather inexpensive step that can lead to fast identification of
duplicates. For example, if the only difference between two records is the differently
recorded address (44 West Fourth Street vs. 44 W4th St.), then the data standardization step
would make the two records identical, alleviating the need for more expensive approximate
matching approaches that we describe in the later sections.
8
Data Complexity:
Due to increment of online databases day by day, a huge amount of data
exist in www so from this data complexity, previous approaches ( such as vision
based approach, page level extraction, TSIMMIS, Web OQL) do not work effectively
because they are low efficient and time consuming approach.
Scripting Dependency:
All the previously most work have not considered about script such as
java script, VB script and CSS so extraction may fail if we will extract scripting
related page.
Ofcousrse, Due to above problem, the main disadvantages are cost such as operational,
computational after then useless data display and hence high time processing.
Genetic Operations
Usually, GP evolves a population of length-free data structures, also called individuals, each
one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions, as illustrated in Fig. 3.1
11
12
Chapter 4
Software Requirements Specification
13
4.1 Introduction
A Software Requirements Specification (SRS) is a requirements specification for a system.
It is a complete description of the behavior of a system to be developed. It includes a set of
use cases that describe all the interactions the users will have with the software. In addition to
use cases, the SRS also contains non-functional requirements. Non-functional requirements
are requirements which impose constraints on the design or implementation (such as
performance engineering requirements, quality standards, or design constraints).
System requirements specification: A structured collection of information that embodies
the requirements of a system. A business analyst, sometimes titled system analyst, is
responsible for analyzing the business needs of their clients and stakeholders to help identify
business problems and propose solutions. Within the systems development life cycle domain,
typically performs a liaison function between the business side of an enterprise and the
information technology department or external service providers. Projects are subject to three
sorts of requirements:
14
Design Constraint
In systems design the design functions and operations are described in detail, including
screen layouts, business rules, process diagrams and other documentation. The output of this
stage will describe the new system as a collection of modules or subsystems.
15
The design stage takes as its initial input the requirements identified in the approved
requirements document. For each requirement, a set of one or more design elements will be
produced as a result of interviews, workshops, and/or prototype efforts.
Design elements describe the desired software features in detail, and generally include
functional hierarchy diagrams, screen layout diagrams, tables of business rules, business
process diagrams, pseudo code, and a complete entity-relationship diagram with a full data
dictionary. These design elements are intended to describe the software in sufficient detail
that skilled programmers may develop the software with minimal additional input design.
Implementation Constraint
Modular and subsystem programming code will be accomplished during this stage. Unit
testing and module testing are done in this stage by the developers. This stage is intermingled
with the next in that individual modules will need testing before integration to the main
project.
4.1.5 Assumptions and Dependencies
In our project, there is nothing to assume and this project approach having none dependencies
because we are developing web-page-programming language dependent dynamic data
extractor.
16
Windows
Client-side Scripting
JavaScript
Programming Language
Java
IDE/Workbench
My Eclipse 6.0
Database
Oracle 10g
Processor
Pentium IV
Hard Disk
40GB
RAM
512MB or more
17
Reliability
Efficiency
Maintainability
Size
18
Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using java is more reliable.
Efficiency
The system is efficient to give best result while searching complex data.
Maintainability
It can be maintained by those semi-skilled person who has knowledge in Java and Oracle.
Size
The maximum size of this project ranges from 2 GB to 5 GB.
19
Chapter 5
System Design
20
21
the evolutionary process, the individuals are handled and modified by genetic operations such
as reproduction, crossover, and mutation, in an iterative way that is expected to spawn better
individuals (solutions to the proposed problem) in the subsequent generations.
Reproduction is the operation that copies individuals without modifying them. Usually, this
operator is used to implement an elitist strategy that is adopted to keep the genetic code of the
fittest individuals across the changes in the generations. If a good individual is found in
earlier generations, it will not be lost during the evolutionary process
Finally, the mutation operation has the role of keeping a minimum diversity level of
individuals in the population, thus avoiding premature convergence. Every solution tree
resulting from the crossover operation has an equal chance of suffering a mutation process. In
a GP tree representation, a random node is selected and the corresponding sub tree is replaced
by a new randomly created sub tree, as illustrated in Fig. 5.5.
transformation of data from input to output, through processes, may be described logically
and independently of the physical components associated with the system. The DFD is also
known as a data flow graph or a bubble chart.
DFDs are the model of the proposed system. They clearly should show the requirements on
which the new system should be built. Later during design activity this is taken as the basis
for drawing the systems structure charts. The Basic Notation used to create a DFDs are as
follows:
1. Dataflow: Data move in a specific direction from an origin to a
destination.
2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.
4. Data Store: Here data are stored or referenced by a process in the System.
5. Rhombus: decision
25
27
28
29
Chapter 6
UML Diagrams
30
Introduction to UML
The unified Modeling Language (UML) is a standard language for writing software
blueprints. The UML may be used to visualize, specify, construct and document the artifacts
of software-intensive system.
The goal of UML is to provide a standard notation that can be used by all object - oriented
methods and to select and integrate the best elements .UML is itself does not prescribe or
advice on how to use that notation in a software development process or as part of an object design methodology. The UML is more than just bunch of graphical symbols. Rather, behind
each symbol in the UML notation is well-defined semantics.
The system development focuses on three different models of the system.
Functional model
Object model
Dynamic model
Functional model in UML is represented with use case diagrams, describing the
functionality of the system from user point of view.
Object model in UML is represented with class diagrams, describing the structure of the
system in terms of objects, attributes, associations and operations.
Dynamic model in UML is represented with sequence diagrams, start chart diagrams and
activity diagrams describing the internal behavior of the system.
31
Include Relationships
An include relationship is a relationship between two use cases:
It indicates that the use case to which the arrow points is included in the use case on the other
side of the arrow. This makes it possible to reuse a use case in another use case.
32
admin
Enter the topic name
Comparison of citations
User
33
User Registration
Login
+Enter Personalinfo
+Enter LoginID
+Enter Password
+Registation Succesfulyl()
+Login Successfully()
+Display UniqueCitation()
+Display RomeveDuplicateCitation()
+Display theresults()
View SimilarCitation
Graph
+Display SimilarCitationData()
+Display GraphData()
34
User
User
Registration
Login
Login fail
Login success
User Home
Display the no of
citation
Remove the
duplicate Citation
NewActivity8
Rearrange the
subtrees
User
Login
Search Topic
Display the
citation
Comparison of
citations
Linkage Citation
are rating
Login
Enter the topic name
generate
reproduction tree
Rearrange the
subtrees
complete tree
Display the
citation
4: Comparison of citations
Comparison of
citations
Login
Rearrange the
subtrees
Search
Topic
User
Registration
Login
Upload no of Citation
for each t opic
Ent er the
topic name
Display the no of
citation from no of sites
Comparison of
citations
Remove the
duplicate Citat ion
Linkage Citation
are rating
Rearrange the
subtrees
Admin
Upload the
Topic Content
User
Search
the Topic
Remove the
duplicate Citation
Gives the
final tree
Rearrange
the subtrees
39
40
Search
Content
Upload
Content
System
Display
citation
complet
e tree
Remove
...
Chapter 7
Project Schedule & Estimate
42
The Project Planning step is the most critical step in the project management life cycle. The
reason is that its only when we list all of the tasks in our project plan, that we truly have an
idea of what its going to take, to deliver our project on time. So to perform Project Planning
in a smart and efficient way, we need a well-defined and organized project plan to help is to
do it.This chapter explains all the project planning features. It lists our tasks, create schedules,
project management approach, projected project budget, project timeline, etc.
Project planning is a discipline for stating how to complete a project within a certain time
frame, usually with defined stages, and with designated resources. Creating a project plan is
the first thing that needs to be done when undertaking any kind of project. At a minimum, a
project plan answers basic questions about the project:
What? What is the work that will be performed on the project? What are the major
products/deliverables?
Who? Who will be involved and what will be their responsibilities within the
project? How will they be organized?
When? What is the project time line and when will particularly meaningful points,
referred to as milestones, be complete?
Often project planning is ignored in favour of getting on with the work. However, many
people fail to realize the value of a project plan in saving time and money and then face the
consequences later.
43
Module Name
Start Date
End Date
Status
Completed
B] Literature Survey
Completed
Requirement Analysis
Planning Phase
A] Deciding Scope
1st
2013
August 2 nd
2013
August Completed
B] Selection of Platform
7th
2013
August 12th
2013
August Completed
August 30th
2013
August Completed
Designing Phase
B] User Interface
4
Modeling Phase
A] Draw UML Diagrams
Completed
Completed
Submitted
44
Sr
Module Name
Start Date
End Date
Status
Completed
Completed
Completed
Completed
Completed
Completed
No.
1
Completed
Completed
Paper Publication
Completed
10
Submitted
45
Organic projects: are relatively small, simple software projects in which small teams
with good application experience work to a set of less than rigid requirements.
46
Semi-detached projects: are intermediate (in size and complexity) software projects
in which teams with mixed experience levels must meet a mix of rigid and less than
rigid requirements.
Embedded projects: are software projects that must be developed within a set of
tight hardware, software, and operational constraints.
ab
bb
Cb
db
Organic
2.4
1.05
2.5
0.38
Semi Detached
3.0
1.12
2.5
0.35
Embeded
4.6
1.20
2.5
0.32
Basic COCOMO is good for quick, early, rough order of magnitude estimates of software
costs, but it does not account for differences in hardware constraints, personnel quality and
experience, use of modern tools and techniques, and other project attributes known to have a
significant influence on software costs, which limits its accuracy.
Our project comes under organic category
For organic:
Effort estimation
Effort = a _(Sizeb) _EAF Where,
Size (KLOC) = 2.5 KLOC
a = 2.4
b = 1.05
47
n = 3 (number of persons)
EAF = 1.05 (Adjustment factor)
Therefore.
48
Chapter 8
Software Implementation
49
8.1 Introduction
Implementation is the stage where the theoretical design is turned in to working system. The
most crucial stage is achieving a new successful system and in giving confidence on the new
system for the users that it will work efficiently and effectively.
The system can be implemented only after through testing is done and if it found to work
according to the specification. It involves careful planning, investigation of the current
system and its constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning. Two major tasks of
preparing the implementation are education and training of the users and testing of the
system.
The more complex the system being implemented, the more involved will be the systems
analysis and design effort required just for implementation. The implementation phase
comprises of several activities. The required hardware and software acquisition is carried out.
The System may require some hardware and software acquisition is carried out. The system
may require some software to be developed. For this, programs are written and tested. The
user then changes over to his new fully tested system and the old system is discontinued.
Implementation is the process of having systems personnel check out and
put new
equipment in to use, train users, install the new application, and construct any files of data
needed to it.
Depending on the size of the organization that will be involved in using the application and
the risk associated with its use, system developers may choose to test the operation in only
one area of the firm, say in one department or with only one or two persons. Sometimes they
will run the old and new systems together to compare the results. In still other situations,
developers will stop using the old system one-day and begin using the new one the next. As
we will see, each implementation strategy has its merits, depending on the business situation
in which it is considered. Regardless of the implementation strategy used, developers strive
to ensure that the systems initial use in trouble-free.
Once installed, applications are often used for many years. However, both the organization
and the users will change, and the environment will be different over the weeks and months.
Therefore, the application will undoubtedly have to be maintained. Modifications and
changes will be made to the software, files, or procedures to meet the emerging requirements.
50
51
2. Genetic Operations:
This module has been developed to provide the structure based results. Here, first thing is
selection of root terminals. This is zero level results. Next, we find out next level of
childrens. This procedure applies till reaches to leaf nodes for extraction of results. In this
procedure, all internal nodes we find out here in implementation part. These internal nodes
identification and create the structure possible with three operations here. Those are called
selection, crossover and mutation.
3. Generational Evolutionary Algorithm:
This Algorithm initialize all the results of nodes. Each and every node of rating we are
calculates here. According to rating value calculates fitness. After finding the fitness node its
possible for creates the reproduced tree in implementation. It can contains all nodes are best.
This same process applies till for finding the optimal tree identification. This same procedure
repeatedly performs here.
4. Record De-duplication:
According to requirement automatically it can changes here in implementation. It can show
the efficient results in tree data structure in implementation. It is the good evidence based
results display. All those nodes are display with the help of similarity function in
implementation process.
5. Precision and Recall operations:
These two operations are performed only for correctly identified duplicated data.
P = Number of Correctly Identified Duplicated pairs / Number of Identified Duplicated pairs
R = Number of Correctly Identified Duplicated pairs / Number of True Duplicated pairs
Algorithm
Algorithm: Generational Evolutionary Algorithm
The first thing is that all the gathered data is initialized from which the data is
discovered.
After then all the individual data are evaluated and a numeric fitness value is assigned
to each one.
52
Selection process is performed in which all the n individual are selected into next
generation population without modifying the data.
After then Crossover operation is performed in which m individual that will compose
the next generation with the best parent is selected and replace the existing generation
i.e. in this process two parent tree are selected according to matching policy and then
a random sub tree is selected in each parent.
And finally Mutation operation is performed in which the best individual are
produced in the population.
</td>
<td > </td>
<td colspan="1" align="left" valign="top"><img
src="<%=request.getContextPath()+"/images/c5.jpg"%>" align="top" height="200" /
></td>
</tr>
</table>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>
// LoginPage.jsp
</style>
</head>
<body>
<jsp:include page="Header.jsp"></jsp:include>
<fieldset>
<legend>Login Form</legend>
<form action="<%=request.getContextPath()+"/LoginAction"%>" method=post
name="login">
<table border="0" align="center" bgcolor="white" width="80%">
<tr>
<td height="120" align="right">
<td><table border="0" align="center">
<tr align="center"><strong><h3><font color="#4682B4">Login
Form</font></h3></strong>
</tr>
<tr>
<td ><font color="#DA70D6" size=""><b>UserID</b></font></td>
<td ><input type="text" name="username"> </td>
</tr>
<tr>
<td><font color="#DA70D6" size=""><b>Password</b></font></td>
<td>
<input type="password" name="password">
</tr>
<tr>
<td colspan="2">
<div align="center" class="style11">
<input type="submit" name="Submit" value="Sign In">
55
</td>
// Search.jsp
<%@ page language="java" import="java.util.*" pageEncoding="ISO-8859-1"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
56
<head>
<script type="text/javascript">
function changePage(){
var select=document.getElementById("select").value;
var sel="";
if(select==sel){
alert("plz select any one ");
}
else if(select==java){
alert("u have selected java");
location.href="./MainPage.jsp";
}else if(select==c){
alert("u have selected c");
location.href="./C.jsp";
}else if(select==unix){
alert("u have selected Unix");
location.href="./Unix.jsp";
}
}
</script>
</head>
<jsp:include page="Header.jsp"></jsp:include>
<body>
<center> <h3><font color="#008080">Select u r search Concept Here</font></h3>
<form name="cpaper" action="./GetSearchPageAction">
<table>
<tr>
<td> <font size="4" color="#4682B4">Select Here</font></td>
57
</td>
</tr>
<br/>
<br/><br/>
<tr align="center"><td> <input type="submit" value="search">
</table>
</form>
</center>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>
</td></tr>
64
65
Chapter 9
Software Testing
66
9.1 Introduction
Testing Strategies
Testing:
1. The process of executing a system with the intent of finding an error.
2. Testing is defined as the process in which defects are identified, isolated, subjected
for rectification and ensured that product is defect free in order to produce the quality
product and hence customer satisfaction.
3. Quality is defined as justification of the requirements
4. Defect is nothing but deviation from the requirements
5. Defect is nothing but bug.
6. Testing --- The presence of bugs
7. Testing can demonstrate the presence of bugs, but not their absence
8. Debugging and Testing are not the same thing!
9. Testing is a systematic attempt to break a program.
10. Debugging is the art or method of uncovering why the script /program did not execute
properly.
Testing Methodologies:
Black box Testing: is the testing process in which tester can perform testing on an
application without having any internal structural knowledge of application.
Usually Test Engineers are involved in the black box testing.
White box Testing: is the testing process in which tester can perform testing on an
application with having internal structural knowledge.
Usually The Developers are involved in white box testing.
Gray Box Testing: is the process in which the combination of black box and white
box tonics are used.
67
Levels of Testing:
Module1
Module2
Units
i/p
Module3
Units
Units
Integration o/p
Test Planning:
1.
Test
Planning
defined
as
strategic
document
which
describes the procedure how to perform various testing on the total application in the
most efficient way.
2. This document involves the scope of testing,
3. Objective of testing,
4. Areas that need to be tested,
5. Areas that should not be tested,
6. Scheduling Resource Planning,
7. Areas to be automated, various testing tools used.
Test Development
Types of Testing:
Smoke Testing: is the process of initial testing in which tester looks for the availability of all
the functionality of the application in order to perform detailed testing on them. (Main check
is for available forms).
Sanity Testing: is a type of testing that is conducted on an application initially to check for
the proper behavior of an application that is to check all the functionality are available before
the detailed testing is conducted by on them.
Regression Testing: is one of the best and important testing. Regression testing is the
process in which the functionality, which is already tested before, is once again tested
whenever some new change is added in order to check whether the existing functionality
remains same.
Static Testing: is the testing, which is performed on an application when it is not been
executed. Ex: GUI, Document Testing.
Dynamic Testing: is the testing which is performed on an application when it is being
executed. Ex: Functional testing.
Alpha Testing: it is a type of user acceptance testing, which is conducted on an application
when it is just before released to the customer.
Beta-Testing: it is a type of UAT that is conducted on an application when it is released to
the customer, when deployed in to the real time environment and being accessed by the real
time users.in this type of testing, developer can get user response.
Compatibility testing: it is the testing process in which usually the products are tested on the
environments with different combinations of databases. In order to check how far the product
is compatible with all these environments platform combination.
Adhoc Testing: Adhoc Testing is the process of testing in which unlike the formal testing
where in test case document is used, without that test case document testing can be done of
an application, to cover that testing of the future which are not covered in that test case
document. Also it is intended to perform GUI testing which may involve the cosmetic issues.
69
Test Data:
Here all test cases that are used for the system testing are specified. The goal is to test the
different functional requirements specified in Software Requirements Specifications (SRS)
document.
Unit Testing:
Each individual module has been tested against the requirement with some test data.
Test Report:
The module is working properly provided the user has to enter information. All data entry
forms have tested with specified test cases and all data entry forms are working properly.
Error Report:
If the user does not enter data in specified order then the user will be prompted with error
messages. Error handling was done to handle the expected and unexpected errors.
The mechanism for determining whether a software program or system has passed or failed
such a test is known as a test oracle. In some settings, an oracle could be a requirement or use
case, while in others it could be a heuristic. It may take many test cases to determine that a
software program or system is considered sufficiently scrutinized to be released. Test cases
are often referred to as test scripts, particularly when written. Written test cases are usually
collected into test suites. If a requirement has sub-requirements, each sub-requirement must
have at least two test cases. Written test cases should include a description of the
functionality to be tested, and the preparation required to ensure that the test can be
conducted.
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub-assemblies, assemblies and/or a finished product. It is the process of
exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.
Test case format
Test Cases usually have the following components:
Initial Condition
Expected behavior/outcome
Must have the positive perception to verify whether the requirements are justified.
71
Description
Check
for
date
and
of
the
Display
Result
system must be
displayed
Description
in should not be
allow
invalid
save
should
allow
72
not
Result
Test Case 1:
Login Page Test Case
Table 9.3: Login Page Test Case
Test Steps
Test
Test
Case
Case
Description
Step
Name
Login
Expected
Actual
Validate
To
Login
that
name
login
must
on (say
a)
page password
be click
error
and Login
and less
not
than
Submit characters
must
1 characters
displayed
enter
be
name
1 chars full
(say
a)
password
click
or
an
button
or
Password
must
be
displayed
Pwd
Validate
To
Password
that
Password on (say
login
must
error
nothing) Password
click characters
must
1 characters
displayed
73
be
Pwd02
Validate
Password
To
verify
that
Password on
login
page
must
be
allow special
characters(say
or
an
error message
or
Password
must
characters
Link
special full
be
displayed
Verify
To
Hyperlinks
the
Hyper Link
Page
must
be
displayed
Links
available at
left side on
login
page
working
or
not
must
be
displayed
Click New Users New
Link
Users
Registration
Form must be
displayed
74
Case Test
Case
Step
Name
Description
Registration
Validate
To
User Name
that
Expected
Actual
on button
name
error
Name
Must
be Declared
Registration
page must be
Declared
Validate
To
Password
that
error
Submit message
Password on button
Password
Registration
Must
page must be
Declared
be
Declared
Validate
To
First Name
that
verify enter
First Name
First an
on Submit button
Name
error
Name
Must
be Declared
Registration
page must be
Declared
Validate
To
Last Name
that
verify enter
Last Name
Name
page must be
75
error
on Submit button
Registration
Declared
Last an
Name
Must
be Declared
Validate
To
verify enter
Address
Address an
Submit message
button
on
error
Address
Registration
Must
be
page must be
Declared
Declared
Validate
To
verify enter
Phone
that
Phone number
number
number
Phone an
click message
on Submit button
Phone
Registration
number
page must be
Must
Declared
Declared
Validate
To
verify enter
Phone
that
number
error
Phone an
be
error
giving
abc)
click
characters
Registration
button
Submit number
Must
be
page must be
numeric
Declared
Declared
Validate
To
verify enter
Phone
that
Phone number
number
valid
1234)
click
number
Registration
button
Phone an
error
is message
values Phone
Submit number
Must
be
page must be
Valid
value
Declared
Declared
76
Chapter 10
Result Analysis
77
We present and discuss the results of the experiments performed to evaluate our proposed GP
based approach to record Deduplication. There were three sets of experiments:
1. GP was used to find the best combination function for previously user-selected evidence,
i.e., <attribute, similarity function> pair combinations specified by the user for the
Deduplication task. The use of user selected evidence is a common strategy adopted by all
previous record Deduplication approaches. Our objective in this set of experiments was to
compare the evidence combination suggested by our GP-based approach with that of a stateof the- art SVM-based solution, the Marlin system, used as our baseline.
2. GP was used to find the best combination function with automatically selected evidence.
Our objective in this set of experiments was to discover the impact on the result when GP has
the freedom to choose the similarity function that should be used with each specific attribute.
3. GP was tested with different replica identification boundaries. Our objective in this set of
experiments was to identify the impact on the resulting Deduplication function when different
replica identification boundary values are used with our GP based approach.
78
Chapter 11
Technical Specification
79
Pentium IV
Hard Disk
40GB
RAM
512MB or more
Windows
Programming Language :
Java
Web Applications
IDE/Workbench
My Eclipse 6.0
Database
Oracle 10g
11.3 Advantages
1. It can remove the maximized duplicate based results here.
2. Without entire search space provides the meaningful results.
3. Provides the effective results with the help of machine learning techniques.
4. It can give the best evidence based results.
5. Every time provides the tune based results environment in implementation.
11.4 Disadvantages
1. As this project is mainly based on Search Engine so a lots of manpower is required
2. Network Connectivity in Rural Area may lead to Discontinuation of this Project.
11.5 Application
In Dynamic Web
In Digital Libraries
In Large Files
Pattern Recognition
80