Vide

A Genetic Programming Approach to Record Deduplication
Chapter 1
Problem Definition
1.1 Problem Statement

As Extracting structured data from Web pages is a challenging problem due to the underlying
intricate structures of such pages. Until now, a large number of techniques have been
proposed to address this problem, but all of them have inherent limitations because they are
dependent on Web Page Programming Language, Version and Scripting and on etc...
We know that World Wide Web has more and more online databases and the number of
databases is increasing day by day hence data duplication is occurring very fast. When any
query is submitted to databases then it retrieves the information from that database and then
extract but as the number of data is increasing rapidly in database, the previous approaches
has not been updated so it becomes very hard to detect duplicate data and extract with nonduplicated data in effective manner.
The evaluation is done by assigning to an individual a value that measures how suitable that
individual is to the proposed problem. In our GP experimental environment, individuals are
evaluated on how well they learn to predict good answers to a given problem, using the set of
functions and terminals available. The resulting value is also called raw fitness and the
evaluation functions are called fitness functions. Notice that after the evaluation step, each
solution has a fitness value that measures how good or bad it is to the given problem. Thus,
by using this value, it is possible to select which individuals should be in the next generation.
Strategies for this selection may involve very simple or complex techniques, varying from
just selecting the best n individuals to randomly selecting the individuals proportionally to
their fitness.
1.2 Need of Proposed System

These are the following needs of the Proposed System:
To remove the maximized duplicate based results.
To provide the meaningful results.
To provide the effective results.
To Shorten the Cost.

2
Chapter 2
Introduction
2.1 Overview
In this environment, the decision of keeping repositories with dirty data goes far beyond
technical questions such as the overall speed or performance of data management systems. To
better understand the impact of this problem, it is important to list and analyze the major
consequences of allowing the existence of dirty data in the repositories.
These include, for example:
1) Performance degradationas additional useless data demand more processing, more time
is required to answer simple user queries;
2) Quality lossthe presence of replicas and other inconsistencies leads to distortions in
reports and misleading conclusions based on the existing data;
3) Increasing operational cost: Because of the additional volume of useless data, investments
are required on more storage media and extra computational processing power to keep the
response time levels acceptable.
To avoid these problems, it is necessary to study the causes of dirty data in repositories. A
major cause is the presence of duplicates, quasi replicas, or near-duplicates in these
repositories, mainly those constructed by the aggregation or integration of distinct data
sources. The problem of detecting and removing duplicate entries in a repository is generally
known as Record Deduplication.
2.2 Objectives
Record Deduplication is the task of identifying, in a data repository, records that refer to the
same real world entity or object in spite of misspelling words, typos, different writing styles
or even different schema representations or data types.
Chapter 3
Literature Survey
Using Genetic Programming, Removing the duplicated data is a vast and interesting topic in
the field of Data Mining. Today, this problem arises mainly when data are collected from
many different sources using different information description styles and metadata standards.
Other common place for replicas is found in data repositories created from OCR documents.
These situations can lead to inconsistencies that may affect many systems such as those that
depend on searching and mining tasks.
To remove these problems, it is necessary to design a de-duplication function that combines
the information available in the data repositories in order to identify whether a pair of record
entries refers to the same real-world entity.
In the realm of bibliographic citations, for instance, this problem was extensively discussed
by Lawrence et al. They propose a number of algorithms for matching citations from
different sources based on edit distance, word matching, phrase matching, and subfield
extraction.
As more strategies for extracting disparate pieces of evidence become available, many works
have proposed new distinct approaches to combine and use them. Elmagarmid et al. classify
these approaches into the following two categories: 1) Ad-Hoc or Domain Knowledge
Approaches: This category includes approaches that usually depend on specific domain
knowledge or specific string distance metrics. Techniques that make use of declarative
languages can be also classified in this category 2) Training-based Approaches: This category
includes all approaches that depend on some sort of training either supervised or semisupervised in order to identify the replicas. Probabilistic and machine learning approaches
fall into this category.
Next, we briefly comment on some works based on these two approaches (domain knowledge
and training-based), particularly those that exploit the domain knowledge and those that are
based on probabilistic and machine learning techniques, which are the ones more related to
our work.
Active Atlas is a system whose main goal is to learn rules for mapping records from two
distinct files in order to establish relationships among them. During the learning phase, the
mapping rule and the transformation weights are defined. The process of combining the
transformation weights is executed using decision trees. This system differs from the others
in the sense that it tries to reduce the amount of necessary training, relying on user-provided
information about the most relevant cases for training.
6
Before Marlin, this system was the state-of-the-art solution for the problem.
An approach distinct from the previous ones is presented in. The main idea is to generate
individual rankings for each field based on generated similarity scores.
The distance between these rankings is calculated by using the well-known Spearmans
Footrule metric, which is minimized by a modified version of the Hungarian Algorithm
specifically tailored to this problem by the authors. Then, a merge algorithm based on a score
scheme is applied to the resulting rankings. At the end of this process, the top records in this
global ranking are considered to be the most similar to the input record. Notice that this
approach requires no training. Unfortunately, the experiments conducted do not evaluate the
quality of the global ranking with respect to the record matching effectiveness.
In this project, we propose a GP-based approach to improve results produced by the Fellegi
and Sunters method. Particularly, we use GP to balance the weight vectors produced by that
statistical method, in order to generate a better evidence combination than the simple
summation used by it. In comparison with our previous results, this paper presents a more
general and improved GP-based approach for de-duplication, which is able to automatically
generate effective de-duplication functions even when a suitable similarity function for each
record attribute is not provided in advance. In addition, it also adapts the suggested functions
to changes on the replica identification boundary values used to classify a pair of records as
replicas or not. These two characteristics are extremely important since they free the user
from the burden of having to select the similarity function to use with each attribute required
for the de-duplication task and tune the replica identification boundary accordingly.
Detection of Duplicated records is the process of identifying different or multiple records that
refer to one unique real-world entity or object. Typically, the process of duplicate detection is
preceded by a data preparation stage, during which data entries are stored in a uniform
manner in the database, resolving (at least partially) the structural heterogeneity problem. The
data preparation stage includes a parsing, a data transformation, and a standardization step.
The approaches that deal with data preparation are also described under the using the term
ETL (Extraction, Transformation, Loading). These steps improve the quality of the in-flow
data and the data comparable and more usable. While data preparation is not the focus of this
survey, for completeness we describe briefly the tasks performed in that stage. A
comprehensive collection of papers related to various data transformation approaches can be
found in. Parsing is the first critical component in the data preparation stage. Parsing locates,
7
identifies and isolates individual data elements in the source. Parsing makes it easier to
correct, standardize, and match data because it allows the comparison of individual
components, rather than of long complex strings of data. For example, the appropriate parsing
of name and address components into consistent packets of information is a crucial part in the
data cleaning process. Multiple parsing methods have been proposed recently in the literature
and the area continues to be an active of research. Data transformation refers to simple
conversions that can be applied to the data in order for them to conform to the data types of
their corresponding domains. In other words, this type of conversion focuses on manipulating
one field at a time, without taking into account the values in the related field. The most
common form of a simple transformation is the conversion of a data element from one data
type to another. Such a data type conversion is usually required when a legacy or parent
application stored data in a data type that makes sense within the context of the original
application, but not in a newly developed or subsequent system. Renaming of a field from
one name to another is considered data transformation as well. Encoded values in operational
systems and in external data is another problem that is addressed at this stage. These values
should be converted to their decoded equivalents, so records from different sources can be
compared in a uniform manner. Range checking is yet another kind of data transformation
which involves examining data in a field to ensure that it falls within the expected range,
usually a numeric or date range. Lastly, dependency checking is slightly more involved since
it requires comparing the value in a particular field to the values in another field, to ensure a
minimal level of consistency in the data.
Data standardization refers to the process of standardizing the information represented in
certain fields to a content format. This is used for information that can be stored in many
different ways in various data sources and must be converted to a uniform representation
before the duplicate detection process starts. Without standardization, many duplicate entries
could erroneously be designated as non-duplicates, based on the fact that common identifying
information cannot be compared. Typically, when operational applications are designed and
constructed, there is very little uniform handling of date and time formats across applications.
Data standardization is a rather inexpensive step that can lead to fast identification of
duplicates. For example, if the only difference between two records is the differently
recorded address (44 West Fourth Street vs. 44 W4th St.), then the data standardization step
would make the two records identical, alleviating the need for more expensive approximate
matching approaches that we describe in the later sections.
8
3.1 Existing System

Actually, the main problem in the existing system is:
Data Complexity:
Due to increment of online databases day by day, a huge amount of data
exist in www so from this data complexity, previous approaches ( such as vision
based approach, page level extraction, TSIMMIS, Web OQL) do not work effectively
because they are low efficient and time consuming approach.
Web page programming language and version Dependency:

We have all previous approaches related to web page extraction that
was html dependent. Lets take an example of any website such as Mumbai
University. Two years ago, there was some one administrator who was maintaining
that website and he was using html and version was 3.0. After one year, a new
administrator was appointed and he maintained that website using html version 4.0
and now a time, some different administrator is maintaining that website by using
some other web page programming language such as XHTML and XML. Will data be
extracted effectively? Of course no. so this is a problem of web page programming
language and version related dependency.
Scripting Dependency:
All the previously most work have not considered about script such as
java script, VB script and CSS so extraction may fail if we will extract scripting
related page.
Problem with Record Duplication:

One of the most important thing is that when data is uploaded from
different location then there may be a chance of data duplication. If we consider any
digital library website such as Google, Yahoo, Microsoft or any website then there
exist so many unwanted data. One data repeats so many time hence so many space is
wasted. Due to this, processing time is very high when we submit any query in the
database.
Ofcousrse, Due to above problem, the main disadvantages are cost such as operational,
computational after then useless data display and hence high time processing.
3.2 Proposed System

Genetic Programming Basic Concepts:
Evolutionary programming is based on ideas inspired on the naturally observed process that
influence virtually all living beings, the natural selection. Genetic Programming is one of the
best known evolutionary programming techniques.
It can be seen as an adaptive heuristic whose basic ideas come from the properties of the
genetic operations and natural selection system. It is a direct evolution of programs or
algorithms used for the purpose of inductive learning (supervised learning), initially applied
to optimization problems. GP, as well as other evolutionary techniques, is also known for its
capability of working with multi objective problems that are normally modeled as
environment restrictions during the evolutionary process.
GP and other evolutionary approaches are also widely known for their good performance on
searching over very large possibly infinite search spaces, where the optimal solution in many
cases is not known, usually providing near-optimal answers . As stated by Koza , the fact
that the GP algorithm operates on a population of individuals, rather than on a single point in
the search space of the problem, is an essential aspect of the algorithm. This can be
explained since the population serves as the pool of the probably valuable genetic material,
which is used to create new solutions with probably valuable new combinations of features.
The main aspect that distinguishes GP from other evolutionary techniques is that it represents
the concepts and the interpretation of a problem as a computer program and even the data are
viewed and manipulated in this way. This special characteristic enables GP to model any
other machine learning representation.
Another advantage of GP over other evolutionary techniques is its applicability to symbolic
regression problems, since the representation structures are variable. According to, while GP
does not mimic nature as closely as do genetic algorithms, it does offer the opportunity to
directly evolve programs of unusual complexity, without having to define the structure or size
of the program in advance. This means that GP is able to discover the independent variables
and their relationships with each other and with any dependent variable. Thus, GP can find
the correct functional form that fits the data and discover the appropriate coefficients.
Solving this kind of problem is noticeably more complex than solving linear regression or
polynomial regression problems.
10
Genetic Operations
Usually, GP evolves a population of length-free data structures, also called individuals, each
one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions, as illustrated in Fig. 3.1
Fig. 3.1: Genetic Trees

When using this tree representation in a GP-based method, a set of terminals and functions
should be defined.
Terminals are inputs, constants or zero argument3 nodes that terminate a branch of a tree.
They are also called tree leaves. The function set is the collection of operators, statements,
and basic or user-defined functions that can be used by the GP evolutionary process to
manipulate the terminal values. These functions are placed in the internal nodes of the tree, as
illustrated in Fig. 3.1. During the evolutionary process, the individuals are handled and
modified by genetic operations such as reproduction, crossover, and mutation, in an iterative
way that is expected to spawn better individuals (solutions to the proposed problem) in the
subsequent generations. Reproduction is the operation that copies individuals without
modifying them. Usually, this operator is used to implement an elitist strategy that is adopted
to keep the genetic code of the fittest individuals across the changes in the generations. If a
good individual is found in earlier generations, it will not be lost during the evolutionary
process.
The crossover operation allows genetic content (e.g., sub trees) exchange between two
parents, in a process that can generate two or more children. In a GP evolutionary process,
two parent trees are selected according to a matching (or pairing) policy and, then, a random
sub tree is selected in each parent. Child trees are the result from the swap of the selected sub
trees between the parents, as illustrated in Fig. 3.2.
11
Fig. 3.2: Crossover Operation

Finally, the mutation operation has the role of keeping a minimum diversity level of
individuals in the population, thus avoiding premature convergence. Every solution tree
resulting from the crossover operation has an equal chance of suffering a mutation process. In
a GP tree representation, a random node is selected and the corresponding sub tree is replaced
by a new randomly created sub tree, as illustrated in Fig. 3.3.
Fig.3.3: Mutation Operation

All operations for node replacements and insertions performed by mutation and crossover
are executed using equal (and constant) probabilities. This way all nodes have the same
probability of being chosen in order to guarantee the diversity of the individuals within the
genetic pool.
12
Chapter 4
Software Requirements Specification
13
4.1 Introduction
A Software Requirements Specification (SRS) is a requirements specification for a system.
It is a complete description of the behavior of a system to be developed. It includes a set of
use cases that describe all the interactions the users will have with the software. In addition to
use cases, the SRS also contains non-functional requirements. Non-functional requirements
are requirements which impose constraints on the design or implementation (such as
performance engineering requirements, quality standards, or design constraints).
System requirements specification: A structured collection of information that embodies
the requirements of a system. A business analyst, sometimes titled system analyst, is
responsible for analyzing the business needs of their clients and stakeholders to help identify
business problems and propose solutions. Within the systems development life cycle domain,
typically performs a liaison function between the business side of an enterprise and the
information technology department or external service providers. Projects are subject to three
sorts of requirements:
Business requirements describe in business terms what must be delivered or

accomplished to provide value.
Product requirements describe properties of a system or product (which could be

one of several ways to accomplish a set of business requirements.)
Process requirements describe activities performed by the developing organization.

For instance, process requirements could specify specific methodologies that must be
followed, and constraints that the organization must obey.
Product and process requirements are closely linked. Process requirements often specify the
activities that will be performed to satisfy a product requirement. For example, a maximum
development cost requirement (a process requirement) may be imposed to help achieve a
maximum sales price requirement (a product requirement); a requirement that the product be
maintainable (a Product requirement) often is addressed by imposing requirements to follow
particular development styles
14
4.1.1 Project Scope

GP-based approach is also able to automatically find effective Deduplication functions, even
when the most suitable similarity function for each record attribute is not known in advance.
This is extremely useful for the non-specialized user, who does not have to worry about
selecting these functions for the Deduplication task. In addition, we show that our approach is
also able to adapt the suggested Deduplication function to changes on the replica
identification boundaries used to classify a pair of records as a match or not.
This Project can be mainly scoped in Web Data Extraction. GP-based approach is able to
automatically find effective de-duplication functions, even when the most suitable similarity
function for each record attribute is not known in advance. This is extremely useful for the
non-specialized user, who does not have to worry about selecting these functions for the deduplication task. In addition, we show that our approach is also able to adapt the suggested
de-duplication function to changes on the replica identification boundaries used to classify a
pair of records as a match or not.
4.1.2 User Classes and Characteristics
There are mainly two types are users in this product i.e. Administrator and User.
Administrator having all types of privileges and user having limited privileges. Admin can
upload the documents whereas user can search the data and will find the de-duplicated data.
4.1.3 Operating Environment
This product will be installed on windows based machine. It is mainly for server machine.
This product is web-based and will be hosted by a web server. It can be viewed by any web
browser.
4.1.4 Design and Implementation Constraints
Design Constraint
In systems design the design functions and operations are described in detail, including
screen layouts, business rules, process diagrams and other documentation. The output of this
stage will describe the new system as a collection of modules or subsystems.
15
The design stage takes as its initial input the requirements identified in the approved
requirements document. For each requirement, a set of one or more design elements will be
produced as a result of interviews, workshops, and/or prototype efforts.
Design elements describe the desired software features in detail, and generally include
functional hierarchy diagrams, screen layout diagrams, tables of business rules, business
process diagrams, pseudo code, and a complete entity-relationship diagram with a full data
dictionary. These design elements are intended to describe the software in sufficient detail
that skilled programmers may develop the software with minimal additional input design.
Implementation Constraint
Modular and subsystem programming code will be accomplished during this stage. Unit
testing and module testing are done in this stage by the developers. This stage is intermingled
with the next in that individual modules will need testing before integration to the main
project.
4.1.5 Assumptions and Dependencies
In our project, there is nothing to assume and this project approach having none dependencies
because we are developing web-page-programming language dependent dynamic data
extractor.
4.2 System Features

4.2.1 System Feature 1
These are the functional requirements:
1. Upload the data in Different Database.
2. Search the data in Different Database.
3. View Duplicate Data.
4. Apply GP Based approach techniques and Find without Duplicate Data.
5. Calculate the Searching Time.
6. View User Information.
7. Change the password.
16
4.2.2 System Feature 2

Some other features:
1. Upload no of Citation for each topic
2. Enter the topic name
3. Display the no of citation from no of sites
4. Comparison of citations
5. Remove the duplicate Citation
6. Linkage Citation are rating
7. Using rating procedure generate reproduction tree
8. Rearrange the sub trees
9. Gives the final tree as a complete tree
4.3 External Features

4.3.1 User Interfaces
In this project, User Interface will be HTML and CSS.
4.3.2 Software Interfaces
Operating System
Windows
Client-side Scripting
JavaScript
Programming Language
Java
IDE/Workbench
My Eclipse 6.0
Database
Oracle 10g
Processor
Pentium IV
Hard Disk
40GB
RAM
512MB or more
4.3.3 Hardware Interfaces
17
4.3.4 Communication Interfaces

This Genetic Programming Approach to Record Deduplication System will be
communicating through the Intranet via a Transmission Control Protocol of the TCP/IP and
FTP for FTP services, such as file uploads and download.
4.4 Non Functional Requirements

4.4.1 Performance Requirements
This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.
4.4.2 Safety Requirements

This system is safe for extracting if anything goes wrong like power failure.
4.4.3 Security Requirements
1. We are going to develop a secured database. There are different categories of users namely
Administrator, Restricted users who will be viewing either all or some specific information
from the database.
2. Depending upon the category of user the access rights are decided. It means if the user is
an administrator then he can be able to modify the data, append etc. All other users only have
the rights to retrieve the information about database.
4.4.4 S/w Quality Attributes
Software quality measurement quantifies to what extent a software or system rates along
each of these four dimensions:
Reliability
Efficiency
Maintainability
Size
18
Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using java is more reliable.
Efficiency
The system is efficient to give best result while searching complex data.
Maintainability
It can be maintained by those semi-skilled person who has knowledge in Java and Oracle.
Size
The maximum size of this project ranges from 2 GB to 5 GB.
4.5 Risk Assessment

These are the five steps for Risk Assessment:
1. Identify the hazards
2. Decide who might be harmed and how
3. Evaluate the risks and decide on precaution
4. Record your findings and implement them
5. Review your assessment and update if necessary
19
Chapter 5
System Design
20
5.1 System Architecture
Fig. 5.1.A: System Architecture
21
Fig. 5.1.B: System Architecture

Usually, GP evolves a population of length-free data structures, also called individuals, each
one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions, as illustrated in Fig. 5.2. When using this tree representation in a GPbased method, a set of terminals and functions should be defined. Terminals are inputs,
constants or zero argument nodes that terminate a branch of a tree. They are also called tree
leaves. The function set is the collection of operators, statements, and basic or user-defined
functions that can be used by the GP evolutionary process to manipulate the terminal values.
These functions are placed in the internal nodes of the tree, as illustrated in Fig. 5.2. During
22
the evolutionary process, the individuals are handled and modified by genetic operations such
as reproduction, crossover, and mutation, in an iterative way that is expected to spawn better
individuals (solutions to the proposed problem) in the subsequent generations.
Fig. 5.2: Genetic Trees
Reproduction is the operation that copies individuals without modifying them. Usually, this
operator is used to implement an elitist strategy that is adopted to keep the genetic code of the
fittest individuals across the changes in the generations. If a good individual is found in
earlier generations, it will not be lost during the evolutionary process
Fig. 5.3: Reproduction

The crossover operation allows genetic content (e.g. sub trees) exchange between two
parents, in a process that can generate two or more children. In a GP evolutionary process,
two parent trees are selected according to a matching (or pairing) policy and, then, a random
sub tree is selected in each parent. Child trees are the result from the swap of the selected sub
trees between the parents, as illustrated in Fig. 5.4.
23
Fig. 5.4: Crossover
Finally, the mutation operation has the role of keeping a minimum diversity level of
individuals in the population, thus avoiding premature convergence. Every solution tree
resulting from the crossover operation has an equal chance of suffering a mutation process. In
a GP tree representation, a random node is selected and the corresponding sub tree is replaced
by a new randomly created sub tree, as illustrated in Fig. 5.5.
5.2 Analysis Model

Here, we are going to describe two types of Analysis Model:
1. Data Flow Diagram
2. E-R Diagram
5.2.1 Data Flow Diagram

A graphical tool used to describe and analyze the moment of data through a system manual or
automated including the process, stores of data, and delays in the system. Data Flow
Diagrams are the central tool and the basis from which other components are developed. The
24
transformation of data from input to output, through processes, may be described logically
and independently of the physical components associated with the system. The DFD is also
known as a data flow graph or a bubble chart.
DFDs are the model of the proposed system. They clearly should show the requirements on
which the new system should be built. Later during design activity this is taken as the basis
for drawing the systems structure charts. The Basic Notation used to create a DFDs are as
follows:
1. Dataflow: Data move in a specific direction from an origin to a
destination.
2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.
3. Source: External sources or destination of data, which may be People, programs,

organizations or other entities.
4. Data Store: Here data are stored or referenced by a process in the System.
5. Rhombus: decision
25
CONTEXT LEVEL DIAGRAM

Context Level0 DFD
Fig. 5.5: Context Level0 Diagram

Context level1 Diagram:
Login DFD

26
Context level2 Diagram:
27
5.2.2 E-R Diagram

In software engineering, an entity-relationship model (ERM) is an abstract and conceptual
representation of data. Entity-relationship modeling is a database modeling method, used to
produce a type of conceptual schema or semantic data model of a system, often a relational
database, and its requirements in a top-down fashion. Diagrams created by this process are
called entity-relationship diagrams, ER diagrams, or ERDs. The definitive reference for
entity-relationship modeling is Peter Chen's 1976 paper. However, variants of the idea
existed previously, and have been devised subsequently. An entity may be defined as a thing
which is recognized as being capable of an independent existence and which can be uniquely
identified. An entity is an abstraction from the complexities of some domain. An entity may
be a physical object such as a house or a car, an event such as a house sale or a car service, or
a concept such as a customer transaction or order. Although the term entity is the one most
commonly used, following Chen we should really distinguish between an entity and an
entity-type. An entity-type is a category. An entity, strictly speaking, is an instance of a given
entity-type. Entities can be thought of as nouns. Examples: a computer, an employee, a song,
a mathematical theorem. A relationship captures how two or more entities are related to one
another. Relationships can be thought of as verbs, linking two or more nouns. Examples: a
relationship between a company and a computer, a relationship between an employee and a
department, a relationship between an artist and a song, a proved relationship between a
mathematician and a theorem. The model's linguistic aspect described above is utilized in the
declarative database query language ERROL, which mimics natural language constructs.
Entities and relationships can both have attributes. Examples: an employee entity might have
a Social Security Number (SSN) attribute; the proved relationship may have a date attribute.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying
attributes, which is called the entity's primary key. Entity-relationship diagrams don't show
single entities or single instances of relations. Rather, they show entity sets and relationship
sets. Example: a particular song is an entity. The collection of all songs in a database is an
entity set. The eaten relationship between a child and her lunch is a single relationship. The
set of all such child-lunch relationships in a database is a relationship set. In other words, a
relationship set corresponds to a relation in mathematics, while a relationship corresponds to
a member of the relation. Certain cardinality constraints on relationship sets may be indicated
as well.
28
Fig. 5.8: E-R Diagram
29
Chapter 6
UML Diagrams
30
Introduction to UML
The unified Modeling Language (UML) is a standard language for writing software
blueprints. The UML may be used to visualize, specify, construct and document the artifacts
of software-intensive system.
The goal of UML is to provide a standard notation that can be used by all object - oriented
methods and to select and integrate the best elements .UML is itself does not prescribe or
advice on how to use that notation in a software development process or as part of an object design methodology. The UML is more than just bunch of graphical symbols. Rather, behind
each symbol in the UML notation is well-defined semantics.
The system development focuses on three different models of the system.
Functional model
Object model
Dynamic model
Functional model in UML is represented with use case diagrams, describing the
functionality of the system from user point of view.
Object model in UML is represented with class diagrams, describing the structure of the
system in terms of objects, attributes, associations and operations.
Dynamic model in UML is represented with sequence diagrams, start chart diagrams and
activity diagrams describing the internal behavior of the system.
31
6.1 Use Case Diagram

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals, and any
dependencies between those use cases.
Relationships:
Association
An association is a connection between an actor and a use case. An association indicates that
an actor can carry out a use case. Several actors at one use case mean that each actor can
carry out the use case on his or her own and not that the actors carry out the use case together:
Include Relationships
An include relationship is a relationship between two use cases:
It indicates that the use case to which the arrow points is included in the use case on the other
side of the arrow. This makes it possible to reuse a use case in another use case.
32
Upload no of Citation for each topic
admin
Enter the topic name
Display the no of citation from no

of sites
Comparison of citations
Remove the duplicate Citation
User
Linkage Citation are rating
Using rating procedure generate

reproduction tree
Rearrange the subtrees
Gives the final tree as a complete

tree
Fig. 6.1: Use Case Diagram
33
6.2 Class Diagram

Class diagrams are widely used to describe the types of objects in a system and their
relationships. Class diagrams model class structure and contents using design elements such
as classes, packages and objects. Class diagrams describe three different perspectives when
designing a system, conceptual, specification, and implementation. These perspectives
become evident as the diagram is created and help solidify the design.
User Registration
Login
Enter the Keyword
+Enter Personalinfo
+Enter LoginID
+Enter Password
+Extract the Results based on Single Attribute
+Registation Succesfulyl()
+Login Successfully()
+Display the Maximization results()
View unique Results
Remove Duplcate Ciatation
View Citation Results
+Display Unique Citation Info
+Display Citation Info
+Extract the Results
+Display UniqueCitation()
+Display RomeveDuplicateCitation()
+Display theresults()
View SimilarCitation
Graph
+Display SimilarCitation Info
+Display Unique Citation Info with Garph
+Display SimilarCitationData()
+Display GraphData()
Fig. 6.2: Class Diagram
34
6.3 Activity Diagram

Activity diagrams describe the workflow behavior of a system. Activity diagrams are similar
to state diagrams because activities are the state of doing something. The diagrams describe
the state of activities by showing the sequence of activities performed. Activity diagrams can
show activities that are conditional or parallel.
User
User
Registration
Login
Login fail
Login success
User Home
Enter the topic

name
Display the no of
citation
Remove the
duplicate Citation
NewActivity8
Fig. 6.3: Activity Diagram

35
Rearrange the
subtrees
Gives the final

tree
6.4 Sequence Diagram

Sequence diagrams belong to a group of UML diagrams called Interaction Diagrams.
Sequence diagrams describe how objects interact over the course of time through an
exchange of messages. A single sequence diagram often represents the flow of events for a
single use case.
Instance:
An instance of a class shows a sample configuration of an object. On the sequence diagram,
each instance has a lifeline box underneath it showing its existence over a period of time.
Actor: An actor is anything outside the system that interacts with the system. It could be a
user or another system.
Message: The message indicates communication between objects. The order of messages
from top to bottom on your diagram should be the order in which the messages occur.
User
Login
Search Topic
Display the
citation
Comparison of
citations
Linkage Citation
are rating
Login
Enter the topic name
Display the no of citation from no of sites

Comparison of citations
Linkage Citation are rating
Using rating procedure generate reproduction tree
Rearrange the subtrees
Gives the final tree as a complete tree
Fig. 6.4: Sequence Diagram

36
generate
reproduction tree
Rearrange the
subtrees
complete tree
6.5 Communication Diagram

Communication diagrams belong to a group of UML diagrams called Interaction Diagrams.
Communication diagrams, like Sequence Diagrams, show how objects interact over the
course of time. However, instead of showing the sequence of events by the layout on the
diagram, communication diagrams show the sequence by numbering the messages on the
diagram. This makes it easier to show how the objects are linked together, but harder to see
the sequence at a glance.
Instance:
An instance of a class shows a sample configuration of an object. On the sequence diagram,
each instance has a lifeline box underneath it showing its existence over a period of time.
Lollipop Interface:
A lollipop interface is a shorthand syntax for an interface. It shows the interface name
without displaying the operations.
Message:
The message indicates communication between objects. The order of messages from top to
bottom on your diagram should be the order in which the messages occur.
8: Gives the final tree as a complete tree

complete
User
tree
Linkage Citation
are rating
3: Display the no of citation from no of sites
1: Login
5: Linkage Citation are rating
Display the
citation
4: Comparison of citations
Comparison of
citations
Login
2: Enter the topic name
6: Using rating procedure generate reproduction tree

generate
reproduction tree
7: Rearrange the subtrees
Rearrange the
subtrees
Fig. 6.5: Communication Diagram

37
Search
Topic
6.6 State Machine Diagram

UML State chart is notation for describing the sequence of states an object goes through in
response to external events. Objects have behavior and state. The state of an object depends
on its current activity or condition. A state chart diagram shows the possible states of the
object ad the transitions that cause a change in state.
State chart describes the dynamic behavior of an individual object as a number of states. A
state is a condition satisfied by attributes of objects. Given a state, a transition represents a
future state the object can move to and the conditions associated with the change of state.
A state is depicted by a rounded rectangle a transition is depicted by open arrows connecting
two states. States are labeled with their names. A small solid black circle indicates the initial
state and a circle surrounding the small solid circle indicates the final state.
User
Registration
Login
Upload no of Citation
for each t opic
Ent er the
topic name
Display the no of
citation from no of sites
Comparison of
citations
Remove the
duplicate Citat ion
Linkage Citation
are rating
Using rating procedure

generate reproduction tree
Rearrange the
subtrees
Gives the final tree

as a complete tree
Fig. 6.6: State Machine Diagram

38
6.7 Component Diagram
Admin
Upload the
Topic Content
Display the no of citation

from no of sites
User
Search
the Topic
Remove the
duplicate Citation
Gives the
final tree
Rearrange
the subtrees
Fig. 6.7: Component Diagram
39
Using rating procedure

generate reproduction tree
6.8 Package Diagram
Fig. 6.8: Package Diagram
40
6.9 Deployment Diagram
Search
Content
Upload
Content
System
Display
citation
complet
e tree
Remove
...
Fig. 6.9: Deployment Diagram
6.10 Dependency Graph
Fig. 6.10: Dependency Graph

41
Chapter 7
Project Schedule & Estimate
42
7.1 Project Plan
The Project Planning step is the most critical step in the project management life cycle. The
reason is that its only when we list all of the tasks in our project plan, that we truly have an
idea of what its going to take, to deliver our project on time. So to perform Project Planning
in a smart and efficient way, we need a well-defined and organized project plan to help is to
do it.This chapter explains all the project planning features. It lists our tasks, create schedules,
project management approach, projected project budget, project timeline, etc.
Project planning is a discipline for stating how to complete a project within a certain time
frame, usually with defined stages, and with designated resources. Creating a project plan is
the first thing that needs to be done when undertaking any kind of project. At a minimum, a
project plan answers basic questions about the project:
Why? What is the problem or value proposition addressed by the project?
What? What is the work that will be performed on the project? What are the major
products/deliverables?
Who? Who will be involved and what will be their responsibilities within the
project? How will they be organized?
When? What is the project time line and when will particularly meaningful points,
referred to as milestones, be complete?
Often project planning is ignored in favour of getting on with the work. However, many
people fail to realize the value of a project plan in saving time and money and then face the
consequences later.
43
Table 7.1: Project Plan for Sem I

Sr.
No.
1
Module Name
Start Date
End Date
Status
A] Selection of Problem Definition
1st July 2013
17th July 2013
Completed
B] Literature Survey
24th July 2013
25th July 2013
Completed
Requirement Analysis
Planning Phase
A] Deciding Scope
1st
2013
August 2 nd
2013
August Completed
B] Selection of Platform
7th
2013
August 12th
2013
August Completed
A] Module division and allocation 25th

of task
2013
August 30th
2013
August Completed
Designing Phase
15th Sept. 2013 23rd Sept. 2013 Completed
B] User Interface
4
Modeling Phase
A] Draw UML Diagrams
26th Sept. 2013 3 rd Sept. 2013
Completed
Preparation of partial report
4th Sept 2013
8th Sept 2013
Completed
Submission of partial report
21th Sept 2013 12th Oct 2013
Submitted
Table 7.2: Project Plan for Sem II
44
Sr
Module Name
Start Date
End Date
Status
Studied About GPRD system & design
13th Jan. 2014
18th Jan. 2014
Completed
27th Jan. 2014
31st Jan. 2014
Completed
03rd Feb. 2014
08th Feb. 2014
Completed
10th Feb. 2014
15th Feb. 2014
Completed
03rd Mar 2014
08th Mar 2 014
Completed
10th Mar 2014
15th Mar 2014
Completed
No.
1
our graphical user interface

2
Start coding of 1st module, i.e.

graphical user interface.
Module implementation for graphical

user interface
Test the module and removes some

bugs.
Start coding of 2nd module, i.e. GUI

and Testing
Final Coding for 3rd Module and test

the modules.
All parts of Project testing
17th Mar 2014
22nd Mar 2014
Completed
Preparation of Final report.
02nd Apr. 2014
12th Apr. 2014
Completed
Paper Publication
13th Apr. 2014
16th Apr. 2014
Completed
10
Submission of Final Project Report
21st Apr. 2014
07th Apr. 2014
Submitted
45
7.2 Project Estimation

The project estimation includes the following parameters:
1. Time: The total time for overall project completion undergoing various phases of
development is given as eight months (approx.).
2. Efforts: Since the characteristics of each project dictate the distribution of efforts,
35% of the efforts is spent on Analysis and Design, a similar amount on testing.
Coding about 30% of the efforts.
3. Cost: The cost of the project is calculated in terms of the effort applied and the
resources used. The other parameters that account for cost estimation are:
Man/Month
Technology used
Benefits
Machine cost
COCOMO Model the Basic COCOMO is a static, single-valued model that computes
Software development effort (and cost) as a function of program size expressed in estimated
Lines of code.
The COCOMO is a collection of three models: a Basic model that is applied early in the
project, an Intermediate model that is applied after requirements are specified, and an
advanced model that is applied after design is complete. All three models take the form:
E= aS b _EAF
Where,
E is effort in person months,
S is size measured in thousands of lines of code (KLOC),
And EAF is an effort adjustment factor (equal to 1 in the basic model).
The factors a and b depend on the development mode.
COCOMO applies to three classes of software projects:
Organic projects: are relatively small, simple software projects in which small teams
with good application experience work to a set of less than rigid requirements.
46
Semi-detached projects: are intermediate (in size and complexity) software projects
in which teams with mixed experience levels must meet a mix of rigid and less than
rigid requirements.
Embedded projects: are software projects that must be developed within a set of
tight hardware, software, and operational constraints.
The basic COCOMO equations take the form

E = ab(KLOC)bb
D = cb(E)db
P=E / D
Where, E is the effort applied in person-months, D is the development time in chronological
months, KLOC is the estimated number of delivered lines of code for the project
(expressed in thousands), and P is the number of people required. The coefficients ab,
bb, cb and db are given in the following table.
Table 7.3: Basic COCOMO model
Software Project
ab
bb
Cb
db
Organic
2.4
1.05
2.5
0.38
Semi Detached
3.0
1.12
2.5
0.35
Embeded
4.6
1.20
2.5
0.32
Basic COCOMO is good for quick, early, rough order of magnitude estimates of software
costs, but it does not account for differences in hardware constraints, personnel quality and
experience, use of modern tools and techniques, and other project attributes known to have a
significant influence on software costs, which limits its accuracy.
Our project comes under organic category
For organic:
Effort estimation
Effort = a _(Sizeb) _EAF Where,
Size (KLOC) = 2.5 KLOC
a = 2.4
b = 1.05
47
n = 3 (number of persons)
EAF = 1.05 (Adjustment factor)
Therefore.
Effort = 2.4 _(31.05) _ 1.05

= 2.4 _3.169
= 8 person-months(appx.).
Development time = c _(E f f ortd)
Where,
c=2.5
d=0.38
Therefore,
Development time = 2.5 _(7.060.38)
= 2.5 _ 1.869
= 4.91
= 5 person - month(appx.).
Recommended number of people:
N=E/D
=8/5
= 2 people
Considering salary of each Person = Rs.8000/- per month.
Number of months to complete the project=5
Total expenses on the salary of the people=Rs40,000/Hence total cost of the project is RS.40,000/-
48
Chapter 8
Software Implementation
49
8.1 Introduction
Implementation is the stage where the theoretical design is turned in to working system. The
most crucial stage is achieving a new successful system and in giving confidence on the new
system for the users that it will work efficiently and effectively.
The system can be implemented only after through testing is done and if it found to work
according to the specification. It involves careful planning, investigation of the current
system and its constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning. Two major tasks of
preparing the implementation are education and training of the users and testing of the
system.
The more complex the system being implemented, the more involved will be the systems
analysis and design effort required just for implementation. The implementation phase
comprises of several activities. The required hardware and software acquisition is carried out.
The System may require some hardware and software acquisition is carried out. The system
may require some software to be developed. For this, programs are written and tested. The
user then changes over to his new fully tested system and the old system is discontinued.
Implementation is the process of having systems personnel check out and
put new
equipment in to use, train users, install the new application, and construct any files of data
needed to it.
Depending on the size of the organization that will be involved in using the application and
the risk associated with its use, system developers may choose to test the operation in only
one area of the firm, say in one department or with only one or two persons. Sometimes they
will run the old and new systems together to compare the results. In still other situations,
developers will stop using the old system one-day and begin using the new one the next. As
we will see, each implementation strategy has its merits, depending on the business situation
in which it is considered. Regardless of the implementation strategy used, developers strive
to ensure that the systems initial use in trouble-free.
Once installed, applications are often used for many years. However, both the organization
and the users will change, and the environment will be different over the weeks and months.
Therefore, the application will undoubtedly have to be maintained. Modifications and
changes will be made to the software, files, or procedures to meet the emerging requirements.
50
8.2 Implementation Details

Installation Step:
Step 1: Oracle Installation:
Step 2: Tomcat6.0 Installation (or)
Step3: MyEclipse 8.0 Installation
Deployment Step:
Step 1: Start the MyEclipse 8.0
Step2: click on File Menu Button and Select the import option
Step 3: After importing select general option and Click on Existing Project into Workspace
Step4: After that select the Browse Button and select the project and click on Finish Button
Step 5: Right click on project and select the Run as option
8.2.1 Modules and Algorithms

Modules
1. Procedure of Genetic Programming
2. Genetic operations
3. Generational Evolutionary algorithm
4. Record De-duplication
5. Precision and Recall operations
1. Procedure of Genetic Programming:

Multiple User sends the query and extracts results from the search engine. Under extraction
of results, the operation which apply first is selection operation. This selection performs in
different databases and extracts the results with interactive query processing. It does not
provide any optimal solution. These results contains some duplicates. It display the nearly
optimal solution results only.
51
2. Genetic Operations:
This module has been developed to provide the structure based results. Here, first thing is
selection of root terminals. This is zero level results. Next, we find out next level of
childrens. This procedure applies till reaches to leaf nodes for extraction of results. In this
procedure, all internal nodes we find out here in implementation part. These internal nodes
identification and create the structure possible with three operations here. Those are called
selection, crossover and mutation.
3. Generational Evolutionary Algorithm:
This Algorithm initialize all the results of nodes. Each and every node of rating we are
calculates here. According to rating value calculates fitness. After finding the fitness node its
possible for creates the reproduced tree in implementation. It can contains all nodes are best.
This same process applies till for finding the optimal tree identification. This same procedure
repeatedly performs here.
4. Record De-duplication:
According to requirement automatically it can changes here in implementation. It can show
the efficient results in tree data structure in implementation. It is the good evidence based
results display. All those nodes are display with the help of similarity function in
implementation process.
5. Precision and Recall operations:
These two operations are performed only for correctly identified duplicated data.
P = Number of Correctly Identified Duplicated pairs / Number of Identified Duplicated pairs
R = Number of Correctly Identified Duplicated pairs / Number of True Duplicated pairs
Algorithm
Algorithm: Generational Evolutionary Algorithm
The first thing is that all the gathered data is initialized from which the data is
discovered.
After then all the individual data are evaluated and a numeric fitness value is assigned
to each one.
52
Selection process is performed in which all the n individual are selected into next
generation population without modifying the data.
After then Crossover operation is performed in which m individual that will compose
the next generation with the best parent is selected and replace the existing generation
i.e. in this process two parent tree are selected according to matching policy and then
a random sub tree is selected in each parent.
And finally Mutation operation is performed in which the best individual are
produced in the population.
8.2.2 Sample Code

//Home.jsp
<html>
<jsp:include page="Header.jsp"></jsp:include>
<body>
<font color="orange" size="5"><b><i> A Genetic Programming Approach</font> <font
color="Green" size="5"> to Record Deduplication
</font></i></b>
<table align="center"><tr><td>
<font color="#4682B4">
<font size="7">W</font>
e propose a genetic programming approach to record deduplication that combines several
different pieces of evidence
extracted from the data content to find a deduplication function that is able to identify
whether two entries in a repository are replicas or
not. As shown by our experiments, our approach outperforms an existing state-of-the-art
method found in the literature. Moreover, the
suggested functions are computationally less demanding since they use fewer evidence. In
addition, our genetic programming
approach is capable of automatically adapting these functions to a given fixed replica
identification boundary, freeing the user from the
burden of having to choose and tune this parameter.
</font> </p>
53
</td>
<td > </td>
<td colspan="1" align="left" valign="top"><img
src="<%=request.getContextPath()+"/images/c5.jpg"%>" align="top" height="200" /
></td>
</tr>
</table>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>
// LoginPage.jsp
<!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">

<html>
<head>
<script language="JavaScript"
src="<%=request.getContextPath()+"/scripts/gen_validatorv31.js"%>"
type="text/javascript"></script>
<style type="text/css">
.Title {
font-family:Verdana;
font-weight:bold;
font-size:8pt
}
.Title1 {font-family:Verdana;
font-weight:bold;
font-size:8pt
}
54
</style>
</head>
<body>
<fieldset>
<legend>Login Form</legend>
<form action="<%=request.getContextPath()+"/LoginAction"%>" method=post
name="login">
<table border="0" align="center" bgcolor="white" width="80%">
<tr>
<td height="120" align="right">
<td><table border="0" align="center">
<tr align="center"><strong><h3><font color="#4682B4">Login
Form</font></h3></strong>
</tr>
<tr>
<td ><font color="#DA70D6" size=""><b>UserID</b></font></td>
<td ><input type="text" name="username"> </td>
</tr>
<tr>
<td><font color="#DA70D6" size=""><b>Password</b></font></td>
<td>
<input type="password" name="password">
</tr>
<tr>
<td colspan="2">
<div align="center" class="style11">
<input type="submit" name="Submit" value="Sign In">

55
</td>
<input name="Input2" type="reset" value="Clear">

</div>
<td><div>
<a href="./Recoverpassword.jsp"><font color="#228B22" size="3"
style="verdana">ForgotPassword??</font></a>
</div></td>
</tr>
</table></td>
<td> <img src="<%=request.getContextPath()+"/images/l1.jpeg"%>"
height="200" />
</td>
</tr>
</table>
</form>
</fieldset>
<script language="JavaScript" type="text/javascript">
//You should create the validator only after the definition of the HTML form
var frmvalidator = new Validator("login")
frmvalidator.addValidation("username","req","Login Name is required");
frmvalidator.addValidation("password","req","Password is required");
</script>
<br/>
<br/>
<jsp:include page="Footer.jsp"></jsp:include>
</body>
</html>
// Search.jsp
<%@ page language="java" import="java.util.*" pageEncoding="ISO-8859-1"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
56
<head>
<script type="text/javascript">
function changePage(){
var select=document.getElementById("select").value;
var sel="";
if(select==sel){
alert("plz select any one ");
}
else if(select==java){
alert("u have selected java");
location.href="./MainPage.jsp";
}else if(select==c){
alert("u have selected c");
location.href="./C.jsp";
}else if(select==unix){
alert("u have selected Unix");
location.href="./Unix.jsp";
}
}
</script>
</head>
<body>
<center> <h3><font color="#008080">Select u r search Concept Here</font></h3>
<form name="cpaper" action="./GetSearchPageAction">
<table>
<tr>
<td> <font size="4" color="#4682B4">Select Here</font></td>
57
<td> <select id="select" name="search" >

<option value="unix">Unix</option>
</select>
</td>
</tr>
<br/>
<br/><br/>
<tr align="center"><td> <input type="submit" value="search">
</table>
</form>
</center>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>
8.2.3 Screen Shots:
Fig. 8.1: Home Page

58
</td></tr>
Fig. 8.2: Admin Login Page
Fig. 8.3: Admin Welcome Page

59
Fig. 8.4: Upload Webpage
Fig. 8.5: Password Change Page

60
Fig. 8.6: User Login Page
Fig. 8.7: User Welcome Page

61
Fig. 8.8: Search Engine Page
Fig. 8.9: Search Result Page-01

62
Fig. 8.10: Search Result Page-02
Fig. 8.11: Duplicate Citation Data Info Page

63
Fig. 8.12: Duplicate Citation Remove Page
Fig. 8.13: View Tree Graph Page
64
Fig. 8.14: Tree Graph
65
Chapter 9
Software Testing
66
9.1 Introduction
Testing Strategies
Testing:
1. The process of executing a system with the intent of finding an error.
2. Testing is defined as the process in which defects are identified, isolated, subjected
for rectification and ensured that product is defect free in order to produce the quality
product and hence customer satisfaction.
3. Quality is defined as justification of the requirements
4. Defect is nothing but deviation from the requirements
5. Defect is nothing but bug.
6. Testing --- The presence of bugs
7. Testing can demonstrate the presence of bugs, but not their absence
8. Debugging and Testing are not the same thing!
9. Testing is a systematic attempt to break a program.
10. Debugging is the art or method of uncovering why the script /program did not execute
properly.
Testing Methodologies:
Black box Testing: is the testing process in which tester can perform testing on an
application without having any internal structural knowledge of application.
Usually Test Engineers are involved in the black box testing.
White box Testing: is the testing process in which tester can perform testing on an
application with having internal structural knowledge.
Usually The Developers are involved in white box testing.
Gray Box Testing: is the process in which the combination of black box and white
box tonics are used.
67
Levels of Testing:
Module1
Module2
Units
i/p
Module3
Units
Units
Integration o/p i/p
Integration o/p
System Testing: Presentation + business +Databases
UAT: user acceptance testing
Fig. 9.1 Software Testing Life Cycle
Test Planning:
1.
Test
Planning
defined
as
strategic
document
which
describes the procedure how to perform various testing on the total application in the
most efficient way.
2. This document involves the scope of testing,
3. Objective of testing,
4. Areas that need to be tested,
5. Areas that should not be tested,
6. Scheduling Resource Planning,
7. Areas to be automated, various testing tools used.
Test Development
1. Test case Development (check list)

2. Test Procedure preparation (Description of the Test cases).
1. Implementation of test cases. Observing the result.
68
Types of Testing:
Smoke Testing: is the process of initial testing in which tester looks for the availability of all
the functionality of the application in order to perform detailed testing on them. (Main check
is for available forms).
Sanity Testing: is a type of testing that is conducted on an application initially to check for
the proper behavior of an application that is to check all the functionality are available before
the detailed testing is conducted by on them.
Regression Testing: is one of the best and important testing. Regression testing is the
process in which the functionality, which is already tested before, is once again tested
whenever some new change is added in order to check whether the existing functionality
remains same.
Static Testing: is the testing, which is performed on an application when it is not been
executed. Ex: GUI, Document Testing.
Dynamic Testing: is the testing which is performed on an application when it is being
executed. Ex: Functional testing.
Alpha Testing: it is a type of user acceptance testing, which is conducted on an application
when it is just before released to the customer.
Beta-Testing: it is a type of UAT that is conducted on an application when it is released to
the customer, when deployed in to the real time environment and being accessed by the real
time users.in this type of testing, developer can get user response.
Compatibility testing: it is the testing process in which usually the products are tested on the
environments with different combinations of databases. In order to check how far the product
is compatible with all these environments platform combination.
Adhoc Testing: Adhoc Testing is the process of testing in which unlike the formal testing
where in test case document is used, without that test case document testing can be done of
an application, to cover that testing of the future which are not covered in that test case
document. Also it is intended to perform GUI testing which may involve the cosmetic issues.
69
9.2 Test Plans

Testing process starts with a test plan. This plan identifies all the testing related activities that
must be performed and specifies the schedules, allocates the resources, and specified
guidelines for testing. During the testing of the unit the specified test cases are executed and
the actual result compared with expected output. The final output of the testing phase is the
test report and the error report.
Test Data:
Here all test cases that are used for the system testing are specified. The goal is to test the
different functional requirements specified in Software Requirements Specifications (SRS)
document.
Unit Testing:
Each individual module has been tested against the requirement with some test data.
Test Report:
The module is working properly provided the user has to enter information. All data entry
forms have tested with specified test cases and all data entry forms are working properly.
Error Report:
If the user does not enter data in specified order then the user will be prompted with error
messages. Error handling was done to handle the expected and unexpected errors.
9.3 Test Cases:

A Test case is a set of input data and expected results that exercises a component with the
purpose of causing failure and detecting faults. Test case is an explicit set of instructions
designed to detect a particular class of defect in a software system, by bringing about a
failure. A Test case can give rise to many tests.
In general a test case is a set of test data and test programs and their expected results. A test
case in software engineering normally consists of a unique identifier, requirement references
from a design specification, preconditions, events, a series of steps (also known as actions) to
follow, input, output and it validates one or more system requirements and generates a pass or
fail.
70
The mechanism for determining whether a software program or system has passed or failed
such a test is known as a test oracle. In some settings, an oracle could be a requirement or use
case, while in others it could be a heuristic. It may take many test cases to determine that a
software program or system is considered sufficiently scrutinized to be released. Test cases
are often referred to as test scripts, particularly when written. Written test cases are usually
collected into test suites. If a requirement has sub-requirements, each sub-requirement must
have at least two test cases. Written test cases should include a description of the
functionality to be tested, and the preparation required to ensure that the test can be
conducted.
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub-assemblies, assemblies and/or a finished product. It is the process of
exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.
Test case format
Test Cases usually have the following components:
Test Case Summary
Initial Condition
Steps to run the test case
Expected behavior/outcome
1. Positive Test Cases:
The positive flow of the functionality must be considered
Valid inputs must be used for testing
Must have the positive perception to verify whether the requirements are justified.
71
Example for Positive Test cases:

Table. 9.1: Example for Positive Test Case
T.C.No
Description
Check
for
Expected value Actual value

the The
date
and
date Time Auto time
of
the
Display
Result
system must be
displayed
Enter the valid It should accept

Roll no into the
student roll no
field
2. Negative Test Cases:
Must have negative perception.
Invalid inputs must be used for test.
Example for Negative Test cases:

Table 9.2: Example for Negative Test Case
T.C.No
Description
Try to modify The Modification

information
date and time
Expected value Actual value
in should not be
allow
Enter invalid data It should not

in to the student accept
details form, click data,
on save
invalid
save
should
allow
72
not
Result
Test Case 1:
Login Page Test Case
Table 9.3: Login Page Test Case
Test Steps
Test
Test
Case
Case
Description
Step
Name
Login
Expected
Actual
Validate
To
verify enter login name an
Login
that
Login less than 1 chars message
name
login
must
on (say
a)
page password
be click
error
and Login
and less
not
than
Submit characters
greater than button
must
1 characters
displayed
enter
be
login Login success
name
1 chars full
(say
a)
password
click
or
an
and error message

and Invalid
Submit Login
button
or
Password
must
be
displayed
Pwd
Validate
To
Password
that
verify enter Password an

less than 1 chars message
Password on (say
login
must
error
nothing) Password
page and Login Name not less than 1

be and
click characters
greater than Submit button
must
1 characters
displayed
73
be
Pwd02
Validate
Password
To
verify
that
Password on
login
page
must
be
allow special
enter Password Login success

with
characters(say
or
an
error message
!@hi&*P) Login Invalid

Name and click Login
Submit button
or
Password
must
characters
Link
special full
be
displayed
Verify
To
Verify Click Sign Up Home
Hyperlinks
the
Hyper Link
Page
must
be
displayed
Links
available at
left side on
login
page
working
or
Click Sign Up Sign Up page

Link
not
must
be
displayed
Click New Users New
Link
Users
Registration
Form must be
displayed
74
Registration Page Test Case

Table 9.4: Registration Page Test Case
Test Steps
Test
Case Test
Case
Step
Name
Description
Registration
Validate
To
User Name
that
Expected
Actual
verify enter User name an

User click
Submit message User
on button
name
error
Name
Must
be Declared
Registration
page must be
Declared
Validate
To
Password
that
verify enter Password an

click
error
Submit message
Password on button
Password
Registration
Must
page must be
Declared
be
Declared
Validate
To
First Name
that
verify enter
First Name
First an
click message First
on Submit button
Name
error
Name
Must
be Declared
Registration
page must be
Declared
Validate
To
Last Name
that
verify enter
Last Name
Name
page must be
75
error
click message Last
on Submit button
Registration
Declared
Last an
Name
Must
be Declared
Validate
To
verify enter
Address
that Address click
Address an
Submit message
button
on
error
Address
Registration
Must
be
page must be
Declared
Declared
Validate
To
verify enter
Phone
that
Phone number
number
number
Phone an
click message
on Submit button
Phone
Registration
number
page must be
Must
Declared
Declared
Validate
To
verify enter
Phone
that
Phone number is only message
number
error
Phone an
be
error
is number (say numeric values Phone
giving
abc)
click
characters
Registration
button
Submit number
Must
be
page must be
numeric
Declared
Declared
Validate
To
verify enter
Phone
that
Phone number
number
number (say Valid
valid
1234)
click
number
Registration
button
Phone an
error
is message
values Phone
Submit number
Must
be
page must be
Valid
value
Declared
Declared
76
Chapter 10
Result Analysis
77
We present and discuss the results of the experiments performed to evaluate our proposed GP
based approach to record Deduplication. There were three sets of experiments:
1. GP was used to find the best combination function for previously user-selected evidence,
i.e., <attribute, similarity function> pair combinations specified by the user for the
Deduplication task. The use of user selected evidence is a common strategy adopted by all
previous record Deduplication approaches. Our objective in this set of experiments was to
compare the evidence combination suggested by our GP-based approach with that of a stateof the- art SVM-based solution, the Marlin system, used as our baseline.
2. GP was used to find the best combination function with automatically selected evidence.
Our objective in this set of experiments was to discover the impact on the result when GP has
the freedom to choose the similarity function that should be used with each specific attribute.
3. GP was tested with different replica identification boundaries. Our objective in this set of
experiments was to identify the impact on the resulting Deduplication function when different
replica identification boundary values are used with our GP based approach.
78
Chapter 11
Technical Specification
79
11.1 H/w Requirements

Processor
Pentium IV
Hard Disk
40GB
RAM
512MB or more
11.2 S/w Requirements

Operating System
Windows
Programming Language :
Java
Web Applications
JDBC, Servlets, JSP
IDE/Workbench
My Eclipse 6.0
Database
Oracle 10g
11.3 Advantages
1. It can remove the maximized duplicate based results here.
2. Without entire search space provides the meaningful results.
3. Provides the effective results with the help of machine learning techniques.
4. It can give the best evidence based results.
5. Every time provides the tune based results environment in implementation.
11.4 Disadvantages
1. As this project is mainly based on Search Engine so a lots of manpower is required
2. Network Connectivity in Rural Area may lead to Discontinuation of this Project.
11.5 Application
In Dynamic Web
In Digital Libraries
In Large Files
Pattern Recognition
80

Vide

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Vide

Diunggah oleh

Hak Cipta:

Format Tersedia

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

1.1 Problem Statement

1.2 Need of Proposed System

To remove the maximized duplicate based results.

To provide the meaningful results.

To provide the effective results.

To Shorten the Cost.

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

3.1 Existing System

Web page programming language and version Dependency:

Problem with Record Duplication:

A Genetic Programming Approach to Record Deduplication

3.2 Proposed System

A Genetic Programming Approach to Record Deduplication

Fig. 3.1: Genetic Trees

A Genetic Programming Approach to Record Deduplication

Fig. 3.2: Crossover Operation

Fig.3.3: Mutation Operation

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

Business requirements describe in business terms what must be delivered or

Product requirements describe properties of a system or product (which could be

Process requirements describe activities performed by the developing organization.

A Genetic Programming Approach to Record Deduplication

4.1.1 Project Scope

A Genetic Programming Approach to Record Deduplication

4.2 System Features

A Genetic Programming Approach to Record Deduplication

4.2.2 System Feature 2

4.3 External Features

4.3.3 Hardware Interfaces

A Genetic Programming Approach to Record Deduplication

4.3.4 Communication Interfaces

4.4 Non Functional Requirements

4.4.2 Safety Requirements

A Genetic Programming Approach to Record Deduplication

4.5 Risk Assessment

A Genetic Programming Approach to Record Deduplication

A Genetic Programming Approach to Record Deduplication

5.1 System Architecture

Fig. 5.1.A: System Architecture

A Genetic Programming Approach to Record Deduplication

Fig. 5.1.B: System Architecture

A Genetic Programming Approach to Record Deduplication

Fig. 5.2: Genetic Trees

Fig. 5.3: Reproduction

A Genetic Programming Approach to Record Deduplication

Fig. 5.4: Crossover

5.2 Analysis Model

5.2.1 Data Flow Diagram

A Genetic Programming Approach to Record Deduplication

3. Source: External sources or destination of data, which may be People, programs,

A Genetic Programming Approach to Record Deduplication

CONTEXT LEVEL DIAGRAM

Fig. 5.5: Context Level0 Diagram

Fig. 5.6: Context Level1 Diagram

A Genetic Programming Approach to Record Deduplication