Anda di halaman 1dari 80

A Genetic Programming Approach to Record Deduplication

Chapter 1
Problem Definition

A Genetic Programming Approach to Record Deduplication

1.1 Problem Statement


As Extracting structured data from Web pages is a challenging problem due to the underlying
intricate structures of such pages. Until now, a large number of techniques have been
proposed to address this problem, but all of them have inherent limitations because they are
dependent on Web Page Programming Language, Version and Scripting and on etc...
We know that World Wide Web has more and more online databases and the number of
databases is increasing day by day hence data duplication is occurring very fast. When any
query is submitted to databases then it retrieves the information from that database and then
extract but as the number of data is increasing rapidly in database, the previous approaches
has not been updated so it becomes very hard to detect duplicate data and extract with nonduplicated data in effective manner.
The evaluation is done by assigning to an individual a value that measures how suitable that
individual is to the proposed problem. In our GP experimental environment, individuals are
evaluated on how well they learn to predict good answers to a given problem, using the set of
functions and terminals available. The resulting value is also called raw fitness and the
evaluation functions are called fitness functions. Notice that after the evaluation step, each
solution has a fitness value that measures how good or bad it is to the given problem. Thus,
by using this value, it is possible to select which individuals should be in the next generation.
Strategies for this selection may involve very simple or complex techniques, varying from
just selecting the best n individuals to randomly selecting the individuals proportionally to
their fitness.

1.2 Need of Proposed System


These are the following needs of the Proposed System:

To remove the maximized duplicate based results.

To provide the meaningful results.

To provide the effective results.

To Shorten the Cost.


2

A Genetic Programming Approach to Record Deduplication

Chapter 2
Introduction

A Genetic Programming Approach to Record Deduplication

2.1 Overview
In this environment, the decision of keeping repositories with dirty data goes far beyond
technical questions such as the overall speed or performance of data management systems. To
better understand the impact of this problem, it is important to list and analyze the major
consequences of allowing the existence of dirty data in the repositories.
These include, for example:
1) Performance degradationas additional useless data demand more processing, more time
is required to answer simple user queries;
2) Quality lossthe presence of replicas and other inconsistencies leads to distortions in
reports and misleading conclusions based on the existing data;
3) Increasing operational cost: Because of the additional volume of useless data, investments
are required on more storage media and extra computational processing power to keep the
response time levels acceptable.
To avoid these problems, it is necessary to study the causes of dirty data in repositories. A
major cause is the presence of duplicates, quasi replicas, or near-duplicates in these
repositories, mainly those constructed by the aggregation or integration of distinct data
sources. The problem of detecting and removing duplicate entries in a repository is generally
known as Record Deduplication.

2.2 Objectives
Record Deduplication is the task of identifying, in a data repository, records that refer to the
same real world entity or object in spite of misspelling words, typos, different writing styles
or even different schema representations or data types.

A Genetic Programming Approach to Record Deduplication

Chapter 3
Literature Survey

A Genetic Programming Approach to Record Deduplication

Using Genetic Programming, Removing the duplicated data is a vast and interesting topic in
the field of Data Mining. Today, this problem arises mainly when data are collected from
many different sources using different information description styles and metadata standards.
Other common place for replicas is found in data repositories created from OCR documents.
These situations can lead to inconsistencies that may affect many systems such as those that
depend on searching and mining tasks.
To remove these problems, it is necessary to design a de-duplication function that combines
the information available in the data repositories in order to identify whether a pair of record
entries refers to the same real-world entity.
In the realm of bibliographic citations, for instance, this problem was extensively discussed
by Lawrence et al. They propose a number of algorithms for matching citations from
different sources based on edit distance, word matching, phrase matching, and subfield
extraction.
As more strategies for extracting disparate pieces of evidence become available, many works
have proposed new distinct approaches to combine and use them. Elmagarmid et al. classify
these approaches into the following two categories: 1) Ad-Hoc or Domain Knowledge
Approaches: This category includes approaches that usually depend on specific domain
knowledge or specific string distance metrics. Techniques that make use of declarative
languages can be also classified in this category 2) Training-based Approaches: This category
includes all approaches that depend on some sort of training either supervised or semisupervised in order to identify the replicas. Probabilistic and machine learning approaches
fall into this category.
Next, we briefly comment on some works based on these two approaches (domain knowledge
and training-based), particularly those that exploit the domain knowledge and those that are
based on probabilistic and machine learning techniques, which are the ones more related to
our work.
Active Atlas is a system whose main goal is to learn rules for mapping records from two
distinct files in order to establish relationships among them. During the learning phase, the
mapping rule and the transformation weights are defined. The process of combining the
transformation weights is executed using decision trees. This system differs from the others
in the sense that it tries to reduce the amount of necessary training, relying on user-provided
information about the most relevant cases for training.
6

A Genetic Programming Approach to Record Deduplication

Before Marlin, this system was the state-of-the-art solution for the problem.
An approach distinct from the previous ones is presented in. The main idea is to generate
individual rankings for each field based on generated similarity scores.
The distance between these rankings is calculated by using the well-known Spearmans
Footrule metric, which is minimized by a modified version of the Hungarian Algorithm
specifically tailored to this problem by the authors. Then, a merge algorithm based on a score
scheme is applied to the resulting rankings. At the end of this process, the top records in this
global ranking are considered to be the most similar to the input record. Notice that this
approach requires no training. Unfortunately, the experiments conducted do not evaluate the
quality of the global ranking with respect to the record matching effectiveness.
In this project, we propose a GP-based approach to improve results produced by the Fellegi
and Sunters method. Particularly, we use GP to balance the weight vectors produced by that
statistical method, in order to generate a better evidence combination than the simple
summation used by it. In comparison with our previous results, this paper presents a more
general and improved GP-based approach for de-duplication, which is able to automatically
generate effective de-duplication functions even when a suitable similarity function for each
record attribute is not provided in advance. In addition, it also adapts the suggested functions
to changes on the replica identification boundary values used to classify a pair of records as
replicas or not. These two characteristics are extremely important since they free the user
from the burden of having to select the similarity function to use with each attribute required
for the de-duplication task and tune the replica identification boundary accordingly.
Detection of Duplicated records is the process of identifying different or multiple records that
refer to one unique real-world entity or object. Typically, the process of duplicate detection is
preceded by a data preparation stage, during which data entries are stored in a uniform
manner in the database, resolving (at least partially) the structural heterogeneity problem. The
data preparation stage includes a parsing, a data transformation, and a standardization step.
The approaches that deal with data preparation are also described under the using the term
ETL (Extraction, Transformation, Loading). These steps improve the quality of the in-flow
data and the data comparable and more usable. While data preparation is not the focus of this
survey, for completeness we describe briefly the tasks performed in that stage. A
comprehensive collection of papers related to various data transformation approaches can be
found in. Parsing is the first critical component in the data preparation stage. Parsing locates,
7

A Genetic Programming Approach to Record Deduplication

identifies and isolates individual data elements in the source. Parsing makes it easier to
correct, standardize, and match data because it allows the comparison of individual
components, rather than of long complex strings of data. For example, the appropriate parsing
of name and address components into consistent packets of information is a crucial part in the
data cleaning process. Multiple parsing methods have been proposed recently in the literature
and the area continues to be an active of research. Data transformation refers to simple
conversions that can be applied to the data in order for them to conform to the data types of
their corresponding domains. In other words, this type of conversion focuses on manipulating
one field at a time, without taking into account the values in the related field. The most
common form of a simple transformation is the conversion of a data element from one data
type to another. Such a data type conversion is usually required when a legacy or parent
application stored data in a data type that makes sense within the context of the original
application, but not in a newly developed or subsequent system. Renaming of a field from
one name to another is considered data transformation as well. Encoded values in operational
systems and in external data is another problem that is addressed at this stage. These values
should be converted to their decoded equivalents, so records from different sources can be
compared in a uniform manner. Range checking is yet another kind of data transformation
which involves examining data in a field to ensure that it falls within the expected range,
usually a numeric or date range. Lastly, dependency checking is slightly more involved since
it requires comparing the value in a particular field to the values in another field, to ensure a
minimal level of consistency in the data.
Data standardization refers to the process of standardizing the information represented in
certain fields to a content format. This is used for information that can be stored in many
different ways in various data sources and must be converted to a uniform representation
before the duplicate detection process starts. Without standardization, many duplicate entries
could erroneously be designated as non-duplicates, based on the fact that common identifying
information cannot be compared. Typically, when operational applications are designed and
constructed, there is very little uniform handling of date and time formats across applications.
Data standardization is a rather inexpensive step that can lead to fast identification of
duplicates. For example, if the only difference between two records is the differently
recorded address (44 West Fourth Street vs. 44 W4th St.), then the data standardization step
would make the two records identical, alleviating the need for more expensive approximate
matching approaches that we describe in the later sections.
8

A Genetic Programming Approach to Record Deduplication

3.1 Existing System


Actually, the main problem in the existing system is:

Data Complexity:
Due to increment of online databases day by day, a huge amount of data
exist in www so from this data complexity, previous approaches ( such as vision
based approach, page level extraction, TSIMMIS, Web OQL) do not work effectively
because they are low efficient and time consuming approach.

Web page programming language and version Dependency:


We have all previous approaches related to web page extraction that
was html dependent. Lets take an example of any website such as Mumbai
University. Two years ago, there was some one administrator who was maintaining
that website and he was using html and version was 3.0. After one year, a new
administrator was appointed and he maintained that website using html version 4.0
and now a time, some different administrator is maintaining that website by using
some other web page programming language such as XHTML and XML. Will data be
extracted effectively? Of course no. so this is a problem of web page programming
language and version related dependency.

Scripting Dependency:
All the previously most work have not considered about script such as
java script, VB script and CSS so extraction may fail if we will extract scripting
related page.

Problem with Record Duplication:


One of the most important thing is that when data is uploaded from
different location then there may be a chance of data duplication. If we consider any
digital library website such as Google, Yahoo, Microsoft or any website then there
exist so many unwanted data. One data repeats so many time hence so many space is
wasted. Due to this, processing time is very high when we submit any query in the
database.

Ofcousrse, Due to above problem, the main disadvantages are cost such as operational,
computational after then useless data display and hence high time processing.

A Genetic Programming Approach to Record Deduplication

3.2 Proposed System


Genetic Programming Basic Concepts:
Evolutionary programming is based on ideas inspired on the naturally observed process that
influence virtually all living beings, the natural selection. Genetic Programming is one of the
best known evolutionary programming techniques.
It can be seen as an adaptive heuristic whose basic ideas come from the properties of the
genetic operations and natural selection system. It is a direct evolution of programs or
algorithms used for the purpose of inductive learning (supervised learning), initially applied
to optimization problems. GP, as well as other evolutionary techniques, is also known for its
capability of working with multi objective problems that are normally modeled as
environment restrictions during the evolutionary process.
GP and other evolutionary approaches are also widely known for their good performance on
searching over very large possibly infinite search spaces, where the optimal solution in many
cases is not known, usually providing near-optimal answers . As stated by Koza , the fact
that the GP algorithm operates on a population of individuals, rather than on a single point in
the search space of the problem, is an essential aspect of the algorithm. This can be
explained since the population serves as the pool of the probably valuable genetic material,
which is used to create new solutions with probably valuable new combinations of features.
The main aspect that distinguishes GP from other evolutionary techniques is that it represents
the concepts and the interpretation of a problem as a computer program and even the data are
viewed and manipulated in this way. This special characteristic enables GP to model any
other machine learning representation.
Another advantage of GP over other evolutionary techniques is its applicability to symbolic
regression problems, since the representation structures are variable. According to, while GP
does not mimic nature as closely as do genetic algorithms, it does offer the opportunity to
directly evolve programs of unusual complexity, without having to define the structure or size
of the program in advance. This means that GP is able to discover the independent variables
and their relationships with each other and with any dependent variable. Thus, GP can find
the correct functional form that fits the data and discover the appropriate coefficients.
Solving this kind of problem is noticeably more complex than solving linear regression or
polynomial regression problems.
10

A Genetic Programming Approach to Record Deduplication

Genetic Operations
Usually, GP evolves a population of length-free data structures, also called individuals, each
one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions, as illustrated in Fig. 3.1

Fig. 3.1: Genetic Trees


When using this tree representation in a GP-based method, a set of terminals and functions
should be defined.
Terminals are inputs, constants or zero argument3 nodes that terminate a branch of a tree.
They are also called tree leaves. The function set is the collection of operators, statements,
and basic or user-defined functions that can be used by the GP evolutionary process to
manipulate the terminal values. These functions are placed in the internal nodes of the tree, as
illustrated in Fig. 3.1. During the evolutionary process, the individuals are handled and
modified by genetic operations such as reproduction, crossover, and mutation, in an iterative
way that is expected to spawn better individuals (solutions to the proposed problem) in the
subsequent generations. Reproduction is the operation that copies individuals without
modifying them. Usually, this operator is used to implement an elitist strategy that is adopted
to keep the genetic code of the fittest individuals across the changes in the generations. If a
good individual is found in earlier generations, it will not be lost during the evolutionary
process.
The crossover operation allows genetic content (e.g., sub trees) exchange between two
parents, in a process that can generate two or more children. In a GP evolutionary process,
two parent trees are selected according to a matching (or pairing) policy and, then, a random
sub tree is selected in each parent. Child trees are the result from the swap of the selected sub
trees between the parents, as illustrated in Fig. 3.2.

11

A Genetic Programming Approach to Record Deduplication

Fig. 3.2: Crossover Operation


Finally, the mutation operation has the role of keeping a minimum diversity level of
individuals in the population, thus avoiding premature convergence. Every solution tree
resulting from the crossover operation has an equal chance of suffering a mutation process. In
a GP tree representation, a random node is selected and the corresponding sub tree is replaced
by a new randomly created sub tree, as illustrated in Fig. 3.3.

Fig.3.3: Mutation Operation


All operations for node replacements and insertions performed by mutation and crossover
are executed using equal (and constant) probabilities. This way all nodes have the same
probability of being chosen in order to guarantee the diversity of the individuals within the
genetic pool.

12

A Genetic Programming Approach to Record Deduplication

Chapter 4
Software Requirements Specification

13

A Genetic Programming Approach to Record Deduplication

4.1 Introduction
A Software Requirements Specification (SRS) is a requirements specification for a system.
It is a complete description of the behavior of a system to be developed. It includes a set of
use cases that describe all the interactions the users will have with the software. In addition to
use cases, the SRS also contains non-functional requirements. Non-functional requirements
are requirements which impose constraints on the design or implementation (such as
performance engineering requirements, quality standards, or design constraints).
System requirements specification: A structured collection of information that embodies
the requirements of a system. A business analyst, sometimes titled system analyst, is
responsible for analyzing the business needs of their clients and stakeholders to help identify
business problems and propose solutions. Within the systems development life cycle domain,
typically performs a liaison function between the business side of an enterprise and the
information technology department or external service providers. Projects are subject to three
sorts of requirements:

Business requirements describe in business terms what must be delivered or


accomplished to provide value.

Product requirements describe properties of a system or product (which could be


one of several ways to accomplish a set of business requirements.)

Process requirements describe activities performed by the developing organization.


For instance, process requirements could specify specific methodologies that must be
followed, and constraints that the organization must obey.
Product and process requirements are closely linked. Process requirements often specify the
activities that will be performed to satisfy a product requirement. For example, a maximum
development cost requirement (a process requirement) may be imposed to help achieve a
maximum sales price requirement (a product requirement); a requirement that the product be
maintainable (a Product requirement) often is addressed by imposing requirements to follow
particular development styles

14

A Genetic Programming Approach to Record Deduplication

4.1.1 Project Scope


GP-based approach is also able to automatically find effective Deduplication functions, even
when the most suitable similarity function for each record attribute is not known in advance.
This is extremely useful for the non-specialized user, who does not have to worry about
selecting these functions for the Deduplication task. In addition, we show that our approach is
also able to adapt the suggested Deduplication function to changes on the replica
identification boundaries used to classify a pair of records as a match or not.
This Project can be mainly scoped in Web Data Extraction. GP-based approach is able to
automatically find effective de-duplication functions, even when the most suitable similarity
function for each record attribute is not known in advance. This is extremely useful for the
non-specialized user, who does not have to worry about selecting these functions for the deduplication task. In addition, we show that our approach is also able to adapt the suggested
de-duplication function to changes on the replica identification boundaries used to classify a
pair of records as a match or not.
4.1.2 User Classes and Characteristics
There are mainly two types are users in this product i.e. Administrator and User.
Administrator having all types of privileges and user having limited privileges. Admin can
upload the documents whereas user can search the data and will find the de-duplicated data.
4.1.3 Operating Environment
This product will be installed on windows based machine. It is mainly for server machine.
This product is web-based and will be hosted by a web server. It can be viewed by any web
browser.
4.1.4 Design and Implementation Constraints

Design Constraint
In systems design the design functions and operations are described in detail, including
screen layouts, business rules, process diagrams and other documentation. The output of this
stage will describe the new system as a collection of modules or subsystems.

15

A Genetic Programming Approach to Record Deduplication

The design stage takes as its initial input the requirements identified in the approved
requirements document. For each requirement, a set of one or more design elements will be
produced as a result of interviews, workshops, and/or prototype efforts.
Design elements describe the desired software features in detail, and generally include
functional hierarchy diagrams, screen layout diagrams, tables of business rules, business
process diagrams, pseudo code, and a complete entity-relationship diagram with a full data
dictionary. These design elements are intended to describe the software in sufficient detail
that skilled programmers may develop the software with minimal additional input design.
Implementation Constraint
Modular and subsystem programming code will be accomplished during this stage. Unit
testing and module testing are done in this stage by the developers. This stage is intermingled
with the next in that individual modules will need testing before integration to the main
project.
4.1.5 Assumptions and Dependencies
In our project, there is nothing to assume and this project approach having none dependencies
because we are developing web-page-programming language dependent dynamic data
extractor.

4.2 System Features


4.2.1 System Feature 1
These are the functional requirements:
1. Upload the data in Different Database.
2. Search the data in Different Database.
3. View Duplicate Data.
4. Apply GP Based approach techniques and Find without Duplicate Data.
5. Calculate the Searching Time.
6. View User Information.
7. Change the password.

16

A Genetic Programming Approach to Record Deduplication

4.2.2 System Feature 2


Some other features:
1. Upload no of Citation for each topic
2. Enter the topic name
3. Display the no of citation from no of sites
4. Comparison of citations
5. Remove the duplicate Citation
6. Linkage Citation are rating
7. Using rating procedure generate reproduction tree
8. Rearrange the sub trees
9. Gives the final tree as a complete tree

4.3 External Features


4.3.1 User Interfaces
In this project, User Interface will be HTML and CSS.
4.3.2 Software Interfaces
Operating System

Windows

Client-side Scripting

JavaScript

Programming Language

Java

IDE/Workbench

My Eclipse 6.0

Database

Oracle 10g

Processor

Pentium IV

Hard Disk

40GB

RAM

512MB or more

4.3.3 Hardware Interfaces

17

A Genetic Programming Approach to Record Deduplication

4.3.4 Communication Interfaces


This Genetic Programming Approach to Record Deduplication System will be
communicating through the Intranet via a Transmission Control Protocol of the TCP/IP and
FTP for FTP services, such as file uploads and download.

4.4 Non Functional Requirements


4.4.1 Performance Requirements
This system is developing in the high level languages and using the advanced front-end and
back-end technologies it will give response to the end user on client system with in very less
time.

4.4.2 Safety Requirements


This system is safe for extracting if anything goes wrong like power failure.
4.4.3 Security Requirements
1. We are going to develop a secured database. There are different categories of users namely
Administrator, Restricted users who will be viewing either all or some specific information
from the database.
2. Depending upon the category of user the access rights are decided. It means if the user is
an administrator then he can be able to modify the data, append etc. All other users only have
the rights to retrieve the information about database.
4.4.4 S/w Quality Attributes
Software quality measurement quantifies to what extent a software or system rates along
each of these four dimensions:

Reliability

Efficiency

Maintainability

Size
18

A Genetic Programming Approach to Record Deduplication

Reliability
The system is more reliable because of the qualities that are inherited from the chosen
platform java. The code built by using java is more reliable.
Efficiency
The system is efficient to give best result while searching complex data.
Maintainability
It can be maintained by those semi-skilled person who has knowledge in Java and Oracle.

Size
The maximum size of this project ranges from 2 GB to 5 GB.

4.5 Risk Assessment


These are the five steps for Risk Assessment:
1. Identify the hazards
2. Decide who might be harmed and how
3. Evaluate the risks and decide on precaution
4. Record your findings and implement them
5. Review your assessment and update if necessary

19

A Genetic Programming Approach to Record Deduplication

Chapter 5
System Design

20

A Genetic Programming Approach to Record Deduplication

5.1 System Architecture

Fig. 5.1.A: System Architecture

21

A Genetic Programming Approach to Record Deduplication

Fig. 5.1.B: System Architecture


Usually, GP evolves a population of length-free data structures, also called individuals, each
one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions, as illustrated in Fig. 5.2. When using this tree representation in a GPbased method, a set of terminals and functions should be defined. Terminals are inputs,
constants or zero argument nodes that terminate a branch of a tree. They are also called tree
leaves. The function set is the collection of operators, statements, and basic or user-defined
functions that can be used by the GP evolutionary process to manipulate the terminal values.
These functions are placed in the internal nodes of the tree, as illustrated in Fig. 5.2. During
22

A Genetic Programming Approach to Record Deduplication

the evolutionary process, the individuals are handled and modified by genetic operations such
as reproduction, crossover, and mutation, in an iterative way that is expected to spawn better
individuals (solutions to the proposed problem) in the subsequent generations.

Fig. 5.2: Genetic Trees

Reproduction is the operation that copies individuals without modifying them. Usually, this
operator is used to implement an elitist strategy that is adopted to keep the genetic code of the
fittest individuals across the changes in the generations. If a good individual is found in
earlier generations, it will not be lost during the evolutionary process

Fig. 5.3: Reproduction


The crossover operation allows genetic content (e.g. sub trees) exchange between two
parents, in a process that can generate two or more children. In a GP evolutionary process,
two parent trees are selected according to a matching (or pairing) policy and, then, a random
sub tree is selected in each parent. Child trees are the result from the swap of the selected sub
trees between the parents, as illustrated in Fig. 5.4.
23

A Genetic Programming Approach to Record Deduplication

Fig. 5.4: Crossover

Finally, the mutation operation has the role of keeping a minimum diversity level of
individuals in the population, thus avoiding premature convergence. Every solution tree
resulting from the crossover operation has an equal chance of suffering a mutation process. In
a GP tree representation, a random node is selected and the corresponding sub tree is replaced
by a new randomly created sub tree, as illustrated in Fig. 5.5.

5.2 Analysis Model


Here, we are going to describe two types of Analysis Model:
1. Data Flow Diagram
2. E-R Diagram

5.2.1 Data Flow Diagram


A graphical tool used to describe and analyze the moment of data through a system manual or
automated including the process, stores of data, and delays in the system. Data Flow
Diagrams are the central tool and the basis from which other components are developed. The
24

A Genetic Programming Approach to Record Deduplication

transformation of data from input to output, through processes, may be described logically
and independently of the physical components associated with the system. The DFD is also
known as a data flow graph or a bubble chart.
DFDs are the model of the proposed system. They clearly should show the requirements on
which the new system should be built. Later during design activity this is taken as the basis
for drawing the systems structure charts. The Basic Notation used to create a DFDs are as
follows:
1. Dataflow: Data move in a specific direction from an origin to a

destination.

2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.

3. Source: External sources or destination of data, which may be People, programs,


organizations or other entities.

4. Data Store: Here data are stored or referenced by a process in the System.

5. Rhombus: decision

25

A Genetic Programming Approach to Record Deduplication

CONTEXT LEVEL DIAGRAM


Context Level0 DFD

Fig. 5.5: Context Level0 Diagram


Context level1 Diagram:
Login DFD

Fig. 5.6: Context Level1 Diagram


26

A Genetic Programming Approach to Record Deduplication

Context level2 Diagram:

Fig. 5.7: Context Level3 Diagram

27

A Genetic Programming Approach to Record Deduplication

5.2.2 E-R Diagram


In software engineering, an entity-relationship model (ERM) is an abstract and conceptual
representation of data. Entity-relationship modeling is a database modeling method, used to
produce a type of conceptual schema or semantic data model of a system, often a relational
database, and its requirements in a top-down fashion. Diagrams created by this process are
called entity-relationship diagrams, ER diagrams, or ERDs. The definitive reference for
entity-relationship modeling is Peter Chen's 1976 paper. However, variants of the idea
existed previously, and have been devised subsequently. An entity may be defined as a thing
which is recognized as being capable of an independent existence and which can be uniquely
identified. An entity is an abstraction from the complexities of some domain. An entity may
be a physical object such as a house or a car, an event such as a house sale or a car service, or
a concept such as a customer transaction or order. Although the term entity is the one most
commonly used, following Chen we should really distinguish between an entity and an
entity-type. An entity-type is a category. An entity, strictly speaking, is an instance of a given
entity-type. Entities can be thought of as nouns. Examples: a computer, an employee, a song,
a mathematical theorem. A relationship captures how two or more entities are related to one
another. Relationships can be thought of as verbs, linking two or more nouns. Examples: a
relationship between a company and a computer, a relationship between an employee and a
department, a relationship between an artist and a song, a proved relationship between a
mathematician and a theorem. The model's linguistic aspect described above is utilized in the
declarative database query language ERROL, which mimics natural language constructs.
Entities and relationships can both have attributes. Examples: an employee entity might have
a Social Security Number (SSN) attribute; the proved relationship may have a date attribute.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying
attributes, which is called the entity's primary key. Entity-relationship diagrams don't show
single entities or single instances of relations. Rather, they show entity sets and relationship
sets. Example: a particular song is an entity. The collection of all songs in a database is an
entity set. The eaten relationship between a child and her lunch is a single relationship. The
set of all such child-lunch relationships in a database is a relationship set. In other words, a
relationship set corresponds to a relation in mathematics, while a relationship corresponds to
a member of the relation. Certain cardinality constraints on relationship sets may be indicated
as well.

28

A Genetic Programming Approach to Record Deduplication

Fig. 5.8: E-R Diagram

29

A Genetic Programming Approach to Record Deduplication

Chapter 6
UML Diagrams

30

A Genetic Programming Approach to Record Deduplication

Introduction to UML
The unified Modeling Language (UML) is a standard language for writing software
blueprints. The UML may be used to visualize, specify, construct and document the artifacts
of software-intensive system.
The goal of UML is to provide a standard notation that can be used by all object - oriented
methods and to select and integrate the best elements .UML is itself does not prescribe or
advice on how to use that notation in a software development process or as part of an object design methodology. The UML is more than just bunch of graphical symbols. Rather, behind
each symbol in the UML notation is well-defined semantics.
The system development focuses on three different models of the system.

Functional model

Object model

Dynamic model

Functional model in UML is represented with use case diagrams, describing the
functionality of the system from user point of view.
Object model in UML is represented with class diagrams, describing the structure of the
system in terms of objects, attributes, associations and operations.
Dynamic model in UML is represented with sequence diagrams, start chart diagrams and
activity diagrams describing the internal behavior of the system.

31

A Genetic Programming Approach to Record Deduplication

6.1 Use Case Diagram


A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals, and any
dependencies between those use cases.
Relationships:
Association
An association is a connection between an actor and a use case. An association indicates that
an actor can carry out a use case. Several actors at one use case mean that each actor can
carry out the use case on his or her own and not that the actors carry out the use case together:

Include Relationships
An include relationship is a relationship between two use cases:

It indicates that the use case to which the arrow points is included in the use case on the other
side of the arrow. This makes it possible to reuse a use case in another use case.

32

A Genetic Programming Approach to Record Deduplication

Upload no of Citation for each topic

admin
Enter the topic name

Display the no of citation from no


of sites

Comparison of citations

Remove the duplicate Citation

User

Linkage Citation are rating

Using rating procedure generate


reproduction tree

Rearrange the subtrees

Gives the final tree as a complete


tree

Fig. 6.1: Use Case Diagram

33

A Genetic Programming Approach to Record Deduplication

6.2 Class Diagram


Class diagrams are widely used to describe the types of objects in a system and their
relationships. Class diagrams model class structure and contents using design elements such
as classes, packages and objects. Class diagrams describe three different perspectives when
designing a system, conceptual, specification, and implementation. These perspectives
become evident as the diagram is created and help solidify the design.

User Registration

Login

Enter the Keyword

+Enter Personalinfo

+Enter LoginID
+Enter Password

+Extract the Results based on Single Attribute

+Registation Succesfulyl()

+Login Successfully()

+Display the Maximization results()

View unique Results

Remove Duplcate Ciatation

View Citation Results

+Display Unique Citation Info

+Display Citation Info

+Extract the Results

+Display UniqueCitation()

+Display RomeveDuplicateCitation()

+Display theresults()

View SimilarCitation

Graph

+Display SimilarCitation Info

+Display Unique Citation Info with Garph

+Display SimilarCitationData()

+Display GraphData()

Fig. 6.2: Class Diagram

34

A Genetic Programming Approach to Record Deduplication

6.3 Activity Diagram


Activity diagrams describe the workflow behavior of a system. Activity diagrams are similar
to state diagrams because activities are the state of doing something. The diagrams describe
the state of activities by showing the sequence of activities performed. Activity diagrams can
show activities that are conditional or parallel.

User

User
Registration

Login

Login fail

Login success
User Home

Enter the topic


name

Display the no of
citation

Remove the
duplicate Citation

NewActivity8

Fig. 6.3: Activity Diagram


35

Rearrange the
subtrees

Gives the final


tree

A Genetic Programming Approach to Record Deduplication

6.4 Sequence Diagram


Sequence diagrams belong to a group of UML diagrams called Interaction Diagrams.
Sequence diagrams describe how objects interact over the course of time through an
exchange of messages. A single sequence diagram often represents the flow of events for a
single use case.
Instance:
An instance of a class shows a sample configuration of an object. On the sequence diagram,
each instance has a lifeline box underneath it showing its existence over a period of time.
Actor: An actor is anything outside the system that interacts with the system. It could be a
user or another system.
Message: The message indicates communication between objects. The order of messages
from top to bottom on your diagram should be the order in which the messages occur.

User

Login

Search Topic

Display the
citation

Comparison of
citations

Linkage Citation
are rating

Login
Enter the topic name

Display the no of citation from no of sites


Comparison of citations

Linkage Citation are rating

Using rating procedure generate reproduction tree

Rearrange the subtrees

Gives the final tree as a complete tree

Fig. 6.4: Sequence Diagram


36

generate
reproduction tree

Rearrange the
subtrees

complete tree

A Genetic Programming Approach to Record Deduplication

6.5 Communication Diagram


Communication diagrams belong to a group of UML diagrams called Interaction Diagrams.
Communication diagrams, like Sequence Diagrams, show how objects interact over the
course of time. However, instead of showing the sequence of events by the layout on the
diagram, communication diagrams show the sequence by numbering the messages on the
diagram. This makes it easier to show how the objects are linked together, but harder to see
the sequence at a glance.
Instance:
An instance of a class shows a sample configuration of an object. On the sequence diagram,
each instance has a lifeline box underneath it showing its existence over a period of time.
Lollipop Interface:
A lollipop interface is a shorthand syntax for an interface. It shows the interface name
without displaying the operations.
Message:
The message indicates communication between objects. The order of messages from top to
bottom on your diagram should be the order in which the messages occur.

8: Gives the final tree as a complete tree


complete
User
tree
Linkage Citation
are rating
3: Display the no of citation from no of sites
1: Login
5: Linkage Citation are rating

Display the
citation

4: Comparison of citations
Comparison of
citations

Login

2: Enter the topic name

6: Using rating procedure generate reproduction tree


generate
reproduction tree

7: Rearrange the subtrees

Rearrange the
subtrees

Fig. 6.5: Communication Diagram


37

Search
Topic

A Genetic Programming Approach to Record Deduplication

6.6 State Machine Diagram


UML State chart is notation for describing the sequence of states an object goes through in
response to external events. Objects have behavior and state. The state of an object depends
on its current activity or condition. A state chart diagram shows the possible states of the
object ad the transitions that cause a change in state.
State chart describes the dynamic behavior of an individual object as a number of states. A
state is a condition satisfied by attributes of objects. Given a state, a transition represents a
future state the object can move to and the conditions associated with the change of state.
A state is depicted by a rounded rectangle a transition is depicted by open arrows connecting
two states. States are labeled with their names. A small solid black circle indicates the initial
state and a circle surrounding the small solid circle indicates the final state.

User
Registration

Login

Upload no of Citation
for each t opic

Ent er the
topic name

Display the no of
citation from no of sites

Comparison of
citations

Remove the
duplicate Citat ion

Linkage Citation
are rating

Using rating procedure


generate reproduction tree

Rearrange the
subtrees

Gives the final tree


as a complete tree

Fig. 6.6: State Machine Diagram


38

A Genetic Programming Approach to Record Deduplication

6.7 Component Diagram

Admin

Upload the
Topic Content

Display the no of citation


from no of sites

User

Search
the Topic

Remove the
duplicate Citation

Gives the
final tree

Rearrange
the subtrees

Fig. 6.7: Component Diagram

39

Using rating procedure


generate reproduction tree

A Genetic Programming Approach to Record Deduplication

6.8 Package Diagram

Fig. 6.8: Package Diagram

40

A Genetic Programming Approach to Record Deduplication

6.9 Deployment Diagram

Search
Content
Upload
Content

System
Display
citation

complet
e tree
Remove
...

Fig. 6.9: Deployment Diagram

6.10 Dependency Graph

Fig. 6.10: Dependency Graph


41

A Genetic Programming Approach to Record Deduplication

Chapter 7
Project Schedule & Estimate

42

A Genetic Programming Approach to Record Deduplication

7.1 Project Plan

The Project Planning step is the most critical step in the project management life cycle. The
reason is that its only when we list all of the tasks in our project plan, that we truly have an
idea of what its going to take, to deliver our project on time. So to perform Project Planning
in a smart and efficient way, we need a well-defined and organized project plan to help is to
do it.This chapter explains all the project planning features. It lists our tasks, create schedules,
project management approach, projected project budget, project timeline, etc.

Project planning is a discipline for stating how to complete a project within a certain time
frame, usually with defined stages, and with designated resources. Creating a project plan is
the first thing that needs to be done when undertaking any kind of project. At a minimum, a
project plan answers basic questions about the project:

Why? What is the problem or value proposition addressed by the project?

What? What is the work that will be performed on the project? What are the major
products/deliverables?

Who? Who will be involved and what will be their responsibilities within the
project? How will they be organized?

When? What is the project time line and when will particularly meaningful points,
referred to as milestones, be complete?

Often project planning is ignored in favour of getting on with the work. However, many
people fail to realize the value of a project plan in saving time and money and then face the
consequences later.

43

A Genetic Programming Approach to Record Deduplication

Table 7.1: Project Plan for Sem I


Sr.
No.
1

Module Name

Start Date

End Date

Status

A] Selection of Problem Definition

1st July 2013

17th July 2013

Completed

B] Literature Survey

24th July 2013

25th July 2013

Completed

Requirement Analysis

Planning Phase
A] Deciding Scope

1st
2013

August 2 nd
2013

August Completed

B] Selection of Platform

7th
2013

August 12th
2013

August Completed

A] Module division and allocation 25th


of task
2013

August 30th
2013

August Completed

Designing Phase

15th Sept. 2013 23rd Sept. 2013 Completed

B] User Interface
4

Modeling Phase
A] Draw UML Diagrams

26th Sept. 2013 3 rd Sept. 2013

Completed

Preparation of partial report

4th Sept 2013

8th Sept 2013

Completed

Submission of partial report

21th Sept 2013 12th Oct 2013

Submitted

Table 7.2: Project Plan for Sem II

44

A Genetic Programming Approach to Record Deduplication

Sr

Module Name

Start Date

End Date

Status

Studied About GPRD system & design

13th Jan. 2014

18th Jan. 2014

Completed

27th Jan. 2014

31st Jan. 2014

Completed

03rd Feb. 2014

08th Feb. 2014

Completed

10th Feb. 2014

15th Feb. 2014

Completed

03rd Mar 2014

08th Mar 2 014

Completed

10th Mar 2014

15th Mar 2014

Completed

No.
1

our graphical user interface


2

Start coding of 1st module, i.e.


graphical user interface.

Module implementation for graphical


user interface

Test the module and removes some


bugs.

Start coding of 2nd module, i.e. GUI


and Testing

Final Coding for 3rd Module and test


the modules.

All parts of Project testing

17th Mar 2014

22nd Mar 2014

Completed

Preparation of Final report.

02nd Apr. 2014

12th Apr. 2014

Completed

Paper Publication

13th Apr. 2014

16th Apr. 2014

Completed

10

Submission of Final Project Report

21st Apr. 2014

07th Apr. 2014

Submitted

45

A Genetic Programming Approach to Record Deduplication

7.2 Project Estimation


The project estimation includes the following parameters:
1. Time: The total time for overall project completion undergoing various phases of
development is given as eight months (approx.).
2. Efforts: Since the characteristics of each project dictate the distribution of efforts,
35% of the efforts is spent on Analysis and Design, a similar amount on testing.
Coding about 30% of the efforts.
3. Cost: The cost of the project is calculated in terms of the effort applied and the
resources used. The other parameters that account for cost estimation are:
Man/Month
Technology used
Benefits
Machine cost
COCOMO Model the Basic COCOMO is a static, single-valued model that computes
Software development effort (and cost) as a function of program size expressed in estimated
Lines of code.
The COCOMO is a collection of three models: a Basic model that is applied early in the
project, an Intermediate model that is applied after requirements are specified, and an
advanced model that is applied after design is complete. All three models take the form:
E= aS b _EAF
Where,
E is effort in person months,
S is size measured in thousands of lines of code (KLOC),
And EAF is an effort adjustment factor (equal to 1 in the basic model).
The factors a and b depend on the development mode.
COCOMO applies to three classes of software projects:

Organic projects: are relatively small, simple software projects in which small teams
with good application experience work to a set of less than rigid requirements.

46

A Genetic Programming Approach to Record Deduplication

Semi-detached projects: are intermediate (in size and complexity) software projects
in which teams with mixed experience levels must meet a mix of rigid and less than
rigid requirements.

Embedded projects: are software projects that must be developed within a set of
tight hardware, software, and operational constraints.

The basic COCOMO equations take the form


E = ab(KLOC)bb
D = cb(E)db
P=E / D
Where, E is the effort applied in person-months, D is the development time in chronological
months, KLOC is the estimated number of delivered lines of code for the project
(expressed in thousands), and P is the number of people required. The coefficients ab,
bb, cb and db are given in the following table.
Table 7.3: Basic COCOMO model
Software Project

ab

bb

Cb

db

Organic

2.4

1.05

2.5

0.38

Semi Detached

3.0

1.12

2.5

0.35

Embeded

4.6

1.20

2.5

0.32

Basic COCOMO is good for quick, early, rough order of magnitude estimates of software
costs, but it does not account for differences in hardware constraints, personnel quality and
experience, use of modern tools and techniques, and other project attributes known to have a
significant influence on software costs, which limits its accuracy.
Our project comes under organic category
For organic:
Effort estimation
Effort = a _(Sizeb) _EAF Where,
Size (KLOC) = 2.5 KLOC
a = 2.4
b = 1.05
47

A Genetic Programming Approach to Record Deduplication

n = 3 (number of persons)
EAF = 1.05 (Adjustment factor)
Therefore.

Effort = 2.4 _(31.05) _ 1.05


= 2.4 _3.169
= 8 person-months(appx.).
Development time = c _(E f f ortd)
Where,
c=2.5
d=0.38
Therefore,
Development time = 2.5 _(7.060.38)
= 2.5 _ 1.869
= 4.91
= 5 person - month(appx.).
Recommended number of people:
N=E/D
=8/5
= 2 people
Considering salary of each Person = Rs.8000/- per month.
Number of months to complete the project=5
Total expenses on the salary of the people=Rs40,000/Hence total cost of the project is RS.40,000/-

48

A Genetic Programming Approach to Record Deduplication

Chapter 8
Software Implementation

49

A Genetic Programming Approach to Record Deduplication

8.1 Introduction
Implementation is the stage where the theoretical design is turned in to working system. The
most crucial stage is achieving a new successful system and in giving confidence on the new
system for the users that it will work efficiently and effectively.
The system can be implemented only after through testing is done and if it found to work
according to the specification. It involves careful planning, investigation of the current
system and its constraints on implementation, design of methods to achieve the change over
and an evaluation of change over methods a part from planning. Two major tasks of
preparing the implementation are education and training of the users and testing of the
system.
The more complex the system being implemented, the more involved will be the systems
analysis and design effort required just for implementation. The implementation phase
comprises of several activities. The required hardware and software acquisition is carried out.
The System may require some hardware and software acquisition is carried out. The system
may require some software to be developed. For this, programs are written and tested. The
user then changes over to his new fully tested system and the old system is discontinued.
Implementation is the process of having systems personnel check out and

put new

equipment in to use, train users, install the new application, and construct any files of data
needed to it.
Depending on the size of the organization that will be involved in using the application and
the risk associated with its use, system developers may choose to test the operation in only
one area of the firm, say in one department or with only one or two persons. Sometimes they
will run the old and new systems together to compare the results. In still other situations,
developers will stop using the old system one-day and begin using the new one the next. As
we will see, each implementation strategy has its merits, depending on the business situation
in which it is considered. Regardless of the implementation strategy used, developers strive
to ensure that the systems initial use in trouble-free.
Once installed, applications are often used for many years. However, both the organization
and the users will change, and the environment will be different over the weeks and months.
Therefore, the application will undoubtedly have to be maintained. Modifications and
changes will be made to the software, files, or procedures to meet the emerging requirements.
50

A Genetic Programming Approach to Record Deduplication

8.2 Implementation Details


Installation Step:
Step 1: Oracle Installation:
Step 2: Tomcat6.0 Installation (or)
Step3: MyEclipse 8.0 Installation
Deployment Step:
Step 1: Start the MyEclipse 8.0
Step2: click on File Menu Button and Select the import option
Step 3: After importing select general option and Click on Existing Project into Workspace
Step4: After that select the Browse Button and select the project and click on Finish Button
Step 5: Right click on project and select the Run as option

8.2.1 Modules and Algorithms


Modules
1. Procedure of Genetic Programming
2. Genetic operations
3. Generational Evolutionary algorithm
4. Record De-duplication
5. Precision and Recall operations

1. Procedure of Genetic Programming:


Multiple User sends the query and extracts results from the search engine. Under extraction
of results, the operation which apply first is selection operation. This selection performs in
different databases and extracts the results with interactive query processing. It does not
provide any optimal solution. These results contains some duplicates. It display the nearly
optimal solution results only.

51

A Genetic Programming Approach to Record Deduplication

2. Genetic Operations:
This module has been developed to provide the structure based results. Here, first thing is
selection of root terminals. This is zero level results. Next, we find out next level of
childrens. This procedure applies till reaches to leaf nodes for extraction of results. In this
procedure, all internal nodes we find out here in implementation part. These internal nodes
identification and create the structure possible with three operations here. Those are called
selection, crossover and mutation.
3. Generational Evolutionary Algorithm:
This Algorithm initialize all the results of nodes. Each and every node of rating we are
calculates here. According to rating value calculates fitness. After finding the fitness node its
possible for creates the reproduced tree in implementation. It can contains all nodes are best.
This same process applies till for finding the optimal tree identification. This same procedure
repeatedly performs here.
4. Record De-duplication:
According to requirement automatically it can changes here in implementation. It can show
the efficient results in tree data structure in implementation. It is the good evidence based
results display. All those nodes are display with the help of similarity function in
implementation process.
5. Precision and Recall operations:
These two operations are performed only for correctly identified duplicated data.
P = Number of Correctly Identified Duplicated pairs / Number of Identified Duplicated pairs
R = Number of Correctly Identified Duplicated pairs / Number of True Duplicated pairs

Algorithm
Algorithm: Generational Evolutionary Algorithm

The first thing is that all the gathered data is initialized from which the data is
discovered.

After then all the individual data are evaluated and a numeric fitness value is assigned
to each one.

52

A Genetic Programming Approach to Record Deduplication

Selection process is performed in which all the n individual are selected into next
generation population without modifying the data.

After then Crossover operation is performed in which m individual that will compose
the next generation with the best parent is selected and replace the existing generation
i.e. in this process two parent tree are selected according to matching policy and then
a random sub tree is selected in each parent.

And finally Mutation operation is performed in which the best individual are
produced in the population.

8.2.2 Sample Code


//Home.jsp
<html>
<jsp:include page="Header.jsp"></jsp:include>
<body>
<font color="orange" size="5"><b><i> A Genetic Programming Approach</font> <font
color="Green" size="5"> to Record Deduplication
</font></i></b>
<table align="center"><tr><td>
<font color="#4682B4">
<font size="7">W</font>
e propose a genetic programming approach to record deduplication that combines several
different pieces of evidence
extracted from the data content to find a deduplication function that is able to identify
whether two entries in a repository are replicas or
not. As shown by our experiments, our approach outperforms an existing state-of-the-art
method found in the literature. Moreover, the
suggested functions are computationally less demanding since they use fewer evidence. In
addition, our genetic programming
approach is capable of automatically adapting these functions to a given fixed replica
identification boundary, freeing the user from the
burden of having to choose and tune this parameter.
</font> </p>
53

A Genetic Programming Approach to Record Deduplication

</td>
<td > </td>
<td colspan="1" align="left" valign="top"><img
src="<%=request.getContextPath()+"/images/c5.jpg"%>" align="top" height="200" /
></td>
</tr>
</table>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>

// LoginPage.jsp

<!DOCTYPE HTML PUBLIC "-//w3c//dtd html 4.0 transitional//en">


<html>
<head>
<script language="JavaScript"
src="<%=request.getContextPath()+"/scripts/gen_validatorv31.js"%>"
type="text/javascript"></script>
<style type="text/css">
.Title {
font-family:Verdana;
font-weight:bold;
font-size:8pt
}
.Title1 {font-family:Verdana;
font-weight:bold;
font-size:8pt
}
54

A Genetic Programming Approach to Record Deduplication

</style>
</head>
<body>

<jsp:include page="Header.jsp"></jsp:include>
<fieldset>
<legend>Login Form</legend>
<form action="<%=request.getContextPath()+"/LoginAction"%>" method=post
name="login">
<table border="0" align="center" bgcolor="white" width="80%">
<tr>
<td height="120" align="right">
<td><table border="0" align="center">
<tr align="center"><strong><h3><font color="#4682B4">Login
Form</font></h3></strong>
</tr>
<tr>
<td ><font color="#DA70D6" size=""><b>UserID</b></font></td>
<td ><input type="text" name="username"> </td>
</tr>
<tr>
<td><font color="#DA70D6" size=""><b>Password</b></font></td>
<td>
<input type="password" name="password">
</tr>
<tr>
<td colspan="2">
<div align="center" class="style11">
<input type="submit" name="Submit" value="Sign In">
&nbsp;
55

</td>

A Genetic Programming Approach to Record Deduplication

<input name="Input2" type="reset" value="Clear">


</div>
<td><div>
<a href="./Recoverpassword.jsp"><font color="#228B22" size="3"
style="verdana">ForgotPassword??</font></a>
</div></td>
</tr>
</table></td>
<td> <img src="<%=request.getContextPath()+"/images/l1.jpeg"%>"
height="200" />
</td>
</tr>
</table>
</form>
</fieldset>
<script language="JavaScript" type="text/javascript">
//You should create the validator only after the definition of the HTML form
var frmvalidator = new Validator("login")
frmvalidator.addValidation("username","req","Login Name is required");
frmvalidator.addValidation("password","req","Password is required");
</script>
<br/>
<br/>
<jsp:include page="Footer.jsp"></jsp:include>
</body>
</html>

// Search.jsp
<%@ page language="java" import="java.util.*" pageEncoding="ISO-8859-1"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
56

A Genetic Programming Approach to Record Deduplication

<head>

<script type="text/javascript">
function changePage(){
var select=document.getElementById("select").value;
var sel="";
if(select==sel){
alert("plz select any one ");
}
else if(select==java){
alert("u have selected java");
location.href="./MainPage.jsp";
}else if(select==c){
alert("u have selected c");
location.href="./C.jsp";
}else if(select==unix){
alert("u have selected Unix");
location.href="./Unix.jsp";
}
}
</script>
</head>
<jsp:include page="Header.jsp"></jsp:include>
<body>
<center> <h3><font color="#008080">Select u r search Concept Here</font></h3>
<form name="cpaper" action="./GetSearchPageAction">
<table>
<tr>
<td> <font size="4" color="#4682B4">Select Here</font></td>
57

A Genetic Programming Approach to Record Deduplication

<td> <select id="select" name="search" >


<option value="unix">Unix</option>
</select>

</td>

</tr>
<br/>
<br/><br/>
<tr align="center"><td> <input type="submit" value="search">
</table>
</form>
</center>
<br/>
<jsp:include page="./Footer.jsp"></jsp:include>
</body>
</html>

8.2.3 Screen Shots:

Fig. 8.1: Home Page


58

</td></tr>

A Genetic Programming Approach to Record Deduplication

Fig. 8.2: Admin Login Page

Fig. 8.3: Admin Welcome Page


59

A Genetic Programming Approach to Record Deduplication

Fig. 8.4: Upload Webpage

Fig. 8.5: Password Change Page


60

A Genetic Programming Approach to Record Deduplication

Fig. 8.6: User Login Page

Fig. 8.7: User Welcome Page


61

A Genetic Programming Approach to Record Deduplication

Fig. 8.8: Search Engine Page

Fig. 8.9: Search Result Page-01


62

A Genetic Programming Approach to Record Deduplication

Fig. 8.10: Search Result Page-02

Fig. 8.11: Duplicate Citation Data Info Page


63

A Genetic Programming Approach to Record Deduplication

Fig. 8.12: Duplicate Citation Remove Page

Fig. 8.13: View Tree Graph Page

64

A Genetic Programming Approach to Record Deduplication

Fig. 8.14: Tree Graph

65

A Genetic Programming Approach to Record Deduplication

Chapter 9
Software Testing

66

A Genetic Programming Approach to Record Deduplication

9.1 Introduction
Testing Strategies
Testing:
1. The process of executing a system with the intent of finding an error.
2. Testing is defined as the process in which defects are identified, isolated, subjected
for rectification and ensured that product is defect free in order to produce the quality
product and hence customer satisfaction.
3. Quality is defined as justification of the requirements
4. Defect is nothing but deviation from the requirements
5. Defect is nothing but bug.
6. Testing --- The presence of bugs
7. Testing can demonstrate the presence of bugs, but not their absence
8. Debugging and Testing are not the same thing!
9. Testing is a systematic attempt to break a program.
10. Debugging is the art or method of uncovering why the script /program did not execute
properly.

Testing Methodologies:

Black box Testing: is the testing process in which tester can perform testing on an
application without having any internal structural knowledge of application.
Usually Test Engineers are involved in the black box testing.

White box Testing: is the testing process in which tester can perform testing on an
application with having internal structural knowledge.
Usually The Developers are involved in white box testing.

Gray Box Testing: is the process in which the combination of black box and white
box tonics are used.

67

A Genetic Programming Approach to Record Deduplication

Levels of Testing:

Module1

Module2

Units

i/p

Module3

Units

Units

Integration o/p i/p

Integration o/p

System Testing: Presentation + business +Databases

UAT: user acceptance testing

Fig. 9.1 Software Testing Life Cycle

Test Planning:

1.

Test

Planning

defined

as

strategic

document

which

describes the procedure how to perform various testing on the total application in the
most efficient way.
2. This document involves the scope of testing,
3. Objective of testing,
4. Areas that need to be tested,
5. Areas that should not be tested,
6. Scheduling Resource Planning,
7. Areas to be automated, various testing tools used.

Test Development

1. Test case Development (check list)


2. Test Procedure preparation (Description of the Test cases).
1. Implementation of test cases. Observing the result.
68

A Genetic Programming Approach to Record Deduplication

Types of Testing:
Smoke Testing: is the process of initial testing in which tester looks for the availability of all
the functionality of the application in order to perform detailed testing on them. (Main check
is for available forms).
Sanity Testing: is a type of testing that is conducted on an application initially to check for
the proper behavior of an application that is to check all the functionality are available before
the detailed testing is conducted by on them.
Regression Testing: is one of the best and important testing. Regression testing is the
process in which the functionality, which is already tested before, is once again tested
whenever some new change is added in order to check whether the existing functionality
remains same.
Static Testing: is the testing, which is performed on an application when it is not been
executed. Ex: GUI, Document Testing.
Dynamic Testing: is the testing which is performed on an application when it is being
executed. Ex: Functional testing.
Alpha Testing: it is a type of user acceptance testing, which is conducted on an application
when it is just before released to the customer.
Beta-Testing: it is a type of UAT that is conducted on an application when it is released to
the customer, when deployed in to the real time environment and being accessed by the real
time users.in this type of testing, developer can get user response.
Compatibility testing: it is the testing process in which usually the products are tested on the
environments with different combinations of databases. In order to check how far the product
is compatible with all these environments platform combination.
Adhoc Testing: Adhoc Testing is the process of testing in which unlike the formal testing
where in test case document is used, without that test case document testing can be done of
an application, to cover that testing of the future which are not covered in that test case
document. Also it is intended to perform GUI testing which may involve the cosmetic issues.

69

A Genetic Programming Approach to Record Deduplication

9.2 Test Plans


Testing process starts with a test plan. This plan identifies all the testing related activities that
must be performed and specifies the schedules, allocates the resources, and specified
guidelines for testing. During the testing of the unit the specified test cases are executed and
the actual result compared with expected output. The final output of the testing phase is the
test report and the error report.

Test Data:
Here all test cases that are used for the system testing are specified. The goal is to test the
different functional requirements specified in Software Requirements Specifications (SRS)
document.

Unit Testing:
Each individual module has been tested against the requirement with some test data.

Test Report:
The module is working properly provided the user has to enter information. All data entry
forms have tested with specified test cases and all data entry forms are working properly.
Error Report:
If the user does not enter data in specified order then the user will be prompted with error
messages. Error handling was done to handle the expected and unexpected errors.

9.3 Test Cases:


A Test case is a set of input data and expected results that exercises a component with the
purpose of causing failure and detecting faults. Test case is an explicit set of instructions
designed to detect a particular class of defect in a software system, by bringing about a
failure. A Test case can give rise to many tests.
In general a test case is a set of test data and test programs and their expected results. A test
case in software engineering normally consists of a unique identifier, requirement references
from a design specification, preconditions, events, a series of steps (also known as actions) to
follow, input, output and it validates one or more system requirements and generates a pass or
fail.
70

A Genetic Programming Approach to Record Deduplication

The mechanism for determining whether a software program or system has passed or failed
such a test is known as a test oracle. In some settings, an oracle could be a requirement or use
case, while in others it could be a heuristic. It may take many test cases to determine that a
software program or system is considered sufficiently scrutinized to be released. Test cases
are often referred to as test scripts, particularly when written. Written test cases are usually
collected into test suites. If a requirement has sub-requirements, each sub-requirement must
have at least two test cases. Written test cases should include a description of the
functionality to be tested, and the preparation required to ensure that the test can be
conducted.
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub-assemblies, assemblies and/or a finished product. It is the process of
exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of tests. Each test type addresses a specific testing requirement.
Test case format
Test Cases usually have the following components:

Test Case Summary

Initial Condition

Steps to run the test case

Expected behavior/outcome

1. Positive Test Cases:

The positive flow of the functionality must be considered

Valid inputs must be used for testing

Must have the positive perception to verify whether the requirements are justified.

71

A Genetic Programming Approach to Record Deduplication

Example for Positive Test cases:


Table. 9.1: Example for Positive Test Case
T.C.No

Description

Check

for

Expected value Actual value


the The

date

and

date Time Auto time

of

the

Display

Result

system must be
displayed

Enter the valid It should accept


Roll no into the
student roll no
field

2. Negative Test Cases:

Must have negative perception.

Invalid inputs must be used for test.

Example for Negative Test cases:


Table 9.2: Example for Negative Test Case
T.C.No

Description

Try to modify The Modification


information
date and time

Expected value Actual value

in should not be
allow

Enter invalid data It should not


in to the student accept
details form, click data,
on save

invalid
save

should
allow

72

not

Result

A Genetic Programming Approach to Record Deduplication

Test Case 1:
Login Page Test Case
Table 9.3: Login Page Test Case
Test Steps
Test

Test

Case

Case

Description
Step

Name
Login

Expected

Actual

Validate

To

verify enter login name an

Login

that

Login less than 1 chars message

name
login
must

on (say

a)

page password
be click

error

and Login
and less

not

than

Submit characters

greater than button

must

1 characters

displayed
enter

be

login Login success

name

1 chars full

(say

a)

password
click

or

an

and error message


and Invalid
Submit Login

button

or

Password
must

be

displayed
Pwd

Validate

To

Password

that

verify enter Password an


less than 1 chars message

Password on (say
login
must

error

nothing) Password

page and Login Name not less than 1


be and

click characters

greater than Submit button

must

1 characters

displayed

73

be

A Genetic Programming Approach to Record Deduplication

Pwd02

Validate
Password

To

verify

that
Password on
login

page

must

be

allow special

enter Password Login success


with

characters(say

or

an

error message

!@hi&*P) Login Invalid


Name and click Login
Submit button

or

Password
must

characters

Link

special full

be

displayed

Verify

To

Verify Click Sign Up Home

Hyperlinks

the

Hyper Link

Page

must

be

displayed

Links
available at
left side on
login

page

working

or

Click Sign Up Sign Up page


Link

not

must

be

displayed
Click New Users New
Link

Users

Registration
Form must be
displayed

74

A Genetic Programming Approach to Record Deduplication

Registration Page Test Case


Table 9.4: Registration Page Test Case
Test Steps
Test

Case Test

Case
Step

Name

Description

Registration

Validate

To

User Name

that

Expected

Actual

verify enter User name an


User click

Submit message User

on button

name

error

Name

Must

be Declared

Registration
page must be
Declared
Validate

To

Password

that

verify enter Password an


click

error

Submit message

Password on button

Password

Registration

Must

page must be

Declared

be

Declared
Validate

To

First Name

that

verify enter
First Name

First an

click message First

on Submit button

Name

error

Name

Must

be Declared

Registration
page must be
Declared
Validate

To

Last Name

that

verify enter
Last Name

Name

page must be

75

error

click message Last

on Submit button

Registration

Declared

Last an

Name

Must

be Declared

A Genetic Programming Approach to Record Deduplication

Validate

To

verify enter

Address

that Address click

Address an
Submit message

button

on

error

Address

Registration

Must

be

page must be

Declared

Declared
Validate

To

verify enter

Phone

that

Phone number

number

number

Phone an
click message

on Submit button

Phone

Registration

number

page must be

Must

Declared

Declared

Validate

To

verify enter

Phone

that

Phone number is only message

number

error

Phone an

be

error

is number (say numeric values Phone

giving

abc)

click

characters

Registration

button

Submit number
Must

be

page must be

numeric

Declared

Declared

Validate

To

verify enter

Phone

that

Phone number

number

number (say Valid

valid

1234)

click

number

Registration

button

Phone an

error

is message
values Phone
Submit number
Must

be

page must be

Valid

value

Declared

Declared

76

A Genetic Programming Approach to Record Deduplication

Chapter 10
Result Analysis

77

A Genetic Programming Approach to Record Deduplication

We present and discuss the results of the experiments performed to evaluate our proposed GP
based approach to record Deduplication. There were three sets of experiments:
1. GP was used to find the best combination function for previously user-selected evidence,
i.e., <attribute, similarity function> pair combinations specified by the user for the
Deduplication task. The use of user selected evidence is a common strategy adopted by all
previous record Deduplication approaches. Our objective in this set of experiments was to
compare the evidence combination suggested by our GP-based approach with that of a stateof the- art SVM-based solution, the Marlin system, used as our baseline.
2. GP was used to find the best combination function with automatically selected evidence.
Our objective in this set of experiments was to discover the impact on the result when GP has
the freedom to choose the similarity function that should be used with each specific attribute.
3. GP was tested with different replica identification boundaries. Our objective in this set of
experiments was to identify the impact on the resulting Deduplication function when different
replica identification boundary values are used with our GP based approach.

78

A Genetic Programming Approach to Record Deduplication

Chapter 11
Technical Specification

79

A Genetic Programming Approach to Record Deduplication

11.1 H/w Requirements


Processor

Pentium IV

Hard Disk

40GB

RAM

512MB or more

11.2 S/w Requirements


Operating System

Windows

Programming Language :

Java

Web Applications

JDBC, Servlets, JSP

IDE/Workbench

My Eclipse 6.0

Database

Oracle 10g

11.3 Advantages
1. It can remove the maximized duplicate based results here.
2. Without entire search space provides the meaningful results.
3. Provides the effective results with the help of machine learning techniques.
4. It can give the best evidence based results.
5. Every time provides the tune based results environment in implementation.

11.4 Disadvantages
1. As this project is mainly based on Search Engine so a lots of manpower is required
2. Network Connectivity in Rural Area may lead to Discontinuation of this Project.

11.5 Application

In Dynamic Web

In Digital Libraries

In Large Files

Pattern Recognition

80

Anda mungkin juga menyukai