Anda di halaman 1dari 64

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

Dezhi Liu June 25th 2010

Diploma Thesis Supervisor: Prof. Dr.-Ing. Holger Hinrichs Department of Electrical Engineering and Computer Science Luebeck University of Applied Sciences

Case Studies of Open Source Data Quality Management

Department: Electrical Engineering and Computer Science University course: Information Technology Subject: Case Studies of Open Source Data Quality Management

As organizations increasingly depend on their information systems, the eld of data quality management (DQM) is getting more and more attention. Data of high quality will not only reduce unnecessary cost, but also provide creditable information for decision making. This thesis focuses on the area of Open Source data quality management. The goal is to give an overview of this area with case studies based on major technologies.

Both theoretical background regarding DQM and practical evaluation of the sample tools are included. The fundamentals in terms of data quality, DQM and Open Source software are introduced. Three major technologies which are widely adopted by Open Source tools, i.e. data proling, data cleansing and record linkage, are discussed in detail. The case studies are based on the evaluation criteria focusing on two aspects: Open Source qualication and data quality functionality. According to the criteria, qualications of each tool are evaluated by the following approaches: analysis based on available documents, general-purposed test and sample task evaluation. Finally, the thesis concludes with summarized results of the case studies as well as suggestions and a brief outlook for further research in this area.

Author: Dezhi Liu

Attending Professor: Prof. Dr.-Ing. Holger Hinrichs WS / SS: SS 2010

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualication.

Only the sources that were cited in the document have been used in this thesis. Parts that are direct quotes or paraphrases are identied as such. I agree that my work is published, in particular that the work is presented to third parties for inspection or copies of the work are made to pass on to third parties.


CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Signature iii


C O De PY ce R mb IG e r1 H T 5, 20 D 1 ez 1 hi Li u
Foundation 2.1 Data Quality . . . . . . . . . . . . . 2.1.1 Introduction to the Concept 2.1.2 Data Quality Dimensions . . 2.2 Data Quality Management . . . . . 2.2.1 Recognize the Problems . . 2.2.2 General Approaches . . . . 2.3 Major Technologies . . . . . . . . . 2.3.1 Data Proling . . . . . . . . 2.3.2 Data Cleansing . . . . . . . 2.3.3 Record Linkage . . . . . . . 2.4 Open Source Software . . . . . . . 2.4.1 What is Open Source? . . . 2.4.2 Open Source Models . . . . 2.4.3 Why Open Sourceethodology 3.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . 3.1.1 Open Source Qualication . . . . . . . . . . . 3.1.2 Data Quality Functionality . . . . . . . . . . . 3.2 Tool Selection . . . . . . . . . . . . . . . . . . . . . . 3.3 Technical Environment and Data Sources . . . . . . . 3.3.1 Technical Environment and Database Systems . 3.3.2 Data Source for Sample Task Evaluation . . . 3.4 Sample Task Evaluation . . . . . . . . . . . . . . . . . 3.4.1 Data Proling Tasks . . . . . . . . . . . . . . 3.4.2 Data Cleansing Task . . . . . . . . . . . . . . 3.4.3 Record Linkage Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 20 21 22 23 23 24 25 25 28 28 iv

Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 3


A Candidate List and Selection Results

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Summary 5.1 Comparison and Result . . . . . . . . . . . . . . . 5.1.1 Case Study Result on Data Proling Tools . 5.1.2 Case Study Result on Data Cleansing Tools 5.1.3 Case Study Result on Record Linkage Tools 5.2 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 50 51 52 53 54 58 v

Evaluation and Results 4.1 Talend Open Proler . . . . . . . . . . . . . . . . . 4.1.1 Open Source Qualication . . . . . . . . . . 4.1.2 Working Principles and General Functionality 4.1.3 Sample Task Evaluation . . . . . . . . . . . 4.2 DataCleaner . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Open Source Qualication . . . . . . . . . . 4.2.2 Working Principles and General Functionality 4.2.3 Sample Task Evaluation . . . . . . . . . . . 4.3 Talend Open Studio . . . . . . . . . . . . . . . . . . 4.3.1 Open Source Qualication . . . . . . . . . . 4.3.2 Working Principles and General Functionality 4.3.3 Sample Task Evaluation . . . . . . . . . . . 4.4 SQL Power DQguru . . . . . . . . . . . . . . . . . . 4.4.1 Open Source Qualication . . . . . . . . . . 4.4.2 Working Principles and General Functionality 4.4.3 Sample Task Evaluation . . . . . . . . . . . 4.5 Fril . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Open Source Qualication . . . . . . . . . . 4.5.2 Working Principles and General Functionality 4.5.3 Sample Task Evaluation . . . . . . . . . . . 4.6 Febrl . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Open Source Qualication . . . . . . . . . . 4.6.2 Working Principles and General Functionality 4.6.3 Sample Task Evaluation . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

29 29 29 29 30 34 34 35 35 38 38 38 39 41 41 41 42 42 42 43 43 46 46 46 48

Chapter 1 Introduction

Nowadays electronic data can be acquired, processed and stored much faster in a more exible way. They are more and more widely used in the information and communication technology (ICT) society. Apart from all the benets of this development, organizations start to suffer from the disadvantages brought by data with poor quality. As shown in the report from the Data Warehousing Institute in 2002, data quality problems cost businesses in the US more than 600 billion dollars a year [Ins02]. The cost could be even higher because people are now trying to squeeze more information from their data for decision making. If data itself is of poor quality, so will be the information extracted. For this reason, data quality (DQ) and data quality management (DQM) are getting more and more attention. As organizations start to recognize the importance of the quality of their data, technologies that help to assure and improve data quality have had a great development. New technologies, such as data warehousing and business intelligence (BI) have been developed as the new strategic initiatives. The elds are dominated by huge players, e.g. the Oracle Corp. and SAP. Such vendors provide comprehensive range of products as well as services that help organizations improve their data quality. In the mean time, several Open Source projects have also emerged in this area. Although they are still not quite comparable with the commercial ones, they have a great potential to become more powerful and competitive because of their own characteristics. The Open Source tools enable organizations, especially small and medium enterprises (SME), to tailor every aspect of the data quality experience from the ground up.


The goal of this thesis is to give a brief overview of the Open Source DQM area covering not only the related theories, but also some case studies of the current practices in this area. As the start point, an introduction of data quality is needed. Data quality management should be discussed in detail concentrating on the available technologies. Due to the time limitation of this thesis, as well as the lack of common agreement on the major technologies, it is not possible to cover the whole range. Therefore, only those, which are widely adopted by the current Open Source (OS) 1

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Motivation Objective

tools, are picked out and introduced intensively. Evaluation criteria should be clearly dened. Case studies of Open Source tools on these technologies should be performed according to the predened criteria. Sample tasks are to be designed in order to provide real practice experiences of each tool. In this thesis, three major technologies are focused on data proling, data cleansing and record linkage. The selected tools and a short description of each one are shown in Table 1.2.1. Sample tasks should correspond to the categorization. OS Software DataCleaner Talend Open Proler Talend Open Studio Description Open Source data proling tool Open Source data proling tool

The evaluation of sample tasks is implemented on the data source foodmart in two databases MySQL and Apache Derby. Both the important steps of implementation and the results of each evaluated tool should be presented in detail. Ultimately, results of the case study should be presented. The conclusion should contain a discription of the current situation in this area based on the result. Found problems should be discussed and suggestions and outlook for further research in this area should be given.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Open Source data integration tool SQL Power DQguru Febrl Fril Open Source data cleansing & MDM tool Open Source record linkage tool Open Source record linkage tool Table 1.2.1: Overview of selected tools 2


Structure of the Thesis

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 1.3.1: Thesis structure 3

Figure 1.3.1 gives a graphical representation of the structure of this thesis. First, in chapter 2, fundamentals in the area of data quality and DQM as well as Open Source software are introduced, which are served as the basis of this thesis. Then, chapter 3 contains the criteria for evaluation, tool selection procedure for case studies and related information about the evaluation procedure, e.g. data sources used and the technical environment. Case studies are performed in chapter 4, which includes the detailed description about the process of evaluation of each tool and the results obtained by sample tasks execution. At last, chapter 5 concludes this thesis with a comparison of the tools in each category according to the specied criteria and speculates on open problems and future work in this area.

Chapter 2 Foundation


Before data quality can be measured and improved, it should be dened in a meaningful and measurable way. There are different areas where data quality can be addressed from a research perspective, ranging from statistics, management to computer science. The concept of data quality was rst introduced by statisticians in the late 1960s. Since the beginning of the 1990s, computer scientists started to consider the importance of the quality of electronic data stored in databases, data warehouses and legacy systems [BS06]. The broad denition of data quality has been given as tness for use by information consumers [Jur88]. It is often concerned with excellence, value, conformance to specications or meeting consumer expectations [KSW02]. Information quality, as well as data quality, has been dened by IAIDQ1 as the tness for use of information [EE10][PS05]. The term data quality is also used synonymously with information quality, whose meaning has a broader range. This thesis is restricted to the usage of the term data quality angled from a more product-based perspective, and focuses on the data stored and processed in information systems. Before we start to look at different dimensions of this multifaceted concept, the importance of data quality is discussed rst. Why is data quality important?

Today, data quality is causing more and more attention in corporations because it plays an increasingly important role in the business world. If not treated well, it may lead to serious consequences. On one hand, the lack of data quality itself causes overhead. Companies are recognizing that the data in their information systems are generally of poor quality. In fact, the problem is already leading to terrible costs in some companies. The overhead mainly includes the cost of unnecessary printing, postage, and stafng costs. Those costs appear very slow, but it is a steady

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Quality
Introduction to the Concept
The International Association for Information and Data Quality, see


Until recently, most people still equate the terms data quality and data accuracy. Indeed, it is obvious that inaccurate data is of low quality. But data quality is more than just accuracy. In reality, there are many other signicant dimensions such as completeness, consistency, and currency. Although several researches have been carried out focusing on analysing the problems and dening the dimensions, a precise and commonly accepted denition is still not available. There are generally three approaches to dene data quality dimensions: theoretical, empirical and intuitive (see [BS06], chapter 2). As typical examples, the theoretical approach that Wand & Wang proposed in 1996 considers mapping the real world (RW) to an information system (IS) [WW96]; the empirical research approach that Wang & Strong proposed in 1996 relied on the information consumer feedback to derive quality criteria and then classify them into categories. All together 15 different dimensions were selected by the authors tting in four general categories [WS96]. There is also no general agreement on the names of the dimensions, and in different proposals, the meanings and criteria of dimensions with the same name may differ. But no matter which approach was adopted, as a result it tends to be agreeable at some level. The common dimensions that general proposals include are: Accuracy, Reliability, Timeliness, Completeness and Consistency. Data quality dimensions dened in this section are theoretically grounded, which can also serve as a basis for further research and evaluation of the data quality tools. A integrated set of dimensions of data quality is introduced based on three related researches: the classication proposed by Larry P. English (1999) (see [Eng99] chapter 2), the data quality dimensions dened by Batini & Scannapieco (2002) [BS06] and the Semiotic Information Quality Framework proposed by Price & Shanks (2005) [PS05].

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Quality Dimensions

erosion of an organizations credibility among customers and suppliers. As already mentioned in chapter 1, the cost for U.S. businesses in 2002 is more than 600 billion dollars. On the other hand, data quality plays an important role in the knowledge discovery process. In the past three decades, computer hardware technology has made great progress, leading to large supplies of more powerful devices with lower prices, e.g. computers, data collection equipment, and storage media. The popularity of those devices in both PC and business market provides a great boost to the information industry, leading the huge development of heterogeneous database systems and Internet-based information systems such as the World Wide Web (WWW) [HK01]. To effectively and efciently analyze the fast-growing, tremendous amount of data in those information systems becomes a challenging task. At the same time, companies are eager to extract valuable information from those data. The awkward situation has been described as data rich but information poor [HK01]. Its hard to imagine if such tremendous amount of data are with poor quality, all the cost and effort spent on analyzing them will turn out to be a waste. Therefore, it is even more important to keep the data in a high quality level in larger information systems. It has been estimated by English (2009) that the poor information quality can cost organizations from 20 to 35 percent of operating revenue wasted in recovery from process failure and information scrap and rework. 122 organizations lost a total of $1.2 trillion in process failure and waste caused by poor quality Information [Eng09].

Inherent Data Quality

Dimensions in this category are based on the stored datas conformance to the real-world it represents. It is similar to the concept of the term intrinsic data quality proposed by Wand & Wang (1996). A closer and simpler word for this category that most people may use is accuracy. Inaccuracy implies that the information system represents a real world state different from the one that should have been represented. (cited from [WW96]) But mapping the state from the real world is not that simple. Dimensions in this category are the most visible and the most important among all the dimensions. It is the foundation measure of the quality of data. If the data is erroneous, to explore the other dimensions of data quality wont make too much sense. There are different kinds of errors and they should be distinguished from each other. For example, consider a database that contains names, addresses, phone numbers, date of birth, etc (see Table 2.1.1). Some records can be wrong, some can be missing and some can be obsolete. The typical dimensions in this category are: Syntactic Accuracy

Syntactic accuracy is dened as the state that a value conforms to its metadata. In other words, data obeys the constraints described by the specied integrity rules, also called database schema. The level of data quality in this dimension largely depends on the degree to which it is governed by schema and integrity constraints controlling permissible data values. There are generally two types of constraints: domain constraints and integrity constraints. Domain constraints can be dened as a value should be included in a certain domain denition. For example, the rst name of the third customer in Table 2.1.1 is Jck, it has a great potential to be an error whose correct value is Jack, because Jck is not an admissible value in the common rst name domain. The birth_date value of the fourth customer is invalid, as 33 is not a valid value for month. Age should be between 0 to 130 is another simple example. This kind of error can be recognized by comparison functions. Integrity constraints can be considered as a more complex situation of domain constraint, as the domain is related to other attributes. For example, the marital status can only equal to married when the age of the entity is at least 18. The age should be also conforming to the date of birth. These situations are often referred to as consistency problems in other literature [BS06]. id 1 2 3 4 rst_name Jean Michael Jck Babara last_name Derry Spence Baker Marshall birth_date 1987-10-04 1971-04-06 1975-03-27 1923-33-10 address 7640 First Ave. 5423 Camby Rd. 7230 Berrellesa Street 6202 Seeno St. phone_number 583-555-4474 (0)616 556966 904-555-8788

Semantic Accuracy Semantic accuracy represents the correctness of the value as compared to the corresponding value of the entity in the real world. It can be recognized and measured only when the true 6

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Table 2.1.1: Sample customer table

value is already known, or the true value should be deduced from the available information. For example, the real birthday of the rst customer in Table 2.1.1 is 1988-12-22. The stored value is syntactically correct, but the real birthday of that person is 1988-12-12. The value is not correct as it does not represent the state of the real world. The measurement of semantic accuracy is more complex. One technique is to compare the data in different data sources. When data has a potential to be representing the same entity in the real world, it may provide reference for correct information. Non-redundancy


Completeness has been dened as the ability of an information system to represent every meaningful state of the represented real world system by Wand & Wang (1996) [WW96]. Data can be incomplete in different ways. In the case of relational data, there are normally three kinds of completeness: attribute completeness, tuple completeness and value completeness [BS06]. Attribute completeness is usually determined by the schema of the data source. If an attribute is missing but is necessary for applications, the data is incomplete regarding its set of attributes. This is usually a design problem. Tuple completeness is reected by the number of missing records. Value completeness is commonly related to the presence of NULL. A value can be missing either because it exists but is unknown, or because it does not exist at all.

Pragmatic Data Quality

Data are of high level of inherent quality may not be valuable if they are not used. Therefore, data quality should not only represent the inherent quality of that data, but also the quality that related to its application and usage, i.e. the degree of usefulness. As being dened by English (1999) [Eng99], those qualities are dened as the pragmatic data quality. Dimensions in this category are all related to data usage. Generally saying, concerning its usage, data should be accessible, secure, up-to-date, exibly and suitably presented, relevant and valuable. Data accessibility and security often depend on the quality of the information system and the service it provides. Timeliness refers to the

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

The problems with duplication are included in this dimension. Duplication in a data source can be described as more than one record or tuple in one data source represent the same realworld entity. In the customers example, each customer should be represented once exactly in the database. When there is another tuple with the same name, date of birth, gender, it has a great potential to be a duplicate. Duplicates may also have minor differences because the real world state has changed or data has been typed in differently. Redundant data are apt to be generated when integration process is performed. When duplication occurs, further costs such as additional mailing cost or data storage cost are added directly. Techniques such as record linkage are designed to detect such potentials. The approach used to eliminate duplications usually is referred to as de-duplication.

delay between the data stored in the information systems and the real world state that data represents [WW96], which largely depends on data update process. Format exibility and suitability represents the state of the data format can be easily aggregated or converted into other format aiming at different applications. Relevance and the value of data depend on the real word entity. The precision and unit measurement should also be taken into account. For example, for the purpose of recognizing different records in two data sources that representing the same real world entity (record linkage), the elds that are to be compared should be converted into the same format, so that they are comparable by machines. This may not be a data quality problem unless the data is used for record linkage. It requires data of high format exibility and suitability.


In the previous section, the concept of data quality has been introduced and different data quality issues have been categorized into several dimensions. The question becomes how we can measure, assure and improve data quality. The whole process generally is referred to as data quality management (DQM). In this section, a short introduction of DQM and general approaches is given. In order to deal with problems concerning data quality, the rst thing is to recognize what problems are there in data sources.


Rahm & Do (2000) have had an analysis and research on the data quality problems [RD00]. The data quality problems can roughly be categorized into two groups. They are either single-source problems or multi-source problems. For each group, there are problems that at the schema level or at the instance level (see Table 2.2.1).
Single-Source Problems Poor schema design, lack of integrity constrains Data entry errors Multi-Source Problems Heterogeneous data models and schema designs Overlapping, contradicting and inconsistent data

Single-source problems refer to data quality problems that are present in a single data source. As stated by Rahm & Do (2000) [RD00], data quality in single source largely depends on the quality of its data model, such as integrity constraints or schema design. Invalid data that cannot be avoid by schema or integrity constraints, e.g., due to misspellings during data entry or duplicates are referred to as at the instance level. Multi-source problems exist especially when multiple data sources need to be integrated, such as during the data warehousing processes. The reason is that redundant or inconsistent data due to data representation difference and duplication problems are apt to increase signicantly. Typical multi-source problems that are at the schema level are naming and structural conicts. 8

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Quality Management
Recognize the Problems
Schema Level Instance Level

Table 2.2.1: Four classication of data quality problems [RD00]


General Approaches

There are various approaches that can solve the data quality problems stated above, such as data proling, data cleansing, record linkage, monitoring, etc. However, in order to obtain and maintain a high level of data quality, data quality management should be more than implementing several separate applications occasionally. From the research perspective, the systematic DQM process consists of different phases, which incorporate together as a cycle and should be implemented constantly. One proposal is the TDQM (Total Data Quality Management) Cycle (see Figure 2.2.1) which has been dened by Wang (1998) [Wan98]. It is based on the PDCA

(plan-do-check-act) four-step process known as Deming cycle which is typically used in business process improvement. It the TDQM cycle, data is considered as an information product (IP). A short description of each component of the cycle is given as follows.
Dene The important DQ dimensions and the corresponding DQ requirements should be dened.

Measure Analysis


Additionally, two kinds of components should be embodied in a complete DQM program, namely proactive and reactive components. Proactive components

Proactive components are those approaches aiming at preventing low quality data actually getting into the data source, i.e. eliminating or minimizing the errors at the entry process. There are several approaches. In fact, it entails the all administrative activities, such as the high-quality design of the information system including infrastructure and applications, establishment and deployment of roles, responsibilities, policies, processing and maintenance of data, training staff and so on. Furthermore, techniques such as data entry system (DES) are very effective at reducing input errors. Typical tasks of a DES are data entry checks ensuring each eld for valid values, duplicate checks ensuring non-redundancy.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 2.2.1: Cycle of TDQM methodology
This phase involves the measurement of DQ quality according to the dimensions and requirements dened in the rst phase. Root causes for the DQ quality problems should be identied and the impacts of poor quality information should be calculated. According to the result of the previous phases, DQ quality improvement techniques should be applied against the problems that inuence data quality.

Reactive components While performing proactive techniques is effective at preventing a lot of erroneous and inaccurate data from getting into information systems, it is still not enough to assure high data quality. To deal with the data quality problems that already exist in data sources, reactive components should be employed. The component generally contains detecting and removing errors and inconsistencies from data, solving data quality problems that have been discussed previously. Therefore, technologies has emerged to detect and solve data quality issues such as data proling, data auditing, data cleansing, data parsing, standardization and record linkage. The names may differ in different context or depend on the preference of each individual. Those techniques are realized by softwares which are often referred to as data quality tools. There are already several open source solutions available in the market. The technologies for DQM that are widely adopted by those tools are one of the main topics of this thesis. Detailed description for those techniques is discussed in the next section.


According to our analysis, most widely used data quality technologies can be categorized into three groups: data proling, data cleansing and record linkage. Other technologies may still not widely adopted by Open Source data quality tools or do not have a common agreement on them yet. Some are related to data quality but the relevant enough. For all the stated reasons, the three major technologies are considered by the author as the most popular ones among the Open Source solutions. Before we look at the approaches and technologies, data types are redened for the convenience of introducing the situations. Data types here should be distinguished with those in database systems, the latter focus on the physical property of the data. The four types are namely Nominal, Numerical, Time-related and Others.




CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Major Technologies
Nominal data refer to texture data that have an integrity domain (personal names, city names) or have a certain pattern (postal code, ISBN, email). Many data proling technologies trying to improve data quality in the syntactic accuracy dimension focus on this type of data. Note that due to the development of names, variations or new names have been introduced, but we still consider it as Nominal data. Numerical data here has a restricted meaning: only numbers that have mathematical meaning, in other words, the average or sum of the numbers are meaningful. Any date or time related data types are considered as time related data. Databases such as MySQL and Derby have the support of date format, only valid values are allowed in the database. This can prevent syntax inaccurate date data entering information systems. 10


Although there are more interesting categories concerning data type, we refer to all other data types that dont belong to the three data types as others. An example would be unstructured-text data such as commentary data or address data.

Data type sometimes depends on how the data is treated or proled, e.g. standardizing address data is also applicable when the address data is treated as the Nominal type. When referring to this categorization of data types, the rst letter is capitalized throughout the thesis.


Data proling has been described as the most important technology and recognized as the start for improving data quality, especially in terms of data accuracy [Ols03]. Data proling is the process of examining the data available in existing data sources by collecting statistics and information in order to gain knowledge about the content, structure and quality of that data. With the information obtained by data proling, data quality problems would be able to be detected or recognized. Major data proling approaches contain metadata analysis, structure analysis, column analysis and data rule analysis. Metadata Analysis

Metadata is generally dened as data about data [Cat97]. Metadata is also data, it can be stored a repository or in other forms such as XML documents, report denitions, descriptions for database, log les, connections and conguration les [WH06]. In the data quality context, metadata should dene the qualication and standard for accurate data. According to Olson (2003) [Ols03], such metadata should contain schema denition, business objects, domains, table denitions, rules, etc. This type of analysis is usually application-oriented, as it depends on different business cases. If the metadata is accurate and complete, the data proling task would become to examine data against the metadata. One of the data proling tasks is to rene the metadata. Structure Analysis

Compared with metadata analysis, this type of analysis focuses on the structure of a database, which often referring to the information about tables as well as columns in the sense of how they are related to each other. This information is often useful when an application involves the usage of more than one tables or columns. As an example of a typical business model, the information of customers, products and sales are often stored in three separate tables. In order to provide a complete purchase fact of a customer, it is required that the three tables cooperate together (see Figure 2.3.1). This is achieved by primary keys and foreign keys in the database systems. Structure analysis is the analysis that intends to nd out and analyze such dependencies in a database.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Proling

Figure 2.3.1: Example of business database Column Analysis

This type of analysis focuses on the values stored in a column and the property independent of all other columns. Interesting directions are discussed as follows. Storage Property Storage property of a column includes its physical data type (CHAR, FLOAT, DATE...) and its length.
Pattern Distribution

MAC address is a unique identier assigned to most network adapters or network interface cards (NICs) by the manufacturer for identication 3 A family of IEEE standards for local area networks (LAN) and metropolitan area networks (MAN). 4 See for more information about Regex

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Sometimes column, especially of type String, have a specic character pattern. For example, the gender of a person may be represented by M indicating male and F indicating female, or by whole words, or simply by 1 and 2. The system should be consistent in representing the value. Another example is Media Access Control (MAC) address2 , according to IEEE 8023 , the standard format for printing MAC-48 addresses should appear like 00-21-6b-67-ce-54 or 00:21:6b:67:ce:54, i.e. six groups of two hexadecimal digits, separated by hyphens (-) or colons (:). Any other formats of presenting the MAC address can be recognized as inaccurate or non-standard presentation. In this case, the proling analysis should be able to detect all such inaccurate formats. This is done by validating each value by a specic rule, e.g. regular expression (Regex), which provides a exible way for matching Nominal data or Time-related data. Another way of analysis is to nd out all patterns for presenting the same data, in order to design a solution for making the format consistent. Regex is short for Regular Expression, provides a concise and exible means for matching strings of text concerning content or structure. It is possible to use regex to dene particular characters, words, patterns of characters and numbers or order of them in the text content. Some popular syntax standards are PCRE (Perl Compatible Regular Expressions) and POSIX Basic Regular Expressions. As an example, a regex that matches all valid email address format may appear as follows4 . 12

Regex [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}
Range and Value Distribution

The two aspects are often considered together with Numeric or Time-related data. Some values already have a specic range, and the task is to validate all values against that range. The task could also be, how values are distributed in some certain ranges. More sophisticated analysis in this direction which intends to derive business information is often referred to as data analysis techniques such as data mining, distinguished from data proling techniques.


After data proling, target data have been explored and problems or noisy data should have been identied. The question becomes how to deal with the found anomalies. The action is often referred to as data cleansing. Other terms like data cleaning, data scrubbing, or reconciliation may have the same meaning or have slight differences in various contexts. There is no clear 13

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Domain Constraint

While range is related to Numeric or Time-related data, domain constraint here refers to that a group of String values have a certain domain, which are considered as Nominal data. For example, there are 50 valid values for the 50 state names of the USA, any values that dont match to those 50 valid ones are considered as irregular or incorrect data.

Null Conditions

In databases, a NULL value may indicate that the value is not applicable or applicable but did not provided, or that it is just not clear whether it is applicable or not [Ols03]. Sometimes other string values like none, \, ? may indicate the same meaning as NULL. Such NULL indicators are to be detected and unied. In the context of column analysis, duplications refer to identical values, i.e. the values are exactly the same. The analysis should also nd out some simple statistics like the number of duplicates, distinct values and unique values.

Duplication and Uniqueness

Data Quality Rule Analysis

Data quality rule analysis intends to nd out violating combinations of values across multiple columns according to data rule. The columns can come from single or multiple table or even from different data sources. When a cross-table validation is applied, certain join condition should be dened beforehand. A data rule may appear simple or complex which may contain more sub-rules and include the participation of several data sources. Below shows an data rule example. Rule Clause


Data Cleansing

or settled denition for data cleansing [MF03][MM00]. Here we use this term throughout the thesis with a narrowed meaning that referring to the act or process of correcting or eliminating erroneous or inaccurate data from the data sources. Typical approaches for data cleansing are data transformation, integrity constraint enforcement [MF03]. Data Transformation Generally saying, data transformation is the process that transforms the data from their given format into another format which is expected by the application. Sometimes the term data standardization is also used for transforming data into a standardized format. In the data warehousing eld, data cleansing is applied before the integration of data sets in order to guarantee the consistency of data presenting [HK01]. Therefore some ETL Tools may provide data quality component which deal with data quality issues specically. Integrity Constraint Enforcement

Integrity constraint enforcement is the process of eliminating the integrity constraint violations or the process of providing useful information that can be used for this action [MF03]. A typical approach is to purge outdated or unrecoverable records. Sometimes it is not easy for the machine to tell if the value should be deleted or not. Those values are marked so that further action can be applied.


Record linkage refers to the task of nding syntactically distinct data entries that refer to the same entity in the real world. There are many other names that may also refer to record linkage, e.g. entity reconciliation, the merge / purge problem and so forth. The name may differ in different researches or user communities [Chr08a][GBVR03]. The concept is rst introduced by Halbert L. Dunn in 1946. The idea was to assemble each persons records of the principal events in life into a volume [Dun46]. While individuals are the initial entities of record linkage, entities of interest now include companies, geographic region, families and households. Record linkage is a useful tool when performing data mining tasks, especially when the data are originated from different sources. This technology is used in a extensive range, e.g. in customer systems for marketing, customer relationship management (CRM), data warehousing, as well as in health and clinical area and government administrations [GBVR03]. In many research projects it is also necessary to collect and examine information about an entity from more than one data source. Those sources often do not use a unique identier or key. Background

If two data sets of records (A and B) are to be linked, each record from A is to be compared with each one from B, generating |A| |B| record pairs, with |X| denoting the number of records in the data set X. Based on a probabilistic approach proposed by Fellegi and Sunter (1969) [FS69], there are three possible decisions for each record pair, namely M (match), U (unmatch) and 14

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Record Linkage

P (possible match), which determined by a comparison vector. The vector contains the coded matching rule and it represents the matching level between the record pair. Process Model The process model of record linkage which is widely adopted by record linkage tools contains the common components showing in a ow diagram (see Figure 2.3.2), which is based on the TAILOR system [MEE02].

Blocking / Searching

It is often expensive or sometimes even unfeasible to compare all pairs when the data sets contains a large number of records. Because of this reason, indexing or ltering techniques known as blocking / searching methods [Chr08a] [JLX+ 08], are employed to perform a preselection. These methods are often implemented as the rst step of a record linkage process (regardless of standardization, which is considered separately in this thesis). Record linkage tools often provide more than one algorithm to accomplish this task. One typical example approach is the sorted neighborhood method [HS95]. The idea behind is to sort the records over the relevant attributes (keys) bringing matching records close together and then compare the records restricted to a small neighborhood within the sorted list, like scanning the sorted records from sources using a xed size of window. Another algorithm is called Blocking as Preselection, which is based on quickly computed rejection rules [NM01]. Matching and Comparison

After the ltering process, the remaining much fewer record pairs are to be compared using comparison functions. Those functions determine the way to compare two elds (or attributes) of each record pair. This is the core and major research area of record linkage technology. A large number of comparison functions are now available for different types of data, e.g. strings, numbers, dates etc. Researchers are searching for more targeting methods such as comparison functions especially aiming at names or addresses, as they have their own characteristic concerning comparison [Chr06]. Functions for numerical values and dates are often based on mathematical operations. All the comparison functions return a raw similarity value, for example a number between 0 and 1, indicating a total non-match and an exact match respectively, which also is 15

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 2.3.2: A sample ow diagram of the record linkage process model

referred to as matching weight [GBVR03]. A good command of the comparison functions will help to improve the quality effectively and obtain a high performance of record linkage process. A brief introduction for some commonly used string comparison methods5 is given as follows (more information please refer to [CRF03] [GBVR03] [MEE02]). Exact match Levenstein distance Jaro metric This straightforward function returns 1 if the two strings are exactly the same and returns 0 if they are not. The Levenstein distance between two strings is dened as the minimum number of edit operations (insertion, deletion, or substitution of a single character) needed to convert one string into the other. This string comparator is introduced by Jaro (1989) [Jar89], which is based on the number and order of the common characters between two strings. According to the record-linkage literature, good results have been obtained using this method or its variants [CRF03]. As a variant of Jaro metric, modications have been made to the original Jaro metric. It introduced the length of the longest common prex of the two strings to the function [PWC97]. The Jaro and Jaro-Winkler metrics normally intend to compare short strings, e.g. names of persons.

Winkler metric

Q-grams distance

Classication and Decision Models

The task of this step is to make a decision for the compared record pairs. As discussed in the probabilistic framework of Fellegi and Sunter (1969) [FS69], for each pair of record there are three possible results, namely a match, a non-match and a possible match. This is done by forming all the calculated matching weights to a comparison vector. A comparison vector is usually dened by assigning weight for each comparison function. These vectors are then used to determine and classify record pairs into the three groups. Merging Techniques

This technique is not yet widely adopted by Open Source record linkage tools as it involves a much more sophisticated decision-making method and often is considered to be applicationoriented. If a record pair is identied as a match or a possible match, in order to combine the two records together, choices have to be made if some of the attributes dont share the same value. The goal is to select the value which is more likely to be correct when possible. It remains to be a potential research eld of the record linkage technology.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Input strings are rst divided into q-grams, e.g. a string Smith is divided into { Smi, mit, ith, th, h } with q=3. The comparison result is obtained by comparing the divided substrings. This distance minimizes errors due to switching substrings in a long string. It can help to detect approximate string joins efciently [GBVR03].
The terminology of methods may differ in different contexts.



Open Source Software

What is Open Source?

Open Source Licenses

An Open Source license is a license that makes the source code and the Open Source product available, especially for the purpose concerning redistribution or bringing Open Source products to the commercial world. Some popular Open Source license examples are: GNU General Public License (GPL), Lesser General Public License (LGPL), Apache License 2.0, Mozilla Public License 1.1 (MPL), Eclipse Public License [Ros05]. There are several restrictions that an Open Source license may include [Lau04] [Rus10]. One is regarding the term copyleft. A copyleft license requires all modied or extended versions of the software to be Open Source as well. The strictness of copyleft may differ, for instance, copyleft in the GNU GPL is stronger than in the LGPL, the latter permits linking with non-free modules [FSF10]. Another aspect is the compatibility with other Open Source licenses, e.g., whether the Open Source software is allowed to be linked with code using or to be released under different licenses. For example, Apache License 2.0 is compatible with GPL 3.0, while Apache License 1.0 isnt. OSI is a popular organization which reviews and approves Open Source licenses based on their OSD. There are also other approvals by different organizations based on their own criteria, such as Free Software Foundation (FSF) approval, Fedora Project approval. The restrictions of the same license may in its different version.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

Open Source, as the name may indicate, describes the state that the source code is freely and publicly available. Open Source software (OSS) generally refers to computer software with its source code available that permits users to study, change, and improve the software under certain conditions. The concept of Open Source software is often considered to be easily confused with freeware, where free here means there is no payment involved. However, they are two distinct concepts. Freeware, which is regarded as free of charge, are usually closed source [Hin07]. According to the Open Source Denition (OSD) proposed by the Open Source Initiative (OSI), there are 10 criteria that an Open Source software must comply (see [Ini10]). The basic idea behind is simple, on one hand, from a programmers perspective, as stated by OSI: When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, people x bugs. Therefore, better software may be produced compared with the traditional closed model. On the other hand, from a users perspective, this mode eliminates or largely reduces the cost for using software. Meanwhile, Open Source has also became a controversial topic. Apart from all the benets, as the term Open Source is becoming more and more popular, it is also used by some vendors as a marketing strategy. There are also arguments that suggest Open Source may have a bad inuence to the software market. Open Source software projects are built and maintained by a network of volunteer programmers. Open Source softwares has already become a very important part of the software industry. The Apache HTTP Server, the GNU / Linux operating system and the internet browser Mozilla Firefox are some of the most popular and successful products in the Open Source world.


Open Source Models

Commercial Open Source Software The use of dual-licensing has been adopted by many Open Source software venders. This model enables software providers offer the software under an Open Source license and under separate proprietary license terms as well. Under this model, the software product often releases two editions. The community edition typically coming with fewer features meets the users basic needs and is made available to the community, which means it is completely free. The enterprise edition usually includes enhanced features as well as support, training services, documentation, consulting and more. Open Source projects using this model usually also have a dedicated software development team. This hybrid version of licensing makes it possible for the customers to try out the software under an Open Source license before they decide whether it is worth buying [Let08]. Compared to traditional closed source software, this model provides more transparency of the software and at some level ensures the quality of the software. Tools which are usually supposed to be used for business purposes are often under this kind of business model, because it requires higher quality and stability of the software, technical support and services. Free and Open Source Software (FOSS)

Free and Open Source Software is often regarded as completely free for usage, i.e. no enterprise solution is available, no additional functionalities or other kind of services are reserved. The software may be licensed under one or a combination of Open Source licenses [Ros05]. Multiple Open Source licensing provides different options for developers regarding redistribution.


As mentioned before, being Open Source may have the potential to produce better software. In fact, it offers much more for users. Some of the benets are listed below. Openness and Transparency

Take record linkage tools as an example, as stated by Christen (2008), most of the commercial data integration and linkage systems are software in black boxes from the users perspective [Chr08a]. The details of the software or the technology are hidden, i.e. users only know how to use the software, and dont know why it works, what technologies included are. Furthermore, many of the commercial systems are specialized in a certain domain, targeting certain tasks, such as business data integration or customer data de-duplication. This is the situation for most closed source software. Some new technologies may have been widely used in such software, but users are difcult to learn about them. Open Source makes it possible for the users to look inside the software. This may be very useful for training purposes and for skilled software developers.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Why Open Source?

Flexibility The exibility in the Open Source context is reected by the redistribution ability, which means the users may modify Open Source software to their own needs or integrate them to their own systems. Of course, the activity of redistribution should obey the terms of the license that the software is published under. Some Open Source software offers well documented code and API (application programmers interface), which make it easier for understanding the software and making changes to it. Cost

Market Overview: Open Source DQ Tools

Open Source DQ tools are often related to Open Source business intelligence (BI) and data integration, which have already achieved signicant progress in the last several years. The development of Open Source data integration seems to have an impact on new entrants in the data quality market. Although DQ tools have been described as the target of the next Open Source data management space, the capabilities are still far behind established commercial vendors according to T. Friedmans report in this area [BF09]. Current market of Open Source tools that are related to data quality is heterogeneous. There are vendors under the commercial Open Source model like Talend and SQL Power, which offers not only data quality solutions but also integration or reporting solutions. There are also some smaller pure Open Source projects such as DataCleaner and Fril. According to the major technologies of DQM discussed in section 2.3, we decided to concentrate on four groups of data quality tools: Data Proling Tools, ETL (extract, transform, load) Tools, Data Cleansing Tools, Record Linkage Tools. There are other available Open Source tools which are also related to the data quality area, such as Data Mining Tools, BRMS (Business Rule Management System), Auditing Tools and so forth. For the reason stated in section 2.3, no details are provided for them.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

Using Open Source software will greatly reduce the cost in the sense of the software product. FOSSs are completely free. Although Open Source software vendors are trying to sell their services, compared to the traditional closed model, the cost is much lower. Commercial software suites for business application are sometimes very expensive. This situation is more severe when it comes to small and medium-sized companies. If the functions can also be realized by Open Source software, e.g. in the eld of Business Intelligence (BI), much more companies will benet from the low-cost BI solutions [Hin07]. Although in the meantime, Open Source software also brings problems, e.g. less long-term support and quality assurance. At this point commercial Open Source software model may compensate for those shortcomings.

Chapter 3 Methodology
The case studies of Open Source data quality software are based on the three major technologies introduced in Section 2.3 on page 10. The the evaluation methodology is discussed in this chapter. Software evaluation is often considered as an extensive and complex task since there are miscellaneous perspectives and quality factors within this eld. A comprehensive evaluation procedure should not only include a systematic testing in a scientic way, but also reect users opinions of the software by e.g. surveys. However, the evaluation criteria of this thesis concentrate on two aspects, namely the Open Source qualication and data quality functionality. The evaluation of other aspects such as exibility, reliability and stability requires professional evaluation procedure and therefore is less focused. Nevertheless, the qualication of such aspects could still be indicated indirectly in the evaluation result. Due to the inherent variety of the selected tools, this methodology also contains an introduction to their own characteristic and the working principle. Note that the tools using commercial Open Source model, which means a commercial version of the software may also be available (see section 2.4.2 on page 18), all the evaluation is restricted on the community version.



For the evaluation of Open Source qualication, we follow the criteria proposed by van den Berg (2005) [vdB05], which is set by collecting and synthesizing the available Open Source software evaluation models. Modications have been made for the suitability of the condition of this thesis. The approaches include visiting the projects website and observing the activities as well as reading related documentation or articles. The evaluation of the Open Source qualication focuses on the following aspects, license, release activity, community, documentation and user support.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Evaluation Criteria
Open Source Qualication


As mentioned in section 2.4.1 on page 17, Open Source license plays an important role in the Open Source culture. It provides the exibility concerning modication and redistribution as well as restrictions of them. The update frequency of software can reect its liveliness at some level. A new release may improve the software by xing bugs and adding new feature. All the release activities together form the road of progress of the software. This is assessed by checking the release activities and the change log that are available for the software. The release date of the latest version referred in this thesis is up to June 15, 2010. Another thing special for Open Source software is its community, which consists of the people who participate in the project. It is the community that does most of the testing and maintains the Open Source project. The communitys activity is often expressed through discussions in forums. So visiting the projects forum is one way to evaluate the community. Community forum is also the place where users get help, report bugs or other kinds of support. There are usually two types of documentation. The rst is called user guide or manual, which tells users how to install and use the software. Additional materials like webinar (Web-based seminar) or on-line tutorial also belong to this type. The other type of documentation concerning the development of the software. The commonly seen form is the API (application programmers interface) , which, similar to the Java API, may include specications of routines, data structures, object classes, and protocols.

Release Activity

Documentation and User Support


The evaluation of the functionality criteria is achieved by two approaches working principles and general functionality and sample task evaluation. This criterion are derived into several subcriteria. The result of evaluation in the functionality criteria contains two parts, the tools own qualication revealed by the given description and the comparison of each tool according to the subcriteria. Approaches

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Quality Functionality
Working Principles and General Functionality


This part of evaluation relies on the documentation as well as a general-purposed test of each evaluated tool. The main purpose is to give a description for each tool regarding its core features and major functionality as well as some working principles. It is difcult to compare the characteristics of each tool. This part therefore intends to give readers an overview and a general idea of the way of working and capabilities of the evaluated tool. This evaluation focuses on the real usage of each tool. Each case study contains one or 21

Sample Task Evaluation

several sample tasks. The conguration important steps of conguration are presented by screen shots and descriptions. The same tasks are executed by the tools of each technology. Therefore the performance of each tool can be can be compared on the same basis. Detailed information about the sample task evaluation are given in section 3.4 on page 25. Subcriteria Ease-of-use The ease-of-use criterion may be reected in many ways, for example, how easy it is to build a connection to the target data set or to congure the tool for the tasks. This subcriterion can also reect the user-friendliness or usability based on the software design. Note that the evaluation of this criterion is mainly based on the authors experience with the evaluated tool. We evaluate the exibility by examining three aspects. First is the operation exibility in terms of the requirement for running environment of the evaluated tools. Second is the input and output exibility the restrictions of the input le format and the output capability. They may have a large inuence on the scope of application and further usage of the tool. The last one is regarding the user-tunable ability. A tool that provides more congure options for different tasks is considered more exible. The functionality qualication of each tool is based on the authors opinion based on the experience. The range of functionality provided by each tool and the powerfulness will be taking into consideration. The evaluation of this subcriterion is based on the sample task evaluation result obtained by each sample tool.





Selection Criteria

As mentioned previously, there are a number of Open Source tools available related to the data quality subject. The selection process is mainly based on the help of the Internet web searching supplemented by information provided in related articles. On one hand related researches in this eld are not much and usually not available without cost. On the other hand one of the most important qualications of Open Source software is the openness and the transparency to users. The most popular approach is per Internet. We collect all the potential candidates and examine them one by one, then eliminate the inappropriate ones. The criteria for selection are specied below. Open Source Qualication

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Tool Selection
1. The software and its source code should be available. This criterion is set because we found that even some tools claim to be Open Source, their codes or their products turn out to be not publicly available. Those tools do not match the basic requirement for Open Source software and therefore are eliminated. 22

2. The latest release of the software should not be too long ago. Specically, the candidate, whose latest release date is more than 2 years ago, should be eliminated because that it is considered to be no longer under development or not active any more.

The software must have functionality that relevant to at least one of the three technologies discussed in section 2.3 on page 10. Note that this is regardless of whether the softwares original purpose or main functionality is targeted on data quality or not. For instance, Talend Open Studio is an ETL tools, but it is also acceptable because it has data cleansing functionality. Selection Result

The full candidates list and detailed information about elimination can be found in Appendix A. As a result, 6 out of 10 tools are selected after the selection procedure. DataCleaner and Talend Open Proler (TOP) are Open Source data quality tools that focus on data proling. SQL Power DQguru, formerly called the SQL Power MatchMaker, is a data cleansing & MDM tool. Talend Open Studio (TOS) is an Open Source ETL (extraction, transformation and loading) product which also provides data quality components. In this thesis, it is treated as a data cleansing tool. TOP and TOS are both from Talend, which is an Open Source software vendor providing data solutions. They are two tools with different functionality. Fril and Febrl are two Open Source record linkage tools. The selection result and the version of each tool installed for evaluation is shown in Table 3.2.1.
Software DataCleaner Talend Open Proler Talend Open Studio SQL DQguru Fril Febrl



CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Version 1.5.4 4.0.1 0.9.7 4.0.1 2.1.4 0.4.1
Home Page

Table 3.2.1: Selection result

Technical Environment and Data Sources

Technical Environment and Database Systems

Operating system: Windows XP sp3 professional. Python version installed: 2.5.4

Java(TM) SE Runtime Environment: build 1.6.0_20-b02

Version of Database systems: MySQL 5.1, Apache Derby release


Short description of the two database systems Both MySQL and Apache Derby are Open Source relational database management systems (RDBMS). Although Apache Derby is not as popular as MySQL, it is commonly considered as a popular embedded Java database and is highlighted for its small footprint and free-of-installation feature. The community version of MySQL is licensed under GPL 2.0, while Apache Derby under Apache License 2.0. We use the client/server mode of the Derby database. JDBC (Java Database Connectivity) is used for all connections for both database systems, although ODBC might also be supported under some circumstances. The two JDBC drivers used are respectively com.mysql.jdbc.Driver and org.apache.derby.jdbc.ClientDriver. The ability to connect to the two database systems and run tasks against them might reect the compatibility and exibility of the software at some level, although the difference of the result might be minor. Unless specied, the same performance of the sample task evaluation (see section 3.4) is indicated.


Two data sources are used for the evaluation. All proling and cleansing tasks are performed against both the MySQL and the Apache Derby version of the foodmart database. Record linkage tasks are performed against CSV les of sample customer tables. A short description is given below. Foodmart database

This database is available on the Internet with both MySQL and Apache Derby database versions. The data set is virtually created and has hence almost no real meaning, and may contain very few valuable data with necessary data quality problems. Instead, it is used as a platform. Modications have to be made for testing purpose erroneous data or anomalies should be added into the database. Customer table CSV File
cust_id 31810 31811 33391 33390 33391 fname Jamse James Thomas Tom Thomas

The customer data used for evaluation are modied upon the sample data les provided by the record linkage tool Febrl, which is available along with the software. All data are raw fake data and they are generated from 500 original records, which stand for 500 different real world entities. Each record has 2 ~ 4 duplicates, which may differ from the original record in various 24

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Source for Sample Task Evaluation
lname Sorella Sorella Miles Miles Mils str 27 27 5 5 address1 Alberrga Str. Alberga Street Mainoru Place Manoru Mainoru Place city Alphington Alphingten Red Hill Red Hlll postcode 5114 5114 3250 8750 3250 birthdate 19760524 19260524 19360107 19860707 19860107 ... ... ... ... ... ...

Table 3.3.1: Sample duplicate data

ways similar values in the same column, missing information, etc. Table 3.3.1 shows a portion of the data set as an example.


Sample Task Evaluation


Task 1: simple column statistics

Description: This task is a simple proling operation. The goal is to provide general statistical information about the target data set based on its number of records and values. Data proling tools should be able to nd out statistics about the data set including, e.g. the number of rows, the number of null values, the number of distinct and unique values, the number of duplicates, and the number of blank elds. Task 2: domain integrity

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Data Proling Tasks
Key words: value frequency analysis, simple statistics Related DQ dimension: non-redundancy, completeness Database information about the testing related data: Potential target data types: Nominal, Numerical, Time-related, Others

Several functions of data quality tools are picked out to build several sample tasks. As mentioned previously, the purpose is to compare the conguration procedure and the results with the same basis. The approach is to provide screen shots for the important steps and the result as well as the descriptions of them. As mentioned in section 3.3.2, the testing data set has almost no real meaning. So rather than focus on the improvement of the real data quality, we focus on the realization of such data quality functions. Therefore, the tasks are usually separated from each other instead of being a complete systematic DQM process. The tasks should be designed simple to implement and functionality-oriented. For example, when we implement record linkage tasks, we do not focus on data transformation and standardization which are discussed in detail elsewhere, although the tools have such functionalities and data standardization is usually implemented before record linkage tasks. Instead, the two data sets that to be linked is already transformed to a consistent format. The performance, result and problems of implementation will be presented after sample task evaluation.

Table customer

Column address1


Size 30

Nullable true

Key words: domain integrity, attribute domain constraint Related DQ dimension: syntactic accuracy Potential target data types: Nominal


Database information about the testing related data:

Table customer

Column state_province


Size 30

Nullable true

2. Canadian Provinces and Territories Names and Abbreviations: 13 distinct values [Pos10]. 3. Political divisions of Mexico including 31 distinct values. Task 3: pattern analysis

Description: Some data has different presenting pattern or format. For the purpose of consistency in presentation of data, they are required to be unied. A portion of sample erroneous data is shown in Figure 3.4.1. The purpose of this task is to identify all patterns that exist in the data set, for the purpose of carrying out a solution to unify the presentation of phone numbers. The follow-up data standardization task is designed as Task 5.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Key words: pattern conformance, consistency of presenting data Potential target data types: Nominal, Time-related Related DQ dimension: Syntactic Accuracy, Pragmatic Data Quality Database information about the testing related data:

Description: The purpose of this task is to evaluate the capability in validating domain integrity of each data proling tool. This aspect is also regarded as attribute domain constraint. A set of data, e.g. provinces of a country, colors or even peoples names, may have a certain value domain. Any value outside this domain is considered as irregular or incorrect. In foodmart database, customers are from 3 countries: the USA, Canada and Mexico. Values in column state_province should be included in 3 domains with altogether 106 valid values: 1. The United States Postal Service state abbreviations: 62 distinct values[Ser10].

Table customer

Column phone1


Size 30

Nullable false

Figure 3.4.1: Sample erroneous data


Task 4: DQ rule analysis

Key words: data quality rule analysis Related DQ dimension: Semantic Accuracy, Pragmatic Data Quality Potential target data types: Nominal, Numerical, Time-related, Others Database information about the testing related data:

Description: Data should obey the rules of the real world, DQ rules are used for validating such data. DQ rules can also be used to nd out interesting records. In this sample task, 2 DQ rules are designed. DQ tools should be able to nd out statistics and the corresponding data according to those rules. 1. If the gender suggests that the customer is Male, then the Salutation should be Mr.. The violating records are actually more interesting, so the corresponding rule clause is shown as follows. SQL Rule Clause

2. The second rule is based on an assumption that for each sales record, if the purchase location (province/state of the store) of a customer is not consistent with his/her registry location (the province/state of the customer), the record has a potential to be inaccurate. For example, if the customer has moved to another state/province, this information might be reected by his/her purchase location. Because the sales table doesnt contain any location information, this rule analysis requires the comparison of records from three different tables (the relation is illustrated in Figure 3.4.2). The corresponding rule clause is shown

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
GENDER = M AND SALUTATION !=MR. Figure 3.4.2: Relation of tables 27

Table customer customer customer store

Column gender salutation state_province store_state


Size 10 5 30 30

Nullable false true true true



Data Cleansing Task

Task 5: transformation and cleansing

Key words: pattern conformance, consistency of data presentation, data transformation Related DQ dimension: Semantic Accuracy, Pragmatic Data Quality Potential target data types: Nominal, Time-related Database information about the testing related data:

Description: This task is aiming at solving the pattern inconsistency that has been discussed in Task 3. The goal is to standardize all the 10-digit phone numbers which appear in different formats into the one unied format, namely 999-999-9999. Data that contain characters other than digits, -, , (, ) or are not able to be transformed will be considered as data with problems. For such data, an asterisk (*) is added at the beginning of the original value for further cleansing activities.


Task 6: record linkage

Description: This task is aiming at linking the records that has the same real world entity. The record linkage tools should be able to run the record linkage process with different parameters, regarding indexing method, comparison function and so on, for the detection of linkages between two data sets. Records that represent the same real world entity should be linked. The tools should also not link two records with different real world entities. Additionally, they should be able to provide the reference for their decision. The result should be reusable. It usually requires a large number of trials and analyses in order to obtain an optimum performance of record linkage tasks.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Table customer Column phone1 Type VARCHAR Size 30 Nullable false

Record Linkage Task

Key words: de-duplication, record linkage

Testing related data set: customer table CSV le (see section 3.3.2 on page 24)


Chapter 4 Evaluation and Results



Talend Open Proler (TOP) is licensed under GPL 2.0. The project was launched in June 2008. From the rst version 1.0.0RC1, TOP has already over 30 releases until the latest version 4.0.1 released on April 28, 2010. The source code is available on-line1 on Subversion. Subversion is an open-source revision control system, which enables developers to maintain the source code as well as other les, e.g. web pages, and documentation [Wik10]. TOP shows a comparably high activity level which is reected by not only its release frequency, but also its forum discussion and its way of bug nding and xing. The project is held by the software vendor Talend, and it has a dedicated software development team which is responsible of and guarantees the development and maintenance of the project. Documentations including the software introduction and tutorial are available. The tutorial provides step-by-step congure instructions and lead user to go through every feature of the product.


Talend Open Proler is based on Eclipse RCP2 , which is a Java-based software framework. It runs on any operating system that supports the Java virtual machine. Some component is also provided based on the programming language Perl. TOP provides dedicated connections to various database systems, e.g. MySQL, Oracle, PostgreSQL, etc. It also support generic ODBC and JDBC if the dedicated connection is not found [Tal09] As dedicated connection to Appache Derby database is still not supported by the current version of TOP, the generic JDBC is used for evaluation purpose. Data proling functionalities are mainly supported by the use of the following elements.
1 2

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Talend Open Proler
Open Source Qualication Working Principles and General Functionality
See Eclipse Rich Client Platform, see


Patterns Patterns are used for validating strings. It enables to dene the content or structure of String values by specifying the regular expression (Regex, see section 2.3.1 on page 12). A number of patterns are available by default. Users can also visit the community website to nd more available patterns that generated and shared by other users or developers. Indicators Indicators in the TOP context have the meaning of the operation options targeting at different proling purposes. Simply put, choosing the indicators means choosing the way of your proling process. System indicators are predened and ready to use, but users are able to dene their own indicators as well as to change the system indicators. The main part of an indicator is its SQL template3 , an example is given as follows. SQL Statement

SQL Rules

SQL rules enable the Data Rule Analysis on TOP. Please refer to section 2.3.1 on page 12 for detailed information about Data Rule Analysis.


Task 1: simple column statistics

1. Preparation Indicator Simple Statistics is used directly in this task.

2. Result The corresponding result is shown in a table (see Figure 4.1.1) and a histogram (not shown) is also generated to give users a more intuitive view . Note: Duplicate here refers to identical values. The number of Distinct Count, Unique Count and Duplicate Count indicates the count of values, rather than the count of all qualied records. 3. Remarks

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
SELECT COUNT(*) FROM (SELECT <%=__COLUMN_NAMES__%>, COUNT(*) mycount FROM <%=__TABLE_NAME__%> m <%=__WHERE_CLAUSE__%> GROUP BY <%=__COLUMN_NAMES__%> HAVING mycount > 1) AS myquery

Sample Task Evaluation

(a) User-tunable options is provided for more complex analysis, which means user can make changes to the existing Indicator including, e.g. case insensitive options, thresholds denitions etc.

An SQL statement with parameters, such as <%=__COLUMN_NAMES__%> in the example.


Figure 4.1.1: Result of simple column statistics

Task 2: domain integrity

1. Preparation We use the Pattern Matching feature for this task. Talend does not support any import domain functionality compared with the Dictionary feature in DataCleaner (see section 4.2.2 on page 35). One way to accomplish the task is to dene a new Pattern using Regex. There are 106 valid values, so it is more complex to use Regex for this task. A shortened version of the Regex used for this task is shown below. Regex (shortened)

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
(b) Data values or rows of any certain result can be viewed and the SQL statement is generated correspondingly. This feature provides a more exible way for reuse of the proling result. The sample result SQL statement for Duplicate Count is showing below. SQL Statement SELECT * FROM foodmart.customer WHERE address1 IN (SELECT address1 FROM (SELECT address1, COUNT(*) mycount FROM foodmart.customer m GROUP BY address1 HAVING mycount > 1) AS myquery) (c) TOP is able to connect to Derby database using Generic JDBC connection, with the driver org.apache.derby.jdbc.ClientDriver. Proling tasks can run against the database with no problem occured. However, the database explorer component seem to have compatibility problems. The SQL statement it generated is correct but running the statement pops out an error. The problem has not been solved and our assumption is because of the Generic JDBC connection. ^(AK |AL |AR ... |NL |NT |NS |NU |ON |PE |QC |SK |YT )$ Description (based on the Basic Regular Expressions (BRE) syntax)4


Metacharacter ^ $ |

Description Matches the starting position of the string. Matches the ending position of the string. Known as the choice operator matches either the expression before or the expression after the operator, e.g. all the valid values (i.e. AK, AL, AR, ..., YT) are separated by a | .

3. Remarks It would be more convenient if a feature like Dictionary is supported. An alternative of way of implementing this task would be user-dened Java Indicator. Task 3: pattern analysis

1. Preparation Indicator Pattern Frequency Statistics is used.

2. Result The result is shown in Figure 4.1.3. For reference, a portion of sample erroneous data in original data set is shown in Figure 3.4.1 on page 26.

3. Remarks As shown in the result, all digits are represented by 9, all uppercase letters are represented by A, and all lowercase letters are represented by a, other characters are represented by themselves.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.1.2: Result of domain integrity validation 32

2. Result As shown in Figure 4.1.2, values that do not match those in the integrity domain can be detected. It is convenient to retrieve either the valid or invalid rows or values from the database.

Task 4: DQ rule analysis

Please refer to section 3.4.1 on page 27 for detailed information. 1. Preparation

2. Result The result of the both DQ rule analysis is shown in Figure 4.1.5.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
(a) Prepare the rst DQ rule. This is done by completing the the Where Clause. DQ rule conguration: The rst DQ rule is done by completing the Where Clause. The second rule, named as region consistency, needs a cross-table validation. The Join Condition should be set correspondingly as it indicates how the tables are connected with each other (see Figure 4.1.4). Figure 4.1.4: DQ rule conguration 33

Figure 4.1.3: Result of pattern frequency statistics

3. Remarks The analysis could also be achieved by simply using SQL statements. But the workload is much higher especially for complicated DQ rules. It is more intuitive and time-saving to have such DQ rule features in data quality tools.



The development of DataCleaner is handled by the DataCleaner community supported by Open Source community.. The rst version 1.0 was released on April 12, 2008. Until June 15, 2010 DataCleaner has 9 releases in total. The latest one is Version 1.5.4 released on May 15, 2010. DataCleaner is distributed under LGPL license. Documentation and Online Wiki contains both software introduction and tutorial are available for DataCleaner, some parts of the documentation, e.g. API, are still under development5 . Users can access its source code through the subversion repository @ eobjects.org6 .
5 6

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.1.5: Result of DQ rule analysis In order to show the violating records, the following SQL statement can be used (generated automatically by TOP ). SQL Statement SELECT sales_fact_dec_1998.* FROM foodmart.sales_fact_dec_1998 sales_fact_dec_1998 JOIN foodmart.customer customer ON (sales_fact_dec_1998.customer_id = customer.customer_id) JOIN store ON (sales_fact_dec_1998.store_id = store.store_id) WHERE (customer.state_province != store.store_state)


Open Source Qualication

see See



Working Principles and General Functionality


In the Prole mode, Prole is the name of the object that can be added to form a proling task. Validate

Validate mode provides several validate functions called validation rule, specically, JavaScript evaluation, Dictionary lookup, Value range evaluation, Regex validation and Not-null check. Dictionary A dictionary contains all the valid data values for proling. The use of dictionary allows user to dene domain constraint for a data set. Regexes Regexes, short for Regular Expressions, provide a concise and exible means for matching strings of text concerning content or structure (see section 2.3.1 on page 12).


In the Compare mode, columns, tables and schemas of different data set can be compared. It is one approach for consistency checking.


Task 1: simple column statistics

1. Preparation Prole Standard measures is used.

2. Result The corresponding result is shown in Figure 4.2.1. The number of rows, number of NULL values and empty values are shown in the table. Highest and lowest values are meaningless in this case.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Sample Task Evaluation

DataCleaner requires a Java Runtime Environment (JRE) version 5.0 or higher. It runs in both desktop application mode and command-line interface mode on Microsoft Windows Operating Systems, MacOS X and Linux. It supports only two database types by default, but it is able to support any database with a JDBC driver by simple driver installation. The connection to both testing databases has been successfully managed. The software supports a number of different optimization techniques that can be combined in order to make the executions quicker and more efcient [Sr09]. For example, the Max connections option enables running independent queries simultaneously providing a better utilization of CPU. DataCleaner has three operation modes Prole, Validate and Compare.

3. Remarks DataCleaner does not support functions concerning the count of duplicates, distinct values and unique values. It is also not exible to modify the Prole to unselect the unnecessary components such as the highest or lowest value component in this case. For comparison, see section 4.1.3 on page 30. Task 2: domain integrity 1. Preparation

2. Result As shown in Figure 4.2.2, all values that do not match those in the integrity domain are detected. Compared to the result obtained by Talend, DataCleaner provides a clearer view as it all the invalid values are listed out.

3. Remarks Dictionary look up is quite handy in realizing this task. Instead of using a text-le, the integrity domain can also be dened using a column in an databases. DataCleaner also provides other approaches for this task which are similar to Talend Open Proler, e.g. Javascript or Regex. Task 3: pattern analysis

1. Preparation Prole Pattern Finder is used in this task. 2. Result

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.2.1: Result of simple column statistics (a) Dictionary creation: a text-le containing all the valid values. (b) Validation rule Dictionary look up is used. 36

3. Remarks As shown in the result of this implementation, all digits are represented by 9, all letters regardless of uppercase or lowercase are represented by a. An assumption is that if a eld containing both digits and letters, each character will be represented by ?. No reference has been found in the documentation to explain the situation. Task 4: DQ rule analysis

No DQ rule analysis functionality is supported by the current version of DataCleaner.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.2.2: Result of domain integrity analysis The result of this task is shown in Figure 4.2.3. For reference, a portion of the original values can be viewed in Figure 3.4.1 on page 26. Figure 4.2.3: Result of pattern analysis 37


Talend Open Studio

Open Source Qualication

As mentioned previously, Talend Open Studio (TOS) and Talend Open Proler (TOP) are both provided by Talend. The project was launched in October 2006 earlier than TOP. TOS has already over 60 releases with the latest one 4.0.1 released on April 28, 2010. The Open Source qualication of TOS is similar to TOP (see section 4.1.1 on page 29).


Working Principles and General Functionality

The same as Talend Open Proler, TOS is based on Eclipse RCP. It runs on any operating system that supports the Java virtual machine. It is announced compatible to most existing database systems. As mentioned before, TOS is basically an open-source ETL tool. Here skipping all the data integration functionalities which may be considered as the spirit of the product, we focus on its data quality components. With the components presented below, TOS is able to provide very basic record linkage functionality compared with that provided by record linkage tools that will be presented later. The introduction is based on the Reference Guide [Tal10]. tMap

tMap is an advanced component integrated as a plugin of Talend Open Studio. tMap transforms and routes data from single or multiple sources to single or multiple destinations. The mapping provides many predened ready-to-use functions for transformation of different data types. Users also can dene their own transforming functions using Java code. tFuzzyMatch

tFuzzyMatch Compares a column from the main ow with a reference column from the lookup ow and outputs the main ow data displaying the distance. It helps ensuring the data quality of source data against a reference data source. tUniqRow

tUniqRow Compares entries and removes the rst encountered duplicate from the input ow. Output ability

Users are able to choose from various output options such as to update the original data source with the obtained result or to save the result into another data source that TOS supports. To export a project for reuse, user can choose either to save the project as a Talend object or to export the Java code for the project. User is also able to generate a html document containing the details and description of the project.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u


Sample Task Evaluation

Task 5: transformation and cleansing Task description can be found in section 3.4.2 on page 28. 1. Preparation (a) Generate Java code to create a user-dened transformation function retainNum. The code of the class Retain_Char containing the function is shown in Java Code 4.3.1. (b) Congure the workow and tMap accordingly by adding the transformation functions (see Figure 4.3.1). The transformation result is saved into another data source called phonename.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

CO D ec PY e RI mb G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.3.1: Workow preparation Java Code 4.3.1
public cl a ss Retain_Char { / * r e t a i n N u m : t h i s f u n c t i o n i s u s e d t o s t a n d a r d i z e phone * number f o r m a t s * { talendTypes } String * { C a t e g o r y } User D e f i n e d */ public s t a t i c S t r i n g retainNum ( S t r i n g o r i g i n ) { i f ( o r i g i n . length ( ) < 1) { return " " ; } S t r i n g newS = new S t r i n g ( ) ; / / g e t t i n g r i d o f a l l u n n e c e s s a r y symbol or c h a r a c t e r s newS = o r i g i n . r e p l a c e ( "" , " " ) ; newS = newS . r e p l a c e ( " " , " " ) ; newS = newS . r e p l a c e ( " ( " , " " ) ; newS = newS . r e p l a c e ( " ) " , " " ) ; / / i f the s t r i n g s t a r t s with "0" , d e l e t e the "0" i f ( newS . s t a r t s W i t h ( " 0 " ) ) { newS = newS . s u b s t r i n g ( 1 ) ; } / / i f t h e phone number h a s 10 d i g i t s , o u t p u t t h e / / s t a n d a r d i z e d f o r m a t "123 456 7890" i f ( newS . l e n g t h ( ) ==10 && newS . m a t c h e s ( " [0 9]{10} " ) ) { newS = newS . s u b s t r i n g ( 0 , 3 ) . c o n c a t ( "" )


25 26 27 28 29 30 31 32 33 }

. c o n c a t ( newS . s u b s t r i n g ( 3 , 6 ) ) . c o n c a t ( "" ) . c o n c a t ( newS . s u b s t r i n g ( 6 , 1 0 ) ) ; r e t u r n newS ; } / / i f n o t , o u t p u t t h e o r i g i n a l phone number and mark / / t h e phone number by a d d i n g a " * " a t t h e b e g i n n i n g . e l s e return "*" . concat ( o r i g i n ) ; }

3. Remarks Although TOS already provides several predened functions for String operations, data cleansing tasks are mostly application oriented and differ from time to time. User dened function for transformation enables users to utilize any tranformation function as long as it can be realized by Java.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.3.2: Result of cleansing: comparison of the proling result A small piece of the original data set and the corresponding output is illustrated in Figure 4.3.3. 40

2. Result In order to provide an overview for the result, both the original and obtained phone columns are proled again (see Task 3 in section 3.4.1 on page 26). The two results of pattern analysis are shown in Figure 4.3.2. The upper table contains the result of the original data set, the downer the phone column after transformation. As shown in the result, the valid values are all transformed into the standard pattern leaving the invalid ones with an asterisk (*) added at the beginning.



DQguru is licensed under GPL 3.0. It is the data cleansing product of the SQL POWER SOFTWARE Group. The initial Open Source release is 0.9.0 released in November 2007. The current version is 0.9.7 released on March 2010. It provides several short demo videos which is free of charge and available on the website to help users learn some features about the software. However, its user guide is not free, it costs $49.00. Non-members may check out a read-only working copy of its source code anonymously on its Google Code Project Page7 .


DQguru is Java-based and the software is available in three versions, for Windows, Mac OS, and Unix/Generic. PostgreSQL, MySQL, SQL Server, Oracle Version, HSQLDB are currently supported. The connection ability to other databases, such as DB2/UDB, Apache Derby, Infobright, and Eigenbase has not been certied. Because the lack of documentation for this tool, more detailed description about its functionality is not able to be provided. According to our experience, the output ability of this tool is rather limited. DQguru only supports updating the original data set with the transformed result. An additional option is to add a CSV as an output le.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.3.3: Comparison of the original data and the corresponding transformation result

SQL Power DQguru

Open Source Qualication

Working Principles and General Functionality




Sample Task Evaluation

Task 5: transformation and cleansing 1. Preparation Workow preparation conguration (see Figure 4.4.1)

2. Result The result is similar to that obtained by Talend Open Studio (see Section 4.3.3 on page 39). Only the transformation of the valid phone numbers into the unied presenting pattern has been successfully managed using the workow shown in Figure 4.4.1. Due to the lack of documentation and limited number of transformation functions, the attempt of further operations is not successful. 3. Remarks The conguration process is not as complex as that of TOS. The interface and conguration of DQguru are very intuitive and user-friendly. The lack of user-tunable functionality provides a huge difculty for complex data cleansing tasks.



The project of Fril started in 2008, with version 1.0 released in October. Until now Fril has 14 releases in total. The latest release is Version 2.1.4 released on December 22, 2009. The product is licensed under MPL 1.1/GPL 2.0/LGPL 2.1, which is quite exible concerning redistribution. Documentation including the software introduction and tutorial is available. The tutorial provides step-by-step instructions for linkage conguration and contains a description for the features 42

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.4.1: Workow preparation


Open Source Qualication

available in FRIL, short introduction and usage of some comparison functions are also included. The source distribution is available for download8 which contains Java source code of FRIL framework along with all the libraries required for byte-code generation.


Working Principles and General Functionality

Join condition

Join condition in Fril determines the way of comparison. Assigning a join condition includes choosing the distance metric, a.k.a comparison function, and setting up all the necessary parameters. (see section 2.3.3 on page 15) Join method

Join method in Fril has the same meaning as blocking/searching method (see section 2.3.3). Four options are available: nested loop join, sorted neighborhood method, blocking and search method and SVM join [CL01]. For details on the join method please refer to [JLX+ 08]. Results saving conguration

This conguration enables a enhancement method for the record linkage result (see Figure 4.5.3).


Task 6: record linkage 1. Preparation

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Sample Task Evaluation
(a) Set the join condition for each attribute. Assign the match weight and empty value score for each attribute. Acceptance level is the threshold for duplicates identication. The setting of this task is shown in Figure 4.5.1

Fril is Java based and can be run on any operating system that supports the Java virtual machine. By default it supports 4 kinds of data sources: CSV le, Excel le, Text le, and Database with a JDBC driver. The default supported databases are MS Access database, MS SQL, MySQL, PostgreSQL and Oracle. By simple modication of the le, it may be able to support any database with an JDBC driver. We have managed use Fril to access Apache Derby database. The output is limited to CSV le.


CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.5.1: Join condition conguration Figure 4.5.2: Joined method conguration (b) Joined method conguration. (see Figure 4.5.2) (c) Results saving conguration. (see Figure 4.5.3) 44

2. Result and output As a result, 493 linkages are identied by this implementation. Figure 4.5.4 shows the linkage result viewed by the internal viewer of Fril, which has the ability to add features like highlighting differences, ltering etc. Condence for each linked pair is also provided in the linkage result.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.5.3: Results saving conguration Figure 4.5.4: Sample Linkage results 45

Additional: Merging technique for de-duplication The reason for presenting this additional task is that the de-duplication mode of Fril is not exactly the same as the record linkage mode. De-duplicated result is generated by employing simple merging technique. As far as we explore, the result has been conrmed by one developer of Fril, Fril only support the simple algorithm for combine two linked records: 1. If one value of the elds is missing, the existing one is adopted. 2. If both elds are not empty, and have different values, the rst one appears is adopted. The sequence of records depends on the indexing and searching method employed. Although this is not a real intelligent algorithm, it could be the start point to provide the merging feature to record linkage tools. An example is shown below. If two records are: fname lname date_of_birth Aaom Breed 19945718 Aaron Breed 19940718 Aaom Breed 19945718 phone

By employing Frils algorithm, the result would be: 03 01452468 .

As Aaron is a valid value for names and 19945718 is not a valid birthdate, a better answer could be generated if a more advanced algorithm is employed, i.e. Aaron Breed 19940718 03 01452468 .

This could be done by adding integrity constraint functionality to the merging technique (see section 2.3.3).



The Febrl project is developed and maintained by The ANU (the Australian National University) Data Mining Group. The rst version of Febrl (version 0.1) was released in September 2002. Until now Febrl has 6 releases in total, and the latest one is 0.4.1 released in December 2008. The product is licensed under the ANU Open Source License, which is a FOSS license. The manual describing the software is available as well as several research publications. It is a prototype software with no precompiled binary available. The source code is along with the software it self.


Febrl is based on the programming language Python9 and associated extension libraries. It is an cross-platform software, but due to differences among operating systems, small changes

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
03 01452468


Open Source Qualication

Working Principles and General Functionality



is required for running the software correctly. The installation requires Python 2.5, with additional modules and libraries needed for the graphical user interface (GUI) and the support vector machine (SVM) classication method [CL01]. Currently, only several text-based le formats are supported, including the CSV format, i.e. it provide no database connectivity. The output is also limited to CSV le. All information provided in this section is based on [Chr08b][Chr08a, Chr06], for further information refer to the reference. Index Febrl provides several indexing methods (see section 2.3.3 on page 15): FullIndex, SortingIndex, QGramIndex, CanopyIndex, StringMapIndex, SufxArrayIndex (more information is provided in [Chr08b], chapter 8). And detailed conguration are supported, e.g. window size, encoding function for the selected eld, etc. Comparison methods

Febrl provides a lot of comparison functions (for more information see [Chr08b], chapter 9). Figure 4.6.2 on the following page presents a portion of them. Classication methods

An introduction of classication methods is provided in section 2.3.3 on page 16. Febrl offers six different classication techniques presented in Figure 4.6.3 on page 49 (for more information see [Chr08b], chapter 10). Output

Four output les are available of the execution of each task. 1. Weight vector le produced in the record pair comparison step. It contains the similarity value of every compared pair of attributes obtained by the comparison function according to the conguration. 2. Histogram le is a simple text-based histogram as the histogram shown in Figure 4.6.4. 3. Match status le contains the summed matching weight for each linked pair. A match ID is the identier for each linked record pair. 4. The match le copies all the input data set and adds the generated match ID(s) at the end of each record that has been linked.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u


Sample Task Evaluation

Task 6: record linkage 1. Preparation (a) Index method conguration. (see Figure 4.6.1)

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.6.1: Index method conguration (b) Comparison method conguration and parameter assignment. (see Figure 4.6.2) Figure 4.6.2: Comparison method conguration (c) Classication conguration. (see Figure 4.6.3) 48

2. Result and output The result is shown in the histogram that automatically generated by Febrl (see Figure 4.6.4)

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Figure 4.6.3: Classication conguration Figure 4.6.4: The histogram showing the record linkage result 49

Chapter 5 Summary

We made a brief comparison of the selected tools in each category according to the evaluation criteria. The results are shown in briey in three tables. Comments and conclusions are also presented concentrating the two aspects, namely the Open Source qualication and data quality functionality according to the evaluation criteria in section 3.1 on page 20.

Important Note: All the results obtained are the authors own opinion and based on the experience of the evaluation process in this thesis. The release date of the latest version referred in this thesis is up to June 15, 2010. Description of Notation


CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Comparison and Result
denotes that the tool is superior than the other one according to the indicating criteria. denotes that the tool is inferior to the other one according to the indicating criteria. denotes that two evaluated tools do not show much differences according to the indicating criteria. denotes that the tool does not have such functionality.

Case Study Result on Data Proling Tools


License Release Activity Version Installed Release Date User support Ease-of-use Flexibility

DataCleaner LGPL 1.5.4 15 May 2010

4.0.1 28 Apr 2010


TOP Functionality Task 1 Task 2 Task 3 Task 4



Talend Open Studio (TOS) is an ETL tool with comprehensive functions and extension capabilities, providing a large number of options with user-tunable conguration parameters. It 51

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Case Study Result on Data Cleansing Tools
TOS GPL 2.0 License Release Activity Version Installed Release Date User support Ease-of-use Flexibility Functionality Task 5 DQguru GPL 3.0 4.0.1 28 Apr 2010 0.9.7 Mar 2010

Compared with DataCleaner, Talend Open Proler (TOP) is more popular regarding its forum activities. Its documentation shows better qualication and the Webinar are very helpful for starting users to get to know the software. TOP has a more complex structure providing more functions and extension capabilities including a large number of options for the users to dene and congure. Although the structure may put difculty in using the tool and may require additional learning, it provides more exibility and efciency. All sample tasks are accomplished, although the ability to dene integrity domain is limited. The ability to not only show the result in the database but also provide SQL statements for them is very useful. This feature is not supported by the current version of DataCleaner. DataCleaner is a simple-to-use data proling tool which provides a simpler working algorithm. The clear and user-friendly GUI makes it easy for understanding the tool without putting too much effort on the user-guide. DataCleaner provides limited conguration possibility and it does not support data rule analysis according to our sample task evaluation. Regarding the output type capability, DataCleaner is able to export a HTML report after each proling task, although it only includes the basic information of the result. TOP on the other hand doesnt support report output in its community version. Another advantage of DataCleaner is its small footprint, the full installation costs only 14.9MB compared with TOP 233MB.


Febrl is a prototype tool as described by its developer and therefore it is more research purposed. It shows more complexity concerning both its user interface and structure. More conguration parameters, more comparison functions and the complex structure increase the difculty of using the software. The conguration requires users to have a certain level of knowledge in the record linkage area. The high level of documentation reduces the difculty a little bit. Febrl also doesnt support any database connectivity until now which puts a great disadvantage regarding the range of usage, as most real applications are based on databases. Fril, on the other hand, supports database connectivity using JDBC and provides a more user-friendly interface. The interface is more intuitive and easy to understand. They dont show much difference regarding record linkage task performance, although more comparison functions in Febrl may have a potential for generating better results.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Case Study Result on Record Linkage Tools
Fril MPL 1.1/GPL 2.0/LGPL 2.1 2.1.4 22 Dec 2009 License Release Activity Version Installed Release Date User support Ease-of-use Flexibility Functionality Task 6 Febrl ANU Open Source License 0.4.1 Dec 2008 52

provides much more than data cleansing. A comparable complex structure requires the necessary knowledge for operating the software, but it provides more exibility and efciency at the same time. User dened function using Java code provides a big privilege users who are familiar with Java programming. It enables the realization of functions that are not supported directly by the tool itself. Abundant options regarding output and export provide more advantages of exibility. DQguru, on the other hand, provides a large amount of predened operations for data transformation. It only supports very limited conguration possibilities which makes it inexible regarding its functionality and the output capability as well. The lack of documentation and support put a great restriction to us for further exploration of this tool. Data cleansing tasks are often application oriented, i.e. it depends on different business cases. Predened transformation functions sometimes are not sufcient for such tasks. This disadvantage also is reected by the performance of the sample task evaluation.


Summary and Outlook

This research project has attempted to answer some questions regarding Open Source data quality software. Open Source data quality is relatively a new area and this thesis can serve as a basis for further research in this area. For the summary and outlook of this thesis, we focus on answer the following questions. What are the major technologies in the Open Source data quality management area and how do they work?

What Open Source solutions are provided in this area?

After given a general overview of the Open Source data quality market, six tools are selected for evaluation. The evaluation revolves around the three major technologies. The criteria cover two aspects Open Source qualication and data quality functionality. Additional descriptions about each tools working principles and sample tasks implementation steps are presented. Readers can have a quick look to each tool as well as get an impression of how it works and the qualication regarding each technology. What do the results of the case studies suggest?

The case studies on data proling, data cleansing and record linkage tools show that the selected Open Source tools are generally able to realize most of the functionalities of each technology. But differences between these tools have also been revealed regarding both Open Source qualication and their functionality. The ease-of-use and the performance of the sample tasks also differ greatly sometimes. One of the most aspects that lead to this difference is the Open Source model. Pure Open Source vs. Commercial Open Source

Those tools that do not have a commercial version or are not supported by any software vendor are considered as pure Open Source software. They often have small footprints, simple structure and are developed by a small group of people. They may not provide as much user support compared with commercial Open Source software. Their disadvantages are also reected by the scale of community and the lack of popularity. However, the functionality they provide is often 53

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

Three major technologies are discussed in this thesis: data proling, data cleansing and record linkage. Other technologies may also be related with data quality management but it is difcult to cover the whole scope due to the limitation of time of this thesis and the lack of agreement on some of the technologies. Data proling is usually considered as the start point of a data quality management process, i.e. the rst thing is to identify the existing data quality problems. The other two technologies are data cleansing and record linkage, which focus on the solutions for solving the found data quality issues. The fundamentals of each of the three technologies are introduced. The case studies have been performed on each technology showing the current practices in this area based on the theoretical basis.

What can be expected for further research or development in this area?

As this thesis only focused on three major technologies, one task for further research is to nd out other technologies and more applications in this area. Adding more functionalities or optimizing the current ones could be also an interesting task. As the example of Fril, merging technique based on general purposed algorithm can be added for its de-duplication process by employing integrity domain or other methods. The current Open Source solutions in this area still aim at a general scope, there are limited products available and they are still not as competent as the commercial software. The functionality and usability are to be improved which still requires a lot of effort. Some of the evaluated tools still exist in a raw stage and are considered as research-oriented. In order to be used for solving real business problems, more precise and specic details should be added.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u

handy and easy to use. The small footprint often comes also with small cost of disk space and less operating consumption. The simpler structure makes it easier for interested developers to get involved in the project if good documentation is provided. It reduces the time of learning. This also provides advantages considering redistribution and integration ability. One existing problem is that due to the lack of contribution strength of the pure Open Source software, the documentation is sometimes with low quality or incomplete. Some Open Source projects stopped at the stage of simply bringing the product free-of-charge to the market, which allows users to use the software as a freeware. Lack of instruction makes it hard for users to make full use of the software. The commercial Open Source software usually offers a complete documentation as well as good user support. Therefore, they usually show better qualication during the evaluation.

[BF09] A. Bitterer and T. Friedman. Whos Who in Open-Source Data Quality. 2009.

[Chr08a] P. Christen. Febrl - A Freely Available Record Linkage System with a Graphical User Interface. In Second Australasian Workshop on Health Data and Knowledge Management, volume 80 of CRPIT, pages 1725, Wollongong, NSW, Australia, 2008. ACS.

[Chr08b] P. Christen. Febrl - Freely Extensible Biomedical. The Australian National University, 0.4.1 edition, December 2008. [CL01] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. Online, 2001.

[CRF03] W.W. Cohen, P.D. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Subbarao Kambhampati and Craig A. Knoblock, editors, IIWeb, pages 7378, 2003. [Dun46] H.L. Dunn. Record Linkage. American Journal of Public Health, 36(12):1412, 1946.

[Eng99] L.P. English. Improving Data Warehouse and Business Information Quality. John Wiley & Sons, April 1999. [Eng09] L.P. English. Information Quality Applied. Wiley, 2009.

[FSF10] Inc. Free Software Foundation. Various licenses and comments about them, 2010. URL [Online; accessed 15-June-2010].

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
[BS06] C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. [Cat97] W. Cathro. Metadata: An Overview. 1997. [Chr06] P. Christen. A Comparison of Personal Name Matching: Techniques and Practical Issues. In ICDM Workshops, pages 290294. IEEE Computer Society, 2006. [EE10] M. Eppler and L. English. Iq/dq glossary international association for information and data quality (iaidq), 2010. URL shtml. [Online; accessed 06-April-2010]. [FS69] I. Fellegi and A. Sunter. A theory for record linkage. J. of the American Stat. Soc., 1969. 55

[GBVR03] L. Gu, R.A. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Math. and Inf. Sci., GPO Box 664, Canberra 2601, Australia, April 2003. [Hin07] H. Hinrichs. A Survey on Open Source Software for Data Quality Management. In Proc. of 5. German Information Quality Management Conference (GIQMC), Bad Soden, 2007. [HK01] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, second edition edition, 2001. [HS95] M. Hernandez and S. Stolfo. The Merge/Purge problem for large databases. In ACM-SIGMOD Conference, 1995.

[JLX+ 08] P. Jurczyk, J.J. Lu, L. Xiong, J.D. Cragan, and A. Correa. FRIL: A tool for comparative record linkage. In AMIA Annual Symposium Proceedings, volume 2008, page 440. American Medical Informatics Association, 2008.

[KSW02] B.K. Kahn, D.M. Strong, and R.Y. Wang. Information quality benchmarks: product and service performance. Commun. ACM, 45(4):184192, 2002.

[MEE02] V.S. Verykios M.G. Elfeky and A.K. Elmagarmid. TAILOR: A record linkage toolbox. March 2002. [MF03] H. Mller and J.C. Freytag. Problems, methods, and challenges in comprehensive data cleansing. Humboldt Universitt Berlin, Tech. Rep, 2003.

[MM00] J.I. Maletic and A. Marcus. Data cleansing: Beyond integrity analysis. 2000.

[NM01] M. Neiling and R.M. Mller. The good into the Pot, the bad into the Crop. Preselection of Record Pairs for Database Fusion. Institute for Information Systems, 2001. [Ols03] J.E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
[Ini10] The Open Source Initiative. The open source denition, 2010. URL http:// [Online; accessed 21-May-2010]. [Ins02] Data Warehousing Institute. Data quality and the bottom line: Achieving business success through a commitment to high quality data. 2002. [Jar89] M.A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414420, 1989. [Jur88] J. Juran. Jurans Quality Control Handbook. pub-mcgraw-hill, pub-mcgraw-hill:adr, 4th edition, 1988. [Lau04] A.M. St Laurent. Understanding Open Source and Free Software Licensing. OReilly Media, Inc., 2004. [Let08] F. Letellier. Open Source Software: the Role of Nonprots in Federating Business and Innovation Ecosystems. 2008. [Pos10] Canada Post. Canadian provinces and territories names and abbreviations, 2010. URL 56

PGaddress-e.asp. [Online; accessed 09-May-2010]. [PS05] R.J. Price and G.G. Shanks. Empirical renement of a semiotic information quality framework. In HICSS. IEEE Computer Society, 2005. [PWC97] E.H. Porter, W.E. Winkler, and Bureau Of The Census. Approximate string comparison and its effect on an advanced record linkage system. August 1997. [RD00] E. Rahm and H.H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull, 23(4):313, 2000. [Ros05] L. Rosen. Open Source Licensing: Software Freedom and Intellectual Property Law. Prentice Hall, 2005.

[vdB05] K. van den Berg. Finding open options: An open source software evaluation model with a case study on course management systems. Master thesis, Tilburg University, Augest 2005. [Wan98] R.Y. Wang. A product perspective on total data quality management. Commun. ACM, 41(2):5865, 1998.

[WH06] R. Wolter and K. Haselden. The what, why, and how of master data management, November 2006. URL bb190163.aspx. [Online; accessed 01-June-2010]. [Wik10] Wikipedia. Apache subversion Wikipedia, the free encyclopedia, 2010. URL [Online; accessed 05-June-2010]. [WS96] R.Y. Wang and D.M. Strong. Beyond accuracy: What data quality means to data consumers. Journal on Management Information Systems, 12(4):534, 1996.

[WW96] Y. Wand and R.Y. Wang. Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11):8695, November 1996.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
[Ser10] United States Postal Service. Ofcial usps abbreviations, 2010. URL http:// [Online; accessed 09-May-2010]. [Sr09] K. Srensen. DataCleaner Reference Documentation. DataCleaner, 1.5.3 edition, October 2009. [Tal09] Talend. Talend Open Proler User Guide, 3.2a edition, October 2009. [Tal10] Talend. Talend Open Studio Components Reference Guide, 4.0a edition, April 2010. 57

[Rus10] Z. Rusin. Open source licenses, 2010. URL documentation/licensing/licenses_summary.html. [Online; accessed 21-May-2010].

Appendix A Candidate List and Selection Results

Candidate Talend Open Proler DataCleaner OpenDQ Aggregate Proler Talend Open Studio SQL DQguru Potters Wheel Fril Febrl RapidMiner

Table A.0.1 lists all the candidates investigated in the selection procedure for the case study. The reason for elimination is presented below. OpenDQ

Although being tagged with open source data quality, the company delivers the OpenDQ product as a total solution on a open source platform1 , which means that the software as well as the source code are not available for unpaid customers as far as we can investigate. Aggregate Proler

Although the product is available on the SourceForge2 , it turns out to be an evaluation version. The information about the community and maintenance of the software are very limited.
1 2

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u
Website Eliminated X X X X

Table A.0.1: Candidate list

See See


Potters Wheel Potters Wheel A-B-C is an open source data analysis and cleansing tool, but the latest version available is the version 1.3 which released on Oct 10, 2000. We consider the software is no longer under development. RapidMiner RapidMiner is an environment for machine learning and data mining experiments, its focus is data mining which is also a technology regarding data quality management, but due to the lack of time, it is not covered in this thesis.

CO D ec PY e m RI b G er H 1 T 5, 20 D 1 ez 1 hi Li u