Anda di halaman 1dari 12

Data Cleaning: Problems and Current

Approaches
Abstract
Weclassifydataqualityproblemsthatareaddressedbydatacleaningandprovideanoverviewofthemain
solutionapproaches.Datacleaningisespeciallyrequiredwhenintegratingheterogeneousdatasourcesand
shouldbeaddressedtogetherwithschemarelateddatatransformations.Indatawarehouses,datacleaningis
amajorpartofthesocalledETLprocess.Wealsodiscusscurrenttoolsupportfordatacleaning.

1 Introduction
Data cleaning,alsocalled data cleansingor scrubbing,dealswithdetectingandremovingerrorsand
inconsistenciesfromdatainordertoimprovethequalityofdata.Dataqualityproblemsarepresentinsingle
datacollections,suchasfilesanddatabases,e.g.,duetomisspellingsduringdataentry,missinginformation
orotherinvaliddata.Whenmultipledatasourcesneedtobeintegrated,e.g.,indatawarehouses,federated
databasesystemsorglobalwebbasedinformationsystems,theneedfordatacleaningincreases
significantly.Thisisbecausethesourcesoftencontainredundantdataindifferentrepresentations.Inorderto
provideaccesstoaccurateandconsistentdata,consolidationofdifferentdatarepresentationsandelimination
ofduplicateinformationbecomenecessary.
Operational
sources

Extraction, Transformation, Loading


Extraction

Integration

Aggregation

Schemamatching
andintegration

Schemaextraction
andtranslation

Data
warehouse

Schema
implementation

Data
warehouse

Data
staging
area
Instanceextraction
andtransformation

Instancematching
andintegration

Filtering,
aggregation

Scheduling,logging,monitoring,recovery,backup
Legends:

Metadataflow
Dataflow

Figure1.

3Instancecharacteristics
(realmetadata)
2Translationrules

4Mappingsbetweensourceandtarget
schema

5Filteringandaggregationrules

Stepsofbuildingadatawarehouse:theETLprocess

Datawarehouses[6][16]requireandprovideextensivesupportfordatacleaning.Theyloadand
continuouslyrefreshhugeamountsofdatafromavarietyofsourcessotheprobabilitythatsomeofthe
sourcescontaindirtydataishigh.Furthermore,datawarehousesareusedfordecisionmaking,sothatthe
correctnessoftheirdataisvitaltoavoidwrongconclusions.Forinstance,duplicatedormissinginformation
willproduceincorrectormisleadingstatistics(garbagein,garbageout).Duetothewiderangeofpossible
ThisworkwasperformedwhileonleaveatMicrosoftResearch,Redmond,WA.

datainconsistenciesandthesheerdatavolume,datacleaningisconsideredtobeoneofthebiggestproblems
indatawarehousing.DuringthesocalledETLprocess(extraction,transformation,loading),illustratedin
Fig.1,furtherdatatransformationsdealwithschema/datatranslationandintegration,andwithfilteringand
aggregatingdatatobestoredinthewarehouse.AsindicatedinFig.1,alldatacleaningistypically
performedinaseparatedatastagingareabeforeloadingthetransformeddataintothewarehouse.Alarge
numberoftoolsofvaryingfunctionalityisavailabletosupportthesetasks,butoftenasignificantportionof
thecleaningandtransformationworkhastobedonemanuallyorbylowlevelprogramsthataredifficultto
writeandmaintain.
Federateddatabasesystemsandwebbasedinformationsystemsfacedatatransformationstepssimilarto
thoseofdatawarehouses.Inparticular,thereistypicallya wrapperperdatasourceforextractionanda
mediatorforintegration[32][31].Sofar,thesesystemsprovideonlylimitedsupportfordatacleaning,
focusinginsteadondatatransformationsforschematranslationandschemaintegration.Dataisnot
preintegratedasfordatawarehousesbutneedstobeextractedfrommultiplesources,transformedand
combinedduringqueryruntime.Thecorrespondingcommunicationandprocessingdelayscanbesignificant,
makingitdifficulttoachieveacceptableresponsetimes.Theeffortneededfordatacleaningduring
extractionandintegrationwillfurtherincreaseresponsetimesbutismandatorytoachieveusefulquery
results.
Adatacleaningapproachshouldsatisfyseveralrequirements.Firstofall,itshoulddetectandremoveall
majorerrorsandinconsistenciesbothinindividualdatasourcesandwhenintegratingmultiplesources.The
approachshouldbesupportedbytoolstolimitmanualinspectionandprogrammingeffortandbeextensible
toeasilycoveradditionalsources.Furthermore,datacleaningshouldnotbeperformedinisolationbut
togetherwithschemarelateddatatransformationsbasedoncomprehensivemetadata.Mappingfunctionsfor
datacleaningandotherdatatransformationsshouldbespecifiedinadeclarativewayandbereusablefor
otherdatasourcesaswellasforqueryprocessing.Especiallyfordatawarehouses,aworkflowinfrastructure
shouldbesupportedtoexecutealldatatransformationstepsformultiplesourcesandlargedatasetsina
reliableandefficientway.
Whileahugebodyofresearchdealswithschematranslationandschemaintegration,datacleaninghas
receivedonlylittleattentionintheresearchcommunity.Anumberofauthorsfocussedontheproblemof
duplicateidentificationandelimination,e.g.,[11][12][15][19][22][23].Someresearchgroupsconcentrateon
generalproblemsnotlimitedbutrelevanttodatacleaning,suchasspecialdataminingapproaches[30][29],
anddatatransformationsbasedonschemamatching[1][21].Morerecently,severalresearcheffortspropose
andinvestigateamorecomprehensiveanduniformtreatmentofdatacleaningcoveringseveral
transformationphases,specificoperatorsandtheirimplementation[11][19][25].
Inthispaperweprovideanoverviewoftheproblemstobeaddressedbydatacleaningandtheirsolution.In
thenextsectionwepresentaclassificationoftheproblems.Section3discussesthemaincleaning
approachesusedinavailabletoolsandtheresearchliterature.Section4givesanoverviewofcommercial
toolsfordatacleaning,includingETLtools.Section5istheconclusion.

2 Data cleaning problems


Thissectionclassifiesthemajordataqualityproblemstobesolvedbydatacleaninganddatatransformation.
Aswewillsee,theseproblemsarecloselyrelatedandshouldthusbetreatedinauniformway.Data
transformations[26]areneededtosupportanychangesinthestructure,representationorcontentofdata.
Thesetransformationsbecomenecessaryinmanysituations,e.g.,todealwithschemaevolution,migratinga
legacysystemtoanewinformationsystem,orwhenmultipledatasourcesaretobeintegrated.
AsshowninFig.2weroughlydistinguishbetweensinglesourceandmultisourceproblemsandbetween
schemaandinstancerelatedproblems.Schemalevelproblemsofcoursearealsoreflectedintheinstances;
theycanbeaddressedattheschemalevelbyanimprovedschemadesign(schemaevolution),schema
translationandschemaintegration.Instancelevelproblems,ontheotherhand,refertoerrorsand
inconsistenciesintheactualdatacontentswhicharenotvisibleattheschemalevel.Theyaretheprimary
focusofdatacleaning.Fig.2alsoindicatessometypicalproblemsforthevariouscases.Whilenotshownin
Fig.2,thesinglesourceproblemsoccur(withincreasedlikelihood)inthemultisourcecase,too,besides
specificmultisourceproblems.

Data Quality Problems


Single-Source Problems

Multi-Source Problems

Schema Level

Instance Level

Schema Level
(Lackofintegrity
constraints,poor
schemadesign)

(Dataentryerrors)

(Heterogeneous
datamodelsand
schemadesigns)

Uniqueness
Referentialintegrity

Misspellings
Redundancy/duplicates
Contradictoryvalues

Namingconflicts
Structuralconflicts

Figure2.

Instance Level
(Overlapping,
contradictingand
inconsistentdata)
Inconsistentaggregating
Inconsistenttiming

Classificationofdataqualityproblemsindatasources

2.1 Single-source problems


Thedataqualityofasourcelargelydependsonthedegreetowhichitisgovernedbyschemaandintegrity
constraintscontrollingpermissabledatavalues.Forsourceswithoutschema,suchasfiles,therearefew
restrictionsonwhatdatacanbeenteredandstored,givingrisetoahighprobabilityoferrorsand
inconsistencies.Databasesystems,ontheotherhand,enforcerestrictionsofaspecificdatamodel(e.g.,the
relationalapproachrequiressimpleattributevalues,referentialintegrity,etc.)aswellasapplicationspecific
integrityconstraints.Schemarelateddataqualityproblemsthusoccurbecauseofthelackofappropriate
modelspecificorapplicationspecificintegrityconstraints,e.g.,duetodatamodellimitationsorpoor
schemadesign,orbecauseonlyafewintegrityconstraintsweredefinedtolimittheoverheadforintegrity
control.Instancespecificproblemsrelatetoerrorsandinconsistenciesthatcannotbepreventedatthe
schemalevel(e.g.,misspellings).
Scope/Problem
Attribut Illegalvalues
Violatedattribute
Record
dependencies
Uniqueness
Record
violation
type
Referential
Source
integrityviolation

Dirty Data
bdate=30.13.70
age=22,bdate=12.02.70
emp1=(name=JohnSmith,SSN=123456)
emp2=(name=PeterMiller,SSN=123456)
emp=(name=JohnSmith,deptno=127)

Reasons/Remarks
valuesoutsideofdomainrange
age=(currentdatebirthdate)
shouldhold
uniquenessforSSN(socialsecurity
number)violated
referenceddepartment(127)notdefined

Table1.Examplesforsinglesourceproblemsatschemalevel(violatedintegrityconstraints)

Forbothschemaandinstancelevelproblemswecandifferentiatedifferentproblemscopes:attribute(field),
record,recordtypeandsource;examplesforthevariouscasesareshowninTables1and2.Notethat
uniquenessconstraintsspecifiedattheschemaleveldonotpreventduplicatedinstances,e.g.,ifinformation
onthesamerealworldentityisenteredtwicewithdifferentattributevalues(seeexampleinTable2).
Scope/Problem
Attribute Missingvalues

Record
Record
type

Source

Dirty Data
phone=9999999999

Misspellings
Crypticvalues,
Abbreviations
Embeddedvalues

city=Liipzig
experience=B;
occupation=DBProg.
name=J.Smith12.02.70NewYork

Misfieldedvalues
Violatedattribute
dependencies
Word
transpositions
Duplicatedrecords

city=Germany
city=Redmond,zip=77777

Contradicting
records
Wrongreferences

Reasons/Remarks
unavailablevaluesduringdataentry
(dummyvaluesornull)
usuallytypos,phoneticerrors
multiplevaluesenteredinoneattribute
(e.g.inafreeformfield)
cityandzipcodeshouldcorrespond

name1=J.Smith,name2=MillerP.

usuallyinafreeformfield

emp1=(name=JohnSmith,...);
emp2=(name=J.Smith,...)
emp1=(name=JohnSmith,bdate=12.02.70);
emp2=(name=JohnSmith,bdate=12.12.70)
emp=(name=JohnSmith,deptno=17)

sameemployeerepresentedtwicedueto
somedataentryerrors
thesamerealworldentityisdescribedby
differentvalues
referenceddepartment(17)isdefinedbut
wrong

Table2.Examplesforsinglesourceproblemsatinstancelevel

Giventhatcleaningdatasourcesisanexpensiveprocess,preventingdirtydatatobeenteredisobviouslyan
importantsteptoreducethecleaningproblem.Thisrequiresanappropriatedesignofthedatabaseschema
andintegrityconstraintsaswellasofdataentryapplications.Also,thediscoveryofdatacleaningrules
duringwarehousedesigncansuggestimprovementstotheconstraintsenforcedbyexistingschemas.

2.2 Multi-source problems


Theproblemspresentinsinglesourcesareaggravatedwhenmultiplesourcesneedtobeintegrated.Each
sourcemaycontaindirtydataandthedatainthesourcesmayberepresenteddifferently,overlapor
contradict.Thisisbecausethesourcesaretypicallydeveloped,deployedandmaintainedindependentlyto
servespecificneeds.Thisresultsinalargedegreeofheterogeneityw.r.t.datamanagementsystems,data
models,schemadesignsandtheactualdata.
Attheschemalevel,datamodelandschemadesigndifferencesaretobeaddressedbythestepsofschema
translationandschemaintegration,respectively.Themainproblemsw.r.t.schemadesignarenamingand
structuralconflicts[2][24][17].Namingconflictsarisewhenthesamenameisusedfordifferentobjects
(homonyms)ordifferentnamesareusedforthesameobject(synonyms).Structuralconflictsoccurinmany
variationsandrefertodifferentrepresentationsofthesameobjectindifferentsources,e.g.,attributevs.table
representation,differentcomponentstructure,differentdatatypes,differentintegrityconstraints,etc.
Inadditiontoschemalevelconflicts,manyconflictsappearonlyattheinstancelevel(dataconflicts).All
problemsfromthesinglesourcecasecanoccurwithdifferentrepresentationsindifferentsources(e.g.,
duplicatedrecords,contradictingrecords,).Furthermore,evenwhentherearethesameattributenamesand
datatypes,theremaybedifferentvaluerepresentations(e.g.,formaritalstatus)ordifferentinterpretationof
thevalues(e.g.,measurementunitsDollarvs.Euro)acrosssources.Moreover,informationinthesources
maybeprovidedatdifferentaggregationlevels(e.g.,salesperproductvs.salesperproductgroup)orrefer
todifferentpointsintime(e.g.currentsalesasofyesterdayforsource1vs.asoflastweekforsource2).
Amainproblemforcleaningdatafrommultiplesourcesistoidentifyoverlappingdata,inparticular
matchingrecordsreferringtothesamerealworldentity(e.g.,customer).Thisproblemisalsoreferredtoas
theobjectidentityproblem[11],duplicateeliminationorthemerge/purgeproblem[15].Frequently,the
informationisonlypartiallyredundantandthesourcesmaycomplementeachotherbyprovidingadditional
informationaboutanentity.Thusduplicateinformationshouldbepurgedoutandcomplementing
informationshouldbeconsolidatedandmergedinordertoachieveaconsistentviewofrealworldentities.
Customer(source1)
CID
11
24

Name
KristenSmith
ChristianSmith

Street
2HurleyPl
HurleySt2

City
SouthFork,MN48503
SForkMN

Sex
0
1

Client(source2)
Cno
24

LastName
Smith

FirstName
Christoph

Gender
M

493

Smith

KrisL.

Address
23HarleySt,Chicago
IL,606332394
2HurleyPlace,South
ForkMN,485035998

Phone/Fax
3332226542/
3332226599
4445556666

Customers(integratedtargetwithcleaneddata)
No LName
1
Smith

FName
KristenL.

Gender
F

Smith

Christian

Smith

Christoph

Figure3.

Street
2Hurley
Place
2Hurley
Place
23Harley
Street

City
South
Fork
South
Fork
Chicago

State
MN
MN
IL

ZIP
48503
5998
48503
5998
60633
2394

Phone
444555
6666

Fax

333222
6542

333222
6599

CID
11

Cno
493

24
24

Examplesofmultisourceproblemsatschemaandinstancelevel

ThetwosourcesintheexampleofFig.3arebothinrelationalformatbutexhibitschemaanddataconflicts.
Attheschemalevel,therearenameconflicts(synonyms Customer/Client, Cid/Cno, Sex/Gender)and
structuralconflicts(differentrepresentationsfornamesandaddresses).Attheinstancelevel,wenotethat
therearedifferentgenderrepresentations(0/1vs.F/M)andpresumablyaduplicaterecord(Kristen
Smith).Thelatterobservationalsorevealsthatwhile Cid/Cnoarebothsourcespecificidentifiers,their
contentsarenotcomparablebetweenthesources;differentnumbers(11/493)mayrefertothesameperson
whiledifferentpersonscanhavethesamenumber(24).Solvingtheseproblemsrequiresbothschema

integrationanddatacleaning;thethirdtableshowsapossiblesolution.Notethattheschemaconflictsshould
beresolvedfirsttoallowdatacleaning,inparticulardetectionofduplicatesbasedonauniform
representationofnamesandaddresses,andmatchingofthe Gender/Sexvalues.

3 Data cleaning approaches


Ingeneral,datacleaninginvolvesseveralphases

Data analysis:Inordertodetectwhichkindsoferrorsandinconsistenciesaretoberemoved,adetailed
dataanalysisisrequired.Inadditiontoamanualinspectionofthedataordatasamples,analysis
programsshouldbeusedtogainmetadataaboutthedatapropertiesanddetectdataqualityproblems.
Definition of transformation workflow and mapping rules:Dependingonthenumberofdatasources,
theirdegreeofheterogeneityandthedirtynessofthedata,alargenumberofdatatransformationand
cleaningstepsmayhavetobeexecuted.Sometime,aschematranslationisusedtomapsourcestoa
commondatamodel;fordatawarehouses,typicallyarelationalrepresentationisused.Earlydata
cleaningstepscancorrectsinglesourceinstanceproblemsandpreparethedataforintegration.Later
stepsdealwithschema/dataintegrationandcleaningmultisourceinstanceproblems,e.g.,duplicates.
Fordatawarehousing,thecontrolanddataflowforthesetransformationandcleaningstepsshouldbe
specifiedwithinaworkflowthatdefinestheETLprocess(Fig.1).
Theschemarelateddatatransformationsaswellasthecleaningstepsshouldbespecifiedbya
declarativequeryandmappinglanguageasfaraspossible,toenableautomaticgenerationofthe
transformationcode.Inaddition,itshouldbepossibletoinvokeuserwrittencleaningcodeandspecial
purposetoolsduringadatatransformationworkflow.Thetransformationstepsmayrequestuser
feedbackondatainstancesforwhichtheyhavenobuiltincleaninglogic.
Verification:Thecorrectnessandeffectivenessofatransformationworkflowandthetransformation
definitionsshouldbetestedandevaluated,e.g.,onasampleorcopyofthesourcedata,toimprovethe
definitionsifnecessary.Multipleiterationsoftheanalysis,designandverificationstepsmaybeneeded,
e.g.,sincesomeerrorsonlybecomeapparentafterapplyingsometransformations.
Transformation:ExecutionofthetransformationstepseitherbyrunningtheETLworkflowforloading
andrefreshingadatawarehouseorduringansweringqueriesonmultiplesources.
Backflow of cleaned data:After(singlesource)errorsareremoved,thecleaneddatashouldalsoreplace
thedirtydataintheoriginalsourcesinordertogivelegacyapplicationstheimproveddatatooandto
avoidredoingthecleaningworkforfuturedataextractions.Fordatawarehousing,thecleaneddatais
availablefromthedatastagingarea(Fig.1).
Thetransformationprocessobviouslyrequiresalargeamountofmetadata,suchasschemas,instancelevel
datacharacteristics,transformationmappings,workflowdefinitions,etc.Forconsistency,flexibilityandease
ofreuse,thismetadatashouldbemaintainedinaDBMSbasedrepository[4].Tosupportdataquality,
detailedinformationaboutthetransformationprocessistoberecorded,bothintherepositoryandinthe
transformedinstances,inparticularinformationaboutthecompletenessandfreshnessofsourcedataand
lineageinformationabouttheoriginoftransformedobjectsandthechangesappliedtothem.Forinstance,in
Fig.3,thederivedtable Customerscontainstheattributes CIDand Cno,allowingonetotracebackthe
sourcerecords.
Inthefollowingwedescribeinmoredetailpossibleapproachesfordataanalysis(conflictdetection),
transformationdefinitionandconflictresolution.Forapproachestoschematranslationandschema
integration,werefertotheliteratureastheseproblemshaveextensivelybeenstudiedanddescribed
[2][24][26].Nameconflictsaretypicallyresolvedbyrenaming;structuralconflictsrequireapartial
restructuringandmergingoftheinputschemas.

3.1 Data analysis


Metadatareflectedinschemasistypicallyinsufficienttoassessthedataqualityofasource,especiallyifonly
afewintegrityconstraintsareenforced.Itisthusimportanttoanalysetheactualinstancestoobtainreal
(reengineered)metadataondatacharacteristicsorunusualvaluepatterns.Thismetadatahelpsfindingdata
qualityproblems.Moreover,itcaneffectivelycontributetoidentifyattributecorrespondencesbetween
sourceschemas(schemamatching),basedonwhichautomaticdatatransformationscanbederived[20][9].

Therearetworelatedapproachesfordataanalysis,dataprofilinganddatamining. Data profilingfocusses


ontheinstanceanalysisofindividualattributes.Itderivesinformationsuchasthedatatype,length,value
range,discretevaluesandtheirfrequency,variance,uniqueness,occurrenceofnullvalues,typicalstring
pattern(e.g.,forphonenumbers),etc.,providinganexactviewofvariousqualityaspectsoftheattribute.
Table3showsexamplesofhowthismetadatacanhelpdetectingdataqualityproblems.
Problems
Illegal values

Metadata
cardinality
max,min
variance,deviation

Misspellings

attributevalues

Missing
values
Varying value
representatio
Duplicates

nullvalues
attributevalues+defaultvalues
attributevalues
cardinality+uniqueness
attributevalues

Examples/Heuristics
e.g.,cardinality(gender)>2indicatesproblem
max,minshouldnotbeoutsideofpermissiblerange
variance,deviationofstatisticalvaluesshouldnotbehigherthan
threshold
sortingonvaluesoftenbringsmisspelledvaluesnexttocorrect
values
percentage/numberofnullvalues
presenceofdefaultvaluemayindicaterealvalueismissing
comparingattributevaluesetofacolumnofonetableagainstthat
ofacolumnofanothertable
attributecardinality=#rowsshouldhold
sortingvaluesbynumberofoccurrences;morethan1occurrence
indicatesduplicates

Table3.Examplesfortheuseofreengineeredmetadatatoaddressdataqualityproblems

Data mininghelpsdiscoverspecificdatapatternsinlargedatasets,e.g.,relationshipsholdingbetween
severalattributes.Thisisthefocusofsocalleddescriptivedataminingmodelsincludingclustering,
summarization,associationdiscoveryandsequencediscovery[10].Asshownin[28],integrityconstraints
amongattributessuchasfunctionaldependenciesorapplicationspecificbusinessrulescanbederived,
whichcanbeusedtocompletemissingvalues,correctillegalvaluesandidentifyduplicaterecordsacross
datasources.Forexample,anassociationrulewithhighconfidencecanhinttodataqualityproblemsin
instancesviolatingthisrule.Soaconfidenceof99%forruletotal=quantity*unit priceindicatesthat1%of
therecordsdonotcomplyandmayrequirecloserexamination.

3.2 Defining data transformations


Thedatatransformationprocesstypicallyconsistsofmultiplestepswhereeachstepmayperformschema
andinstancerelatedtransformations(mappings).Toallowadatatransformationandcleaningsystemto
generatetransformationcodeandthustoreducetheamountofselfprogrammingitisnecessarytospecify
therequiredtransformationsinanappropriatelanguage,e.g.,supportedbyagraphicaluserinterface.
VariousETLtools(seeSection4)offerthisfunctionalitybysupportingproprietaryrulelanguages.Amore
generalandflexibleapproachistheuseofthestandardquerylanguageSQLtoperformthedata
transformationsandutilizethepossibilityofapplicationspecificlanguageextensions,inparticularuser
definedfunctions(UDFs)supportedinSQL:99[13][14].UDFscanbeimplementedinSQLorageneral
purposeprogramminglanguagewithembeddedSQLstatements.Theyallowimplementingawiderangeof
datatransformationsandsupporteasyreusefordifferenttransformationandqueryprocessingtasks.
Furthermore,theirexecutionbytheDBMScanreducedataaccesscostandthusimproveperformance.
Finally,UDFsarepartoftheSQL:99standardandshould(eventually)beportableacrossmanyplatforms
andDBMSs.
CREATEVIEWCustomer2(LName,FName,Gender,Street,City,State,ZIP,CID)AS
SELECT LastNameExtract(Name), FirstNameExtract(Name),Sex,Street, CityExtract(City),
StateExtract(City), ZIPExtract(City),CID
FROMCustomer
Figure4.

Exampleoftransformationstepdefinition

Fig.4showsatransformationstepspecifiedinSQL:99.TheexamplereferstoFig.3andcoverspartofthe
necessarydatatransformationstobeappliedtothefirstsource.Thetransformationdefinesaviewonwhich
furthermappingscanbeperformed.Thetransformationperformsaschemarestructuringwithadditional
attributesintheviewobtainedbysplittingthenameandaddressattributesofthesource.Therequireddata
extractionsareachievedbyUDFs(showninboldface).TheUDFimplementationscancontaincleaning
logic,e.g.,toremovemisspellingsincitynamesorprovidemissingzipcodes.
UDFsmaystillimplyasubstantialimplementationeffortanddonotsupportallnecessaryschema
transformations.Inparticular,simpleandfrequentlyneededfunctionssuchasattributesplittingormerging

arenotgenericallysupportedbutneedoftentobereimplementedinapplicationspecificvariations(see
specificextractfunctionsinFig.4).Morecomplexschemarestructurings(e.g.,foldingandunfoldingof
attributes)arenotsupportedatall.Togenericallysupportschemarelatedtransformations,language
extensionssuchastheSchemaSQLproposalarerequired[18].Datacleaningattheinstancelevelcanalso
benefitfromspeciallanguageextensionssuchasaMatchoperatorsupportingapproximatejoins(see
below).Systemsupportforsuchpowerfuloperatorscangreatlysimplifytheprogrammingeffortfordata
transformationsandimproveperformance.Somecurrentresearcheffortsondatacleaningareinvestigating
theusefulnessandimplementationofsuchquerylanguageextensions[11][25].

3.3 Conflict resolution


Asetoftransformationstepshastobespecifiedandexecutedtoresolvethevariousschemaandinstance
leveldataqualityproblemsthatarereflectedinthedatasourcesathand.Severaltypesoftransformationsare
tobeperformedontheindividualdatasourcesinordertodealwithsinglesourceproblemsandtoprepare
forintegrationwithothersources.Inadditiontoapossibleschematranslation,thesepreparatorysteps
typicallyinclude:
Extracting values from free-form attributes (attribute split):Freeformattributesoftencapturemultiple
individualvaluesthatshouldbeextractedtoachieveamorepreciserepresentationandsupportfurther
cleaningstepssuchasinstancematchingandduplicateelimination.Typicalexamplesarenameand
addressfields(Table2,Fig.3,Fig.4).Requiredtransformationsinthissteparereorderingofvalues
withinafieldtodealwithwordtranspositions,andvalueextractionforattributesplitting.
Validation and correction:Thisstepexamineseachsourceinstancefordataentryerrorsandtriesto
correctthemautomaticallyasfaraspossible.Spellcheckingbasedondictionarylookupisusefulfor
identifyingandcorrectingmisspellings.Furthermore,dictionariesongeographicnamesandzipcodes
helptocorrectaddressdata.Attributedependencies(birthdateage,totalpriceunitprice/quantity,
cityphoneareacode,)canbeutilizedtodetectproblemsandsubstitutemissingvaluesorcorrect
wrongvalues.
Standardization:Tofacilitateinstancematchingandintegration,attributevaluesshouldbeconvertedto
aconsistentanduniformformat.Forexample,dateandtimeentriesshouldbebroughtintoaspecific
format;namesandotherstringdatashouldbeconvertedtoeitherupperorlowercase,etc.Textdatamay
becondensedandunifiedbyperformingstemming,removingprefixes,suffixes,andstopwords.
Furthermore,abbreviationsandencodingschemesshouldconsistentlyberesolvedbyconsultingspecial
synonymdictionariesorapplyingpredefinedconversionrules.
Dealingwithmultisourceproblemsrequiresrestructuringofschemastoachieveaschemaintegration,
includingstepssuchassplitting,merging,foldingandunfoldingofattributesandtables.Attheinstance
level,conflictingrepresentationsneedtoberesolvedandoverlappingdatamusttobedealtwith.The
duplicate eliminationtaskistypicallyperformedaftermostothertransformationandcleaningsteps,
especiallyafterhavingcleanedsinglesourceerrorsandconflictingrepresentations.Itisperformedeitheron
twocleanedsourcesatatimeoronasinglealreadyintegrateddataset.Duplicateeliminationrequirestofirst
identify(i.e.match)similarrecordsconcerningthesamerealworldentity.Inasecondstep,similarrecords
aremergedintoonerecordcontainingallrelevantattributeswithoutredundancy.Furthermore,redundant
recordsarepurged.Inthefollowingwediscussthekeyproblemofinstancematching.Moredetailsonthe
subjectareprovidedelsewhereinthisissue[22].
Inthesimplestcase,thereisanidentifyingattributeorattributecombinationperrecordthatcanbeusedfor
matchingrecords,e.g.,ifdifferentsourcessharethesameprimarykeyorifthereareothercommonunique
attributes.Instancematchingbetweendifferentsourcesisthenachievedbyastandardequijoinonthe
identifyingattribute(s).Inthecaseofasingledataset,matchescanbedeterminedbysortingonthe
identifyingattributeandcheckingifneighboringrecordsmatch.Inbothcases,efficientimplementationscan
beachievedevenforlargedatasets.Unfortunately,withoutcommonkeyattributesorinthepresenceof
dirtydatasuchstraightforwardapproachesareoftentoorestrictive.Todeterminemostorallmatchesa
fuzzymatching(approximatejoin)becomesnecessarythatfindssimilarrecordsbasedonamatchingrule,
e.g.,specifieddeclarativelyorimplementedbyauserdefinedfunction[14][11].Forexample,sucharule
couldstatethatpersonrecordsarelikelytocorrespondifnameandportionsoftheaddressmatch.The
degreeofsimilaritybetweentworecords,oftenmeasuredbyanumericalvaluebetween0and1,usually

dependsonapplicationcharacteristics.Forinstance,differentattributesinamatchingrulemaycontribute
differentweighttotheoveralldegreeofsimilarity.Forstringcomponents(e.g.,customername,company
name,)exactmatchingandfuzzyapproachesbasedonwildcards,characterfrequency,editdistance,
keyboarddistanceandphoneticsimilarity(soundex)areuseful[11][15][19].Morecomplexstringmatching
approachesalsoconsideringabbreviationsarepresentedin[23].Ageneralapproachformatchingbothstring
andtextdataistheuseofcommoninformationretrievalmetrics.WHIRLrepresentsapromising
representativeofthiscategoryusingthecosinedistanceinthevectorspacemodelfordeterminingthedegree
ofsimilaritybetweentextelements[7].
Determiningmatchinginstanceswithsuchanapproachistypicallyaveryexpensiveoperationforlargedata
sets.Calculatingthesimilarityvalueforanytworecordsimpliesevaluationofthematchingruleonthe
cartesianproductoftheinputs.Furthermoresortingonthesimilarityvalueisneededtodeterminematching
recordscoveringduplicateinformation.Allrecordsforwhichthesimilarityvalueexceedsathresholdcanbe
consideredasmatches,orasmatchcandidatestobeconfirmedorrejectedbytheuser.In[15]amultipass
approachisproposedforinstancematchingtoreducetheoverhead.Itisbasedonmatchingrecords
independentlyondifferentattributesandcombiningthedifferentmatchresults.Assumingasingleinputfile,
eachmatchpasssortstherecordsonaspecificattributeandonlytestsnearbyrecordswithinacertain
windowonwhethertheysatisfyapredeterminedmatchingrule.Thisreducessignificantlythenumberof
matchruleevaluationscomparedtothecartesianproductapproach.Thetotalsetofmatchesisobtainedby
theunionofthematchingpairsofeachpassandtheirtransitiveclosure.

4 Tool support
Alargevarietyoftoolsisavailableonthemarkettosupportdatatransformationanddatacleaningtasks,in
particularfordatawarehousing.1Sometoolsconcentrateonaspecificdomain,suchascleaningnameand
addressdata,oraspecificcleaningphase,suchasdataanalysisorduplicateelimination.Duetotheir
restricteddomain,specializedtoolstypicallyperformverywellbutmustbecomplementedbyothertoolsto
addressthebroadspectrumoftransformationandcleaningproblems.Othertools,e.g.,ETLtools,provide
comprehensivetransformationandworkflowcapabilitiestocoveralargepartofthedatatransformationand
cleaningprocess.AgeneralproblemofETLtoolsistheirlimitedinteroperabilityduetoproprietary
applicationprogramminginterfaces(API)andproprietarymetadataformatsmakingitdifficulttocombine
thefunctionalityofseveraltools[8].
Wefirstdiscusstoolsfordataanalysisanddatarengineeringwhichprocessinstancedatatoidentifydata
errorsandinconsistencies,andtoderivecorrespondingcleaningtransformations.Wethenpresent
specializedcleaningtoolsandETLtools,respectively.

4.1 Data analysis and reengineering tools


Accordingtoourclassificationin3.1, data analysis toolscanbedividedintodataprofilinganddatamining
tools.MIGRATIONARCHITECT(EvokeSoftware)isoneofthefewcommercial data profiling tools.Foreach
attribute,itdeterminesthefollowingrealmetadata:datatype,length,cardinality,discretevaluesandtheir
percentage,minimumandmaximumvalues,missingvalues,anduniqueness.M IGRATIONARCHITECTalso
assistsindevelopingthetargetschemafordatamigration. Data mining tools,suchasWIZRULE(WizSoft)
andDATAMININGSUITE(InformationDiscovery),inferrelationshipsamongattributesandtheirvaluesand
computeaconfidencerateindicatingthenumberofqualifyingrows.Inparticular,W IZRULEcanrevealthree
kindsofrules:mathematicalformula,ifthenrules,andspellingbasedrulesindicatingmisspellednames,
e.g., valueEdinburgh appears 52 times in fieldCustomer; 2 case(s) contain similar value(s).WIZRULE
alsoautomaticallypointstothedeviationsfromthesetofthediscoveredrulesassuspectederrors.
Data reengineering tools,e.g.,INTEGRITY(Vality),utilizediscoveredpatternsandrulestospecifyand
performcleaningtransformations,i.e.,theyreengineerlegacydata.InI NTEGRITY,datainstancesundergo
severalanalysissteps,suchasparsing,datatyping,patternandfrequencyanalysis.Theresultofthesesteps
isatabularrepresentationoffieldcontents,theirpatternsandfrequencies,basedonwhichthepatternfor
standardizingdatacanbeselected.Forspecifyingcleaningtransformations,I NTEGRITYprovidesalanguage
includingasetofoperatorsforcolumntransformations(e.g.,move,split,delete)androwtransformation
1

Forcomprehensivevendorandtoollistings,seecommercialwebsites,e.g.,DataWarehouseInformationCenter
(www.dwinfocenter.org),DataManagementReview(www.dmreview.com),DataWarehousingInstitute(www.dwinstitute.com)

(e.g.,merge,split).INTEGRITYidentifiesandconsolidatesrecordsusingastatisticalmatchingtechnique.
Automatedweightingfactorsareusedtocomputescoresforrankingmatchesbasedonwhichtheusercan
selecttherealduplicates.

4.2 Specialized cleaning tools


Specializedcleaningtoolstypicallydealwithaparticulardomain,mostlynameandaddressdata,or
concentrateonduplicateelimination.Thetransformationsaretobeprovidedeitherinadvanceintheformof
arulelibraryorinteractivelybytheuser.Alternatively,datatransformationscanautomaticallybederived
fromschemamatchingtoolssuchasdescribedin[21].
Special domain cleaning:Namesandaddressesarerecordedinmanysourcesandtypicallyhavehigh
cardinality.Forexample,findingcustomermatchesisveryimportantforcustomerrelationship
management.Anumberofcommercialtools,e.g.,IDCENTRIC(FirstLogic),PUREINTEGRATE(Oracle),
QUICKADDRESS(QASSystems),REUNION(PitneyBowes),andTRILLIUM(TrilliumSoftware),focuson
cleaningthiskindofdata.Theyprovidetechniquessuchasextractingandtransformingnameand
addressinformationintoindividualstandardelements,validatingstreetnames,cities,andzipcodes,in
combinationwithamatchingfacilitybasedonthecleaneddata.Theyincorporateahugelibraryofpre
specifiedrulesdealingwiththeproblemscommonlyfoundinprocessingthisdata.Forexample,
TRILLIUMsextraction(parser)andmatchermodulecontainsover200,000businessrules.Thetoolsalso
providefacilitiestocustomizeorextendtherulelibrarywithuserdefinedrulesforspecificneeds.
Duplicate elimination:SampletoolsforduplicateidentificationandeliminationincludeD ATACLEANSER
(EDD),MERGE/PURGELIBRARY
(Sagent/QMSoftware),
MATCHIT(HelpITSystems),and
MASTERMERGE(PitneyBowes).Usually,theyrequirethedatasourcesalreadybecleanedformatching.
Severalapproachesformatchingattributevaluesaresupported;toolssuchasD ATACLEANSERand
MERGE/PURGELIBRARYalsoallowuserspecifiedmatchingrulestobeintegrated.

4.3 ETL tools


AlargenumberofcommercialtoolssupporttheETLprocessfordatawarehousesinacomprehensiveway,
e.g.,COPYMANAGER(InformationBuilders),DATASTAGE(Informix/Ardent),EXTRACT(ETI),POWERMART
(Informatica),DECISIONBASE(CA/Platinum),DATATRANSFORMATIONSERVICE(Microsoft),METASUITE
(Minerva/Carleton),SAGENTSOLUTIONPLATFORM(Sagent),andWAREHOUSEADMINISTRATOR(SAS).They
usearepositorybuiltonaDBMStomanageallmetadataaboutthedatasources,targetschemas,mappings,
scriptprograms,etc.,inauniformway.Schemasanddataareextractedfromoperationaldatasourcesvia
bothnativefileandDBMSgatewaysaswellasstandardinterfacessuchasODBCandEDA.Data
transformationsaredefinedwithaneasytousegraphicalinterface.Tospecifyindividualmappingsteps,a
proprietaryrulelanguageandacomprehensivelibraryofpredefinedconversionfunctionsaretypicallypro
vided.Thetoolsalsosupportreusingexistingtransformationsolutions,suchasexternalC/C++routines,by
providinganinterfacetointegratethemintotheinternaltransformationlibrary.Transformationprocessingis
carriedouteitherbyanenginethatinterpretsthespecifiedtransformationsatruntime,orbycompiledcode.
Allenginebasedtools(e.g.,COPYMANAGER,DECISIONBASE,POWERMART,DATASTAGE,
WAREHOUSEADMINISTRATOR),possessaschedulerandsupportworkflowswithcomplexexecution
dependenciesamongmappingjobs.Aworkflowmayalsoinvokeexternaltools,e.g.,forspecializedcleaning
taskssuchasname/addresscleaningorduplicateelimination.
ETLtoolstypicallyhavelittlebuiltindatacleaningcapabilitiesbutallowtheusertospecifycleaningfunc
tionalityviaaproprietaryAPI.Thereisusuallynodataanalysissupporttoautomaticallydetectdataerrors
andinconsistencies.However,userscanimplementsuchlogicwiththemetadatamaintainedandbydeter
miningcontentcharacteristicswiththehelpofaggregationfunctions(sum,count,min,max,median,vari
ance,deviation,).Theprovidedtransformationlibrarycoversmanydatatransformationandcleaning
needs,suchasdatatypeconversions(e.g.,datereformatting),stringfunctions(e.g.,split,merge,replace,
substringsearch),arithmetic,scientificandstatisticalfunctions,etc.Extractionofvaluesfromfreeform
attributesisnotcompletelyautomaticbuttheuserhastospecifythedelimitersseparatingsubvalues.
Therulelanguagestypicallycover if-thenand caseconstructsthathelphandlingexceptionsindatavalues,
suchasmisspellings,abbreviations,missingorcrypticvalues,andvaluesoutsideofrange.Theseproblems
canalsobeaddressedbyusingatablelookupconstructandjoinfunctionality.Supportforinstancematching
istypicallyrestrictedtotheuseofthejoinconstructandsomesimplestringmatchingfunctions,e.g.,exact

orwildcardmatchingandsoundex.However,userdefinedfieldmatchingfunctionsaswellasfunctionsfor
correlatingfieldsimilaritiescanbeprogrammedandaddedtotheinternaltransformationlibrary.

5 Conclusions
Weprovidedaclassificationofdataqualityproblemsindatasourcesdifferentiatingbetweensingleand
multisourceandbetweenschemaandinstancelevelproblems.Wefurtheroutlinedthemajorstepsfordata
transformationanddatacleaningandemphasizedtheneedtocoverschemaandinstancerelateddata
transformationsinanintegratedway.Furthermore,weprovidedanoverviewofcommercialdatacleaning
tools.Whilethestateoftheartinthesetoolsisquiteadvanced,theydotypicallycoveronlypartofthe
problemandstillrequiresubstantialmanualeffortorselfprogramming.Furthermore,theirinteroperabilityis
limited(proprietaryAPIsandmetadatarepresentations).
Sofaronlyalittleresearchhasappearedondatacleaning,althoughthelargenumberoftoolsindicatesboth
theimportanceanddifficultyofthecleaningproblem.Weseeseveraltopicsdeservingfurtherresearch.First
ofall,moreworkisneededonthedesignandimplementationofthebestlanguageapproachforsupporting
bothschemaanddatatransformations.Forinstance,operatorssuchasMatch,MergeorMapping
Compositionhaveeitherbeenstudiedattheinstance(data)orschema(metadata)levelbutmaybebuilton
similarimplementationtechniques.Datacleaningisnotonlyneededfordatawarehousingbutalsoforquery
processingonheterogeneousdatasources,e.g.,inwebbasedinformationsystems.Thisenvironmentposes
muchmorerestrictiveperformanceconstraintsfordatacleaningthatneedtobeconsideredinthedesignof
suitableapproaches.Furthermore,datacleaningforsemistructureddata,e.g.,basedonXML,islikelytobe
ofgreatimportancegiventhereducedstructuralconstraintsandtherapidlyincreasingamountofXMLdata.
Acknowledgments
WewouldliketothankPhilBernstein,HelenaGalhardasandSunitaSarawagiforhelpfulcomments.
References
[1]Abiteboul,S.;Clue,S.;Milo,T.;Mogilevsky,P.;Simeon,J.: Tools for Data Translation and Integration.In
[26]:38,1999.
[2]Batini,C.;Lenzerini,M.;Navathe,S.B.: A Comparative Analysis of Methodologies for Database Schema
Integration.InComputingSurveys18(4):323364,1986.
[3]Bernstein,P.A.;Bergstraesser,T.: Metadata Support for Data Transformation Using Microsoft Repository.In
[26]:914,1999
[4]Bernstein,P.A.;Dayal,U.: An Overview of Repository Technology.Proc.20thVLDB,1994.
[5]Bouzeghoub,M.;Fabret,F.;Galhardas,H.;Pereira,J;Simon,E.;Matulovic,M.: Data Warehouse Refreshment.In
[16]:4767.
[6]Chaudhuri,S.,Dayal,U.: An Overview of Data Warehousing and OLAP Technology.ACMSIGMODRecord
26(1),1997.
[7]Cohen,W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based
Textual
Similarity.Proc.ACMSIGMODConf.onDataManagement,1998.
[8]Do,H.H.;Rahm,E.: On Metadata Interoperability in Data Warehouses.Techn.Report,Dept.ofComputerSci
ence,Univ.ofLeipzig.http://dol.unileipzig.de/pub/200013.
[9]Doan,A.H.;Domingos,P.;Levy,A.Y.: Learning Source Description for Data Integration.Proc.3rdIntl.Work
shopTheWebandDatabases(WebDB),2000.
[10]Fayyad,U.: Mining Database: Towards Algorithms for Knowledge Discovery.IEEETechn.BulletinDataEngi
neering21(1),1998.
[11]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: Declaratively cleaning your data using AJAX.InJournees
BasesdeDonnees,Oct.2000.http://caravel.inria.fr/~galharda/BDA.ps.
[12]Galhardas,H.;Florescu,D.;Shasha,D.;Simon,E.: AJAX: An Extensible Data Cleaning Tool.Proc.ACMSIG
MODConf.,p.590,2000.
[13]Haas,L.M.;Miller,R.J.;Niswonger,B.;TorkRoth,M.;Schwarz,P.M.;Wimmers,E.L.: Transforming Heterogeneous Data with Database Middleware: Beyond Integration.In[26]:3136,1999.
[14]Hellerstein,J.M.;Stonebraker,M.;Caccia,R.: Independent, Open Enterprise Data Integration.In[26]:4349,
1999.
[15]Hernandez,M.A.;Stolfo,S.J.: Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem.Data
MiningandKnowledgeDiscovery2(1):937,1998.
[16]Jarke,M.,Lenzerini,M.,Vassiliou,Y.,Vassiliadis,P.: Fundamentals of Data Warehouses.Springer,2000.
[17]Kashyap,V.;Sheth,A.P.: Semantic and Schematic Similarities between Database Objects: A Context-Based
Approach.VLDBJournal5(4):276304,1996.

10

[18]Lakshmanan,L.;Sadri,F.;Subramanian,I.N.: SchemaSQL A Language for Interoperability in Relational MultiDatabase Systems.Proc.26thVLDB,1996.


[19]Lee,M.L.;Lu,H.;Ling,T.W.;Ko,Y.T.: Cleansing Data for Mining and Warehousing.Proc.10thIntl.Conf.
DatabaseandExpertSystemsApplications(DEXA),1999.
[20]Li,W.S.;Clifton,S.: SEMINT: A Tool for Identifying Attribute Correspondences in Heterogeneous
Databases
Using Neural Networks.InDataandKnowledgeEngineering33(1):4984,2000.
[21]Milo,T.;Zohar,S.: Using Schema Matching to Simplify Heterogeneous Data Translation.Proc.24thVLDB,
1998.
[22]Monge,A.E. Matching Algorithm within a Duplicate Detection System.IEEETechn.BulletinDataEngineering
23(4),2000(thisissue).
[23]Monge,A.E.;Elkan,P.C.: The Field Matching Problem: Algorithms and Applications.Proc.2ndIntl.Conf.
KnowledgeDiscoveryandDataMining(KDD),1996.
[24]Parent,C.;Spaccapietra,S.: Issues and Approaches of Database Integration.Comm.ACM41(5):166178,1998.
[25]Raman,V.;Hellerstein,J.M.: Potter's Wheel: An Interactive Framework for Data Cleaning.WorkingPaper,1999.
http://www.cs.berkeley.edu/~rshankar/papers/pwheel.pdf.
[26]Rundensteiner,E.(ed.):SpecialIssueonDataTransformation.IEEETechn.Bull.DataEngineering22(1),1999.
[27]Quass,D.: A Framework for Research in Data Cleaning.UnpublishedManuscript.BrighamYoungUniv.,1999
[28]Sapia,C.;Hfling,G.;Mller,M.;Hausdorf,C.;Stoyan,H.;Grimmer,U.: On Supporting the Data Warehouse
Design by Data Mining Techniques.Proc.GIWorkshopDataMiningandDataWarehousing,1999.
[29]Savasere,A.;Omiecinski,E.;Navathe,S.: An Efficient Algorithm for Mining Association Rules in Large Databases.Proc.21stVLDB,1995.
[30]Srikant,R.;Agrawal,R.: Mining Generalized Association Rules.Proc.21stVLDBconf.,1995.
[31]TorkRoth,M.;Schwarz,P.M.: Dont Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources.Proc.
23rdVLDB,1997.
[32]Wiederhold,G.: Mediators in the Architecture of Future Information Systems.Computer25(3):3849,1992.

11