(3rd ed.)
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
Chapter 3
Jiawei Han
Han, Micheline Kamber,
Kamber and Jian Pei
University of Illinois at Urbana-Champaign &
Data cleaning
Data integration
Data reduction
understood?
Dimensionality reduction
Numerosityy reduction
Data compression
Normalization
Data Cleaning
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
Age=42,
42 Birthday=03/07/2010
03/07/2010
equipment malfunction
Noisy Data
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
11
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
12
Data Integration
Data integration:
Identify real world entities from multiple data sources, e.g., Bill
For the same real world entity, attribute values from different
13
14
2 (chi-square) test
2 =
(Observed Expected
p
)
Expected
The larger the 2 value, the more likely the variables are
related
Sum (row)
250(90)
200(360)
450
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Play chess
Like science fiction
i =1
(ai A)(bi B)
(n 1) A B
i =1
(ai bi ) n AB
(n 1) A B
Scatter plots
showing the
similarity from
1 to 1.
18
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, A and B are the respective standard
deviation of A and B.
19
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence20
Co-Variance: An Example
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
Curse of dimensionality
23
Dimensionality reduction
All
Allow
easier
i visualization
i
li ti
Wavelet transforms
Fourier transform
Wavelet transform
Frequency
25
26
Wavelet Decomposition
Wavelet Transformation
Haar2
Daubechie4
Method:
28
Coefficient Supports
Hierarchical
2.75
decomposition
+
structure (a.k.a.
error tree) + -1.25
0.5
+
+
2
-1
0.5
-1
- +
2
-1
-1
- +
-1.25
2.75
29
30
The original data are projected onto a much smaller space, resulting
in dimensionality reduction
reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
Normalize input data: Each attribute falls within the same range
x1
31
Redundant attributes
Irrelevant attributes
33
34
35
Y1
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression
Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
Approximates discrete multidimensional probability
distributions
Y1
y=x+1
off numerical
i l data
d t consisting
i ti off values
l
off a
X1
37
38
Using the least squares criterion to the known values of Y1, Y2, ,
X1, X2, .
15
90000
80000
70000
Log-linear models:
25
60000
30
50000
Multiple regression: Y = b0 + b1 X1 + b2 X2
Partitioning rules:
40000
30000
20000
10000
Histogram Analysis
100000
Regression Analysis
40
10
Sampling
Clustering
Can have hierarchical clustering and be stored in multidimensional index tree structures
41
Types of Sampling
42
Raw Data
43
44
11
Raw Data
Cluster/Stratified Sample
The aggregated
gg g
data for an individual entityy of interest
45
Data Compression
46
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is
possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
Compressed
Data
Original Data
lossless
Original Data
Approximated
47
48
12
Data Transformation
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
Attribute/feature construction
49
v' =
v A
73,600 54,000
= 1.225
16,000
v
v' = j
10
min-max normalization
z-score normalization
50
v minA
(new _ maxA new _ minA) + new _ minA
maxA minA
v'=
Discretization
Normalization
13
Binning
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
Histogram analysis
54
Data
14
57
58
Somehierarchiescanbeautomaticallygeneratedbasedon
theanalysisofthenumberofdistinctvaluesperattributein
thedataset
Theattributewiththemostdistinctvaluesisplacedat
thelowestlevelofthehierarchy
Exceptions,e.g.,weekday,month,quarter,year
country
province_or_ state
city
street
59
15 distinct values
15
Summary
Data Quality
Data Cleaning
Data Integration
Data Reduction
Summary
61
References
D.P.BallouandG.K.Tayi.Enhancingdataqualityindatawarehouseenvironments.Comm.of
ACM,42:7378,1999
A.Bruce,D.Donoho,andH.Y.Gao.Waveletanalysis.IEEESpectrum,Oct1996
T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003
T.DasuandT.Johnson.ExploratoryDataMiningandDataCleaning.JohnWiley,2003
J.DevoreandR.Peck.Statistics:TheExplorationandAnalysisofData.DuxburyPress,1997.
H.Galhardas,D.Florescu,D.Shasha,E.Simon,andC.A.Saita.Declarativedatacleaning:
Language,model,andalgorithms.VLDB'01
M.HuaandJ.Pei.Cleaningdisguisedmissingdata:Aheuristicapproach.KDD'07
H.V.Jagadish,etal.,SpecialIssueonDataReductionTechniques.BulletinoftheTechnical
CommitteeonDataEngineering,20(4),Dec.1997
H.LiuandH.Motoda(eds.).FeatureExtraction,Construction,andSelection:ADataMining
Perspective.KluwerAcademic,1998
p
,
J.E.Olson.DataQuality:TheAccuracyDimension.MorganKaufmann,2003
D.Pyle.DataPreparationforDataMining.MorganKaufmann,1999
V.RamanandJ.Hellerstein.PottersWheel:AnInteractiveFrameworkforDataCleaningand
Transformation,VLDB2001
T.Redman.DataQuality:TheFieldGuide.DigitalPress(Elsevier),2001
R.Wang,V.Storey,andC.Firth.Aframeworkforanalysisofdataqualityresearch.IEEETrans.
KnowledgeandDataEngineering,7:623640,1995
63
16