Anda di halaman 1dari 11

Why Is Data Preprocessing Important?

Data Preprocessing

No quality data, no quality mining results!


n

Quality decisions must be based on quality data


p

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

n p

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

July 28, 2009

Data Mining: Concepts and Techniques

Data Preprocessing
p p p

Multi-Dimensional Measure of Data Quality

An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes
n n n n

data cleaning data integration data transformation data reduction

July 28, 2009

Data Mining: Concepts and Techniques

Why Data Preprocessing?


p

Major Tasks in Data Preprocessing

Data in the real world is dirty


n

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
p

e.g., occupation= e.g., Salary=-10 e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C e.g., discrepancy between duplicate records

n n

noisy: containing errors or outliers


p

inconsistent: containing discrepancies in codes or names


p p p

July 28, 2009

Data Mining: Concepts and Techniques

July 28, 2009

Data Mining: Concepts and Techniques

Why Is Data Dirty?

Forms of Data Preprocessing

July 28, 2009

Data Mining: Concepts and Techniques

July 28, 2009

Data Mining: Concepts and Techniques

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Chapter 2: Data Preprocessing


p

Measuring the Dispersion of Data


Quartiles, outliers and boxplots
n n n n

p p p p p p p

Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation
p

Quartiles: Q1 (25 th percentile), Q3 (75 th percentile) Inter-quartile range: IQR = Q3 Q1 Five number summary: min, Q 1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation (sample: s, population: )


n

Data reduction Discretization and concept hierarchy generation Summary


Data Mining: Concepts and Techniques 9

Variance: (algebraic, scalable computation) 2= 1 N

1 n 1 n 1 n s = ( xi x)2 = n 1[ xi2 n ( xi )2 ] n 1 i =1 i =1 i =1
2

(x
i =1

)2 =

1 N

x
i =1

2 i

n
July 28, 2009

Standard deviation s (or ) is the square root of variance s2 (or 2)


Data Mining: Concepts and Techniques 13

July 28, 2009

Mining Data Descriptive Characteristics


p

Properties of Normal Distribution Curve


p

Motivation
n

To better understand the data: central tendency, variation and spread median, max, min, quantiles, outliers, variance, etc. Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube
Data Mining: Concepts and Techniques 10

Data dispersion characteristics


n

The normal (distribution) curve n From to +: contains about 68% of the measurements (: mean, : standard deviation) n From 2 to +2: contains about 95% of it n From 3 to +3: contains about 99.7% of it

Numerical dimensions correspond to sorted intervals


n n

Dispersion analysis on computed measures


n n

July 28, 2009

July 28, 2009

Data Mining: Concepts and Techniques

14

Measuring the Central Tendency


p

Boxplot Analysis
x = 1 n
xi
i

Mean (algebraic measure) (sample vs. population):


n n

xi

x
N

i =1

Weighted arithmetic mean: Trimmed mean: chopping extreme values


x =

w
i=1 n i =1

Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot


n n

Median: A holistic measure


n

Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:

Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
Data Mining: Concepts and Techniques 15

n p

Mode
n n n

median = L1 + (

n / 2 ( f )l f median

)c

n n

mean mode = 3 ( mean median)


Data Mining: Concepts and Techniques 11 July 28, 2009

July 28, 2009

Symmetric vs. Skewed Data


p

Visualization of Data Dispersion: Boxplot Analysis

Median, mean and mode of symmetric, positively and negatively skewed data

July 28, 2009 July 28, 2009 Data Mining: Concepts and Techniques 12

Data Mining: Concepts and Techniques

16

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Histogram Analysis
p

Loess Curve
p p

Graph displays of basic statistical class descriptions n Frequency histograms


p p

A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

July 28, 2009

Data Mining: Concepts and Techniques

17

July 28, 2009

Data Mining: Concepts and Techniques

21

Quantile Plot
p p

Positively and Negatively Correlated Data

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information n For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi

July 28, 2009

Data Mining: Concepts and Techniques

18 July 28, 2009 Data Mining: Concepts and Techniques 22

Quantile-Quantile (Q-Q) Plot


p p

Not Correlated Data

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Allows the user to view whether there is a shift in going from one distribution to another

July 28, 2009

Data Mining: Concepts and Techniques

19 July 28, 2009 Data Mining: Concepts and Techniques 23

Scatter plot
p p

Data PREPROCESSING

Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane

July 28, 2009

Data Mining: Concepts and Techniques

20

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Data Preprocessing
p

Data PREPROCESSING

Data reduction
n n

data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization. Used to obtain a reduced representation of the data while minimizing the loss of information content.

Data reduction
T1 T2 T2000 A1 A2 A3 ... A126 T1 T4 T1456 A1 A2 A3 ... A115

DATA CLEANING

Data Integration
p

Data Cleaning :Missing Values

Data integration n combines data from multiple sources into a coherent data store e.g. data warehouse n sources may include multiple database, data cubes or flat files n Issues in data integration
p schema

integration and resolution of data value

p redundancy p detection

conflicts

DATA CLEANING

Data Integration
p

Data Cleaning:Noisy Data


p

Schema integration
n n

integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
p

Noise - random error or variance in a measured variable smooth out the data to remove the noise

Detecting and resolving data value conflicts


n n

for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units

DATA CLEANING

Data Integration

Data Cleaning:Noisy Data

Redundant data occur often when integration of multiple databases

The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue

Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

DATA CLEANING

DATA CLEANING

Simple Discretization Methods: Binning

Regression
y
Y1

Y1

y=x+1

X1

DATA CLEANING

DATA CLEANING

Binning Methods for Data Smoothing

Data Cleaning : Inconsistent Data


p p

Can be corrected manually using external references Source of inconsistency


n

error made at data entry, can be corrected using paper trace

DATA CLEANING

Cluster Analysis
n

Data PREPROCESSING

Clustering
p Outliers

may be detected by clustering, where similar values are organized into groups or clusters.

n n

Combined computer and human inspection Regression

DATA CLEANING

Cluster Analysis

DATA REDUCTION

Data Reduction

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Data Cube Aggregation

DATA REDUCTION

DATA REDUCTION

Dimensionality Reduction
n

Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant Dimensionality reduction reduces the dataset size by removing such attributes among them

DATA REDUCTION

Data Cube Aggregation


Sales data for company AllElectronics for 1997 1999 (pp73) Year = 1999 Year = 1998 Year = 1997 Quarter Sales Q1 $224,000 Q2 $408,000 Q3 $350,000 Q4 $586,000

Dimensionality Reduction

DATA REDUCTION

How can we find a good subset of the original attributes?? attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.

Year 1997 1998 1999

Sales $1,568,000 $2,356,000 $3,594,000

Data Reduction
Data cube Aggregation

DATA REDUCTION

Dimensionality Reduction Attribute subset selection techniques

DATA REDUCTION

Dimensionality reduction

Data Reduction

Data compression Numerosity reduction Discretization and Concept Hierarchy generation

DATA REDUCTION

Dimensionality Reduction
Standard form Data preparation Dimension reduction

Dimensionality Reduction Attribute subset selection techniques

DATA REDUCTION

Evaluation

Prediction Methods

Data Subset

The role of dimension reduction in Data Mining

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Example of Decision Tree Induction


Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6?

DATA REDUCTION

DATA COMPRESSION

Data Compression Methods

Class 1

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}

DATA REDUCTION

Dimensionality Reduction Attribute subset selection techniques


p Reducts

Data Reduction
Data cube Aggregation Dimensionality reduction

computation by rough set

theory
selection of attributes are identified by the concept of discernibility relations of classes in the dataset n Will be discussed in next class.
n

Data Reduction

Data compression Numerosity reduction Discretization and Concept Hierarchy generation

NUMEROSITY REDUCTION

Data Reduction
Data cube Aggregation

Numerosity Reduction
Dimensionality reduction

Data Reduction

Data compression Numerosity reduction Discretization and Concept Hierarchy generation

DATA COMPRESSION

NUMEROSITY REDUCTION

Data Compression
p

Numerosity Reduction

Apply data encoding or transformation to obtain a reduced or compressed representation of the original data lossless
n

although typically lossless, they allow only limited manipulation of data.

lossy

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Regress Analysis and Log-Linear Models

Sampling

WOR SRS le random t simp e withou ( mpl ent) sa cem repla


SR SW R

Raw Data

Data Reduction Method (2): Histograms


p

Sampling
Raw Data Cluster/Stratified Sample

Divide data into buckets and store average (sum) for each bucket Partitioning rules:
n n n

40 35 30 25

Equal-width: equal bucket range Equal-frequency (or equal-depth)

V-optimal: with the least histogram 20 variance (weighted sum of the 15 original values that each bucket represents) 10 MaxDiff: set bucket boundary between each pair for pairs have the 1 largest differences

5 0
10000 30000 50000 70000 90000
58

July 28, 2009

Data Mining: Concepts and Techniques

Clustering
p

Hierarchical Reduction
p p p p

Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8

Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation
n n n

An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram

Sampling
p p

Data Reduction
Data cube Aggregation

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data
n

Simple random sampling may have very poor performance in the presence of skew Stratified sampling: p Approximate the percentage of each class (or subpopulation of interest) in the overall database p Used in conjunction with skewed data

Dimensionality reduction

Develop adaptive sampling methods


n

Data Reduction

Data compression Numerosity reduction Discretization and Concept Hierarchy generation

Sampling may not reduce database I/Os (page at a time).

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Discretization

Entropy-Based Discretization
p

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
E (S ,T ) = | S 1| |S| Ent ( S 1) + |S2| |S| Ent ( S 2 )

The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

Discretization and Concept hierarchy

Entropy-Based Discretization
p

The process is recursively applied to partitions obtained until some stopping criterion is met, Experiments show that it may reduce data size and improve classification accuracy

Ent ( S ) E (T , S ) >

Data Preprocessing
p

Segmentation by Natural Partitioning


A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, natural intervals.
n

If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Discretization and Concept Hierarchy Generation for Numeric Data


p p p p p

Example of 3-4-5 Rule


count Step 1: -$351 Min Step 2: Step 3: (-$1,000 - 0) msd =1,000 -$159 Low (i.e, 5%-tile) Low =-$1,000 Hig h=$2,000 (-$1,000 - $2,00 0) (0 -$ 1,000) ($1,000 - $2,00 0) profit $1,8 38 Hig h(i. e, 95%-0 tile) $4,700 Max

Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning
(-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0)

Step 4:

(-$4000 -$5,000)

(-$400 - 0) (0 $20 0) ($200 $40 0) ($400 $60 0)

(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)

($1,000 - $2, 000)

($2,000 - $5, 000)

($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)

($1,400 $1,600) ($800 $1,000) ($1,600 ($1,800 $1,800) $2,000)

($600 $80 0)

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Concept Hierarchy Generation


p

Data Transformation

Many techniques can be applied recursively in order to provide a hierarchical partitioning of the attribute - concept hierarchy Concept hierarchy useful for mining at multiple levels of abstraction

Concept Hierarchy Generation for Categorical Data

Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization z-score normalization normalization by decimal scaling New attributes constructed from the given ones

Attribute/feature construction

Automatic Concept Hierarchy Generation


p

Data Transformation: Normalization


p

Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set (but.. Must consider the semantic of those attributes and relation among them) n The attribute with the most distinct values is placed at the lowest level of the hierarchy n Note: Exceptionweekday, month, quarter, year

min-max normalization
v' = v min ( new _ max new _ min ) + new _ min max min
A A A A A A

z-score normalization
v' = v mean A stand _ dev
A

country province_or_ state city street

15 distinct values 365 distinct values

normalization by decimal scaling

v' =
3567 distinct values 674,339 distinct values

v 10 j

Where j is the smallest integer such that Max(|v ' |)<1

Data PREPROCESSING
p

Discretization and Concept Hierarchy Generation


Manual Discretization
n

The information to convert the continuous values into discrete values are obtain from the expert of the domain area Assignment 1

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com

Assignment 1
p

Data Discretization
Table 6: Discretization of the mathematical symbols
Orientation Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 Orientation #1 Orientation #2 h02 1 0 0 0 2 1 0 0 2 1 2 2 0 0 h03 2 1 1 0 2 1 1 2 0 1 2 2 0 0 h11 1 0 1 1 0 0 1 1 0 0 1 1 0 0 h12 2 1 0 1 2 1 1 0 2 1 2 2 0 0 h13 2 1 2 1 1 1 2 2 0 1 0 2 0 0 h21 2 1 1 0 2 0 2 2 1 0 1 2 0 1 h22 2 1 2 0 2 1 1 0 1 1 2 2 0 0 h30 1 0 1 1 1 1 0 0 0 0 1 1 1 1 h31 2 0 2 1 0 1 2 1 1 0 2 2 0 1 Result s

Title Preprocessing of XXX dataset : experiment on manual and automated techniques


n

Manual preprocessing of given dataset


p p

p p

Identify incomplete, noisy and inconsistent data Use statistical techniques such as frequency, boxplot to detect such data. Record number of missing values, noisy data Also record which tuple involve (if not too many)

Manual discretization
p

Use binning, histogram analysis, regression, cluster

Compare with automated technique (later in class)


p

Binning, entropy,

Summary
p p p

..\example_data.xls For automated preprocessing use http://rosetta.lcb.uu.se/

Data preparation is a big issue for both warehousing and mining Data preparation includes
n n n

Data cleaning and data integration Data reduction and feature selection Discretization

A lot methods have been developed but still an active area of research

Data Discretization

Data Discretization
Table 5: The invariance features for mathematical symbols
Symbol h 02 0. 86711 0. 54536 0. 58806 0. 61814 0. 88477 0. 80491 0. 73293 0. 66253 0. 91948 0. 82281 2. 213 2. 15402 0. 15565 0. 16081 h 03 0. 18849 0. 02198 0. 05518 0. 00880 0. 14812 0. 05006 0. 05052 0. 08034 0. 02059 0. 06182 0. 71402 0. 18761 0. 00002 0. 01299 h11 0. 08184 0. 02583 0. 08122 0. 05408 0. 01660 0. 03593 0. 16291 0. 03918 0. 01081 0. 02135 0. 059 0. 08548 0. 00662 0. 01091 h 12 0. 16839 0. 0241 0. 00895 0. 01927 0. 13137 0. 01596 0. 05135 0. 01415 0. 06653 0. 03221 0. 22918 0. 33771 0. 00547 0. 00812 h13 0. 12728 0. 01231 0. 07504 0. 05894 0. 06236 0. 04019 0. 11263 0. 10883 0. 00924 0. 03237 0. 00903 0. 81689 0. 00182 0. 00205 h 21 0. 01923 0. 01844 0. 01626 0. 00178 0. 02861 0. 00195 0. 02107 0. 01978 0. 01543 0. 01006 0. 01181 0. 11741 0. 00775 0. 01267 h22 0. 24873 0. 1193 0. 18318 0. 07934 0. 21195 0. 12116 0. 1385 0. 11662 0. 15602 0. 12365 0. 63556 0. 70659 0. 03896 0. 04902 h30 0. 12638 0. 00087 0. 03664 0. 01363 0. 04551 0. 01324 0. 00799 0. 0049 0. 00388 0. 00398 0. 05279 0. 03468 0. 02263 0. 04908 h 31 0. 04125 0. 00535 0. 05776 0. 02165 0. 00528 0. 01841 0. 07375 0. 01161 0. 00697 0. 00606 0. 08960 0. 13071 0. 00017 0. 01069

Printed with FinePrint trial version - purchase at www.fineprint.com PDF created with pdfFactory Pro trial version www.pdffactory.com