Why Data Preprocessing?: Incomplete

Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers
e.g., occupation=
e.g., Salary=-10
inconsistent: containing discrepancies in

codes or names
August 10, 2015
e.g., Age=42 Birthday=03/07/1997

e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Data Mining: Concepts and
Techniques
Why Is Data Dirty?
Incomplete data may come from
Noisy data (incorrect values) may come from
Faulty data collection instruments

Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Not applicable data value when collected

Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems
Different data sources

Functional dependency violation (e.g., modify some linked
data)
Duplicate records also need data cleaning
August 10, 2015

Techniques
Why Is Data Preprocessing

Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or

even misleading statistics.
Data warehouse needs consistent integration of

quality data
Data extraction, cleaning, and transformation

comprises the majority of the work of building a
data warehouse
August 10, 2015

Techniques
Multi-Dimensional Measure of Data

Quality
A well-accepted multidimensional view:

Accuracy
Completeness
Consistency
Timeliness
Value added
Interpretability
Accessibility
August 10, 2015

Techniques
Major Tasks in Data

Preprocessing
Data cleaning
Data integration
Normalization and aggregation
Data reduction
Integration of multiple databases, data cubes, or files
Data transformation
Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces

the same or similar analytical results
Data discretization
Part of data reduction but with particular importance,

especially for numerical data
August 10, 2015

Techniques
Forms of Data Preprocessing
August 10, 2015

Techniques
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
August 10, 2015

Techniques
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the

time of entry
not register history or changes of the data
Missing data may need to be inferred.
August 10, 2015

Techniques
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing

(assuming the tasks in classificationnot effective when the
percentage of missing values per attribute varies
considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same

class: smarter
August 10, 2015

Techniques
Noisy Data
Noise: random error or variance in a measured

variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
August 10, 2015

Techniques
10
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
August 10, 2015

Techniques
11
Cluster Analysis
August 10, 2015

Techniques
12
Data Integration
Data integration:
Combines data from multiple sources into a
coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
August 10, 2015

Techniques
13
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified

range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
August 10, 2015

Techniques
14
Example of Decision Tree

Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
August 10, 2015
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}

Techniques
15
Summary
Data preparation or preprocessing is a big issue

for both data warehousing and data mining
Descriptive data summarization is need for quality

data preprocessing
Data preparation includes
Data cleaning and data integration
Data reduction
Discretization
A lot a methods have been developed but data

preprocessing still an active area of research
August 10, 2015

Techniques
16
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.

Communications of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons,
2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk.

Mining Database Structure; Or, How to Build a Data Quality Browser . SIGMOD02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of
the Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations.

Communications of ACM, 39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering,
7:623-640,
1995
Data Mining:
Concepts and
August 10, 2015
Techniques
17

Why Data Preprocessing?: Incomplete

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Why Data Preprocessing?: Incomplete

Diunggah oleh

Hak Cipta:

Format Tersedia

Why Data Preprocessing?

Data in the real world is dirty

noisy: containing errors or outliers

inconsistent: containing discrepancies in

August 10, 2015

e.g., Age=42 Birthday=03/07/1997

Why Is Data Dirty?

Incomplete data may come from

Noisy data (incorrect values) may come from

Faulty data collection instruments

Inconsistent data may come from

Not applicable data value when collected

Different data sources

Duplicate records also need data cleaning

August 10, 2015

Data Mining: Concepts and

Why Is Data Preprocessing

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or

Data warehouse needs consistent integration of

Data extraction, cleaning, and transformation

August 10, 2015

Data Mining: Concepts and

Multi-Dimensional Measure of Data

A well-accepted multidimensional view:

August 10, 2015

Data Mining: Concepts and

Major Tasks in Data

Normalization and aggregation

Integration of multiple databases, data cubes, or files

Fill in missing values, smooth noisy data, identify or

Obtains reduced representation in volume but produces

Part of data reduction but with particular importance,

August 10, 2015

Data Mining: Concepts and

Forms of Data Preprocessing

August 10, 2015

Data Mining: Concepts and

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

August 10, 2015

Data Mining: Concepts and

Data is not always available

E.g., many tuples have no recorded value for several

Missing data may be due to

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the

not register history or changes of the data

Missing data may need to be inferred.

August 10, 2015

Data Mining: Concepts and

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean