Exploratory Input

Fall 2003 Data Mining 1
Exploratory Data Mining and

Data Preparation
The Data Mining Process
Business
understanding
Deployment
Data
Data
preparation
Modeling
Data
evaluation
Evaluation
Exploratory Data Mining
Preliminary process
Data summaries
Attribute means
Attribute variation
Attribute relationships
Visualization

Select an attribute
Summary Statistics
Possible Problems:
Many missing values (16%)
No examples of one value
Visualization
Appears to be a
good predictor of
the class
Exploratory DM Process
For each attribute:
Look at data summaries
Identify potential problems and decide if an
action needs to be taken (may require
collecting more data)
Visualize the distribution
Identify potential problems (e.g., one dominant
attribute value, even distribution, etc.)
Evaluate usefulness of attributes
Weka Filters
Weka has many filters that are helpful in
preprocessing the data
Attribute filters
Add, remove, or transform attributes
Instance filters
Add, remove, or transform instances
Process
Choose for drop-down menu
Edit parameters (if any)
Apply
Data Preprocessing
Data cleaning
Missing values, noisy or inconsistent data
Data integration/transformation
Data reduction
Dimensionality reduction, data
compression, numerosity reduction
Discretization
Data Cleaning
Missing values
Weka reports % of missing values
Can use filter called ReplaceMissingValues
Noisy data
Due to uncertainty or errors
Weka reports unique values
Useful filters include
RemoveMisclassified
MergeTwoValues
Data Transformation
Why transform data?
Combine attributes. For example, the ratio of two
attributes might be more useful than keeping them
separate
Normalizing data. Having attributes on the same
approximate scale helps many data mining
algorithms(hence better models)
Simplifying data. For example, working with
discrete data is often more intuitive and helps the
algorithms(hence better models)
Weka Filters
The data transformation filters in Weka
include:
Add
AddExpression
MakeIndicator
NumericTransform
Normalize
Standardize
Discretization
Discretization reduces the number of
values for a continuous attribute
Why?
Some methods can only use nominal data
E.g., in Weka ID3 and Apriori algorithms
Helpful if data needs to be sorted
frequently (e.g., when constructing a
decision tree)
Unsupervised Discretization
Unsupervised - does not account for classes
Equal-interval binning

Equal-frequency binning
64 65 68 69 70 71 72 75 80 81 83 85
Yes No Yes Yes Yes No No
Yes
Yes
Yes
No Yes Yes No
64 65 68 69 70 71 72 75 80 81 83 85
Yes
Yes
Yes
No Yes Yes No
Take classification into account
Use entropy to measure information gain
Goal: Discretizise into 'pure' intervals
Usually no way to get completely pure intervals:
Supervised Discretization
64 65 68 69 70 71 72 75 80 81 83 85
Yes
Yes
Yes
No Yes Yes No
A B C D E F
9 yes & 4 no 1 no
1 yes 8 yes & 5 no
Error-Based Discretization
Count number of misclassifications
Majority class determines prediction
Count instances that are different
Must restrict number of classes.
Complexity
Brute-force: exponential time
Dynamic programming: linear time
Downside: cannot generate adjacent intervals
with same label
Weka Filter
Attribute Selection
Before inducing a model we almost
always do input engineering
The most useful part of this is attribute
selection (also called feature selection)
Select relevant attributes
Remove redundant and/or irrelevant
attributes
Why?
Reasons for Attribute
Selection
Simpler model
More transparent
Easier to interpret
Faster model induction
What about overall time?
Structural knowledge
Knowing which attributes are important may be
inherently important to the application
What about the accuracy?
Attribute Selection Methods
What is evaluated?
Attributes
Subsets of
attributes
Evaluation
Method
Independent
Filters Filters
Learning
algorithm
Wrappers
Filters
Results in either
Ranked list of attributes
Typical when each attribute is evaluated
individually
Must select how many to keep
A selected subset of attributes
Forward selection
Best first
Random search such as genetic algorithm
Filter Evaluation Examples
Information Gain
Gain ration
Relief

Correlation
High correlation with class attribute
Low correlation with other attributes

Wrappers
Wrap around the
learning algorithm
Must therefore always
evaluate subsets
Return the best subset
of attributes
Apply for each learning
algorithm
Use same search
methods as before
Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
Stop?
Yes
No
How does it help?
Nave Bayes

Instance-based learning

Decision tree induction
Scalability
Data mining uses mostly well developed
techniques (AI, statistics, optimization)
Key difference: very large databases
How to deal with scalability problems?
Scalability: the capability of handling
increased load in a way that does not
effect the performance adversely
Massive Datasets
Very large data sets (millions+ of
instances, hundreds+ of attributes)
Scalability in space and time
Data set cannot be kept in memory
E.g., processing one instance at a time
Learning time very long
How does the time depend on the input?
Number of attributes, number of instances
Two Approaches
Increased computational power
Only works if algorithms can be sped up
Must have the computing availability
Adapt algorithms
Automatically scale-down the problem so
that it is always approximately the same
difficulty
Computational Complexity
We want to design algorithms with good
computational complexity
exponential
linear
logarithm
Number of instances
(Number of attributes)
Time
polynomial
Example: Big-Oh Notation
Define
n =number of instances
m =number of attributes
Going once through all the instances has
complexity O(n)
Examples
Polynomial complexity: O(mn
2
)
Linear complexity: O(m+n)
Exponential complexity: O(2
n
)
Classification
If no polynomial time algorithm exists to solve
a problem it is called NP-complete
Finding the optimal decision tree is an
example of a NP-complete problem
However, ID3 and C4.5 are polynomial time
algorithms
Heuristic algorithms to construct solutions to a
difficult problem
Efficient from a computational complexity
standpoint but still have a scalability problem
Decision Tree Algorithms
Traditional decision tree algorithms assume
training set kept in memory
Swapping in and out of main and cache
memory expensive
Solution:
Partition data into subsets
Build a classifier on each subset
Combine classifiers
Not as accurate as a single classifier
Other Classification Examples
Instance-Based Learning
Goes through instances one at a time
Compares with new instance
Polynomial complexity O(mn)
Response time may be slow, however
Nave Bayes
Polynomial complexity
Stores a very large model
Data Reduction
Another way is to reduce the size of the
data before applying a learning
algorithm (preprocessing)
Some strategies
Dimensionality reduction
Data compression
Numerosity reduction
Dimensionality Reduction
Remove irrelevant, weakly relevant, and
redundant attributes
Attribute selection
Many methods available
E.g., forward selection, backwards elimination,
genetic algorithm search
Often much smaller problem
Often little degeneration in predictive
performance or even better performance
Data Compression
Also aim for dimensionality reduction
Transform the data into a smaller space
Principle Component Analysis
Normalize data
Compute c orthonormal vectors, or principle
components, that provide a basis for normalized
data
Sort according to decreasing significance
Eliminate the weaker components
PCA: Example
Numerosity Reduction
Replace data with an alternative,
smaller data representation
Histogram

1-10 11-20 21-30
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,
15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,
20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
Other Numerosity Reduction
Clustering
Data objects (instance) that are in the
same cluster can be treated as the same
instance
Must use a scalable clustering algorithm

Sampling
Randomly select a subset of the instances
to be used
Sampling Techniques
Different samples
Sample without replacement
Sample with replacement
Cluster sample
Stratified sample
Complexity of sampling actually sublinear,
that is, the complexity is O(s) where s is the
number of samples and s<<n
Weka Filters
PrincipalComponents is under the
Attribute Selection tab
Already talked about filters to discretize
the data
The Resample filter randomly samples
a given percentage of the data
If you specify the same seed, youll get the
same sample again

Exploratory Input

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Exploratory Input

Diunggah oleh

Hak Cipta:

Format Tersedia

Fall 2003 Data Mining 1

Exploratory Data Mining and

Anda mungkin juga menyukai