Ch1 Introduction

1
Chapter 1
INTRODUCTION
2
What is Pattern Recognition?
Pattern Recognition by Human
perceptual
specialized decision making
Pattern Recognition by Computers
benefit of automated pattern recognition
advantage in complex calculations
Pattern Recognition from Data (Data Mining)

3
Pattern Recognition from Data
Pattern recognition from data is the process of
learning the historical data by finding data
dependency and getting the knowledge from data.
4
What is Data?

Studies Education Works I ncome (D)
1 Poor SPM Poor None
2 Poor SPM Good Low
3 Moderate SPM Poor Low
4 Moderate Diploma Poor Low
5 Poor SPM Poor None
7 Good MSC Good Medium
:
99 Poor SPM Good Low

5
What is Knowledge??
studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
6
Why is Data Mining prevalent?
1. Lots of data is collected and stored in data
warehouses
Business
Wal-Mart logs nearly 20 million transactions per
day
Astronomy
Telescope collecting large amounts of data.
Space
NASA is collecting peta bytes of data from satellites
Physics
High energy physics experiments are expected to
generate 100 to 1000 tera bytes in the next decade.
7
2. Quality and richness of data collected is
improving
Retailers
Scanner data is much more accurate than other
means
E-commerce
Rich data on customer browsing
Science
Accurate of sensor is improving

8
3. The gap between data and analysts is increasing
Existing of Hidden information
High cost of human labor
Much of data is never analyzed at all
9
Origins of Data Mining
Drawn ideas from Machine Learning,
Pattern Recognition, Statistics, and
Database Systems for applications that have
Enormous of data
High dimensionality of data
Heterogeneous data
Unstructured data
10
Data Mining: confluence of multiple
discipline
DATA
MINING
Database
technology
statistic
Machine
learning
Information
science
Neural network
Pattern
recognition
visualization
Information
retrieval
HPerformance
computing
Spatial
data analysis
11
Data Mining What it isnt
Small Scale
Data mining methods are designed for large data sets
Foolproof
Data mining techniques will discover patterns in any
data
The patterns discovered may be meaningless
It is up to the user to determine how to interpret the
results
Make it foolproof and theyll just invent a better fool
Magic
Data mining techniques cannot generate information
that is not present in the data
They can only find the patterns that are already there
12
Example: Data Mining is not .
Generating multidimensional cubes of a relational
table
Searching for a phone number in a phone book
Searching for keywords on Google (IR)
Generating a histogram of salaries for different
age groups
Issuing SQL query to a database, and reading the
reply
13
Data Mining What it is
Extracting knowledge from large amounts of data
Uses techniques from:
Pattern Recognition
Machine Learning
Statistics
Plus techniques unique to data mining
(Association rules)
Data mining methods must be efficient and
scalable
14
Example: Data mining is
What goods should be promoted to this customer?
What is the probability that a certain customer will respond to a
planned promotion?
Can one predict the most profitable securities to buy/sell during the
next trading session?
Will this customer default on a loan or pay back on schedule?
What medical diagnose should be assigned to this patient?
What kind of cars should be sell this year??
Finding groups of people with similar hobbies
Are chances of getting cancer higher if you live near a power line?

15
Data Mining is simply...
Finds relationship

make prediction
16
Data Mining: Definition
The non trivial extraction of implicit,
previously unknown, and potentially useful
information from data
(William J Fawley, Gregory Piatetsky-
Shapiro and Christopher J Matheus)
17
Data Mining : 1-step of KDD
KDD = Knowledge Discovery in Databases

Patterns
Data
Warehouse
Databases
Flat files
Selection and
Transformation
Data Mining
Evaluation &
Presentation
Cleaning and
Integration
Knowledge
18
Contd
Data cleaning
To remove noise and inconsistent data
Data integration
Multiple data sources may be combined
Data selection
Data relevant to the analysis task are retrieved from the
database
Data transformation
Data are transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations
19
Contd
Data mining
An essential process where intelligent methods are
applied in order to extract data patterns
Pattern evaluation
To identify the truly interesting patterns representing
knowledge based on some interestingness measures
Knowledge presentation
Visualization and knowledge representation techniques
are used to present the mined knowledge to the users
20
Early Steps of Data Mining
Data preprocessing
handling incomplete data, noisy data, uncertain
data
Data discretization/representation
transforms data into suitable values for the
mining algorithm to find patterns
Data selection
selects the suitable data for mining purposes
21
Data base Systems
Kinds of DB
Relational
Data warehouse
Transactional DB
Advanced DB system
Flat files
WWW
Kinds of Knowledge
Classification
Association
Clustering
Prediction

22
Data Mining Types of Data
Mining can be performed on data in a variety of forms
Relational Database
Traditional DMBS everyone is familiar with
Data is stored in a series of tables (Collection of tables)
Data is extracted via queries, typically with SQL
SQL: Show me a list of items that were sold in the last quarter
show me the total sales of the last month, grouped by branch
How many transactions occurred in the month of December?
which sales person had the highest amount of sales
Relational language: aggregate function such as sum, avg, count,
max, min
23
Apply data mining go further
Searching for trends or data patterns
Analyzed customer data to predict credit risk of new customers based on their
income
Detect deviation items whose sales are far from those expected in
comparison with the previous year (further investigated: change in packaging,
increase in price?)
Transaction Database
Similar to relational database (transactions stored in a table)
Each row (record) is a transaction with id & list of items in
transaction
Nested relation
Can be unfolded into a relational database or stored in flat files
since nested relational structures did not supported by relational db
system
Which items sold well together?
24
Data Warehouse
Stores historical data, potentially from multiple sources
Organized around major subjects
Contains summary statistics
Object / Object-Relational Databases
Database consisting of objects
Object = set of variables + associated methods
Eg: Intel uses regularity extraction in automatic circuit layout
Images
Can mine features extracted from images, OR
Can use mining techniques to extract features
Content based image retrieval
25
Vector Geometries (spatial db)
Include GIS and CAD data
Raster data n-dimensional bit maps /pixel maps
Vector format point, line, polygon
Can find spatial patterns between features
Describing the characteristics of houses located near a specified
kind of location
Describe the climate of mountainous areas located at various
altitudes
Text
Can be unstructured, semi-structured, or structured
Documentation, newspaper articles, web sites etc.
Can facilitate search by linking related documents / concepts
26
Video / Audio
Speech recognition recognized spoken command
Security applications
Integrated with standard data mining methods (storage and
searching)
Temporal Databases / Time Series
Global change databases (temperature records)
Space shuttle telemetry
Stock market data (stock exchange)
Usually stores relational data that include time-related attributes
Find the trend of changes for objects decision making/strategy
planning
27
Stock exchange data can be mined to uncover trends that could
help in planning investment strategies (when is the best time to
purchase TNB stock?)
Legacy Databases
Group of heterogeneous databases (relational, OO db, network db,
multimedia db etc.)
Connected by intra- or inter-computer networks
Information exchange is very difficult student academic
performance among different schools/universities
Data mining transforming the given data into higher, more
generalized, conceptual levels

28
DBMS, OLAP & Data Mining
Area

DBMS

OLAP

Data Mining

Task

Extraction of detailed and
summary data

Summaries, trends and
forecast

Knowledge discovery
of hidden patterns and
insight

Type of
result

Information

Analysis

Insight and prediction

Method

Deduction (Ask the
question, verify with
data)

Multidimensional data
modeling, Aggregation,
statistics

Induction (Build the
model, apply it to new
data, get the result)

Example
question

Who purchased mutual
funds in the last 3 years

What is the average income of
mutual fund buyers by region
by year?

Who will buy a
mutual fund in the
next 6 months and
why?

29
Example: Weather data
Record of the weather conditions during a two-
week period, along with the decisions of a tennis
player whether or not to play tennis on each
particular day
Generated tuples (or examples, instances)
consisting of values of 4 independent variables
Outlook
Temperature
Humidity
Windy
One dependent variable - play
30
Contd
Day

outlook

temperature

humidity

windy

play

1

sunny

85

85

false

No

2

sunny

80

90

true

No

3

overcast

83

86

False

Yes

4

rainy

70

96

False

Yes

5

rainy

68

80

False

Yes

6

rainy

65

70

True

No

7

overcast

64

65

True

Yes

8

sunny

72

95

False

No

9

sunny

69

70

False

Yes

10

rainy

75

80

False

Yes

11

sunny

75

70

True

Yes

12

overcast

72

90

True

Yes

13

overcast

81

75

False

Yes

14

rainy

71

91

true

no

31
DBMS
We may answer questions by querying a
DBMS containing the above table
What was the temperature in the sunny days?
Which days the humidity was less than 75?
Which days the temperature was greater than
70?
Which days the temperature was greater than
70 and the humidity was less than 75?
32
OLAP (On-line analytical
processing)
Using OLAP create Multidimensional
Model (Data cube)
Eg. Dimensions: time, outlook, play can
create the model below
9/5

sunny

rainy

overcast

Week
1

0/2

2/1

2/0

Week
2

2/1

1/1

2/0

33
Contd
Observing the data cube easily
identify some important properties of the data
Find regularities or pattern
Eg. The 3
rd
column: if the outlook is
overcast the play attribute is always yes
If outlook = overcast then play = yes
34
Drill-down: time dimension
Concept hierarchy
9/5

sunny

rainy

overcast

1

0/1

0/0

0/0

2

0/1

0/0

0/0

3

0/0

0/0

1/0

4

0/0

1/0

0/0

5

0/0

1/0

0/0

6

0/0

0/1

0/0

7

0/0

0/0

1/0

8

0/1

0/0

0/0

9

1/0

0/0

0/0

10

0/0

1/0

0/0

11

1/0

0/0

0/0

12

0/0

0/0

1/0

13

0/0

0/0

1/0

14

0/0

0/1

0/0

35
Roll-up (reverse of drill-down)
9/5

sunny

rainy

overcast

Week
1

0/2

2/1

2/0

Week
2

2/1

1/1

2/0

36
Data Mining Tasks
Prediction methods
Use some variables to predict unknown or future values
of the same or other variables.
Inference on the current data in order to make
prediction
Description methods
Find human interpretable patterns that describe data
Characterize the general properties of data in db
Descriptive mining is complementary to predictive mining
but it is closer to decision support than decision making

37
Contd
Association Rule Mining (descriptive)
Classification and Prediction (predictive)
Clustering (descriptive)
Sequential Pattern Discover (descriptive)
Regression (predictive)
Deviation Detection (predictive)
38
Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between
attributes
Data is typically stored in very large databases,
sometimes in flat files or images
Uses include decision support, classification and
clustering
Application areas include business, medicine and
engineering
39
Given a set of transactions, each
of which is a set of items, find
all rules (XY) that satisfy
user specified minimum support
and confidence constraints
Support = (#T containing X and
Y)/(#T)
Confidence=(#T containing X
and Y)/ (#T containing X)
Applications
Cross selling and up selling
Supermarket shelf
management

Some rules discovered
Bread Jem
Sup=60%, conf=75%
Jelly Bread
Sup=60%, conf=100%
Jelly Jem
Sup=20%, conf=100%
Jelly Milk
Sup=0%

Transaction Items
T1 Bread, Jelly, Jem
T2 Bread, Jem
T3 Bread, Milk, Jem
T4 Coffee, Bread
T5 Coffee, Milk
40
Association Rule Mining:
Definition
Given a set of records, each of which
contain some number of items from a given
collection:
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items
Example:
{Bread} {Jem}
{Jelly} {Jem}
41
Marketing and sales promotion
Say the rule discovered is
{Bread, } {Jem}
Jem as a consequent: can be used to determine what
products will boost its sales.
Bread as antecedent: can be used to see which products
will be impacted if the store stops selling bread
Bread as an antecedent and Jem as a consequent: can be
used to see what products should be stocked along with
Bread to promote the sale of Jem.
42
Supermarket shelf management
Goal: To identify items that are bought
concomitantly by a reasonable fraction of
customers so that they can be shelved.
Data Used: Point-of sale data collected with
barcode scanners to find dependencies among
products.
Example
If customer buys jelly, then he is very likely to by Jem.
So dont be surprised if you find Jem next to Jelly on an
aisle in the super market. Also salsa next to tortilla
chips.
43
Association rule mining will produce LOTS of rules
How can you tell which ones are important?
High Support
High Confidence
Rules involving certain attributes of interest
Rules with a specific structure
Rules with support / confidence higher than expected
Completeness Generating all interesting rules
Efficiency Generating only rules that are interesting

44
Clustering
Determine object groupings such that objects within the
same cluster are similar to each other, while objects in
different groups are not
Typically objects are represented by data points in a
multidimensional space with each dimension
corresponding to one or more attributes. Clustering
problem in this case reduces to the following:
Given a set of data points, each having a set of attributes, and a
similarity measure, find cluster such that
Data points in one cluster are more similar to one another
Data points in separate clusters are less similar to one another

45
Contd
Similarity measures:
Euclidean distance (continuous attr.)
Other problem specific measures
Types of Clustering
Group-Based Clustering
Hierarchical Clustering

46
Clustering Example
Euclidean distance
based clustering in 3D
space
Intra cluster distances
are minimised
Inter cluster distances
are maximised
47
Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of
customers where each subset can be targeted with
a distinct marketing mix
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information
Find clusters of similar customers
Measure the clustering quality by observing the buying
patterns of customers in the same cluster vs. those from
different clusters.
48
Clustering: Document Clustering
Goal: To find groups of documents that are similar
to each other based on important terms appearing
in them
Approach: To identify frequently occurring terms
in each document. Form a similarity measure
based on frequencies of different terms. Use it to
generate clusters.
Gain: Information Retrieval can utilize the clusters
to relate a new document or search to clustered
documents
49
Clustering:
Document Clustering Example
Clustering points: 3204 articles of LA Times
Similarity measure: Number of common words in
documents (after some word filtering)

Category Total articles Correctly placed articles
Financial
Foreign
National
Metro
Sports
Entertainment
555
341
273
943
738
354
364
260
36
746
573
278
50
Classification: Definition
Given a set of records (called the training set)
Each record contains a set of attributes. One of the
attributes is the class
Find a model for the class attribute as a function of
the values of other attributes
Goal: Previous unseen records should be assigned
to a class as accurately as possible
Usually, the given data set is divided into training and
test set, with training set used to build the model and
test set used to validate it. The accuracy of the model is
determined on the test set.
51
Classification: contd
Classifiers are created using labeled training samples
Classifiers are evaluated using independent labeled
samples (test set)
Training samples created by ground truth / experts
Classifier later used to classify unknown samples
Measurements must be able to predict the phenomenon!
Examples
Direct marketing
Fraud detection
Customer churn
Sky survey cataloging
Classifying galaxies
52
Classification Example
Tid Ref und
Marital
Status
Taxable
Income
Cheat
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Div orced
Married
Div orced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Training
Set
Learn
Classif ier
Model
Test
set
Ref und
Marital
Status
Taxable
Income
Cheat
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Div orced
Married
Div orced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
53
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell phone product
Approach:
Use the data collected for a similar product introduced in the recent
past.
Use the profiles of consumers along with their (buy, didnt buy}
decision. The latter becomes the class attribute.
The profile of the information may consist of demographic,
lifestyle and company interaction.
Demographic Age, Gender, Geography, Salary
Psychographic - Hobbies
Company Interaction Recentness, Frequency, Monetary
Use these information as input attributes to learn a classifier model

54
Classification: Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions
Approach:
Use credit card transactions and the information on its
account holders as attributes (important: when and
where the card was used)
Label past transactions as {fraud, fair} transactions.
This forms the class attribute
Learn a model for the class of transactions
Use this model to detect fraud by observing credit card
transactions on an account.
55
Regression
Predict the value of a given continuous valued
variable based on the values of other variables,
assuming a linear or non-linear model of
dependency
Extensively studied in the fields of Statistics and
Neural Networks
Predicting sales number of new product based on
advertising expenditure
Predicting wind velocities based on temperature,
humidity, air pressure, etc
Time series prediction of stock market indices
56
Deviation/Anomaly Detection
Some data objects do not comply with the general
behavior or model of the data. Data objects that
are different from or inconsistent with the
remaining set are called outliers
Outliers can be caused by measurement or
execution error. Or they represent some kind of
fraudulent activity
Goal of deviation/anomaly detection is to detect
significant deviations from normal behavior
57
Deviation/Anomaly Detection:
Definition
Given a set of n points or objects, and k, the
expected number of outliers, find the top k
objects that considerably dissimilar,
exceptional or inconsistent with the
remaining data
This can be viewed as two sub problems
Define what data can be considered as
inconsistent in a given data set
Find an efficient method to mine the outliers
58
Deviation:
Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions
Approach:
Based on past usage patterns, develop model for
authorized credit card transactions
Check for deviation from model, before authenticating
new credit card transactions
Hold payment and verify authenticity of doubtful
transaction by other means (phone call, etc.)

59
Anomaly detection:
Network Intrusion Detection
Goal: to detect intrusion of a computer
network
Approach:
Define and develop a model for normal user
behavior on the computer network
Continuously monitor behavior of users to
check if it deviates from the defined normal
behavior
Raise an alarm, if such deviation is found
60
Sequential pattern discovery:
definition
Given is a set of objects, with each object
associated with its own time of events, find
rules that predict strong sequential
dependencies among different events
Sequence discovery aims at extracting sets
of events that commonly occur over a
period of time

(A B) (C) (D E)
61
Telecommunication Alarm Logs
Telecommunication alarm logs
(Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) (Fire_Alarm)
62
Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences
Computer bookstore
(Intro_to_Visual_C) (C++ Primer)
(Perl_For_Dummies, Tcl_Tk)
60% customers who buy Intro toVisual C and C++
Primer also buy Perl for dummies and Tcl Tk within
a month
Athletic apparel store
(Shoes) (Racket, Racket ball) (Sport_Jacket)
63
Example: Data Mining(Weather
data)
By applying various data mining techniques, we
can find
associations and regularities in our data
Extract knowledge in the forms of rules, decision trees
etc.
Predict the value of the dependent variable in new
situation
Some example
Mining association rules
Classification by decision trees and rules
Prediction methods
64
Mining association rules
First, discretize the numeric attributes (a part of
the data preprocessing stage)
Group the temperature values in three intervals
(hot, mild, cool) and humidity values in two (high,
normal)
Substitute the values in data with the
corresponding names
Apply the Apriori algorithm and get the following
rules
65
Discretized weather data
Day

outlook

temperature

humidity

windy

play

1

sunny

hot

high

false

No

2

sunny

hot

high

true

No

3

overcast

hot

high

False

Yes

4

rainy

mild

high

False

Yes

5

rainy

cool

normal

False

Yes

6

rainy

cool

normal

True

No

7

overcast

cool

normal

True

Yes

8

sunny

mild

high

False

No

9

sunny

cool

normal

False

Yes

10

rainy

mild

normal

False

Yes

11

sunny

mild

normal

True

Yes

12

overcast

mild

high

True

Yes

13

overcast

hot

normal

False

Yes

14

rainy

mild

high

true

no

66
Contd

1. humidity=normal windy=false play=yes (4,1)
2. temperature=cool humidity=normal (4,1)
3. outlook=overcast play=yes (4,1)
4. temperature=cool play=yes humidity=normal (3,1)
5. outlook=rainy windy=false play=yes (3, 1)
6. outlook=rainy play=yes windy=false (3, 1)
7. outlook=sunny humidity=high play=no (3, 1)
8. outlook=sunny play=no humidity=high (3, 1)
9. temperature=cool windy=false humidity=normal play=yes (2,
1)
10. temperature=cool humidity=normal windy=false play=yes (2,
1)

67
Contd
These rules show some attribute values sets
(itemsets) that appear frequently in the data
Support (the number of occurrences of the
itemset in the data)
Confidence (accuracy) of the rules
Rule 3 the same as the one that is
produced by observing the data cube
68
Classification by Decision Trees
and Rules
Using ID3 algorithm, the following decision tree
is produced
Outlook=sunny
Humidity=high:no
Humidity=normal:yes
Outlook=overcast:yes
Outlook=rainy
Windy=true:no
Windy=false:yes
69
Contd
Decision tree consists of:
Decision nodes that test the values of their
corresponding attribute
Each value of this attribute leads to a subtree and so on,
until the leaves of the tree are reached
They determine the value of the dependent variable
Using a decision tree we can classify new tuples
70
Contd
A decision tree can be presented as a set of rules
Each rule represents a path through the tree from the
root to a leaf
Other data mining techniques can produce rules
directly: Prism algorithm
if outlook=overcast then yes
if humidity=normal and windy=false then yes
If temperature=mild and humidity=normal the yes
If outlook=rainy and windy=false then yes
If outlook=sunny and humidity=high then no
If outlook=rainy and windy=true then no
71
Prediction methods
DM offers techniques to predict the value of the
dependent variable directly without first
generating a model
The most popular approaches is based of statistical
methods
Uses the Bayes rule to predict the probability of
each value of the dependent variable given the
values of the independent variables
72
Contd
Eg: applying Bayes to the new tuple:
(sunny, mild, normal, false, ?)

P(play=yes| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.8
P(play=no| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.2

The predicted value must be yes
73
Data Mining : Problems and Challenges
Noisy
data
Dynamic
Database
s
Large
Database
s

74
Noisy data
many of attribute values will be inexact or
incorrect
erroneous instruments measuring some property
human errors occurring at data entry
two forms of noise in the data
corrupted values - some of the values in the training set
are altered from the original form
missing values - one or more of the attribute values
may be missing both for examples in the training set
and for object which are to be classified.
75
Difficult Training Set
Non-representative data
Learning are based on a few examples
Using large db, the rules probably representative
Absence of boundary cases
To find the real differences between two classes
Limited information
Two objects to be classified give the same conditional
attributes but are classified in the diff class
Not have enough information of distinguishing two
types of objects
76
Dynamic databases
Db change continually
Rules that reflect the content of the db at all
time (preferred)
If same changes are made, the whole
learning process may have to be conducted
again
77
Large databases
The size of db to be ever increasing
Machine learning algorithms handling a
small training set (a few hundred examples)
Much care on using similar techniques in
larger db
Large db provide more knowledge (eg.
rules may be enormous)
78
Data Mining Issues in Data Mining
User Interaction / Visualization
Incorporation of Background Knowledge
Noisy or Incomplete Data
Determining Interestingness of Patterns
Efficiency and Scalability
Parallel and Distributed Mining
Incremental Learning / Mining Time-Changing
Phenomena
Mining from Image / Video / Audio Data
Mining Unstructured Data

Ch1 Introduction

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ch1 Introduction

Diunggah oleh

Hak Cipta:

Format Tersedia

1

Anda mungkin juga menyukai