Anda di halaman 1dari 43

Pattern Recognition Application

DATA MINING
Data Mining is discovering hidden values in
data repositories.
Data mining is important because:
o It facilitates the process of fetching the required
information from huge amounts of data.
o It helps making crucial decisions.
This can be achieved by wide variety of data
mining techniques which can be categorized
into supervised and unsupervised learning.
Supervised Learning
Supervised learning" algorithms are those used in classification
and prediction.
Data is available in which the value of the outcome of interest is known .

Training data" are the data from which the classification or


prediction algorithm learns," or is trained," about the
relationship between predictor variables and the outcome
variable.

This process results in a model


Classification Model
Predictive Model

3
Supervised Learning

Model is then run with another sample of data


validation data"
the outcome is known but we wish to see how well the model performs
If many different models are being tried out, a third sample of known
outcomes -test data is used with the final, selected model to predict
how well it will do.

The model can then be used to classify or predict the outcome


of interest in new cases where the outcome is unknown.

4
Unsupervised Learning
No outcome variable to predict or classify

No learning from cases

Unsupervised leaning methods


Association Rules
Data Reduction Methods
Clustering Techniques

5
CLASSIFICATION MODEL BASIC
BLOCKS

DATA PREPROCESSING

PREDICTIVE
EVALUATION
MODEL
The Steps in Data Mining
1. Develop an understanding of the purpose of the data
mining project
It is a one-shot effort to answer a question or questions or
Application (if it is an ongoing procedure).
2. Obtain the dataset to be used in the analysis.
Random sampling from a large database to capture records to be used
in an analysis
Pulling together data from different databases.
Usually the analysis to be done requires thousands or tens of
thousands of records.

7
The Steps in Data Mining

3. Explore, clean, and preprocess the data


Verifying that the data are in reasonable condition.
How missing data should be handled?
Are the values in a reasonable range, given what you would expect for
each variable?
Are there obvious outliers?"
Data are reviewed graphically
For example, a matrix of scatter plots showing the relationship of each
variable with each other variable.
Ensure consistency in the definitions of fields, units of measurement,
time periods, etc.

8
The Steps in Data Mining
4. Reduce the data
If supervised training is involved separate them into training,
validation and test datasets.
Eliminating unneeded variables,
Transforming variables
Creating new variables
Make sure you know what each variable means, and whether it is
sensible to include it in the model.
5. Determine the data mining task
Classification, prediction, clustering, etc.
6. Choose the data mining techniques to be used
Regression, neural nets, hierarchical clustering, etc.

9
The Steps in Data Mining
7. Use algorithms to perform the task.
Iterative process - trying multiple variants, and often using multiple variants of
the same algorithm (choosing different variables or settings within the
algorithm).
When appropriate, feedback from the algorithm's performance on validation
data is used to refine the settings.
8. Interpret the results of the algorithms.
Choose the best algorithm to deploy,
Use final choice on the test data to get an idea how well it will perform.
9. Deploy the model.
Integrate the model into operational systems
Run it on real records to produce decisions or actions.

10
Example
Open MatLab
Write the following command
load fisheriris
Iris Data
Fishers Iris data base (Fisher, 1936) is perhaps
the best known database to be found in the
pattern recognition literature.
The data set contains 3 classes of 50 instances
each, where each class refers to a type of iris
plant. One class is linearly separable from the
other two; the latter are not linearly separable
from each other.
Iris Data
The data base contains the following attributes:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica
Fishers Iris data base is available in Matlab (load
fisheriris)
Plot the Data
Now lets plot Iris Data and see how the sepal
measurements differ between species.
You can use the two columns containing sepal
measurements.

gscatter(meas(:,1), meas(:,2), species,'rgb','osd');


xlabel('Sepal length');
ylabel('Sepal width');
Now Classify
The matlab classify function can perform
classification using different types of
discriminant analysis.
First classify the data using the default linear
discriminant analysis (LDA)
MatLab 2015
ldaclass=classify(meas(:,1:2), meas(:,1:2),species);

MatLab 2016
lda = fitcdiscr(meas(:,1:2),species);
ldaClass = resubPredict(lda);
Assignment
Prepare a presentation to discuss the theory
of operation of discriminant analysis classifier
including:
1. Idea of operation
2. Mathematical formulation
3. Advantages and disadvantages
4. Different forms of writing its function in different
versions of MatLab (2010,2013,2016,2017) with a
detailed description of the function employed in the
version you are using
Computing the error
Now compute the resubstitution error, which
is the misclassification error (the proportion of
misclassified observations) on the training set.
MatLab 2013
Bad=~strcmp(ldaclass,species);
N=size(meas,1);
ldaResubErr = sum(bad)/N;
MatLab 2017
ldaResubErr = resubLoss(lda)
Evaluation
How predictive is the model we learned?
Error on the training data is not a good
indicator of performance on future data
Q: Why?
A: Because new data will probably not be exactly
the same as the training data!
Overfitting fitting the training data too
precisely - usually leads to poor results on new
data

18
Evaluation issues
Possible evaluation measures:
Classification Accuracy
Total cost/benefit when different errors involve
different costs
Lift and ROC curves
Error in numeric predictions
How reliable are the predicted results ?

19
Classifier error rate
Natural performance measure for
classification problems: error rate
Success: instances class is predicted correctly
Error: instances class is predicted incorrectly
Error rate: proportion of errors made over the
whole set of instances
Training set error rate: is way too optimistic!
you can find patterns even in random data

20
Evaluation on LARGE data
If many (thousands) of examples are available,
including several hundred examples from each
class, then a simple evaluation is sufficient
Randomly split data into training and test sets
(usually 2/3 for train, 1/3 for test)
Build a classifier using the train set and
evaluate it using the test set.

21
Classification Step 1:
Split data into train and test sets
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Testing set

22
Classification Step 2:
Build a model on a training set
THE PAST
Results Known

+
+ Training set
-
-
+
Data

Model Builder

Testing set

23
Classification Step 3:
Evaluate on test set (Re-train?)
Results Known
+
+ Training set
-
-
+
Data
Model Builder
Evaluate
Predictions
+
Y N
-
+
Testing set -

24
Handling unbalanced data
Sometimes, classes have very unequal frequency
Attrition prediction: 97% stay, 3% attrite (in a month)
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% dont buy, 1% buy
Security: >99.99% of Americans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, but
useless

25
Balancing unbalanced data
With two classes, a good approach is to build
BALANCED train and test sets, and train model
on a balanced set
randomly select desired number of minority class
instances
add equal number of randomly selected majority
class
Generalize balancing to multiple classes
Ensure that each class is represented with
approximately equal proportions in train and test
26
A note on parameter tuning
It is important that the test data is not used in any way to
create the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data cant be used for parameter tuning!
Proper procedure uses three sets: training data, validation
data, and test data
Validation data is used to optimize parameters

27
Making the most of the data
Once evaluation is complete, all the data can
be used to build the final classifier
Generally, the larger the training data the
better the classifier (but returns diminish)
The larger the test data the more accurate the
error estimate

28
Classification:
Train, Validation, Test split
Results Known
+
Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -
29
Evaluating Classification &
Predictive Performance

30
Why Evaluate?

Multiple methods are available to


classify or predict
For each method, multiple
choices are available for settings
To choose best model, need to
assess each models performance

31
Accuracy Measures
(Classification)

32
Misclassification error

Error = classifying a record as


belonging to one class when it
belongs to another class.

Error rate = percent of


misclassified records out of the
total records in the validation
data
33
Separation of Records

High separation of records means that using


predictor variables attains low error

Low separation of records means that using


predictor variables does not improve much on
nave rule

34
Confusion Matrix

Classification Confusion Matrix


Predicted Class
Actual Class 1 0
1 201 85
0 25 2689

201 1s correctly classified as 1


85 1s incorrectly classified as 0
25 0s incorrectly classified as 1
2689 0s correctly classified as 0

35
36
Confusion matrix glossary
In a 2-class problem where the class is either C
or not C the confusion matrix looks like this:
Classifier Output
True Class C not C

C TP FN

not C FP TN

TP is the number of true positives. Its a C, and classifier output is C


FN is the number of false negatives. Its a C, and classifier output is not C.
TN is the number of true negatives. Its not C, and classifier output is not C.
FP is the number of false positives. Its not C, and classifier output is C.
37
Performance measurements
Accuracy: The accuracy of a measurement is how close
a result comes to the true value. Acc. = no of correct
classified patterns/ total no of patterns =
TP+TN/(TP+TN+FP+FN)
Error Rate: (sum of misclassified records)/(total
records) = (FP+FN)/(TP+TN+FP+FN)
Sensitivity (True Positive Rate TPR): measures the
proportion of positives that are correctly identified.
TPR=TP/P =TP/(TP+FN)
Specificity (True Negative Rate TNR): measures the
proportion of negatives that are correctly identified.
TNR=TN/N =TN/(TN+FP)

38
Example

Using the specified confusion matrix Classification Confusion Matrix


calculate: Predicted Class
1- Accuracy
Actual Class 1 0
2- Error Rate
3- Sensitivity 1 201 85
4- Specificity 0 25 2689

Solution:
TP=201, FP=25, TN=2689, FN= 85
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 err = (201+2689) = 96.33%
Sensitivity = 201/(201+85) = 68.14%
Specificity = 2689/(2689+25) = 99.08%
39
Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class 1
2. Compare to cutoff value, and classify accordingly

Default cutoff value is 0.50


If >= 0.50, classify as 1
If < 0.50, classify as 0
Can use different cutoff values
Typically, error rate is lowest for cutoff = 0.50

40
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004

If cutoff is 0.50: 13 records are classified as 1


If cutoff is 0.80: seven records are classified as
1
41
Confusion Matrix for
Different Cutoffs
Cut off Prob.Val. for Success (Updatable) 0.25

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 11 1
non-owner 4 8

Cut off Prob.Val. for Success (Updatable) 0.75

Classification Confusion Matrix


Predicted Class

Actual Class owner non-owner

owner 7 5
non-owner 1 11
42
Confusion Matrix on Iris data example
[ldaResubCM,grpOrder] =confusionmat(species,ldaClass)

ldaResubCM =
49 1 0
0 36 14
0 15 35
grpOrder =
3x1 cell array
{'setosa' }
{'versicolor'}
{'virginica' }

Anda mungkin juga menyukai