S6 Evaluation

S6:
Evalua*on

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Objec*ves for Week 4

By the end of Week 4 you should be able to explain the following evalua*on approaches for classica*on algorithms: Cross Valida*on LiO Curves ROC Curves Precision/Recall Curves Learning Curves By the end of Week 4 you will also learn some Weka basics in Lab.
Data Mining 2013 2
Objec*ves for Week 4

Avoid OverTng Compare Classiers in Order to Pick the Right One Could more data be helpful? Factor in Costs of Misclassica*on (May have to carry over to Week 5)
Data Mining 2013
Two Types of Error

False positive (false alarm), FP!
Predict that the person is going to buy the product, but they do not!
False nega)ve (miss), FN

Predict that the person is not going to buy the product, but they do (or would have)
2-class Confusion Matrix

Predicted class True class positive (#P) negative (#N) positive #TP #FP negative #P - #TP #N - #FP
Reduce the 4 numbers to two rates

true posi*ve rate = TP = (#TP)/(#P) false posi*ve rate = FP = (#FP)/(#N)

Classica*on: Train, Valida*on, Test Split

Results Known
+ + - - +
Training set Evaluate Predic*ons + - + -
Model Builder
Data
Model Builder
Valida)on set
Final Test Set

6
Final Model
+ - Final Evalua*on + -
Cross-valida*on
Cross-valida+on avoids overlapping test sets
First step: data is split into k subsets of equal size Second step: each subset in turn is used for tes*ng and the remainder for training
This is called k-fold cross-valida+on OOen the subsets are stra*ed before the cross-valida*on is performed The error es*mates are averaged to yield an overall error es*mate
7
Cross-validation example:
Break up data into groups of the same size Hold aside one group for testing and use the rest to build model
Test
Repeat
More on cross-valida*on
Standard method for evalua*on: stra*ed ten-fold cross-valida*on Why ten? Extensive experiments have shown that this is the best choice to get an accurate es*mate Stra*ca*on reduces the es*mates variance Even becer: repeated stra*ed cross- valida*on
9
E.g. ten-fold cross-valida*on is repeated ten *mes and results are averaged (reduces the variance)
Comparing data mining schemes

Frequent situa*on: we want to know which one of two learning schemes performs becer Note: this is domain dependent! Obvious way: compare 10-fold CV es*mates Problem: variance in es*mate Variance can be reduced using repeated CV However, we s*ll dont know whether the results are reliable
10
Signicance tests
Signicance tests tell us how condent we can be that there really is a dierence Null hypothesis: there is no real dierence Alterna+ve hypothesis: there is a dierence A signicance test measures how much evidence there is in favor of rejec*ng the null hypothesis Lets say we are using 10 *mes 10-fold CV Then we want to know whether the two means of the 10 CV es*mates are signicantly dierent
11
Students paired t-test tells us whether the means of two samples are signicantly dierent
What are CPES Good For?
12
How to Evaluate Performance ?

Scalar Measures
Accuracy (compared to what??) Expected cost Area under the ROC curve
Visualiza*on Techniques
ROC curves LiO curves Precision/Recall curves
ROC analysis?
A way to compare classiers A way to assess your models ability to rank new test cases A way to look at performance for dierent scenarios Same is true for liO curves, precision/recall curves Why not pick the best model for the scenario? You would if you knew the cost scenario up front.
Business problem:
Target consumers for new product

Large telecommunica*ons company Product: new telecom service Large direct marke*ng campaign Long experience with targeted marke*ng Sophis*cated segmenta*on models based on data and intui*on
e.g., regarding the types of customers known or thought to have anity for this type of service
15
The Data
The rm determined 21 segments by a combina*on of customer characteris*cs
Loyalty (L)
Existing Customer Prior spending Current plan Frequent switch
SEGMENT ID 1 2 3 4 5 6 7 8 9 10
Geography (G)
State Zip Urban Cable Region
Demographics (D)
Age Gender Children Head of Household
Other (O)
Type of Mailer Internet Type
11 12 13 14 15 16 17 18 19 20
separately, assessed >150 potential attributes from these categories
21
16
Area Under the ROC Curve

Attribute All first order All first order + oracle All first order + CI score NN 0.61 0.63 0.62 non-NN 0.71 0.74 0.74
NN
non-NN
17
Assump*ons
Standard Cost Model
correct classica*on costs 0 cost of misclassica*on depends only on the class, not on the individual example over a set of examples costs are addi*ve
Costs or Class Distribu*ons:
are not known precisely at evalua*on *me may vary with *me may depend on where the classier is deployed
Classica*on Accuracy
Predicted Class 0 1 1 1 0 1 1 0 0 0 Predicted CPE .2 .9 .6 .8 .3 .6 .9 .4 .3 .1 Actual Classica)on 0 1 1 1 1 0 1 1 0 0
19
Classica*on Accuracy = (TP+TN)/ m Where m is the number of test instances

20
21
m = 10, Accuracy = ?
Predic*ng probabili*es
Performance measure so far: success rate Also called 0-1 loss func+on:
Most classiers produces class probabili*es Depending on the applica*on, we might want to check the accuracy of the probability es*mates 0-1 loss is not the right thing to use in those cases where we care about ranking or using the prob es*mates
$ 0 if prediction is correct # "1 if prediction is incorrect
22
ROC Curves
23
Specic Example
People that will not buy People that will buy
Test Result
Threshold
Call these consumers negative Call these consumers positive
Test Result
Some deni*ons ...

Call these consumers negative Call these consumers positive
True Positives
Test Result
Do not purchase purchase
Call these consumers negative
Call these consumers positive
Test Result
False Positives
True negatives
Test Result
False negatives
Test Result
Moving the Threshold: right

- +
Test Result
Moving the Threshold: leO

- +
Test Result
ROC curve
100%
True Positive Rate (sensitivity)

0% 0%
False Positive Rate (1-specificity)
100%
ROC curve comparison

A good test:
100% 100%
A poor test:
True Positive Rate
0%
True Positive Rate

0%
0 %
False Positive Rate
100%
0 %
False Positive Rate
100%
ROC curve extremes

Best Test:
100% 100%
Worst test:
True Positive Rate
True Positive Rate

0 % 0 %
0 %
False Positive Rate
100 %
0 %
False Positive Rate
100 %
The distributions dont overlap at all
The distributions overlap completely
ROC Curves
FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)
35
ROC Curves

36
ROC Curves
Predicted Class 1 1 1 1 1 0 0 0 0 0 Predicted CPE .9 .9 .8 .6 .6 .4 .3 .3 .2 .1 Actual Classica)on 1 1 1 0 1 1 1 0 0 0 Label neg Label pos

37
ROC Curves

38
ROC Curves

39
ROC Curve
For a given threshold on f(x), you get a point on the ROC curve. 100% Ideal ROC curve (AUC=1)
Positive class success rate (hit rate, sensitivity) 0 AUC 1 0 1 - negative class success rate (false alarm rate, 1-specificity) 100%
Area under ROC curve (AUC)

Overall measure of test performance Comparisons between two tests based on dierences between (es*mated) AUC For con*nuous data, AUC equivalent to Mann- Whitney U-sta+s+c (nonparametric test of dierence in loca*on between two popula*ons)
AUC for ROC curves

100% 100%
True Positive Rate
True Positive Rate
AUC = 100%
AUC = 50%
0 %
0 %
0 %
False Positive Rate
100%
0 %
False Positive Rate
100%
100%
100%
True Positive Rate
True Positive Rate
AUC = 90%
AUC = 65%
0 %
0 %
0 %
False Positive Rate
100%
0 %
False Positive Rate
100%
LiO Curves
43
LiO Curves
Predicted Class 1 1 How many good customers did we select? 1 1 1 0 0 0 0 0 Predicted CPE .9 .9 .8 .6 .6 .4 .3 .3 .2 .1 Actual Classica)on 1 1 1 0 1 1 1 0 0 0 Percent targeted
44
LiO Curve
Customers ranked according to f(x); selection of the top ranking customers. 100% Hit rate = Frac. good customers select. Ideal Lift
Gni=2 AUC-1 0 Gini 1
Fraction of customers selected
100%
Precision Recall Curve (IR)
Precision Precision is the fraction of the documents retrieved that are relevant to the user's information need. Recall Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. In binary classification, recall is called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.
Precision/Recall Curves
Predicted Class 1 1 1 1 1 0 0 0 0 0 Predicted CPE .9 .9 .8 .6 .6 .4 .3 .3 .2 .1 Actual Classica)on 1 1 1 0 1 1 1 0 0 0 Percent targeted
47
Performance Assessment
Cost matrix
Truth: y Class -1 Class +1 Total Class+1 /Total Predictions: F(x) Class -1 tn fn rej=tn+fn Class +1 Total neg=tn+fp pos=fn+tp
m=tn+fp +fn+tp
Class +1 / Total
fp tp sel=fp+tp Precision = tp/sel
False alarm = fp/neg Hit rate = tp/pos Frac. selected = sel/m
False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power
Compare F(x) = sign(f(x)) to the target y, and report:

Accuracy = (tp + tn) / m Error rate = (fn + fp)/m {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 (sensitivity+specificity)/2 F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold in F(x) = sign(f(x)+), and plot: ROC curve: Hit rate vs. False alarm rate
Lift curve: Hit rate vs. Fraction selected Precision/recall curve: Hit rate vs. Precision
Learning Curves
49
Learning Curves
test accuracy
number of training examples
50
Ques*ons?
51
WE NEED TO FORMALIZE THE IDEA THAT MISTAKES HAVE DIFFERENT COSTS
Cost matrix
Actual Sun Rain Predicted Sun Rain 0 1 10 1
Source: Adapted from Zadrozny (2003)
52
ASSUME (FOR NOW) WE ALSO KNOW THE PROBABILITY OF SUN VS. RAIN Cost matrix
PSun
PRain
Assume known and also PRain = 1 - Psun
Source: Adapted from Zadrozny (2003)
53
APPROACH FOR MAKING COST-OPTIMAL DECISIONS
STEPS
PSun
PRain
1. Find expected cost of predicting sun
2. Find expected cost of predicting rain
3. Predict whichever has a lower expected cost
Source: Adapted from Elkan (2001) and Zadrozny (2003)
54
WORKED EXAMPLE WEATHER 1. Expected cost of predicting Sun = 0*PSun + 10*PRain = 0*(1-Prain) + 10*PRain = 10*PRain 2. Expected cost of predicting Rain = 1*PSun + 1*PRain = 1*(1-Prain) + 1*PRain = 1*(1 - Prain + Prain) =1 3. When should we predict Rain? CostRain < CostSun 1 < 10*PRain 1/10 < PRain
55
TAKEAWAYS FROM WEATHER EXAMPLE
. PSun
PRain
Take umbrella even if PRain

as low as 10%
Might act as if one class is

true even if another is more probable
56
BE FAMILIAR WITH THE MORE GENERAL FORMULATION
E(X) = P(Xi)*Xi Actual A Predicted A B c00 c10
Expected cost if predict A

B c01 c11 P(Actual class = A) c00 + P(Actual class = B) c01
Vs.
Expected cost if predict B

P(Actual class = A) c10 + P(Actual class = B) c11
Source: Adapted from Elkan (2001)
57
EXAMPLE TWO: MARKETING
Auto dealers have data on states registered drivers Need hot prospect lists Classify people as buyer vs. non-buyer
Source: Adapted from Zahavi (2004)
58
WHO SHOULD THE HOT PROSPECTS BE?
Given C = $20 N = $2000 Pi
= Cost of mailing and cold-call = Profit from sale of car = Person is purchase probability (youre given a different Pi for each person)
59
WORKED EXAMPLE MARKETING
Consider hot prospect if targeting has higher expected profit than not targeting E[Profit if predict buyer] P*(N-C) + (1-P)*(-C) P*N P*C C + P*C P P E[Profit if predict non-buyer] 0 0 C/N 1% (or $20/$2000)
Source: Adapted from Zahavi (2004)
60
THREE MAIN APPROACHES TO COST-SENSITIVE LEARNING Direct Determine lowest expected cost class
Cost-sensitive learning
External Modify the distribution of the training set Internal Make the classifier cost-sensitive
61
CONTEXT OF DECISION TREES: SUN VS. RAIN REVISITED

Wind < 10MPH
20 R 20 S
Pressure = Low Pressure = High
5R 10 S
15 R 10 S
WHAT CLASS TO PREDICT?
WHAT CLASS TO PREDICT?

62
STANDARD METHOD: MAXIMIZE ACCURACY
20 R 20 S
5R 10 S
15 R 10 S
PREDICT SUN
Since S more likely than R
PREDICT RAIN
Since R more likely than S
63
WAS THAT RIGHT? LETS DIG DEEPER...
20 R 20 S
Use leaf nodes to

approximate probabilities
5R 10 S
15 R 10 S
PREDICT SUN P(R)5/15=33% P(S)10/15=67%
PREDICT RAIN P(R)15/25=60% P(S)10/25=40%

64
NO! WE NEED TO RELABEL THE TREE BECAUSE ACCURACY IS NOT THE SAME AS PROFIT
Accuracy uses
20 R 20 S
50% rule (most likely case)
To maximize
profit, we said treat as R if PRain 10%
5R 10 S
15 R 10 S
PREDICT PREDICT SUN RAIN
P(R)= 5/15=33% P(S)=10/15=67%
PREDICT RAIN P(R)=15/25=60% P(S)=10/25=40% 65
External Modify the distribution of the data set Internal Make the classifier cost-sensitive
66
IN THE EXTERNAL METHOD, COSTLY EXAMPLES ARE MADE MORE PREVALENT IN THE TRAINING DATA
Multiply number of negative examples by D ^ D P* = cutoff response rate (costoptimal threshold) Po = threshold of given classifier (typically 0.5)
67
D = Original training set ^ D = Modified training set
Source: Adapted from Elkan (2001) and Zadrozny et al.(2003)
EXTERNAL METHOD: TREAT THE CLASSIFIER AS A BLACK BOX, BUT ALTER THE DATA SETS DISTRIBUTION
^ D (modified distribution D)
Costinsensitive classifier
Cost-sensitive predictions
^ D is built from D, but
makes costly examples more prevalent
Maximizing accuracy on
^ D helps minimize cost on D
Source: Adapted from Elkan (2001) and Zadrozny et al.(2003)
68
External Modify the distribution of the data set Internal Build the model differently
69
ONE IDEA: IMPROVE THE PROBABILITY ESTIMATES
Issue 1. Pruning
Potential problem
Potential resolution
Removes sub-trees Dont prune But also removes distinctions that

improve probabilities!
2. Bias
Probabilities systematically
shifted toward 0/1
Smoothing: Adjust estimates to

be less extreme
3. Variance
Probabilities not statistically

reliable when leafs are small
Use curtailment: For small leaf,

uses the estimate of its parent
Source: Zadrozny and Elkan (2001)
70
ONE IDEA: NOTE ON LAPLACE CORRECTION FOR Decision Trees

2. Bias
Probabilities systematically
shifted toward 0/1
Smoothing: Adjust estimates to

be less extreme
Laplace correction is a common probability smoothing method used for estimating. The probability is estimated as CPEi = (a+1)/(b +2) a represents the number of samples belonging to the particular class Ci , b represents the number of samples at the leaf
71
ANOTHER IDEA: USE PROFIT AS THE SPLIT CRITERIA
Approaches so far
New work
Build an accuracy-based
decision tree
Grow tree based on cost

information (profit instead of entropy as split criteria)
Fix afterward (or before)
Good initial results for specific

types of problems
Source: Abrahams et al. (2005)
72
TAKEAWAYS
Maximizing accuracy is not the same as maximizing prot
Can be op*mal to act is if one outcome were true even when

another is more probable (worth the risk)
Do this by combining decision theory with data mining (good

probability es*mates are the key)
73
Crafting (Good) Papers (/Projects) on Machine Learning?
74
Crafting (Good) Papers (/Projects) on Machine Learning?

Experimental work in Machine Learning: state precisely the dependent variables in each study what is the target variable state the measures of performance accuracy speed Area under the ROC curve profit state clearly the independent variables controlled in each study the induction method - often some new algorithm being compared against more established ones the domain Characteristics of the algorithm Learning curves Noise missing data and perturbed New attributes Evaluation! Evaluation! Evaluation! Why should we believe you that this is a good approach? Is your approach generalizable?
75
Due 2 page proposal

n n
Clear Definition of Problem Clear Definition of Data

n n n
Where will you get the data What is your dependent variable What are your predictor variables
n n
Why is your problem interesting
What has been done before and how will you compare what you are working on to what has been done before
n n
What is your evaluation measure
What would be a good answer to your problem-what is your expected punchline By mid-semester project must have data set and baseline results, exploratory analysis
n 76
References
Based on N. Abe, A. Abrahams, M. Saar-Tschechansky, T. Jiang, B. Holte, P. Langley, Eibe and Frank, D. Goldstein, Wikipedia
77
S6: Evalua*on

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

S6 Evaluation

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

S6 Evaluation

Diunggah oleh

Hak Cipta:

Format Tersedia

S6:

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Objec*ves for Week 4

Objec*ves for Week 4

Data Mining 2013

Two Types of Error

False nega)ve (miss), FN

2-class Confusion Matrix

Reduce the 4 numbers to two rates

Classica*on: Train, Valida*on, Test Split

Training set Evaluate Predic*ons + - + -

Final Test Set

Comparing data mining schemes

What are CPES Good For?

How to Evaluate Performance ?

Target consumers for new product

separately, assessed >150 potential attributes from these categories

Area Under the ROC Curve

Costs or Class Distribu*ons:

Classica*on Accuracy = (TP+TN)/ m Where m is the number of test instances

$ 0 if prediction is correct # "1 if prediction is incorrect

Some deni*ons ...

Call these consumers negative

Call these consumers positive

Call these consumers negative

Call these consumers positive

Call these consumers negative

Call these consumers positive

Moving the Threshold: right

Moving the Threshold: leO

True Positive Rate (sensitivity)

False Positive Rate (1-specificity)

ROC curve comparison

True Positive Rate

True Positive Rate

False Positive Rate

False Positive Rate

ROC curve extremes

True Positive Rate

True Positive Rate

False Positive Rate

False Positive Rate

The distributions dont overlap at all

The distributions overlap completely

FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)

FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)

FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)

FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)

FP Rate = FP/ (TN + FP) TP Rate = TP/ (TP + FN)

Area under ROC curve (AUC)

AUC for ROC curves

True Positive Rate

True Positive Rate

False Positive Rate

False Positive Rate

True Positive Rate

True Positive Rate

False Positive Rate

False Positive Rate

Gni=2 AUC-1 0 Gini 1

Fraction of customers selected

Precision Recall Curve (IR)

fp tp sel=fp+tp Precision = tp/sel

Classicaon: Train, Validaon, Test Split