Evalua*on
Model Builder
Data
Model Builder
Valida)on set
Final Model
+ - Final Evalua*on + -
Cross-valida*on
Cross-valida+on
avoids
overlapping
test
sets
First
step:
data
is
split
into
k
subsets
of
equal
size
Second
step:
each
subset
in
turn
is
used
for
tes*ng
and
the
remainder
for
training
This
is
called
k-fold
cross-valida+on
OOen
the
subsets
are
stra*ed
before
the
cross-valida*on
is
performed
The
error
es*mates
are
averaged
to
yield
an
overall
error
es*mate
7
Cross-validation example:
Break up data into groups of the same size Hold aside one group for testing and use the rest to build model
Test
Repeat
More
on
cross-valida*on
Standard
method
for
evalua*on:
stra*ed
ten-fold
cross-valida*on
Why
ten?
Extensive
experiments
have
shown
that
this
is
the
best
choice
to
get
an
accurate
es*mate
Stra*ca*on
reduces
the
es*mates
variance
Even
becer:
repeated
stra*ed
cross- valida*on
9
E.g. ten-fold cross-valida*on is repeated ten *mes and results are averaged (reduces the variance)
Signicance
tests
Signicance
tests
tell
us
how
condent
we
can
be
that
there
really
is
a
dierence
Null
hypothesis:
there
is
no
real
dierence
Alterna+ve
hypothesis:
there
is
a
dierence
A
signicance
test
measures
how
much
evidence
there
is
in
favor
of
rejec*ng
the
null
hypothesis
Lets
say
we
are
using
10
*mes
10-fold
CV
Then
we
want
to
know
whether
the
two
means
of
the
10
CV
es*mates
are
signicantly
dierent
11
Students paired t-test tells us whether the means of two samples are signicantly dierent
12
Visualiza*on
Techniques
ROC
curves
LiO
curves
Precision/Recall
curves
ROC
analysis?
A
way
to
compare
classiers
A
way
to
assess
your
models
ability
to
rank
new
test
cases
A
way
to
look
at
performance
for
dierent
scenarios
Same
is
true
for
liO
curves,
precision/recall
curves
Why
not
pick
the
best
model
for
the
scenario?
You
would
if
you
knew
the
cost
scenario
up
front.
Business problem:
The
Data
The
rm
determined
21
segments
by
a
combina*on
of
customer
characteris*cs
Loyalty (L)
Existing Customer Prior spending Current plan Frequent switch
SEGMENT ID 1 2 3 4 5 6 7 8 9 10
Geography (G)
State Zip Urban Cable Region
Demographics (D)
Age Gender Children Head of Household
Other (O)
Type of Mailer Internet Type
11 12 13 14 15 16 17 18 19 20
21
16
NN
non-NN
17
Assump*ons
Standard
Cost
Model
correct
classica*on
costs
0
cost
of
misclassica*on
depends
only
on
the
class,
not
on
the
individual
example
over
a
set
of
examples
costs
are
addi*ve
are not known precisely at evalua*on *me may vary with *me may depend on where the classier is deployed
Classica*on
Accuracy
Predicted
Class
0
1
1
1
0
1
1
0
0
0
Predicted
CPE
.2
.9
.6
.8
.3
.6
.9
.4
.3
.1
Actual
Classica)on
0
1
1
1
1
0
1
1
0
0
19
Classica*on Accuracy
Classica*on
Accuracy
Predicted
Class
0
1
1
1
0
1
1
0
0
0
Predicted
CPE
.2
.9
.6
.8
.3
.6
.9
.4
.3
.1
Actual
Classica)on
0
1
1
1
1
0
1
1
0
0
21
m = 10, Accuracy = ?
Predic*ng
probabili*es
Performance
measure
so
far:
success
rate
Also
called
0-1
loss
func+on:
Most classiers produces class probabili*es Depending on the applica*on, we might want to check the accuracy of the probability es*mates 0-1 loss is not the right thing to use in those cases where we care about ranking or using the prob es*mates
22
ROC Curves
23
Specic
Example
People that will not buy People that will buy
Test Result
Threshold
Call these consumers negative Call these consumers positive
Test Result
True Positives
Test Result
Do not purchase purchase
Test Result
Do not purchase purchase
False Positives
True negatives
Test Result
Do not purchase purchase
False negatives
Test Result
Do not purchase purchase
Test Result
Do not purchase purchase
Test Result
Do not purchase purchase
ROC
curve
100%
100%
A poor test:
0%
0 %
100%
0 %
100%
Worst test:
0 %
100 %
0 %
100 %
ROC Curves
35
ROC
Curves
Predicted
Class
1
1
1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
ROC
Curves
Predicted
Class
1
1
1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
Label neg Label pos
ROC
Curves
Predicted
Class
1
1
1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
Label neg Label pos
ROC
Curves
Predicted
Class
1
1
1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
Label neg Label pos
ROC
Curve
For a given threshold on f(x), you get a point on the ROC curve. 100% Ideal ROC curve (AUC=1)
Positive class success rate (hit rate, sensitivity) 0 AUC 1 0 1 - negative class success rate (false alarm rate, 1-specificity) 100%
AUC = 100%
AUC = 50%
0 %
0 %
0 %
100%
0 %
100%
100%
100%
AUC = 90%
AUC = 65%
0 %
0 %
0 %
100%
0 %
100%
LiO Curves
43
LiO
Curves
Predicted
Class
1
1
How many good customers did we select? 1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
Percent targeted
44
LiO
Curve
Customers ranked according to f(x); selection of the top ranking customers. 100% Hit rate = Frac. good customers select. Ideal Lift
100%
Precision Precision is the fraction of the documents retrieved that are relevant to the user's information need. Recall Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. In binary classification, recall is called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.
Precision/Recall
Curves
Predicted
Class
1
1
1
1
1
0
0
0
0
0
Predicted
CPE
.9
.9
.8
.6
.6
.4
.3
.3
.2
.1
Actual
Classica)on
1
1
1
0
1
1
1
0
0
0
Percent targeted
47
Performance
Assessment
Cost matrix
Truth: y Class -1 Class +1 Total Class+1 /Total Predictions: F(x) Class -1 tn fn rej=tn+fn Class +1 Total neg=tn+fp pos=fn+tp
m=tn+fp +fn+tp
Class +1 / Total
False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power
Vary the decision threshold in F(x) = sign(f(x)+), and plot: ROC curve: Hit rate vs. False alarm rate
Lift curve: Hit rate vs. Fraction selected Precision/recall curve: Hit rate vs. Precision
Learning Curves
49
Learning Curves
test accuracy
50
Ques*ons?
51
Cost matrix
52
ASSUME (FOR NOW) WE ALSO KNOW THE PROBABILITY OF SUN VS. RAIN Cost matrix
PSun
PRain
53
STEPS
PSun
PRain
54
WORKED
EXAMPLE
WEATHER
1. Expected cost of predicting Sun = 0*PSun + 10*PRain = 0*(1-Prain) + 10*PRain = 10*PRain 2. Expected cost of predicting Rain = 1*PSun + 1*PRain = 1*(1-Prain) + 1*PRain = 1*(1 - Prain + Prain) =1 3. When should we predict Rain? CostRain < CostSun 1 < 10*PRain 1/10 < PRain
Source: Adapted from Elkan (2001) and Zadrozny (2003)
55
. PSun
PRain
56
Vs.
57
Auto dealers have data on states registered drivers Need hot prospect lists Classify people as buyer vs. non-buyer
58
= Cost of mailing and cold-call = Profit from sale of car = Person is purchase probability (youre given a different Pi for each person)
59
Consider hot prospect if targeting has higher expected profit than not targeting E[Profit if predict buyer] P*(N-C) + (1-P)*(-C) P*N P*C C + P*C P P E[Profit if predict non-buyer] 0 0 C/N 1% (or $20/$2000)
60
THREE MAIN APPROACHES TO COST-SENSITIVE LEARNING Direct Determine lowest expected cost class
Cost-sensitive learning
External Modify the distribution of the training set Internal Make the classifier cost-sensitive
61
20 R 20 S
Pressure = Low Pressure = High
5R 10 S
15 R 10 S
20 R 20 S
Pressure = Low Pressure = High
5R 10 S
15 R 10 S
PREDICT SUN
Since S more likely than R
PREDICT RAIN
Since R more likely than S
63
20 R 20 S
Pressure = Low Pressure = High
5R 10 S
15 R 10 S
NO! WE NEED TO RELABEL THE TREE BECAUSE ACCURACY IS NOT THE SAME AS PROFIT
Accuracy uses
20 R 20 S
Pressure = Low Pressure = High
To maximize
profit, we said treat as R if PRain 10%
5R 10 S
15 R 10 S
THREE MAIN APPROACHES TO COST-SENSITIVE LEARNING Direct Determine lowest expected cost class
Cost-sensitive learning
External Modify the distribution of the data set Internal Make the classifier cost-sensitive
66
IN THE EXTERNAL METHOD, COSTLY EXAMPLES ARE MADE MORE PREVALENT IN THE TRAINING DATA
Multiply number of negative examples by D ^ D P* = cutoff response rate (costoptimal threshold) Po = threshold of given classifier (typically 0.5)
67
EXTERNAL METHOD: TREAT THE CLASSIFIER AS A BLACK BOX, BUT ALTER THE DATA SETS DISTRIBUTION
^ D (modified distribution D)
Costinsensitive classifier
Cost-sensitive predictions
Maximizing accuracy on
^ D helps minimize cost on D
68
THREE MAIN APPROACHES TO COST-SENSITIVE LEARNING Direct Determine lowest expected cost class
Cost-sensitive learning
External Modify the distribution of the data set Internal Build the model differently
69
Issue 1. Pruning
Potential problem
Potential resolution
2. Bias
Probabilities systematically
shifted toward 0/1
3. Variance
70
Probabilities systematically
shifted toward 0/1
Laplace correction is a common probability smoothing method used for estimating. The probability is estimated as CPEi = (a+1)/(b +2) a represents the number of samples belonging to the particular class Ci , b represents the number of samples at the leaf
71
Approaches so far
New work
Build an accuracy-based
decision tree
72
TAKEAWAYS
73
74
75
Where will you get the data What is your dependent variable What are your predictor variables
n n
What has been done before and how will you compare what you are working on to what has been done before
n n
What would be a good answer to your problem-what is your expected punchline By mid-semester project must have data set and baseline results, exploratory analysis
n 76
References
Based on N. Abe, A. Abrahams, M. Saar-Tschechansky, T. Jiang, B. Holte, P. Langley, Eibe and Frank, D. Goldstein, Wikipedia
77
S6:
Evalua*on