Anda di halaman 1dari 57

Evaluating Classification &

Predictive Performance
Evaluating & Comparing
Classifiers
Evaluation of a classifier is necessary for two
different purposes:
To obtain the complete specification of a particular model
i.e., to obtain numerical values of the parameters of a
particular method.
To compare two or more fully specified classifiers in order
to select the best classifier.
Useful criteria
Reasonableness
Accuracy
Cost measures
Evaluation Criterion:
Reasonableness
Reasonableness
As in regression, time series, and other models
we expect the model to be reasonable:
Based on the analysts domain knowledge, is there a
reasonable basis for a causal relationship between the
predictor variables and Y (group membership)?
Are the predictor variables actually available for
prediction in the future?
If the classifier implies a certain order of importance
among the predictor variables (indicated by p-values
for specific predictors), is this order reasonable?

Evaluation Criterion:
Accuracy Measures
Accuracy Measures
The idea is to compare the predictions with the
actual responses (like forecast errors in time series,
or residuals in regression models).
In regression/ time series etc. we displayed these as
3 columns (actual values, predicted/fitted values,
errors) or plotted them on a graph.
In classification the predictions and actual values
are displayed in a compact format called a
classification/confusion matrix.
This can be done for the training and/or validation
set.
Misclassification error
Error = classifying a record as
belonging to one class when it belongs
to another class.
Error rate = percent of misclassified
records out of the total records in the
validation data
Classification/confusion matrix
a b
c d
Predicted
Actual
1

C Y =
2

C Y =
1
C Y =
2
C Y =
Example with two groups Y = C
1
or C
2

# of obs that were classified
correctly as group C1
Confusion Matrix
201 1s correctly classified as 1
85 1s incorrectly classified as 0
25 0s incorrectly classified as 1
2689 0s correctly classified as 0

Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Classification Measures
1. The overall accuracy of a classifier is


2. The overall error rate of a classifier is
d c b a
d a
+ + +
+
= ) prediction correct ( P
d c b a
c b
+ + +
+
= accuracy overall - 1
a b
c d
Predicted
Actual
C
1
C
2
C
1
C
2
Error Rate
Overall error rate = (25+85)/3000 = 3.67%
Accuracy = 1 err = (201+2689)/3000 =
96.33%
If multiple classes, error rate is:
(sum of misclassified records)/(total records)
Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Nave Rule
Often used as benchmark: we hope to
do better than that
Exception: when goal is to identify
high-value but rare outcomes, we may
do well by doing worse than the nave
rule (see lift later)
Nave rule: classify all records as
belonging to the most prevalent
class
Accuracy Measures cont.
The base accuracy of a dataset is the accuracy of the
nave rule


The base error rate is


The lift of a classifier (aka its improvement) is


Proportion of majority class
1 base accuracy
% 100
error Base
error overall - error Base
Lift
Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Using nave rule: (every case is classified to the
majority class as 0)
Base accuracy = 2714/3000 = 90.47%
Base error rate = 9.53%
Using the classifier:
Overall error rate = (85+25)/3000 = 3.67%
Lift = (9.53-3.67)/9.53= 61%


286
2714
3000
nave rule
Separation of Records

High separation of records means
that using predictor variables attains
low error

Low separation of records means
that using predictor variables does not
improve much on nave rule

When One Class is More
Important
Tax fraud
Credit default
Response to promotional offer
Detecting electronic network intrusion
Predicting delayed flights
In many cases it is more important to identify
members of one class
In such cases, we are willing to tolerate greater
overall error, in return for better identifying the
important class for further attention
Accuracy Measures cont.
Suppose the two groups are asymmetric in that
it is more important to correctly predict
membership in C
1
than in C
2
. E.g., in the
bankruptcy example, it may be more important
to correctly predict a firm that is going bankrupt
than to correctly predict a firm that is going to
stay solvent. The classifier is essentially used as
a system for detecting or signaling C
1
.
In such a case, the overall accuracy is not a
good measure for evaluating the classifier.
Accuracy Measures for Unequal
Importance of Groups

Sensitivity of a classifier = its ability to
correctly detect the important group members
=% of C1 members correctly classified
Specificity of a classifier = its ability to
correctly rule out non C1 members
=% of C2 members correctly classified
b a
a
+
d c
d
+
a b
c d
Predicted
Actual C
1
= important group
C
1
C
2
C
1
C
2
Accuracy Measures for Unequal
Importance of Groups

= false positive rate of classifier

= false negative rate of classifier
c a
c
+
d b
b
+
a b
c d
Predicted
Actual
C
1
= important group
C
1
C
2
C
1
C
2
Accuracy measures
Actual Class 1 0
1 201 85
0 25 2689
Predicted Class
Classification Confusion Matrix
Class 1 is important: (e.g. cancer)
Sensitivity = 201/286 = 70%
Specificity = 2689/2714 = 99%
False positive rate = 25/226 = 11%
False negative rate =85/2774 = 3%

286
226 2774
2714
3000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9 1
Cutoff probability
Overall Accuracy
Sensitivity
Specificity
Measures for unequal-
importance of groups
Ideally, we would like the sensitivity and specificity to
be as _______ as possible, while the false positive and
false negative rates to be as _______ as possible.
In practice, there is a trade-off between sensitivity and
specificity, and between false positives and false
negatives.
high
low
Classify all
firms as
bankrupt
Tradeoff cont.
Such a trade-off is resolved via constrained
optimization. For example, we may seek to
maximize sensitivity subject to some minimum
required level of specificity. Or, we may seek to
minimize the false positive rate subject to some
maximum tolerable rate of false negatives.

Cutoff for classification
Most DM algorithms classify via a 2-step process:
For each record,
1. Compute probability of belonging to class 1
2. Compare to cutoff value, and classify accordingly
Default cutoff value is 0.50
If >= 0.50, classify as 1
If < 0.50, classify as 0
Can use different cutoff values
Typically, error rate is lowest for cutoff = 0.50
Cutoff Table
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
If cutoff is 0.50: 13 records are classified as 1
If cutoff is 0.80: 7 records are classified as 1
Cutoff
=0.50
Cutoff
=0.80
Confusion Matrix for Different
Cutoffs
0.25
Actual Class owner non-owner
owner 11 1
non-owner 4 8
0.75
Actual Class owner non-owner
owner 7 5
non-owner 1 11
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Lift and Decile Charts: Goal
Useful for assessing performance in terms
of identifying the most important class

Helps evaluate, e.g.,
How many tax records to examine
How many loans to grant
How many customers to mail offer to

Lift and Decile Charts Cont.
Compare performance of DM model to
no model, pick randomly
Measures ability of DM model to identify
the important class, relative to its average
prevalence
Charts give explicit assessment of results
over a large number of cutoffs
Lift and Decile Charts: How
to Use
Compare lift to no model baseline

In lift chart: compare step function to
straight line

In decile chart compare to ratio of 1

Lift Charts (Bankruptcy;
1= bankrupt)
Compare to random assignments
A good classifier picks up the
bankrupt firms (1s) at the beginning
(until 20) then flattens out
Lift chart (training dataset)
0
5
10
15
20
0 20 40
# cases
C
u
m
u
l
a
t
i
v
e
Cumulative
Status when
sorted using
predicted values
Cumulative
Status using
average
Serial no. Predicted Actual Cumulative actual Cumulative average
1 0.991543081 1 1 0.486486486
2 0.988105889 1 2 0.972972973
3 0.986579738 1 3 1.459459459
4 0.984652584 1 4 1.945945946
5 0.962885364 1 5 2.432432432
6 0.937034077 0 5 2.918918919
7 0.909046376 1 6 3.405405405
8 0.901302885 1 7 3.891891892
9 0.888245685 1 8 4.378378378
10 0.882716903 1 9 4.864864865
11 0.84924749 1 10 5.351351351
12 0.830714449 1 11 5.837837838
13 0.827507501 1 12 6.324324324
14 0.82097585 1 13 6.810810811
15 0.779509154 1 14 7.297297297
16 0.747775362 1 15 7.783783784
17 0.547297007 0 15 8.27027027
18 0.506053382 1 16 8.756756757
19 0.454098506 0 16 9.243243243
20 0.431882627 0 16 9.72972973
21 0.33018223 1 17 10.21621622
22 0.245603253 0 17 10.7027027
23 0.22744156 0 17 11.18918919
24 0.174103639 0 17 11.67567568
25 0.165288148 0 17 12.16216216
26 0.161843048 0 17 12.64864865
27 0.128189076 0 17 13.13513514
28 0.120534482 0 17 13.62162162
29 0.105942669 0 17 14.10810811
30 0.085019177 1 18 14.59459459
31 0.01090307 0 18 15.08108108
32 0.009456913 0 18 15.56756757
33 0.004583116 0 18 16.05405405
34 0.002982095 0 18 16.54054054
35 0.000462468 0 18 17.02702703
36 0.000286908 0 18 17.51351351
37 3.26618E-06 0 18 18
Lift Chart cumulative
performance
After examining (e.g.,) 10 cases (x-axis), 9
owners (y-axis) have been correctly IDd
0
2
4
6
8
10
12
14
0 10 20 30
C
u
m
u
l
a
t
i
v
e
# cases
Lift chart (training dataset)
Cumulative
Ownership when
sorted using
predicted values
Cumulative
Ownership using
average
Lift Charts: How to Compute
Using the models classifications, sort
records from most likely to least likely
members of the important class
Compute lift: Accumulate the correctly
classified important class records (Y
axis) and compare to number of total
records (X axis)


Decile Chart
0
0.5
1
1.5
2
2.5
1 2 3 4 5 6 7 8 9 10
D
e
c
i
l
e

m
e
a
n

/

G
l
o
b
a
l

m
e
a
n
Deciles
Decile-wise lift chart (training dataset)
In most probable (top) decile, model is twice as likely to
ID the important class (compared to avg. prevalence)
Lift vs. Decile Charts
Both embody concept of moving down
through the records, starting with the most
probable

Lift chart shows continuous cumulative results
Y axis shows the number of important class records
identified
Decile chart does this in decile chunks of data
Y axis shows ratio of decile mean to overall mean

Evaluation Criterion:
Cost Measures
Asymmetric Costs:
Misclassification Costs May Differ
The cost of making a misclassification
error may be higher for one class than
the other(s)

Looked at another way, the benefit of
making a correct classification may be
higher for one class than the other(s)
The Confusion Matrix
Predict as 1 Predict as 0
Actual 1 8 2
Actual 0 20 970
Error rate = (2+20)/1000 = 2.2% (higher than
nave rate= 10/1000 = 1% )
Example Response to Promotional Offer
Example Response to Promotional
Offer
Nave rule (classify everyone as 0)
has error rate of 1% (seems good)
Using DM we can correctly classify eight
1s as 1s
It comes at the cost of misclassifying
twenty 0s as 1s and two 0s as 1s.

Suppose we send an offer to 1000 people,
with 1% average response rate
(1 = response, 0 = nonresponse)
Introducing Costs & Benefits
Suppose:
profit from a 1 is $10
Cost of sending offer is $1
Then:
Under nave rule, all are classified as 0 so no offers
are sent: no cost, no profit
Under DM predictions, 28 offers are sent.
8 respond with profit of $10 each = $80
20 fail to respond, cost $1 each = -$20
972 receive nothing (no cost, no profit)
Net profit = $80-$20=$60
Profit Matrix
Predict as 1 Predict as 0
Actual 1 $80 0
Actual 0 ($20) 0
Predict as 1 Predict as 0
Actual 1 8 2
Actual 0 20 970
cases
profit
Profit from an acceptance is $10 Cost of sending an offer is $1
Opportunity costs
As we see, best to convert everything
to costs, as opposed to a mix of costs
and benefits
E.g., instead of benefit from sale refer
to opportunity cost of lost sale
Leads to same decisions, but referring
only to costs allows greater applicability

Cost Matrix
(inc. opportunity costs)

Opportunity cost of missing an acceptance = $10
Cost of sending an offer = $1
Predict as 1 Predict as 0
Actual 1 $8 $20
Actual 0 $20 $0
Predict as 1 Predict as 0
Actual 1 8 2
Actual 0 20 970
cases
Costs

Misclassification Cost
Bankruptcy Example: Here a misclassification
occurs when either (a) a firm that actually went
bankrupt is misclassified as being solvent or when (b)
a firm that actually remained solvent is misclassified
as going bankrupt.
Suppose the cost of misclassifying a firm that actually
went bankrupt as being solvent is 10 times the cost
of misclassifying a firm that actually remained solvent
as being bankrupt.
The expected cost of misclassification is a useful way
of evaluating the performance of a classifier when
the misclassification costs are asymmetric.

Misclassification Cost
Measures
In general, for a confusion matrix






Let q
1
be the cost of misclassifying an observation that
is actually from C
1
as belonging to C
2
.
Let q
2
be the cost of misclassifying an observation that
is actually from C
2
as belonging to C
1
.
a b
c d
Predicted
Actual
C
1
C
2
C
1
C
2
Cost Function
The idea is to find a classifier that minimizes this
function.




Note: The optimal parameters are affected by the
misclassification costs through their ratio q1/q2
i.e., if both q
1
and q
2
were doubled, the optimal
parameters would be unaffected.
The optimal parameters are also affected by the
prior probability P(C1)/P(C2)


2 1
q
d c b a
d c
d c
c
q
d c b a
b a
b a
b
|
.
|

\
|
+ + +
+
|
.
|

\
|
+
+
|
.
|

\
|
+ + +
+
|
.
|

\
|
+
Cost Function adjusting for
distorted sample probabilities
If we use stratified sampling then we can use
external/prior information on the proportions of
observations belonging to each class (e.g., census
data) denoted p(C
1
),p(C
2
), and incorporate them
into the cost structure.
The expected cost of misclassification:


2 2 1 1
) ( ) ( q C p
d c
c
q C p
b a
b

|
.
|

\
|
+
+
|
.
|

\
|
+
Example: Bankruptcy
p(C
0
)=0.5, q0=q1
Interactively change costs (q0,q1) and prior
probabilities (p0) to get new cost curve
(see Bankruptcy accuracy measures XLM.xls)
0
0.1
0.2
0.3
0.4
0.5
0.6
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
0
.
8
0
.
9 1
cutoff probability
E
x
p
e
c
t
e
d

c
o
s
t
Cost Function: Applicability of
results to new data
This expression is a reasonable estimate
of future misclassification cost if the
proportions of C
1
and C
2
in the sample
data are similar to the proportions of C
1

and C
2
that are expected in the future.
However, the sample data is often
intentionally biased towards one group.
Why?
Generalize to Cost Ratio
Sometimes actual costs and benefits are hard to
estimate

Need to express everything in terms of costs
(i.e. cost of misclassification per record)
Goal is to minimize the average cost per record

A good practical substitute for individual costs is
the ratio of misclassification costs (e,g,,
misclassifying fraudulent firms is 5 times worse
than misclassifying solvent firms)
Multiple Classes
Theoretically, there are m(m-1)
misclassification costs, since any case
could be misclassified in m-1 ways
Practically too many to work with
In decision-making context, though, such
complexity rarely arises one class is
usually of primary interest
For m classes, confusion matrix has m rows
and m columns
Adding Cost/Benefit to Lift
Curve
Sort records in descending probability of success
For each case, record cost/benefit of actual
outcome
Also record cumulative cost/benefit
Plot all records
X-axis is index number (1 for 1
st
case, n for n
th
case)
Y-axis is cumulative cost/benefit
Reference line from origin to y
n
(

y
n
= total net
benefit)


Lift Curve May Go Negative
If total net benefit from all cases is
negative, reference line will have
negative slope

Nonetheless, goal is still to use
cutoff to select the point where net
benefit is at a maximum
Negative slope to reference curve
Oversampling and Asymmetric
Costs
Rare Cases
Responder to mailing
Someone who commits fraud
Debt defaulter
Often we oversample rare cases to give
model more information to work with
Typically use 50% 1 and 50% 0 for
training


Asymmetric costs/benefits typically go hand in
hand with presence of rare but important class
An Oversampling Procedure
1. Separate the responders (rare) from non-
responders
2. Randomly assign half the responders to the
training sample, plus equal number of non-
responders
3. Remaining responders go to validation sample
4. Add non-responders to validation data, to
maintain original ratio of responders to non-
responders
5. Randomly take test set (if needed) from
validation



Classification Using Triage
Instead of classifying as C
1
or C
2
, we
classify as
C
1
C
2
Cant say
The third category might receive special
human review
Take into account a gray area in making
classification decisions
Summary
Evaluation metrics are important for comparing
across DM models, for choosing the right
configuration of a specific DM model, and for
comparing to the baseline
Major metrics: confusion matrix, error rate,
predictive error
Other metrics when
one class is more important
asymmetric costs
When important class is rare, use oversampling
In all cases, metrics computed from validation data

Anda mungkin juga menyukai