1 / 44
1 / 44
1 / 44
training error:
The test error is the average error that results from using a
2 / 44
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
3 / 44
available
Some methods make a mathematical adjustment to the
4 / 44
Validation-set approach
5 / 44
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
6 / 44
28
26
24
22
20
16
18
26
24
22
20
18
16
28
in a linear regression
We randomly split the 392 observations into two sets, a
training set containing 196 of the data points, and a
validation set containing the remaining 196 observations.
Degree of Polynomial
10
10
Degree of Polynomial
Left panel shows single split; right panel shows multiple splits
7 / 44
overestimate the test error for the model fit on the entire
data set.
8 / 44
overestimate the test error for the model fit on the entire
data set. Why?
8 / 44
K-fold Cross-validation
9 / 44
Validation
Train
Train
Train
Train
10 / 44
The details
Let the K parts be C1 , C2 , . . . CK , where Ck denotes the
CV(K) =
K
X
nk
k=1
MSEk
11 / 44
The details
Let the K parts be C1 , C2 , . . . CK , where Ck denotes the
CV(K) =
K
X
nk
k=1
MSEk
cross-validation (LOOCV).
11 / 44
CV(n) =
1X
n
i=1
yi yi
1 hi
2
,
12 / 44
CV(n) =
1X
n
i=1
yi yi
1 hi
2
,
the data enough. The estimates from each fold are highly
correlated and hence their average can have high variance.
a better choice is K = 5 or 10.
12 / 44
26
24
22
20
16
18
26
24
22
20
18
16
28
10fold CV
28
LOOCV
Degree of Polynomial
10
10
Degree of Polynomial
13 / 44
20
3.0
15
10
5
2.5
2.0
1.5
10
Flexibility
20
0.0
0.5
1.0
2.0
1.5
1.0
0.5
0.0
2.5
3.0
10
Flexibility
20
10
20
Flexibility
14 / 44
15 / 44
15 / 44
bias-variance tradeoff.
15 / 44
valid.
16 / 44
17 / 44
17 / 44
NO!
has already seen the labels of the training data, and made
use of them. This is a form of training and must be
included in the validation process.
It is easy to simulate realistic data with the class labels
18 / 44
NO!
has already seen the labels of the training data, and made
use of them. This is a form of training and must be
included in the validation process.
It is easy to simulate realistic data with the class labels
18 / 44
NO!
has already seen the labels of the training data, and made
use of them. This is a form of training and must be
included in the validation process.
It is easy to simulate realistic data with the class labels
genomics papers.
18 / 44
19 / 44
Wrong Way
Selected set
of predictors
Predictors
Outcome
Samples
CV folds
20 / 44
Right Way
Selected set
of predictors
Samples
Predictors
Outcome
CV folds
21 / 44
The Bootstrap
22 / 44
23 / 44
A simple example
Suppose that we wish to invest a fixed sum of money in
24 / 44
A simple example
Suppose that we wish to invest a fixed sum of money in
given by
=
2
X
Y2 XY
,
+ Y2 2XY
Example continued
2 , 2 , and
But the values of X
XY are unknown.
Y
2 ,
We can compute estimates for these quantities,
X
Y2 ,
and
XY , using a data set that contains measurements for
X and Y .
We can then estimate the value of that minimizes the
XY
Y2
.
2
2
X +
Y 2
XY
25 / 44
0
1
Example continued
1
0
3
Example continued
call
1,
2, . . . ,
1000 .
The left-hand panel of the Figure on slide 29 displays a
27 / 44
Example continued
The mean over all 1,000 estimates for is
1000
1 X
r = 0.5996,
1000
r=1
SE(
) 0.083.
So roughly speaking, for a random sample from the
population, we would expect
to differ from by
approximately 0.08, on average.
28 / 44
0.4
0.5
0.6
0.7
0.8
0.9
0.8
0.7
0.6
0.3
0.4
50
50
0.5
100
100
150
150
200
200
0.9
Results
0.3
0.4
0.5
0.6
0.7
0.8
0.9
True
Bootstrap
Obs
*1
*2
4.3
2.4
2.1
1.1
!!
!!
!!
!!
*B
!Z
5.3
2.8
Obs
5.3
2.8
4.3
2.4
5.3
2.8
Obs
2.1
1.1
5.3
2.8
*2
4.3
!!
!!
!!
2.4
!!
!!
!!
!!
!!
Obs
*B
2.1
1.1
2.1
1.1
4.3
2.4
*1
SEB (
) = t
r
.
B1
r=1
estimated from the original data set. See center and right
panels of Figure on slide 29. Bootstrap results are in blue.
For this example SEB (
) = 0.087.
32 / 44
Real World
Random
Sampling
Population
Bootstrap World
Random
Sampling
Data
Z = (z1 , z2 , . . . zn )
Estimate
f (Z)
Estimated
Population
Bootstrap
dataset
Z = (z1 , z2 , . . . zn )
Bootstrap
Estimate
f (Z )
33 / 44
34 / 44
34 / 44
the true .
35 / 44
35 / 44
35 / 44
36 / 44
36 / 44
36 / 44
36 / 44
36 / 44
36 / 44
37 / 44
Pre-validation
38 / 44
Pre-validation
38 / 44
Motivating example
An example of this problem arose in the paper of vant Veer et
al. Nature (2002). Their microarray data has 4918 genes
measured over 78 cases, taken from a study of breast cancer.
There are 44 cases in the good prognosis group and 34 in the
poor prognosis group. A microarray predictor was
constructed as follows:
1. 70 genes were selected, having largest absolute correlation
with the 78 class labels.
2. Using these 70 genes, a nearest-centroid classifier C(x) was
constructed.
3. Applying the classifier to the 78 microarrays gave a
dichotomous predictor zi = C(xi ) for each case i.
39 / 44
Results
Comparison of the microarray predictor with some clinical
predictors, using logistic regression with outcome prognosis:
microarray
angio
er
grade
pr
age
size
4.096
1.208
-0.554
-0.697
1.214
-1.593
1.483
Stand. Err.
Re-use
1.092
0.816
1.044
1.003
1.057
0.911
0.732
microarray
angio
er
grade
pr
age
size
1.549
1.589
-0.617
0.719
0.537
-1.471
0.998
Pre-validated
0.675
0.682
0.894
0.720
0.863
0.701
0.594
Model
Coef
Z score
p-value
3.753
1.482
-0.530
-0.695
1.149
-1.748
2.026
0.000
0.069
0.298
0.243
0.125
0.040
0.021
2.296
2.329
-0.690
0.999
0.622
-2.099
1.681
0.011
0.010
0.245
0.159
0.267
0.018
0.046
40 / 44
41 / 44
Pre-validation process
Omitted data
Predictors
Prevalidated
Predictor
Response
Observations
Logistic Regression
Fixed
predictors
42 / 44
43 / 44