Statistical Learning
1 Introduction 9
I Classification 11
2 Problem setting 13
3 Discriminant Analysis 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 17
3.3 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . 20
3.4 Regularized Discriminant Analysis . . . . . . . . . . . . . . . 22
3.5 Reduced-Rank Linear Discriminant Analysis . . . . . . . . . . 22
3.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Logistic regression vs linear discriminant analysis . . . 26
3.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . . . 27
3.8 Kernel Density Discriminant Analysis . . . . . . . . . . . . . 27
3.8.1 Kernel density estimation . . . . . . . . . . . . . . . . 28
3.8.2 Kernel density discriminant analysis . . . . . . . . . . 29
3.8.3 Naive Bayes classifier . . . . . . . . . . . . . . . . . . 29
4 Basis Expansions 31
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Cubic splines . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Natural cubic splines . . . . . . . . . . . . . . . . . . . 33
4.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Splines as linear operators and effective degrees of free-
dom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Multidimensional Splines . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Multivariate regression splines . . . . . . . . . . . . . 39
3
4 CONTENTS
7 Classification Trees 65
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 Growing Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3 Pruning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Bumping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.6 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Nearest-Neighbor Classification 79
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 K-Nearest-Neighbors . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.1 Nearest-Neighbors in High Dimensions . . . . . . . . . 81
8.3 Adaptive Nearest-Neighbors Methods . . . . . . . . . . . . . 82
II Regression 95
10 Problem setting 97
In beginning of data collection, data were scarce and the goal of statistics
was to extract the most possible information out of these data. Since the
inception of electronic data recording and due to the increasing and cheaper
storage capacities, data collections have grown considerably in size and are
still growing. Nowadays, huge data collections are frequently available. As
a result, statisticians have been confronted with a new challenge. How can
valuable information be retrieved from these huge databases while filtering
out the (high percentage of) noise. Instead of a lack of data, there now is an
abundance of data which causes computational problems as well as method-
ological problems. The high percentage of noise causes standard statistical
methods to produce very imprecise and unstable results. Hence, important
trends are not detected anymore. The instability of the models makes it
difficult to generalize them to e.g. reliably predict new outcomes. Statis-
ticians and computer scientists from machine learning/pattern recognition
have developed a collection of methods that are intended to obtain stable
and accurate information from large data collections in a reasonably fast
way. A selection of these techniques will be discussed in this course text.
7
8 CONTENTS
Chapter 1
Introduction
In this course we will consider supervised learning problems, that is, there
is an outcome variable present in the data. Hence, we have a dataset with
an outcome (response/dependent variable) and a set of features (predictors
or input/explanatory/independent variables) from which we need to build a
good model that explains/fits well the outcome and can be used for predict-
ing the outcome of new objects for which the features are known. Such a
model is sometimes called a (statistical) learner. The dataset at hand is the
training dataset that contains for a set of objects both the outcome mea-
surement(s) and the feature measurements. This training dataset is used
to guide the learning process. If the outcome is qualitative (categorical) we
call the problem a classification problem. Classification methods will be
discussed in the first part of the course. If the outcome is quantitative we
call the problem a regression problem. Regression methods will be discussed
in the second part of the course.
We will not discuss unsupervised learning problems in this course. In
this case there is a dataset with a set of features for each object, but there
is no (clear) outcome measurement. The task here is to find interesting
groups (clusters) or patterns that give insight in how the data are organized.
Most of the unsupervised learning techniques focus on clustering the objects
in groups. Such techniques are discussed in the course Multivariate Data
Analysis.
9
10 CHAPTER 1. INTRODUCTION
Part I
Classification
11
Chapter 2
Problem setting
13
14 CHAPTER 2. PROBLEM SETTING
Object Sepal Length Sepal Width Petal Length Petal Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
.. .. .. .. .. ..
. . . . . .
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
.. .. .. .. .. ..
. . . . . .
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
∑
k
EML(Ĝ(x)) = L(r, Ĝ(x)) P (r|X = x). (2.1)
r=1
The classifier is thus defined pointwise. For each feature vector x, the class
estimate is given by the solution of (2.2). For the standard zero-one loss,
the classifier simplifies to
∑
Ĝ(x) = argming∈G P (r|X = x). (2.3)
r̸=g
∑ ∑
Since r∈G P (r|X = x) = 1 = P (g|X = x) + r̸=g P (r|X = x), we can
rewrite (2.3) as
Ĝ(x) = argming∈G [1 − P (g|X = x)]
or equivalently
Ĝ(x) = argmaxg∈G P (g|X = x). (2.4)
This solution is called the Bayes classifier. It says that we should classify
an object with features x to the most probable class as determined by the
conditional distribution P (G|X = x), that is
EX [EML(Ĝ(X))] (2.6)
with Ĝ(x) the Bayes classifier, is called the optimal Bayes error rate. This is
the error rate we would obtain if the conditional distribution P (G|X) would
be known.
16 CHAPTER 2. PROBLEM SETTING
Bayesian Classifier
10
5
0
X2
−5
−10
−15
−10 −5 0 5
X1
Discriminant Analysis
3.1 Introduction
Suppose we have k classes and let us denote by fr (x) the conditional density
of X in class G = r. Moreover, let P (G = r) = πr be the prior probability
∑
that an observation belongs to class r, with kr=1 πr = 1. By applying Bayes
theorem we obtain
fr (x)πr
P (r|X = x) = ∑k (3.1)
s=1 fs (x)πs
Since the denominator does not change for any of the classes, it suffices to
estimate the conditional class densities fr (x) for each of the classes. In the
next sections we will consider different ways to model these densities:
17
18 CHAPTER 3. DISCRIMINANT ANALYSIS
√
The function dΣr (x, µr ) = (x − µr )t (Σr )−1 (x − µr ) is the Mahalanobis
distance between x and µr . This distance takes the variances of the different
components into account as well as the correlations between them as given by
the covariance matrix Σr . In general the Mahalanobis distance between any
√
two points x and y in Rd is given by dΣr (x, y) = (x − y)t (Σr )−1 (x − y).
We also call this the distance between x and y determined by Σr . This
distance function satisfies the properties of a metric:
In Linear Discriminant Analysis (LDA) we assume that the classes all share
a common covariance matrix Σr = Σ for all classes r. From (2.4) and (3.1)
it follows that we classify x in class r if ϕr (x)πr is maximal. Using (3.2) and
taking the logarithm to simplify things, we obtain for each class
10
10
+ +
5
5
X2
X2
+ +
0
0
+ +
−5
−5
−10
−10
−10 −5 0 5 −10 −5 0 5
X1 X1
the problem of the unknown class densities yet. We have modeled the class
densities as Gaussian distributions with a common covariance matrix to ob-
tain these scores. However, these scores depend on the unknown parameters
of the Gaussian distributions. Hence, to use the linear discriminant rule in
practice, we still need to estimate the centers and the common covariance
matrix of the Gaussian distributions from the training data. Often, there
is no relevant information about the prior class probabilities available, and
in that case these probabilities can also be estimated from the data. The
standard estimates are
nr
π̂r = , where nr is the number of observations of class r (3.5)
n
1 ∑
µ̂r = xi (3.6)
nr g =r
i
1 ∑∑
k
Σ̂ = (xi − µ̂r )(xi − µ̂r )t . (3.7)
n−k g =r r=1 i
Hence, the estimated centers µ̂r are the sample means of the observations
in each class and the covariance matrix estimate Σ̂ is the pooled covariance
matrix of the observations in all classes. The right panel of Figure 3.1 shows
the estimated decision boundaries based on a sample of size 75 from each of
the Gaussian distributions. In this example the estimated decision bound-
aries are close to the optimal Bayes boundaries, as could be expected. Of
20 CHAPTER 3. DISCRIMINANT ANALYSIS
course, in general the optimal Bayes boundaries are not necessarily linear.
Let us consider the example in Figure 2.1. These data were a sample from
a mixture of three bivariate normal distributions with different covariance
matrices. Figure 2.1 already showed that the optimal boundaries are not
linear in this case. Figure 3.2 shows the estimated decision boundaries by
linear discriminant analysis. Although the optimal boundaries are nonlin-
ear we see that the linear boundaries yield a good approximation of these
boundaries.
−5
−10
−15
−10 −5 0 5
X1
1 ∑
µ̂r = xi
nr g =r
i
1 ∑
Σ̂r = (xi − µ̂r )(xi − µ̂r )t .
nr − 1 g =r
i
10
+
5
5
0
X2
X2
+
0
−5
+
−10
−5
−15
−10
−10 −5 0 5 −10 −5 0 5
X1 X1
has revealed that LDA requires at least (k − 1)(d + 1) parameters and QDA
requires at least (k − 1)(d(d + 3)/2 + 1) parameters. This means that the
difference in number of parameters still increases linearly with the number
of groups and quadratic with the dimension d. Estimating more parameters
means a loss of precision for the estimates, hence QDA can be expected to
have a larger variability than LDA. On the other hand, LDA can be expected
to have a larger bias if the true decision boundaries are nonlinear. Many
studies have reported very good performance of LDA and QDA in a wide
range of settings (both simulated and real data). In several of these settings
the data are far from Gaussian. A possible explanation is that the data
can only support simple linear or quadratic boundaries obtained by stable
estimates. More sophisticated techniques that produce complex boundaries
have a too low precision to behave well. For QDA the stability will only be
guaranteed in low dimensions.
where Σ̂r is the group covariance matrix as used in QDA, Σ̂ is the pooled
covariance matrix of all groups as used in LDA. The value of α ∈ [0, 1] needs
to be selected based on the performance of the model on a validation sample
or by using cross-validation. In practice, the model is constructed for a range
of α values (e.g. by taking a step size of 0.1) and the performance of these
models is compared. The model with the best performance is then selected.
The idea is to find the one-dimensional direction in which the groups are
best separated.
Note that it is easier to find the direction that best separates the group
means, because for this problem only the variance between the group means
(between-class variance) needs to be maximized without considering the
within-class variance. However, this direction does not always separate well
the groups because the correlations within the groups is ignored. This is
illustrated in the left panels of Figure 3.4. This dataset consists of two
groups with different centers (indicated by ’+’) but the same covariance
matrix. In a two-class problem the direction maximizing the between-class
variance is the direction that connects the two centers, which in this case is
just the first component. Projecting the data on the first component (lower
left panel) shows that there is considerable overlap between the two groups
in this direction. However, if we take the within covariance into account we
get the projection shown in the right panels. Although the group centers
are closer together in this directions, the two groups are almost completely
separated in this direction.
5
X2
X2
+ + + +
0
0
−5
−5
X1 X1
| | || | || | || | | | | || ||| || |||||| ||||||||||| | || ||||||||| |||| |||| | || ||||| || ||||||||||||||||| ||| | || ||||| | | || | | | | | | || | ||||||| | || || | |||||||||||||| || ||||| | || || || | ||| | ||||||| ||||||||| ||| || |||||| ||||| || || || | |
+ + + +
−15 −10 −5 0 5 10 15 −5 0 5
X1 Canonical variate
Figure 3.4: Mixture of two bivariate normals with different centers but a
common covariance matrix. The left panels show the direction maximizing
the variance between the group centers and the separation of the groups in
that direction. The right panels show the discriminant coordinate direction
and the group separation in that direction.
max at Ba subject to at W a = 1.
a
to the order of the discriminant coordinates, this loss is often small. In fact,
sometimes it is even beneficial to use less than all discriminant coordinates
to classify objects. Suppose we only use the first l < k − 1 discriminant
coordinates, then this is equivalent to assuming a mixture of Gaussian dis-
tributions with common covariance matrix and additionally restricting the
group centers to lie in an l-dimensional affine subspace of the d dimensional
feature space. This method is called Reduced Rank Linear Discriminant
Analysis.
We can demand more explicitly that the decision boundaries are linear. If we
determine score functions δr (x) for all classes r = 1, . . . , k, then the decision
boundary between two classes is given by the set of points which satisfies
δr (x) = δs (x). The decision boundaries are linear, that is two classes are
separated by a hyperplane if {x; δr (x) = δs (x)} = {x; α0 + α1t x = 0} for
some coefficients α0 and α1 . This can be achieved by requiring that some
monotone transformation of the score functions δr (x) is a linear function.
For example, if we model the posterior probabilities P (r|X = x) for a two
class problem, then a popular choice is
exp(β0 + β1t x)
P (G = 1|X = x) =
1 + exp(β0 + β1t x)
1
P (G = 2|X = x) = . (3.10)
1 + exp(β0 + β1t x)
Therefore, the decision boundary (the set of points for which the log-odds
are zero) is linear.
Note that requiring the decision boundary to be linear is not necessarily
a very restrictive condition since the predictor space can be expanded by
including e.g. squares and cross-products of the basic predictors. Linear
functions in this augmented space then become quadratic functions in the
original (basic) space.
In the general k class problem the logistic regression model (3.10) ex-
26 CHAPTER 3. DISCRIMINANT ANALYSIS
pands to
t x)
exp(βr0 + βr1
P (G = r|X = x) = ∑k−1 , r = 1, . . . , k − 1
t x)
1 + s=1 exp(βs0 + βs1
1
P (G = k|X = x) = ∑k−1 t x)
. (3.12)
1 + s=1 exp(βs0 + βs1
This model ensures that the probabilities are all between 0 and 1 and sum
to 1. Using the logit transformation we obtain
P (G = r|X = x)
log( t
) = βr0 + βr1 x = 0 r = 1, . . . , k − 1
P (G = k|X = x)
Note that the model uses the last class (k) as the reference class (in the
denominator) to which each other class is compared. However, the choice of
denominator is arbitrary with different choices leading to the same classifica-
tion. The logistic regression coefficients are usually estimated by maximum
likelihood. This leads to a set of estimating equations that can be solved
numerically by using iteratively reweighted least squares.
with P (X) the marginal density of X. Both models assume the same form
for the conditional density P (G = r|X). LDA specifies the marginal density
through (3.2). That is, P (X, G = r) = ϕr (x)πr with ϕr the Gaussian
density. This means that the marginal density P (X) is a normal mixture
density
∑k
P (X) = ϕr (x)πr (3.14)
r=1
and thus has a fully parametric form. This parametric model allows us to
estimate the parameters efficiently, but if the model is incorrect we can get
unreliable results.
In logistic regression the marginal distribution P (X) is left unspecified
and only the conditional distribution P (G = r|X) is modeled. Hence, this
is a semiparametric approach. There will be a loss in efficiency compared to
LDA if the marginal density is indeed a mixture of normals, but on the other
hand this approach is more robust because it mainly relies on observations
close to the boundary to estimate the boundaries.
3.7. MIXTURE DISCRIMINANT ANALYSIS 27
In LDA and QDA we assign observations to the closest class, were the dis-
tance is measured between the observation and the center of each class, using
an appropriate metric. Hence, each class is represented by a single point,
called prototype. If classes are inhomogeneous, then a single prototype for
each class might not be sufficient. In such cases we can use a mixture model
for the density of X in each class
∑
Mr
P (X = x|r) = fr (x) = πlr ϕlr (x) (3.15)
l=1
where ϕlr is the Gaussian density (3.2) with mean µlr and covariance matrix
∑ r
Σlr . The mixing proportions πlr sum to one, that is M l=1 πlr = 1. Each
class r is now represented by Mr prototypes. Using (3.1) we obtain the class
posterior probabilities
∑Mr
πr l=1 πlr ϕlr (x)
P (r|X = x) = ∑k ∑Ms (3.16)
s=1 πs l=1 πls ϕls (x)
This is sometimes also called the Parzen estimate. A popular choice for
the kernel function is the Gaussian kernel Kλ (x0 , x) = ϕ(|x − x0 |/λ). The
standard deviation λ determines the size of the window around x0 in which
observations get a large weight. The choice of λ corresponds again to a bias-
variance trade-off. A small window implies that the density estimate fˆX k (x )
0
is based on a small number of observations xi (close to x0 ). Consequently,
the variance of the estimate will be large but the bias will be small because
the estimator can accommodate well to the information about the density
given by the observations close to x0 . On the other hand, if the window is
wide, the density estimate fˆX
k (x ) is a weighted average over a large number
0
of observations leading to a smaller variance. However, the estimator now
also uses information from observations further from x0 . Such observations
will more likely be in regions where the density is different, introducing bias
in the estimate. Note that if λ is very large, we just fit a normal density to
the data, so we end up with a parametric density estimate.
If we denote by ϕ(x; λ) the Gaussian density with mean 0 and standard
deviation λ, then the nonparametric density estimate (3.17) with Gaussian
kernel can be written as
1∑
n
fˆX
k
(x0 ) = ϕ(x0 − xi ; λ). (3.18)
n
i=1
fˆk (x)π̂r
P̂ (r|X = x) = ∑k r . (3.20)
ˆk
s=1 fs (x)π̂s
Note that kernel density discriminant analysis can be seen as the limit case
of mixture discriminant analysis with n prototypes in each class.
Analogous to the regularization parameter α in regularized discriminant
analysis, the parameter λ can be selected by comparing the performance
(error rate) for different λ values on an independent validation sample. Al-
ternatively, if a validation sample is not available, cross-validation can be
used.
∏
d
fr (X) = fr,j (Xj ). (3.21)
j=1
These density estimates are then used in (3.1) to obtain discriminant rules.
The problem thus reduces to estimating the univariate marginal densities
which simplifies computation drastically. The original naive Bayes method
uses univariate Gaussian distributions to estimate the marginal densities.
A straightforward extensions uses nonparametric kernel density estimates
of the univariate densities fr,j (Xj ). The technique can even incorporate
discrete explanatory variables Xj . The marginal probabilities can then easily
be estimated by the proportions in a bar chart or frequency table. Hence, the
naive Bayes method is a computationally efficient technique that is widely
applicable.
Of course the assumption of independent feature components within each
class will generally not be true. Despite this strong assumption, the naive
Bayes technique often performs very well compared to more sophisticated
30 CHAPTER 3. DISCRIMINANT ANALYSIS
Basis Expansions
4.1 Introduction
Many techniques will impose a (simple) structure for e.g. the decision bound-
ary in classification or the regression function (next part). For instance, LDA
and many of its extensions assume linear decision boundaries while many
regression techniques consider linear regression. Such a relatively simple
structure is often a convenient or even necessary restriction, but will in gen-
eral not be the correct structure but only an approximation. As already
mentioned, the restriction can be relaxed by increasing the predictor space
by adding e.g. squares and cross-products of the measured predictors and
using the technique in the enlarged/transformed predictor space.
In general, consider transformations hm (X) : Rd → R of X, m =
1, . . . , M . The new transformed predictor space is then given by h(X) =
(h1 (X), . . . , hM (X)). A linear technique or linear expansion in X, T (X) =
∑d
j=1 βj Xj can consequently be extended to
∑
M
T̃ (X) = T (h(X)) = βm hm (X),
m=1
31
32 CHAPTER 4. BASIS EXPANSIONS
√
• hm (X) = log(Xj ), hm (X) = Xj , hm (X) = ∥X∥, . . . are standard
nonlinear transformations of measured predictors.
where I(.) is the indicator function that takes the value 1 within the specified
interval and zero elsewhere. The piecewise polynomial function T̃ (X) is
∑
now given by T̃ (X) = 3m=1 βm hm (X), a linear function of the transformed
predictors. Since only one predictor is different from zero in each of the
intervals, the coefficients βm can be estimated by β̂m = Ȳm with Ȳm the mean
of the Y values in the mth region. In discriminant analysis the Y variables
will be dummy variables corresponding to class membership for each of the
classes. In regression Y will be the quantitative response variable.
Fitting piecewise linear functions is obtained by adding three additional
basis functions
hm+3 (X) = hm (X)X, m = 1, 2, 3.
These functions thus equal X within the specified interval but are zero out-
∑
side that interval. This leads to the approximation T̃ (X) = 6m=1 βm hm (X).
Both of the above approximations have the drawback of not being con-
tinuous at the knots, that is the endpoints of the intervals. To introduce con-
tinuity at the knots we need two restrictions T̃ (ξ1− ) = T̃ (ξ1+ ) which implies
4.3. SPLINES 33
4.3 Splines
A more direct approach to impose continuity and other desirable properties
is to use basis functions that incorporate the necessary constraints. For the
above example of a piecewise linear continuous fit we can use the following
basis functions
where (.)+ means that the resulting function equals the function between
brackets when that function is positive and the resulting function is zero
when the function between brackets is negative. Note that the number of
basis functions now corresponds to the number of free parameters.
Hence, we fit local cubic functions with constraints at the knots that ensure
that at the knots the values of the left and right parts have the same function
value and the same values of the first and second derivatives.
In theory splines can be defined with constraints beyond second deriva-
tives but in practice there is seldom any need to go beyond cubic splines.
Splines with fixed knots introduced here are often called regression splines.
In practice, one needs to select the order of the spline, the number of knots,
and their location. Often the observations are used as the knots.
(X − ξl )3+ − (X − ξq )3+
with dl (X) = .
ξq − ξl
∑
n ∫
RSS(f, λ) = (yi − f (xi )) + λ
2
(f ′′ (t))2 dt (4.2)
i=1
The first term of (4.2) measures the goodness-of-fit, that is the closeness of
the fit to the data. The second term of (4.2) is the penalty term that penal-
izes curvature in the function. Functions with curvature have second order
derivatives different from zero and thus get penalized. Hence, the criterion
shows a preference for simpler (linear) fits if they fit the data reasonably
well. The parameter λ is the smoothing parameter which controls the trade-
off between goodness-of-fit and curvature. A larger value of λ means a
bigger preference for simple linear functions, a smaller λ value means less
penalization for more complex functions with curvature. The two limit cases
are
λ = 0: No penalty. f can be any function that fits the data exactly leading
to a RSS equal to zero.
f̂ = N β̂ = N (N t N + λΩ)−1 N t y
= S λ y. (4.4)
This means that the fitted values are a linear function of the responses
y. The linear operator S λ is known as the smoother matrix. Note that
this is similar to least squares regression where we have f̂LS = Hy with
H = X(X t X)−1 X t the hat matrix. As for least squares, the linear operator
S λ does not depend on y but only on the xi observations and λ.
When using regression splines (e.g. cubic splines), we first transform
the predictor variable X to an M -dimensional (M < n) spline basis space
leading to an n × M matrix B ξ for the transformed feature space where ξ
is the knot sequence used by the regression spline. Applying least squares
to find the regression coefficients yields the fitted values
f̂ = B ξ β̂ = B ξ (B tξ B ξ )−1 B tξ y
= H ξ y.
The hat matrix H ξ is now a projection matrix that projects the response
vector y onto the M -dimensional space spanned by the columns of the new
feature matrix B ξ .
36 CHAPTER 4. BASIS EXPANSIONS
• H ξ has rank M < n, while S λ has rank n. That is, while H ξ projects
y on the M -dimensional space spanned by the columns of B ξ , the
smoother matrix S λ does not project y onto a space of lower di-
mension, but instead finds an approximation in the full n-dimensional
space that satisfies the conditions imposed by the penalty term.
This definition of the effective degrees of freedom can be used to select the
value of the parameter λ. A smaller effective degrees of freedom means a
larger λ and vice versa. This is explained below in more detail.
The smoother matrix S λ can be rewritten as
S λ = N (N t N + λΩ)−1 N t
= N [N t (N + λ(N t )−1 Ω)]−1 N t
= N [N t (I + λ(N t )−1 ΩN −1 )N ]−1 N t
= N N −1 [I + λ(N t )−1 ΩN −1 ]−1 (N t )−1 N t
= [I + λ(N t )−1 ΩN −1 ]−1
= [I + λK]−1 (4.6)
min(y − f )t (y − f ) + λf t Kf , (4.7)
f
For this reason K is called the penalty matrix. Since S λ is a symmetric
positive semidefinite matrix, it has a real eigen-decomposition
∑
n
Sλ = ρj (λ)vj vjt (4.8)
j=1
1
with ρj (λ) = (4.9)
1 + λdj
4.4. SMOOTHING SPLINES 37
∑
n
Sλy = ρj (λ)(vjt y)vj
j=1
∑
The decomposition of y is y = nj=1 (vjt y)vj . Hence the linear smoother
first decomposes the vector y w.r.t. the basis v1 , . . . , vn and then adds
factors ρj (λ) to the weight (vjt y) of each basisvector in y. Since S λ has
rank n, the factors ρj (λ) are all larger than zero and also smaller than 1 as
can be seen from (4.9). The size of a factor ρj (λ) determines the amount of
shrinkage that is used for the contribution of each of the basisvectors.
In contrast, a least squares regression spline method is based on a pro-
jection matrix H ξ of rank M . The property that a projection matrix is
idempotent implies that the eigenvalues ρj (H ξ ) of H ξ can only take the
values zero or one. M eigenvalues will be one and the remaining n − M
eigenvalues are zero. This means that the contribution of a basisvector to
y is either fully used, or ignored (shrunk to zero). This contrast is often
reflected by calling smoothing splines shrinking smoothers while calling re-
gression splines projection smoothers.
It follows from (4.9) that the eigenvalues ρj (λ) of S λ are an inverse func-
tion of the eigenvalues dj of K. Hence, small eigenvalues of S λ correspond
to large eigenvalues of K and vice versa. The smoothing parameter λ con-
trols the rate by which the eigenvalues ρj (λ) of S λ decrease to zero. The
penalty matrix K depends on Ω, the matrix of second derivatives of the ba-
sis functions. Since the first two basis functions hN N
1 (X) = 1 and h2 (X) = X
have second derivatives equal to zero, the corresponding eigenvalues of K
will be zero. From (4.9) it then follows that the corresponding eigenvalues
of S λ are always equal to 1. Hence, the linear contributions to y are never
shrunk.
Since the trace of a matrix also equals the sum of its eigenvalues, we
have that
∑
n
dfλ = trace(S λ ) = ρj (λ). (4.10)
j=1
Larger values of λ lead to more severe penalization and yields smaller eigen-
values (according to (4.9)), and thus also a smaller degrees of freedom. On
the other hand, smaller λ values correspond to larger degrees of freedom.
38 CHAPTER 4. BASIS EXPANSIONS
For the limiting case λ = 0 it follows from (4.6) that S λ = I which implies
that ρj (λ) = 1 for j = 1, . . . , n and thus dfλ = n. On the other hand, for
λ → ∞, all eigenvalues ρj (λ) = 0 except the first two for which dj = 0 and
thus ρ1 (λ) = ρ2 (λ) = 1 as always. Hence, dfλ = 2 and in fact, since the
eigenvalues of S λ are all zero or one, S λ now equals the projection matrix
H, the hat matrix for linear regression of y on X.
Since dfλ is a monotone function of the smoothing parameter λ, we
can specify the degrees of freedom and derive λ from it. For instance, in
R, the function smooth.spline takes the effective degrees of freedom as a
possible input parameter to specify the amount of smoothing desired. Using
the effective degrees of freedom allows to parallel model selection for spline
methods with model selection for traditional parametric methods. Spline
based fits corresponding to different effective degrees of freedom can be
compared and investigated using residual plots and other criteria, to select
an optimal fit.
As explained before, the selection of the smoothing parameter λ and thus
the selection of the effective degrees of freedom dfλ is a trade-off between
bias and variance. We illustrate this trade-off with the following example.
cos(10(X + 1/4))
Y = f (X) + ϵ with f (X) = ,
(X + 1/4)
with X ∼ U [0, 1] and ϵ ∼ N (0, 1). The training sample consists of n = 100
samples (xi , yi ) generated independently from this model. We fit smoothing
splines for three different effective degrees of freedom 3,6, and 12 to these
data.
Note that from (4.4) it immediately follows that
4
2
2
0
0
y
y
−2
−2
−4
−4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Effective df=12
4
2
0
y
−2
−4
Figure 4.1: Nonlinear function f (black curve) with smoothing spline fits
for three different values of the effective degrees of freedom.
in the confidence band that reveals the increased variance which now is the
main cause for the difference between the fitted and true curves.
∑
2 ∑
Mj
f (X) = βjm gjm (X)
j=1 m=1
where the coefficients βjm need to be estimated from the training data using
e.g. least squares.
The M1 × M2 dimensional tensor product spline basis is defined by
∑
M1 ∑
M2
f (X) = βjm gjm (X).
j=1 m=1
Again, the coefficients βjm need to be estimated from the training data using
e.g. least squares. Note that the dimension of this basis grows exponentially
with the dimension d contrary to the additive basis. Hence, some regular-
ization or selection will be necessary when using the tensor basis in higher
dimensions.
∑
n
RSS(f, λ) = (yi − f (xi ))2 + λJ(f ) (4.11)
i=1
∑
n
f (x) = β0 + β t x + αj hj (x),
j=1
4.6. REGULARIZATION WITH KERNELS 41
∑
d ∑
f (X) = β0 + fj (Xj ) + fjm (Xj , Xm )
j=1 j<m
The kernel K(x, y) allows to define an inner product in the Hilbert space.
∑ ∑∞
For any two functions f (x) = ∞ i=1 ai ϕi (x) and g(x) = i=1 bi ϕi (x) in HK ,
the inner product is defined as
∞
∑ ai bi
< f, g >= . (4.16)
γi
i=1
This inner product leads to the definition of a norm in the Hilbert space HK
given by
∑∞
a2i
∥f ∥HK =< f, f >=
2
. (4.17)
γi
i=1
Functions f in HK satisfy the constraint
Using (4.21) and (4.22) the penalty term J(f ) = ∥f ∥2HK can be rewritten as
∑
n ∑
n
∥f ∥2HK =< f, f > = αi αj < K(., xi ), K(., xj ) >
i=1 j=1
∑n ∑ n
= αi αj K(xi , xj )
i=1 j=1
t
= α Kα (4.25)
44 CHAPTER 4. BASIS EXPANSIONS
This yields
∑
n
fˆ(x) = α̂i K(x, xi )
i=1
or
f̂ = K α̂
= K(K + λI)−1 y
= (I + λK −1 )−1 y
This strongly resembles the smoothing spline solution (4.6) and shows that
the fitted values again are a linear function of the responses y.
Two possible choices for the kernel function K(x, y) are introduced be-
low.
obtained through
2
∑
n ∑
M ∑
M
min yi − βj hj (xi ) + λ βj2
β1 ,...,βM
i=1 j=1 j=1
Using the polynomial kernel, this problem can be rewritten as in (4.27) and
solved easily. Note that the number of eigen-functions soon becomes very
large (even larger than n, however, from (4.28) it follows that the solution
can be computed in the same computation time order (O(n3 )) independent
of the degree m of the polynomial.
Flexible Discriminant
Analysis Methods
5.1 Introduction
In Chapter 3 we introduced the basic discriminant analysis techniques LDA
and QDA and already discussed several extensions with the purpose of in-
troducing more flexibility to obtain a better classification. Using the basis
expansions introduced in the previous chapter, we now continue the discus-
sion on discriminant analysis and show how basis expansions can be used as
an alternative approach to develop flexible discriminant analysis techniques.
Similarly, logistic regression, also introduced in Chapter two, can be made
more flexible.
47
48 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS
∑
l
fr (x) = wj [η̂j (x) − η̄jr ]2
j=1
where
1
wj = ,
rj2 (1 − rj2 )
1 ∑n
with rj2 = n−d i=1 (θ̂j (gi ) − η̂j (xi )) , the mean squared residual of the
2
jth optimally scored fit. The discriminant score fr (x) is equivalent to the
Mahalanobis distance of x to µ̂r , the center of class r.
Linear discriminant analysis can now be generalized by using more flex-
ible regression fits in (5.1) instead of the linear regression fits ηj (x) = xt βj .
In light of Chapter 4 we consider basis expansions of the predictor space to
obtain flexible regression fits. This leads to the minimization problem
[ n ]
1∑ ∑
l
min (θj (gi ) − h(xi )t βj )2 (5.2)
θj ,βj ;j=1...,l n
j=1 i=1
which leads to
exp(f (x))
P (Y = 1|X = x) = (5.5)
1 + exp(f (x))
1
P (Y = 0|X = x) = . (5.6)
1 + exp(f (x))
These probabilities can then be used to classify the object with feature value
x.
Also smoothing splines can be formulated in the logistic regression con-
text. They are obtained as the solution of the following problem. Among all
functions f (x) with two continuous derivatives, find the function that max-
imizes the penalized log-likelihood. Let us denote p(x) = P (Y = 1|X = x)
modeled as in (5.5), then the penalized log-likelihood equals
∑
n ∫
1
l(f ; λ) = [yi log(p(xi )) + (1 − yi ) log(1 − p(xi ))] − λ (f ′′ (t))2 dt
2
i=1
∑n ∫
1
= [yi f (xi ) − log(1 + exp(f (x)))] − λ (f ′′ (t))2 dt (5.7)
2
i=1
50 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS
Note that the penalty term now has a negative sign because the log-likelihood
needs to be maximized. As usual, the penalty term penalizes functions with
more curvature. As for problem (4.2) it can be shown that the optimal so-
lution is a natural cubic spline with knots at the distinct xi values. Hence,
∑
the optimal function can be written as f (x) = nj=1 hN j (x)βj . By inserting
this expression into (5.7), the penalized log-likelihood can be considered as
a function of β (for fixed value of λ), l(β; λ), and taking derivatives w.r.t. β
yields
∂l(β; λ)
= N t (y − p) − λΩβ (5.8)
∂β
2
∂ l(β; λ)
= −N t W N − λΩ (5.9)
∂β∂β t
where p is the n-dimensional vector with elements p(xi ) and W is a diagonal
matrix whose diagonal elements are weights W ii = p(xi )(1 − p(xi )). As in
∫ ′′
subsection 4.4.1, we have that N ij = hN j (xi ) and Ωij = Ni (t)Nj′′ (t) dt.
The optimal solution β̂ should have first derivative (5.8) equal to zero.
Unfortunately, this first derivative is a nonlinear function of β, such that
the solution can not be expressed analytically (as opposed to the linear
least squares solution). However, an iterative procedure can be executed
until convergence. For a current estimate β current , let us denote
z = N β current + W −1 (y − p) = f current + W −1 (y − p)
β new = (N t W N + λΩ)−1 N t W z
f new = N (N t W N + λΩ)−1 N t W z
= S λ,w z (5.10)
Comparing (5.10) with (4.4), it can be seen that the update equation each
time fits a weighted smoothing spline to the working response z.
section 4.5.3 to control the dimension of the model and to keep the model
interpretation easy. The additive model replaces the linear function in (3.11)
by a more flexible additive function
( ) ( )
P (Y = 1|X = x) p(x)
log = log = β0 + f1 (X1 ) + · · · + fd (Xd ).
P (Y = 0|X = x) 1 − p(x)
(5.11)
The smooth functions fj (Xj ) introduce flexibility in the logistic regression
model, while the additivity still allows us to interpret the model because
the contributions of the different predictors are kept separate in the model.
If interactions between predictors are of interest, the additive model can be
extended to ANOVA spline decompositions as introduced in subsection 4.5.3.
To guarantee a unique solution, the standard convention is to impose the
∑
condition ni=1 fj (xij ) = 0 for j = 1, . . . , d.
The structure of the algorithm for the additive logistic regression model
is the same as for the (flexible) logistic regression algorithm in the previous
section.
∑ ∑
• Given a current fit f̂ = β̂0 + dj=1 f̂j , that is fˆ(xi ) = β̂0 + dj=1 fˆj (xij ),
and corresponding probabilities p̂i = exp(fˆ(xi ))/[1 + exp(fˆ(xi ))], con-
struct the working response vector
z = f̂ + W −1 (y − p̂)
∑
n ∑
d ∑
d ∫
RSS(β0 , f1 , . . . , fd ) = wi [zi − β0 − 2
fj (xij )] + λj (fj′′ (tj ))2 dtj
i=1 j=1 j=1
∑n
i=1 fj (xij ) = 0 for j = 1, . . . , d it immediately follows that β̂0 = ave(zi ),
the average response. To estimate the components fˆj , the fact that each
of these components are cubic smoothing splines is exploited as follows.
The algorithm is an iterative procedure, starting from fˆj (xij ) = 0 for all
i = 1, . . . , n and j = 1, . . . , d. To obtain a new estimate fˆl , a weighted
smoothing spline is applied to the target vector
∑
z − β̂0 − f̂j
j̸=l
versus xl . Here, f̂j ; j ̸= l are the currently available fits for all other com-
ponents. Hence, the target vector is the residual vector that remains after
removing from the response the effects that are already explained by the
other components. The smoothing spline fit then estimates which part of
this residual vector can be explained by predictor Xl . This update procedure
is applied in turn to each of the component functions fj , associated with
the different predictors. The process is continued until convergence, that
is, until each of the estimates fˆl are stabilized. This type of algorithm is
called a backfitting algorithm. The additive model and backfitting algorithm
can also be used with other componentwise regression fitting methods than
smoothing splines such as local polynomial fits, regression splines, kernel
methods, etc.
If the componentwise regression fitting method is a linear operator, such
as smoothing splines, then the effective degrees of freedom for each compo-
nent is computed as dfj = trace(S j ) − 1 where S j is the smoother matrix
for the jth component. The −1 in the degrees of freedom is caused by the
fact that the constant term does not need to be estimated, because only one
constant term is needed in the additive fit, and this constant β̂0 is estimated
beforehand. The effective degrees of freedom is again a convenient tool to
determine the value of the smoothing spline parameters λj .
The additive model and backfitting algorithm can be formulated more
generally by introducing a link function g which is the transformation needed
to relate the conditional mean of the response variable Y to the predictors
through an additive function
• g(E[Y |X]) = E[Y |X], that is g is the identity. This link function
is used for continuous, symmetric (Gaussian) response variables and
leads to the basic additive models.
5.5. GENERALIZED ADDITIVE MODELS 53
• g(E[Y |X]) = log(E[Y |X]) for log-additive models useful for e.g. Pois-
son distributed count data as response variable.
For link functions g different from the identity, these models are called gener-
alized additive models. Note that not all predictors need to be fit additively.
Linear and nonlinear contributions can be mixed easily by first estimating
the linear part and estimating the additive part from the resulting residuals.
A drawback of additive models is that each of the predictors that is included
additively in the model is automatically included in the model. Hence, if
the number of predictors is large, model selection techniques are needed to
select the best subset of predictors.
54 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS
Chapter 6
6.1 Introduction
In this chapter we mainly discuss the use of kernel-based methods for clas-
sification. We start by considering methods that explicitly try to find linear
boundaries that separate the different classes as much as possible. We then
discuss the construction of optimal separating hyperplanes when the classes
are not completely separable. This technique is then extended by using ker-
nels to (implicitly) transform and enlarge the predictor space. This results
in more flexible nonlinear boundaries in the original predictor space.
β
β⋆ =
∥β∥
55
56 CHAPTER 6. SUPPORT VECTOR MACHINES
dsign (x, Hβ ) = (β ⋆ )t (x − x0 ).
The function is called signed distance because its absolute value is the or-
thogonal distance from the point to the hyperplane but the function is pos-
itive for points on the positive side (f (x) > 0) and negative for points on
the negative side (f (x) < 0) of the hyperplane. Note that for any point x0
in Hβ , it holds that f (x0 ) = 0, which yields β t x0 = −β0 . Therefore, the
signed distance can be rewritten as
dsign (x, Hβ ) = (β ⋆ )t (x − x0 )
1
= (β t x − β t x0 )
∥β∥
1
= (β t x + β0 )
∥β∥
f (x)
= ,
∥f ′ (x)∥
since f ′ (x) = β in this case. Hence, the signed distance of a point x to the
hyperplane defined by f (x) = 0 is proportional to the value f (x). Separating
hyperplane classifiers try to find a linear function f (x) as in (6.1) such that
f (x) > 0 for objects in class Y = 1 and f (x) < 0 for objects in class
Y = −1. Procedures that classify observations by computing the sign of a
linear combination of the input features were originally called perceptrons
when introduced in the engineering literature.
Let us assume that the data are separable, that is, there is at least
one hyperplane that completely separates the two classes. In this case, the
solution of the separating hyperplane problem will often not be unique, but
several hyperplanes completely separate the classes. To make the solution
unique, an additional condition can be imposed to find the best possible
separating hyperplane among all choices. The optimal separating hyperplane
separates the two classes and maximizes the distance from the hyperplane to
the closest points from both classes. This extra condition defines a unique
solution to the separating hyperplane problem and maximizes the margin
between the two classes on the training data. It can be expected that it
therefore also leads to better performance on test data.
If Hβ is a separating hyperplane, then Y = −1 for points on the negative
side (f (x) < 0) of Hβ and Y = 1 for points on the positive side (f (x) > 0).
It follows that the (unsigned) distance of a point xi to the hyperplane Hβ is
obtained by d(xi , Hβ ) = yi (β t xi + β0 )/∥β∥. Hence, a separating hyperplane
satisfies d(xi , Hβ ) = yi (β t xi + β0 )/∥β∥ > 0 for all training points.
6.2. SEPARATING HYPERPLANES 57
max C (6.2)
β,β0
β t xi + β0
subject to yi ≥ C; i = 1, . . . , n. (6.3)
∥β∥
The condition (6.3) can equivalently be written as
yi (β t xi + β0 ) ≥ C∥β∥; i = 1, . . . , n. (6.4)
Now suppose that for some value of C, there are values β and β0 that satisfy
condition (6.4). Let us consider a positive value γ, and define β̃ = γβ and
β̃0 = γβ0 . It then follows that ∥β̃∥ = γ∥β∥ and condition (6.4) is equivalent
to
γyi (β t xi + β0 ) ≥ γC∥β∥; i = 1, . . . , n
or yi (β̃ t xi + β̃0 ) ≥ C∥β̃∥; i = 1, . . . , n.
This means (not surprisingly) that if values β and β0 satisfy the condition,
then also any multiple by a positive constant satisfies the constraint. To
make the solution unique, the value of ∥β∥ can be taken equal to an arbitrary
chosen constant. A convenient choice is to take
1
∥β∥ = . (6.5)
C
The maximization problem (6.2) then becomes the minimization problem
min ∥β∥.
β,β0
subject to yi (β t xi + β0 ) ≥ 1; i = 1, . . . , n. (6.7)
Note that condition (6.3) together with (6.5) implies that around the optimal
separating boundary there is a margin of width 1/∥β∥ in both directions,
that does not contain any data points.
The optimal separating hyperplane problem as formulated in (6.6)-(6.7)
consists of a quadratic criterion that needs to be minimized under linear
inequality constraints, which is a standard convex optimization problem of
operations research. That is, minimizing (6.6)-(6.7) is equivalent to mini-
mizing w.r.t. β and β0 the Lagrange (primal) function
1 ∑ n
LP = ∥β∥2 − αi [yi (β t xi + β0 ) − 1] (6.8)
2
i=1
58 CHAPTER 6. SUPPORT VECTOR MACHINES
∑
n
β = αi yi x i (6.9)
i=1
∑n
0 = αi yi (6.10)
i=1
∑
n
1 ∑∑
n n
LD = αi − αi αj yi yj xti xj (6.11)
2
i=1 i=1 j=1
∑
n
αi yi = 0
i=1
αi ≥ 0. (6.12)
This is a standard convex optimization problem. The final solution must sat-
isfy the Karush-Kuhn-Tucker conditions, well-known in operations research.
These conditions are given by (6.7),(6.9)-(6.10),(6.12) and the condition
αi [yi (β t xi + β0 ) − 1] = 0; i = 1, . . . , n. (6.13)
separable as we assumed here, then it can be shown that the logistic regres-
sion boundary is a separating hyperplane (not necessarily optimal). Hence,
both methods have the advantage of being more robust to model misspec-
ifications and deviations in the data if they occur away from the decision
boundary, but are less stable for well-behaved Gaussian data. On the other
hand, as mentioned before, LDA gives equal weight to all observations and
thus is more efficient if the data are Gaussian, but can be more sensitive to
model deviations.
Once the optimal separating hyperplane Hβ̂ has been found as solution
of the constrained optimization problem, new observations can be classified
as
Ĝ(x) = signfˆ(x) = sign(β̂ t xi + β̂0 )
In practice, it will often occur that there does not exist a separable
hyperplane and thus there is no feasible solution to problem (6.6)-(6.7).
Therefore, we will extend the separable hyperplane problem in the next
section to the case of nonseparable classes.
yi (β t xi + β0 ) ≥ 1 − ξi ; i = 1, . . . , n.
subject to yi (β t xi + β0 ) ≥ 1 − ξi ; i = 1, . . . , n
ξi ≥ 0; i = 1, . . . , n (6.15)
where the tuning parameter γ > 0 now takes over the role of Q. A large
∑
value of γ puts a strong constraint on ni=1 ξi and thus allows few misclas-
∑
sifications while a small value of γ puts little constraint on ni=1 ξi . Note
that the limit case γ = ∞ corresponds to the separable case since all slack
variables ξi should equal zero in this case. The support vector classifier prob-
lem (6.14)-(6.15) still consists of a quadratic criterion with linear constraints,
hence a standard convex optimization problem. The minimization problem
is equivalent to minimizing w.r.t. β, β0 and ξi the Lagrange function
1 ∑ n∑ n ∑ n
LP = ∥β∥2 + γ ξi − αi [yi (β t xi + β0 ) − (1 − ξi )] − µi ξi (6.16)
2
i=1 i=1 i=1
∑
n
β = αi yi x i (6.17)
i=1
∑n
0 = αi yi (6.18)
i=1
αi = γ − µi ; i = 1, . . . , n (6.19)
Note that equations (6.17) and (6.18) are the same as in the separable case
discussed in the previous section (equations (6.9) and (6.10)). The difference
is that the last equation relates the Lagrange multipliers αi to the constant
γ. Substituting these first order conditions in (6.16) yields the Wolfe dual
objective function
∑
n
1 ∑∑
n n
LD = αi − αi αj yi yj xti xj (6.20)
2
i=1 i=1 j=1
Note that this is exactly the same dual function as in the nonseparable case
(expression (6.11)) but now it needs to be maximized α1 , . . . , αn under the
constraints
∑
n
αi yi = 0
i=1
6.4. SUPPORT VECTOR MACHINES 61
0 ≤ αi ≤ γ. (6.21)
The Karush-Kuhn-Tucker conditions satisfied by the solution include the
constraints
αi [yi (β t xi + β0 ) − (1 − ξi )] = 0; i = 1, . . . , n (6.22)
µi ξi = 0; i = 1, . . . , n (6.23)
yi (β t xi + β0 ) − (1 − ξi ) ≥ 0; i = 1, . . . , n. (6.24)
From (6.17) it still follows that the solution for β is a linear combination of
the observations with αi > 0. From condition (6.22) it follows that these
points are observations for which constraint (6.24) is exactly met. These
observations are again called the support vectors. Some of these support
vectors will lie on the margin (ξi = 0) which according to (6.23) and (6.19)
implies that 0 < αi < γ while others lie on the wrong side of their margin
(ξi > 0) and are characterized by αi = γ.
f (x) = β t h(x) + β0
∑n
= αi yi h(x)t h(xi ) + β0 . (6.27)
i=1
Note that for given αi and β, the constant β0 can be determined as the
solution of yi f (xi ) = 1 for support points xi on the margin (0 < αi < γ).
From (6.25) and (6.27) it follows that both the optimization problem (6.25)
and its solution (6.27) in the transformed feature space only depend on prod-
ucts of basis functions h(x)t h(x′ ). This makes it convenient to use transfor-
mations based on kernel functions. Consider a kernel function K(x, x′ ) =
∑∞ ′
m=1 γm ϕm (x)ϕm (x ) as in (4.14) and define the associated basis functions
√
as hm (x) = γm ϕm (x), then for any points x and x′ in Rd the product
∑
h(x)t h(x′ ) = ∞ ′ ′
m=1 γm ϕm (x)ϕm (x ) = K(x, x ). Hence, if the transforma-
tion of the feature space is induced by a kernel function, then (6.25) can be
rewritten as
∑n
1 ∑∑
n n
LD = αi − αi αj yi yj K(xi , xj ) (6.28)
2
i=1 i=1 j=1
which needs to be maximized under constraints (6.26) and the solution (6.27)
becomes
∑
n
f (x) = αi yi K(x, xi ) + β0 (6.29)
i=1
which shows that knowledge of the kernel function is sufficient to compute
the support vector classifier, without the need to explicitly specify the trans-
formation h(X). The above dual problem still can easily be solved using the
same techniques as for the support vector classifier (in the original space) in
the previous section. The resulting classification method is called a support
vector machine. Popular choices for the kernel function K are the polyno-
mial kernels introduced in subsection 4.6.1, the radial basis functions kernel
of subsection 4.6.2, and the neural network kernel defined by
1 − yi (β t xi + β0 ) = 1 − yi f (xi ) ≤ 0.
For support vectors (αi > 0), condition (6.22) implies that
yi (β t xi + β0 ) − (1 − ξi ) = 0
or equivalently
0 ≤ ξi = 1 − yi (β t xi + β0 )
= 1 − yi f (xi ).
∑n
Using these results, i=1 ξi can be rewritten as
∑
n
[1 − yi f (xi )]+ (6.30)
i=1
By inserting (6.30) into (6.14) and denoting λ = 1/(2γ) the support vector
classifier solves the optimization problem
[ n ]
∑
min [1 − yi f (xi )]+ + λ∥β∥2 .
β,β0
i=1
√ ∑ ∑∞
where θm = γm βm . Since, ∥β∥2 = ∞ 2
m=1 βm =
2
m=1 (θm /γm ) it follows
that (6.31) can be rewritten as
[ n [ ∞
] ∞
]
∑ ∑ ∑ 2
θm
min 1 − yi θm ϕm (xi ) + λ (6.32)
θ,β0 γm
i=1 m=1 + m=1
which corresponds to (4.13) for the choice L(yi , f (xi )) = [1 − yi f (xi )]+ . A
natural generalization is to use different loss functions than the standard
SVM loss function in (6.33) such as the logistic regression loss function
Classification Trees
7.1 Introduction
Tree-based methods partition the feature space into rectangular regions and
then fit a simple model (often a constant) in each of these regions. Trees are
conceptually simple, yet very powerful techniques. The rectangular regions
in feature space are obtained by consecutive binary partitions of the data,
based on the value of one of the predictors. Hence, the key component of
tree methods is the way that the binary partitions are determined.
65
66 CHAPTER 7. CLASSIFICATION TREES
point for each split. In the classification context, this means that the result-
ing tree should minimize the misclassification error. Suppose that the tree
procedure splits the feature space into M leaves R1 , . . . , RM . Each of these
regions Rm ; m = 1, . . . , M contains a number of observations denoted as
nm . All observations in a terminal node m (corresponding to region Rm )
will be assigned to the class which has the highest proportion among the
training data that belong to region Rm . For any class r = 1, . . . , k, the
proportion of training observations of class r in region Rm is given by
1 ∑
p̂mr = I(yi = r),
nm
xi ∈Rm
where I(yi = r) is the indicator function that takes value 1 if the condition
yi = r is satisfied and zero else. The observations in terminal node m are
thus assigned to class
r(m) = arg max p̂mr ,
r
the majority class in leave m. The goal is to minimize the misclassification
rate, given by
1 ∑ ∑ ∑
M M
nm
I(yi ̸= r(m)) = 1 − p̂ .
n n mr(m)
m=1 xi ∈Rm m=1
∑ ∑
k
p̂mr p̂ms = p̂mr (1 − p̂mr ).
r̸=s r=1
The Gini index can be seen as the training misclassification error rate that
is obtained if the observations of Rm are not uniquely assigned to one class,
7.2. GROWING TREES 67
but are given probabilities to belong to each of the classes, with p̂mr the
probability of belonging to class r; r = 1, . . . , k. Alternatively, the Gini
index can be interpreted as follows. For any class r, code the observations in
Rm belonging to class r as 1 and give code 0 to all other observations. Then,
the success probability of this Bernouilli variable is p̂mr and its variance
is given by p̂mr (1 − p̂mr ). Summing the variance over all possible classes
r; r = 1, . . . , k again leads to the Gini index. The cross-entropy or deviance
is given by
∑k
− p̂mr log(p̂mr ).
r=1
Figure 7.1 compares the shape of the three impurity measures for a two-class
problem as a function of p, the proportion of observations in class 1 (or class
2). Clearly, all three measures are similar.
0.5
0.4
0.3
0.2
0.1
Misclassification error
Gini index
Entropy
0.0
Figure 7.1: Shape of the three impurity measures for a two-class problem as
a function of p, the proportion of observations in class 1.
The best split is then the variable Xj and split-point t that solves
where nR1 and nR2 are the numbers of training observations in R1 and R2 .
Once the optimal split has been found, the data are partitioned according
to the two resulting regions R1 and R2 and the splitting process is repeated
for each of the regions.
Note that both cross-entropy and the Gini index are more sensitive to
changes in class probabilities within a region, even if the majority class
remains the same, which is another reason why they are preferable when
growing a tree. To illustrate this, consider a two-class problem with a train-
ing dataset of size n = 800 and the same number of observations in each
class, n1 = n2 = 400. Suppose a split creates regions R1 with n1 = 300 and
n2 = 100, and R2 with n1 = 100 and n2 = 300, while another split creates
regions R1 with n1 = 200 and n2 = 400, and R2 with n1 = 200 and n2 = 0.
Both splits create a misclassification rate of 0.25, meaning that they are
equally good when the splitting criterion (7.1) is based on misclassification
error. However, the second split produces a ’pure’ leave where the misclassi-
fication error is zero. Therefore, this split is preferable. When the splitting
criterion (7.1) is based on the Gini index or cross-entropy, it yields a lower
value for the second split as is desirable.
It has to be decided when to stop splitting regions into further subregions
or equivalently, it has to be decided how large a tree can grow. One strategy
would be to evaluate each split by comparing the misclassification rate of
the tree before the split with the misclassification rate of the tree obtained
after the split. The split can then be retained if the gain in misclassification
rate exceeds some threshold. However, this strategy does not work out well,
because seemingly worthless splits in the beginning of the process might lead
to very good splits below it. Therefore, the preferred strategy is to grow a
large tree T0 where splitting of regions only stops when some minimal region
size (e.g. 5) has been reached. This large tree has the danger of overfitting
the training data. Therefore, the initial tree is then pruned to find a subtree
that fits the data well but is more stable and is expected to perform better
on validation data.
defined as
∑
M
Cα (T ) = nm QRm (T ) + αM, (7.2)
m=1
∑
M (T )
= nm QRm (T ) (7.3)
m=1
7.4 Bagging
Bagging uses the bootstrap methodology to improve the estimates or predic-
tions of a classification (or regression) procedure such as trees. The boot-
strap creates new training datasets by randomly drawing observations with
replacement from the original training dataset. Usually, each of these boot-
strap samples has the same size as the original dataset. This yields a number
of bootstrap datasets denoted by B. The same model is then refit to each
of these bootstrap datasets. For any point x, the model fit to each of the
bootstrap samples, yields a prediction Ĝb (x); b = 1, . . . , B. For any class
r; r = 1, . . . , k a prediction Ĝ(x) = r can also be written as a vector fˆ(x) of
70 CHAPTER 7. CLASSIFICATION TREES
length k which is zero everywhere, except for the rth component that equals
1. Thus,
Ĝ(x) = arg max fˆ(x).
1≤s≤k
Then, the bagging estimate derived from the B bootstrap samples is given
by
1 ∑ˆ
B
ˆ
fbag (x) = fb (x), (7.4)
B
b=1
where fˆb (x) is the prediction of the bth bootstrap sample (b = 1, . . . , B).
Since each of these predictions is a vector with zeros and 1 at the predicted
class, the bagged estimate fˆbag (x) = (p̂1 , . . . , p̂k ) where p̂r is the proportion
of bootstrap training samples (e.g. the proportion of trees) that predict class
r (r = 1, . . . , k) at x. These proportions can be interpreted as estimated class
probabilities and thus the predicted class at x becomes
This procedure is often called majority voting because the predicted class
is the class that gets the most votes from the B bootstrap samples. Note
that for classifiers that already yield class probability estimates at x for a
single training dataset, an alternative bagging strategy is to average these
probabilities over bootstrap samples. This approach tends to give a lower
variance for the predictions.
The idea behind the bootstrap is that we randomly draw samples of
size n from the distribution P̂ , where P̂ is the empirical distribution of the
original dataset. This original distribution is discrete and puts mass 1/n
at each observation in the dataset. Therefore, to generate bootstrap sam-
ples, we successively draw observations from the original dataset where at
each stage, all observations have the same probability 1/n of being selected.
The bagging estimate in (7.4) is then only a Monte-Carlo approximation
of the true bagging estimate EP̂ [fˆ⋆ (x)] where fˆ⋆ (x) is the prediction at x
derived from the classification rule obtained from a random training sample
Z⋆ = {(x⋆i , yi⋆ ); i = 1, . . . , n} generated from P̂ . The Monte-Carlo estimate
approaches the true bagging estimate as B → ∞. The bagging estimate
EP̂ [fˆ⋆ (x)] will only differ from the estimate on the original training data
fˆ(x) if the method that is being applied to each of the bootstrap sam-
ples is a nonlinear or adaptive function of the training data. Otherwise,
EP̂ [fˆ⋆ (x)] = fˆP̂ (x) = fˆ(x), the prediction derived from the original training
data. Therefore, bagging is interesting for highly data dependent methods
such as trees, where the number of leaves and the variables determining the
successive splits will be different for each bootstrap tree.
7.5. BUMPING 71
Note that bagging can make a good classifier better and more stable,
but bagging a bad classifier does not make it better. In fact, by making
the classifier more stable, it can become even worse. This is illustrated by
the following simple artificial example. Suppose that Y = 1 for all x (and
thus there actually is only 1 class), and the classifier Ĝ(x) randomly predicts
Y = 1 with probability 0.4 and Y = 0 with probability 0.6 (for all x). Then,
the misclassification error of this classifier is 0.6, clearly a bad classifier.
However, when bagging this classifier, at each x, approximately 60% of the
bootstrap samples will predict class Y = 0 and only 40% will predict Y = 1.
Hence, if the number of bootstrap samples, B, is large enough, the bagging
estimate will predict Y = 0 at each x, leading to a misclassification error
of 100% which is even much worse than the initial classifier on the original
training data.
By averaging models such as trees over bootstrap samples, bagging in-
creases the space of models in which an optimal solution is sought. Indeed,
the bagging classifier does not belong to the class of the initial base classi-
fier anymore, e.g. the bagging classifier is not a classification tree anymore.
However, a drawback of bagging is that if the base classifier comes from a
model with a simple structure such as trees, then this simple structure is
lost for the bagging classifier. For interpretation of the model this can be a
drawback.
7.5 Bumping
Bumping uses bootstrap samples to select a best fitting model from a random
set of model fits in model space. Hence, instead of averaging or combining
models as in bagging, bumping is a technique to find a better single model
than the one obtained from the original training data. For each of the B
bootstrap samples, we fit the model which leads to predictions fˆb (x); b =
1, . . . , B at x. We then choose the model that produces the smallest average
prediction error on the original training dataset. By convention, the original
training data set is included in the set of bootstrap samples, so that the
method is allowed to pick the solution on the original data if it has the
lowest training error. Bumping is mainly helpful for improving problems
where fitting methods can find local minima. The bootstrap samples are
a kind of perturbed data that can yield improved solutions if the method
on the original training data gets stuck in a poor solution (local minimum).
Note that since bumping compares the classification error of the different
fitted models on the training data, one must ensure that these models have
(roughly) the same complexity. More complex models usually fit the training
72 CHAPTER 7. CLASSIFICATION TREES
data better, so allowing for different complexity among the fitted models
would just lead to the selection of the most complex model. For example, in
the case of trees, this means that the trees grown on each of the bootstrap
samples should have (approximately) the same number of leaves.
7.6 Boosting
Boosting is a very powerful way of combining the output of many weak
classifiers to produce a high performance committee. Consider a two-class
problem with the class variable coded as Y ∈ {−1, 1}. A classifier G pro-
duces the training misclassification error
1∑
n
err
¯ = ̸ G(xi )).
I(yi =
n
i=1
The expected error rate on future predictions is given by EXY I(Y ̸= G(X)).
A weak classifier is a classifier whose error rate is only slightly better
than the 50% erorr rate obtained by random guessing. Boosting sequentially
applies the weak classification method to repeatedly modified versions of the
data which leads to a sequence of weak classifiers, Gm (x); m = 1, . . . , M .
The predictions of these M classifiers are then combined through a weighted
majority vote to produce the final prediction
( M )
∑
G(x) = sign am Gm (x) . (7.5)
m=1
iterate the following procedure for m going from 1 to M . Apply the classi-
fication method to the training data using the weights wi , which yields the
classifier Gm (x). Compute the training error of this classifier, given by
∑n
wi I(yi ̸= Gm (xi ))
¯ m = i=1 ∑n
err . (7.6)
i=1 wi
Compute the classifier weight am = log[(1 − err
¯ m )/err
¯ m ]. Define new ob-
servation weights as
When the iterations are finished and the M classifiers have been obtained,
construct the final classifier through weighted majority voting as defined
in (7.5).
To explain why boosting is successful in improving the performance of
the base classifier on the original data, we reconsider (7.5). This expression
can be interpreted as fitting an additive expansion using a set of elementary
basis functions G1 (x), . . . , GM (x). As seen in Chapter 4, in general a linear
basis function expansion has the form
∑
M
f (x) = βm b(x; γm ),
m=1
Consider for example squared error loss L(y, f (x)) = (y − f (x))2 , then
where rm−1,i is the residual on the ith observation that remains after the
already fitted expansion. Thus for squared error loss the term βm b(x; γm )
that best fits the current residuals is added to the expansion. The param-
eters β̂m , γ̂m in each step are obtained by solving the simpler optimization
problem
∑n
min L (rm−1,i , βm b(x; γm )) ,
βm ,γm
i=1
which is of form (7.9).
For AdaBoost, the basis functions in the expansion are the sequence of
classifiers Gm (x). Using the exponential loss function
or equivalently,
[ ]
∑
n
(m)
∑
n
(m)
min [exp(βm ) − exp(−βm )] wi I(yi ̸= Gm (xi )) + exp(−βm ) wi .
βm ,Gm
i=1 i=1
(7.15)
The minimization problem (7.15) can be solved in two steps. First, for any
value βm > 0, the solution to (7.15) for Gm (x) is clearly given by
∑
n
(m)
Ĝm = arg min wi I(yi ̸= Gm (xi )), (7.16)
i=1
7.6. BOOSTING 75
which is the classifier that minimizes the weighted training error. The op-
timal solution Ĝm is thus independent of the actual value of βm . Plugging
the optimal solution Ĝm into (7.14) and solving for βm yields
( )
1 1 − err
¯ m
β̂m = log , (7.17)
2 err
¯ m
where err
¯ m is the training error of classifier Ĝm as defined in (7.6). Using the
solutions β̂m and Ĝm given by (7.17) and (7.16), the expansion is updated
as
fm (x) = fm−1 (x) + β̂m Ĝm (x).
which leads to
(m+1) (m)
wi = wi exp[2β̂m I(yi ̸= Ĝm (xi ))] exp(−β̂m ). (7.18)
The factor exp(−β̂m ) is the same for all observations and can therefore be
ignored. Hence, by defining am = 2β̂m the weights in (7.18) are equivalent to
the weights (7.7) used in AdaBoost, so we can conclude that AdaBoost tries
to minimize the general minimization problem (7.8) using the exponential
loss function (7.12) by using forward stagewise additive modeling.
The exponential loss function (7.12) used by AdaBoost clearly is at-
tractive from the computational viewpoint as it allows to approximate the
solution of the complex minimization problem (7.8) by solving a sequence of
simpler minimization problems (7.13) in a forward stagewise additive mod-
eling strategy. However, the exponential loss function is also a reasonable
choice from a statistical viewpoint. At the population level, the general
minimization problem (7.8) using exponential loss becomes
and it can be shown that the optimal solution at the population level is
( )
1 P (Y = 1|X = x)
f (x) = log . (7.19)
2 P (Y = −1|X = x)
76 CHAPTER 7. CLASSIFICATION TREES
Thus, the optimal solution is one-half the log-odds and AdaBoost estimates
this optimal solution. This justifies using the sign of the classifier as the
classification rule as in (7.5).
Another loss function that can be used is the binomial negative loglike-
lihood. Let
1
p(x) = P (Y = 1|X = x) = (7.20)
1 + exp(−2f (x))
which implies that f (x) is one-half the log-odds as in (7.19). Consider the
response variable Y ′ = (Y + 1)/2, that is Y ′ codes the classes as 0 and 1,
then the binomial log-likelihood is
Since the maximum of the log-likelihood (7.21) at the population level is ob-
tained at the true probabilities p(x) = P (Y = 1|X = x), it follows that the
population minimizer of the general problem (7.8) using the binomial nega-
tive loglikelihood loss function, that is, the minimum of E[−l(Y, f (x))|X =
x] is also one-half the log-odds. Hence, both loss functions yield the same op-
timal solution at the population level, but this is not generally true anymore
at finite-sample training datasets.
Contrary to the exponential loss function, the binomial loss function
extends easily to multiclass (k > 2) classification problems with classes
Y = 1, . . . , k. Let
exp(fr (x))
pr (x) = ∑k ,
l=1 exp(fl (x))
∑k
with the constraint l=1 fl (x) = 0 to ensure uniqueness of the function
fl (x); l = 1, . . . , k. The binomial negative loglikelihood loss function then
extends naturally to the k-class multinomial negative log-likelihood loss func-
tion:
∑
k
L(y, f (x)) = − I(y = r) log(pr (x))
l=1
( k )
∑k ∑
= − I(y = r)fr (x) + log exp(fl (x)) ,
l=1 l=1
to grow and prune each tree as explained before. However, boosting assumes
that the basic classifier is a weak classifier. Therefore, a more efficient choice
that leads to a better performance is to fix the size of the trees that are grown
in each step. The size J of a tree is its number of leaves, or equivalently
the number of regions the feature space is partitioned in. Experience has
indicated that a choice of J in the range 4 ≤ J ≤ 8 usually works well
for boosting and generally the results are fairly insensitive to the choice of
J in this range. A second choice is the value of M , the number of trees
in the additive expansion. Each additional iteration usually decreases the
training error, which implies that overfitting occurs when M is too large.
A convenient way to select the optimal value M ⋆ is to monitor prediction
error on a validation sample as a function of M when running the boosting
algorithm. The (smallest) value of M that minimizes this test error is the
optimal choice M ⋆ .
78 CHAPTER 7. CLASSIFICATION TREES
Chapter 8
Nearest-Neighbor
Classification
8.1 Introduction
Nearest-neighbors is a simple and model-free method. The technique is es-
sentially data driven and can be very effective. However, its lack of structure
makes it not useful for trying to understand the relationship between the
features and the outcome (class membership).
8.2 K-Nearest-Neighbors
Consider a training dataset {(xi , gi ); i = 1, . . . , n} where gi is the class label
of the ith observation and takes values in the set {1, . . . , k}. For a target
point x0 that needs to be classified, K-nearest-neighbors first determines
the K training points closest to x0 and then assigns x0 to the class that
is best represented among these K training points, that is, majority voting
among the K nearest neighbors of x0 is used to predict the class where x0
belongs to. If ties occur, that is, two or more classes have the same maximal
number of votes among the K nearest neighbors, then a class is assigned at
random from these candidate classes. To determine the K nearest neighbors
of x0 , that is, the K training points closest to x0 , a distance measure needs
to be specified. If all features are quantitative, then the Euclidean distance
in feature space is used:
d(x0 , xi ) = ∥xi − x0 ∥; i = 1, . . . , n.
Since it is possible (even likely) that the features are measured in different
units with different scales, it is advisable to first standardize the features
such that they all have mean zero and variance 1. Otherwise, (if features
79
80 CHAPTER 8. NEAREST-NEIGHBOR CLASSIFICATION
have very different scales) the determination of the nearest neighbors would
be completely determined by the differences in the scales. If the features
are qualitative or of mixed type, the Euclidean distance can be replaced by
a proximity measure which expresses for each pair of observations how alike
they are. A proximity measure can either produce similarities that measure
resemblance or dissimilarities that measure difference between objects.
K-nearest-neighbors is a very simple technique but its local nature allows
it to adapt well to local characteristics of the data which makes it well-suited
for classification problems with a very irregular boundary. A limit case is
K = 1, the 1 nearest-neighbor classifier, which classifies each target point x0
in the same class as the closest training point. If the training sample is large
and fills well the predictor space, then the bias of the 1-nearest-neighbor
classifier is low, but on the other hand, its variance is high. In fact, for
a two-class classification problem, the error rate of the 1-nearest-neighbor
classifier asymptotically never exceeds twice the optimal Bayes error rate.
This can be seen as follows. Consider an arbitrary point x and let r⋆ be the
dominant class at x. Denote pr (x) the true conditional probability for class
r (r = 1, 2) at x. The Bayes rule classifies x in class r⋆ with misclassification
probability given by
Bayes error = 1 − pr⋆ (x).
Now, suppose that the training dataset fills up the predictor space in a dense
fashion. This means that for any target point x and arbitrary small distance
ϵ > 0, there exists a training sample size nϵ such that there is a training
point at distance closer than ϵ of x for training samples of size n ≥ nϵ . The
1-nearest-neighbor classifier will classify x to the same class as the nearest
training point. Asymptotically, this training point will be arbitrary close to
x. Therefore, with probability p1 (x) this training point belongs to class 1,
and with probability p2 (x) = 1 − p1 (x) the training point belongs to class 2.
Hence, asymptotically the misclassification error of the 1-nearest-neighbor
classifier becomes
The key issue in this result is the assumption of no bias. In practice, the
training sample is not dense in predictor space, and thus the 1-nearest-
8.2. K-NEAREST-NEIGHBORS 81
neighbor classifier can show substantial bias in some areas of predictor space
which deteriorates its performance.
The high variance of the 1-nearest-neighbor classifier makes it often not
a good choice even when its bias is low. The choice of the optimal number
of nearest neighbors K that is used for the nearest-neighbors classifier is a
tuning parameter that needs to be selected to obtain a compromise in the
bias-variance trade-off. A larger value of K will decrease the variance of the
estimator. If this decrease in variance can be combined with no or little in-
crease in bias, then clearly the larger value of K is a better choice. As usual,
the optimal value of K can be determined by comparing performance on a
validation set for different values of K. If a validation set is not available,
cross-validation can be used.
D(x, x0 )2 = (x − x0 )t Σϵ (x − x0 ), (8.1)
where
The covariance matrix Σϵ determining the adaptive metric seems quite com-
plicated, but can be explained as follows. First the data are sphered using
the within covariance matrix W . Note that B ⋆ is the between covari-
ance matrix in this sphered feature space. The neighborhood around the
(sphered) target point in the sphered feature space is then stretched in the
directions of B ⋆ that have small or zero eigenvalues. Indeed, the directions
with smallest eigenvalues contribute little to the distance (8.1) and thus the
neighborhood can stretch far in these directions. On the other hand, direc-
tions with large eigenvalues of B ⋆ contribute largely to the distance (8.1)
and thus the neighborhood stretches little in these directions. The factor ϵ
is added to avoid that the neighborhood can stretch infinitely in a direction
84 CHAPTER 8. NEAREST-NEIGHBOR CLASSIFICATION
9.1 Introduction
A good classification method for a practical problem is a classifier that
performs well on independent test data. Assessment of this performance is
therefore extremely important in practice. Test performance can be used
to guide the choice of learning method or model. It thus can drive model
selection problems such as the choice of the tuning parameter(s). Moreover,
it is a measure of the quality of the ultimately chosen model.
In this chapter we discuss methods for the assessment of the performance
of models on test data. The focus is on estimating this performance when
an independent test set is not available.
where the expectation takes (population) averages over all quantities that
are random. This includes G and X but also the randomness in the training
sample that produced Ĝ. The corresponding training error is the average
loss over the training sample, given by
1∑
n
err
¯ = L(gi , Ĝ(xi )). (9.2)
n
i=1
85
86CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION
The 0 − 1 loss function can always be used while the log-likelihood loss
function can only be used for classification methods that produce estimates
of the class probabilities. The log-likelihood loss function can be used for
general response variables Y (quantitative or qualitative) with a density
Pθ(X) (Y ) that depends on the predictors X through some parameter θ(X).
The loss function then is
part of the full dataset. A typical choice is to take 50% of the data for the
training part and 25% for both the validation and test parts.
However, in many cases the full dataset is not large enough to split it
into three independent parts. We discuss in the next sections how model
selection and estimation of the test error can be performed in such cases. The
methods to approximate the validation step consist of two classes. Those
that use an analytical approximation and those that re-use the training data.
Note that the first expectation is with respect to the randomness of the
responses in the training data while the second expectation averages over
all possible vectors Y new of n new responses. The optimism is now defined as
the expected difference between the in-sample error and the training error:
op = Errorin − Ey (err),
¯ (9.4)
where yi and ŷi are the true and predicted classes at xi for the 0 − 1 loss
function while for log-likelihood loss yi and ŷi are the true and predicted
class 1 probabilities at xi . The optimism, that is, the amount by which the
the training error underestimates the true prediction error depends on the
covariance between yi and its prediction ŷi . Hence, the stronger yi affects
its own prediction, the larger the optimism.
Using (9.5) in (9.4) yields the relation
2∑
n
Errorin = Ey (err)
¯ + Cov(ŷi , yi ) (9.6)
n
i=1
88CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION
for two-class classification. In the next section we discuss some methods that
(analytically) estimate the optimism, which is then added to the training
error to obtain an estimate of the in-sample error. In-sample error usually
is not of direct interest because future observations in feature space are not
likely to coincide with training observations. However, for comparison of
models, in-sample error is sufficient because only the relative size of the
prediction error is needed.
where d is the number of parameters that are fit and θ̂ is the maximum-
likelihood estimate of θ. The first term on the right side can be estimated
by the training error err
¯ (that is, the expectation is dropped). The second
term on the right is the estimate of the optimism. Given a set of models
f (x; α), where the index α is e.g. a tuning parameter, denote by err(α)
¯ and
d(α) the training error and number of parameters for each model, then AIC
is given by
d(α)
AIC(α) = err(α)
¯ +2 .
n
We now choose the model giving smallest value of AIC among the set of
models. For nonlinear and other adaptive, complex models, the number of
parameters d needs to be replaced by some measure of model complexity
such as the effective degrees of freedom.
The Bayesian Information criterion (BIC) or Schwartz criterion re-
places the constant 2 in the second term of AIC by log(n), that is,
d
BIC = err
¯ + log(n) .
n
9.5 Cross-validation
Cross-validation is the most widely used method to estimate prediction er-
ror. It directly estimates the generalization error (9.1), that is the prediction
error on independent test data. K-fold cross-validation splits the data at
random into K roughly equal parts. For each part j (j = 1, . . . , K), of the
data, the model is fit to the other K − 1 parts of the data, which leads to
a fit fˆ−j (x) where the −j indicates that part j of the data was removed
during the fitting process. We then calculate the prediction error of this
model fˆ−j (x) when predicting the outcome for the observations in the left-
out part (part j) of the data. The prediction errors obtained from the K
parts are then combined. To summarize, we can define an indexing function
κ : {1, . . . , n} → {1, . . . , K} that indicates for each training observation to
which of the K parts it belongs. Then the K-fold cross-validation estimate
of the test error is
1∑
n
CV-error = L(yi , fˆ−κ(i) (xi )).
n
i=1
1∑
n
CV-error(α) = L(yi , fˆ−κ(i) (xi ; α)). (9.7)
n
i=1
are more sparse in predictor space. As a result the predictions in the cross-
validation procedure can be biased depending on how the performance of
the classifier changes with sample size. This bias leads to less accurate pre-
dictions and hence an over-estimation of the prediction error. On the other
hand, a large value of K (e.g. K = n) makes the cross-validation procedure
approximately unbiased for the true prediction error but the procedure be-
comes highly variable due to its stronger dependence on the training data
set (the n training sets in the cross-validation are very similar to one an-
other). Also, the computational burden can be heavy if no update rules can
be used. Overall, 5 or 10 fold cross-validation can be recommended as a
good compromise.
1 1 ∑∑
B n
Errorboot = L(yi , fˆb (xi )).
Bn
b=1 i=1
However, in general Errorboot does not provide a good estimate of the true
prediction error. The reason is that each bootstrap sample has observations
in common with the original training sample that is being used as the test
set. If a method is overfitting, then it will also overfit in the bootstrap
samples, leading to unrealistically good predictions for all observations in the
training sample (acting as the test sample) that also belong to the bootstrap
sample. Cross-validation explicitly avoided this problem by splitting the
data in non-overlapping training and test samples.
As an illustration of the overly optimistic character of this bootstrap
estimate, consider a 1-nearest-neighbor classifier that is being applied to a
two-class classification problem. If both classes contain the same number
of observations and the class labels are independent of the features, then
the true error rate is 0.5 (50%). Now consider the predictions for the ob-
servations in the original training dataset based on the 1-nearest-neighbor
classifier trained on a bootstrap sample. For training observations that be-
long to the bootstrap sample, the error will be zero because the nearest
9.6. BOOTSTRAP METHODS 91
neighbor in the bootstrap sample is the observation itself and hence its class
is predicted correctly. If the training observation does not belong to the
bootstrap sample, then the predicted class will be the class of the near-
est neighbor in the bootstrap sample. Due to the randomness in assigning
class labels, averaging over the bootstrap samples, the probability that this
nearest neighbor belongs to either class is the same, that is, 50%. Since boot-
strap samples are independently drawn with replacement from the original
training sample and each observation has the same probability 1/n of being
selected, we can calculate the probability that an observation i belongs to a
bootstrap sample b:
( )
1 n
P (observation i ∈ bootstrap sample b) = 1 − 1 −
n
1
→ 1−
e
= 0.632. (9.8)
Here, for every training observation i, the set C −i ⊂ {1, . . . , B} is the set
with indices of bootstrap samples that do not contain the ith training ob-
servation. |C −i | is the number of bootstrap samples not containing the ith
observation. If we take the number of bootstrap samples B large enough,
then all C −i will be nonempty. If empty sets C −i occur, we can leave these
terms out of the summation.
The out-of-bag bootstrap estimate of the prediction error solves the over-
fitting problem with Errorboot , but it suffers from a bias similar as K-fold
cross-validation with a low value of K. From (9.8) it follows that the av-
erage number of distinct observations in each bootstrap sample is 0.632 n,
hence the effective sample size of the bootstrap samples is about 60% of
the original sample size n. Therefore, the behavior of the bootstrap esti-
mate of the prediction error is similar to the behavior of 2-fold/3-fold cross-
validation. When such a bias occurs, the underestimation of prediction error
by Errorboot turns into an overestimation of this error by Errorout-of-bag .
92CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION
The .632 bootstrap estimator of prediction error tries to avoid the prob-
lems of the above proposals. It corrects for the possible underestimation of
prediction error by Errorboot due to overfitting, but at the same time ad-
dresses the possible bias of the Errorout-of-bag estimator. The .632 bootstrap
estimator is defined by
Error.632 = 0.368err
¯ + 0.632Errorout-of-bag .
1 ∑∑
n n
γ̂ = L(yi , Ĝ(x′i )).
n2 ′
i=1 i =1
Based on this estimate of the no-information error rate, the relative overfit-
ting rate is defined as
Errorout-of-bag − err
¯
R̂ = .
γ̂ − err
¯
Clearly, R̂ ∈ [0, 1] with R̂ = 0 when there is no overfitting (err
¯ = Errorout-of-bag )
and R̂ = 1 if the overfitting equals γ̂ − err,
¯ that is, if the overfitting is so large
that the performance of the classifier is as bad as the random assignment
rule. The .632+ bootstrap estimator is now defined by
Error.632+ = (1 − ŵ)err
¯ + ŵErrorout-of-bag .
where
0.632
ŵ = .
1 − 0.368R̂
9.6. BOOTSTRAP METHODS 93
The weight ŵ ranges between 0.632 and 1. The weight equals 0.632 if R̂ =
0, that is, when there is no overfitting, and in this case Error.632+ equals
Error.632 since no improvement was necessary. The weight becomes 1 if
R̂ = 1, that is, when there is large overfitting, and in this case Error.632+
equals Errorout-of-bag , so no bias correction is made.
To illustrate that the .632+ bootstrap estimator solves the underesti-
mation of the prediction error rate in heavily overfitting situations, we look
again at the 1-nearest-neighbor classifier for the two-class problem with equal
class sizes and labels assigned at random. In this case we clearly have that
γ̂ = 0.5, and since also Errorout-of-bag = 0.5 as shown above, we obtain that
R̂ = 1, and hence, ŵ = 1 which implies that Error.632+ = Errorout-of-bag =
0.5.
94CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION
Part II
Regression
95
Chapter 10
Problem setting
where the expectation is w.r.t. the joint distribution of (X, Y ). For squared
error loss this becomes
97
98 CHAPTER 10. PROBLEM SETTING
By taking the derivative w.r.t. c, it can be seen that the solution to (10.2)
is
fˆ(x) = EY |X [Y |X = x], (10.3)
can be used. In this case it can be shown that the optimal solution becomes
Y = f (X) + ϵ (10.4)
Using squared error loss in the general additive error model, leads to
minimizing
∑ n
RSS(f ) = (yi − f (xi ))2 (10.5)
i=1
over all possible functions f , which yields infinitely many solutions. Any
solution fˆ(x) that passes through the n training data points can be chosen.
However, most of these solutions will perform poorly on independent test
data, leading to poor predictions at the test points. Hence, in order to
obtain useful solutions we must restrict the minimization in (10.5) to a
smaller class of functions. The choice of restrictions is based on the nature
of the problem or on theoretical considerations, but not driven by the data.
Any set of appropriate restrictions imposed on the class from which fˆ can be
selected will lead to a (unique) solution. However, different restrictions may
(and often will) lead to different solutions. The constraints imposed by most
learning methods can be interpreted as complexity restrictions that require
some kind of local regular (simple) behavior. By comparing the solutions
obtained from different learning methods, an optimal prediction procedure
can be selected.
100 CHAPTER 10. PROBLEM SETTING
Chapter 11
11.1 Introduction
Linear regression models assume that the regression function E(Y |X) de-
pends linearly on the predictors X1 , . . . , Xd . Such models are simple and in
many cases provide a reasonable and well-interpretable description of the ef-
fect of the features on the outcome. In some situations, simple linear models
can even outperform fancier nonlinear models, such as in cases with a small
number of training data, cases with very noisy data such that the trend (the
signal) is hard to detect (low signal to noise ratio) or in cases with sparse
data (high-dimensional data problems). We will start the discussion of lin-
ear models with standard least squares estimation and then continue with
shrinkage methods that allow for bias if this leads to a substantial decrease
in variance and thus yields a more stable model. As an alternative to shrink-
age methods, linear least squares regression can be applied to a (limited) set
of derived input directions. This approach is discussed in the last sections.
∑
d
f (X) = β0 + βj Xj . (11.1)
j=1
Hence, the linear model assumes that the regression function E(Y |X) can be
approximated reasonably well by the linear form (11.1). Note that the model
is linear in the unknown parameters β0 , . . . , βd . The predictor variables
X1 , . . . , Xd can be measured features, transformations of features, polyno-
mial representations or other basis expansions, dummy variables, interac-
tions, etc....
101
102 CHAPTER 11. LINEAR REGRESSION METHODS
∑
n
RSS(β) = (yi − f (xi ))2
i=1
2
∑
n ∑
d
= yi − β0 − βj xij
i=1 j=1
= (y − Xβ) (y − Xβ),
t
(11.2)
where X is the feature data matrix which starts with a column of ones to
include the intercept in the model. Differentiating (11.2) with respect to β
and setting the result equal to zero yields the estimating equation
Xt (y − Xβ) = 0 (11.3)
with solution
β̂ = (Xt X)−1 Xt y (11.4)
ŷ = Xβ̂
= X(Xt X)−1 Xt y
= Hy,
with H the hat matrix. The predicted value at a new input vector x0 is
obtained by ŷ0 = fˆ(x0 ) = (1, xt0 )β̂.
Least squares minimizes RSS(β) = ∥y − Xβ∥2 which implies that β̂ is
determined such that ŷ is the orthogonal projection of n-dimensional vector
y onto the subspace spanned by the columns x0 , . . . , xd of the matrix X,
called the column space of X. Hence, the residual vector y − ŷ is orthogonal
to this subspace as can be seen in (11.3). The hat matrix H computes
the orthogonal projection of y onto the column space of X and thus is a
projection matrix as explained before.
It can easily be shown that the variance-covariance matrix of β̂ is given
by
Var(β̂) = (Xt X)−1 σ 2 ,
1 ∑ n
2
σ̂ = (yi − ŷi )2
n − (d + 1)
i=1
11.2. LEAST SQUARES REGRESSION 103
Var(θ̂) ≤ Var(θ̃).
When using the squared error loss function (10.1) to measure prediction
accuracy, we focus on the mean squared error of estimators. Consider the
prediction of the response at a point x0 ,
Y0 = f (x0 ) + ϵ0 = xt0 β + ϵ0 .
where MSE(f˜(x0 )) = E[[xt0 (β − β̃)]2 ] is the mean squared error of the pre-
diction at x0 by the estimator f˜. Therefore, expected prediction error and
mean squared error of a prediction differ only by the constant σ 2 , represent-
ing the irreducible variance of the new observation y0 around its expected
104 CHAPTER 11. LINEAR REGRESSION METHODS
The unbiasedness of the least squares estimator makes the first term of
the mean squared error zero. Since the prediction f˜(x0 ) = xt0 β̃ is a lin-
ear combination of β̃, the Gauss-Markov theorem ensures that the least
squares estimator is optimal w.r.t. variance within an important class of
unbiased estimators. That is, the least squares estimator has the smallest
mean squared error among all linear unbiased estimators. However, this
optimal unbiased estimator can still be highly unstable. This can happen in
situations with (many) candidate predictors among which only a few have
a significant relation with the outcome. The large fraction of noise vari-
ables then causes a large variability of the least squares estimates. Also
the presence of highly correlated predictors (multicollinearity) will make an
unbiased estimator such as least squares highly variable.
To obtain a more stable, better performing solution, we can look for
biased estimators that lead to a smaller mean squared error, and thus to
better predictions. These biased estimators thus trade a little bias for a
(much) larger reduction in variance. From a broader viewpoint, almost all
models are only an approximation of the true mechanism that relates the
features to the outcome, and hence these models are biased. Therefore,
allowing for a little extra bias when selecting the optimal model from the
class of models considered is a small price to achieve a more stable, better
performing model. As always, picking the right model means trying to find
the right balance between bias and variance.
A low prediction accuracy due to high variability is a first reason why
the least squares estimator may not be satisfactory. A second reason is the
possible loss of interpretableness. With a large number of predictors it is
often of interest to determine a small subset of predictors that exhibit the
strongest influence on the outcome. The high variability of least squares
complicates the selection of these predictors. A biased estimator obtained
by shrinkage will be much more successful. With least squares, we can
take the subset selection approach to determine important variables or to
find stable submodels. Note that after subset selection we usually end up
with a biased regression model, a fact that is often ignored in regression
analysis. Several selection strategies exist such as all-subsets, backward
elimination, or forward selection. Different selection criteria can be used
11.3. RIDGE REGRESSION 105
∑
d
subject to ∥β∥22 = βj2 ≤ s, (11.7)
j=1
predictors (to mean zero and variance 1) when fitting the ridge regression
model.
From (11.5) it can easily be seen that the ridge regression solution is
given by
β̂ridge = (Xt X + λI)−1 Xt y. (11.8)
ŷ = Xβ̂ridge
= [X(Xt X + λI)−1 Xt ]y
= S ridge
λ y,
where S ridge
λ is the ridge regression linear operator. The tuning parameter
λ can therefore be determined by fixing the effective degrees of freedom of
the ridge regression solution, where the effective degrees of freedom is given
by trace(S ridge
λ ).
To further explain the behavior of ridge regression we now make some
derivations similar to those in Section 4.4.1 for smoothing splines. We as-
sume that both the response and the predictors have been centered (and
scaled) such that the n × d matrix X is the centered feature matrix. Note
that the centering causes the intercept to be zero, so there is one parameter
less. The singular value decomposition (SVD) of the centered feature matrix
X is
X = U DV t (11.9)
ŷridge = Xβ̂ridge
= [X(Xt X + λI)−1 Xt ]y
= [U DV t (V DU t U DV t + λI)−1 V DU t ]y
= [U DV t (V D 2 V t + λV IV t )−1 V DU t ]y
= [U DV t (V (D 2 + λI)V t )−1 V DU t ]y
= [U DV t V (D 2 + λI)−1 V t V DU t ]y
= [U D(D 2 + λI)−1 DU t ]y (11.10)
Note that D(D 2 + λI)−1 D is still a diagonal matrix with diagonal elements
γj2
; j = 1, . . . , d.
γj2 + λ
∑
d
γj2
ŷridge = (utj y)uj (11.11)
j=1
γj2 + λ
ŷLS = Xβ̂LS
= [X(Xt X)−1 Xt ]y
= [U DV t (V DU t U DV t )−1 V DU t ]y
= [U DV t (V D 2 V t )−1 V DU t ]y
= [U DV t V (D 2 )−1 V t V DU t ]y
= [U D(D 2 )−1 DU t ]y
= [U U t ]y
∑
d
= (utj y)uj , (11.12)
j=1
shrinks the least squares coordinate weights (utj y) by the factor γj2 /(γj2 + λ).
Hence, the smaller the squared singular value γj2 , the larger the amount
of shrinking in the direction uj . Note that ridge regression thus looks for
an optimal solution in the same d-dimensional subspace as least squares,
contrary to smoothing splines in Section 4.4 that look for a solution in the
full n dimensional space.
To understand which directions correspond with larger amount of shrink-
ing, we consider the d × d matrix of squares and cross-products:
Xt X = V DU t U DV t
= V D2 V t (11.13)
zj = Xvj = U DV t vj = U Dej = γj U ej = γj uj .
directions the slope of the linear surface fitting the data. This makes the
least squares solution very unstable. Ridge regression protects against a po-
tentially high variance in these directions (with small variance) by shrinking
the regression coefficients in these directions. Hence, ridge regression does
not allow a large slope in these directions and thus assumes (implicitly) that
the response will vary most in directions with high variance of the predic-
tors. This is often a reasonable assumption and even if it does not hold, the
data do not contain sufficient information to reliably estimate the slope of
the surface in directions with small variance of the predictors.
∑
d
subject to ∥β∥1 = |βj | ≤ t, (11.14)
j=1
∑
M
PCR
ŷPCR = Xθ̂m vm
m=1
∑
M
PCR
= X θ̂m vm
m=1
= Xβ̂PCR .
∑
M
ŷPCR = (utj y)uj . (11.15)
j=1
Note that if M = d we would just obtain the least squares solution. For
M < d PCR thus selects the optimal solution in an M -dimensional subspace
of the d-dimensional subspace spanned by the columns of X. Both ridge
regression and principal components regression use the principal components
of the feature data to find an alternative for the least squares fit. While ridge
regression shrinks the fit the most in principal components directions with
small variance, PCR completely ignores these directions and thus makes
the slope of the linear fit zero in these directions. Hence, PCR makes the
stronger assumption that the response does not vary in these directions.
112 CHAPTER 11. LINEAR REGRESSION METHODS
vm = argmax Var(Xv) m = 1, . . . , M,
∥v∥=1, vlt Sv=0 l=1,...,m−1
which yields directions that have sufficiently high variance in predictor space,
but also have a high correlation with the response. However, further investi-
gation of PLS has revealed that the variance component tends to dominate
and thus PLS behaves much like PCR and hence, also similar to ridge re-
gression.
114 CHAPTER 11. LINEAR REGRESSION METHODS
Chapter 12
Nonparametric Regression
12.1 Introduction
In this chapter we consider regression techniques that allow to fit more flex-
ible regression functions f (X). On way to achieve this flexibility is by using
basis expansions such as splines. These techniques are already discussed
in Chapter 4. In this chapter we will focus on flexible regression methods
that do not expand the predictor space but that fit a different simple lo-
cal regression model at each query point x0 . The local fit is based on the
observations close to the target point x0 and the weights given to the ob-
servations are chosen such that the resulting estimating function fˆ(X) is a
smooth function. To determine the weights, a weighting or kernel function
Kλ (x0 , x) is used as in Section 3.8. Hence, the weight of an observation i
depends on the distance of xi to the target point x0 . As such, kernel based
nonparametric regression can be seen as a smooth version of nearest neigh-
bor regression, which gives weight one to all neighbors of the target point
x0 . Nearest neighbors for regression is also discussed in this chapter.
115
116 CHAPTER 12. NONPARAMETRIC REGRESSION
Figure 12.1 shows the three kernel functions. The Epanechnikov and tri-
cube kernels have a compact support (positive on a finite interval) while the
Gaussian kernel has a noncompact support. The tri-cube function is flatter
on the top than the Epanechnikov kernel which yields more efficient results
but can lead to more bias. The tri-cube function has the extra advantage of
being differentiable at the boundary of the support.
Epanechnikov
0.8
Tri−cube
Gaussian
0.6
D(t)
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
The tuning parameter λ controls the window size, that is the minimal
distance from the target point at which the weight becomes zero for the
compact support kernels. For the noncompact Gaussian kernel, λ is the
standard deviation as discussed in section 3.8.1. The choice of λ leads to
the usual trade-off between bias (large λ) and variance (small λ).
Locally kernel-weighted averages can have problems at the boundary.
At the boundary, the window tends to contain less points and the kernel
12.2. KERNEL BASED NONPARAMETRIC REGRESSION 117
window becomes asymmetric which can lead to severe bias. Bias can also
occur in the interior if the training points are not approximately equally
spaced. Fitting higher order local regression models can alleviate the bias
effect.
which yields estimates β̂0 (x0 ) and β̂1 (x0 ) at each target point x0 . With these
estimates, the fit at x0 becomes fˆ(x0 ) = β̂0 (x0 ) + β̂1 (x0 )x0 . Note that at
each target point x0 , the fit fˆ(x0 ) is obtained from a different linear model.
If we introduce the n × n diagonal weight matrix W (x0 ) with diagonal
elements Kλ (x0 , xi ) i = 1, . . . , n. Then, using the weighted least squares
solution of (12.6) the fitted response can be written as
with x̃0 = (1, x0 )t and X an n × 2 matrix where the first column is a column
of ones and the second column is x, the observed feature vector. Hence,
again the fit is obtained by a linear combination of the observed response y.
If we consider f̂ , the vector of fitted responses at the training points, then
we obtain
f̂ = S kernel
λ y,
where the n × n matrix S kernel
λ = (l(x1 ), . . . , l(xn ))t is the linear operator.
The effective degrees of freedom, trace(S kernel
λ ) can thus be used to determine
the tuning parameter λ.
which yields estimates β̂0 (x0 ), . . . , β̂M (x0 ) and corresponding fit fˆ(x0 ) =
∑
β̂0 (x0 )+ M m
m=1 β̂m (x0 )x0 at each target point x0 . With local degree M (M >
1) polynomial fits, the bias will be reduced further. Especially in regions of
high curvature, local polynomial fits will have much lower bias than local
linear fits, that have a bias effect described by trimming the hills and filling
the valleys. Of course the bias reduction does not come for free. A higher
order model needs to be fit locally (using the same number of observations
as the linear model) which yields a decrease of precision and thus an increase
in variance.
where S NNR
K is the linear operator. The effective degrees of freedom, trace(S NNR
K )
can be used to determine the tuning parameter K.
Note that the kernel based local regression methods in the previous sec-
tion can be extended by replacing the constant window width λ by a window
width function λ(x0 ) that determines the width of the neighborhood at each
target point x0 . For example, we can use the function
which is the distance between x0 and its Kth closest training point. If we
construct an adaptive kernel by using (12.8) in (12.2) with D(t) = I(|t| ≤
1), then the local average (12.1) just equals K-nearest neighbor regres-
sion. Straightforward extensions are obtained by using this kernel in (12.6)
or (12.7) which yields nearest neighbor methods that locally fit linear or
polynomial models instead of simple averages. Adaptive kernel based local
regression methods are obtained by using the width function (12.8) in (12.2)
and applying the Epanechnikov (12.3), tri-cube (12.4), or Gaussian (12.5)
function.
12.4. LOCAL REGRESSION WITH MORE PREDICTORS 119
Structured Regression
Methods
13.1 Introduction
To handle the curse of dimensionality which affects the local regression meth-
ods in the previous chapters, more structure will need to be imposed on the
regression function. However, we do not want to impose a too rigid struc-
ture, such as a global linear regression structure, that reduces flexibility too
much. In this chapter we discuss some techniques that impose a certain
structure of the regression function but still allow for flexibility.
∑
d ∑
f (X) = f (X1 , . . . , Xd ) = β0 + fj (Xj ) + fkl (Xk , Xl ) + . . . . (13.1)
j=1 k<l
121
122 CHAPTER 13. STRUCTURED REGRESSION METHODS
which are linear splines with their knot at the value t. The two functions
together are called a reflected pair. The idea of MARS is to consider for each
input variable Xj an expansion in reflected pairs with knots at the observed
values xij i = 1, . . . , n of that input variable. This yields the collection of
functions
The function hm (X) is selected from the current set of basis functions
{h0 (X), h1 (X), h2 (X)}. There is one restriction on the products considered.
The resulting basis function should be an interaction of different features,
but higher order powers of a feature variable are not allowed. The pair of
functions that yields the largest decrease in residual sum of squares is added
13.3. MULTIVARIATE ADAPTIVE REGRESSION SPLINES 123
to the set of basis functions and the process is iterated until the basis ex-
pansion contains a maximum number of terms, M . Note that each basis
function in this expansion is a product of a number of linear splines.
The resulting model with all M basis functions usually overfits the train-
ing data. Therefore, a backward elimination procedure is applied (similar
to pruning trees). At each step the basis function whose removal causes the
smallest increase in residual squared error, is removed from the model. This
yields a sequence fˆm m = 1, . . . , M of best models of size going from 1 to
M . To select the optimal model from this sequence, MARS uses generalized
cross-validation (GCV) defined as
∑n
(yi − fˆm (xi ))2
GCV(m) = i=1 . (13.2)
(1 − df(m)/n)2
Here, the value df(m) is the effective number of parameters used to build the
model. This effective number of parameters accounts both for the number
of (independent) regression coefficients in the model and the number of ’pa-
rameters’ used in selecting the positions of the knots. It can be argued that
with linear splines it requires 3 parameters to select each knot. Therefore,
df(m) = r + 3K where K is the number of knots being used and r is the
rank of the n × m matrix with the basis functions evaluated at each of the
training observations. Usually, the matrix has full rank such that r = m.
The generalized cross-validation criterion is a computationally efficient
approximation of leave-one-out cross-validation that can be used for linear
fitting methods ŷ = Sy. Leave-one-out cross-validation prediction error is
given by
1∑
n
CV-error = [yi − fˆ−i (xi )]2 .
n
i=1
For many linear fitting methods, it can be shown that
yi − fˆ(xi )
yi − fˆ−i (xi ) = ,
1 − S ii
where S ii is the ith diagonal element of S. Hence, in this case leave-one-out
cross-validation can be obtained without refitting the model each time:
[ ]2
1 ∑ yi − fˆ(xi )
n
CV-error = .
n 1 − S ii
i=1
The use of piecewise linear basis functions in MARS has the advantage
that each basis function is only nonzero in a small part of the predictor space,
and hence operates locally. This means that the regression surface is built
up parsimoniously, using nonzero components locally where they are needed
to improve the fit instead of globally. In this way, parameters are spend
carefully, using only one or a few nonzero parameters for the fit in each
region of the predictor space. Another important advantage of piecewise
linear functions is computational. It can be shown that the piecewise linear
functions can be fit efficiently and the resulting fit can easily be updated
when a knot is shifted.
The forward modeling strategy in MARS allows higher order interac-
tions to enter only if the lower order interactions are already in the model.
Although this is not necessarily always the case, it is a useful working as-
sumption that is often being made. It avoids that the optimal basis functions
need to be selected from a set of functions that grows exponentially with
dimension. Another useful option in MARS is to put an upper limit on the
order of interactions that are allowed. Without limit, this upper limit is M .
However, it is often useful to set a limit of two or three. This yields simpler
models that can be interpreted more easily.
Chapter 14
Regression Trees
14.1 Introduction
Similarly to classification trees, regression trees partition the feature space
into rectangular regions and then fit a constant in each of these regions.
Regression trees usually try to minimize squared error loss when determining
the binary partitions, but other loss functions (e.g. L1 loss) are possible too.
In this chapter we discuss regression trees using squared error loss and their
extensions obtained by bagging or boosting.
∑
M
fˆ(x) = cm I(x ∈ Rm ),
m=1
1 ∑ ∑
M
(yi − cm )2 .
n
m=1 xi ∈Rm
125
126 CHAPTER 14. REGRESSION TREES
where
R1 (j, t) = {X|Xj ≤ t} and R2 (j, t) = {X|Xj > t}
as before. Hence, squared error loss is used as the impurity measure within
each region.
As for classification, we grow a large tree T0 where splitting of regions
only stops when some minimal region size (e.g. 5) has been reached. This
large tree has the danger of overfitting the training data. Therefore, the
initial tree is then pruned to find a subtree that fits the data well but is
more stable and is expected to perform better on validation data.
1 ∑
QRm (T ) = (yi − ĉm )2 .
nm
xi ∈Rm
The optimal tree for each α ≥ 0 is again found by weakest link pruning
which in each step removes the branch point that produces the smallest
increase of the total impurity (7.3).
Drawbacks of trees are a lack of smoothness of the resulting regression
function. Within each leave a constant is fit, so the resulting regression
function shows many jumps. Another drawback is the difficulty of regression
trees to capture additive structures. Consider a simple additive structure
Y = c1 I(X1 < t1 ) + c2 I(X2 < t2 ) + ϵ where ϵ is zero mean random error.
A regression tree might make its first split on X1 near t1 . At the next level,
it would have to split both regions on X2 near t2 to reflect the additive
structure. This might happen with sufficiently rich data, but the additive
model is given no preference. If there were several additive effects, then it is
highly unlikely that a regression tree will find this additive structure. Both
these drawbacks are overcome by using MARS as alternative to regression
trees.
14.4. BAGGING 127
14.4 Bagging
The instability of regression trees due to their high variability can be ad-
dressed by using bagging as introduced in Section 7.4. Using a number B
of bootstrap samples, the bagging estimate fˆbag (x) is given by (7.4). For
regression trees using squared error loss, bagging is especially useful as it re-
duces variance and leaves bias (approximately) unchanged. Hence, bagging
reduces the mean squared error of predictions. To explain this, consider a
random sample {(xi , yi ); i = 1, . . . , n} from a distribution P and the ideal
aggregate estimator fag (x) = EP [fˆ⋆ (x)] where fˆ⋆ (x) is the fit at x obtained
from a set of observations {(x⋆i , yi⋆ ); i = 1, . . . , n} randomly sampled from
P . We can now write
The mean squared error of the estimator fˆ(x) thus consists of the mean
squared error of the aggregate estimator fag (x) and the variance of fˆ(x)
around its mean fag (x). Therefore, the true population level aggregate esti-
mator never increases mean squared error. This suggests that the bagging
estimate fˆbag (x) obtained by drawing bootstrap samples from the training
data often decreases the mean squared error.
14.5 Boosting
Boosting regression trees using squared error loss can be obtained by suc-
cessively fitting regression trees to the residuals of the previous tree. Boost-
ing can be seen as fitting an additive expansion using forward stagewise
modeling as explained in Section 7.6. In detail, the algorithm starts from
the regression tree fitted to the data which yields the initial fit f0 (x) and
then iterates the following steps for m = 1, . . . , M . First calculate the
residuals of the training points, rm−1,i = yi − fm−1 (xi ), then fit a regres-
sion tree T to the targets rm−1,1 , . . . , rm−1,n and update the additive fit as
fm (x) = fm−1 (x) + T (rm−1,1 , . . . , rm−1,n ). Finally, the boosting algorithm
outputs the fit fˆ(x) = fM (x). This boosting procedure for regression trees
is called multiple additive regression trees (MART). As for classification, the
performance is best by growing fixed size trees of size 4 ≤ J ≤ 8. The
optimal number of iterations M ⋆ can be selected by monitoring prediction
error in the process.
128 CHAPTER 14. REGRESSION TREES
Chapter 15
15.1 Introduction
In this chapter we discuss the use of kernel-based methods for regression.
Regression support vector machines using squared error loss have already
been discussed in Chapter 4 (see (4.27)). Here, we discuss regression support
vector machines using a different loss function that has an effect similar to
the loss function used in support vector machines for classification.
f (x) = β0 + xt β.
For the support vector classifier in Section 6.3 we determined the boundary
with largest margin that separated the two classes. The separation was
not complete in the sense that we allowed that a number of observations
fell on the wrong side of their boundary. Similarly, we now look for the
hyperplane such that all observations lie within a band of width ϵ around
the hyperplane. Again, we do not expect that such a hyperplane exists for
sufficiently small values of ϵ, thus we allow that a number of observations
falls outside the band. This leads to the ϵ-insensitive loss function
Hence, the loss is zero for observations with response yi whose distance to
the hyperplane f (x) is smaller than ϵ. For observations larger than ϵ the
loss equals |y − f (x)| − ϵ.
129
130 CHAPTER 15. REGRESSION SUPPORT VECTOR MACHINES
With this loss function, support vector regression solves the problem
[ n ]
∑ λ
min L(yi , f (xi )) + ∥β∥2 , (15.1)
β0 ,β 2
i=1
subject to yi − f (xi ) ≤ ϵ + ξi
f (xi ) − yi ≤ ϵ + ξi⋆
ξi ≥ 0
ξi⋆ ≥ 0 i = 1, . . . , n
1∑ ∑
n n
1
LP = ∥β∥2 + (ξi + ξi⋆ ) + αi (yi − β0 − xt β − ϵ − ξi )
2 λ
i=1 i=1
∑
n ∑
n ∑
n
+ αi⋆ (β0 + x β − yi − ϵ −
t
ξi⋆ ) − µi ξi − µ⋆i ξi⋆ (15.2)
i=1 i=1 i=1
where αi , αi⋆ ≥ 0 and µi , µ⋆i ≥ 0 are the Lagrange multipliers. Setting the
derivatives w.r.t. β, β0 , ξi and ξi⋆ equal to zero yields the equations
∑
n
β = (αi − αi⋆ )xi (15.3)
i=1
∑
n
0 = (αi − αi⋆ ) (15.4)
i=1
1
αi = − µi ; i = 1, . . . , n (15.5)
λ
1
αi⋆ = − µ⋆i ; i = 1, . . . , n (15.6)
λ
Substituting the first order conditions (15.3)-(15.6) in (15.2) yields the dual
objective function
∑
n ∑
n
1 ∑∑
n n
LD = (αi −αi )yi −ϵ
⋆
αi +αi −
⋆
(αi −αi⋆ )(αj −αj⋆ ) xti xj (15.7)
2
i=1 i=1 i=1 j=1
∑
n
(αi − αi⋆ ) = 0
i=1
15.3. REGRESSION SUPPORT VECTOR MACHINES 131
0 ≤ αi ≤ 1
λ
0 ≤ αi⋆ ≤ 1
λ .
αi (yi − β0 − xt β − ϵ − ξi ) = 0; i = 1, . . . , n
αi⋆ (β0 + xt β − yi − ϵ − ξi⋆ ) = 0; i = 1, . . . , n
αi αi⋆ = 0; i = 1, . . . , n
ξi ξi⋆ = 0; i = 1, . . . , n
1
(αi − )ξi = 0; i = 1, . . . , n
λ
1 ⋆
(αi − )ξi = 0; i = 1, . . . , n
⋆
λ
These constraints imply that only a subset of the values αi − αi⋆ are different
from zero and the associated observations are called the support vectors.
∑
The dual function and the fitted values f (x) = xt β = ni=1 (αi − αi⋆ )xt xi
both are determined only through the products xt xi . Thus, we can generalize
support vector regression by using kernels.
∑
M
f (x) = β0 + h(x)t β = β0 + βm hm (x)
m=1
Books
Duda, R.O., Hart, P.E., and Stork, D.G. (2001). ”Pattern Classifica-
tion,” John Wiley and Sons.
Green, P.J., and Silverman, B.W. (1994). ”Nonparametric Regression
and Generalized Linear Models,” Chapman and Hall.
Papers
Bühlmann, P., and Yu, B. (2003). ”Boosting with the L2 Loss: Re-
gression and Classification,” Journal of the American Statistical As-
sociation, 98, 324339.
133
134 CHAPTER 15. REGRESSION SUPPORT VECTOR MACHINES