Anda di halaman 1dari 4

Lecture Notes on

The Kernel Trick


Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU

28 January 2017

Contents
1 The kernel trick 1
1.1 Nonlinear input-output functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Inner product in feature space and the kernel function . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Input-output function in terms of the kernel function . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Selecting the basis for the weight vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Mercers theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1 The kernel trick


1.1 Nonlinear input-output functions
If x denotes an input vector and k (x) with k = 1, ..., K is a set of fixed nonlinear scalar functions
in x, then1 X
g(x) := wk k (x) = wT (x) (1)
k

with real coefficients wk defines a nonlinear input-output function g(x). On the very right I have used
the more compact vector notation with w := (w1 , ..., wK )T and := (1 , ..., K )T . Note, that not only x but
also the k (x) can be regarded as vectors, which span a vector space of all functions that can be expressed
in the form of equation (1). This vector space is often referred to as function space (D: Funktionenraum)
and it has dimensionality K. Thus, is a vector of vectors. For any concrete input vector x, (x) can be
considered a simple vector in a K-dimensional Euclidean vector space, which is often referred to as
feature space (D: Merkmalsraum). The process of going from x to (x) is referred to as a mapping (D:
Abbildung) from input space to feature space or as a nonlinear expansion (D: nichtlineare Erweiterung) (
Wiskott, 2007).
2006, 2007 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures
from other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their own
copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights
(here usually figures I have the rights to publish but you dont, like my own published figures). Figures I do not have the rights
to publish are grayed out, but the word Figure, Image, or the like in the reference is often linked to a pdf.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.
1 Important text (but not inline formulas) is set bold face; marks important formulas worth remembering; marks less

important formulas, which I also discuss in the lecture; + marks sections that I typically skip during my lectures.

1
The dimensionality K of the feature space is usually much greater than the dimensionality
of the input space. However, the manifold populated by expanded input vectors cannot be higher-
dimensional than the input-space. For instance, if the input space is two-dimensional and feature space
is three-dimensional, the two-dimensional manifold of input space gets embedded in the three-dimensional
feature space. It can be folded and distorted, but you cannot make it truly three-dimensional. Locally it
will alway be two-dimensional (think of a towel that is rolled up).
For high-dimensional feature spaces, the mapping into and computations within feature space can
become computationally very expensive. However, some inner products in particular feature
spaces can be computed very efficiently without the need to actually do the explicit mapping
of the input vectors x into feature space.

1.2 Inner product in feature space and the kernel function


As an example consider two input vectors x and x and the feature space of polynomials of degree
two with the basis functions

1 := 1 , (2)

2 := 2 x1 , 3 := 2 x2 , (3)

4 := x21 , 5 := 2 x1 x2 , 6 := x22 . (4)

Now, if we take the Euclidean inner product of the input vectors mapped into feature space,
we get
  T
(x)T (x) = 1, 2 x1 , 2 x2 , x21 , 2 x1 x2 , x22 1, 2 x1 , 2 x2 , x21 , 2 x1 x2 , x22 (5)
= 12 + 2x1 x1 + 2x2 x2 + x21 x21 + 2x1 x2 x1 x2 + x22 x22 (6)
2
= (1 + x1 x1 + x2 x2 ) (7)
= (1 + xT x)2 . (8)

Note that one would get the same result if one would drop the factors 2 in (3) and (4) but would use a
different inner product in feature space, namely z z := z1 z1 + 2z2 z2 + 2z3 z3 + z4 z4 + 2z5 z5 + z6 z6 . Thus,
there is some arbitrariness here in the way the nonlinear expansion and the inner product in feature space
are defined.
Note also that equation (8) implies that the inner product between any two vectors mapped into feature
space is non-negative. That means that all vectors are mapped into a cone of 90 opening angle.
Interestingly, this generalizes to arbitrarily high exponents, so that

(x)T (x) = (1 + xT x)d (9)

is the inner product for a particular nonlinear expansion to polynomials of degree d. For d = 3 and

1 := 1 , (10)

2 :=
3 x1 , 3 := 3 x2 , (11)

4 := 3 x21
, 5 := 6 x1 x2 , 6 := 3 x22 , (12)

7 := x31 , 8 := 3 x21 x2 , 9 := 3 x1 x22 , 10 := x32 , (13)

we get
(x)T (x) = (1 + xT x)3 . (14)

Equation (9) offers a very efficient way of computing the inner product in feature space, in
particular for high degrees d. Because of the importance of the inner product between two expanded input
vectors we give it a special name and define the kernel function (D: Kernfunktion)

k(x, x) := (x)T (x) . (15)

2
Expressing everything in terms of inner products in feature space and using the kernel function
to efficiently compute these inner products is the kernel trick (D: Kern-Trick). It can save a lot of
computation and permits us to use arbitrarily high-dimensional feature spaces. However, it requires that
any computation in the feature space is formulated in terms of inner products, including the nonlinear
input-output function, as will be discussed in the next section.
The kernel function given above is just one example, there exist several others (Muller et al., 2001).

1.3 Input-output function in terms of the kernel function


Since everything has to be expressed in terms of inner products in feature space in order to take
advantage of the kernel trick, we have to rewrite (1). To do so we define w as a linear combination
of some expanded input vectors xj with j = 1, ..., J. We then get
X
w := j (xj ) (16)
j
(1)
g(x) = wT (x) (17)
(16)
X
= j (xj )T (x) (18)
j
(15)
X
= j k(xj , x) . (19)
j

The number J of input vectors xj used to define w is usually taken to be much smaller than the
dimensionality K of the feature space, because otherwise the computational complexitiy of computing
g(x) with the kernel trick (19) would be comparable to that of computing it directly (1) and there would be
no advantage in using the kernel trick. This, however, defines a subspace of the feature space within which
the optimization of g(x) takes place and one is back again in not such a high-dimensional space. Thus, the
kernel trick allows you to work in a potentially very high-dimensional space, but usually you
have to choose a moderate-dimensional subspace by selecting the basis for the weight vector
before actually starting to optimize it. Selecting this basis can be an art in itself.

x2 x2
(a) (b) (c)
xj (1+s) 2 s xj

xTj x
(1+ xTj x ) 2
1
3

x1 2 1 s
1
x1 4
0 1

1 0
2 1
3 4

Figure 1: Visualization of the kernel function (1 + xTj x)2 (8) for (19). (a) The value of the inner product
between x and a specific xj plotted with (green) contour lines. (b) The transformation (1 + s)2 used to
turn the inner product of the original input vectors into the inner product between the nonlinearly expanded
vectors in feature space. (c) The value of the kernel function for vectors x and xj plotted with (red) contour
lines. The kernel function takes the form of a parabula-shaped ditch.
CC BY-SA 4.0

1.4 Selecting the basis for the weight vector


There are three possibilities to choose the basis for the weight vectors.

3
One simply takes all available data vectors x . This is too expensive if there is a lot of data available.

One selects a subset of the data vectors x with a good heuristics that predicts their usefulness.

One has a-priori ideas about what good basis vectors might be regardless of the data vectors.

1.5 Mercers theorem


It is interesting to note that since everything is done with the kernel function, it is not even
necessary to know the feature space and the inner product within it. It is only required that the
kernel function is such that a corresponding feature space and inner product exists. There are criteria based
on Mercers theorem that guarantee this existence, see (Muller et al., 2001). For instance, it is obvious
that the kernel function must obey

k(x, x) = k(x, x) (20)


k(x, x) 0 x (21)

otherwise it cannot induce a scalar product in feature space. A condition the kernel function does not need
to fulfill is k(x, x) = 0 x = 0, since the zero vector in the original space does not need to be mapped
onto the zero vector in feature space.

References
Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An introduction to kernel-based
learning algorithms. IEEE Trans. on Neural Networks, 12(2):181201. 3, 4
Wiskott, L. (2007). Lecture notes on nonlinear expansion. Available at https://www.ini.rub.de/PEOPLE/
wiskott/Teaching/Material/index.html. 1

Anda mungkin juga menyukai