Anda di halaman 1dari 39

Support

Vector Machines and Kernel Methods


Machine Learning March 25, 2010

Last Time
Basics of the Support Vector Machines

Review: Max Margin


How can we pick which is best? Maximize the size of the margin.
Small Margin Large Margin

Are these really equally valid?


3

Review: Max Margin OpQmizaQon


The margin is the projecQon of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0

wT (x1 x2) = 2
v u ProjecQon: u u wT (x1 x2 ) w w 2 Size of the Margin: w

x1 x2

Review: Maximizing the margin


Goal: maximize the margin
2 max w min w
Linear Separability of the data by the decision boundary

where ti (wT xi + b) 1

wT xi + b 1 if wT xi + b 1 if

ti = 1 ti = 1
5

Review: Max Margin Loss FuncQon


Primal

1 L(w, b) = w w 2 w =

N 1

Dual

W () =

where i 0

N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0
N 1 i=0

i=0 N 1 i=0

i [ti ((w xi ) + b) 1] i ti xi

N 1

i ti = 0

Review: Support Vector Expansion


Independent of the Dimension of x!

New decision FuncQon


D( ) = sign(wT + b) x x T N 1 i ti xi + b x = sign = sign N 1
i=0 i=0

When i is non-zero then xi is a support vector When i is zero xi is not a support vector
7

N 1 i=0

i ti xi

i ti (xi T ) + b x

Review: VisualizaQon of Support Vectors


=0

>0

Today
How support vector machines deal with data that are not linearly separable
Soa-margin Kernels!

Why we like SVMs


They work
Good generalizaQon

Easily interpreted.
Decision boundary is based on the data in the form of the support vectors.
Not so in mulQlayer perceptron networks

Principled bounds on tesQng error from Learning Theory (VC dimension)


10

SVM vs. MLP


SVMs have many fewer parameters
SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate

SVM: Convex opQmizaQon task


2 N 1 1 yn g wkl g wjk g wij xn,i R() = N n=0 2 j i
k
11

MLP: likelihood is non-convex -- local minima

Linear Separability
So far, support vector machines can only handle linearly separable data

But most data isnt.

Soa margin example


Points are allowed within the margin, but cost is introduced.

x1 i x2

Hinge Loss

13

Soa margin classicaQon


There can be outliers on the other side of the decision boundary, or leading to a small margin. SoluQon: Introduce a penalty term to the constraint funcQon N 1

min w + C
N 1 i=0

1 L(w, b) = w w + C 2

where ti (wT xi + b) 1 i and i 0


i
N 1 i=0

i=0

i [ti ((w xi ) + b) + i 1]

14

Soa Max Dual


min w + C
T

N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0

where ti (w xi + b) 1 i and i 0

N 1 i=0

N SQll QuadraQc Programming! 1

W () =

i=0

1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0

N 1

where 0 i C

i ti = 0
15

ProbabiliQes from SVMs


Support Vector Machines are discriminant funcQons
Discriminant funcQons: f(x)=c DiscriminaQve models: f(x) = argmaxc p(c|x) GeneraQve Models: f(x) = argmaxc p(x|c)p(c)/p(x)

No (principled) probabiliQes from SVMs SVMs are not based on probability distribuQon funcQons of class instances.
16

Eciency of SVMs
Not especially fast. Training n^3
QuadraQc Programming eciency

EvaluaQon n
Need to evaluate against each support vector (potenQally n)

17

Kernel Methods
Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods
Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods
We will look at a way to add dimensionality to the data in order to make it linearly separable. In the extreme. we can construct a dimension for each data point May lead to overong.

Remember the Dual?


Primal

1 L(w, b) = w w 2 w =

N 1

Dual

W () =

where i 0

N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0
N 1 i=0

i=0 N 1 i=0

i [ti ((w xi ) + b) 1] i ti xi

N 1

i ti = 0
21

Basis of Kernel Methods


W () =
N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0
=
N 1 i=0

N 1

i ti xi

The decision process doesnt depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The objecQve funcQon is based on the dot product of data points not the data points themselves.
22

Basis of Kernel Methods


W () =
Since data points only appear within a dot product. Thus we can map to another space through a replacement
N 1 i=0

1 i i j ti tj (xi xj ) 2 i,j=0

N 1

W () =

The objecQve funcQon is based on the dot product of data points not the data points themselves.
23

N 1 i=0

xi xj (xi ) (xj )

1 i i j ti tj ((xi ) (xj )) 2 i,j=0

N 1

Kernels
W () =
N 1 i=0 N 1 1 i ti tj i j ((xi ) (xj )) 2 i,j=0 N 1 i=0 N 1 1 i ti tj i j K(xi , xj ) 2 i,j=0

W () =

The objecQve funcQon is based on a dot product of data points, rather than the data points themselves. We can represent this dot product as a Kernel
Kernel FuncQon, Kernel Matrix

Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x

Kernels
Kernels are a mapping
xT xj (xi )T (xj ) i

K(xi , xj ) = (xi )T (xj )


(xi ) xi xj (xj )

Kernels
Gram Matrix: Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )

Consider the following Kernel:


K(xi , xj ) = (xi T xj )2

Kernels
Gram Matrix: Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )

Consider the following Kernel:


K( , ) = ( T )2 x z x z 2 (x) = 2 (x0 , 2x0 x1 , x2 ) 1 = (x0 z0 + x1 z1 )
2 2 = x2 z0 + 2x0 z0 x1 z1 + x1 z1 0 2 T 2 2 2 = (x0 , 2x0 x1 , x1 ) (z0 , 2z0 z1 , z1 ) T = (x) (z)

Kernels
Kij = (xi )T (xj ) K(xi , xj ) = (xi )T (xj )

In general we dont need to know the form of . Just specifying the kernel funcQon is sucient. A good kernel: CompuQng K(xi,xj) is cheaper than (xi)

K(xi , xj ) = (xi )T (xj )

Kernels

Valid Kernels:
x z z x Symmetric K( , ) = K(, ) Must be decomposable into funcQons
Harder to show. Gram matrix is posiQve semi-denite (psd). PosiQve entries are denitely psd. NegaQve entries may sQll be psd

T Kij 0 x x

Kernels
Given a valid kernels, K(x,z) and K(x,z), more kernels can be made from them.
cK(x,z) K(x,z)+K(x,z) K(x,z)K(x,z) exp(K(x,z)) and more

IncorporaQng Kernels in SVMs


W () =
N 1 i=0 N 1 1 i ti tj i j ((xi ) (xj )) 2 i,j=0 N 1 i=0 N 1 1 i ti tj i j K(xi , xj ) 2 i,j=0

W () =

OpQmize is and bias w.r.t. kernel N 1 Decision funcQon:


D( ) = sign x D( ) = sign x N 1
i=0 i=0

i ti (xi T ) + b x

i ti K(xi , ) + b x

Some popular kernels


Polynomial Kernel Radial Basis FuncQons String Kernels Graph Kernels

Polynomial Kernels
K( , ) = ( T + c)d x z x z where c 0

The dot product is related to a polynomial power of the original dot product. if c is large then focus on linear terms if c is small focus on higher order terms Very fast to calculate
33

Radial Basis FuncQons


K( , ) = exp x z x z 2 2
2

The inner product of two points is related to the distance in space between the two points. Placing a bump on each point.

34

String kernels
Not a gaussian, but sQll a legiQmate Kernel
K(s,s) = dierence in length K(s,s) = count of dierent leters K(s,s) = minimum edit distance

Kernels allow for innite dimensional inputs.


The Kernel is a FUNCTION dened over the input space. Dont need to specify the input space exactly

We dont need to manually encode the input.


35

Graph Kernels
Dene the kernel funcQon based on graph properQes
These properQes must be computable in poly-Qme
Walks of length < k Paths Spanning trees Cycles

Kernels allow us to incorporate knowledge about the input without direct feature extracQon.
Just similarity in some space.
36

Where else can we apply Kernels?


Anywhere that the dot product of x is used in an opQmizaQon. Perceptron: D(x) = sign(T x + b)
=
T D(x) = sign tj xx + b tj K(xj , x) + b D(x) = sign j
j T = sign tj xj x + b

tj xj

Kernels in Clustering
In clustering, its very common to dene cluster similarity by the distance between points
k-nn (k-means)

This distance can be replaced by a kernel.

Well return to this more in the secQon on unsupervised techniques

Bye
Next Qme
Supervised Learning Review Clustering

Anda mungkin juga menyukai