Andrew Rosenberg - Support Vector Machines and Kernel Methods

Support
Vector Machines and Kernel Methods

Machine Learning March 25, 2010
Last Time
Basics of the Support Vector Machines
Review: Max Margin

How can we pick which is best? Maximize the size of the margin.
Small Margin Large Margin
Are these really equally valid?

3
Review: Max Margin OpQmizaQon

The margin is the projecQon of x1 x2 onto w, the normal of the hyperplane.
wT x + b = 0
wT (x1 x2) = 2
v u ProjecQon: u u wT (x1 x2 ) w w 2 Size of the Margin: w
x1 x2
Review: Maximizing the margin

Goal: maximize the margin
2 max w min w
Linear Separability of the data by the decision boundary
where ti (wT xi + b) 1
wT xi + b 1 if wT xi + b 1 if
ti = 1 ti = 1
5
Review: Max Margin Loss FuncQon

Primal
1 L(w, b) = w w 2 w =
N 1
Dual
W () =
where i 0
N 1 i=0
1 i i j ti tj (xi xj ) 2 i,j=0
N 1 i=0
i=0 N 1 i=0
i [ti ((w xi ) + b) 1] i ti xi
N 1
i ti = 0
Review: Support Vector Expansion

Independent of the Dimension of x!
New decision FuncQon

D( ) = sign(wT + b) x x T N 1 i ti xi + b x = sign = sign N 1
i=0 i=0
When i is non-zero then xi is a support vector When i is zero xi is not a support vector
7
N 1 i=0
i ti xi
i ti (xi T ) + b x
Review: VisualizaQon of Support Vectors

=0
>0
Today
How support vector machines deal with data that are not linearly separable
Soa-margin Kernels!
Why we like SVMs

They work
Good generalizaQon
Easily interpreted.
Decision boundary is based on the data in the form of the support vectors.
Not so in mulQlayer perceptron networks
Principled bounds on tesQng error from Learning Theory (VC dimension)

10
SVM vs. MLP

SVMs have many fewer parameters
SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate
SVM: Convex opQmizaQon task

2 N 1 1 yn g wkl g wjk g wij xn,i R() = N n=0 2 j i
k
11
MLP: likelihood is non-convex -- local minima
Linear Separability
So far, support vector machines can only handle linearly separable data
But most data isnt.
Soa margin example

Points are allowed within the margin, but cost is introduced.
x1 i x2
Hinge Loss
13
Soa margin classicaQon

There can be outliers on the other side of the decision boundary, or leading to a small margin. SoluQon: Introduce a penalty term to the constraint funcQon N 1
min w + C
N 1 i=0
1 L(w, b) = w w + C 2
where ti (wT xi + b) 1 i and i 0

i
N 1 i=0
i=0
i [ti ((w xi ) + b) + i 1]
14
Soa Max Dual

min w + C
T
N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0
where ti (w xi + b) 1 i and i 0
N 1 i=0
N SQll QuadraQc Programming! 1
W () =
i=0
1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0
N 1
where 0 i C
i ti = 0
15
ProbabiliQes from SVMs

Support Vector Machines are discriminant funcQons
Discriminant funcQons: f(x)=c DiscriminaQve models: f(x) = argmaxc p(c|x) GeneraQve Models: f(x) = argmaxc p(x|c)p(c)/p(x)
No (principled) probabiliQes from SVMs SVMs are not based on probability distribuQon funcQons of class instances.
16
Eciency of SVMs
Not especially fast. Training n^3
QuadraQc Programming eciency
EvaluaQon n
Need to evaluate against each support vector (potenQally n)
17
Kernel Methods
Points that are not linearly separable in 2 dimension, might be linearly separable in 3.
Kernel Methods
Points that are not linearly separable in 2 dimension, might be linearly separable in 3.
Kernel Methods
We will look at a way to add dimensionality to the data in order to make it linearly separable. In the extreme. we can construct a dimension for each data point May lead to overong.
Remember the Dual?

Primal
1 L(w, b) = w w 2 w =
N 1
Dual
W () =
where i 0
N 1 i=0
N 1 i=0
i=0 N 1 i=0
i [ti ((w xi ) + b) 1] i ti xi
N 1
i ti = 0
21
Basis of Kernel Methods

W () =
N 1 i=0
=
N 1 i=0
N 1
i ti xi
The decision process doesnt depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The objecQve funcQon is based on the dot product of data points not the data points themselves.
22
Basis of Kernel Methods

W () =
Since data points only appear within a dot product. Thus we can map to another space through a replacement
N 1 i=0
N 1
W () =
The objecQve funcQon is based on the dot product of data points not the data points themselves.
23
N 1 i=0
xi xj (xi ) (xj )
1 i i j ti tj ((xi ) (xj )) 2 i,j=0
N 1
Kernels
W () =
N 1 i=0 N 1 1 i ti tj i j ((xi ) (xj )) 2 i,j=0 N 1 i=0 N 1 1 i ti tj i j K(xi , xj ) 2 i,j=0
W () =
The objecQve funcQon is based on a dot product of data points, rather than the data points themselves. We can represent this dot product as a Kernel
Kernel FuncQon, Kernel Matrix
Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x
Kernels
Kernels are a mapping
xT xj (xi )T (xj ) i
K(xi , xj ) = (xi )T (xj )

(xi ) xi xj (xj )
Kernels
Gram Matrix: Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )
Consider the following Kernel:

K(xi , xj ) = (xi T xj )2
Kernels
Gram Matrix: Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )
Consider the following Kernel:

K( , ) = ( T )2 x z x z 2 (x) = 2 (x0 , 2x0 x1 , x2 ) 1 = (x0 z0 + x1 z1 )
2 2 = x2 z0 + 2x0 z0 x1 z1 + x1 z1 0 2 T 2 2 2 = (x0 , 2x0 x1 , x1 ) (z0 , 2z0 z1 , z1 ) T = (x) (z)
Kernels
Kij = (xi )T (xj ) K(xi , xj ) = (xi )T (xj )
In general we dont need to know the form of . Just specifying the kernel funcQon is sucient. A good kernel: CompuQng K(xi,xj) is cheaper than (xi)
K(xi , xj ) = (xi )T (xj )
Kernels
Valid Kernels:
x z z x Symmetric K( , ) = K(, ) Must be decomposable into funcQons
Harder to show. Gram matrix is posiQve semi-denite (psd). PosiQve entries are denitely psd. NegaQve entries may sQll be psd
T Kij 0 x x
Kernels
Given a valid kernels, K(x,z) and K(x,z), more kernels can be made from them.
cK(x,z) K(x,z)+K(x,z) K(x,z)K(x,z) exp(K(x,z)) and more
IncorporaQng Kernels in SVMs

W () =
N 1 i=0 N 1 1 i ti tj i j ((xi ) (xj )) 2 i,j=0 N 1 i=0 N 1 1 i ti tj i j K(xi , xj ) 2 i,j=0
W () =
OpQmize is and bias w.r.t. kernel N 1 Decision funcQon:

D( ) = sign x D( ) = sign x N 1
i=0 i=0
i ti (xi T ) + b x
i ti K(xi , ) + b x
Some popular kernels

Polynomial Kernel Radial Basis FuncQons String Kernels Graph Kernels
Polynomial Kernels
K( , ) = ( T + c)d x z x z where c 0
The dot product is related to a polynomial power of the original dot product. if c is large then focus on linear terms if c is small focus on higher order terms Very fast to calculate
33
Radial Basis FuncQons

K( , ) = exp x z x z 2 2
2
The inner product of two points is related to the distance in space between the two points. Placing a bump on each point.
34
String kernels
Not a gaussian, but sQll a legiQmate Kernel
K(s,s) = dierence in length K(s,s) = count of dierent leters K(s,s) = minimum edit distance
Kernels allow for innite dimensional inputs.

The Kernel is a FUNCTION dened over the input space. Dont need to specify the input space exactly
We dont need to manually encode the input.

35
Graph Kernels
Dene the kernel funcQon based on graph properQes
These properQes must be computable in poly-Qme
Walks of length < k Paths Spanning trees Cycles
Kernels allow us to incorporate knowledge about the input without direct feature extracQon.
Just similarity in some space.
36
Where else can we apply Kernels?

Anywhere that the dot product of x is used in an opQmizaQon. Perceptron: D(x) = sign(T x + b)
=
T D(x) = sign tj xx + b tj K(xj , x) + b D(x) = sign j
j T = sign tj xj x + b
tj xj
Kernels in Clustering
In clustering, its very common to dene cluster similarity by the distance between points
k-nn (k-means)
This distance can be replaced by a kernel.
Well return to this more in the secQon on unsupervised techniques
Bye
Next Qme
Supervised Learning Review Clustering

Andrew Rosenberg - Support Vector Machines and Kernel Methods

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Andrew Rosenberg - Support Vector Machines and Kernel Methods

Diunggah oleh

Hak Cipta:

Format Tersedia

Support

Vector Machines and Kernel Methods

Review: Max Margin

Are these really equally valid?

Review: Max Margin OpQmizaQon

Review: Maximizing the margin

Review: Max Margin Loss FuncQon

Review: Support Vector Expansion

New decision FuncQon

Review: VisualizaQon of Support Vectors

Why we like SVMs

Principled bounds on tesQng error from Learning Theory (VC dimension)

SVM vs. MLP

SVM: Convex opQmizaQon task

MLP: likelihood is non-convex -- local minima

But most data isnt.

Soa margin example

Soa margin classicaQon

where ti (wT xi + b) 1 i and i 0

Soa Max Dual

N 1 N 1 1 L(w, b) = w w + C i i [ti ((w xi ) + b) + i 1] 2 i=0 i=0

N SQll QuadraQc Programming! 1

ProbabiliQes from SVMs

Remember the Dual?

Basis of Kernel Methods

Basis of Kernel Methods

1 i i j ti tj ((xi ) (xj )) 2 i,j=0

Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x

K(xi , xj ) = (xi )T (xj )

Consider the following Kernel:

Consider the following Kernel:

K(xi , xj ) = (xi )T (xj )

IncorporaQng Kernels in SVMs

OpQmize is and bias w.r.t. kernel N 1 Decision funcQon:

Some popular kernels

Radial Basis FuncQons

Kernels allow for innite dimensional inputs.

We dont need to manually encode the input.

Where else can we apply Kernels?

This distance can be replaced by a kernel.

Well return to this more in the secQon on unsupervised techniques

Anda mungkin juga menyukai