Problem Definition
Consider a training set of n iid samples
(x 1, y 1), (x 2 , y 2 ),..., (x n , y n )
where xi is a vector of length m and
y i {1,1} is the class label for data point xi.
Find a separating hyperplane w x b 0
corresponding to the decision function
f (x ) sign(w x b )
Separating Hyperplanes
x(2)
x(1)
Separating Hyperplanes
Training data is just a subset of of all possible data
Suppose hyperplane is close to sample xi
If we see new sample close to sample i, it is likely
to be on the wrong side of the hyperplane
x(2)
xi
x(1)
Separating Hyperplanes
Hyperplane as far as possible from any sample
x(2)
xi
x(1)
SVM
Idea: maximize distance to the closest example
x(2)
x(2)
xi
xi
x(1)
smaller distance
larger distance
x(1)
x(1)
x(1)
g(x) = wtx + b
x(1)
x(2)
2
m
w
subject to constraints
w t xi b 1 yi 1
t
w xi b 1 yi -1
1
J (w ) w
2
s.t
yi ( w x i b) 1
1
2
Minimize
w
2
subject to y i ( w xi b ) 1, i
Introduce Lagrange multipliers ai 0
associated with the constraints
The solution to the primal problem is equivalent to
determining the saddle point of the function
n
1
2
LP L( w,b ,a ) w ai y i (xi w b ) 1
2
i 1
Solving Constrained QP
At saddle point, LP has minimum requiring
LP
w ai y i xi 0 w ai y i xi
w
i
i
LP
ai y i 0
b
i
Primal-Dual
Primal:
n
n
1
2
LP w a i yi ( x i w b) a i
2
i 1
i 1
a y
i
Dual:
substitute
1 n n
LD a i a ia j yi y j x i x j
2 i 1 j 1
i 1
maximize LD with respect to
subject to a i 0, a i yi 0
i
1 n n
t
L
a
a
a
y
y
x
maximize D
i
i j i j i xj
2 i 1 j 1
i 1
constrained to
a i 0 i and
a y
i 1
Threshold
b can be determined from the optimal a and
Karush-Kuhn-Tucker (KKT) conditions
ai y i w xi b 1 0, i
ai 0 implies
y i (w xi b ) 1 w xi b y i
b yi w x i
Support Vectors
For every sample i, one of the following must hold
ai = 0
ai >0 and yi(w .xi+b - 1) = 0
Many ai 0
w a i yi x i sparse solution
i
SVM: Classification
Given a new sample x, finds its label y
y sign(w x b )
n
w ai y i x i
i 1
SVM: Example
SVM: Example
0.036
0
a 0.039
0
0.076
0
support
vectors
Solution
0.33
find w using w ai y i x i a . y x
0
.
20
i 1
b y 1 w t x 1 0.13
outliers
x(1)
xi is a measure of
deviation from the ideal
position for sample i
x(2)
xi > 1
x(1)
0< xi <1
constrained to y i wt xi b 1 xi and x i 0 i
C > 0 is a constant which measures relative weight of the
first and second terms
if C is small, we allow a lot of samples not in ideal position
if C is large, we want to have very few samples not in ideal
position
x(2)
x(2)
x(1)
large C, few samples not in
ideal position
x(1)
small C, a lot of samples
not in ideal position
maximize
constrained to
find w using
1 n n
LD a ai ai ai y i y j x it x j
2 i 1 j 1
i 1
0 ai C i and
ai y i
w ai y i x i
i 1
t
solve for b using any 0 <ai < C and ai [ y i w xi b 1] 0
1 2
j(x)=(x,x2 )
-3
-2
R2
R1
R2
1
x 1
x
g 2 w1 w 2 2 w 0
x
x
g x w1 x w 2 x 2 w 0
L
a
a
a
y
y
x
maximize
D
i
i i i j i xj
2 i 1 j 1
i 1
and classification
y sign( ai y i xi x b )
i 1
maximize
1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1
Kernel
A function that returns the value of the dot product
between the images of the two arguments:
K(x,y)=j(x)tj(y)
Given a function K, it is possible to verify that it is
a kernel.
n
maximize
1 n n
LD a ai ai ai y i y j j x i t j x j
2 i 1 j 1
i 1
K(xi,xj)
Kernel Matrix
(aka the Gram matrix):
Mercers Theorem
The kernel matrix is Symmetric Positive Definite
Any symmetric positive definite matrix can be
regarded as a kernel matrix, that is as an inner
product matrix in some space
Every (semi)positive definite, symmetric
function is a kernel: i.e. there exists a mapping such
that it is possible to write:
K(x,y)=j(x)tj(y)
Positive definite
f LK (x , y )f (x )f (y )dxdy 0
From www.support-vector.net
Examples of Kernels
Some common choices (both satisfying Mercers
condition):
Polynomial kernel
K xi , x j
K x i , x j x it x j 1
2
1
exp
xi x j
2
2
From www.support-vector.net
From www.support-vector.net
From www.support-vector.net
Making Kernels
The set of kernels is closed under some
operations. If K, K are kernels, then:
K+K is a kernel
cK is a kernel, if c>0
aK+bK is a kernel, for a,b >0
Etc etc etc
can make complex kernels from simple
ones: modularity !
From www.support-vector.net
maximize
constrained to
1 n n
LD a ai ai ai y i y j K x i , x j
2 i 1 j 1
i 1
0 ai C i and
ai y i
i
1
g j x w j x ai y i j x i j x
x i S
g x ai y i j x i j x ai y i j t x i j x ai y i K x i , x
x i S
x i S
x i S
gx
x i S
zi K x i , x
inverse distance
weight of support
m1
from x to
g x
vector xi
support vector xi
most important
training samples,
1
2
K x i , x exp
x
x
i.e. support vectors
i
2
2
f(x)
Cost to
build H
matrix
tradition
ally
Cost if
d=100
f(a)tf(b)
Cost to Cost if
build H d=100
matrix
sneakily
d2 n2 /4
2,500 n2
(atb+1)2
d n2 / 2
50 n2
Cubic
All d3/6
terms up to
degree 3
d3 n2 /12
83,000 n2
(atb+1)3
d n2 / 2
50 n2
Quartic
All d4/24
terms up to
degree 4
d4 n2 /48
1,960,000 n2 (atb+1)4
d n2 / 2
50 n2
SVM Summary
Advantages:
Disadvantages: