Anda di halaman 1dari 40

# Lecture 2: The SVM classifier

Hilary 2015

## Review of linear classifiers

Linear separability
Perceptron

## Support Vector Machine (SVM) classifier

Wide margin
Cost function
Slack variables
Loss functions revisited
Optimization

A. Zisserman

Binary Classification
Given training data (xi, yi) for i = 1 . . . N , with
xi Rd and yi {1, 1}, learn a classifier f (x)
such that
f (x i )

0 yi = +1
< 0 yi = 1

## i.e. yif (xi) > 0 for a correct classification.

Linear separability

linearly
separable

not
linearly
separable

Linear classifiers
A linear classifier has the form

f (x) = 0
X2

f (x) = w>x + b
f (x) < 0

f (x) > 0
X1

## is known as the weight vector

Linear classifiers
A linear classifier has the form

f (x) = 0

f (x) = w>x + b

## For a K-NN classifier it was necessary to `carry the training data

For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data

## The Perceptron Classifier

Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function

f (x i ) = w > x i + b
separates the categories for i = 1, .., N
how can we find this separating hyperplane ?

## The Perceptron Algorithm

>
x i + w0 = w > x i
Write classifier as f (xi) = w
where w = (w
, w0), xi = (
xi, 1)
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then

w w + sign(f (xi)) xi

## Until all the data is correctly classified

For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then

w w + sign(f (xi)) xi

## Until all the data is correctly classified

before update

after update

X2

X2

w
w

xi

X1

NB after convergence w

X1

PN
= i i xi

Perceptron
example

-2

-4

-6

-8

-10
-15

-10

-5

## if the data is linearly separable, then the algorithm will converge

convergence can be slow
separating line close to training data
we would prefer a larger margin for generalization

10

## Support Vector Machine

linearly separable data
wTx + b = 0
b
||w||

Support Vector
Support Vector

f (x) =

X
i

i yi (xi > x) + b
support vectors

## SVM sketch derivation

Since w>x + b = 0 and c(w>x + b) = 0 define the same
plane, we have the freedom to choose the normalization
of w
Choose normalization such that w>x++b = +1 and w>x+
b = 1 for the positive and negative support vectors respectively
Then the margin is given by

w
. x+ x =
||w||

w>

x+ x
||w||

2
=
||w||

## Support Vector Machine

linearly separable data
Margin =

Support Vector
Support Vector

wTx + b = 1

wTx + b = 0
wTx + b = -1

2
||w||

SVM Optimization
Learning the SVM can be formulated as an optimization:
2
1
max
subject to w>xi+b
w ||w||
1
Or equivalently
min ||w||2
w

if yi = +1
for i = 1 . . . N
if yi = 1

>
subject to yi w xi + b 1 for i = 1 . . . N

## This is a quadratic optimization problem subject to linear

constraints and there is a unique minimum

## the points can be linearly separated but

there is a very narrow margin

## but possibly the large margin solution is

better, even though one constraint is violated

In general there is a trade off between the margin and the number of
mistakes on the training data

i
2
>
||w||
||w||

i 0

Margin =

Misclassified
point

## for 0 < 1 point is between

margin and correct side of hyperplane. This is a margin violation

1
i
<
||w||
||w||

Support Vector
Support Vector
=0

wTx + b = 1
wTx + b = 0
wTx + b = -1

2
||w||

## Soft margin solution

The optimization problem becomes
min

wRd ,

R+

||w|| +C

N
X

subject to

>
yi w xi + b 1i for i = 1 . . . N

## Every constraint can be satisfied if i is suciently large

C is a regularization parameter:
small C allows constraints to be easily ignored large margin
large C makes constraints hard to ignore narrow margin
C = enforces all constraints: hard margin
This is still a quadratic optimization problem and there is a
unique minimum. Note, there is only one parameter, C.

0.8
0.6

feature y

0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1

-0.8

-0.6

-0.4

-0.2
0
feature x

0.2

0.4

0.6

0.8

C = Infinity

hard margin

C = 10

soft margin

## Application: Pedestrian detection in Computer Vision

Objective: detect (localize) standing humans in an image
cf face detection with a sliding window classifier

## reduces object detection to

binary classification

## does an image window

contain a person or not?

## Training data and features

Positive data 1208 positive window examples

image

dominant
direction

HOG

frequency

orientation

## Averaged positive examples

Algorithm
Training (Learning)
Represent each example window by a HOG feature vector

xi Rd , with d = 1024

## Train a SVM classifier

Testing (Detection)
Sliding window classifier

f (x) = w> x + b

Learned model

f (x) = w>x + b

## Slide from Deva Ramanan

Optimization
Learning an SVM has been formulated as a constrained optimization problem over w and
min

wRd ,i R+

||w||2 + C

N
X
i

>
i subject to yi w xi + b 1 i for i = 1 . . . N

>
The constraint yi w xi + b 1 i, can be written more concisely as

yif (xi) 1 i

## which, together with i 0, is equivalent to

i = max (0, 1 yif (xi))
Hence the learning problem is equivalent to the unconstrained optimization problem over w
min ||w||2 + C

wRd

regularization

N
X
i

## max (0, 1 yif (xi))

loss function

Loss function
N
X
2
min ||w|| + C
max (0, 1 yif (xi))
wRd
i

wTx + b = 0

loss function
Points are in three categories:

Support Vector

## 1. yif (xi) > 1

Point is outside margin.
No contribution to loss
2. yif (xi) = 1
Point is on margin.
No contribution to loss.
As in hard margin case.
3. yif (xi) < 1
Point violates margin constraint.
Contributes to loss

Support Vector

Loss functions

yif (xi)
SVM uses hinge loss max (0, 1 yi f (xi))
an approximation to the 0-1 loss

Optimization continued

min C

wRd

N
X
i

local
minimum

global
minimum

## Does this cost function have a unique solution?

Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)

Convex functions

convex

Not convex

SVM
min C

wRd

N
X
i

convex

## Gradient (or steepest) descent algorithm for SVM

To minimize a cost function C(w) use the iterative update

wt+1 wt tw C(wt)
where is the learning rate.
First, rewrite the optimization problem as an average
N
X

1
max (0, 1 yif (xi))
||w||2 +
min C(w) =
w
2
N i

N
1 X

=
||w||2 + max (0, 1 yif (xi))
N i 2

## (with = 2/(N C) up to an overall scale of the problem) and

f (x ) = w > x + b
Because the hinge loss is not dierentiable, a sub-gradient is
computed

L(xi, yi; w) = max (0, 1 yif (xi))

f (xi) = w>xi + b

L
= yixi
w

L
=0
w

yif (xi)

N
1 X

C(w) =
N i 2

## The iterative update is

wt+1 wt wt C(wt)
N
1X
wt
(wt + w L(xi, yi; wt))
N i
where is the learning rate.
Then each iteration t involves cycling through the training data with the

wt wt

## if yif (xi) < 1

otherwise

1
In the Pegasos algorithm the learning rate is set at t = t

## Pegasos Stochastic Gradient Descent Algorithm

Randomly sample from the training data
2

10

4
1

10

energy

2
0

10

-2
-1

10

-4

-2

10

50

100

150

200

250

300

-6
-6

-4

-2

Next lecture see that the SVM can be expressed as a sum over the
support vectors:

f (x) =

i yi (xi > x) + b

support vectors

On web page:
http://www.robots.ox.ac.uk/~az/lectures/ml
links to SVM tutorials and video lectures
MATLAB SVM demo