Last
Time
Basics
of
the
Support
Vector
Machines
wT (x1 x2) = 2
v u ProjecQon:
u u wT (x1 x2 ) w w 2 Size
of
the
Margin:
w
x1 x2
where ti (wT xi + b) 1
wT xi + b 1 if wT xi + b 1 if
ti = 1 ti = 1
5
1 L(w, b) = w w 2 w =
N 1
Dual
W () =
where i 0
N 1 i=0
1 i i j ti tj (xi xj ) 2 i,j=0
N 1 i=0
i=0 N 1 i=0
i [ti ((w xi ) + b) 1] i ti xi
N 1
i ti = 0
When
i
is
non-zero
then
xi
is
a
support
vector
When
i
is
zero
xi
is
not
a
support
vector
7
N 1 i=0
i ti xi
i ti (xi T ) + b x
>0
Today
How
support
vector
machines
deal
with
data
that
are
not
linearly
separable
Soa-margin
Kernels!
Easily
interpreted.
Decision
boundary
is
based
on
the
data
in
the
form
of
the
support
vectors.
Not
so
in
mulQlayer
perceptron
networks
Linear
Separability
So
far,
support
vector
machines
can
only
handle
linearly
separable
data
x1 i x2
Hinge Loss
13
min w + C
N 1 i=0
1 L(w, b) = w w + C 2
i=0
i [ti ((w xi ) + b) + i 1]
14
where ti (w xi + b) 1 i and i 0
N 1 i=0
W () =
i=0
1 i ti tj i j (xi xj ) 2 i,j=0
N 1 i=0
N 1
where 0 i C
i ti = 0
15
No
(principled)
probabiliQes
from
SVMs
SVMs
are
not
based
on
probability
distribuQon
funcQons
of
class
instances.
16
Eciency
of
SVMs
Not
especially
fast.
Training
n^3
QuadraQc
Programming
eciency
EvaluaQon
n
Need
to
evaluate
against
each
support
vector
(potenQally
n)
17
Kernel
Methods
Points
that
are
not
linearly
separable
in
2
dimension,
might
be
linearly
separable
in
3.
Kernel
Methods
Points
that
are
not
linearly
separable
in
2
dimension,
might
be
linearly
separable
in
3.
Kernel
Methods
We
will
look
at
a
way
to
add
dimensionality
to
the
data
in
order
to
make
it
linearly
separable.
In
the
extreme.
we
can
construct
a
dimension
for
each
data
point
May
lead
to
overong.
1 L(w, b) = w w 2 w =
N 1
Dual
W () =
where i 0
N 1 i=0
1 i i j ti tj (xi xj ) 2 i,j=0
N 1 i=0
i=0 N 1 i=0
i [ti ((w xi ) + b) 1] i ti xi
N 1
i ti = 0
21
1 i i j ti tj (xi xj ) 2 i,j=0
=
N 1 i=0
N 1
i ti xi
The
decision
process
doesnt
depend
on
the
dimensionality
of
the
data.
We
can
map
to
a
higher
dimensionality
of
the
data
space.
Note:
data
points
only
appear
within
a
dot
product.
The
objecQve
funcQon
is
based
on
the
dot
product
of
data
points
not
the
data
points
themselves.
22
1 i i j ti tj (xi xj ) 2 i,j=0
N 1
W () =
The
objecQve
funcQon
is
based
on
the
dot
product
of
data
points
not
the
data
points
themselves.
23
N 1 i=0
xi xj (xi ) (xj )
N 1
Kernels
W () =
N 1 i=0 N 1 1 i ti tj i j ((xi ) (xj )) 2 i,j=0 N 1 i=0 N 1 1 i ti tj i j K(xi , xj ) 2 i,j=0
W () =
The
objecQve
funcQon
is
based
on
a
dot
product
of
data
points,
rather
than
the
data
points
themselves.
We
can
represent
this
dot
product
as
a
Kernel
Kernel
FuncQon,
Kernel
Matrix
Kernels
Kernels
are
a
mapping
xT xj (xi )T (xj ) i
Kernels
Gram
Matrix:
Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )
Kernels
Gram
Matrix:
Kij = (xi )T (xj )
K(xi , xj ) = (xi )T (xj )
Kernels
Kij = (xi )T (xj ) K(xi , xj ) = (xi )T (xj )
In general we dont need to know the form of . Just specifying the kernel funcQon is sucient. A good kernel: CompuQng K(xi,xj) is cheaper than (xi)
Kernels
Valid
Kernels:
x z z x Symmetric
K( , ) = K(, ) Must
be
decomposable
into
funcQons
Harder
to
show.
Gram
matrix
is
posiQve
semi-denite
(psd).
PosiQve
entries
are
denitely
psd.
NegaQve
entries
may
sQll
be
psd
T Kij 0 x x
Kernels
Given
a
valid
kernels,
K(x,z)
and
K(x,z),
more
kernels
can
be
made
from
them.
cK(x,z)
K(x,z)+K(x,z)
K(x,z)K(x,z)
exp(K(x,z))
and
more
W () =
i ti (xi T ) + b x
i ti K(xi , ) + b x
Polynomial
Kernels
K( , ) = ( T + c)d x z x z where c 0
The
dot
product
is
related
to
a
polynomial
power
of
the
original
dot
product.
if
c
is
large
then
focus
on
linear
terms
if
c
is
small
focus
on
higher
order
terms
Very
fast
to
calculate
33
The inner product of two points is related to the distance in space between the two points. Placing a bump on each point.
34
String
kernels
Not
a
gaussian,
but
sQll
a
legiQmate
Kernel
K(s,s)
=
dierence
in
length
K(s,s)
=
count
of
dierent
leters
K(s,s)
=
minimum
edit
distance
Graph
Kernels
Dene
the
kernel
funcQon
based
on
graph
properQes
These
properQes
must
be
computable
in
poly-Qme
Walks
of
length
<
k
Paths
Spanning
trees
Cycles
Kernels
allow
us
to
incorporate
knowledge
about
the
input
without
direct
feature
extracQon.
Just
similarity
in
some
space.
36
tj xj
Kernels
in
Clustering
In
clustering,
its
very
common
to
dene
cluster
similarity
by
the
distance
between
points
k-nn
(k-means)
Bye
Next
Qme
Supervised
Learning
Review
Clustering