Jinlong Wu
Computational Mathematics,
SMS, PKU
May 4,2007
Introduction
In this section I will review linearly separable problems because they are the
x1 , y1 ), . . . , (x
xn , yn )} is said to be linearly
simpliest for SVC. The training set {(x
separable if there exists a linear discriminant function whose sign matches the
1
10
k
k
( T x T x0 )
(1)
( T x + 0 )
10
subject to
1
yi ( T x i + 0 ) C,
k k
i.
(2)
i.
2.5
1.5
x2
0.5
0.5
+1
1
val = 1
dec boun
1.5
val = +1
x1
3
3.1
For real-world data sets, they can not be separated idealy by simple hyperplanes usually. If hyperplanes are replaced by hypersurfaces, it could be
expected that better classification results will be obtained. Therefore, the
OP (2) should be generalized to
1
k k2
,0
2
T (x
xi ) + 0 ) 1,
subject to yi (
min
(3)
i.
3.2
Soft margins
Maybe someone will raise another problem after careful thought: Is this
generalization able to separate any real-world data sets perfectly?
5
min
,0
(
subject to
X
1
k k2 +C
i
2
i=1
yi ((xi )T + 0 ) 1 i i,
i 0.
6
(5)
Duality of (5)
X
X
X
1
LP = k k2 +C
i
i [yi ((xi )T + 0 ) (1 i )]
i i , (6)
2
i=1
i=1
i=1
where i , i , i 0 i. We minimize LP w.r.t , 0 and i . Setting the
respective derivatives to zero, we get
=
n
X
i yi (xi ),
(7)
i=1
0=
n
X
i yi ,
(8)
i=1
i = C i , i,
(9)
By substituting the last three equalities into (6), we obtain the Lagrangian
(Wolfe) dual objective function
LD =
n
X
i=1
1 XX
i
i i0 yi yi0 (xi )T (xi0 ).
2 i=1 i0 =1
(10)
P
We maximize LD subject to 0 i C and ni=1 i yi = 0. In addition to
(7)-(9), the Karush-Kuhn-Tucker (KKT) conditions include the constraints
i [yi ((xi )T + 0 ) (1 i )] = 0,
i i = 0,
(11)
(12)
yi [(xi )T + 0 ] (1 i ) 0,
(13)
1 XX
max
i
i i0 yi yi0 (xi )T (xi0 )
2 i=1 i0 =1
i=1
n
X
yi i = 0
subject to i=1
,
0 i C, i
(14)
and get
i s, and from the (7) has the form
=
n
X
i yi (xi ),
(15)
i=1
n
X
x)T (x
xi ) + 0
i yi (x
(16)
i=1
x) = sign[f (x
x)] = sign[
x)T (x
xi ) + 0 ].
G(x
i yi (x
(17)
i=1
Since SVC just uses the sign of the decision function to classify the class, only
the decision function is informative eventually. However, x always appears
8
x)T (x
x0 ) in (17). Defining the kernel function K(x
x, x 0 )
with pairwise forms (x
x, x 0 ) = (x
x)T (x
x0 ), (17) can be written as
as K(x
n
X
x) = sign[f(x
x)] = sign[
x, x j ) + 0 ].
G(x
i yi K(x
(18)
i=1
x, x 0 ) = ekxxxx k , > 0
Radial Basis Function (RBF): K(x
x, x 0 ) = tanh(x
xT x 0 + r)
Sigmoid: K(x
where , r, and d are kernel parameters.
xi , x j ), QP (14) can be
Defining a new n by n matrix Q as Qij = yi yj K(x
expressed more simply as
1
) = T Q
eT
min g(
2
(
yT = 0
subject to
,
0 i C, i
(19)
Fortunately many particularly more powerful algorithms have been invented to solve (19), such as chunking, decomposition and sequential minimal optimization(SMO). Especially SMO has attracted a lot of attention
since it was proposed by Platt(1998). All of them employ separate and
conquer strategies to split the original big QP (19) into many much smaller
subproblems, and to solve them iteratively and finally obtain the solution to
(19).
But the iterative computation is expensively time-comsuming when n
is big. Joachims(1998) proposes two new techniques to reduce the cost of
computation. The first one is Shrinking.
6.1
Shrinking
For many problems the number of free Support Vectors (0 < i < C) is small
compared to all Support Vectors (0 < i C). The shrinking technique
reduces the size of the problem without considering some bounded Support
Vectors(i = C). When the iterative process approaches the end, only variables in a small set A, we call it the active set, is allowed to move according to
Theorem 5 in Fan et al.(2005). After shrinking, the decomposition method
works on a smaller problem:
1 T
A QAA A (eeA QAN kN )T A
A
2
(
y TA A = y TN kN
subject to
,
A )i C, i = 1, . . . , |A|
0 (
min
(20)
10
Defining
) {i|i < C, yi = 1 or i > 0, yi = 1},
Iup (
) {i|i < C, yi = 1 or i > 0, yi = 1};
Ilow (
(21)
) max {yi g(
)i },
m(
)
iIup (
) max {yi g(
)i },
M (
(22)
)
iIlow (
(23)
(24)
where tol is a small positive value which indicates the KKT conditions are
obeyed within tol. In PKSVM, tol = 103 by default.
The following shrinking procedure is from LIBSVM, which is one of the
most popular SVM softwares at present. More details are available from
C.-C. Chang et al.(2001).
1. Some bounded variables will be shrunken after every min(n, 1000) iterations. Since the KKT conditions are not satisfied within tol during
the iterative process, (24) will not be obeyed yet, that is,
) > M (
).
m(
(25)
2. Since the previous shrinking strategy may fail, and many iterations are
spent in obtaining the final digit of the required accuracy, we would
not hope these iterations are wasted because they are trying to solve
a wrongly shrunken subproblem (20). Thus once the iteration attains
the tolerance
) M (
) + 10tol,
m(
(27)
). After reconstruction, we shrink
we reconstruct the whold gradient g(
some bounded variables based on the same rule in step 1, and the
iterations continue.
The other useful technique for saving computational time by Joachims(1998)
is called Caching. To illustrate caching technique is very necessary, some
analyses about computational complexity will first be presented.
6.2
Computational complexity
Most time in each iteration is spent on the kernel evalutions needed to compute the q rows of Q, which q relys on the decomposition method, for SMO
q = 2. This step has a time complexity of O(npq). Using the stored rows
of Q, updating k is done in time O(nq). Setting up the QP subproblem
requires O(nq) as well. The selection of the next working set, which includes
computing the gradient, can be done in O(nq).
6.3
Caching
As illustrated in the last subsection, the most expensive step in each iteration
is the kernel evalutions to compute the q rows of Q. Near the end of iterations,
eventual support vectors enter the working set multiple times. To avoid
recomputing the rows of Q, Caching is useful for reducing computational
cost.
Since Q is fully dense and may not be put into the computer memory
completely, usually a special storage using the idea of a cache is utilized to
store recently used Qij .
Just as in SVM light and LIBSVM, a simple least-recently-used caching
strategy is implemented in PKSVM. When the cache has not enough room
for a new row, the row which has not been used for the greatest number of
iterations will be eliminated from the cache.
12
Conclusions
This article presents a simple summary of Support Vector Machine for classification problems. However, it also includes most of the state-of-the-art
techniques to make SVM more practical for large-scale problems, such as
shrinking and caching. Of course some other useful methods, i.e., working
set selection, have to be skipped because of the space restriction.
SVM has been developed to be a big family because of thousands of
excellent research papers in the last ten years. It has become one of the
most powerful and popular tools in machining learning. Although the whole
article is dedicated to C-SVC, yet -SVC, -SVR, -SVR and some other
generalizations of SVM share most of the techniques mentioned here. The
difference between them is small expect that the dual optimization problems
differ formally. Thus one can easily consult more details about them in many
textbooks and papers if it is necessary.
References
[1] Naiyang Deng, Yingjie Tian, A New Method in Data MiningSupport
Vector Machine, Science Press, 2004.
[2] Nils J. Nilsson, Learning machines: Foundations of Trainable Pattern
Classifying Systems, McGraw-Hill, 1965.
[3] Vladimir N. Vapnik and A. Lerner, Pattern recognition using generalized
portrait method, Automation and Remote Control, 24: 774-780, 1963.
[4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, A
training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational
Learning Theory, pages 144-152, Pittsburgh, PA, July 1992, ACM Press.
[5] Corinna Cortes and Vladimir N. Vapnik, Support vector networks,Machine Learning, 20:pp 1-25, 1995.
13
14