tral Clustering:
Analysis and an algorithm
Abstra
t
Despite many empiri
al su
esses of spe
tral
lustering methods|
algorithms that
luster points using eigenve
tors of matri
es de-
rived from the data|there are several unresolved issues. First,
there are a wide variety of algorithms that use the eigenve
tors
in slightly dierent ways. Se
ond, many of these algorithms have
no proof that they will a
tually
ompute a reasonable
lustering.
In this paper, we present a simple spe
tral
lustering algorithm
that
an be implemented using a few lines of Matlab. Using tools
from matrix perturbation theory, we analyze the algorithm, and
give
onditions under whi
h it
an be expe
ted to do well. We
also show surprisingly good experimental results on a number of
hallenging
lustering problems.
1 Introdu
tion
The task of nding good
lusters has been the fo
us of
onsiderable resear
h in
ma
hine learning and pattern re
ognition. For
lustering points in R |a main ap-
n
2 Algorithm
1. Form the aÆnity matrix A 2 Rnn dened by Aij = exp( jjsi sj jj2 =2 2 ) if
i 6= j, and Aii = 0.
2. Dene D to be the diagonal matrix whose (i; i)-element is the sum of A's i-th
row, and
onstru
t the matrix L = D 1=2 AD 1=2 .1
3. Find x1 ; x2 ; : : : ; xk , the k largest eigenve
tors of L (
hosen to be orthogonal
to ea
h other in the
ase of repeated eigenvalues), and form the matrix X =
[x1 x2 : : : xk ℄ 2 Rnk by sta
king the eigenve
tors in
olumns.
4. Form the matrixP Y from X by renormalizing ea
h of X's rows to have unit length
(i.e. Yij = Xij =( j Xij2 )1=2 ).
5. Treating ea
h row of Y as a point in Rk ,
luster them into k
lusters via K-means
or any other algorithm (that attempts to minimize distortion).
6. Finally, assign the original point si to
luster j if and only if row i of the matrix
Y was assigned to
luster j.
Here, the s
aling parameter 2
ontrols how rapidly the aÆnity A falls o withij
the distan
e between s and s , and we will later des
ribe a method for
hoosing
i j
it automati
ally. We also note that this is only one of a large family of possible
algorithms, and later dis
uss some related methods (e.g., [6℄).
At rst sight, this algorithm seems to make little sense. Sin
e we run K-means
in step 5, why not just apply K-means dire
tly to the data? Figure 1e shows an
example. The natural
lusters in R2 do not
orrespond to
onvex regions, and K-
means run dire
tly nds the unsatisfa
tory
lustering in Figure 1i. But on
e we map
the points to R (Y 's rows), they form tight
lusters (Figure 1h) from whi
h our
k
method obtains the good
lustering shown in Figure 1e. We note that the
lusters
in Figure 1h lie at 90Æ to ea
h other relative to the origin (
f. [8℄).
Readers familiar with spe
tral graph theory [3℄ may be more familiar with the Lapla-
1
ian I L. But as repla
ing L with I L would
ompli
ate our later dis
ussion, and only
hanges the eigenvalues (from i to 1 i ) and not the eigenve
tors, we instead use L.
3 Analysis of algorithm
3.1 Informal dis
ussion: The \ideal"
ase
To understand the algorithm, it is instru
tive to
onsider its behavior in the \ideal"
ase in whi
h all points in dierent
lusters are innitely far apart. For the sake of
dis
ussion, suppose that k = 3, and that the three
lusters of sizes n1 , n2 and n3
are S1 , S2 , and S3 (S = S1 [ S2 [ S3 , n = n1 + n2 + n3 ). To simplify our exposition,
also assume that the points in S = fs1; : : : ; s g are ordered a
ording to whi
h
luster they are in, so that the rst n1 points are in
luster S1 , the next n2 in S2 ,
n
ij
as in the previous algorithm, but starting with A^ instead of A. Note that A^ and L^
are therefore blo
k-diagonal:
2 (11) 3 2 ^ (11) 3
A 0 0 L 0 0
A^ = 4 0 A (22)
0 5 ; L^ = 4 0 L^ (22)
0 5 (1)
0 0 A(33) 0 0 L^ (33)
where we have adopted the
onvention of using parenthesized supers
ripts to index
into subblo
ks of ve
tors/matri
es, and L^ ( ) = (D^ ( ) ) 1 2 A( ) (D^ ( ) ) 1 2 . Here,
ii ii = ii ii =
A^( ) = A( ) 2 R is the matrix of \intra-
luster" aÆnities for
luster i. For fu-
ii ii ni ni
ture use, also dene d^( ) 2 R to be the ve
tor
ontaining D^ ( ) 's diagonal elements,
i ni ii
1. Also, sin
e A( ) > 0 (j 6= k), the next eigenvalue is stri
tly less than 1. (See,
ii
jk
^( ) = r
yj
i
i (4)
for all i = 1; : : : ; k; j = 1; : : : ; n . i
In other words, there are k mutually orthogonal points on the surfa
e of the unit
k -sphere around whi
h Y^ 's rows will
luster. Moreover, these
lusters
orrespond
exa
tly to the true
lustering of the original data.
3.2 The general
ase
In the general
ase, A's o-diagonal blo
ks are non-zero, but we still hope to re
over
guarantees similar to Proposition 1. Viewing E = A A^ as a perturbation to the
\ideal" A^ that results in A = A^ + E , we ask: When
an we expe
t the resulting rows
of Y to
luster similarly to the rows of Y^ ? Spe
i
ally, when will the eigenve
tors
of L, whi
h we now view as a perturbed version of L^ , be \
lose" to those of L^ ?
Matrix perturbation theory [10℄ indi
ates that the stability of the eigenve
tors of a
matrix is determined by the eigengap. More pre
isely, the subspa
e spanned by L^ 's
rst 3 eigenve
tors will be stable to small
hanges to L^ if and only if the eigengap
Æ = j3 4 j, the dieren
e between the 3rd and 4th eigenvalues of L ^ , is large. As
dis
ussed previously, the eigenvalues of L^ is the union of the eigenvalues of L^ (11) ,
^ (22) , and L^ (33) , and 3 = 1. Letting ( ) be the j -th largest eigenvalue of L^ ( ) , we
L j
i ii
therefore see that 4 = max (2 ) . Hen
e, the assumption that j3 4 j be large is
i
i
exa
tly the assumption that max (2 ) be bounded away from 1. i
i
Note that (2 ) depends only on L^ ( ) , whi
h in turn depends only on A( ) = A^( ) ,
i ii ii ii
the matrix of intra-
luster similarities for
luster S . The assumption on 2( ) has a i
i
: :
would be unreasonable to expe
t an algorithm to
orre
tly guess what partition of
the four
lusters into three subsets we had in mind.
This
onne
tion between the eigengap and the
ohesiveness of the individual
lusters
an be formalized in a number of ways.
Assumption A1.1. Dene the Cheeger
onstant [3℄ of the
luster S to be
P 2I 62I i
P P 62I ^ g :
(ii)
h(S ) = minI (5)
j ;k
Ajk
minf
i (i) (i)
2I ^ j dj ; k dk
where the outer minimum is over all index subsets I f1; : : : ; n g. Assume that
there exists Æ > 0 so that (h(S ))2 =2 Æ for all i.
i
2
6 k), whi
h is true in our
ase.
This
ondition is satised by A^jkii > 0 (j = ( )
A standard result in spe
tral graph P theory shows that Assumption A1.1 implies
Assumption A1. Re
all that d^( ) = A( )
hara
terizes how \well
onne
ted"
j
i
k
ii
jk
or how \similar" point j is to the other points in the same
luster. The term in
the minI fg
hara
terizes how well (I ; I ) partitions S into two subsets, and the
minimum over I pi
ks out the best su
h partition. Spe
i
ally, if there is a partition
i
of S 's points so that the weight of the edges a
ross the partition is small, and so
i
that ea
h of the partitions has moderately large \volume" (sum of d^( ) 's), then the i
j
Cheeger
onstant will be small. Thus, the assumption that the Cheeger
onstants
h(S ) be large is exa
tly that the
lusters S be hard to split into two subsets.
i i
We
an also relate the eigengap to the mixing time of a random walk (as in [6℄)
dened on the points of a
luster, in whi
h the
han
e of transitioning from point i
to j is proportional to A , so that we tend to jump to nearby-points. Assumption
ij
A1 is equivalent to assuming that, for su
h a walk dened on the points of any
one of the
lusters S , the
orresponding transition matrix has se
ond eigenvalue at
i
most 1 Æ. The mixing time of a random walk is governed by the se
ond eigenvalue;
thus, this assumption is exa
tly that the walks mix rapidly. Intuitively, this will be
true for tight (or at least fairly \well
onne
ted")
lusters, and untrue if a
luster
onsists of two well-separated sets of points so that the random walk takes a long
time to transition from one half of the
luster to the other. Assumption A1
an also
be related to the existen
e of multiple paths between any two points in the same
luster.
Assumption A2. There is some xed 1 > 0, so that for every i1 ; i2 2 f1; : : : ; kg,
i1 6= i2 , we have that
P P
2 ^ ^ 1 :
2
2 j Si1 k
(6) Ajk
Si2 d d
j k
To gain intuition about this,
onsider the
ase of two \dense"
lusters i1 and i2 of
size
(n) ea
h. Sin
e d^ measures how \
onne
ted" point j is to other points in
j
measures how
onne
ted s is to points in other
lusters. The next assumption is
j
that all points must be more
onne
ted to points in the same
luster than to points
in other
lusters; spe
i
ally, that the ratio between these two quantities be small.
Assumption A3. For some xed 2 > 0, for every i = 1; : : : ; k, j 2 S , we have:
P 62 12
i
2 P 2 ^ ^ (7)
Ajk 2 =
k:k Si Akl
^ dj k;l Si d d
k l
For intuition about this assumption, again
onsider the
ase of densely
onne
ted
lusters (as we did previously). Here, the quantity in parentheses on the right hand
side is OP(1), so this be
omesPequivalent to P demanding that the following ratio be
small: ( : 62 A )=d^ = ( : 62 A )=( : 2 A ) = O(2 ).
k k Si jk j k k Si jk k k Si jk
This last assumption is a fairly benign one that no points in a
luster be \too mu
h
less"
onne
ted than other points in the same
luster.
p
Theorem 2 Let assumptions A1, A2, A3 and A4 hold. Set = (
k k 1)1 + k22 .
p
If Æ > (2 + 2), then there exist k orthogonal ve
tors r1 ; : : : ; r (r k
T
i
rj = 1 if i = j ,
0 otherwise) so that Y 's rows satisfy
1 X X jjy( )
k ni p 2
p
2
i
ri jj 4C 4 + 2 k
2
: (8)
n
i=1 j =1
j 2
(Æ 2)2
Thus, the rows of Y will form tight
lusters around k well-separated points (at 90Æ
from ea
h other) on the surfa
e of the k-sphere a
ording to their \true"
luster S . i
4 Experiments
To test our algorithm, we applied it to seven
lustering problems. Note that whereas
2 was previously des
ribed as a human-spe
ied parameter, the analysis also sug-
gests a parti
ularly simple way of
hoosing it automati
ally: For the right 2 ,
Theorem 2 predi
ts that the rows of Y will form k \tight"
lusters on the surfa
e
of the k-sphere. Thus, we simply sear
h over 2 , and pi
k the value that, after
lustering Y 's rows, gives the tightest (smallest distortion)
lusters. K-means in
Step 5 of the algorithm was also inexpensively initialized using the prior knowledge
that the
lusters are about 90Æ apart.3 The results of our algorithm are shown in
Figure 1a-g. Giving the algorithm only the
oordinates of the points and k, the
dierent
lusters found are shown in the Figure via the dierent symbols (and
ol-
ors, where available). The results are surprisingly good: Even for
lusters that do
not form
onvex regions or that are not
leanly separated (su
h as in Figure 1g),
the algorithm reliably nds
lusterings
onsistent with what a human would have
hosen.
We note that there are other, related algorithms that
an give good results on a
subset of these problems, but we are aware of no equally simple algorithm that
an give results
omparable to these. For example, we noted earlier how K-means
easily fails when
lusters do not
orrespond to
onvex regions (Figure 1i). Another
alternative may be a simple \
onne
ted
omponents" algorithm that, for a threshold
, draws an edge between points s and s whenever jjs
i j s jj2 , and takes the
i j
resulting
onne
ted
omponents to be the
lusters. Here, is a parameter that
an
(say) be optimized to obtain the desired number of
lusters k. The result of this
algorithm on the three
ir
les-joined dataset with k = 3 is shown in Figure 1j.
One of the \
lusters" it found
onsists of a singleton point at (1:5; 2). It is
lear
that this method is very non-robust.
We also
ompare our method to the algorithm of Meila and Shi [6℄ (see Figure 1k).
Their method is similar to ours, ex
ept for the seemingly
osmeti
dieren
e that
they normalize A's rows to sum to 1 and use its eigenve
tors instead of L's, and do
not renormalize the rows of X to unit length. A renement of our analysis suggests
that this method might be sus
eptible
P ^( to bad
lusterings when the degree to whi
h
dierent
lusters are
onne
ted ( d ) varies substantially a
ross
lusters.
)
j
i
j
3
Brie
y, we let the rst
luster
entroid be a randomly
hosen row of Y , and then
repeatedly
hoose as the next
entroid the row of Y that is
losest to being 90Æ from
all the
entroids (formally, from the worst-
ase
entroid) already pi
ked. The resulting
K-means was run only on
e (no restarts) to give the results presented. K-means with the
more
onventional random initialization and a small number of restarts also gave identi
al
results. In
ontrast, our implementation of Meila and Shi's algorithm used 2000 restarts.
nips, 8 clusters lineandballs, 3 clusters fourclouds, 2 clusters
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
(a) (b) ( )
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4.5 4.5
0.8
4 4
0.6
3.5 3.5
0.4
3 3
0.2
2.5 2.5
2 0 2
1 1
−0.4
0.5 0.5
−0.6
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
threecircles−joined, 3 clusters (connected components) lineandballs, 3 clusters (Meila and Shi algorithm) nips, 8 clusters (Kannan et al. algorithm)
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Figure 1: Clustering examples, with
lusters indi
ated by dierent symbols (and
olors,
where available). (a-g) Results from our algorithm, where the only parameter varied a
ross
runs was k. (h) Rows of Y (jittered, subsampled) for two
ir
les dataset. (i) K-means.
(j) A \
onne
ted
omponents" algorithm. (k) Meila and Shi algorithm. (l) Kannan et al.
Spe
tral Algorithm I. (See text.)
5 Dis
ussion
There are some intriguing similarities between spe
tral
lustering methods and Ker-
nel PCA, whi
h has been empiri
ally observed to perform
lustering [7, 2℄. The main
dieren
e between the rst steps of our algorithm and Kernel PCA with a Gaussian
kernel is the normalization of A (to form L) and X . These normalizations do im-
prove the performan
e of the algorithm, but it is also straightforward to extend our
analysis to prove
onditions under whi
h Kernel PCA will indeed give
lustering.
While dierent in detail, Kannan et al. [4℄ give an analysis of spe
tral
lustering
that also makes use of matrix perturbation theory, for the
ase of an aÆnity matrix
with row sums equal to one. They also present a
lustering algorithm based on
k singular ve
tors, one that diers from ours in that it identies
lusters with
individual singular ve
tors. In our experiments, that algorithm very frequently
gave poor results (e.g., Figure 1l).
A
knowledgments
We thank Marina Meila for helpful
onversations about this work. We also thank
Ali
e Zheng for helpful
omments. A. Ng is supported by a Mi
rosoft Resear
h
fellowship. This work was also supported by a grant from Intel Corporation, NSF
grant IIS-9988642, and ONR MURI N00014-00-1-0637.
Referen
es
[1℄ C. Alpert, A. Kahng, and S. Yao. Spe
tral partitioning: The more eigenve
tors, the
better. Dis
rete Applied Math, 90:3{26, 1999.
[2℄ N. Christianini, J. Shawe-Taylor, and J. Kandola. Spe
tral kernel methods for
lus-
tering. In Neural Information Pro
essing Systems 14, 2002.
[3℄ F. Chung. Spe
tral Graph Theory. Number 92 in CBMS Regional Conferen
e Series
in Mathemati
s. Ameri
an Mathemati
al So
iety, 1997.
[4℄ R. Kannan, S. Vempala, and A. Vetta. On
lusterings|good, bad and spe
tral.
In Pro
eedings of the 41st Annual Symposium on Foundations of Computer S
ien
e,
2000.
[5℄ J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image
segmentation. In Per
eptual Organization for Arti
ial Vision Systems. Kluwer, 2000.
[6℄ M. Meila and J. Shi. Learning segmentation by random walks. In Neural Information
Pro
essing Systems 13, 2001.
[7℄ B. S
holkopf, A. Smola, and K.-R Muller. Nonlinear
omponent analysis as a kernel
eigenvalue problem. Neural Computation, 10:1299{1319, 1998.
[8℄ G. S
ott and H. Longuet-Higgins. Feature grouping by relo
alisation of eigenve
tors
of the proximity matrix. In Pro
. British Ma
hine Vision Conferen
e, 1990.
[9℄ D. Spielman and S. Teng. Spe
tral partitioning works: Planar graphs and nite
element meshes. In Pro
eedings of the 37th Annual Symposium on Foundations of
Computer S
ien
e, 1996.
[10℄ G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. A
ademi
Press, 1990.
[11℄ Y. Weiss. Segmentation using eigenve
tors: A unifying view. In International Con-
feren
e on Computer Vision, 1999.