Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti
Prasad
L16FlatCluster
Clustering algorithms
Partitional (Flat) Hierarchical (Tree)
Prasad L16FlatCluster 2
What is clustering?
Clustering: the process of grouping a set of objects into classes of similar objects
Documents within a cluster should be similar. Documents from different clusters should be dissimilar.
A common and important task that finds many applications in IR and other places
Prasad L16FlatCluster 3
How would you design an algorithm for finding the three clusters in this case?
However,
there
are
many
ways
of
inuencing
the
outcome
of
clustering:
number
of
clusters,
similarity
measure,
representa*on
of
documents,
.
.
.
Applications of clustering in IR
Application Search result clustering Scatter-Gather What is clustered? search results (subsets of) collection collection Benefit more effective information presentation to user alternative user interface: search without typing effective information presentation for exploratory browsing higher efficiency: faster search
7
Collection clustering
Cluster-based retrieval 7
collection
Yahoo! Hierarchy isnt clustering but is the kind of output you want from clustering
www.yahoo.com/Science (30) agriculture ... dairy biology ... physics ... CS ... space ... craft missions
botany cell AI courses crops magnetism HCI agronomy evolution forestry relativity
Prasad
L16FlatCluster
10
10
11
11
Example: The query car will also return docs containing automobile
Because clustering grouped together docs containing car with those containing automobile.
Prasad
15
clusty.com / Vivisimo
Prasad
L16FlatCluster
16
17
17
Prasad
L16FlatCluster
18
Clustering Algorithms
Partitional (Flat) algorithms
Usually start with a random (partial) partition Refine it iteratively
K means clustering (Model based clustering)
Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means and Kmedoids algorithms
Prasad L16FlatCluster 22
K-Means
Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c.
Reassignment of instances to clusters is based on distance to the current cluster centroids. (Or one can equivalently phrase it in terms of similarities)
Prasad L16FlatCluster 23
1 (c) = x | c | xc
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)
Prasad L16FlatCluster 24
K-means algorithm
25
K Means Example
(K=2)
Pick seeds Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged!
Prasad
L16FlatCluster
26
27
centroids
Exercise: (i) Guess what the op*mal clustering into two clusters is in this case; (ii) compute the centroids of the clusters
28
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations. Doc partition unchanged. Centroid positions dont change.
Convergence
Why should the K-means algorithm ever reach a fixed point?
A state in which clusters dont change.
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge. Number of iterations could be large.
Prasad
L16FlatCluster
52
Lower case
Convergence of K-Means
Define goodness measure of cluster k as sum of squared distances from cluster centroid:
Gk = i (di ck)2 (sum over all di in cluster k)
G = k Gk Intuition: Reassignment monotonically decreases G since each vector is assigned to the closest centroid.
Prasad L16FlatCluster 53
Sec. 16.4
Convergence of K-Means
Recomputation monotonically decreases each Gk since (mk is number of members in cluster k): (di a)2 reaches minimum for: 2(di a) = 0 di = a mK a = di a = (1/ mk) di = ck K-means typically converges quickly
Time Complexity
Computing distance between two docs is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm). Computing centroids: Each doc gets added once to some centroid: O(nm). Assume these two steps are each done once for I iterations: O(IKnm).
Prasad L16FlatCluster 55
Op*mality
of
K-means
Convergence does not mean that we converge to the op*mal clustering! This is the great weakness of K-means. If we start with a bad set of seeds, the resul*ng clustering can be horrible.
56
Seed Choice
Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to suboptimal clusterings.
Select good seeds using a heuristic (e.g., doc least similar to any existing mean) Try out multiple L16FlatCluster starting points
Example showing sensitivity to seeds
In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F}. If you start with D and F you converge to {A,B,D,E} {C,F}.
Prasad
57
Prasad
59
Hierarchical Clustering
Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti
Prasad
L17HierCluster
60
Hierarchical Clustering
Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.
animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean
Prasad
L17HierCluster
63
Precondition: Start with each document as a separate cluster. Postcondition: Eventually all documents belong to the same cluster.
Divisive (top-down):
Precondition: Start with all documents belonging to the same cluster. Postcondition: Eventually each document forms a cluster of its own.
64
Prasad
L17HierCluster
65
d3 d5 d1 d2 d4
d1,d2 d3,d4,d5
d4,d5
d3
Prasad
L17HierCluster
66
Prasad
d6
d4
d5
d3
Centroid Outlier
Prasad
L17HierCluster
69
Complete-link
Similarity of the furthest points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar
70