Anda di halaman 1dari 39

Clustering: Partictioning Methods

Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Prof. Pier Luca Lanzi

Prof. Pier Luca Lanzi


How Do Partitioning Methods Work?


Given n objects and k clusters, nd a partition of k clusters that


minimizes a given score

Each of the k clusters is usually identied by its centroid C


with m is the cluster identier {1, , k}

Sum of squares is a rather typical score for partitioning


methods

k m=1 tmiKm

(Cm tmi )

Global optimal is possible exhaustively enumerate all partitions


Heuristic methods are always used (k-means and k-medoids)

Prof. Pier Luca Lanzi

K-means Example (step 1)


C1
Y
Pick 3
initial
cluster
centers
(randomly)

C2

C3
X

Prof. Pier Luca Lanzi

K-means Example (step 2)


C1
Y
Assign
each point
to the closest
cluster
center

C2

C3
X

Prof. Pier Luca Lanzi

K-means Example (step 3)


C1
Y
Move
each cluster center
to the mean
of each cluster
C2

C1

C3
C2
C3
X

Prof. Pier Luca Lanzi

K-means Example (step 4)


C1
Y
Reassign
points
closest to a different new cluster center

C3
C2

X

Prof. Pier Luca Lanzi

K-means Example (step 5)


C1
Y
re-compute cluster means


C3
C2

X

Prof. Pier Luca Lanzi

K-means Example (step 6)


C1
Y re-compute cluster means


C2
C3

X
Prof. Pier Luca Lanzi

K-means Clustering

10

Works with numeric data


Pick a number (K) of cluster centers (at random)
Assign every item to its nearest cluster center 
(e.g. using Euclidean distance)

Move each cluster center to the mean of its assigned items


Repeat assignment and moving steps until convergence 
(change in cluster assignments less than a threshold)

Prof. Pier Luca Lanzi

Evaluating K-means Clusters


11

Most common measure is the sum of squared error (SSE)


For each point, the error is the distance  to the nearest cluster
To get SSE, we square these errors and sum them.

SSE = dist 2 (mi , x )
K

x is a data point in cluster C and m is the representative point for cluster C


Given two clusters, we can choose  the one with the smallest error
One easy way to reduce SSE is to increase K,  the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
i i i

i =1 xCi

clustering with higher K


Prof. Pier Luca Lanzi


Two different K-means Clusterings



3 2.5 2 1.5

12

Original Points

y
1 0.5 0 -2

-1.5

-1

-0.5

0.5

1.5

3 2.5 2 1.5

3 2.5 2 1.5

1 0.5 0

y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering
Prof. Pier Luca Lanzi

Sub-optimal Clustering

Importance of Choosing the Initial Centroids



Iteration 1 6 5 4 3 2
3 2.5 2 1.5

13

y
1 0.5 0 -2

-1.5

-1

-0.5

0.5

1.5

Prof. Pier Luca Lanzi


Importance of Choosing the Initial Centroids



Iteration 1
3 2.5 2 1.5 3 2.5 2 1.5

14

Iteration 2
3 2.5 2 1.5

Iteration 3

1 0.5 0

1 0.5 0

y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

-2

-1.5

-1

-0.5

0.5

1.5

-1.5

-1

-0.5

0.5

1.5

Iteration 4
3 2.5 2 1.5 3 2.5 2 1.5

Iteration 5
3 2.5 2 1.5

Iteration 6

1 0.5 0

1 0.5 0

y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

-2

-1.5

-1

-0.5

0.5

1.5

-1.5

-1

-0.5

0.5

1.5

Prof. Pier Luca Lanzi


Importance of Choosing the Initial Centroids



Iteration 1 5 4 3 2
3 2.5 2 1.5

15

y
1 0.5 0 -2

-1.5

-1

-0.5

0.5

1.5

Prof. Pier Luca Lanzi


Importance of Choosing the Initial Centroids



Iteration 1
3 2.5 2 1.5 3 2.5 2 1.5

16

Iteration 2

1 0.5 0

y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

-1.5

-1

-0.5

0.5

1.5

Iteration 3
3 2.5 2 1.5 3 2.5 2 1.5

Iteration 4
3 2.5 2 1.5

Iteration 5

1 0.5 0

1 0.5 0

y
1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2

-2

-1.5

-1

-0.5

0.5

1.5

-1.5

-1

-0.5

0.5

1.5

Prof. Pier Luca Lanzi


Why Selecting the Best Initial Centroids is Difcult?


centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then  

17

If there are K real clusters then the chance of selecting one

For example, if K = 10, then probability = 10!/1010 = 0.00036


Sometimes the initial centroids will readjust themselves in
right way, and sometimes they dont
Consider an example of ve pairs of clusters

Prof. Pier Luca Lanzi

10 Clusters Example

Iteration 1 4 3 2
8 6 4 2

18

0 -2 -4 -6 0 5 10 15 20

x Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi

10 Clusters Example

Iteration 1
8 6 4 2 8 6 4 2

19

Iteration 2

0 -2 -4 -6 0 5 10 15 20

0 -2 -4 -6 0 5 10 15 20

Iteration 3
8 6 4 2 8 6 4 2

Iteration 4

0 -2 -4 -6 0 5 10 15 20

0 -2 -4 -6 0 5 10 15 20

Starting with two initial centroids in one cluster of each pair of clusters
Prof. Pier Luca Lanzi

10 Clusters Example

Iteration 1 4 3 2
8 6 4 2

20

0 -2 -4 -6 0 5 10 15 20

x
Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi

10 Clusters Example

Iteration 1
8 6 4 2 8 6 4 2

21

Iteration 2

0 -2 -4 -6 0 8 6 4 2 5 10 15 20

0 -2 -4 -6 0 8 6 4 2 5 10 15 20

x Iteration 3

x Iteration 4

0 -2 -4 -6 0 5 10 15 20

0 -2 -4 -6 0 5 10 15 20

Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi

Dealing with the Initial Centroids Issue


22

Multiple runs, helps, but probability is not on your side


Sample and use another clustering method (hierarchical?) to
determine initial centroids

Select more than k initial centroids and then select among


these initial centroids

Postprocessing
Bisecting K-means, not as susceptible to initialization issues

Prof. Pier Luca Lanzi

Updating Centers Incrementally


23

In the basic K-means algorithm, centroids are updated after all


points are assigned to a centroid

An alternative is to update the centroids after each assignment


(incremental approach)
Each assignment updates zero or two centroids
More expensive
Introduces an order dependency
Never get an empty cluster
Can use weights to change the impact

Prof. Pier Luca Lanzi


Pre-processing and Post-processing


24

Pre-processing
Normalize the data
Eliminate outliers
Post-processing
Eliminate small clusters that may represent outliers
Split loose clusters, i.e., clusters with relatively high SSE
Merge clusters that are close and 
that have relatively low SSE
These steps can be used during the clustering process


Prof. Pier Luca Lanzi


Bisecting K-means

25

Variant of K-means that can produce 

a partitional or a hierarchical clustering


Prof. Pier Luca Lanzi


Bisecting K-means Example


26

Prof. Pier Luca Lanzi


Limitations of K-means

27

K-means has problems when clusters are of differing


Sizes
Densities
Non-globular shapes
K-means has also problems when the data contains outliers.

Prof. Pier Luca Lanzi


Limitations of K-means:  Differing Sizes


28

Original Points

K-means (3 Clusters)

Prof. Pier Luca Lanzi


Limitations of K-means:  Differing Density


29

Original Points

K-means (3 Clusters)

Prof. Pier Luca Lanzi


Limitations of K-means:  Non-globular Shapes


30

Original Points

K-means (2 Clusters)

Prof. Pier Luca Lanzi


Overcoming K-means Limitations


31

Original Points


K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.

Prof. Pier Luca Lanzi

Overcoming K-means Limitations


32

Original Points


K-means Clusters

Prof. Pier Luca Lanzi


Overcoming K-means Limitations


33

Original Points

K-means Clusters

Prof. Pier Luca Lanzi


K-Means Clustering Summary


34

Strength
Relatively efcient
Often terminates at a local optimum
The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms

Weakness
Applicable only when mean is dened, then what about

categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

Prof. Pier Luca Lanzi

K-Means Clustering Summary


35

Advantages
Simple, understandable
Items automatically assigned to clusters
Disadvantages
Must pick number of clusters before hand
All items forced into a cluster
Too sensitive to outliers

Prof. Pier Luca Lanzi


Variations of the K-Means Method


36

A few variants of the k-means which differ in


Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures 

to deal with categorical objects


Using a frequency-based method  to update modes of clusters
A mixture of categorical and numerical data:  k-prototype method

Prof. Pier Luca Lanzi

Variations of the K-Means Method


37

A few variants of the k-means which differ in


Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures 

to deal with categorical objects


Using a frequency-based method  to update modes of clusters
A mixture of categorical and numerical data:  k-prototype method

Prof. Pier Luca Lanzi

K-Medoids Clustering

38

Instead of mean, use medians of each cluster


Mean of 1, 3, 5, 7, 9 is 5
Mean of 1, 3, 5, 7, 1009 is 205
Median of 1, 3, 5, 7, 1009 is 5
Median is not affected by extreme values
For large databases, use sampling

Prof. Pier Luca Lanzi


K-Medoids Clustering

39

Find representative objects, called medoids, in clusters


PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990)
CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)

Prof. Pier Luca Lanzi

Anda mungkin juga menyukai