SUSHIL KULKARNI
Cluster Analysis
Similarity Measures:
–Euclidean Distance if attributes are
continuous.
–Other Problem-specific Measures.
SUSHIL KULKARNI
Outliers
cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier
analysis)
SUSHIL KULKARNI
Cluster Analysis
SUSHIL KULKARNI
Data Structures
data matrix attributes/dimensions
(two modes) x11 ... x1f ... x1p
tuples/objects
... ... ... ... ...
x ... xif ... xip
the “classic” data input i1
... ... ... ... ...
x ... xnf ... xnp
n1
objects
dissimilarity or distance
0
matrix d(2,1) 0
objects
(one mode) d(3,1) d ( 3,2) 0
the desired data input to some
clustering algorithms : : :
d ( n,1) d ( n,2) ... ... 0
Measuring Similarity in Clustering
Dissimilarity/Similarity metric:
d(i, j) ≥ 0 (non-negativity)
d(i, i) =0 (isolation)
d(i, j)= d(j, i) (symmetry)
d (i, j) ≤ d(i, h)+d(h, j) (triangular
inequality)
SUSHIL KULKARNI
Type of data in cluster analysis
Interval-scaled variables
e.g., salary, height
Binary variables
e.g., gender (M/F), has_cancer(T/F)
Nominal (categorical) variables
e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects
SUSHIL KULKARNI
Similarity and Dissimilarity Between
Objects (Cont.)
Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 i n jn
Properties
d(i,j) ≥0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
Also one can use weighted distance:
d (i, j) = (w | x − x |2 +w | x − x |2 +...+ wn | x − x |2 )
1 i1 j1 2 i2 j 2 in jn
SUSHIL KULKARNI
Binary Variables
A binary variable has two states: 0 absent, 1 present
A contingency table for binary data
object j
1 0 sum
1 a b a +b
object i 0 c d c+d
sum a + c b + d p
SUSHIL KULKARNI
Dissimilarity between Binary
Variables
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Tina 1 0 1 0 0 0
Dina 1 0 1 0 1 0
SUSHIL KULKARNI
Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering
Methods
Partitioning Methods
Hierarchical Methods
SUSHIL KULKARNI
Major Clustering Approaches
SUSHIL KULKARNI
Cluster Analysis
SUSHIL KULKARNI
Partitioning Algorithms: Basic
Concepts
SUSHIL KULKARNI
Centroid or Medoid
Centroid Medoid
SUSHIL KULKARNI
The k-means Clustering Method
Given k, the k-means algorithm is
implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of
the clusters of the current partition. The
centroid is the center (mean point) of the
cluster.
Assign each object to the cluster with the
nearest seed point.
Go back to Step 2, stop when no more
new assignment.
SUSHIL KULKARNI
K-Means Example
Given: {2,4,10,12,3,20,30,11,25}
Assume that we want two clusters.
Write the elements in increasing order
{2,3,4,10,11,12,20,25,30}
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,11,12,20,25,30}
Means are m1=2.5,m2=16
K1={2,3,4},K2={10,11,12,20,25,30}
Means are m1=3,m2=18
K1={2,3,4,10},K2={11,12,20,25,30}
Means are m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,25,30}
Means are m1=7,m2=25
Stop as the clusters with these means as no more
jumps from K2 to K 1 is possible. SUSHIL KULKARNI
The k-means Clustering Method
Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
SUSHIL KULKARNI
Comments on the k-means Method
SUSHIL KULKARNI
The K-Medoids Clustering Method
SUSHIL KULKARNI
The K-Medoids Clustering Algorithm
SUSHIL KULKARNI
The K-Medoids Clustering Method
Cluster the following data set of ten objects into
two clusters i.e k = 2.
Consider a data set of ten objects as follows
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
The K-Medoids Clustering Method
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6 SUSHIL KULKARNI
The K-Medoids Clustering Method
Step 1
Initialise k centre
Let us assume c1 = (3,4) and c2 = (7,4)
So here c1 and c2 are selected as medoid.
Calculating distance so as to associate each data
object to its nearest medoid. Cost is calculated using
Minkowski distance metric.
SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects
SUSHIL KULKARNI
Similarity and Dissimilarity
Between Objects
If p = 1, then L1 is :
L (i, j) =| x − x | + | x − x | + ...+ | x − x |
1 i1 j1 i2 j 2 i n jn
SUSHIL KULKARNI
The K-Medoids Clustering Method
c1 Data Cost c2 Data Cost
objects (distance) objects (distance)
(Xi) (Xi)
SUSHIL KULKARNI
The K-Medoids Clustering Method
So the clusters become:
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
Since the points (2,6) (3,8) and (4,7) are close to c1 hence
they form one cluster while remaining points form another
cluster.
Cost between any two points is found using formula
d
cos t ( x , c) = ∑ | x − c |
i =1
where x is any data object, c is the medoid, and d is the
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 ={(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
SUSHIL KULKARNI
The K-Medoids Clustering Method
SUSHIL KULKARNI
The K-Medoids Clustering Method
c1 Data Cost O’ Data Cost
objects (distance) objects (distance)
(Xi) (Xi)
SUSHIL KULKARNI
The K-Medoids Clustering Method
SUSHIL KULKARNI
The K-Medoids Clustering Method
= 22-20
=2 > 0
So moving to O′ would be bad idea, so the
previous choice was good and algorithm
terminates here (i.e there is no change in the
medoids).
SUSHIL KULKARNI
Cluster Analysis
What is a Cluster ?
Types of Data in Cluster
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
SUSHIL KULKARNI
Hierarchical Clustering
Use distance matrix as clustering criteria.
This method does not require the number of
clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge objects that have the least dissimilarity
Go on in a non-descending fashion
Eventually all objects belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
SUSHIL KULKARNI
AGNES (Agglomerative Nesting):
Minimum Distance Method
A B C D E
Single link dendrogram
A B A B
1
1
2 3 2
3 2
4 C C
E E
5
1 1
3 D D
1 1
D 3
D
Graph with threshhold dmin=2 Graph with threshhold dmin=3
A B
1
2 3 2
3 2
E 4 C
1
3 D A B C D E
Graph with threshhold dmin=4 Single link dendrogram
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into d
a several levels of nested e
partitioning (tree of clusters), b
a
called a dendrogram. c
Threshold of
1 2 34 5
A B C D E
SUSHIL KULKARNI
MST Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
SUSHIL KULKARNI
DIANA (Divisive Analysis)
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
A B
1
2 3 2
3 2
E 4 C
5 A B
1
1
3
2
D
E C
1
3
D
DIANA (Divisive
A
1
Analysis)
C
2
E
B
1
3 D
A B
1
Cut the largest edge ED
2 The cluster {A,B,C,D,E} is split
into two clusters
E
{E} {A,B,C,D}
1 C
D
SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C
2
Two clusters are
E {E} {A,B,C,D}
B Next cut the edge between B
1
and C
D
A B The cluster {A,B,C,D} is split into
1 {A,B} and {C,D}
1 C
D
SUSHIL KULKARNI
DIANA (Divisive
A Analysis)
1
C
C
D
SUSHIL KULKARNI
More on Hierarchical Clustering Methods
Integration
of hierarchical with
distance-based clustering
BIRCH (1996): uses CF-tree and
incrementally adjusts the quality of sub-
clusters
CURE (1998): selects well-scattered points
from the cluster and then shrinks them
towards the center of the cluster by a
specified fraction
CHAMELEON (1999): hierarchical clustering
using dynamic modeling
SUSHIL KULKARNI