Anda di halaman 1dari 23

Clustering(Pengelompokan)

Pengertian
• Pengelompokan object data sehingga object dalam suatu
kelompok sama (atau berelasi ) satu dengan yang lain namun
berbeda dari (atau tak berelasi ) objek di kelompok lain

Jarak antar
Jarak dalam kelompok di
kelopok di maksimumkan
minimumkan
Pencilan
• Pencilan merupakan objek yang tidak masuk dalam setiap
kelompok

cluster

Pencilan
Kenapa perlu kelompok
• Clustering : diberikan kelompok objek data sehingga
– Sama satu dengan yang lain dalam cluster sama
– Berbeda terhadap objek di cluster lain

• Hasil pengelompokan bermanfaat:


– Sebagai alat untuk memahami distribusi data
• Visualisasi cluster dapat mengungkapkan penting
– Sebagai langkah preprocessing algorithma lain
Applikasi
• Image Processing
• Web
• Bioinformatics
• Persoalan Big Data
• Dan lain-lain
Tugas Pengelompokan
• Kelompokkan observasi menjadi kumpulan
sehingga observasi dalam grup similar,
sedangkan observasi grup berbeda di
kelompok berbeda

• Pertanyaan mendasar:
– Apa artinya “similar”
– Bagaimana mengukur kualitas penyelesaian
– Bagaimana memperoleh partisi yang baik
Fungsi Jarak
• Jarak d(x, y) antara dua objek x dan y adalah metric

– d(i, j)0 (non-negativitas)


– d(i, i)=0 (isolasi)
– d(i, j)= d(j, i) (simmetri)
– d(i, j) ≤ d(i, h)+d(h, j) (ketidaksamaan segitiga)

• Definisi fungsi jarak berbeda untuk variabel real, boolean,


categorical, dan ordinal .

• Bobot dikaitkan dengan variabel berbeda yng didasarkan pada


aplikasi dan data semantik.
Struktur Data
attribute/dimension

• data matriks
 x11 ... x
1
... x 
1d 
 ...

tuples/objek
 ... ... ... ... 
x ... x ... x 
 i1 i id 
 ... ... ... ... ... 
x ... x ... x 
 n1 n nd 
objek

• Matriks jarak  0 
 d(2,1) 0 
 
objek

 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Distance functions for binary vectors
• Jaccard similarity between binary vectors X and Y
X Y
JSim ( X , Y ) 
X Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)

Q1 Q2 Q3 Q4 Q5 Q6
• Example: X 1 0 0 1 1 1
• JSim = 1/6 Y 0 1 1 0 1 0

• Jdist = 5/6
Distance functions for real-valued vectors
• Lp norms or Minkowski distance:
1/ p
 p p p 1/ p 
d 

L p ( x, y)  | x  y |  | x  y | ... | x  x |




   (x  y ) 
 1 1 2 2 d d  
i 1 i i 
 

where p is a positive integer

• If p = 1, L1 is the Manhattan (or city block) distance:

d
L ( x, y) | x1  y1 |  | x  y | ... | x  y |  x y
1 2 2 d d i i
i 1
Distance functions for real-valued
vectors
• If p = 2, L2 is the Euclidean distance:
d ( x, y)  (| x  y |2  | x  y |2 ... | x  y |2 )
1 1 2 2 d d

• Also one can use weighted distance:


d ( x, y)  (w | x  x |2 w | x  x |2 ... w | x  y |2 )
1 1 1 2 2 2 d d d

d ( x, y)  w x  y  w x  y ... w x  y
1 1 1 2 2 2 d d d

• Very often Lpp is used instead of Lp (why?)


The k-means problem
• Given a set X of n points in a d-dimensional space
and an integer k

• Task: choose a set of k points {c1, c2,…,ck} in the


d-dimensional space to form clusters {C1, C2,…,Ck}
such that k
Cost (C )    L2 x  ci 
2

i 1 xCi
is minimized
• Some special cases: k = 1, k = n
Algorithmic properties of the k-means
problem
• NP-hard if the dimensionality of the data is at least 2
(d>=2)

• Finding the best solution in polynomial time is


infeasible

• For d=1 the problem is solvable in polynomial time


(how?)

• A simple iterative algorithm works quite well in


practice
The k-means algorithm
• One way of solving the k-means problem

• Randomly pick k cluster centers {c1,…,ck}

• For each i, set the cluster Ci to be the set of points in X


that are closer to ci than they are to cj for all i≠j

• For each i let ci be the center of cluster Ci (mean of the


vectors in Ci)
• Repeat until convergence
Properties of the k-means algorithm
• Finds a local optimum

• Converges often quickly (but not always)

• The choice of initial points can have large


influence in the result
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Discussion k-means algorithm
• Finds a local optimum
• Converges often quickly (but not always)
• The choice of initial points can have large
influence
– Clusters of different densities
– Clusters of different sizes

• Outliers can also cause a problem (Example?)


The k-median problem
• Given a set X of n points in a d-dimensional
space and an integer k

• Task: choose a set of k points {c1,c2,…,ck} from


X and form clusters {C1,C2,…,Ck} such that
k
Cost (C )    L1 ( x, ci )
i 1 xCi

is minimized
The k-center problem
• Given a set X of n points in a d-dimensional space
and an integer k

• Task: choose a set of k points from X as cluster


centers {c1,c2,…,ck} such that for clusters
{C1,C2,…,Ck}
R(C)  max j max xC j d ( x, c j )
is minimized
Algorithmic properties of the k-centers
problem
• NP-hard if the dimensionality of the data is at least 2
(d>=2)

• Finding the best solution in polynomial time is


infeasible

• For d=1 the problem is solvable in polynomial time


(how?)

• A simple combinatorial algorithm works well in practice


The farthest-first traversal algorithm
• Pick any data point and label it as point 1
• For i=2,3,…,n
– Find the unlabelled point furthest from {1,2,…,i-1}
and label it as i.
//Use d(x,S) = minyєS d(x,y) to identify the distance
//of a point from a set
– π(i) = argminj<id(i,j)
– Ri=d(i,π(i))
The farthest-first traversal is a 2-
approximation algorithm
• Claim1: R1≥R2 ≥… ≥Rn

• Proof:
– Rj=d(j,π(j)) = d(j,{1,2,…,j-1})
≤d(j,{1,2,…,i-1}) //j > i
≤d(i,{1,2,…,i-1}) = Ri
The farthest-first traversal is a 2-
approximation algorithm

• Claim 2: If C is the clustering reported by the


farthest algorithm, then R(C)=Rk+1

• Proof:
– For all i > k we have that
d(i, {1,2,…,k})≤ d(k+1,{1,2,…,k}) = Rk+1

Anda mungkin juga menyukai