BHN Kuliah

Clustering(Pengelompokan)
Pengertian
• Pengelompokan object data sehingga object dalam suatu
kelompok sama (atau berelasi ) satu dengan yang lain namun
berbeda dari (atau tak berelasi ) objek di kelompok lain
Jarak antar
Jarak dalam kelompok di
kelopok di maksimumkan
minimumkan
Pencilan
• Pencilan merupakan objek yang tidak masuk dalam setiap
kelompok
cluster
Pencilan
Kenapa perlu kelompok
• Clustering : diberikan kelompok objek data sehingga
– Sama satu dengan yang lain dalam cluster sama
– Berbeda terhadap objek di cluster lain
• Hasil pengelompokan bermanfaat:

– Sebagai alat untuk memahami distribusi data
• Visualisasi cluster dapat mengungkapkan penting
– Sebagai langkah preprocessing algorithma lain
Applikasi
• Image Processing
• Web
• Bioinformatics
• Persoalan Big Data
• Dan lain-lain
Tugas Pengelompokan
• Kelompokkan observasi menjadi kumpulan
sehingga observasi dalam grup similar,
sedangkan observasi grup berbeda di
kelompok berbeda
• Pertanyaan mendasar:
– Apa artinya “similar”
– Bagaimana mengukur kualitas penyelesaian
– Bagaimana memperoleh partisi yang baik
Fungsi Jarak
• Jarak d(x, y) antara dua objek x dan y adalah metric
– d(i, j)0 (non-negativitas)

– d(i, i)=0 (isolasi)
– d(i, j)= d(j, i) (simmetri)
– d(i, j) ≤ d(i, h)+d(h, j) (ketidaksamaan segitiga)
• Definisi fungsi jarak berbeda untuk variabel real, boolean,

categorical, dan ordinal .
• Bobot dikaitkan dengan variabel berbeda yng didasarkan pada

aplikasi dan data semantik.
Struktur Data
attribute/dimension
• data matriks
 x11 ... x
1
... x 
1d 
 ...
tuples/objek
 ... ... ... ... 
x ... x ... x 
 i1 i id 
 ... ... ... ... ... 
x ... x ... x 
 n1 n nd 
objek
• Matriks jarak  0 
 d(2,1) 0 
 
objek
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Distance functions for binary vectors
• Jaccard similarity between binary vectors X and Y
X Y
JSim ( X , Y ) 
X Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)
Q1 Q2 Q3 Q4 Q5 Q6
• Example: X 1 0 0 1 1 1
• JSim = 1/6 Y 0 1 1 0 1 0
• Jdist = 5/6
Distance functions for real-valued vectors
• Lp norms or Minkowski distance:
1/ p
 p p p 1/ p 
d 

L p ( x, y)  | x  y |  | x  y | ... | x  x |




   (x  y ) 
 1 1 2 2 d d  
i 1 i i 
 
where p is a positive integer
• If p = 1, L1 is the Manhattan (or city block) distance:
d
L ( x, y) | x1  y1 |  | x  y | ... | x  y |  x y
1 2 2 d d i i
i 1
Distance functions for real-valued
vectors
• If p = 2, L2 is the Euclidean distance:
d ( x, y)  (| x  y |2  | x  y |2 ... | x  y |2 )
1 1 2 2 d d
• Also one can use weighted distance:

d ( x, y)  (w | x  x |2 w | x  x |2 ... w | x  y |2 )
1 1 1 2 2 2 d d d
d ( x, y)  w x  y  w x  y ... w x  y
1 1 1 2 2 2 d d d
• Very often Lpp is used instead of Lp (why?)

The k-means problem
• Given a set X of n points in a d-dimensional space
and an integer k
• Task: choose a set of k points {c1, c2,…,ck} in the

d-dimensional space to form clusters {C1, C2,…,Ck}
such that k
Cost (C )    L2 x  ci 
2
i 1 xCi
is minimized
• Some special cases: k = 1, k = n
Algorithmic properties of the k-means
problem
• NP-hard if the dimensionality of the data is at least 2
(d>=2)
• Finding the best solution in polynomial time is

infeasible
• For d=1 the problem is solvable in polynomial time

(how?)
• A simple iterative algorithm works quite well in

practice
The k-means algorithm
• One way of solving the k-means problem
• Randomly pick k cluster centers {c1,…,ck}
• For each i, set the cluster Ci to be the set of points in X

that are closer to ci than they are to cj for all i≠j
• For each i let ci be the center of cluster Ci (mean of the

vectors in Ci)
• Repeat until convergence
Properties of the k-means algorithm
• Finds a local optimum
• Converges often quickly (but not always)
• The choice of initial points can have large

influence in the result
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering

Discussion k-means algorithm
• Finds a local optimum
• Converges often quickly (but not always)
• The choice of initial points can have large
influence
– Clusters of different densities
– Clusters of different sizes
• Outliers can also cause a problem (Example?)

The k-median problem
• Given a set X of n points in a d-dimensional
space and an integer k
• Task: choose a set of k points {c1,c2,…,ck} from

X and form clusters {C1,C2,…,Ck} such that
k
Cost (C )    L1 ( x, ci )
i 1 xCi
is minimized
The k-center problem
• Given a set X of n points in a d-dimensional space
and an integer k
• Task: choose a set of k points from X as cluster

centers {c1,c2,…,ck} such that for clusters
{C1,C2,…,Ck}
R(C)  max j max xC j d ( x, c j )
is minimized
Algorithmic properties of the k-centers
problem
• NP-hard if the dimensionality of the data is at least 2
(d>=2)
• Finding the best solution in polynomial time is

infeasible
• For d=1 the problem is solvable in polynomial time

(how?)
• A simple combinatorial algorithm works well in practice

The farthest-first traversal algorithm
• Pick any data point and label it as point 1
• For i=2,3,…,n
– Find the unlabelled point furthest from {1,2,…,i-1}
and label it as i.
//Use d(x,S) = minyєS d(x,y) to identify the distance
//of a point from a set
– π(i) = argminj<id(i,j)
– Ri=d(i,π(i))
The farthest-first traversal is a 2-
approximation algorithm
• Claim1: R1≥R2 ≥… ≥Rn
• Proof:
– Rj=d(j,π(j)) = d(j,{1,2,…,j-1})
≤d(j,{1,2,…,i-1}) //j > i
≤d(i,{1,2,…,i-1}) = Ri
The farthest-first traversal is a 2-
approximation algorithm
• Claim 2: If C is the clustering reported by the

farthest algorithm, then R(C)=Rk+1
• Proof:
– For all i > k we have that
d(i, {1,2,…,k})≤ d(k+1,{1,2,…,k}) = Rk+1

BHN Kuliah

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

BHN Kuliah

Diunggah oleh

Hak Cipta:

Format Tersedia

Clustering(Pengelompokan)

• Hasil pengelompokan bermanfaat:

– d(i, j)0 (non-negativitas)

• Definisi fungsi jarak berbeda untuk variabel real, boolean,

• Bobot dikaitkan dengan variabel berbeda yng didasarkan pada

where p is a positive integer

• If p = 1, L1 is the Manhattan (or city block) distance:

• Also one can use weighted distance:

• Very often Lpp is used instead of Lp (why?)

• Task: choose a set of k points {c1, c2,…,ck} in the

• Finding the best solution in polynomial time is

• For d=1 the problem is solvable in polynomial time

• A simple iterative algorithm works quite well in

• Randomly pick k cluster centers {c1,…,ck}

• For each i, set the cluster Ci to be the set of points in X

• For each i let ci be the center of cluster Ci (mean of the

• Converges often quickly (but not always)

• The choice of initial points can have large

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering Sub-optimal Clustering

• Outliers can also cause a problem (Example?)

• Task: choose a set of k points {c1,c2,…,ck} from

• Task: choose a set of k points from X as cluster

• Finding the best solution in polynomial time is

• For d=1 the problem is solvable in polynomial time

• A simple combinatorial algorithm works well in practice

• Claim 2: If C is the clustering reported by the

Anda mungkin juga menyukai