Clustering
Contoh
4
Cluster Analysis?
• Pengenalan Pola
• Spatial Data Analysis
• Cluster spatial
• Pemrosesan gambar
• Economic Science (terutama market research)
• WWW
• Berita, hasil pencarian
• Cluster Weblog data to discover groups of similar access patterns
6
• Land use: Identifikasi area yang digunakan untuk hal yang sama.
Ukuran Kesamaan
Requirement Clustering
é 0 ù
• Dissimilarity matrix ê d(2,1) 0 ú
• (one mode) ê ú
ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû
11
• Interval-scaled variables
Interval-Scaled Variable
http://www.sdgs.usd.edu/publications/maps/earthquakes/images/RichterScale.gif
14
Interval Variable
dimana
• Hitung standardized measurement (z-score)
xif - m f
zif = sf
15
Mengapa z-score?
Lanj
Jarak antara
Interval-Scaled Variable
d (i, j) = q (| x - x |q + | x - x |q +...+ | x - x |q )
i1 j1 i2 j2 ip jp
d (i, j) =| x - x | + | x - x | +...+ | x - x |
i1 j1 i2 j 2 ip jp
q : integer positif
18
d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j) ³ 0
• d(i,i) = 0
• d(i,j) = d( j,i)
• d(i,j) £ d(i,k) + d(k,j)
d (i, j) = b+c
• Jarak untuk symmetric binary variables: a +b+c + d
• Jarak untuk asymmetric binary variables: d (i, j) = b+c
a +b+c
• Jaccard coefficient (similarity measure untuk asymmetric binary
variables):
simJaccard (i, j) = a
a +b + c
20
Contoh
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
d ( jack , mary ) =
0+1
2+ 0+1
= 0.33 • gender is a symmetric
d ( jack , jim ) =
1+1
1+1+1
= 0.67 attribute
d ( jim , mary ) =
1+ 2
1+1+ 2
= 0.75 • the remaining attributes are
asymmetric binary
• let the values Y and P be set
to 1, and the value N be set
to 0
21
Nominal Variabel
• Dapat memiliki > 2 states: red, yellow, blue,
green
• Method 1: Simple matching
• m: jumlah cocok, p: jumlah variabel
p
d (i, j) = p- m
Ordinal
Ratio-Scaled Variables
yif = log(xif )
• Pelakukan sebagai continuous ordinal data
24
Campuran
Pendekatan Clustering
• Partisi :
• Buat partisi dan evaluasi berdasarkan kriteria tertentu, misalnya meminimalkan
sum of square errors
• Metode: k-means, k-medoids, CLARANS
• Hirarkis:
• Buat struktur hierarchical menggunakan kriteria tertentu
• Metode: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based :
• Berdasarkan connectivity dan density functions
• Metode: DBSACN, OPTICS, DenClue
• Single link: jarak terpendek antar elemen di dua cluster dis(Ki, Kj)
= min(tip, tjq)
Contoh K-Means
Contoh K-Means
• A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
• Jarak antara setiap titik dengan setiap cluster.
Cluster A, centroid: (2,5)
Cluster B, centroid: (7,5) d ( A1, A) = (| 2 - 2 | + |10 - 5| 2 2
A1 à cluster A
d ( A1, A) = 5
A3 à cluster A, d(A3,A) =
B1 à cluster A, d(B1,A) =
B3 à cluster A, d(B3,A) =
C1 à cluster A, d(C1,A) =
30
Contoh K-Means:
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
31
K-Medoids
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3 initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10