Anda di halaman 1dari 20

1

Clustering
Density-based clustering
Abraham Otero Quintana, Ph.D.
Madrid, July 5th 2010
2
Unsupervised Pattern Recognition (Clustering) 2/20
Course outline:
3. Density-based clustering
3.1. DBSCAN (Density Based Spatial Clustering of
Applications with Noise)
3.2. Grid Clustering
3.3. DENCLUE (DENsity CLUstEring)
3.4. More algorithms
For anoverviewof thesetechniquespleaseread, Tan2006 andBerkhin2002 from
/Docs. Someof theslideshereshownare takenfromthepubliclyavailable
repositoryof thesamebook. Source: http://www-
users.cs.umn.edu/~kumar/dmbook/index.php
3
Unsupervised Pattern Recognition (Clustering) 3/20
3. Density based clustering
A cluster is a dense region of points, which is separated by low-density
regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and
outliers are present.
6 density-based clusters
Densitybasedclustering triestoidentifythosedense (highlypopulated) regions
of themultidimensional spaceandseparatethemfromother dense regions. For a
reviewplease, read, Tan2006 from/DocsandEster1996 from/Docs.
4
Unsupervised Pattern Recognition (Clustering) 4/20
A point is a core point if it has more than
a specified number of points
(MinPts) within a radius Eps (these
points are the interior of a cluster)
A border point has fewer than MinPts
within Eps, but is in the
neighborhood of a core point
A noise point is any point that is not a
core point or a border point.
3.1 DBSCAN: Definitions
DBSCAN isbasedonthefollowingdefinitions.
5
Unsupervised Pattern Recognition (Clustering) 5/20
Classify points as noise, border, core
Eliminate noise points
Perform clustering on the remaining points
3.1 DBSCAN: Algorithm
Demo
Demo: http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
6
Unsupervised Pattern Recognition (Clustering) 6/20
Original
Points
3.1 DBSCAN: Example
Point types: core,
border and noise
Eps = 10, MinPts = 4
Clustering
7
Unsupervised Pattern Recognition (Clustering) 7/20
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
Features:
Resistant to Noise
Can handle clusters of different shapes and sizes
But:
Varying densities
High-dimensional data
Original Points
3.1 DBSCAN: Example
As wehaveseenDBSCAN isquite insensitivetooutliersandcan handlenon-
globular shapes. However, DBSCAN isnot thepanacea: it israther sensitiveto
varyingdensitiesandusuallydoesnot workwell withhigh-dimensional data
sincein thisspacesamplesare muchmore sparse.
8
Unsupervised Pattern Recognition (Clustering) 8/20
3.1 DBSCAN: Example
Pixelsare representedas 6 dimensional vectors(location+color) andsegmented
usingDBSCAN. Thedull studycan be seenat Ye2003 in thecourseCD.
9
Unsupervised Pattern Recognition (Clustering) 9/20
Parameter determination.
For MinPts a small number is usually employed.
For two-dimensional experimental data it has been shown that 4 is the
most reasonable value.
Eps is more tricky, as we have seen. A possible solution:
3.1 DBSCAN: Example
10
Unsupervised Pattern Recognition (Clustering) 10/20
3.1 DBSCAN: Parameter determination
The idea is that for points in a cluster, their k
th
nearest neighbors
are at roughly the same distance
Noise points have the k
th
nearest neighbor at farther distance
So, plot sorted distance of every point to its k
th
nearest neighbor
Reasonableeps
ReasonableMinPtsfor 2D data
Thisalgorithmisrather simple but it stronglydependsontheparametersMinPts
andeps. MinPtsisusuallya lownumbr (for 2D data it has beenexperimentally
shownthat 4 isa reasonablevalue). Thenepscan be easilydeterminedby sorting
thedistanceof the4th closest point for everypoint: Noisepointstendtobe far
fromall therest.
11
Unsupervised Pattern Recognition (Clustering) 11/20
3.2 Grid clustering
Basic algorithm:
1. Define a set of gridcells
2. Compute thedensityof cells
3. Eliminatecellswitha densitysmaller thana threshold
4. Formclusters fromcontiguouscells
Thosewantingtoknow more about gridclustering, please, readHinneburg1999
from/Docs.
Thebasicalgorithmfor gridclusteringisrather simple, formclusters with
contiguousdense cells. However, in thisdefinitionthereare a number of
ambiguousthings:
-How todefine cells: regular/irregular grids, cell size(too largeisnot accurate,
too small may be empty)
-How todefine thethreshold: it dependsonthecell sizeandthedimensionalityof
data
-What kindof adjacencyisconsidered: for instancein 2D, 4 or 8 neighbours
Gridclusteringisa basicidea of manyother clusteringalgorithms: WaveCluster,
Bang, Clique, andMafia
12
Unsupervised Pattern Recognition (Clustering) 12/20
3.3 DENCLUE: Definitions
2
2
( , )
2
1 1
( ) ( , )
i
dist
n n
D
kernel kernel i
i i
f f e

= =
= =

x x
x x x
Influencefunction
Thisalgorithmestimatesthelocal densityof theinput data in a wayverysimilar
tothekernel probabilitydensityfunctionestimators. Thekernel, herecalled
influencefunction, iscopied toeachdata positionyieldingthedensityfunction.
Local maximaof thedensityfunctionare calleddensityattractors.
Thoseinterestedin theoriginal paper of DENCLUE may readHinneburg1998
from/Docs. Thoseinterestedin knowingmore about probabilitydensityfunction
(PDF) estimators, please, readRaykar2002 from/Docs.
13
Unsupervised Pattern Recognition (Clustering) 13/20
3.3 DENCLUE: Clustering

{ }
| ( )
D
kernel
f x x
Center-definedcluster

{ }
| ( )
D
kernel
f x x
Multicenter-definedcluster
Multicenter definedclusters
are a set of center-defined
clusters linkedby a pathof
significance

Generalizes
hierarchical
clustering!
Clusters are formedby a level of significance. For knowingmore about the
connectionbetweenDENCLUE clustering andLevel Set methods, pleaseread,
Yipfrom/Docs.
14
Unsupervised Pattern Recognition (Clustering) 14/20
3.3 DENCLUE: Algorithm
1. Grid Data Set (use r =, the std. dev.)
2. Find (Highly) Populated Cells (use a
threshold=
c
) (shown in blue)
3. Identify populated cells (+nonempty cells)
4. Find Density Attractor pts, C*, using hill
climbing:
1. Randomly pick a point, p
i
.
2. Compute local density (use r=4)
3. Pick another point, p
i+1
, close to p
i
,
compute local density at p
i+1
4. If LocDen(p
i
) <LocDen(p
i+1
), climb
5. Put all points within distance /2of
path, p
i
, p
i+1
, C* into a density
attractor cluster called C*
5. Connect the density attractor clusters,
using a threshold, , on the local densities
of the attractors.
15
Unsupervised Pattern Recognition (Clustering) 15/20
3.3 DENCLUE: Examples
In theslide weshow a coupleof examplesof how DENCLUE clusters data,
accordingthealgorithmpresentedin Hinneburg1998 .
16
Unsupervised Pattern Recognition (Clustering) 16/20
3.3 DENCLUE: Examples
In theslide weshow a coupleof examplesof how DENCLUE clusters data,
accordingthealgorithmpresentedin Yip.
17
Unsupervised Pattern Recognition (Clustering) 17/20
3.3 DENCLUE: Examples
In theslide weshow a coupleof examplesof how DENCLUE clusters data,
accordingthealgorithmpresentedin Yip.
18
Unsupervised Pattern Recognition (Clustering) 18/20
3.3 DENCLUE: Features
Dependence on the kernel width
It generalizes DBSCAN, K-means and Hierarchical Clustering
Very efficient implementation
DENCLUE a fewpositive features, however, it isnot free fromdrawbacksas its
dependencywitha user-definedparameter.
19
Unsupervised Pattern Recognition (Clustering) 19/20
3.4 DBC: More algorithms
Generalized DBSCAN: Any divergence function can be used and points within
a neighbourhood are weighted according to their similarity to the core point.
Fuzzy DBSCAN: Fuzzy distance between fuzzy input vectors
DBCLASD: Assumes uniform density, no parameters required
Recursive DBC: Adaptive change of DBSCAN parameters
WaveCluster: Use wavelets to determine multiresolution clusters.
Optics: equivalent to DBC with a wide range of parameters
Knn DBC: Assign cluster label taking into account the k nearest neighbours
KerdenSOM: Self-organizing structure on the density estimation
STING (STatistical INformation Grid): Quadtree space division, very efficient
Information Theoretic Clustering: Measure the distance between cluster
distributions using information theory.
For knowingmore about:
-GeneralizedDBSCAN, please, readSander1998 from/Docs.
-FuzzyDBSCAN, please, readKriegel2005 from/Docs.
-DBCLASD, please, readXu1998 from/Docs.
-RecursiveDBC, please, readSu2001 from/Docs.
-WaveCluster, please, readSheikholeslami1997 from/Docs.
-Optics, please, readAnkerst1999 from/Docs.
-KnnDBC, please, readTran2003 from/Docs.
-KerdenSOM, please, readPascual2001 from/Docs.
-STING, pleaseread, Wang1997 from/Docs.
20
Unsupervised Pattern Recognition (Clustering) 20/20
Course outline
3. Density-based clustering
3.1. DBSCAN (Density Based Spatial Clustering of
Applications with Noise)
3.2. Grid Clustering
3.3. DENCLUE (DENsity CLUstEring)
3.4. More algorithms
For anoverviewof thesetechniquespleaseread, Tan2006 andBerkhin2002 from
/Docs. Someof theslideshereshownare takenfromthepubliclyavailable
repositoryof thesamebook. Source: http://www-
users.cs.umn.edu/~kumar/dmbook/index.php

Anda mungkin juga menyukai