Anda di halaman 1dari 25

Hierarchical clustering algorithm

Hierarchical clustering algorithm is of two types:


i) Agglomerative Hierarchical clustering algorithm or AGNES (agglomerative nesting) and
ii) Divisive Hierarchical clustering algorithm or DIANA (divisive analysis).
Both this algorithm are exactly reverse of each other. So we will be covering Agglomerative Hierarchical
clustering algorithm in detail.

Agglomerative Hierarchical clustering -This algorithm works by grouping the data one by one on the basis of
the nearest distance measure of all the pairwise distance between the data point. Again distance between the
data point is recalculated but which distance to consider when the groups has been formed? For this there are
many available methods. Some of them are:
1) single-nearest distance or single linkage.
2) complete-farthest distance or complete linkage.
3) average-average distance or average linkage.
4) centroid distance.
5) ward's method - sum of squared euclidean distance is minimized.
This way we go on grouping the data until one cluster is formed. Now on the basis of dendogram graph we can
calculate how many number of clusters should be actually present.

Algorithmic steps for Agglomerative Hierarchical clustering


Let X = {x1, x2, x3, ..., xn} be the set of data points.
1) Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0.
2) Find the least distance pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min
d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering.

3) Increment the sequence number: m = m +1.Merge clusters (r) and (s) into a single cluster to form the nex
clustering m. Set the level of this clustering to L(m) = d[(r),(s)].

4) Update the distance matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) an
adding a row and column corresponding to the newly formed cluster. The distance between the new cluste

denoted (r,s) and old cluster(k) is defined in this way: d[(k), (r,s)] = min (d[(k),(r)], d[(k),(s)]).
5) If all the data points are in one cluster then stop, else repeat from step 2).
Divisive Hierarchical clustering - It is just the reverse of Agglomerative Hierarchical approach.

Advantages
1) No apriori information about the number of clusters required.
2) Easy to implement and gives best result in some cases.

Disadvantages
1) Algorithm can never undo what was done previously.
2) Time complexity of at least O(n2 log n) is required, where n is the number of data points.
3) Based on the type of distance matrix chosen for merging different algorithms can suffer with one or more of
the following:
i) Sensitivity to noise and outliers
ii) Breaking large clusters
iii) Difficulty handling different sized clusters and convex shapes
4) No objective function is directly minimized
5) Sometimes it is difficult to identify the correct number of clusters by the dendogram.

Fig I: Showing dendogram formed from the data set of size 'N' = 60

References
1) k-means and Hierarchical Clustering by Andrew W. Moore.
2) Hierarchical Document Clustering by Benjamin C. M. Fung, Ke Wang and Martin Ester.
3) How to explain Hierarchical Clustering by S. P. Borgatti.
4) http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
5) BIRCH: An efficient data clustering method for very large databases by T. Zhang, R. Ramakrishnan and M.
Livny.

Hierarchical Clustering Essentials - Unsupervised Machine


Learning

1 Required R packages

2 Algorithm

3 Data preparation and descriptive statistics

4 R functions for hierarchical clustering


o

4.1 hclust() function

4.2 agnes() and diana() functions

4.2.1 R code for computing agnes

4.2.2 R code for computing diana

5 Interpretation of the dendrogram

6 Cut the dendrogram into different groups

7 Hierarchical clustering and correlation based distance

8 What type of distance measures should we choose?

9 Comparing two dendrograms

9.1 Tanglegram

9.2 Correlation matrix between a list of dendrogram

10 Infos

There are two standard clustering strategies: partitioning methods (e.g., k-means and pam) and hierarchical
clustering.
Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It
does not require to pre-specify the number of clusters to be generated. The result is a tree-based representation
of the observations which is called a dendrogram. It uses pairwise distance matrix between observations as
clustering criteria.
In this article we provide:

The description of the different types of hierarchical clustering algorithms

R lab sections with many examples for computing hierarchical clustering, visualizing and comparing
dendrogram

The interpretation of dendrogram

R codes for cutting the dendrograms into groups

1 Required R packages
The required packages for this chapter are:

cluster for computing PAM and CLARA

factoextra which will be used to visualize clusters

dendextend for comparing two dendrograms

1.

Install factoextra package as follow:

if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
2.

Install cluster and dendextend packages as follow:

install.packages("cluster")
install.packages("dendextend")
3.

Load the packages :

library(cluster)
library(dendextend)
library(factoextra)

2 Algorithm
Hierarchical clustering can be divided into two main types: agglomerative and divisive.
1.

Agglomerative clustering: Its also known as AGNES (Agglomerative Nesting). It works in a bottomup manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of
the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).
This procedure is iterated until all points are member of just one single big cluster (root) (see figure
below). The result is a tree which can be plotted as a dendrogram.

2.

Divisive hierarchical clustering: Its also known as DIANA (Divise Analysis) and it works in a topdown manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are
included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two.
The process is iterated until all objects are in their own cluster (see figure below).

Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at
identifying large clusters.

The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare,
the Euclidean distance is used by default to measure the dissimilarity between each pair of observations.
As we already know, its easy to compute dissimilarity measure between two pairs of observations. Its mentioned
above that two clusters that are most similar are fused into a new big cluster.
A natural question is :
How to measure the dissimilarity between two clusters of observations?
A number of different cluster agglomeration methods (i.e, linkage methods) has been developed to answer to
this question. The most common types methods are:

Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements
in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these
dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.

Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in
cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage
criterion. It tends to produce long, loose clusters.

Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in
cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance
between the two clusters.

Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean
vector of length p variables) and the centroid for cluster 2.

Wards minimum variance method: It minimizes the total within-cluster variance. At each step the pair
of clusters with minimum between-cluster distance are merged.

Complete linkage and Wards method are generally preferred.

3 Data preparation and descriptive statistics


Well use the built-in R dataset USArrest which contains statistics, in arrests per 100,000 residents for assault,
murder, and rape in each of the 50 US states in 1973. It includes also the percent of the population living in urban
areas.
It contains 50 observations on 4 variables:

[,1] Murder numeric Murder arrests (per 100,000)

[,2] Assault numeric Assault arrests (per 100,000)

[,3] UrbanPop numeric Percent urban population

[,4] Rape numeric Rape arrests (per 100,000)

# Load the data set


data("USArrests")
# Remove any missing value (i.e, NA values for not available)

# That might be present in the data


df <- na.omit(USArrests)
# View the firt 6 rows of the data
head(df, n = 6)
##
Murder Assault UrbanPop
## Alabama
13.2
236
58
## Alaska
10.0
263
48
## Arizona
8.1
294
80
## Arkansas
8.8
190
50
## California
9.0
276
91
## Colorado
7.9
204
78

Rape
21.2
44.5
31.0
19.5
40.6
38.7

Before hierarchical clustering, we can compute some descriptive statistics:

desc_stats <- data.frame(


Min = apply(df, 2, min), # minimum
Med = apply(df, 2, median), # median
Mean = apply(df, 2, mean), # mean
SD = apply(df, 2, sd), # Standard deviation
Max = apply(df, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
##
Min
Med Mean
SD
Max
## Murder
0.8
7.2
7.8 4.4 17.4
## Assault 45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0 66.0 65.5 14.5 91.0
## Rape
7.3 20.1 21.2 9.4 46.0

Note that the variables have a large different means and variances. This is explained by the fact that the variables
are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000
people, and UrbanPop is the percentage of the states population that lives in an urban area.
They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of
transforming the variables such that they have mean zero and standard deviation one. You can read more
about standardizationin the following article: distance measures and scaling.
As we dont want the hierarchical clustering result to depend to an arbitrary variable unit, we start by scaling the
data using the R function scale() as follow:

df <- scale(df)
head(df)
##
Murder
## Alabama
1.24256408
## Alaska
0.50786248
## Arizona
0.07163341
## Arkansas
0.23234938
## California 0.27826823
## Colorado
0.02571456

Assault
UrbanPop
Rape
0.7828393 -0.5209066 -0.003416473
1.1068225 -1.2117642 2.484202941
1.4788032 0.9989801 1.042878388
0.2308680 -1.0735927 -0.184916602
1.2628144 1.7589234 2.067820292
0.3988593 0.8608085 1.864967207

4 R functions for hierarchical clustering


There are different functions available in R for computing hierarchical clustering. The commonly used functions
are:

hclust() [in stats package] and agnes() [in cluster package] for agglomerative hierarchical clustering (HC)

diana() [in cluster package] for divisive HC

4.1 hclust() function


hclust() is the built-in R function [in stats package] for computing hierarchical clustering.
The simplified format is:

hclust(d, method = "complete")

d a dissimilarity structure as produced by the dist() function.

method: The agglomeration method to be used. Allowed values is one of ward.D, ward.D2, single,
complete, average, mcquitty, median or centroid.

The dist() function is used to compute the Euclidean distance between observations. Finally, observations are
clustered using Wards method.

# Dissimilarity matrix
d <- dist(df, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)

4.2 agnes() and diana() functions


The R function agnes() [in cluster package] can be also used to compute agglomerative hierarchical clustering.
The R function diana() [ in cluster package ] is an example of divisive hierarchical clustering.

# Agglomerative Nesting (Hierarchical Clustering)


agnes(x, metric = "euclidean", stand = FALSE, method = "average")
# DIvisive ANAlysis Clustering
diana(x, metric = "euclidean", stand = FALSE)

x: data matrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are
observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of
daisy() or dist().

metric: the metric to be used for calculating dissimilarities between observations. Possible values are
euclidean and manhattan.

stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities.
Measurements are standardized for each variable (column), by subtracting the variables mean value and
dividing by the variables mean absolute deviation

method: The clustering method. Possible values includes average, single, complete, ward.

The function agnes() returns an object of class agnes (see ?agnes.object) which has methods for the
functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().

The function diana() returns an object of class diana (see ?diana.object) which has also methods for the
functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().

Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:

It yields the agglomerative coefficient (see agnes.object) which measures the amount of clustering
structure found

Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).

4.2.1 R code for computing agnes


library("cluster")
# Compute agnes()
res.agnes <- agnes(df, method = "ward")
# Agglomerative coefficient
res.agnes$ac
## [1] 0.934621
# Plot the tree using pltree()
pltree(res.agnes, cex = 0.6, hang = -1,
main = "Dendrogram of agnes")

Its also possible to draw AGNES dendrogram using the function plot.hclust() and the function plot.dendrogram() as
follow:

# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6,
horiz = TRUE)

4.2.2 R code for computing diana

# Compute diana()
res.diana <- diana(df)
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
main = "Dendrogram of diana")

# Divise coefficient; amount of clustering structure found


res.diana$dc
## [1] 0.8514345

As for plotting AGNES dendrogram, the functions plot.hclust() and plot.dendrogram() can be used as follow:

# plot.hclust()
plot(as.hclust(res.diana), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.diana), cex = 0.6,
horiz = TRUE)

5 Interpretation of the dendrogram


In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree,
observations that are similar to each other are combined into branches, which are themselves fused at a
higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The
higher the height of the fusion, the less similar the observations are.

Note that, conclusions about the proximity of two observations can be drawn only based on the height where
branches containing those two observations first are fused. We cannot use the proximity of two observations along
the horizontal axis as a criteria of their similarity.
In order to identify sub-groups (i.e. clusters), we can cut the dendrogram at a certain height as described in the
next section.

6 Cut the dendrogram into different groups


The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k
in k-means clustering.
The function cutree() is used and it returns a vector containing the cluster number of each observation:

# Cut tree into 4 groups


grp <- cutree(res.hc, k = 4)
# Number of members in each cluster
table(grp)
## grp
## 1 2 3 4
## 7 12 19 12
# Get the names for the members of cluster 1
rownames(df)[grp == 1]
## [1] "Alabama"
"Georgia"
"Louisiana"
## [5] "North Carolina" "South Carolina" "Tennessee"

"Mississippi"

Its also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to
specify the border colors for the rectangles:

plot(res.hc, cex = 0.6)


rect.hclust(res.hc, k = 4, border = 2:5)

Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot. Observations
are represented by points in the plot, using principal components. A frame is drawn around each cluster.

library(factoextra)
fviz_cluster(list(data = df, cluster = grp))

The function cutree() can be used also to cut the tree generated with agnes() and diana() as follow:

# Cut agnes() tree into 4 groups


cutree(res.agnes, k = 4)
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)

7 Hierarchical clustering and correlation based distance


The different functions for hierarchical clustering use Euclidean distance measures as default metric. Its also
possible to use correlation-based distance measures. Firstly, pairwise correlation matrix between items is
computed using the function cor() which can calculate either pearson, spearman or kendall correlation
method. Next, the correlation matrix is converted as a distance matrix and finally clustering can be computed on
the resulting distance matrix.

res.cor <- cor(t(df), method = "pearson")


d.cor <- as.dist(1 - res.cor)
plot(hclust(d.cor, method = "ward.D2"), cex = 0.6)

8 What type of distance measures should we choose?


The choice of dissimilarity measures is very important, as it has a strong influence on the resulting dendrogram.
In many of the examples described above, we used Euclidean distance as the dissimilarity measure. Depending on
the type of the data and the researcher questions, other dissimilarity measures might be preferred such
as correlation-based distance.
Correlation-based distance considers two observations to be similar if their features are highly correlated, even
though the observed values may be far apart in terms of Euclidean distance.
If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then
we should go with correlation-based distance as a dissimilarity measure. This is particularly the case in gene
expression data analysis, where we might want to consider genes similar when they are up and down
together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in
term of items, regardless of the volume of items they bought.
If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same
holds true for observations with low values of features.

Note that, when the data are standardized, there is a functional relationship between the Pearson correlation
coefficient

r(x,y)r(x,y) and the Euclidean distance.

With some maths, the relationship can be defined as follow:

deuc(x,y)=2m[1r(x,y)]deuc(x,y)=2m[1r(x,y)]
Where x and y are two standardized m-vectors with zero mean and unit length.
For example, the standard k-means clustering uses the Euclidean distance measure. So, If you want to compute Kmeans using correlation distance, you just have to normalize the points before clustering.

9 Comparing two dendrograms


Well use the package dendextend which contains many functions for comparing two dendrograms,
including: dend_diff(), tanglegram(), entanglement(), all.equal.dendrogram(), cor.dendlist().
The function tanglegram() and cor.dendlist() are described in this section.
A random subset of the dataset will be used in the following example. The function sample() is used to randomly
select 10 observations among the 50 observations contained in the data set

# Subset containing 10 rows


set.seed(123)
ss <- sample(1:50, 10)
df <- df[ss,]

In the R code below, well start by computing pairwise distance matrix using the function dist(). Next, hierarchical
clustering (HC) is computed using two different linkage methods (average and ward.D2). Finally the results of
HC are transformed as dendrograms:

library(dendextend)
# Compute distance matrix
res.dist <- dist(df, method = "euclidean")
# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
# Create two dendrograms
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
# Create a list of dendrograms
dend_list <- dendlist(dend1, dend2)

9.1 Tanglegram
The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines. It can be
used for visually comparing two methods of Hierarchical clustering as follow:

tanglegram(dend1, dend2)

Note that unique nodes, with a combination of labels/items not present in the other tree, are highlighted with
dashed lines.
The quality of the alignment of the two trees can be measured using the function entanglement(). The output
of tanglegram() can be customized using many other options as follow:

tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)

Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement
coefficient corresponds to a good alignment

9.2 Correlation matrix between a list of dendrogram


The function cor.dendlist() is used to compute Baker or Cophenetic correlation matrix between a list of
trees.

# Cophenetic correlation matrix


cor.dendlist(dend_list, method = "cophenetic")
##
[,1]
[,2]
## [1,] 1.0000000 0.9646883
## [2,] 0.9646883 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
##
[,1]
[,2]
## [1,] 1.0000000 0.9622885
## [2,] 0.9622885 1.0000000
The correlation between two trees can be also computed as follow:

# Cophenetic correlation coefficient

cor_cophenetic(dend1, dend2)
## [1] 0.9646883
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9622885

Its also possible to compare simultaneously multiple dendrograms. A chaining operator %>% (available
in dendextend) is used to run multiple function at the same time. Its useful for simplifying the code:

# Subset data
set.seed(123)
ss <- sample(1:150, 10 )
# Create multiple dendrograms by chaining
dend1 <- df %>% dist %>% hclust("com") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("ave") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
##
Complete Single Average Centroid
## Complete
1.00
0.76
0.99
0.75
## Single
0.76
1.00
0.80
0.84
## Average
0.99
0.80
1.00
0.74
## Centroid
0.75
0.84
0.74
1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
corrplot(cors, "pie", "lower")

10 Infos
This analysis has been performed using R software (ver. 3.2.1)

Anda mungkin juga menyukai