Anda di halaman 1dari 20

Information Sciences 179 (2009) 3583–3602

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Performance evaluation of density-based clustering methods


Ramiz M. Aliguliyev *
Institute of Information Technology of Azerbaijan National Academy of Sciences, Department of Artificial Intelligence and Computer Sciences,
9, F. Agayev Street, Baku AZ1141, Azerbaijan

a r t i c l e i n f o a b s t r a c t

Article history: With the development of the World Wide Web, document clustering is receiving more and
Received 2 April 2008 more attention as an important and fundamental technique for unsupervised document
Received in revised form 12 May 2009 organization, automatic topic extraction, and fast information retrieval or filtering. A good
Accepted 4 June 2009
document clustering approach can assist computers in organizing the document corpus
automatically into a meaningful cluster hierarchy for efficient browsing and navigation,
which is very valuable for complementing the deficiencies of traditional information retrie-
val technologies. In this paper, we study the performance of different density-based crite-
Keywords:
Text mining
rion functions, which can be classified as internal, external or hybrid, in the context of
Partitional clustering partitional clustering of document datasets. In our study, a weight was assigned to each
Density-based clustering methods document, which defined its relative position in the entire collection. To show the effi-
Validity indices ciency of the proposed approach, the weighted methods were compared to their
Modified DE algorithm unweighted variants. To verify the robustness of the proposed approach, experiments were
conducted on datasets with a wide variety of numbers of clusters, documents and terms. To
evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-
18828, WebACE and TREC-5 datasets, as they are currently the most widely used bench-
marks in document clustering research. To evaluate the quality of a clustering solution, a
wide spectrum of indices, three internal validity indices and seven external validity indices,
were used. The internal validity indices were used for evaluating the within-cluster scatter
and between cluster separations. The external validity indices were used for comparing the
clustering solutions produced by the proposed criterion functions with the ‘‘ground truth”
results. Experiments showed that our approach significantly improves clustering quality. In
this paper, we developed a modified differential evolution (DE) algorithm to optimize the
criterion functions. This modification accelerates the convergence of DE and, unlike the
basic DE algorithm, guarantees that the received solution will be feasible.
Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction

With the rapid development of the World Wide Web, we are facing an increasing volume of electronic documents, such as
news articles and scientific papers. This explosion of electronic documents has made it difficult for users to extract useful
information from them. Document clustering is receiving more and more attention as an important and fundamental tech-
nique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering
[30,34,49]. It is an effective tool to manage information overload. By clustering similar documents together, we can quickly
browse document collections, easily grasp their distinct topics and subtopics, and efficiently query them, among many other
applications.

* Tel.: +994 12 439 01 67; fax: +994 12 439 61 21.


E-mail addresses: a.ramiz@science.az, aramiz@iit.ab.az

0020-0255/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2009.06.012
3584 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

Generally speaking, clustering can be defined as partitioning of a given dataset into clusters, in such a way that data
points belonging to the same cluster are as similar to each other as possible, whereas data points from two different clusters
are separated by the maximum difference [18–20]. Document clustering has been investigated as a fundamental operation in
many areas, such as data mining [29], information retrieval (IR) [10,27,33] topic detection [8], data streams [16], and as a
preprocessing step for other tasks, such as text summarization [2,3,5,7,22].
Clustering can be performed in two different modes [18,23,24], hard and soft. In hard clustering the clusters are disjointed
and non-overlapping in nature. Any pattern may belong to one and only one class in this case. In the case of soft clustering, a
pattern may belong to any or all of the classes with various membership grades. In this paper, we deal with the hard clus-
tering problem.
We introduce a number of weighted criterion functions, which can be classified into three groups: internal, external and
hybrid. In particular, we evaluate a total of twelve criterion functions that measure various aspects of intra-cluster similarity,
inter-cluster dissimilarity, and their combinations. We developed a modified DE algorithm to optimize the criterion func-
tions. The proposed methods were experimentally evaluated on different datasets using various validity indices and metrics.
For the evaluation, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are
currently the most widely used benchmarks in document clustering research.
The rest of the paper is organized as follows. Section 2 describes a review of the existing clustering methods. Section 3
introduces the proposed clustering methods. Section 4 presents the modified DE to optimize the clustering methods. Section
5 describes the validity indices that were used for evaluation of the quality of the clustering results. We report experimental
results in Section 6, and conclude the paper in Section 7.

2. Brief review of the existing clustering methods

Generally, clustering methods can be categorized into hierarchical methods and partitioning methods. Within each of the
types, there exist a wealth of subtypes and different algorithms for constructing the clusters. An extensive survey of various
clustering techniques can be found in the literature [25,29,32]. Here, we focus on reviewing the partitional clustering tech-
niques, which are the most directly related to our work.
Partitional clustering algorithms decompose the dataset into a number of disjointed clusters that are usually optimal in
terms of some predefined criterion functions. For instance, k-means is a typical partitioning method that aims to minimize
the sum of the squared distance between the data points and their corresponding cluster centers [25,29,32]. Many criterion
functions have been proposed in the literature [1,4,6,48,51] for producing balanced partitions. Their objective is to maximize
the intra-cluster connectivity (compactness) while minimizing inter-cluster connectivity (separability). Zhao and Karypis
[51] evaluated the performance of different criterion functions in the context of partitional clustering algorithms of docu-
ment datasets. This study involved a total of seven different criterion functions. The methods proposed in [4,6] satisfy homo-
geneity within-clusters as well as separability between the clusters. A novel mixed-integer nonlinear programming-based
clustering algorithm, the global optimal search with enhanced positioning, is presented in [48]. This algorithm is significant
in that it is able to progressively identify and weed out outlier data points.
The k-means method [25,29,32] is a commonly used partitioning algorithm in document clustering and other related re-
search areas. This method is based on the idea of a center point that can represent a cluster. Given n objects, the method first
selects k objects as the initial k clusters. Then it iteratively assigns each object to the most similar cluster based on the mean
value of the objects in each cluster. There are many variations of the k-means method [11,31,36]. Bagirov [11] proposed a
new version of the global k-means algorithm, a modified global k-means algorithm that computes clusters incrementally,
and computes the k-partition of a data set by using k  1 cluster centers from the previous iteration. An important step
in this algorithm is the computation of a starting point for the kth cluster center. This starting point is computed by mini-
mizing the so-called auxiliary cluster function. A new k-means type algorithm called W–k-means [31] automatically weights
the variables based on their importance in clustering. W–k-means adds a new step to the basic k-means algorithm to update
the variable weights on the current partition of data. Based on the current partition in the iterative k-means clustering pro-
cess, the algorithm calculates a new weight for each variable based on the variance of the within-cluster distances. The new
weights are used to decide the cluster memberships of the objects in the next iteration. The weights can be used to identify
important variables for clustering and the variables that are likely to contribute noise to the clustering process and can be
removed from the data in the future analysis.
Two new text document clustering algorithms, called clustering based on frequent word sequences (CFWS) and clustering
based on frequent word meaning sequences (CFWMS), are proposed in [38]. Unlike the traditional VSM, these models utilize
the sequential patterns of the words in the document. Frequent word sequences discovered from the document set can
represent the topics covered by the documents very well, and the documents containing the same frequent word sequences
are clustered together in these algorithms. Li et al. [40] proposed a new text-clustering algorithm, called text clustering with
feature selection (TCFS), that performs a supervised feature selection during the clustering process. The selected features
improve the quality of clustering iteratively, and as the clustering process converges, the clustering result has higher
accuracy.
The problem of partitional clustering has been approached from diverse fields of knowledge such as graph theory
[15,21,37,49], neural networks [31], genetic algorithms (GA) [4,6,12,36,42] particle swarm optimization (PSO) [19], ant algo-
rithm [9], and differential evolution [1,18,19]. In the evolutionary approach, clustering a dataset is viewed as an optimization
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3585

problem and solved using an evolutionary search heuristic such as GA [4,12,36,42], DE [1,18,19] or PSO [20]. Das et al. [20]
present a novel, modified PSO-based strategy for the hard clustering of complex data. A new DE-based strategy for hard clus-
tering of real-world data sets was presented in [19]. An important feature of the proposed techniques [18–20] is that they are
able to find the optimal number of clusters automatically (that is, the number of clusters does not have to be known in ad-
vance) for complex and linearly non-separable datasets. A clustering algorithm called clustering with local and global reg-
ularization (CLGR) [49] preserves the merit of local-learning algorithms and spectral clustering. Spectral clustering
formulates clustering as a graph partitioning problem. The optimal partition is approximated by eigenvectors of a properly
normalized affinity matrix of the graph. The relationships between spectral partitioning methods and kernel k-means are
discussed in [21]. This study showed that a weighted form of the kernel k-means objective is mathematically equivalent
to a general, weighted graph partitioning objective. The key contribution of the graph-based relaxed (GBR) algorithm [37]
is its very simple implementation using the existing optimization packages. In [9], a new model (called AntTree) was pre-
sented for data clustering, which was inspired by the self-assembly behavior of real ants. In [4], a fast GA was proposed
for document clustering, in which the penalty function was introduced to accelerate the convergence. In [12], a new sym-
metry based genetic clustering algorithm, VGAPS, was proposed, which automatically evolves the number of clusters as well
as the proper partitioning from a data set. The algorithm genetic algorithm k-means logarithmic regression expectation max-
imization (GAKREM) [42] combines the best characteristics of the k-means and EM algorithms but avoids their weaknesses.
The novelty of GAKREM is that in each evolving generation it efficiently approximates the log-likelihood for each chromo-
some using logarithmic regression, instead of running the conventional EM algorithm until it converges.

3. Clustering documents

The standard clustering technique consists of the following steps: (1) feature selection and data representation model, (2)
similarity measure selection, (3) clustering model, (4) clustering algorithm that generates the clusters using the data model
and the similarity measure, (5) validation [27].

3.1. Document representation and term weighting scheme

Let D ¼ ðD1 ; D2 ; . . . ; Dn Þ be a collection of documents and T ¼ ðT 1 ; T 2 ; . . . ; T m Þ be the complete vocabulary set of the doc-
ument collection D, where n is the number of documents and m is the number of unique terms. There are several ways
to model a text document. We apply the vector space model, widely used in IR and text mining [10], to represent the text
documents. In this model each document Di is represented by a point in an m dimensional vector space,
Di ¼ ðwi1 ; wi2 ; . . . ; wim Þ, i ¼ 1; . . . ; n, where the dimension is the same as the number of terms in the document collection. Each
component of such a vector reflects a term connected with the given document. The value of each component depends on
the degree of relationship between its associated term and the respective document. Many schemes have been proposed for
measuring this relationship. Term weighting is the process of calculating the degree of relationship (or association) between
a term and a document. One of the more advanced term weighting schemes is the tf-idf (term frequency-inverse document
frequency) [10]. The tf-idf scheme aims at balancing the local and the global term occurrences in the documents. In this
scheme
 
n
wij ¼ nij  log ; ð1Þ
nj

where nij is the term frequency, and nj denotes the number of documents in which term T j appears. The term logðn=nj Þ, which
is often called the idf factor, defines the global weight of the term T j . Indeed, when a term appears in all documents in the
collection, then nj ¼ n, and thus the balanced term weight is 0, indicating that the term is useless as a document discrimi-
nator. The idf factor has been introduced to improve the discriminating power of terms in the traditional information
retrieval.

3.2. The cosine measure

As mentioned above, clustering is the process of recognizing natural groups or clusters in multi-dimensional data based
on some similarity measures. Hence, defining an appropriate similarity measure plays a fundamental role in clustering
[30,35]. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity,
Euclidean distance and the Jaccard correlation coefficient [30,35].
The cosine measure is one of the most popular similarity measures applied to text documents. This measure computes the
cosine of the angle between two feature vectors and is used frequently in text mining. The cosine similarity between two
documents Di and Dl is calculated as
Pm
j¼1 wij wlj
simðDi ; Dl Þ ¼ cosðDi ; Dl Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pm 2 Pm 2ffi ; i; l ¼ 1; . . . ; n: ð2Þ
j¼1 wij  j¼1 wlj
3586 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

3.3. The proposed clustering methods

The hard clustering problem can be defined as follows [25,29,32]. A clustering C is a partition of a dataset D into mutually
disjointed subsets C 1 ; C 2 ; . . . ; C k , called clusters, such that C p \ C q ¼ £, p – q (i.e., two different clusters should have no doc-
S
uments in common) and kp C p ¼ D (i.e., each document should definitely belong to a cluster). We also assume that for all
q ¼ 1; . . . ; k C q – £ and C q 2 D, i.e., each cluster should have at least one document assigned and it must not contain all doc-
uments. In other words, k represents the number of non-empty clusters.
As is well known, the clustering quality depends on the clustering technique and the data structure. Different methods
may show different qualities of compactness and separability of clusters; some methods can provide a higher level of com-
pactness and a lower level of separability, and others vice versa, and some may balance compactness and separability of
clusters.
The main drawback most of the clustering methods is that they do not consider data structure. They assume that the data
are independent and are identically distributed samples generated from an unknown probability density function. Therefore,
they cannot find the clusters of an arbitrary shape. In this paper, there is an attempt to find the clusters of arbitrary shape. To
find the clusters of arbitrary shape, it is necessary to have some prior knowledge about the distribution of points. In this pa-
per, in order to have a priori knowledge about the distribution of points, we define their relative positions in a dataset. The
relative position of a point is defined by a measure of affinity between the point and the center of a dataset. It defines a con-
centration degree of points around the center of the entire collection. In this paper, to verify the robustness of the suggested
approach, experiments were conducted on the datasets with varying numbers of clusters, documents and terms. To evaluate
the quality of a clustering solution, a wide spectrum of indices – three internal validity indices and seven external validity
indices – were used. Internal indices evaluate the compactness and separation of the clusters, and external validity indices
compare a clustering solution to a true clustering. Experiments showed that our approach significantly improves the clus-
tering quality.
In general, our study involves the twelve criterion functions that are given below. These criterion functions have been
proposed in the context of partition clustering algorithms. Our goal is to find such a clustering solution that will satisfy
not only the homogeneity and separability of clusters, but also the separability of clusters from all collections. For this pur-
pose, in our clustering methods a weight is assigned to each document defining its position in a document collection.
Thus, let each document Di be associated with a positive weight ai , which is defined as follows:
simðDi ; OÞ
ai ¼ Pn ; i ¼ 1; . . . ; n; ð3Þ
l¼1 simðDl ; OÞ

where O is the center of the document collection D, and jth coordinate oj of the center O is calculated as:
P
oj ¼ 1n ni¼1 wij ðj ¼ 1; . . . ; mÞ. This weight defines the degree of relative similarity of the document Di to the centre of the D,
i.e., this weight defines the position of the document Di relative to the center of the entire collection D. It is not difficult
to see that

X
n
ai ¼ 1: ð4Þ
i¼1

For each cluster C p , we define the weighted cardinality jC p ja to be


X
jC p ja ¼ ai : ð5Þ
Di 2C p

If the weights ai are equal to 1, then (5) is identical to the standard definition of set cardinality, i.e., the size of the cluster C p .
P
From the formula (4), it follows that kp¼1 jC p ja ¼ 1.
The weighted criterion functions introduced in this section can be classified into three groups: internal, external and
hybrid.
The internal criterion functions focus on producing a clustering solution that optimizes a function defined only over the
documents of each cluster, without taking into account the documents assigned to different clusters:
X
k X simðDi ; Dl Þ
Fa1 ¼ jC p ja ! max; ð6Þ
p¼1 Di ;Dl 2C p
ai al
X
k X simðDi ; Op Þ
Fa2 ¼ jC p ja ! max; ð7Þ
p¼1 Di 2C p
ai
P
where Op denotes the centroid of the cluster C p , Op ¼ jC1p j Di 2C p Di , and jC p j is the number of documents belonging to cluster
Cp.
The Fa1 criterion function (6) maximizes the sum of the pairwise similarities between the documents assigned to each
cluster, taking into account their positions in a dataset defined by (3). The Fa2 criterion function (7) is a weighted version
of the k-means algorithm. Comparing Fa2 and the k-means algorithm, we see that the essential difference between them
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3587

is that in this algorithm, the contribution of each cluster is weighted proportionally to its weighted cardinality and the
contribution of each document is weighted inversely to its weight. In this algorithm, each cluster is represented by its cen-
troid vector, and the goal is to find the solution that maximizes the similarity between each document and the centroid of
the cluster to which it is assigned, taking into account the position of the document in a dataset. The functions Fa1 and Fa2
not only provide compactness of the clusters but also separation of the clusters from the entire collection.
The external criterion functions derive the clustering solution by focusing on optimizing a function that is based on how
the various clusters are different from the entire collection and from each other:

X
k
Fa3 ¼ jC p ja simðOp ; OÞ ! min; ð8Þ
p¼1

X
k1 X
k
Fa4 ¼ jC p ja jC q ja simðOp ; Oq Þ ! min : ð9Þ
p¼1 q¼pþ1

The Fa3 criterion function (8) computes the clustering by finding a solution that separates the documents of each cluster
from the entire collection. Specifically, the Fa3 criterion function tries to minimize the similarity between the centroid vector
of each cluster and the centroid vector of the entire collection. The contribution of each cluster is weighted proportionally to
its weighted cardinality, so that larger clusters will be weighted higher in the overall clustering solution. The Fa4 criterion
function (9) computes the clustering by finding a solution that separates each cluster from other clusters.
The following hybrid criterion functions simultaneously optimizes multiple individual criterion functions:

Fa3 þ Fa4
Fa5 ¼ ! min; ð10Þ
Fa1
Fa þ Fa
Fa6 ¼ 3 a 4 ! min : ð11Þ
F2

The Fa5 , Fa6 criterion functions (10), (11) are obtained by combining the criterion functions (6)–(9). Since Fa1 and Fa2 are
maximized, the Fa5 and Fa6 criterion functions need to be minimized as they are inversely related to Fa1 and Fa2 , respec-
tively. The Fa5 , Fa6 criterion functions (10), (11) measure the quality of the overall clustering solution by taking into account
the separation between clusters and the entire collection, the separation between clusters, and the tightness of each cluster.
To show the efficiency of the assignment of weights to documents, we shall compare the criterion functions (6)–(11) with
the following criterion functions:

X
k X
F1 ¼ jC p j simðDi ; Dl Þ ! max; ð12Þ
p¼1 Di ;Dl 2C p

X
k X
F2 ¼ jC p j simðDi ; Op Þ ! max; ð13Þ
p¼1 Di 2C p

X
k
F3 ¼ jC p jsimðOp ; OÞ ! min; ð14Þ
p¼1

X
k1 X
k
F4 ¼ jC p jjC q jsimðOp ; Oq Þ ! min; ð15Þ
p¼1 q¼pþ1

F3 þ F4
F5 ¼ ! min; ð16Þ
F1
F3 þ F4
F6 ¼ ! min : ð17Þ
F2

We shall call the criterion functions (12)–(17) unweighted versions of the criterion functions (6)–(11). The criterion func-
tions (12)–(17) can be received from the (6)–(11) under the assumption ai ¼ a > 0 for any iði ¼ 1; . . . ; nÞ. Note that F3 is
the criterion function E1 proposed in [51]. Comparing F2 and the k-means algorithm, we see that the essential difference
between them is that in method F2 the contribution of each cluster is weighted proportionally to its cardinality. We shall
also call this function the weighed version of the k-means method.

4. DE-based clustering algorithm

In our study, the criterion functions are optimized using the DE [45]. In clustering research, it is possible to view the clus-
tering problem as an optimization problem that locates the optimal centroids of the clusters rather than finding an optimal
partition. The evolutionary algorithms differ mainly in their representations of parameters (usually binary strings are used
for genetic algorithms while parameters are real-valued for evolution strategies and DE) and in their evolutionary operators.
3588 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

4.1. Chromosome representation

DE, like other evolutionary algorithms, begins with a randomly initialized population of multi-dimensional real-coded
chromosomes. To represent the ath chromosome of the population at the current generation (at time t), we use the following
notation:
X a ðtÞ ¼ ½xa;1 ðtÞ; xa;2 ðtÞ; . . . ; xa;mk ðtÞ; ð18Þ
where mk ¼ m  k, a ¼ 1; . . . ; N, N is the size of the population.
Each chromosome forms a candidate solution to the multi-dimensional optimization problem. It is a sequence of real
numbers representing the k cluster centers. For an m-dimensional space, the length of a chromosome is m  k, where the first
m positions (or genes) represent the m dimensions of the first cluster center, the next m genes represent the second cluster
center, and so on. For example, let m ¼ 2 and k ¼ 4, i.e., the space is two-dimensional and the number of clusters is four. Then
the chromosome X ¼ ½0:16 0:37 0:23 0:75 0:82 0:26 0:94 0:68 represents the four cluster centers (0.16, 0.37), (0.23, 0.75),
(0.82, 0.26) and (0.94, 0.68).

4.2. Population initialization

At the initial stage each chromosome randomly chooses k different document vectors from the document collection
D ¼ ½D1 ; D2 ; . . . ; Dn  as the initial cluster centroid vectors. This process is repeated for each of the N chromosomes in the
population.

4.3. Fitness functions

We define the fitness functions according to the objective functions (6)–(11) as follows:
1
f1a ðXÞ ¼ ; ð19Þ
Fa1 ðXÞ
1
f2a ðXÞ ¼ a ; ð20Þ
F2 ðXÞ
f3a ðXÞ ¼ Fa3 ðXÞ; ð21Þ
a a
f4 ðXÞ ¼ F4 ðXÞ; ð22Þ
f5a ðXÞ ¼ Fa5 ðXÞ; ð23Þ
f6a ðXÞ ¼ Fa6 ðXÞ; ð24Þ
so that their minimization leads to the maximization (or minimization) of the criterion functions (6)–(11), respectively. We
define the fitness functions corresponding to the criterion functions (12)–(17) in a similar way.

4.4. The basic DE algorithm

DE is based on a mutation operator, which adds an amount obtained by the difference of two randomly chosen individ-
uals of the current population, in contrast to most of the evolutionary algorithms, in which the mutation operator is defined
by a probability function. The basic algorithm of DE [45] is shown in Fig. 1.

Fig. 1. Pseudo-code of the basic DE algorithm.


R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3589

The scaling factor k 2 ½0; 1, and the crossover probability pr 2 ½0; 1, are the control parameters of DE, which are set by the
user. The values rnds are uniformly distributed random numbers within the range [0,1], chosen once for each s 2 f1; . . . ; mk g,
xr;s ðtÞ is the s th decision variable of the r th chromosome in the population. FðÞ is the objective function to be minimized.

4.5. Crossover operator

The crossover operator for the chromosome of the current best solution X b ðtÞ randomly chooses two other chromosomes
X a ðtÞ and X c ðtÞðb – a – cÞ from the same generation. Then it calculates the weighted difference
pðX b ðtÞ  X a ðtÞÞ þ ð1  pÞðX b ðtÞ  X c ðtÞÞ and creates a trial offspring chromosome by adding the result to the chromosome
X b ðtÞ scaled by the factor kc . Thus, for the sth gene ðs ¼ 1; 2; . . . ; mk Þyb;s ðt þ 1Þ of the child chromosome Y b ðt þ 1Þ, we have

MaxMinðkc xb;s ðtÞ þ ps ðxb;s ðtÞ  xa;s ðtÞÞ þ ð1  ps Þðxb;s ðtÞ  xc;s ðtÞÞÞ; if hs < pr c
yb;s ðt þ 1Þ ¼ : ð25Þ
xb;s ðtÞ; otherwise
The scaling factor, kc 2 ½0:5; 1:0 and the crossover constant, pr c 2 ½0; 1, are control parameters, which are set by the user. The
values ps and hs are uniformly distributed random numbers within the range [0,1], chosen once for each s 2 f1; . . . ; mk g.
We define the function MaxMinðxs ðtÞÞ in (25) as
8
min min
< xs ðtÞ þ ds ðtÞ; if xs ðtÞ 6 xs ðtÞ
>
MaxMinðxs ðtÞÞ ¼ xs ðtÞ; if xs ðtÞ < xs ðtÞ < xmax
min
s ðtÞ; ð26Þ
>
: max
xs ðtÞ  ds ðtÞ; if xs ðtÞ P xmax
s ðtÞ
xmax ðtÞxmin ðtÞ
where xmin
s ðtÞ ¼ mina2f1;2;...;Ng fxa;s ðtÞg; xmax
s ¼ maxa2f1;2;...;Ng fxa;s ðtÞg, and ds ðtÞ ¼ s n
s
.
DE uses the principle of ‘‘survival of the fittest” in its selection process, which may be expressed as:

Y b ðt þ 1Þ; if fza ðY b ðt þ 1ÞÞ < fza ðX b ðtÞÞ
X b ðt þ 1Þ ¼ ; ð27Þ
X b ðtÞ; otherwise
where the fitness functions fza ðxÞ, z ¼ 1; 2; . . . ; 6 are defined by (19)–(24).

4.6. Mutation operator

The mutation operation for the target chromosome X b ðtÞ is performed according to the following rule:

MaxMinðkm xb;s ðtÞ þ qs ðxb;s ðtÞ  xb;q ðtÞÞ þ ð1  qs Þðxb;s ðtÞ  xb;r ðtÞÞÞ; if gs < pr m
yb;s ðt þ 1Þ ¼ ; ð28Þ
xb;s ðtÞ; else

with distinct random integer indices s; q; r 2 f1; . . . ; mk g, s – q – r. The control parameters – the scaling factor km 2 ½0:5; 1:0
and the mutation constant, pr m 2 ½0; 1 – are set by the user. The values qs and gs are uniformly distributed random numbers
within the range [0,1], chosen once for each s 2 f1; . . . ; mk g. If the mutant chromosome yields a better value of the fitness
function, it replaces its parent in the next generation; otherwise the parent remains in the population.
We explain the objective of using the MaxMinðxðtÞÞ function in the crossover (25) and mutation (28) operations. Given the
nature of the problem, after applying the algorithm we must have a solution such that the coordinates belong to the interval
 min max 
xs ; xs , s 2 f1; . . . ; mk g (we call such a solution feasible). As can be seen from Fig. 1, a direct application of the basic DE
algorithm does not guarantee that the obtained solution will be feasible. However such a solution can also be obtained from
the basic DE algorithm. In this case, the feasibility of solution must be verified at every step. Obviously, it requires additional
computational efforts and as a result the convergence of the algorithm is delayed. Therefore, we introduce the function
MaxMinðxðtÞÞ that ensures the feasibility of the solution and accelerates the convergence of the algorithm.

4.7. Halting criterion

The algorithm terminates when the maximum number of fitness calculations t max is achieved.

4.8. The pseudo-code of the proposed DE algorithm

The pseudo-code of the clustering algorithm is given below:

Step 1 (Input. Create initial population). At the initial stage each chromosome randomly chooses k different document vectors
from the document collection D ¼ ½D1 ; D2 ; . . . ; Dn  as the initial cluster centroid vectors.
Step 2 (Form initial clusters). Assign each document vector Di ði ¼ 1; 2; . . . ; nÞ in the document collection to the cluster C p ,
p 2 f1; 2; . . . ; kg iff simðDi ; Op Þ > simðDi ; Oq Þ; q 2 f1; 2; . . . ; kg, and p – q.
Step 3 (Evaluate initial population). Calculate the fitness value of each chromosome in the population based on (19)–(24).
Step 4 (Select best chromosome). Select the chromosome with current best solution.
3590 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

Step 5 (Generate child chromosome). Use the crossover (25) and mutation (28) equations to generate difference-offspring of
the current best chromosome.
Step 6 (Evaluate the child chromosome). Calculate the fitness of the difference-offspring using (19)–(24).
Step 7 (Replace). If a difference-offspring is better than its parent, then replace the parent by the offspring in the next generation,
otherwise the parent retain in the population.
Step 8 (Stopping criterion). Repeat steps 2–7 until a user-specified maximum number t max of fitness calculations is achieved.
Step 9 (Output). Report the partition obtained by the best chromosome as the final solution at the maximum number of fitness
calculations.

5. Measures for evaluation of the clustering quality

Cluster validation refers to the quantitative evaluation of the quality of a clustering solution. In general, there are three
approaches to investigate cluster quality. The first approach is based on internal criteria. Internal criteria assess the clusters
against their own structural properties. Internal cluster validation aims at measuring the quality of a clustering in real-life
settings when there is no knowledge of the real clustering. The second approach is based on external criteria. External cluster
validation refers to comparing a clustering solution to a true clustering. This is important in evaluating the performance of a
clustering algorithm on datasets. The third approach is based on relative criteria for the investigation of cluster quality. Here
the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes produced by the same
method but with different input parameter values. Various cluster validity indices are available in the literature
[13,23,26,41,43,46,50]. In this paper, we use indices for the internal and external approaches.

5.1. Measures for internal cluster validation

In this subsection, we discuss suitable methods for the quantitative evaluation of a clustering result, known as internal
cluster validity indices. Ideally, a validity index should measure the two aspects of partitioning:

(1) Cohesion: The documents in one cluster should be as similar to each other as possible. The fitness variance of the doc-
uments in a cluster is an indication of the cluster’s cohesion or compactness.
(2) Separation: Clusters should be well separated. The similarity among the cluster centers gives an indication of cluster
separation.

Many internal validity measures have been proposed for evaluating clustering results. Most of these popular validity
measures do not work well for clusters with different densities and/or sizes. They usually have a tendency to ignore clusters
with low densities. A validity measure that can deal with this situation is studied in [1,17–19]. This measure is the ratio of
the sum of intra-cluster scatter to inter-cluster separation:
 
Pk 1
P
p¼1 jC p j
minfsimðDi ; Dl g
Di 2C p
Dl 2C p
CS1 ðkÞ ¼ 8 9 : ð29Þ
Pk < =
p¼1 max fsimðOp ; Oq Þg
:q¼1;...;k ;
q–p

This cluster validity index is inspired by the work reported in [17], and has been suitably modified for clustering different
datasets [1,18,19]. It simultaneously takes the cohesion and separation factors into account while dealing with complex
structure data sets. The denominator in (29) computes the largest similarity between cluster centers. The numerator mea-
sures the average smallest similarity between two documents lying in the same cluster, whereas the latter uses the smallest
similarity between two documents lying in the same cluster to measure the scatter volume.
Similarly, the second measure is defined as
 
Pk 1
p¼1 minfsimðDi ; Op g
jC p j D 2C
l p
CS2 ðkÞ ¼ 8 9: ð30Þ
Pk < =
p¼1 max fsimðOp ; Oq Þg
:q¼1;...;k ;
q–p

We define the third measure as


Pk 1
P
p¼1 jC p j Di 2C p simðDi ; Op Þ
CS3 ðkÞ ¼ Pk : ð31Þ
p¼1 simðOp ; OÞ

The numerator in (31) measures the average similarity of documents to the cluster centers. The denominator in (31) com-
putes the sum of the similarity between cluster centers and the center of the entire collection.
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3591

The validity indices (29)–(31) simultaneously take the compactness and separation factors into account while dealing
with complex structure datasets. A large value of these measures (29)–(31) indicates a valid optimal partition.

5.2. Measures for external cluster validation

The second set of experiments will be focused on comparing of clustering results produced by the proposed criterion
functions with the ‘‘ground truth” results. The quality of the clustering solution will be measured by using different metrics.
These metrics measure the matching of clusters computed by each method to the ‘‘ground truth” classes. In situations where
documents are already labeled, we can compare the clusters with the ‘‘true” class labels.
 
Assume that the dataset D is composed of the classes Cþ ¼ C þ þ
1 ; . . . ; C kþ (true clustering), and we apply a clustering pro-
cedure to find clusters C ¼ ðC 1 ; . . . ; C k Þ in this dataset. We present various indices to compare the two partitions
 
C ¼ ðC 1 ; . . . ; C k Þ and Cþ ¼ C þ þ
1 ; . . . ; C kþ .
Important classes of criteria for comparing clustering solutions are based on counting the pairs of points on which two
clustering agree/disagree. The best-known clustering distances based on point pairs are the purity [14,47], the Mirkin metric
[41] and the F-measure[46].
Purity. The purity of the cluster C p is defined as follows [14,47]:

1

purityðC p Þ ¼ max þ C p \ C þpþ ; p ¼ 1; 2; . . . ; k: ð32Þ
jC p j pþ ¼1;...;k

Note that each cluster may contain documents from different classes. The purity gives h theiratio of the dominant class size in
the cluster to the cluster size itself. The value of the purity is always in the interval k1þ ; 1 . A large purity value implies that
the cluster is a ‘‘pure” subset of the dominant class. The purity of the entire collection of clusters was evaluated as a weighted
sum of the individual cluster purities:

Xk
jC p j 1X k

purityðCÞ ¼ purityðC p Þ ¼ max þ C p \ C þpþ : ð33Þ
p¼1
n n p¼1 p þ ¼1;...;k

According to this measure, a higher purity value indicates a better clustering solution.
Mirkin metric. The Mirkin metric is defined as follows [41]:
!
kþ kþ 2
þ 1 X
k
2
X þ 2 Xk X þ
MðC; C Þ ¼ 2 jC p j þ C pþ  2 C p \ C pþ : ð34Þ
n p¼1 pþ ¼1 p¼1 pþ ¼1

The Mirkin metric (34) is scaled with the factor 1=n2 in order to restrict its range to the interval [0,1]. This metric is obviously
0 for identical clusterings, and positive otherwise.
F-measure. Another frequently used external validation measure is commonly called the ‘‘clustering accuracy”. The cal-
culation of this accuracy is inspired by the information retrieval metric known as the F-measure. If we want to compare a
clusters C to a set of classes Cþ , a simple approach would be to calculate the precision (P), recall (R) and the F-measure, used
widely in the information retrieval literature, to measure the success of the retrieval task.
Using our clustering notation, the precision is computed as follows [46]:


C p \ C þpþ
þ
P C p ; C pþ ¼ : ð35Þ
jC p j
The precision is calculated as the portion of cluster C p that includes the documents of class C þ
pþ , thus measuring how homog-
enous the cluster C p is with respect to the class C þ
pþ .
Similarly, the recall is calculated as the proportion of documents from class C þ
pþ that are included in cluster C p , thus mea-
suring how complete the cluster C p is with respect to the class C þ pþ :


C p \ C þpþ
þ
R C p ; C pþ ¼ : ð36Þ
þ
C pþ

Then F value of the cluster C p and the class C þ


pþ is the harmonic mean of the precision and the recall:




2 2P C p ; C þpþ R C p ; C þpþ
F C p ; C þpþ ¼ ¼

: ð37Þ

1 þ
1 P C ; Cþ þ R C ; Cþ
p pþ p pþ
P C p ;C þþ R C p ;C þþ
p p

 þ 
The F-measure of the cluster C p is the maximum F value attained at any class in the entire set of classes Cþ ¼ C þ
1 ; . . . ; C kþ .
That is,
3592 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602



FðC p Þ ¼ max
þ þ
F C p ; C þpþ ; p ¼ 1; 2; . . . ; k: ð38Þ
C 2C

The F-measure of the entire collection is considered to be the sum of the individual cluster specific F-measures, weighted
according to cluster size. That is,
Xk
jC p j
FðCÞ ¼ FðC p Þ: ð39Þ
p¼1
n

The higher the F-measure, the better the clustering solution. This measure has a significant advantage over the purity and the
entropy, because it measures both the homogeneity and the completeness of a clustering solution [46].
 
Now we describe information-based methods to compare partitions C ¼ ðC 1 ; . . . ; C k Þ and Cþ ¼ C þ þ
1 ; . . . ; C kþ . The com-
monly used external validity indices based on information are the partition coefficient [13], the entropy [13,14], the variation
of information [43] and the V-measure [46]. These measures represent plausible ways to evaluate the homogeneity of a clus-
tering solution.
Partition coefficient. The partition coefficient (PC) was introduced by Bezdek [13]. It measures the amount of overlap be-
tween clusters. Considering a cluster C p , the PC is defined as follows:
0 12
þ
kþ C p \ C pþ
1 X @ A :
PCðC p Þ ¼ þ ð40Þ
k pþ ¼1 jC p j

PCðC p Þ is a value between k1þ and 1. If almost all documents of C p belong to the same cluster in C þ
pþ , then PCðC p Þ is close to 1.
On the other hand, if the documents of C p are randomly divided into all clusters of Cþ then PCðC p Þ is close to k1þ .
A global partition coefficient is computed using the following formula:
0 12
þ
Xk Xk Xkþ C p \ C pþ
1 1 @ A :
PCðC; Cþ Þ ¼ PCðC p Þ ¼ þ ð41Þ
k p¼1 kk p¼1 pþ ¼1 jC p j

PCðC; Cþ Þ also takes values between k1þ and 1. Now, if PCðC; Cþ Þ is close to k1þ , then C and Cþ are almost independent. Moreover,
if PCðC; Cþ Þ is close to 1, then C is close to Cþ .
Entropy. An entropy measure based on information-theoretic considerations can be also used. In the same way as a par-
tition coefficient, Bezdek defined the clustering entropy [13]. The entropy of the cluster C p is defined to be
0 1
þ
kþ C p \ C þ
þ
1 X p C p \ C pþ
EðC p Þ ¼  þ log @ A; p ¼ 1; 2; . . . ; k: ð42Þ
logðk Þ pþ ¼1 jC p j jC p j

Note that when x is close to 0, then x log x is close to 0. So we consider that 0 log 0 ¼ 0.
Since the entropy considers the distribution of semantic classes in a cluster, it is a more comprehensive measure than the
purity. Note that we have normalized
the
entropy to take values between 0 and 1. If almost all the documents of cluster C p

C p \C þþ
p
belong to the same class C þ
pþ then jC p j
is close to 1 for p ¼ pþ and is close to 0 for p – pþ . So the entropy EðC p Þ is close to 0

(since 0 log 0 ¼ 0 and 1 log 1 ¼ 0Þ. On the other hand, if the documents of cluster C p are randomly divided among all the clas-

C p \C þpþ
ses of Cþ , then jC p j is close to k1þ and the entropy of cluster C p is close to 1.
In contrast to the purity measure, an entropy value of 0 means that the cluster is comprised entirely of one class, while an
entropy value near 1 implies that the cluster contains a uniform mixture of all classes.
The global clustering entropy of the entire collection is defined to be the sum of the individual cluster entropies weighted
according to the cluster size. That is,

Xk
jC p j
EðCÞ ¼ EðC p Þ: ð43Þ
p¼1
n

The global entropy also takes values between 0 and 1. A perfect clustering solution will be one that produces clusters that
contain documents from only a single class, in which case entropy will be zero. In general, the smaller the entropy, the better
the quality of the cluster.
Variation of information. Another information-based clustering measure is the variation of information (VI) [43]. The var-
iation of information is a recently proposed clustering criterion based on information-theoretic concepts. It measures the
amount of information that we gain and lose when going from the clustering C to another clustering Cþ . Patrikainen and
Meila [43] define it as
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3593

VIðC; Cþ Þ ¼ HðCjCþ Þ þ HðCþ jCÞ; ð44Þ


þ þ
where HðCjC Þ is the conditional entropy of C given C .
An equivalent way of writing the distance VIðC; Cþ Þ is as follows [43]:
0
1 0
1
Xk Xkþ
P C p ; C þpþ Xk Xkþ
P C p ; C þpþ
VIðC; Cþ Þ ¼  P C p ; C þpþ log @
A  P C p ; C þpþ log @ A: ð45Þ
p¼1 pþ ¼1 P C þpþ p¼1 pþ ¼1
PðC p Þ

The variation of information is the sum of the information needed to describe C given Cþ and the information needed to de-
scribe Cþ given C.

The joint distribution P C p ; C þ
pþ in (45) is equal to


C p \ C þpþ
P C p ; C þpþ ¼ : ð46Þ
n
This immediately also implies that
jC p j
PðC p Þ ¼ ð47Þ
n
and


C þpþ
þ
P C pþ ¼ : ð48Þ
n
Based upon these calculations (46)–(48), we define the variation of information (45) as
0 1
kþ þ
1 Xk X jC p j C pþ C
þ B
VIðC; Cþ Þ ¼ C p \ C pþ log @ 2 A: ð49Þ
n p¼1 pþ ¼1 þ
C p \ C pþ

The maximum value of the variation of information is log n, which is achieved when the partitions are as far apart as possible.
In this case it means that one of them places all the documents together in a single cluster while the other places each doc-
ument in a cluster on its own. The maximum value increases with n because larger datasets contain more information, but if
this property is undesirable then one can simply normalize by log n, as we do in the calculations presented here:
0 1

kþ jC j C þ
1 X X B p pþ C
k
þ þ
VIðC; C Þ ¼ C p \ C pþ log @ 2 A: ð50Þ
n log n p¼1 pþ ¼1 þ
C p \ C pþ

In general, the smaller the variation of information, the better the clustering solution.
The variation of information is presented as a distance measure for comparing clusterings of the same dataset. Therefore it
does not distinguish between hypothesized and target clusterings.
V-measure. The V-measure is an entropy-based measure that explicitly measures how successfully the criteria of homo-
geneity and completeness have been satisfied [46]. We define the homogeneity as
(
1; if HðCþ jCÞ ¼ 0
homðCÞ ¼ þ ; ð51Þ
1  HðC jCÞ
HðCþ Þ
; else

where
0 1
þ
X
k kþ C p \ C þ
X C p \ C þpþ
p B C
HðCþ jCÞ ¼  log @P þ A; ð52Þ
n k þ
p¼1 pþ ¼1 þ
p ¼1 C p \ C p
þ

Pk þ
0Pk þ
1

X p¼1 C p \ C pþ p¼1 C p \ C pþ
þ
HðC Þ ¼  þ log @ þ
A: ð53Þ
pþ ¼1 k k

HðCþ jCÞ is equal to 0 when each cluster contains only members of a single class, a perfect homogeneous clustering. In the
degenerate case when HðCþ Þ ¼ 0, when there is only a single class, we define the homogeneity to be 1.
Completeness is symmetric to homogeneity. Therefore, by symmetry with to the calculation above, we define the com-
pleteness as
(
1; if HðCjCþ Þ ¼ 0
compðCÞ ¼ HðCjCþ Þ ; ð54Þ
1 HðCÞ
; else
3594 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

where
0 1
kþ X
þ
k C p \ C þ
þ
X p B C p \ C pþ C
HðCjCþ Þ ¼  log @P A; ð55Þ
n k þ
pþ ¼1 p¼1 p¼1 C p \ C pþ
Pkþ þ
0Pkþ 1
þ
X pþ ¼1 C p \ C pþ pþ ¼1 C p \ C pþ
k
HðCÞ ¼  þ log @ þ
A: ð56Þ
p¼1 k k

Based upon these calculations of the homogeneity and completeness, we calculate the V-measure of a clustering solution by
computing the harmonic mean of the homogeneity and completeness, just as the precision and recall are commonly com-
bined into the F-measure:
2 2 homðCÞcompðCÞ
VðCÞ ¼ 1 1
¼ : ð57Þ
homðCÞ
þ compðCÞ homðCÞ þ compðCÞ

Notice that the computation of the homogeneity, the completeness and the V-measure are completely independent from the
number of classes, the number of the clusters, the size of the dataset and the clustering algorithm used.

6. Experiments

In this section, to test the effectiveness of the proposed approach, we compare the performance of the weighted criterion
functions with the unweighted criterion functions and three methods, studied in [51]. We first describe the datasets used in
the experiments.

6.1. Experimental data

We used datasets with a wide variety of numbers of clusters, numbers of documents and numbers of terms:

 webkb. The webkb dataset contains web pages collected from computer science departments of various universities. There
are 8282 documents and they are divided into 7 categories: student, faculty, staff, department, course, project and other.
This dataset is available from [53].
 re0. This dataset is the subset of the Reuters-21578 Text Categorization Test collection containing the 13 most frequent
categories among the 135 topics [54].
 rec. The rec dataset containing autos, motorcycles, baseball and hockey, which was selected from the version 20News-18828
and contains 3970 documents [55].
 wap. This dataset contains 1560 documents consisting of news articles from 20 different topics in October 1997 collected
in the WebACE project [28].
 fbis. The fbis dataset is derived from the TREC-5 collection. This dataset is available from [56].

These datasets are standard text datasets that are often used as benchmarks for document clustering [30,39,51]. General
characteristics of the datasets are summarized in Table 1.

6.2. Preprocessing

First, we removed stopwords. These are words that are non-descriptive for the topic of a document. Following the com-
mon practice, we used a stoplist provided in [52]. Second, words were stemmed using Porter’s suffix-stripping algorithm
[44], so that words with different endings would be mapped to a single word. The underlying assumption is that different
morphological variations of words with the same root/stem are thematically similar and should be treated as a single word.
In our experiments, we also considered the effect of including terms with small weights in the document representation on

Table 1
Summary of the datasets.

Data Number of documents Number of classes Number of terms Source Description


Before preprocessing After preprocessing
webkb 8282 7 20682 3000 Web knowledge base [53] Web pages
re0 1504 13 2886 500 Reuters-21578 [54] Newsgroup posts
rec 3970 4 16783 2000 20News-18828 [55] Newsgroup posts
wap 1560 20 8460 1000 WebACE [28] Web pages
fbis 2463 17 12764 1500 TREC-5 [56] Newspaper articles
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3595

the overall clustering performance, and decided to discard words that appear with less than a given threshold weight. The
rationale behind discarding terms with small weights is that in many cases they are not very descriptive of the document’s
subject and make little contribution to the similarity between two documents. Besides, the terms with small weights can
also introduce noise into the clustering process and make the similarity computation more expensive. Consequently, we se-
lected the top 3000, 500, 2000, 1000, 1500 terms ranked by their weights defined by (1) for the webkb, re0, rec, wap, and fbis
datasets, respectively, and used them in our experiments.

6.3. Choice of parameters for DE algorithm and the simulation strategy

The modified DE algorithm has a number of control parameters that affect its performance on different datasets. In this
section we discuss the influence of parameters like the population size N, the scaling factors for crossover kc and mutation km ,
the crossover constant prc and the mutation constant pr m .
Population size. To investigate the effect of the population size N, the DE was executed separately with 400–1000, 50–250,
200–600, 50–250 and 100–400 chromosomes (keeping all other parameter settings same as reported in Table 2) for the web-
kb, re0, rec, wap and fbis datasets, respectively. Experiments showed that numbers of chromosomes more than 600 (webkb),
150 (re0), 400 (rec), 200 (wap) and 300 (fbis) produced more or less identical clustering results for DE.
The scaling factors. Provided all other parameters were fixed at the values given in Table 2, we let DE run over different
settings of the scaling factors kc and km . We used kc ¼ 0:5 and km ¼ 0:4, kc ¼ 0:6 and km ¼ 0:4, kc ¼ 0:3 and km ¼ 0:7,
kc ¼ 0:5 and km ¼ 0:7, kc ¼ km ¼ 0:7, kc ¼ 0:9 and km ¼ 0:7, kc ¼ km ¼ 0:8. We noted that the scaling factors kc ¼ 0:9 and
km ¼ 0:7 gave the best clustering results over all the datasets considered.
The crossover constant. Provided all other parameters were fixed at the values shown in Table 2, the DE was run with sev-
eral possible choices of the crossover constant prc . Specifically we used random pr c , pr c ¼ 0:2, pr c ¼ 0:5, pr c ¼ 0:6, pr c ¼ 0:8,
and finally prc ¼ 0:9. It was observed that for all the datasets, the best convergence behavior of DE was obtained for prc ¼ 0:8.
The mutation constant. Provided all other parameters were fixed at the values shown in Table 2, the DE was run with sev-
eral possible choices of the mutation constant pr m . Specifically we used random prm , pr m ¼ 0:1, pr m ¼ 0:2, pr m ¼ 0:4,
prm ¼ 0:5, pr m ¼ 0:7, and finally pr m ¼ 0:8. It was observed that for all the datasets, the best convergence behavior of DE
was obtained for pr m ¼ 0:5.
The optimization procedure used here is stochastic in nature. Hence, it was run several times for each criterion function.
The results reported in this section are averages over 50 runs for each criterion function. Each run was continued up to 1000
fitness evaluations. Table 2 lists all the parameter settings used for all the criterion functions. In the experiments, the number
of clusters is set to be the same as the number of pre-assigned classes in the datasets for all the clustering methods.
Finally, we would like to point out that the algorithm discussed here was developed in the Delphi 7 platform on a Pentium
Dual CPU, 1.6 GHz PC, with 512 KB cache, and 1 GB of main memory in Windows XP environment.

6.4. Experimental results and analysis

In this subsection, we analyze the results of the experiment from different points of view. First, we show the efficiency of
assigning weights to documents. For this purpose we compare clustering solutions obtained by the weighted (6)–(11) and
unweighted (12)–(17) criterion functions. Second, we conduct comprehensive performance evaluations by comparing our
methods with the methods I2 , E1 and H2 , which are the best of the seven different global criterion functions studied in [51].

1. The I2 criterion function is


X
k X
I2 ¼ simðDi ; Op Þ ! max : ð58Þ
p¼1 Di 2C p

Table 2
Parameters setup of the DE for different datasets.

Parameter Value
webkb re0 rec wap fbis
Number of clusters, k 7 13 4 20 17
Number of generation 50 50 50 50 50
Population size, N 600 150 400 200 300
Number of iteration (fitness evaluation), t max 1000 1000 1000 1000 1000
The scaling factor for crossover, kc 0.9 0.9 0.9 0.9 0.9
The crossover constant, pr c 0.8 0.8 0.8 0.8 0.8
The scaling factor for mutation, km 0.7 0.7 0.7 0.7 0.7
The mutation constant, pr m 0.5 0.5 0.5 0.5 0.5
3596 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

The I2 criterion function is used by the popular vector space variant of the k-means algorithm. In this algorithm, each cluster
is represented by its centroid and the goal is to find the solution that maximizes the similarity between each document and
the centroid of the cluster to which it is assigned.
2. The E1 criterion function is
X
k
E1 ¼ jC p jsimðOp ; OÞ ! min : ð59Þ
p¼1

The E1 criterion function computes the clustering by finding a solution that separates the documents of each cluster from
the entire collection. Specifically, it tries to minimize the similarity between the centroid vector of each cluster and the
centroid vector of the entire collection.

3. The H2 criterion function is


I2
H2 ¼ ! max : ð60Þ
E1
The H2 criterion function is obtained by combining I2 with E1 . Since E1 is minimized, H2 needs to be maximized as it is
inversely related to E1 .
The clustering results are shown in Tables 3–7, which give a comparative analysis of the results of the weighted and un-
weighted criterion functions judged by the ten validity indices on webkb, re0, rec, wap and fbis datasets, respectively. In these
tables, the comparative analysis of the methods I2 and H2 is also shown. Table 8 gives a comparative analysis of the meth-
ods judged by the average values of the validity indices on all datasets which are obtained from Tables 3–7.
From Tables 3–8 we make the following main observations:

 The weighted criterion functions produced better solutions than the unweighted criterion functions for all datasets. The
best entries of Tables 3–8 have been marked in boldface. The improvement is shown in brackets (missing values indicate a
methodunweighted methodÞ
zero improvement). Here, we used the relative improvement ðweighted unweighted method
 100 for the indices CS1 , CS2 , CS3 ,
Purity, F-measure, and PC. For the indices Entropy, Mirkin metric, VI and V-measure, we used the relative improvement
ðunweighted methodweighted methodÞ
weighted method
 100.
 It is easy to see that the internal validity indices were more sensitive to the document weighting than the external validity
indices. Of the latter, F-measure and PC were the most sensitive. For this purpose, it is necessary to pay attention to
improvement percentage.
 Comparison of the clustering solutions produced by our criterion functions (except Fa3 ) with the ‘‘ground truth” results
showed that our methods yield high accuracy. For this purpose it is enough to pay attention to the values of the F-measure
and PC indices, which reveal accuracy of approximately 80% and 77%, respectively.
 Among the weighed functions, Fa3 is the most sensitive to the weighting. It improved the results of the function F3 as
follows: 42.17% ðCS1 Þ, 71.65% ðCS2 Þ, 22.53% ðCS3 Þ, 0.40% (Purity), 0.18% (Entropy), 7.47% (Mirkin), 4.16% (F-measure),
15.36% (VI), 4.64% (PC) and 0.02% (V-measure) (see Table 8).

Table 3
Values of the validity indices for clustering methods obtained with the webkb (WebKb) dataset.

Methods Validity indices V-measure


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 4.8695 6.9276 5.2289 0.7536 0.4686 0.4241 0.8536 0.1178 0.8229 1.0001
(12.92%) (6.08%) (26.23%) (0.17%) (0.32%) (0.94%) (0.49%) (0.93%) (1.06%)
F1 4.3123 6.5308 4.1425 0.7523 0.4701 0.4281 0.8494 0.1189 0.8143 1.0001
Fa2 4.5126 14.7237 14.8164 0.7539 0.4687 0.4239 0.8518 0.1183 0.8249 1.0001
(9.91%) (53.42%) (22.76%) (0.23%) (0.11%) (0.80%) (0.37%) (1.10%) (1.86%)
F2 4.1057 9.5968 12.0693 0.7522 0.4692 0.4273 0.8487 0.1196 0.8098 1.0001
Fa3 0.4263 0.5907 0.7219 0.7532 0.4683 0.4834 0.7368 0.1683 0.7217 1.0005
(42.86%) (8.22%) (65.42%) (0.15%) (0.11%) (11.48%) (3.85%) (34.28%) (4.65%) (0.02%)
F3 0.2984 0.3224 0.4364 0.7521 0.4688 0.5389 0.7095 0.2260 0.6896 1.0007
Fa4 9.2147 9.8296 6.2608 0.7537 0.4685 0.4245 0.8542 0.1176 0.8362 1.0001
(21.18%) (21.01%) (24.98%) (0.19%) (0.28%) (0.05%) (0.39%) (0.17%) (1.62%)
F4 7.6039 8.1230 5.0094 0.7523 0.4698 0.4247 0.8509 0.1178 0.8229 1.0001
Fa5 6.0216 8.3407 5.4069 0.7534 0.4689 0.4252 0.8540 0.1180 0.8117 1.0001
(12.62%) (12.04%) (34.81%) (0.13%) (0.30%) (0.19%) (1.29%) (0.09%) (1.29%)
F5 5.3466 7.4445 4.0108 0.7524 0.4703 0.4260 0.8431 0.1181 0.8014 1.0001
Fa6 8.0080 8.8673 4.7231 0.7527 0.4690 0.4251 0.8534 0.1179 0.8203 1.0001
(33.61%) (22.39%) (15.23%) (0.03%) (0.19%) (0.17%) (0.18%) (0.09%) (0.33%)
F6 5.9937 7.2449 4.0987 0.7525 0.4699 0.4258 0.8519 0.1180 0.8176 1.0001
I2 0.4785 0.4364 0.6187 0.7533 0.4674 0.4465 0.7809 0.1445 0.7522 1.0004
(6.98%) (5.69%) (16.17%) (0.17%) (0.09%) (0.85%) (1.79%) (11.49%) (1.80%) (0.01%)
H2 0.4473 0.4129 0.5326 0.7520 0.4678 0.4503 0.7672 0.1611 0.7389 1.0005
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3597

Table 4
Values of the validity indices for clustering methods obtained with the re0 (Reuters-21578) dataset.

Methods Validity indices V-measure


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 1.3273 1.4898 1.4295 0.7827 0.2693 0.2427 0.8617 0.1051 0.8309 1.0001
(13.73%) (5.13%) (18.35%) (0.08%) (0.26%) (0.91%) (0.16%) (0.29%) (1.68%)
F1 1.1671 1.4171 1.2079 0.7821 0.2700 0.2449 0.8603 0.1054 0.8172 1.0001
Fa2 1.2617 2.2424 2.1234 0.7828 0.2690 0.2424 0.8625 0.1047 0.8328 1.0001
(11.56%) (22.60%) (11.03%) (0.08%) (0.19%) (0.33%) (0.06%) (0.48%) (3.02%)
F2 1.1310 1.8291 1.9125 0.7822 0.2695 0.2432 0.8620 0.1052 0.8084 1.0001
Fa3 0.3710 0.7134 0.4597 0.7692 0.2871 0.2701 0.8223 0.1272 0.7499 1.0005
(12.08%) (91.16%) (14.50%) (1.61%) (0.56%) (5.33%) (2.51%) (13.4%) (5.38%) (0.02%)
F3 0.3310 0.3732 0.4015 0.7570 0.2887 0.2845 0.8022 0.1443 0.7116 1.0007
Fa4 1.7534 2.3040 1.3962 0.7829 0.2688 0.2426 0.8626 0.1051 0.8412 1.0001
(15.53%) (14.32%) (17.79%) (0.08%) (0.15%) (0.37%) (0.02%) (0.19%) (1.57%)
F4 1.5177 2.0154 1.1853 0.7823 0.2692 0.2435 0.8624 0.1053 0.8282 1.0001
Fa5 1.3968 1.5551 1.4591 0.7825 0.2691 0.2431 0.8612 0.1052 0.8297 1.0001
(23.47%) (7.72%) (22.27%) (0.04%) (0.22%) (0.08%) (0.09%) (0.29%) (2.56%)
F5 1.1313 1.4437 1.1933 0.7822 0.2697 0.2433 0.8604 0.1055 0.8090 1.0001
Fa6 1.1592 1.6140 1.4441 0.7824 0.2694 0.2428 0.8593 0.1052 0.8189 1.0001
(4.50%) (8.04%) (25.78%) (0.04%) (0.01%) (0.19%) (2.75%)
F6 1.1093 1.4939 1.1481 0.7824 0.2695 0.2428 0.8592 0.1054 0.7970 1.0001
I2 0.7609 1.1111 0.8687 0.7634 0.2741 0.2642 0.8378 0.1131 0.7627 1.0006
(11.36%) (7.47%) (14.95%) (0.3%) (1.75%) (1.7%) (1.85%) (0.71%) (0.82%)
H2 0.6833 1.0339 0.7557 0.7611 0.2789 0.2687 0.8226 0.1139 0.7565 1.0006

Table 5
Values of the validity indices for clustering methods obtained with the rec (20News-18828) dataset.

Methods Validity indices V-


measure
CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 5.3693 3.4404 2.9988 0.4987 0.4974 0.4345 0.7164 0.1569 0.6972 1.0008
(4.38%) (5.77%) (2.45%) (0%) (0.02%) (0.09%) (1.47%) (0.13%) (3.97%)
F1 5.1440 3.2528 2.9271 0.4987 0.4975 0.4349 0.7060 0.1571 0.6706 1.0008
Fa2 4.7038 10.166 9.1161 0.4985 0.4972 0.4343 0.7182 0.1568 0.6907 1.0008
(3.36%) (50.02%) (10.62%) (0.02%) (0.02%) (0.12%) (0.27%) (0.06%) (0.19%)
F2 4.5509 6.7764 8.2411 0.4984 0.4973 0.4348 0.7163 0.1569 0.6894 1.0008
Fa3 0.9431 1.5080 2.2003 0.4989 0.4975 0.4507 0.6214 0.2124 0.5949 1.0011
(48.33%) (82.63%) (11.89%) (0.02%) (0.16%) (4.81%) (1.29%) (24.50%) (1.92%)
F3 0.6358 0.8257 1.9665 0.4988 0.4983 0.4724 0.6135 0.2645 0.5837 1.0011
Fa4 5.6844 9.0306 7.0634 0.4985 0.4972 0.4344 0.7460 0.1567 0.7271 1.0007
(11.26%) (22.62%) (73.62%) (0.04%) (0.05%) (2.77%) (0.13%) (2.28%)
F4 5.1091 7.3648 4.0684 0.4985 0.4974 0.4346 0.7259 0.1569 0.7109 1.0007
Fa5 5.2188 6.6757 3.5368 0.4987 0.4975 0.4347 0.6917 0.1572 0.7039 1.0009
(5.70%) (34.56%) (17.19%) (0.02%) (0.12%) (0.09%) (2.86%) (0.06%) (3.12%)
F5 4.9372 4.9612 3.0179 0.4986 0.4981 0.4351 0.6725 0.1573 0.6826 1.0009
Fa6 4.4785 6.3991 3.0591 0.4987 0.4975 0.4346 0.7069 0.1570 0.6886 1.0009
(4.08%) (16.63%) (17.17%) (0.05%) (0.38%) (0.25%) (3.78%)
F6 4.3031 5.4865 2.6108 0.4987 0.4975 0.4348 0.7042 0.1574 0.6635 1.0009
I2 1.4023 1.7808 2.2711 0.4988 0.4983 0.4462 0.6329 0.2372 0.6176 1.0011
(16.53%) (14.54%) (11.63%) (0.02%) (0.02%) (0.74%) (0.88%) (1.56%) (1.50%)
H2 1.2034 1.5548 2.0344 0.4987 0.4984 0.4495 0.6274 0.2409 0.6085 1.0011

 The external validity index V-measure does not possess discriminative ability, i.e., its value on various datasets was almost
identical for all methods. From this finding, it is possible to draw the conclusion that the use of the index V-measure is not
expedient for evaluating clustering results. Therefore, in the following comparisons, we did not consider the results of the
index V-measure (see Tables 9 and 10).
 The method I2 gave better results than H2 .
 The criterion function F2 (weighted k-means) outperformed the I2 (k-means) method.

From Table 8, we obtained the ranks of the methods by each of the indices, the results of which are shown in Table 9 (as
was mentioned before, in this table the results of the V-measure index were not taken into consideration).
Hence the function Fa4 had the best rank according to the five indices CS1 , Purity, Entropy, F-measure and PC, according to
the three indices CS2 , CS3 b Mirkin the function Fa2 is the best, and according to the index VI the function Fa1 demonstrates
the best results. From this table we can also see that the function E1 showed the worst results under eight indices (out of
nine) and the function H2 had the worst rank under just one index (Purity).
3598 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

Table 6
Values of the validity indices for clustering methods obtained with the wap (WebACE) dataset.

Methods Validity indices V-measure


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 0.8550 1.5252 1.3217 0.6283 0.5271 0.4623 0.7817 0.2026 0.7448 1.0004
(26.44%) (34.78%) (25.07%) (0.61%) (0.23%) (0.24%) (0.40%) (0.15%) (2.38%)
F1 0.6762 1.1316 1.0568 0.6245 0.5283 0.4634 0.7786 0.2029 0.7275 1.0004
Fa2 0.8331 1.6553 1.8071 0.6276 0.5280 0.4621 0.7792 0.2031 0.7376 1.0004
(14.44%) (25.57%) (30.13%) (0.38%) (0.21%) (0.26%) (3.64%) (0.34%) (3.51%)
F2 0.7280 1.3182 1.3887 0.6252 0.5291 0.4633 0.7518 0.2038 0.7126 1.0004
Fa3 0.3277 0.2895 0.2330 0.6277 0.5269 0.5147 0.6308 0.2745 0.6007 1.0008
(22.23%) (6.36%) (62.60%) (0.02%) (0.09%) (5.34%) (2.84%) (3.42%) (2.75%) (0.03%)
F3 0.2681 0.2722 0.1433 0.6276 0.5274 0.5422 0.6134 0.2839 0.5846 1.0011
Fa4 1.0282 1.4473 1.0921 0.6271 0.5274 0.4622 0.7804 0.2042 0.7515 1.0004
(24.81%) (38.68%) (49.75%) (0.05%) (0.08%) (0.11%) (0.42%) (0.20%) (1.93%)
F4 0.8238 1.0436 0.7293 0.6268 0.5278 0.4627 0.7771 0.2046 0.7373 1.0004
Fa5 0.7438 1.2003 0.9259 0.6274 0.5278 0.4624 0.7682 0.2053 0.7384 1.0007
(3.22%) (24.98%) (2.90%) (0.02%) (0.02%) (3.88%) (0.39%) (3.78%)
F5 0.7206 0.9604 0.8998 0.6273 0.5279 0.4624 0.7395 0.2061 0.7115 1.0007
Fa6 0.7819 1.1602 0.8911 0.6274 0.5276 0.4627 0.7579 0.2062 0.7227 1.0007
(42.68%) (22.33%) (0.52%) (0.02%) (0.09%) (5.50%) (0.15%) (3.79%) (0.01%)
F6 0.5480 0.9484 0.8865 0.6274 0.5277 0.4631 0.7184 0.2065 0.6963 1.0008
I2 0.6847 0.5426 0.5981 0.6234 0.5292 0.4967 0.6826 0.2264 0.6178 1.0009
(24.47%) (3.25%) (6.94%) (0.13%) (0.06%) (0.24%) (1.23%) (1.46%) (1.41%) (0.01%)
H2 0.5501 0.5255 0.5593 0.6226 0.5295 0.4979 0.6743 0.2297 0.6092 1.0010

Table 7
Values of the validity indices for clustering methods obtained with the fbis (TREC-5) dataset.

Methods Validity indices V-measure


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 0.9452 1.0657 0.7277 0.7137 0.3578 0.2942 0.8165 0.1334 0.7760 1.0003
(21.51%) (25.85%) (11.75%) (0.04%) (0.03%) (1.10%) (1.27%) (4.41%)
F1 0.7779 0.8468 0.6512 0.7134 0.3579 0.2942 0.8076 0.1351 0.7432 1.0003
Fa2 0.5745 1.3694 1.7039 0.7146 0.3586 0.2944 0.8069 0.1347 0.7626 1.0003
(14.90%) (15.89%) (20.86%) (0.17%) (0.31%) (0.03%) (0.95%) (1.11%) (3.32%)
F2 0.5000 1.1816 1.4098 0.7134 0.3597 0.2945 0.7993 0.1362 0.7381 1.0003
Fa3 0.2869 0.3254 0.3495 0.7122 0.3559 0.3287 0.7299 0.1653 0.6662 1.0006
(133.30%) (60.45%) (21.52%) (0.14%) (10.30%) (10.50%) (5.57%) (8.18%) (0.01%)
F3 0.1230 0.2028 0.2876 0.7122 0.3564 0.3624 0.6607 0.1745 0.6158 1.0007
Fa4 1.1136 1.2865 0.8423 0.7156 0.3577 0.2939 0.8278 0.1331 0.7809 1.0003
(18.87%) (21.48%) (29.60%) (0.11%) (0.25%) (0.14%) (2.03%) (0.08%) (1.26%)
F4 0.9368 1.0590 0.6499 0.7148 0.3586 0.2943 0.8113 0.1332 0.7712 1.0003
Fa5 1.0607 1.1952 0.6491 0.7147 0.3581 0.2951 0.8208 0.1337 0.7768 1.0003
(44.02%) (27.98%) (21.24%) (0.20%) (0.06%) (0.30%) (0.47%) (1.42%) (3.15%)
F5 0.7365 0.9339 0.5354 0.7133 0.3583 0.2960 0.8170 0.1356 0.7531 1.0003
Fa6 0.9812 1.0353 0.5560 0.7154 0.3586 0.2952 0.8179 0.1345 0.7698 1.0003
(46.80%) (27.02%) (6.78%) (0.18%) (0.03%) (0.09%) (1.41%) (3.37%)
F6 0.6684 0.8151 0.5207 0.7141 0.3586 0.2953 0.8172 0.1364 0.7447 1.0003
I2 0.2050 0.3375 0.4871 0.7139 0.3576 0.2954 0.7654 0.1545 0.7236 1.0005
(25.92%) (28.92%) (13.49%) (0.15%) (0.45%) (0.24%) (3.31%) (2.14%) (2.30%)
H2 0.1628 0.2618 0.4292 0.7128 0.3592 0.2961 0.7409 0.1578 0.7073 1.0005

To obtain the resulting ranks of the methods we transformed Table 9 into another one, shown in Table 10.
P ð14sþ1Þrs
The resultant rank in Table 10 was computed according to the following formula: Resultant rankðmethodÞ ¼ 14 s¼1 14
,
where r s denotes the number of times the method appears in the s th rank. For instance, the rank of the method Fa2 is

  X 14
ð14  s þ 1Þr s ð14  1 þ 1Þ  3 ð14  2 þ 1Þ  1 ð14  3 þ 1Þ  2
Resultant rank Fa2 ¼ ¼ þ þ
s¼1
14 14 14 14
ð14  4 þ 1Þ  1 ð14  5 þ 1Þ  1 ð14  6 þ 1Þ  0 ð14  7 þ 1Þ  0 ð14  8 þ 1Þ  0
þ þ þ þ þ
14 14 14 14 14
ð14  9 þ 1Þ  1 ð14  10 þ 1Þ  0 ð14  11 þ 1Þ  0 ð14  12 þ 1Þ  0
þ þ þ þ
14 14 14 14
ð14  13 þ 1Þ  0 ð14  14 þ 1Þ  0 13 24 11 10 6
þ þ ¼3þ þ þ þ þ ¼ 7:5714
14 14 14 14 14 14 14
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3599

Table 8
Average values of the validity indices for clustering methods obtained on all datasets.

Methods Validity indices V-measure


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 2.6733 2.8897 2.3413 0.6754 0.4240 0.3716 0.8060 0.1432 0.7744 1.0003
(10.67%) (9.63%) (17.23%) (0.18%) (0.19%) (0.40%) (0.70%) (0.49%) (2.62%)
F1 2.4155 2.6358 1.9971 0.6742 0.4248 0.3731 0.8004 0.1439 0.7546 1.0003
Fa2 2.3771 6.0314 5.9134 0.6755 0.4243 0.3714 0.8037 0.1435 0.7697 1.0003
(7.80%) (45.67%) (18.17%) (0.18%) (0.16%) (0.32%) (1.02%) (0.56%) (2.39%)
F2 2.2031 4.1404 5.0043 0.6743 0.4250 0.3726 0.7956 0.1443 0.7517 1.0003
Fa3 0.4710 0.6854 0.7929 0.6722 0.4271 0.4095 0.7082 0.1895 0.6667 1.0007
(42.17%) (71.65%) (22.53%) (0.40%) (0.18%) (7.47%) (4.16%) (15.36%) (4.64%) (0.02%)
F3 0.3313 0.3993 0.6471 0.6695 0.4279 0.4401 0.6799 0.2186 0.6371 1.0009
Fa4 3.7589 4.7796 3.3310 0.6756 0.4239 0.3715 0.8142 0.1433 0.7874 1.0003
(17.53%) (21.89%) (43.05%) (0.10%) (0.17%) (0.13%) (1.08%) (0.21%) (1.72%)
F4 3.1983 3.9212 2.3285 0.6749 0.4246 0.3720 0.8055 0.1436 0.7741 1.0003
Fa5 2.8883 3.7934 2.3956 0.6753 0.4243 0.3721 0.7992 0.1439 0.7721 1.0004
(12.19%) (20.48%) (24.03%) (0.07%) (0.14%) (0.13%) (1.61%) (0.42%) (2.74%)
F5 2.5744 3.1487 1.9314 0.6748 0.4249 0.3726 0.7865 0.1445 0.7515 1.0004
Fa6 3.0818 3.8152 2.1347 0.6753 0.4244 0.3721 0.7991 0.1442 0.7641 1.0004
(22.08%) (19.31%) (15.20%) (0.04%) (0.04%) (0.08%) (1.13%) (0.35%) (2.73%)
F6 2.5245 3.1978 1.8530 0.6750 0.4246 0.3724 0.7902 0.1447 0.7438 1.0004
I2 0.7063 0.8417 0.9687 0.6706 0.4253 0.3898 0.7399 0.1751 0.6948 1.0007
(15.90%) (11.07%) (12.35%) (0.18%) (0.35%) (0.69%) (1.84%) (3.20%) (1.56%)
H2 0.6094 0.7578 0.8622 0.6694 0.4268 0.3925 0.7265 0.1807 0.6841 1.0007
F2 2.2031 4.1404 5.0043 0.6743 0.4250 0.3726 0.7956 0.1443 0.7517 1.0003
(211.92%) (391.91%) (416.6%) (0.55%) (0.07%) (4.41%) (7.53%) (17.59%) (8.19%) (0.04%)
I2 0.7063 0.8417 0.9687 0.6706 0.4253 0.3898 0.7399 0.1751 0.6948 1.0007

Table 9
Ranks of methods on different validity indices.

Methods Rank of the methods


CS1 CS2 CS3 Purity Entropy Mirkin F-measure VI PC
Fa1 5 9 5 3 2 3 2 1 2
Fa2 9 1 1 2 3 1 4 3 5
Fa3 13 13 13 11 13 13 13 13 13
Fa4 1 2 3 1 1 2 1 2 1
Fa5 4 6 4 4 4 5 6 6 4
Fa6 3 5 7 5 5 6 7 7 6
F1 8 10 8 10 8 10 5 5 7
F2 10 11 2 9 10 8 8 8 8
F3 ðE1 Þ 14 14 14 13 14 14 14 14 14
F4 2 4 6 7 6 4 3 4 3
F5 6 6 9 8 9 9 10 9 9
F6 7 7 10 6 7 7 9 10 10
I2 11 11 11 12 11 11 11 11 11
H2 12 12 12 14 12 12 12 12 12

From Table 10 we conclude the following:

 Among the methods, the Fa4 method showed the best results.
 The Fa3 method showed the worst result among the weighted criterion functions.
 The methods Fa1 and Fa2 , F1 and F2 , Fa5 and Fa6 , F5 and F6 showed close results.
 Our methods outperformed the other three methods, I2 , H2 b E1 . Among our methods, only Fa3 was outranked by the
other methods I2 and H2 .
 The E1 method showed the worst result among the methods I2 , H2 and E1 . Note that this result conforms with the result
received in [51].
 Comparing the k-means ðI2 Þ method with its weighed variants, Fa2 and F2 , we see that it was outperformed by both
methods.

In order to show the comparison of methods more descriptively, we demonstrate it in histograms. Fig. 2, below, shows a
graphical comparison of the methods.
3600 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

Table 10
The resultant rank of the methods.

Method The number of times the method is in the sth rank Resultant rank

1 2 3 4 5 6 7 8 9 10 11 12 13 14
Fa4 5 3 1 0 0 0 0 0 0 0 0 0 0 0 8.6429
Fa2 3 1 2 1 1 0 0 0 1 0 0 0 0 0 7.5714
Fa1 1 3 2 0 2 0 0 0 1 0 0 0 0 0 7.3571
F4 0 1 2 3 0 2 1 0 0 0 0 0 0 0 6.8571
Fa5 0 0 0 5 1 3 0 0 0 0 0 0 0 0 6.5714
Fa6 0 0 1 0 3 2 3 0 0 0 0 0 0 0 6.0000
F2 0 1 1 0 0 0 0 4 1 2 0 0 0 0 4.9286
F1 0 0 0 0 2 0 1 3 0 3 0 0 0 0 4.5714
F6 0 0 0 0 0 1 4 0 1 3 0 0 0 0 4.4286
F5 0 0 0 0 0 1 0 2 5 1 0 0 0 0 4.1429
I2 0 0 0 0 0 0 0 0 0 0 8 1 0 0 2.5000
H2 0 0 0 0 0 0 0 0 0 0 0 8 0 1 1.7857
Fa3 0 0 0 0 0 0 0 0 0 0 1 0 8 0 1.4286
F3 ðE1 Þ 0 0 0 0 0 0 0 0 0 0 0 0 1 8 0.7143
Resultant rank

Methods

Fig. 2. The comparison of the methods based on the resultant rank.

7. Conclusion

In this paper, we studied twelve criterion functions for document clustering. Six of these functions were weighted, whereas
the remaining six were unweighted. In our study, a weight was assigned to each document, which defined its relative position
in the entire collection. The purpose of the present paper was to show that weighting improves the clustering solution. To
show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. The com-
parison was conducted on five datasets with widely varying numbers of clusters, documents and terms. The quality of a clus-
tering result was evaluated using ten validity indices: three internal validity indices and seven external validity indices. The
internal validity indices were used to evaluate the within-cluster scatter and between cluster separations. The external valid-
ity indices were used to compare the clustering solution produced by the proposed criterion functions with the ‘‘ground truth”
results. The experiments showed that the weighted criterion functions lead to reasonably good results that outperform the
results obtained from the unweighted criterion functions. Furthermore, to study the performance of our methods, we com-
pared the methods against three clustering methods implemented in [51], E1 , H2 and I2 , where I2 is the vector space var-
iant of the k-means method. The experimental results showed that our methods outperform the E1 , H2 and I2 (k-means)
methods. Out of our methods, only Fa3 was outperformed by the methods I2 and H2 . In this paper, we developed a modified
DE algorithm to optimize the criterion functions. An important feature of the proposed modification is that it accelerates the
convergence of DE, and unlike a basic DE algorithm it guarantees that the solution it produces will be feasible.

Acknowledgement

The author would like to thank all the anonymous reviewers for their valuable comments to improve the quality of this
paper.
R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602 3601

References

[1] A. Abraham, S. Das, A. Konar, Document clustering using differential evolution, in: Proceedings of the 2006 IEEE Congress on Evolutionary Computation
(CEC 2006), Springer, Berlin, 2006, pp. 1784–1791.
[2] R.M. Alguliev, R.M. Alyguliev, Automatic text documents summarization through sentences clustering, Journal of Automation and Information Sciences
40 (2008) 53–63.
[3] R.M. Alguliev, R.M. Alyguliev, A.M. Bagirov, Global optimization in the summarization of text documents, Automatic Control and Computer Sciences 39
(2005) 42–47.
[4] R.M. Alguliev, R.M. Aliguliyev, Fast genetic algorithm for clustering of text documents, Artificial Intelligence 3 (2005) 698–707 (in Russian).
[5] R.M. Aliguliyev, A novel partitioning-based clustering method and generic document summarization, in: Proceedings of the 2006 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2006 Workshops) (WI-IATW’06), Hong Kong, China, 2006, pp.
626–629.
[6] R.M. Aliguliyev, A clustering method for document collections and algorithm for estimation the optimal number of clusters, Artificial Intelligence 4
(2006) 651–659 (in Russian).
[7] R.M. Aliguliyev, Automatic document summarization by sentence extraction, Journal of Computational Technologies 12 (2007) 5–15.
[8] J. Allan (Ed.), Topic detection and tracking: event-based information organization, Kluwer Academic Publishers, Norwell, USA, 2002.
[9] H. Azzag, G. Venturini, A. Oliver, C. Guinot, A hierarchical ant based clustering algorithm and its use in three real-world applications, European Journal
of Operational Research 179 (2007) 906–922.
[10] R. Baeza-Yates, R. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, ACM Press, New York, 1999.
[11] A.M. Bagirov, Modified global k-means algorithm for minimum sum-of-squares clustering problems, Pattern Recognition 41 (2008) 3192–3199.
[12] S. Bandyopadhyay, S. Saha, A point symmetry-based clustering technique for automatic evolution of clusters, IEEE Transactions on Knowledge and
Data Engineering 20 (2008) 1441–1457.
[13] J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions on Systems, Man and Cybernetics – Part B: Cybernetics 28 (1998) 301–
315.
[14] F. Boutin, M. Hascoet, Cluster validity indices for graph partitioning, in: Proceedings of the Eighth International Conference on Information
Visualization (IV 2004), London, UK, 2004, pp. 376–381.
[15] Y. Chen, J. Bi, Clustering by maximizing sum-of-squared separation distance, in: Proceedings of the Workshop on Clustering High Dimensional Data
and its Applications, Newport Beach, USA, 2005, pp. 1–12.
[16] Y. Chen, L. Tu, Density-based clustering for real-time stream data, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’07), San Jose, USA, 2007, pp. 133–142.
[17] C.H. Chou, M.C. Su, E. Lai, A new cluster validity measure and its application to image compression, Pattern Analysis and Applications 7 (2004) 205–
220.
[18] S. Das, A. Konar, Automatic image pixel clustering with an improved differential evolution, Applied Soft Computing 9 (2009) 226–236.
[19] S. Das, A. Abraham, A. Konar, Automatic clustering using an improved differential evolution algorithm, IEEE Transaction on Systems, Man, and
Cybernetics – Part A: Systems and Humans 38 (2008) 218–237.
[20] S. Das, A. Abraham, A. Konar, Automatic clustering with a multi-elitist particle swarm optimization algorithm, Pattern Recognition Letters 29 (2008)
688–699.
[21] I.S. Dhillon, Y. Guan, B. Kulis, A unified view of kernel k-means, spectral clustering and graph cuts, University of Texas UTCS Technical Report #TR-04-
25, 2005, 20 p.
[22] D.M. Dunlavy, D.P. O’Leary, J.M. Conroy, J.D. Schlesinger, QCS: a system for querying clustering and summarizing documents, Information Processing
and Management 43 (2007) 1588–1605.
[23] R. Dubes, A.K. Jain, Validity studies in clustering methodologies, Pattern Recognition 11 (1979) 235–254.
[24] M. Friedman, M. Last, Y. Makover, A. Kandel, Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology,
Information Sciences 177 (2007) 467–475.
[25] J. Grabmeier, A. Rudolph, Techniques of cluster algorithms in data mining, Data Mining and Knowledge Discovery 6 (2002) 303–360.
[26] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, Journal of Intelligent Systems 17 (2001) 107–145.
[27] K.M. Hammouda, M.S. Kamel, Efficient phrase-based document indexing for web document clustering, IEEE Transactions on Knowledge and Data
Engineering 16 (2004) 1279–1296.
[28] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, WebACE: a web agent for document categorization and
exploration, in: Proceedings of the Second International Conference on Autonomous Agents, Minneapolis, MN, USA, 1998, pp. 408–415.
[29] J. Han, M. Kamber, Data Mining: Concepts and Techniques, second ed., Morgan Kaufman, San Francisco, 2006.
[30] A. Huang, Similarity measures for text document clustering, in: Proceedings of the Sixth New Zealand Computer Science Research Student Conference
(NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.
[31] J.Z. Huang, M.K. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type clustering, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (2005) 657–668.
[32] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (1999) 264–323.
[33] D.V. Kalashnikov, Z.S. Chen, S. Mehrotra, R. Nuray-Turan, Web people search via connection analysis, IEEE Transactions on Knowledge and Data
Engineering 20 (2008) 1550–1565.
[34] M.S. Khan, S.W. Khor, Web document clustering using a hybrid neural network, Applied Soft Computing 4 (2004) 423–432.
[35] T. Korenius, J. Laurikkala, M. Juhola, On principal component analysis, cosine and Euclidean measures in information retrieval, Information Sciences
177 (2007) 4893–4905.
[36] M. Laszlo, S. Mukherjee, A genetic algorithm that exchanges neighboring centers for k-means clustering, Pattern Recognition Letters 28 (2007) 2359–
2366.
[37] C.H. Lee, O.R. Zaiane, H.H. Park, J. Huang, R. Greiner, Clustering high dimensional data: a graph-based relaxed optimization approach, Information
Sciences 178 (2008) 4501–4511.
[38] Y. Li, S.M. Chung, J.D. Holt, Text document clustering based on frequent word meaning sequences, Data and Knowledge Engineering 64 (2008) 381–404.
[39] T. Li, C. Ding, Weighted consensus clustering, in: Proceedings of the SIAM International Conference on Data Mining (SDM 2008), Atlanta, USA, 2008, pp.
798–809.
[40] Y. Li, C. Luo, S.M. Chung, Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering 20
(2008) 641–652.
[41] B. Mirkin, Mathematical Classification and Clustering, Kluwer Academic Press, Boston, Dordrecht, 1996.
[42] C.D. Nguyen, K.J. Cios, GAKREM: a novel hybrid clustering algorithm, Information Sciences 178 (2008) 4205–4227.
[43] A. Patrikainen, M. Meila, Comparing subspace clusterings, IEEE Transactions on Knowledge and Data Engineering 18 (2006) 902–916.
[44] M. Porter, An algorithm for suffix stripping, Program 14 (1980) 130–137.
[45] K.V. Price, R.M. Storn, J.A. Lampinen, Differential evolution: a practical approach to global optimization (natural computing series), Springer-Verlag,
Berlin, 2005.
[46] A. Rosenberg, J. Hirschberg, V-measure: a conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 Joint Conference
on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), Prague, Czech Republic,
2007, pp. 410–420.
3602 R.M. Aliguliyev / Information Sciences 179 (2009) 3583–3602

[47] A.M. Rubinov, N.V. Soukhorukova, J. Ugon, Classes and clusters in data analysis, European Journal of Operational Research 173 (2006) 849–865.
[48] M.P. Tan, J.R. Broach, C.A. Floudas, A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced
positioning, Journal of Global Optimization 39 (2007) 323–346.
[49] F. Wang, C. Zhang, T. Li, Regularized clustering for documents, in: Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR’07),
Amsterdam, The Netherlands, 2007, pp. 95–102.
[50] Y. Zhang, W. Wang, X. Zhang, Y. Li, A cluster validity index for fuzzy clustering, Information Sciences 178 (2008) 1205–1218.
[51] Y. Zhao, G. Karypis, Empirical and theoretical comparisons of selected criterion functions for document clustering, Machine Learning 55 (2004) 311–
331.
[52] ftp://ftp.cs.cornell.edu/pub/smart/english.stop.
[53] http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz.
[54] www.daviddlewis.com/resources/testcollections/reuters21578.
[55] http://people.csail.mit.edu/jrennie/20Newsgroups/.
[56] http://trec.nist.gov.

Anda mungkin juga menyukai