Anda di halaman 1dari 14

Automation in Construction 42 (2014) 36–49

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

Automatic clustering of construction project documents based on


textual similarity
Mohammed Al Qady ⁎, Amr Kandil 1
School of Civil Engineering, Purdue University, West Lafayette, IN 47907-2051, United States

a r t i c l e i n f o a b s t r a c t

Article history: Text classifiers, as supervised learning methods, require a comprehensive training set that covers all clas-
Received 7 July 2013 ses in order to classify new instances. This limits the use of text classifiers for organizing construction pro-
Received in revised form 4 February 2014 ject documents since it is not guaranteed that sufficient samples are available for all possible document
Accepted 8 February 2014
categories. To overcome the restriction imposed by the all-inclusive requirement, an unsupervised learn-
Available online 15 March 2014
ing method was used to automatically cluster documents together based on textual similarities. Repeated
Keywords:
evaluations using different randomizations of the dataset revealed a region of threshold/dimensionality
Document management values of consistently high precision values and average recall values. Accordingly, a hybrid approach
Single pass clustering was proposed which initially uses an unsupervised method to develop core clusters and then trains a
Supervised/unsupervised learning methods text classifier on the core clusters to classify outlier documents in a consequent refinement step. Evalua-
tion of the hybrid approach demonstrated a significant improvement in recall values, resulting in an over-
all increase in F-measure scores.
© 2014 Elsevier B.V. All rights reserved.

1. Introduction While traditional methods of organizing construction project


documents are simple and easy to use, they are not very useful for infor-
Automatic classification of documents as a supervised learning mation retrieval unless the information seeker has thorough knowledge
method requires a set of class labels and samples of each class in order of the document body [1]. Information regarding a researched knowl-
to conduct the learning process before being able to perform predictions edge topic is almost always distributed over multiple categories thus re-
for new document instances. Usually, the classification procedure as- quiring understanding of document content, not just metadata, to
sumes that the classes are all inclusive (that they make a complete set determine relevancy of a document to the researched topic; a time-
of all the possible outcomes for any new instance) and that they are mu- consuming task that entails the application of human semantic capabil-
tually exclusive (any new instance can belong to one and only one ities. Also, the above-mentioned restrictions that constrain the use of
class). Where classes are static and predefined, the use of text classifiers classifiers do not apply with unsupervised methods: unsupervised
for automatically organizing documents is appropriate. Documents are methods do not require previous identification of all possible classes
traditionally organized in construction projects according to fixed, ab- nor are they trained from sample data. The objective of this study is to
stract categories based on document metadata [1]. Examples of studies evaluate the performance of an unsupervised learning text analysis
investigating the use of automatic text classification of construction technique in organizing project documents into groups of semantically
documents include identifying the corresponding project division similar documents; each group defined by its relation to a specific
for minutes of meeting items [2] and classifying product documents to searchable knowledge topic. It is hypothesized that textual similarity
their relevant division in a construction information classification between project documents accurately reflects semantic relationships
system [3]. between the documents and, when applied in document management
and information retrieval tasks, can achieve results comparable to
what humans recognize using their semantic capabilities. In the next
section, the text analysis technique used in the study is presented
along with several of its applications in previous works. Then the meth-
⁎ Corresponding author at: 2775 Windwood Dr. #178 Ann Arbor, MI 48105. Tel.: +1 odology implemented for the evaluation is presented, followed by a de-
217 4196419.
E-mail addresses: malqady23@gmail.com (M. Al Qady), akandil@purdue.edu
tailed analysis of the results. The study is concluded with a summary of
(A. Kandil). the main results and a discussion on practical uses and limitations of
1
Tel.: +1 765 494 2246. implementing the proposed technique.

http://dx.doi.org/10.1016/j.autcon.2014.02.006
0926-5805/© 2014 Elsevier B.V. All rights reserved.
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 37

2. Clustering application of single pass clustering to image data – was implement-


ed for detecting defects in pipes using image analysis. Evaluation was
Research on clustering methods for information retrieval dates performed based on the comparison of the results of the proposed
back to the second half of the twentieth century. The main objective technique with the inspection reports of a certified inspector using
of clustering is to provide structure to a large dataset by organizing the following metrics: accuracy, recall and false alarm rate. In the lat-
similar data together thus facilitating search and retrieval tasks. ter study, the researchers highlight the limitation of K-means clus-
Clustering methods can be categorized according to the structure tering for detecting rust in grayscale images of bridge members,
they generate into flat clustering and hierarchical clustering [4]. namely: irregular illumination of images, low-contrast images that
With flat- or non-hierarchical-clustering, the dataset is divided into obscure rust areas, and debris on bridge members that create noise
a number of subsets of highly similar elements, dissimilar from ele- in the image analysis.
ments in other clusters, with no relationship between the different Clustering was widely used in image and video identification/
clusters. The main advantage of this simple structure is low compu- processing. Brilakis et al. [12] developed a framework for managing
tational complexity in comparison with the more sophisticated hier- digital images of construction sites. The framework divides an
archical clustering methods. With hierarchical clustering, a complex image into clusters that represent different construction materials
structure of nested clusters is produced of the dataset. This is either in the image and uses the cluster features to identify the material
done using a bottom–up approach, in which clusters start as individ- from a database of material signatures. Evaluation was performed
ual items and pairs of similar items are joined together to form clus- by testing the correctness of identification of five different construc-
ters, which are then joined together in successive steps until a single tion materials in terms of precision, recall and effectiveness. The re-
hierarchy is formed of the complete dataset. This approach is called searchers describe the high accuracy of the bottom–up clustering
agglomerative hierarchical clustering and is more popular than the technique implemented in the method. In video image processing,
top–down approach where the whole dataset is considered one clus- several studies utilized clustering to develop a codebook – or dictio-
ter and is successively broken down into pairwise clusters until the nary – of actions and/or poses used for comparing, identifying and
level of the individual items is reached (also referred to as divisive classifying motions of workers in a construction activity [13,14]. In
hierarchical clustering). Flat clustering techniques include K-means both studies, K-means clustering was used to limit the multitude of
and single pass clustering, while agglomerative hierarchical cluster- possible actions into a fixed set of poses. For evaluation, a supervised
ing techniques include single-link, complete-link and group- learning algorithm was applied to classify the motions of workers on
average. In terms of the exclusivity of cluster membership, clustering a test video based on the developed codebook, and performance was
algorithms can be divided into hard clustering and soft clustering determined based on accuracy of classification.
algorithms. In the former, membership of the items is limited to Several observations are noted from the above review. The majority
only one cluster. In the latter, the degree of association of each of the studies utilized a flat, hard clustering approach. Generally, the re-
item to each cluster formed is determined [5]. quired application dictates the choice of an appropriate clustering
Clustering was used in applications in many fields including or- method; e.g. when the number of resulting clusters is known – or can
ganizing patient data in the medical field, classification of species be reasonably inferred – K-means clustering is an appropriate method
in biological taxonomies and studying census and survey responses (as in the case of detecting dark colored defect areas in gray-scale im-
[4]. Clustering has a wide range of applications for data manage- ages), when multiple associations are feasible, a soft clustering ap-
ment in civil engineering. In the field of structural system identifica- proach is warranted. For evaluation of a clustering method and
tion, Saitta et al. [6] used K-means clustering to narrow down the validation of the outcome, expert review was used in a number of the
number of candidate structural models in order to identify the studies. In [4], the authors note the difficulty of evaluating clustering
best model that reflects actual sensor measurements of a structure. methods, and report using the comparison between an outcome and
Principal component analysis was used to enable visualization of the clusters developed by domain experts as a common method for
the various possible model clusters based on the most relevant measuring performance.
model parameters. Cheng and Teizer [7] implemented clustering For the purpose of this study a flat, hard clustering approach is
to identify objects from point cloud data of a laser scanner in deemed appropriate, for the reasons explained in the Methodology
order to enhance visibility of tower crane operators for safer section. Clustering functions on the same basic assumption as
hoisting operations. The DBSCAN algorithm was used for clustering. classification—that similar documents form clusters that do not over-
Similar to single pass clustering, DBSCAN starts with a randomly se- lap with other non-similar document clusters (also referred to as the
lected data point and successively forms clusters based on two user contiguity hypothesis). However clustering aims at identifying such
defined parameters: maximum allowable distance from the chosen document clusters without any external help from previously labeled
point and minimum cluster size. instances (thus the unsupervised nature of the method). This is usually
In data-mining of databases, Ng et al. [8] used K-means clustering executed in an iterative process in which a specific procedure is repeat-
to automatically group similar facility condition assessment reports ed until a predefined condition is satisfied. Two main flat clustering
of university facilities to investigate the relationship between techniques are reviewed below.
reported deficiencies and facility types. A qualitative evaluation
was used to verify the results of the investigation. Raz et al. [9] inves-
tigated the use of multiple techniques, including clustering, for de- 2.1. K-means
veloping models of good quality truck weigh-in-motion traffic data
in order to facilitate identification of data anomalies. Two clustering In K-means clustering, a number of K centroids is defined by the
techniques were investigated, K-means and Rectmix—a soft cluster- user and all instances in the dataset are assigned to the closest cen-
ing algorithm. Implementation of the proposed mechanism by a troid (determined by Euclidean distance or cosine similarity). Then,
domain expert was used to evaluate the accuracy and usefulness of centroids of all K clusters are calculated according to this assignment
the mechanism. resulting in new centroid positions. All instances in the dataset are
Clustering techniques were applied for defects' detection from re-assigned to the new centroids and this iterative process is contin-
images in several studies including: detection of potential defective ued until cluster centroids remain constant, implying that the opti-
regions in wastewater pipelines [10] and detection of rust in steel mal centroid positions are identified (those that minimize the
bridges to support decisions regarding bridge painting activities distance between each instance in a specific cluster and the cluster's
[11]. In the former study, region-growing segmentation – an centroid).
38 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

2.2. Single pass clustering dimensionality reduction of the dataset's term-document matrix (t–d
matrix), relying on the ability of latent semantic analysis (LSA) to reveal
Single pass clustering, also called incremental clustering, generates the hidden similarities among the dataset's instances.
one cluster at a time using a predefined threshold value. The threshold In [15], a successive evaluation approach that implements LSA was
represents the user's perception of acceptable proximity, e.g. the mini- used to automatically classify documents of a small dataset of 17 docu-
mum acceptable cosine similarity measure between instances and the ments made up of two classes. The results showed that the difference
cluster centroid, the maximum acceptable Euclidean distance between between the average similarities of same-class documents and the aver-
the instances and the cluster centroid. Starting with a random instance, age similarities of different-class documents significantly increased
the closest instance in the dataset that satisfies the threshold is identi- when dimensionality reduction was applied using the optimum dimen-
fied and added to the cluster, and the cluster centroid is calculated. sionality factor. This suggests a polarizing effect for LSA which can be
The process is repeated with the new cluster centroid until no instances used to improve clustering results. The use of LSA implies specifying a
remain that satisfy the threshold, thereby finalizing the first cluster. An certain dimensionality factor for the reduction step and as such the op-
unclustered instance is selected at random and the process is repeated timum dimensionality factor (lopt) is defined in this study as the one that
with the remaining instances in the dataset to sequentially create new results in the highest clustering accuracy. A thorough discussion on LSA
clusters until no unclustered instance remain (or until unclustered in- is found in [16] and a simple example demonstrating LSA's potential in
stances remain that do not meet the threshold standard with any of text analysis is given in Appendix A.
the formed clusters, thereby forming single-instance clusters). The methodology follows four main steps: 1) collecting the dataset,
2) randomizing and pre-processing, 3) developing the t–d matrix, and
3. Methodology 4) clustering and evaluation.

Since the objective is to organize construction project documents 3.1. Collecting the dataset
into semantically related groups, a hierarchical clustering structure is
not warranted, especially given the associated computational complex- Seventy-seven project documents related to eight construction
ity of agglomerative clustering. For the current task, flat clustering is claims make up the dataset for evaluating the developed technique.
more suitable and economical. The use of K-means requires pre- All eight claims originated from one project for the construction of
defining the number of clusters (cardinality) before implementing the an international airport with a total value of work exceeding
algorithm. It is up to the users to judge cardinality based on their knowl- $50 million. The majority of the documents are correspondences be-
edge of the domain topic of the dataset. In reality, the number of clusters tween the main contractor and the project engineer, detailing the
has a significant impact on the results. Leaving this decision subject to factual events related to each claim. Collected and organized by the
the user's judgment detracts from the automated nature of the task contract administrator of the project, the supporting documents for
and adds a high degree of subjectivity to the process. In addition to car- each claim are a representation of a group of semantically-similar
dinality, the choice of the initial centroids greatly impacts the clustering documents, related together by their association to a specific search-
results. K-means essentially tries out various clustering outcomes able claim topic. The evaluation aims at quantitatively identifying
looking for the optimal outcome. While it is highly unlikely that all the performance of the proposed technique in organizing the com-
possible outcomes will be tested, the fact remains that a misguided se- plete dataset into the correct document groups, or clusters, without
lection of the number and position of the initial centroids may unneces- implementation of the learning step that characterizes supervised
sarily prolong the process or, even more critically, result in a local learning techniques. Individual cluster size in the dataset varies
optimal clustering outcome instead of the global optimum. On the from a maximum of 22 documents to a minimum of five, and each
other hand, single pass clustering does not require definition of cardi- document belongs to only one cluster.
nality by the user, but requires determination of a threshold defining
the boundary of similarity between documents and cluster centroids. 3.2. Randomizing and pre-processing
Single pass clustering has been criticized for producing varying cluster
outcomes depending on which instances are selected to initiate the This step includes the tasks of tokenizing, removal of stop-words and
clusters and for the tendency of producing large clusters in the first frequency calculation. The outcome of this step is to represent the
pass. In this study, single pass clustering was used to automatically clus- documents in the dataset as vectors of varying sizes corresponding to
ter project documents. In order to overcome the limitations imposed by the features – or terms – in each document and the feature frequency
single pass clustering, several factors were evaluated to assess their ef- of occurrence. The order of the documents in the dataset is randomized
fect on clustering performance. The first factor evaluated was the effect from the start to measure the consistency of the clustering outcomes.
of the value of the threshold on clustering accuracy. The predefined
value of the threshold has a significant impact on the clustering result: 3.3. Developing the t–d matrix
a stricter threshold decreases cluster size (thereby increasing the total
number of clusters) while a less strict threshold results in less clusters The term–document matrix (t–d matrix) is the input required by the
of larger sizes. clustering algorithm in the final step of the methodology. It is a compi-
Predefining a threshold value must be viewed within the context of a lation of all the document vectors into one matrix where the columns
specific dataset. Success of single pass clustering is understandably de- represent the documents in the dataset and the rows represent the
pendent on the extent to which a specific dataset satisfies the contiguity vocabulary of the dataset. First, the vocabulary is compiled from the
hypothesis. Relatively similar instances that are disjoint from other document vectors and the frequency of each term across all documents
groups of instances makes defining a threshold that highlights these is recorded. Then the t–d matrix is developed based on the randomized
groups possible. Overlapping groups of instances defies any attempt document order, in which matrix elements are calculated according
for accurate clustering, regardless of the value of the threshold. Accord- the specified term weighting method. In this study, two popular
ingly, it is the ability to magnify the similarity between same-class in- term weighting methods were studied for evaluation of clustering
stances and the dissimilarity between different-class instances that performance:
ultimately contributes to clustering performance. One way to achieve
this is by using a term weighting method that best depicts this attribute • Term frequency (tf): the elements of the matrix represent the
in the dataset, accordingly two different weighting methods were frequency of occurrence of the term – identified by the matrix row –
evaluated as explained below. Another way is to experiment with in the specific document—identified by the matrix column.
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 39

• Term frequency inverse document frequency (tf–idf): modifies term a maximum of one (signifying complete similarity). The threshold value
frequency based on the assumption that high occurrence terms across was varied over the range [0.05, 0.95] with a step of 0.01. Maximum and
the dataset are poor indicators of clusters. Term frequency inverse minimum values for the factors were set based on experimentation, to
document frequency is calculated according to Eq. (1), where n is minimize unnecessary computational cost without overlooking signifi-
the number of documents in the dataset and d is the number of docu- cant results.
ments containing the specific term being evaluated:
d  Cc
n simd;C c ¼ : ð2Þ
tf −idf ¼ tf  log : ð1Þ jd jjC c j
d

3.4. Clustering and evaluation For a certain dimensionality/threshold combination, clustering com-
mences by considering the first document in the reconstructed t–d ma-
To evaluate the effects of dimensionality factor and threshold value, trix as the centroid of the first cluster, identifying the closest document
single pass clustering is performed on the randomized dataset using to the cluster that satisfies the threshold, recalculating the centroid and
varying combinations of both factors. Since the dataset is 77 documents, repeating the process. When no documents satisfying the condition re-
the dimensionality factor (l) ranges from a minimum of three dimen- main, a new cluster is initiated using the first unclustered document in
sions (to minimize computational cost) to a maximum of 77 (lmax is the dataset as the centroid of the new cluster and the process is repeated
the special case constituting the original t–d matrix, i.e. dimensionality until all documents are either assigned to a cluster or cannot be assigned
reduction is not applied). The threshold value (h) represents the to any cluster and consequently form a separate single-document clus-
minimum acceptable similarity limit between a document and a cluster ^ developed using a
ter. Clustering is illustrated in Fig. 1. A t–d matrix (X)
centroid that makes the document a candidate for inclusion in the specific weighting method from a randomized dataset undergoes
cluster. Similarity between a cluster centroid (Cc) and a document clustering 6825 times corresponding to all possible dimensionality/
(d) is calculated by cosine similarity using Eq. (2). Similarity theoretical- threshold combinations, and clustering accuracy is calculated after
ly ranges from a minimum of zero (signifying complete dissimilarity) to each to determine the best performance.

l, lmin, lmax : dimensionality factor,


Start; minimum and maximum
initialize: End h, hmin, hmax : threshold value, minimum
l= lmin, h= hmin
and maximum
Y j : integer denoting t-d matrix
column
Cc : cluster centroid
Reconstruct increment l; dj : document j of t-d matrix
N l= lmax? Y
Sim(Cc,dj) : Similarity between centroid
Xˆ l Cc and document dj
pos : position of most similar
document to cluster
θmax : cluster maximum similarity
increment h;
N h= hmax?

j= 0
Evaluate
Y clustering
Cc= dj

N
Cc= dj j=j+1 N dj unclustered? j= 1
Y Remaining
unclustered
documents?
N
N

Initialize: dj unclustered
& θ max= Sim(Cc,dj)
j= j+1 Sim(Cc,dj)>θ max Y j≤ jmax? Y pos> -1?
pos= -1 & pos= j
θ max= -99 Sim(Cc,dj)≥T

N Add docpos to
cluster
Recalculate Cc
j= 0
Single Pass Clustering

Fig. 1. Clustering and evaluation for a certain weighting method.


40 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

3.5. Clustering measures where all instances in the dataset are grouped in one cluster, 100% recall
is achieved (since such clustering contains all possible pairwise combi-
Several clustering measures (methods for evaluating the clustering nations) however precision greatly deteriorates from an excess of false-
outcomes) are presented in [5]. A simple measure is purity calculated positive combinations. At the other end of the spectrum in case of an
by Eq. (3), where u represents a specific cluster from an outcome of i overly fragmented clustering outcome, precision is boosted if the clus-
clusters, c represents a specific class from a number of j classes, N is ters contain same-class instances, however recall is negatively affected
the total number of instances in the dataset and count(ui, cj) is the num- as a result of a large number of missed (false-negative) combinations.
ber of instances belonging to class cj in cluster ui. Purity is the summa- If the clusters are mainly composed of different-class instances, then
tion across all clusters of the number of instances of the class with the both precision and recall values are low. In all these scenarios, the
highest representation in each individual cluster, divided by the total combined F-measure score represents a balanced evaluation of the
number of instances in the dataset. clustering outcome.

1X h  i
3.6. Evaluation tool
Purity ¼ max j count ui ; c j : ð3Þ
N i

Fig. 2 illustrates the evaluation tool developed for performing


Purity has a range of (0,1], where poor clustering results in low pu- clustering and evaluating the clustering outcomes. The user defines
rity values, and good clustering results in unity. One drawback of purity the location of the dataset and the documents are retrieved and pre-
is that an outcome of fragmented clusters containing same-class processed as explained before. The evaluation tool allows the user to
instances will also result in a perfect purity score. For example, for an control the randomization of the dataset according to a user-defined
extreme result where each instance in the dataset is defined as a seed in order to investigate the effect of document sequence on cluster-
single-instance cluster, the result will be a perfect purity measure. The ing outcomes. The user also has the ability to specify the following clus-
evaluation metric must fairly balance between the number of resulting tering options: dimensionality factor, threshold value and weighting
clusters and the performance rating. This is particularly important for method. At the end of a clustering run, details of the most accurate clus-
the current dataset which exhibits large variations in the size of the tering outcome are displayed in a separate window and detailed results
different classes. showing clustering performance at various combinations of threshold
The measure used for evaluating clustering outcome is F-measure. values and dimensionality factors are generated in a separate file.
Clusters must first be decomposed into binary associations of cluster
members that are indicative of the cluster's composition. This is per- 4. Results and analysis
formed for the outcome generated from the clustering process and com-
pared using precision (P), recall (R) and F-measure with the binary A better understanding of the clustering performance is achieved by
associations generated from the true clusters. The range for the above adopting a baseline to compare the results with. A baseline gives per-
evaluation method is also [0, 1]. What distinguishes this method from spective to the results by representing the lower boundary below
purity is the balancing effect it provides as a result of combining preci- which results are considered meaningless and unacceptable. The prob-
sion and recall. The number of pairwise relationships resulting from a ability of a random correct result is a common criteria used in classifica-
cluster made up of n instances is equal to n(n − 1) / 2. Accordingly, an tion evaluations for specifying a baseline. However, using the random
outcome of a few large clusters generates a larger number of relation- approach for evaluating clustering performance will grossly underesti-
ships than an outcome of many small clusters. In the extreme case mate the baseline. The number of possible cluster outcomes for n

Fig. 2. Automatic document clustering and evaluation tool.


M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 41

tf tf-idf
0.8 0.8
Baseline
0.7 0.7
Optimum
0.6 0.6
F-measure

F-measure
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0.0 0.0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Threshold value (h) Threshold value (h)

Fig. 3. F-measure scores for lmax and lopt—average over ten trial runs.

instances grouped into K clusters is a Stirling number of the second kind using lmax, i.e. without any dimensionality reduction. A comparison be-
 
n tween the clustering performance and the baseline highlights the im-
[5], calculated using Eq. (4). The number of possible outcomes in
K provement in performance resulting from applying LSA to single pass
 
n n clustering. If results consistently fall below the adopted baseline, that
case the number of clusters is unknown is therefore ∑K¼1 , the
K does not necessarily indicate that they are meaningless, but that the
summation of the Stirling number for all possible values of K; where K proposed procedure does not offer a positive contribution to clustering.
ranges from one (in case all instances are grouped into one group) to Single pass clustering is prone to inconsistent outcomes depending
n (in case each instance is grouped alone in a single-instance cluster). on the order of documents used in the clustering step. To ensure a rep-
     resentative value for clustering performance, the document order was
n 1 XK j K n randomized using different seed values and the clustering performance
¼ ð−1Þ ðK−jÞ : ð4Þ
K K! j¼0 j was evaluated for the different document sequences. Over ten trial runs,
the highest average F-measure score achieved using the tf weighting
For the current dataset, the total number of possible outcomes is ex- method was 0.68 at an optimum dimensionality factor of 13 and a
tremely large, making the possibility of a random correct cluster null. threshold of 0.69, while the highest average F-measure score achieved
Even if the problem is simplified by assuming that the correct number using the tf–idf weighting method was 0.75 at lopt = 56 and h = 0.24.
of classes is known, the number of possible outcomes for organizing Fig. 3 presents the variation of average F-measure scores across all
77 objects into eight groups is 8.6 × 1064 which is still very large. The threshold values for two specific dimensionality factors: lmax (the base-
random assumption accordingly defies the purpose of using a baseline. line condition) and lopt (the highest average F-measure score achieved
The baseline adopted for this task is the clustering results achieved using the respective weighting method). For both weighting methods,

Threshold (h) Threshold (h)


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
10

10
20

20
Dimensionality factor (l)

Dimensionality factor (l)


30

30
40

40
50

50
60

60
70

70

Term frequency Term frequency inverse document frequency

0.75 0.75 - 0.7 0.7 - 0.5 0.5 - 0.3 < 0.3

Fig. 4. Intensity grid of F-measure scores—average over ten trial runs.


42 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

the baseline's performance is better at the small threshold values but the corresponding l and h factors, for multiple trial runs of the proposed
gradually declines after the peak and is eventually surpassed by the technique. While the absolute maximum of each individual trial run
optimum's performance. For the tf weighting method, this shift occurs varies, in general F-measure scores for the regions mentioned above
midway through the range of threshold values, while for tf–idf it occurs were consistently high over the trial runs.
at the low threshold value of 0.2. For the tf weighting method, while the combination of a dimension-
For the tf method, the optimum dimensionality factor records an av- ality factor of 13 with a threshold of 0.69 was prevalent in most trials,
erage improvement over the baseline of 7.6%, and an 11.5% improvement the values of the F-measure score for such combinations varied signifi-
in peak performance. For the tf–idf method, the average improvement is cantly from a minimum of 0.69 to a maximum of 0.83. This suggests
7% with a 5% improvement of the peak performance. These results high- inconsistencies in the clustering results. Such inconsistencies are not ap-
light the importance of identifying the optimum dimensionality factor in parent in the top prevalent factor combinations of the tf–idf weighting
order to utilize LSA for improving clustering performance. method. The most common combination is a dimensionality factor in
Fig. 4 presents intensity grids of the average F-measure scores for ten the range [54, 57] with a threshold of 0.24 for which the F-measure
trials using both weighting methods. As can be seen from the figure, a scores were approximately 0.78. Another common (l, h) combination
stretch of high F-measure values (indicated by the dashed lines) is ob- is the (15, 0.69) combination which achieved a constant F-measure
served spanning from high l/low h values to low l/high h values score close to 0.75.
(lower left corner of the grids to upper right corner of the grids). In In order to accurately check consistency of the clustering results, the
case of the tf weighting method, this high performance front extends actual clusters created by the different trial runs were examined. Fig. 5
to the mid-threshold region, while for tf–idf it spans across the limits displays the clusters for the highest and lowest F-measure scores
of both factors. These results reveal the indirect relationship between achieved in the trial runs using the tf weighting method. Noting that
dimensionality factor and threshold values. At high dimensionality the true number of classes is eight and the smallest class contains five
levels (where little or no reduction is applied) high clustering perfor- documents, the resulting clusters are considered fragmented. Discrepan-
mance requires relatively relaxed threshold values. Reducing the num- cies are observed between the two cases of the tf weighting method in
ber of dimensions results in improved class separation allowing the use terms of the number and composition of clusters. In addition, cluster im-
of a stricter similarity definition (i.e. higher threshold values) which at- purity is evident, not only for the low F-measure case (clusters 1, 4, 5 and
tests to the contribution of dimensionality reduction in polarizing same- 10) but also for the high score case (cluster 1). Fig. 6 displays the clusters
class instances in the dataset. for the highest and lowest F-measure scores achieved in the trial runs
The regions of high F-measure scores were more prominent with the using the tf–idf weighting method. Both results are highly fragmented
tf–idf weighting method suggesting the superiority of this method over with a different number of clusters in each case. However, although not
the tf weighting method. This observation is attributed to the method's completely identical, cluster composition is similar, impurity is limited
accurate identification of relative term weights that better reflect simi- and clusters make a good representation of the true classes.
larities and therefore result in improved clustering performance. No Examination of the precision and recall values behind the F-measure
specific region had an average F-measure score higher than 0.7 using results in Table 1 offers an explanation for this observation. Average pre-
the tf weighting method. For the tf–idf weighting method, two promi- cision over all trial runs was higher for the tf–idf method, while average
nent regions of highest F-measure scores are apparent, one in the fifty's recall was higher for the tf method. For the tf method, values for preci-
range of dimensionality factors around a threshold of 0.25, and the other sion and recall for each separate trial run were comparatively close,
within the threshold range of [0.60, 0.70] at a dimensionality factor of while values of precision were significantly higher than recall for the
15. Table 1 identifies the maximum F-measure results achieved, and tf–idf method. This discrepancy between the two methods explains

Table 1
Results of various clustering trials.

Trial run Term frequency Tem frequency inverse document frequency

Dim factor (l) Thresh (h) Precision Recall F-measure Dim factor (l) Thresh (h) Precision Recall F-measure

0 10 0.77 0.694 0.667 0.681 15 0.69 0.905 0.627 0.741


1 8 0.86 0.621 0.600 0.610 15 0.69 0.905 0.627 0.741
2 13 0.69 0.699 0.611 0.652 54–57 0.24 0.977 0.658 0.786
3 13 0.69 0.850 0.805 0.827 15 0.68–0.69 0.905 0.627 0.741
4 9 0.8 0.656 0.713 0.683 15 0.69 0.905 0.627 0.741
5 13 0.69 0.721 0.708 0.715 69–76 0.2 0.933 0.751 0.832
6 6 0.88 0.607 0.801 0.691 54–57 0.24 0.983 0.654 0.785
7 13 0.69 0.730 0.760 0.745 66–77 0.18 0.859 0.747 0.799
8 13 0.69 0.691 0.688 0.689 15 0.68–0.69 0.922 0.640 0.756
9 13 0.69 0.769 0.747 0.758 61–64, 67 0.2 0.898 0.733 0.807
10 13 0.69 0.698 0.738 0.717 66–77 0.18 0.859 0.747 0.799
11 13 0.69 0.850 0.810 0.830 54–57 0.24 0.950 0.645 0.768
12 10 0.76 0.682 0.753 0.716 53–57 0.24 0.977 0.658 0.786
13 13 0.69 0.744 0.735 0.739 54–57 0.24 0.983 0.654 0.785
14 40 0.49 0.637 0.670 0.653 54–57 0.24 0.950 0.645 0.768
15 13 0.69 0.850 0.810 0.830 54–57 0.24 0.950 0.645 0.768
16 13 0.69 0.740 0.613 0.671 69–73 0.2 0.979 0.738 0.841
17 13 0.69 0.761 0.758 0.760 54–57 0.24 0.977 0.658 0.786
18 13 0.69 0.769 0.747 0.758 61–64, 67 0.2 0.843 0.756 0.797
19 9 0.8 0.656 0.713 0.683 15 0.68–0.69 0.922 0.640 0.756
20 9 0.81 0.643 0.591 0.616 69–76 0.2 0.933 0.751 0.832
Mean N/A N/A 0.718 0.716 0.715 N/A N/A 0.929 0.677 0.782
St. dev. N/A N/A 0.073 0.070 0.064 N/A N/A 0.043 0.051 0.031
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 43

F-measure= 0.61; (l, h)= (8, 0.86); Seed(8)

Cluster 1: H1 D2 H3 H4 H2 H11 E18 E12 C 4 E19 E13 E17 A 18 H13 E16 E14 C7 E6 A 19
Cluster 2: B1 B 11 B 12 B3 B 10 B 18 B 2 0 B 13 B2 B7 B9 B8
Cluster 3: C1
Cluster 4: C3 F2 G6
Cluster 5: G1 G2 G5 G4 D4 G7 D5 D3
Cluster 6: E1 E3
Cluster 7: D1
Cluster 8: F1 F5 F3 F6 F4 F7
F -measure= 0.83; (l, h)= (13, 0.69); Seed(572639)
Cluster 9: A1 A5 A6 A7 A9 A8 A3 A4 A 10 A 16 A 2 2 A 2 0 A 2 3 A 2 1 A 15 A 17
Cluster 10: G3 C5
Cluster 1: G1 G2 G5 G7 D4 D5 D1 G3 D3 G6 G4 D2
Cluster 11: B 17
Cluster 2: E1 E3 E16 E13 E12 E14 E19 E17 E18 E6
Cluster 12: C 11
Cluster 3: C1
Cluster 13: A 11 A 12 A 14
Cluster 4: F1 F5 F7 F3 F4 F6
Cluster 14: A 13
Cluster 5: H1 H3 H4 H2 H11 H13 C 4
Cluster 6: A1 A4 A8 A7 A9 A6 A5 A3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C9 A 11
Cluster 7: B1 B 11 B 12 B3 B 10 B 13 B 18 B2 B7 B9 B8 B 2 0 B 17
Cluster 8: F2 C3 C7 C5
Cluster 9: C 11
Cluster 10: A 12
Cluster 11: A 13
Cluster 12: A 18

Fig. 5. Clustering results using tf weighting method.

F-measure= 0.84; (l, h)= (69, 0.20); Seed(576168) F-measure= 0.74; (l, h)= (15, 0.69); Seed(8)

Cluster 1: D1 D2 D4 D5 G7 D3 Cluster 1: H1 H4 H3 H2 H13 H11


Cluster 2: B1 B 12 B 11 B3 B 10 B2 B9 B8 B 7 B 2 0 B 13 B 18 B 17 Cluster 2: B1 G5 G2 G1 G3 G4
Cluster 3: C1 C 11 Cluster 3: C1
Cluster 4: E1 E3 E16 E13 E19 E14 E17 E6 Cluster 4: C3 C7 C9 C4 G6 F2
Cluster 5: G1 G4 G3 G6 G5 G2 Cluster 5: E1 E3 E16 E13 E12 E19 E14 E18 E17 E6 A 18
Cluster 6: H1 H4 H3 H2 Cluster 6: D1 D4 D5 G7 D2 D3
Cluster 7: C3 C7 F2 Cluster 7: B2 B7 B 13 B 18 B 2 0 B9 B8 B 10 B 17
Cluster 8: C4 Cluster 8: B3 B 11 B 12
Cluster 9: E12 Cluster 9: F1 F5 F7 F6 F4 F3
Cluster 10: A1 A3 A7 A8 A9 A6 A5 A4 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A 17 A 2 1 A 15 Cluster 10: A1 A4 A5 A6 A7 A9 A8 A3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19
Cluster 11: C5 Cluster 11: C5
Cluster 12: H11 H13 Cluster 12: C 11
Cluster 13: F1 F3 F5 F7 F4 F6 Cluster 13: A 11 A 12
Cluster 14: C9 Cluster 14: A 13
Cluster 15: E18 Cluster 15: A 14
Cluster 16: A 18
Cluster 17: A 19
Cluster 18: A23

Fig. 6. Clustering results using tf–idf weighting method.

tf tf-idf
P= 0.744, R= 0.735 P= 0.905, R= 0.627
F-measure= 0.740 F-measure= 0.741
Cluster 1: F1 F5 F7 F3 F4 F6 Cluster 1: H1 H4 H3 H2 H13 H11
Cluster 2: B1 B 11 B 12 B3 G1 B 10 B 13 B 18 B2 B7 B9 B8 B 2 0 B 17 Cluster 2: B1 G5 G2 G1 G3 G4
Cluster 3: A1 A4 A8 A7 A9 A6 A5 A3 A 14 A 2 2 A 16 A 2 0 A 10 A 2 1 A 15 A 17 A 2 3 A 19 C9 A 11 Cluster 3: C1

Cluster 4: C1 Cluster 4: C3 C7 C9 C4 G6 F2

Cluster 5: D1 D4 G7 D5 G2 G3 D3 G6 G4 G5 Cluster 5: E1 E3 E16 E13 E12 E19 E14 E18 E17 E6 A 18

Cluster 6: C3 F2 E6 H13 E19 E17 E14 C4 C7 E12 C5 E13 E18 Cluster 6: D1 D4 D5 G7 D2 D3

Cluster 7: D2 H1 H3 H4 H2 H11 Cluster 7: B2 B7 B 13 B 18 B 2 0 B9 B8 B 10 B 17

Cluster 8: E1 E3 E16 Cluster 8: B3 B 11 B 12

Cluster 9: C11 Cluster 9: F1 F5 F7 F6 F4 F3

Cluster 10: A12 Cluster 10: A1 A4 A5 A6 A7 A9 A8 A3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19

Cluster 11: A13 Cluster 11: C5

Cluster 12: A18 Cluster 12: C 11


Cluster 13: A 11 A 12
Cluster 14: A 13
Cluster 15: A 14

Fig. 7. Two clustering outcomes similar in F-measure scores and varying in precision and recall.

Reconstruct Classify ts(i) Remove ts(i)


from tr
Xˆ l

ts : set of test documents (outlier documents)


Determine ts, n Add ts(i) to tr; tr : Set of training documents (clusters)
Start and tr; n : number of outlier documents Increment i; Evaluate final End
Determine N clustering
i : integer denoting a specific document in i<n?
Initialize i= 0 Xtr,ts(i,j)
the test set
Xtr,ts(i) : t-d matrix based on training set tr and test
document ts(i)
l : dimensionality factor

Fig. 8. Cluster refinement process.


44 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

0.9 Table 3
Matrix of average precision scores for baseline and optimum cases, before and after
refinement.
0.8
Category Baseline Optimum Difference
F-measure

Original 0.774 0.929 0.155


Refined 0.637 0.819 0.182
0.7
Difference −0.137 −0.110 0.045

0.6
Optimum - Refined
Baseline - Refined
Optimum - Original are defined based on a minimum cluster size (smin). Members of any
Baseline - Original
0.5 cluster in the original outcome that fails to satisfy the minimum are con-
Trial runs sidered outliers and included in the test set. Accordingly, the larger the
minimum limit the smaller the number of clusters in the final outcome.
Fig. 9. Comparison of F-measure results across trial runs. Selecting the minimum cluster size is judgmental, based primarily on
knowledge of the dataset and whether or not large clusters are expect-
ed. Outliers are extracted and the remaining clusters form the training
the clustering outcomes illustrated in the above figures. As discussed set and are considered the classes used for classification. The more
above, a small number of clusters in an outcome produces a high recall these clusters correlate with the true classes in the dataset (i.e. the
result, but also increases cluster impurity especially if the number of lower their impurity and the better they represent each of the true clas-
resulting clusters is less than the number of true classes. Conversely, a ses) the better the chances of an improved clustering outcome after
high precision value is generated if the outcome contains a large num- refinement.
ber of clusters, provided that the clusters are made up of same-class in- Having identified both sets, each individual test document is added
stances (i.e. the case of class fragmentation). These results suggest that to the training set in order to be classified to one of the clusters. With
the optimum outcome of a trial run using the tf–idf method tends to each addition, the t–d matrix is developed and then reduced to a dimen-
have a relatively high precision result and a moderately high recall sionality level that exposes similarities between documents in the set to
result. facilitate the classification step. Based on the results of the evaluation of
different text classifiers at varying dimensionality factors in [17], a
Rocchio classifier was implemented for the hybrid approach using a di-
5. Clustering using a hybrid approach mensionality level of approximately 67% of the available dimensions. Fi-
nally, each outlier is classified and grouped with the closest cluster and
Fig. 7 displays two different clustering outcomes with an almost the refined outcome is evaluated using F-measure to enable comparison
identical F-measure score, one for each weighting method. The general between the original clustering outcome and the refined outcome. The
characteristics of fragmentation and impurity discussed in the previous outcomes from the trial runs previously performed were used to evalu-
section apply to both cases. If the small clusters – the group of outliers – ate the refinement process to obtain a representative estimate of the
in the tf–idf outcome are ignored, the remaining large clusters with min- process's effect on clustering.
imal impurity can still make an acceptable representation of every true Fig. 9 displays the results of evaluating the hybrid clustering ap-
class in the dataset. For example, class A is represented by cluster 10, proach using a minimum cluster size (smin) of four. The same baseline
class B by cluster 7, class C by cluster 4, etc. The same cannot be said as before was used after considering the optimum threshold for each
for the tf outcome due to the high impurity of cluster 5 (a combination case (i.e. the highest result achieved using lmax across the range of
of classes D and G) and cluster 6 (composed mainly of classes C and E, threshold values versus the highest result achieved using lopt). Table 2
with a couple of instances from other classes). The tf–idf clustering out- is a matrix of the average F-measure scores of the trial runs for the
come can therefore be reformulated as a classification problem, by split- different combinations of original/refined, baseline/optimum. Table 3
ting the outcome into a test set made up of the outlier cases and a and Table 4 are the equivalent matrices for precision and recall.
training set consisting of the remaining clusters. Refinement of a high- In general, the optimum cases displayed better precision and F-
precision average-recall clustering outcome is possible by a secondary measure scores than the baseline cases. This indicates LSA's contribution
classification step in which each outlier is classified to one of the large to improved clustering results, but also highlights the importance of
clusters. This hybrid approach therefore combines an unsupervised identifying the appropriate dimensionality factor in order to achieve
learning method (single pass clustering) with a supervised learning such improvements. The optimum cases demonstrated a slight deterio-
method (text classification) with the objective of improving clustering ration in recall from the baseline, but not significant enough to prevent
performance by reducing fragmentation. an improvement in F-measure scores due to a high increase in the
Fig. 8 outlines the process used for refining cluster outcomes and optimum's precision.
evaluating the technique. The process is preceded by performing single A surge in recall and a drop in precision are observed between the
pass clustering on the dataset and defining a specific outcome which the original and refined states. The increase in recall is attributed to a reduc-
process aims at improving. The first step in the refinement process is to tion in the total number of clusters in the final outcome as a result of
define the training and testing sets for the classifier. Outlier instances

Table 2 Table 4
Matrix of average F-measure scores for baseline and optimum cases, before and after re- Matrix of average recall scores for baseline and optimum cases, before and after
finement. refinement.

Category Baseline Optimum Difference Category Baseline Optimum Difference

Original 0.716 0.782 0.066 Original 0.678 0.677 −0.001


Refined 0.735 0.844 0.109 Refined 0.880 0.875 −0.005
Difference 0.019 0.062 0.128 Differences 0.201 0.197 0.197
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 45

(l, h)= (57, 0.24); Seed(23) C1


F-measure= 0.786; P= 0.977; R= 0.658 C11
Cluster 1: F1 F2 F3 F5 F4 F6 F7 Cluster 1: F1 F2 F3 F5 F4 F6 F7 H11
Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 H13

Training Set
Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7 E12

Test Set
Cluster 4: D1 D4 D5 D3 Cluster 4: D1 D4 D5 D3 C5
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 A15
Cluster 6: H1 H4 H3 H2 Cluster 6: H1 H4 H3 H2 A17
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A21
Cluster 8: C1 C 11 Cluster 9: C3 C7 C4 C9 E18
Cluster 9: C3 C7 C4 C9 A18
Cluster 10: H11 H13 A19
Cluster 11: E12 A23
Cluster 12: C5
Cluster 13: A 15 A 17 A 21
Cluster 14: E18
Cluster 15: A 18
(l, h)= (57, 0.24); Seed(23)
Cluster 16: A 19
F-measure= 0.893; P= 0.879; R= 0.907
Cluster 17: A23
Cluster 1: F1 F2 F3 F5 F4 F6 F7
Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 E12 E18 A 18
Cluster 3: G1 G4 D2 G3 G2 G5 G6 G7
Cluster 4: D1 D4 D5 D3 C1
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 13 B 18 B 17 C 11
Cluster 6: H1 H4 H3 H2 H11
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 H13 A 15 A 17 A 2 1 A 19 A 2 3
Cluster 9: C3 C7 C4 C9 C5

Fig. 10. Cluster outcome refinement example.

classification of the outliers, which is expected since the objective of the while precision is more important, completely neglecting recall will
refinement process is to improve recall by reducing fragmentation. The single out for the classification step an extreme result of a completely
decrease in precision is a result of impurity, not only from the misclassi- fragmented outcome that has a perfect precision value but is composed
fication of the outliers, but also from the original pre-refined clusters. of a large number of single-instance or low-size clusters—a result that
This tradeoff between the change in precision and recall before and is unsuitable for the classification step. A moderate recall value is
after refinement resulted in an increase in F-measure scores for both required to cause the necessary balance between low impurity and
the baseline and optimum cases of 1.9% and 6.2%, respectively. Overall, fragmentation.
all three metrics experienced an increase from the original-baseline As demonstrated above, the tf–idf method's optimum clustering out-
averages to the refined-optimum averages of 4.5%, 19.7% and 12.8% for come produced fairly consistent results over many trial runs and were
precision, recall and F-measure, respectively. generally characterized by high precision values and moderately high
A closer look at an actual refined outcome will give substance to the recall values (a mean of 0.93 for precision with a standard deviation of
above results. Fig. 10 illustrates the clustering refinement process for a 0.04, and a mean of 0.68 for recall with a standard deviation of 0.05).
sample outcome. Only one cluster in the original outcome (Cluster 3) ‘Good clusters’ result from a good choice of factors for the single pass
contains a misplaced document. This explains the very high precision clustering step – dimensionality and threshold – that consistently
value. The outcome is also highly fragmented, which explains the medi- generate high clustering performance. Fig. 4 above identifies two main
um recall value: fragmentation increases the number of false-negative regions of high F-measure scores for the tf–idf weighting method
pairwise relationships and consequently reduces recall. Separation of based on the average scores of multiple trial runs: a low-threshold/
the outliers using a minimum cluster size of four results in eight remain- high-dimensionality region, and a high-threshold/low-dimensionality
ing clusters used as the training set for the classification step that are region. So far, the F-measure used in all calculations is the F(1) score;
very low in impurity and that make a good representation of true classes a balanced F-measure that gives equal weights to precision and recall
of the dataset. Four of the 13 documents in the test set were classified (calculated using F − measure = (β2 + 1)PR/(β2 × P + R) and setting
incorrectly, reducing the precision value for the refined outcome. How- β = 1). Using an unbalanced F-measure, which gives a small advantage
ever, accurate classification of the majority of the outliers resulted in a
large gain in recall (due to the decrease in the number of false-
negative pair-wise relationships) ultimately causing a 10.7% increase
in the F-measure score.
The success of the proposed hybrid clustering technique therefore Threshold (h)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
relies on: 0.8
10

• creating ‘good clusters’ using the single pass clustering step—low


Dimensionality factor (l )
20

impurity clusters that match the true classes in the dataset and conse- 0.8 - 0.7 Region 1
quently ensure easier classification of the outlier documents in the
30

next step, and


• selecting the appropriate minimum cluster size that best distinguishes 0.7 - 0.5
40

between clusters and outliers (training set and test set) for the classi-
50

fication step. 0.5 - 0.3


60

5.1. Selection of dimensionality and threshold values


Region 2
70

< 0.3
Precision of the clustering outcome after the initial step of single
pass clustering is a good indicator of the degree of impurity of the gen-
erated clusters; the higher the precision the less impure. However, Fig. 11. Intensity grid of average F(0.5) scores of ten trial runs.
46 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

Region 1 Region 2
Dimensionality factor = 16; Threshold value= 0.67 Dimensionality factor = 56; Threshold value= 0.24
1.0 1.0

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

F(0.5) Precision Recall F(0.5) Precision Recall


0.5 0.5

Fig. 12. Consistency of maximum performance factor combinations over 100 trial runs.

to precision over recall, gives a better picture of the range of factor the sequence of documents. Fig. 12 displays the variation of the evalua-
values that are more likely to produce good clusters. Fig. 11 represents tion metrics across the trial runs. The average F(0.5) score for both re-
the intensity grid of the average F(0.5) score for the same trial runs. A gions across the 100 trials was the same (0.85), however the results
prominent area of high clustering performance appears within the from region 1 were more consistent. The standard deviation for all
[10, 20] dimensionality range and the [0.7, 0.85] threshold range (region three evaluation metrics in region 1 was 0.01, while the standard devi-
1). The highest average F(0.5) was 0.85 at a dimensionality factor of 16 ations for precision, recall and F(0.5) in region 2 were 0.07, 0.01 and
and a threshold of 0.67. A smaller region of high performance appears 0.03, respectively. Moreover, whereas 37 trial runs resulted in a preci-
within the [0.20, 0.25] threshold range and the fifties dimensionality sion value less than 0.9 for the region 2 factors, the lowest precision
range (region 2). The highest score achieved in this region was 0.84 at value for a trial run using the region 1 factors was 0.91.
a threshold of 0.24 and the dimensionality factors 56 and 57. A look at the actual clusters formed by the factors of each region
To test consistency of results, the factor combination with the max- gives a good indication of consistency of results. Over the 100 trial
imum result in each region was tested for 100 trials after randomizing runs, region 1 factors generated three unique outcomes, illustrated in

Outcome R1A Outcome R1B


Cluster 1: F1 F7 F5 F6 F3 F2 Cluster 1: A1 A4 A5 A6 A7 A9 A8 A3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19
Cluster 2: C1 Cluster 2: H1 H4 H3 H2
Cluster 3: B1 G5 G2 G1 G3 G4 Cluster 3: G1 G3 G2 G5 B1 G4
Cluster 4: E1 E3 E16 E13 E12 E19 E14 E18 E17 E6 Cluster 4: C1
Cluster 5: H1 H4 H3 H2 Cluster 5: E1 E3 E16 E13 E12 E19 E14 E18 E17 E6
Cluster 6: D1 D4 D5 G7 D2 D3 Cluster 6: C3 C7 C9 C4 F2 G6
Cluster 7: B2 B7 B 13 B 2 0 B 18 B9 B8 B 10 B 17 Cluster 7: D1 D4 D5 G7 D2 D3
Cluster 8: A1 A4 A5 A6 A7 A9 A8 A3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19 Cluster 8: C5
Cluster 9: B3 B 11 B 12 Cluster 9: F1 F7 F5 F6 F3
Cluster 10: F4 Cluster 10: H11 H13
Cluster 11: C3 C7 C9 C4 G6 Cluster 11: C 11
Cluster 12: H11 H13 Cluster 12: B2 B7 B 13 B 2 0 B 18 B9 B8 B 10 B 17
Cluster 13: C5 Cluster 13: B3 B 11 B 12
Cluster 14: A 11 A 12 Cluster 14: F4
Cluster 15: C 11 Cluster 15: A 11 A 12
Cluster 16: A 13 Cluster 16: A 13
Cluster 17: A 14 Cluster 17: A 14
Cluster 18: A 18 Cluster 18: A 18
Outcome R1C
Cluster 1: H1 H4 H3 H2
Cluster 2: C1
Cluster 3: C3 C7 C9 C4 F2 G6 E6
Cluster 4: B1 G5 G2 G1 G3 G4
Cluster 5: D1 D4 D5 G7 D2 D3
Cluster 6: E1 E3 E16 E13 E12 E19 E14 E18 E17
Cluster 7: C5
Cluster 8: F1 F7 F5 F6 F3
Cluster 9: B2 B7 B 13 B 2 0 B 18 B9 B8 B 10 B 17
Cluster 10: A1 A4 A5 A6 A7 A9 A8 A3 A 15 A 2 1 A 17 A 2 2 A 2 3 A 16 A 2 0 A 10 A 19
Cluster 11: B3 B 11 B 12
Cluster 12: C 11
Cluster 13: F4
Cluster 14: H11 H13
Cluster 15: A 11 A 12
Cluster 16: A 13
Cluster 17: A 14
Cluster 18: A 18

Fig. 13. Clustering outcomes for region 1 factors.


M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 47

30%
Precision Recall F-measure
20%

10%

% Change
0%
2 3 smin 4 5
-10%

-20%

-30%

Fig. 14. Change in average clustering performance between original and refined outcomes.

Fig. 13. Outcome R1A occurred 76 times, while outcomes R1B and R1C minimum cluster size, then cluster 13 would be included in the training
occurred 16 and 8 times, respectively. The three outcomes are identical set thereby splitting class A in the refined outcome. In this case the in-
in the number of clusters formed and the composition of each cluster, crease in recall will not be the same as for the case of using a minimum
except for a disagreement over the clusters for documents F2 and E6. cluster size of four and accordingly a lower final F-measure would be ex-
On the other hand, while the highest F(0.5) score in all 100 trials for pected. Evaluating with a minimum cluster size of three for the above
both regions was based on an outcome using region 2 factors, such example yielded P = 0.86, R = 0.778 and F-measure = 0.817; only a
factors generated 10 unique outcomes ranging in F(0.5) scores from a 3.1% improvement over the original clustering outcome. To measure
minimum of 0.62 to a maximum of 0.89. the effect of the minimum cluster size on the refined outcome, the
While factor combinations from region 2 have the potential of above evaluation was repeated for different values of smin ranging
producing outcomes that have higher F-measure scores, results vary between five (the size of the smallest class in the dataset) and two
depending on the order of the documents used during single pass clus- (where only single-instance clusters in the original outcomes are de-
tering. This can be attributed to the low threshold value and high fined as outliers). Fig. 14 illustrates the average difference in precision,
dimensionality factor of region 2. With high dimensionality, optimal recall and F-measure scores between the original and refined outcomes
separation of similar instances is not achieved, and a lower threshold at different minimum cluster size values. For all values of smin, the gain in
is required to achieve good clustering performance. Under these condi- recall after the refinement process overcomes the loss in precision
tions, the number of candidates that satisfy the similarity limit for a resulting in a positive increase in F-measure scores, except at a mini-
forming cluster increases thereby increasing the competition between mum cluster size of 5 for which a loss in performance occurs after re-
clusters over the instances. Since clusters are formed one at a time finement. The highest gain in F-measure scores was at a minimum
based on the order of the instances, an early forming cluster develops cluster size of four.
with a larger pool of candidate instances, thus given priority over a A comparison of the final refined outcome at both extremes of the
late forming cluster. The outcome is therefore susceptible to such smin range reveals the consequences of selection of a specific minimum
order. Conversely for region 1, the polarizing effect of a low dimension- cluster size. Fig. 15 displays two outcomes of the same trial run based
ality factor results in effective separation of same-class instances thus on a minimum cluster size of five and two. At the high end, the number
allowing the use of a relatively high threshold value. With limited com- of clusters used for classification tends to be low compared to the num-
petition between clusters over the documents as a result of the stricter ber of true classes in the dataset and the number of outliers tends to be
similarity threshold, the outcomes of clustering are fairly consistent high. Since whole classes are missing from the training set, classification
regardless of the sequence of documents used in the process. accuracy is expected to be very low. Accordingly, even if such clusters
initially have low impurity, classification quickly erodes this advantage
5.2. Choice of minimum cluster size and the gain in recall is not sufficient to make any positive impact on
the final F-measure score. This case is impractical for information
The choice of the minimum cluster size has an impact on the refined retrieval purposes as the composition of the resulting clusters is too
outcome's final F-measure score. In Fig. 10, if three is used as the diverse to allow any reasonable assessment of the clusters' content.

smin= 2 smin= 5
F-measure= 0.855; P= 0.953; R= 0.776 F-measure= 0.700; P= 0.595; R= 0.851
Cluster 1: F1 F2 F3 F5 F4 F6 F7 Cluster 1: F1 F2 F3 F5 F4 F6 F7
Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 E12 E18 A 18 Cluster 2: E1 E3 E16 E13 E19 E14 E17 E6 H1 H3 H2 C9 H11 E12 C5 A 15 E18 A 18
Cluster 3: G1 G4 D2 G3 G2 G6 G7 G5 Cluster 3: G1 G4 D2 G3 G2 G6 G7 G5 D1 D4 D5 D3 H4 C1
Cluster 4: D1 D4 D5 D3 Cluster 4: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 18 B 13 B 17 C 11
Cluster 5: B1 B 12 B 11 B 2 0 B3 B 10 B9 B8 B2 B7 B 18 B 13 B 17 Cluster 5: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 C3 C7 C4 H13 A 17 A 21 A 19 A 23
Cluster 6: H1 H4 H3 H2
Cluster 7: A1 A3 A7 A8 A9 A4 A6 A5 A 16 A 2 0 A 10 A 14 A 12 A 11 A 2 2 A 13 A 19 A 2 3
Cluster 8: C1 C 11 C5
Cluster 9: C3 C7 C4 C9
Cluster 10: H11 H13
Cluster 11: A 15 A 17 A 2 1

Fig. 15. Effect of minimum clustering size on cluster refinement.


48 M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49

The other end of the spectrum (smin = 2) is the case where outliers are and/or metadata), but also returns other related documents in the clus-
only single-instance clusters. For this case, the number of clusters in the ter even if their similarity with the keywords is low or if they do not sat-
training set tends to be high in comparison with the number of true isfy the metadata criteria [5]. This ensures high recall and guarantees
classes in the dataset, and the number of test documents tends to be access to the relevant information in the documents. While the pro-
low. Due to fragmentation of the original outcome, classes may be rep- posed approach overcomes the all inclusive class limitation of text clas-
resented by more than one cluster in the training set. As such, there is a sifiers, the assumption of mutually exclusive clusters remains a
better chance of grouping outliers with similar-class documents thereby limitation of the approach. In practice, project documents may belong
increasing recall, and limiting any decrease in precision. However, with to discourses of multiple knowledge topics and assigning a document
multiple classes split over more than one cluster, the final result is still to one and only one may cause knowledge gaps in others. Theoretically,
highly fragmented. This case could be considered as a very conservative the technique may be modified to adopt an any-of approach instead of
clustering approach and, for practical purposes, can be used as an initial the current one-of approach; however, evaluation will require a differ-
step for simplifying a large dataset into smaller groups of very similar ent dataset since all classes in the current dataset are mutually
documents. exclusive.
Another limitation is dictated by the size of the dataset used for eval-
6. Summary and conclusion uating the proposed methodology. The impact of the size of the dataset
on the results is arguable. On one hand, a large dataset produces a large
When the project document corpus is complete and appropriate- t–d matrix which complicates matrix operations and increases compu-
ly organized (e.g. for previously completed projects), in such case the tational cost. Also a large dataset increases the chance of noisy data
use of text classifiers for document retrieval is suitable. However in which adversely affect the performance of the text analysis techniques.
many cases, the document corpus is gradually and continuously de- On the other hand, a small dataset, while easier to manipulate, offers a
veloping (such as the case of an ongoing project) and the classes re- smaller feature set. Scarcity of features – the evidence used to perform
quired for training in a supervised learning method are not readily the required text analysis task – can undermine the performance of
available. Particularly when classes are not predetermined and do the evaluated classification or clustering technique. Accordingly, cau-
not cover the whole spectrum of possible categories the application tion should be exercised in extrapolating the results to other datasets.
of text classification is not straightforward. In this study, an unsuper- The dataset and the resulting vocabulary are relatively small making
vised learning method was adapted and evaluated for the task of any generalizations of the results unjustifiable absent further experi-
clustering documents based on textual similarity into sets of docu- mentation on other datasets.
ments that are semantically related. The single pass clustering algo-
rithm was adopted instead of the popular K-means clustering
algorithm to avoid the requirement for a predetermined user- Appendix A
defined cardinality (number of resulting clusters) associated with
the latter. However, single pass clustering requires definition of a The following example illustrates application of LSA. The sample
minimum threshold similarity measure that indicates during the is made up of 12 documents organized into two classes, classes D
clustering process whether a specific instance belongs to a specific and G. Only the documents' subject headers are used in the analysis
cluster. In addition, single pass clustering is prone to variable cluster- (as opposed to the full document body) in order to limit the size of
ing outcomes depending on the sequence of the instances used in the the t–d matrix. The original t–d matrix – based on term frequency
clustering process. Single pass clustering was performed on the same and the reduced t–d matrix—after applying a dimensionality factor
dataset under varying threshold values and dimensionality factors to of 4 – are presented in Tables A.1 and A.2, respectively. On the docu-
evaluate the ability to identify the correct clusters within the dataset. ments' side, average pairwise similarity between document vectors
Results indicate the indirect relationship between threshold and di- of classes D and G increase after applying LSA from 0.28 and 0.44 to
mensionality: low dimensionality factors require high threshold 0.39 and 0.72, respectively. On the terms' side, similarity between
values to achieve good clustering results and vice-versa. For the cur- the vectors of the terms ‘remobilization’ and ‘relocation’ – which
rent dataset, a low dimensionality factor and a high threshold value were used interchangeably – increased from 0 to 0.95 after dimen-
demonstrated the best performance in terms of precision and consis- sionality reduction. Similarly, similarity between the terms ‘fence’
tency resulting in an average F-measure score of 0.782, a 6.6% in- and ‘gate’ increased from 0.38 to 0.87.
crease over the baseline. To boost recall, single pass clustering was
followed by a cluster refinement step in which the resulting clusters
were used to train a text classifier for classifying outliers. The average
Table A.1
F-measure score after refinement was 0.844, a 6.2% improvement Original t–d matrix for example.
over the unrefined result (12.8% improvement over the original
D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7
baseline). The results were based on repeated trials of different ran-
Adjacent 0 0 0 0 0 0 0 1 0 0 0 0
domizations of the dataset in order to obtain representative values of Airport 0 0 0 0 0 0 0 1 0 0 0 0
the performance. In general, it can be concluded that results im- Approval 0 0 0 0 0 0 1 0 0 0 0 0

proved when some level of dimensionality reduction was applied. Area 0 1 0 1 0 0 0 0 0 0 0 0


Continuation 0 0 0 0 0 0 0 0 0 1 1 1
However, the evaluation showed that the relationship between East 0 0 0 0 0 0 0 0 0 1 0 0
dimensionality factor and threshold value is not constant, i.e. that a Extension 0 0 0 0 0 1 0 0 1 0 0 0
misguided choice of dimensionality reduction can result in perfor- Fence 0 0 0 0 0 1 1 1 1 1 1 1
Gate 0 0 0 0 0 0 1 0 0 0 0 0
mance deterioration.
Mobilization 0 1 0 1 0 0 0 0 0 0 0 0
Results of the evaluation show that textual similarities can be used North 0 0 0 0 0 0 0 0 0 1 0 0
to reveal semantic relations between documents in the dataset. For doc- Office 1 0 1 0 1 0 0 0 0 0 0 0

ument management, this can be used to organize an unorganized docu- Old 0 0 0 0 0 0 0 1 0 0 0 0


Re−mobilization 1 0 0 0 0 0 0 0 0 0 0 0
ment corpus – whether of an ongoing project or a previous project's Relocation 0 0 1 0 1 0 0 0 0 0 0 0
unclassified corpus – into semantically related groups. The advantage Site 0 0 1 0 1 1 0 0 1 0 1 1
of doing so is realized at the document retrieval stage: a search of the Stop 0 0 0 0 0 0 0 1 0 0 0 0
Temporary 0 0 0 0 0 0 1 0 0 0 0 0
documents, whether by keywords and/or metadata, not only returns
Work 0 0 0 0 0 0 0 1 0 0 0 0
the relevant documents (those satisfying the user-defined keywords
M. Al Qady, A. Kandil / Automation in Construction 42 (2014) 36–49 49

Table A.2
Reduced t–d matrix for example.

D1 D2 D3 D4 D5 G1 G2 G3 G4 G5 G6 G7

Adjacent 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Airport 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Approval –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Area 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Continuation –0.147 0.000 –0.056 0.000 –0.056 0.493 0.409 –0.107 0.493 0.661 0.625 0.625

East –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196

Extension 0.028 0.000 0.196 0.000 0.196 0.326 0.173 0.055 0.326 0.267 0.359 0.359

Fence –0.143 0.000 0.028 0.000 0.028 0.933 0.819 1.038 0.933 1.057 1.071 1.071

Gate –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Mobilization 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

North –0.107 0.000 –0.145 0.000 –0.145 0.134 0.171 –0.042 0.134 0.270 0.196 0.196

Office 0.440 0.000 0.954 0.000 0.954 0.210 –0.278 0.031 0.210 –0.397 0.069 0.069

Old 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Re–mobilization 0.089 0.000 0.176 0.000 0.176 0.014 –0.067 0.044 0.014 –0.107 –0.020 –0.020

Relocation 0.351 0.000 0.779 0.000 0.779 0.196 –0.211 –0.013 0.196 –0.290 0.089 0.089

Site 0.339 0.000 1.063 0.000 1.063 0.881 0.200 –0.022 0.881 0.368 0.877 0.877

Stop 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

Temporary –0.067 0.000 –0.105 0.000 –0.105 0.086 0.127 0.110 0.086 0.171 0.119 0.119

Work 0.044 0.000 –0.006 0.000 –0.006 0.028 0.110 0.980 0.028 –0.042 –0.032 –0.032

References [10] W. Guo, L. Soilbelman, J.H. Garrett Jr., Visual pattern recognition supporting defect
reporting and condition assessment of wastewater collection systems, Journal of
[1] M. Al Qady, A. Kandil, Document management in construction—practices and opin- Computing in Civil Engineering 23 (3) (2009) 160–169.
ions, Journal of Construction Engineering and Management 139 (10) (2013) [11] S. Lee, L. Chang, Digital image processing methods for assessing bridge painting rust
06013002-1–06013002-7. defects and their limitations, Proc. of the International Conference on Computing in
[2] C.H. Caldas, L. Soibelman, J. Han, Automated classification of construction project Civil Engineering, American Society of Civil Engineers, Cancun, Mexico, 2005.
documents, Journal of Computing in Civil Engineering 16 (4) (2002) 234–243. [12] I. Brilakis, L. Soibelman, Y. Shinagwa, Material-based construction site image retriev-
[3] C.H. Caldas, L. Soibelman, Automating hierarchical document classification for con- al, Journal of Computing in Civil Engineering 19 (4) (2005) 341–355.
struction management information systems, Automation in Construction 12 (4) [13] J. Gong, C.H. Caldas, Learning and classifying motions of construction workers and
(2003) 395–406. equipment using bag of video feature words and Bayesian learning methods, Proc.
[4] W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms, of the International Workshop on Computing in Civil Engineering, American Society
Prentice Hall, 1992. of Civil Engineers, Miami, Florida, United States, 2011.
[5] C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, [14] V. Escorcia, M. Dávila, M. Golparvar-Fard, J. Niebles, Automated vision-based recog-
Cambridge University Press, New York, 2008. nition of construction worker actions for building interior construction operations
[6] S. Saitta, P. Kripakaran, B. Raphael, I.F. Smith, Improving system identification using using RGBD cameras, Proc. of the Construction Research Congress 2012, American
clustering, Journal of Computing in Civil Engineering 22 (5) (2008) 292–302. Society of Civil Engineers, West Lafayette, Indiana, United States, 2012.
[7] T. Cheng, J. Teizer, Modeling tower crane operator visibility to minimize the risk of limited [15] M. Al Qady, A. Kandil, Automatic document classification using a successively evolv-
situational awareness, Journal of Computing in Civil Engineering (Dec. 14 2012), ing dataset, Proc. of the 2011 3rd International/9th Construction Specialty Confer-
http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000282 (Epub). ence, Curran Associates, Inc., Ottawa, Ontario, Canada, 2011.
[8] H.S. Ng, A. Toukourou, L. Soibelman, Knowledge discovery in a facility condition as- [16] T.K. Landauer, P.W. Foltz, D. Laham, Introduction to latent semantic analysis, Dis-
sessment database using text clustering, Journal of Infrastructure Systems 12 (1) course Process, 25 (2&3) (1998) 259–284.
(2006) 50–59. [17] M. Al Qady, A. Kandil, Automatic classification of project documents based
[9] O. Raz, R. Buchheit, M. Shaw, P. Koopman, C. Faloutsos, Detecting semantic anoma- on text content, Journal of Computing in Civil Engineering (June 20 2013),
lies in truck weigh-in-motion traffic data using data mining, Journal of Computing in http://dx.doi.org/10.1061/(ASCE)CP.1943-5487.0000338 (Epub).
Civil Engineering 18 (4) (2004) 291–300.

Anda mungkin juga menyukai