8
have been proposed, such as the Message Passing Interface
(MPI) [8] and Open Multi-Processing (Open MP) [6]. The
idea of message passing is to parallelize independent tasks new
start runnable
awake
using a network of one master and several slave comput-
ers. While there is no possibility for communication be-
tween the slaves, this approach best fits scenarios where the
same algorithm is started several times on different data sets
schedule
or different analyses are calculated in parallel on the same waiting
9
any other transaction while it is running. Therefore, STM transactional model. All changes to the state of data are en-
allows for sharing changing states between threads in a syn- capsulated in transactions, i.e. every thread has a copy of its
chronous and coordinated manner. In contrast to lock-based working data and can change its value. During submission
strategies, parallelization is possible without explicit thread of the changes to the shared memory the consistency of the
management and without explicit locking resources. internal state is checked. If no interim changes occurred, the
submission is performed, see Figure 2 Thread 0 and Thread
n. If another thread working on another copy of the same
2. CLOJURE data meanwhile has submitted its changes, the transaction
Clojure as a new Lisp dialect uses a software transac- is rejected and restarted with a new copy of the data, see
tional memory system as a means of concurrency control. Figure 2 Thread 1 and Thread 2.
In the following we highlight some features of the program- MVCC supplies every actor with its own snapshot of the
ming language Clojure and its access to STM. For an in- data. In Clojure Refs and Agents handle changing mutable
troduction to Clojure refer to the Rich Hickey’s website states to new values automatically. Here, all interactions
(http://clojure.org) and to the book by Halloway [9]. with Refs must be encapsulated within a transaction via the
dosync macro. Transactions support synchronous change to
Functional programming. multiple Refs. Every actor will see a consistent snapshot of
Clojure is a new dialect of Lisp. Modern programming the world at a certain time point. Any changes to the data
languages have already adopted many features of Lisp, but made by an actor will not be seen by other actors until the
the interchangeability of code and data and the macro sys- transaction has been committed. A transaction is automati-
tem still are unique to Lisp. Clojure is not designed to pro- cally restarted if the states to be changed have been changed
vide backward compatibility to ANSI Common Lisp. For by another actor in the meantime. There are three methods
instance Clojure extends the s-expressions from lists to vec- to change the value of a Ref. Using ref-set the value is just
tors and maps. All of its basic data structures are immutable set to a new value, with alter the current in-transaction-
by design. This allows for very efficient copying of data, and value of the Ref is modified. The commute operation enables
is an important requirement for Clojure’s software transac- simultaneous write access without checking for intermediate
tional memory system with multiversion concurrency con- state changes, e.g. commutative changes to a list. For in-
trol [4]. stance a parallelized counter is realized with Refs as follows:
( def c o u n t e r ( r e f 0 ) )
Java Virtual Machine. ( dosync ( a l t e r c o u n t e r inc ) )
Clojure is hosted on the Java Virtual Machine (JVM). The
JVM is an industry-standard platform used by many devel- In contrast to Refs, Agents offer the possibility of asyn-
opers and provides access to numerous Java libraries. Its use chronous changes to a single reference. Functions are sent
is not restricted to a specific operating system. Therefore it to an Agent which are asynchronously applied to an Agent’s
is already used as an operational basis by several program- state and whose return value will become the Agent’s up-
ming languages like Groovy or Scala [17, 23]. Compared to dated state. The actions of all Agents are collected in a
Haskell and its STM system, Clojure can additionally profit thread pool and executed at some future point in time. Clo-
from this Java basis [24]. This assures that the final parallel jure’s Agents are integrated with the software transactional
program can be easily complemented by a Graphical User memory system, i.e. any changes made during an action
Interface (GUI) and distributed as a platform-independent are delayed until the final commit. As with all of Clojure’s
Java application. concurrency support, no user-code locking is involved. The
following code shows a parallelized counter realized with
Realizing software transactional memory in Clojure. Agents in Clojure:
On the one hand, functional programming languages can
help to eliminate concurrency issues while retaining thread- ( def c o u n t e r ( a g e n t 0 ) )
safe code, as their built-in immutable data structures will ( send c o u n t e r inc )
never change their state [30]. On the other hand, it is often
necessary to provide mutable states. Concurrency control
via software transactional memory enables an efficient man-
agement of mutable shared memory. This enables a fast and 3. A CLUSTERING CASE STUDY
parallel design of algorithms without the need for manually With our case study we wanted to inspect the capability
administrating threads and concurrency issues. Different of Clojure being used to parallelize existing algorithms on a
variants of realizing software transactional memory systems multi-core desktop computer. We decided to implement an
exist, such as locking / pessimistic, lock-free / optimistic, algorithm that does only require some basic transactional
and hybrids. Clojure uses a multiversion concurrency con- properties of Clojure to benchmark the potential benefit of
trol (MVCC, Figure 2) [4]. Every thread has its own copy of the parallelization. Also the final program should give ex-
the data and will process all of its modifications to the shared actly the same results as the single core implementation.
data without blocking other threads. Before committing the The k-means cluster algorithm fits our requirements and is
changes, it is checked whether other threads have altered the also included in several parallel benchmark suites [22]. Most
data in use. If so, the whole process of transaction has to be of the runtime for large data sets is spent on vector calcula-
retried until a consistent commit can be done. Unless the tions. Therefore, the parallel implementation was expected
number of failed transactions is small, this optimistic con- to scale well with the number of available processors.
currency control technique scales well with the number of
threads. Figure 2 shows the principles of Clojure’s software
10
Thread 0
Thread 1
Data 2
Data 0
Thread 2
The k-means clustering algorithm. objective function f . The partition with the best value is
Clustering is a classical example of unsupervised learning, the set of clusters sought.
i.e. learning without a teacher. The term cluster analysis This brute force method is computationally infeasible as
summarizes a collection of methods for generating hypothe- the cardinality of the set of all possible partitions is huge
ses about the structure of the data by solely exploring pair- even for small k and N . The cardinality of Φ(X, k) can be
wise distances or similarities in the data space. Clustering is computed by the Stirling numbers of the second kind [14]:
often applied as a first step in data analysis for the creation !
1 X
k
of initial hypotheses. Let X = {x1 , . . . , xN } be a set of data k k−i k N
points with the feature vector xi ∈ Rd . Cluster analysis is |Φ(X, k)| = SN = (−1) i (2)
k! i=0 i
used to build a partition of a data set containing k clusters
such that data points within a cluster are more similar to Existing algorithms provide different heuristics for this search
each other than points from different clusters. A partition problem. K-means is probably one of the most popular of
P (k) is a set of clusters {C1 , C2 , . . . , Ck } with 0 < k < N these partitional cluster algorithms [21]. The following list-
and meets the following conditions: ing shows the pseudocode for the k-means algorithm:
[
k F u n c t i o n KMeans
Ci = X , Ci 6= ∅ (1a)
I n p u t : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d )
i=1
k ( Number o f c l u s t e r s )
Ci ∩ Cj = ∅ , i 6= j (1b)
Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s )
m: X−>C ( C l u s t e r a s s i g n m e n t s )
Figure 3 illustrates the idea of clustering data. The arti-
ficial two-dimensional data set is split into 9 clusters, where I n i t i a l i z e C ( e . g . random s e l e c t i o n from X)
each cluster centroid is placed in the center of gravity of the While C h a s c h a n g e d
cluster. The dotted lines mark the borders of the clusters, For e a c h x i i n X
m( x i ) = a r g m i n j d i s t a n c e ( x i , c j )
which are equally distant to all neighboring clusters. The End
basic clustering task can be formulated as an optimisation For e a c h c j i n C
problem: c j = c e n t r o i d ( { x i | m( x i ) = j } )
End
Partitional cluster analysis: For a fixed number of End
groups k find that partition P (k) of a data set X out of
the set of all possible partitions Φ(X, k) for which a chosen Given a number k, the k-means algorithm splits a data set
objective function f : Φ(X, k) → R+ is optimized. For all X = {x1 . . . , xn } into k disjoint clusters. Hereby, the cluster
possible partitions with k clusters compute the value of the centroids µ1 , . . . , µk are placed in the center of gravity of the
11
is given in Figure 4.
The following code shows the assignment step of the par-
allel k-means algorithm in Clojure:
( defn a s s i g n m e n t [ ]
(map #(send % update−d a t a a g e n t ) DataAgents )
( defn update−d a t a a g e n t [ d a t a p o i n t s ]
(map update−d a t a p o i n t d a t a p o i n t s ) )
( defn update−d a t a p o i n t [ d a t a p o i n t ]
( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ]
( dosync (commute ( nth MemberRefs newass )
conj ( : d a t a d a t a p o i n t ) ) )
( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )
This amounts to minimizing the sum of squared distances of During the update step (blue lines) each Cluster Agent cal-
data points to their respective cluster centroids. K-means is culates its new centroid from its list of members. The up-
implemented by iterating between two major steps, which date function distributes the call for update to each cluster
reassign data points to nearest cluster centroids and update agent. The update-clusteragent function collects all clus-
centroids (often also called prototypes) for the newly assem- ter members from the n-th MemberRef and then calculates
bled cluster. A centroid µj is updated by computing the their new center. The n-th element of the MemberRefs is
mean of all points in cluster Cj : cleared for the next assignment step. Finally, the new cen-
1 X ter is assigned to the cluster agent. The loop of assignment
µj = xi . (4) updates and center updates is repeated until no changes to
|Cj | x ∈C
i j the centers occur.
Cluster validation.
Parallel implementation of k-means in Clojure. Cluster validation provides methods to evaluate the re-
Using software transactional memory and Clojure’s built- sults of cluster analysis in an objective way. The validation
in thread management the parallel execution of code is im- methods always measure a selected point of view, that is
plicitly defined. In Clojure asynchronous state changes are one has to predefine an expected characteristic of the data.
modeled with Agents. Agents independently perform ac- Several validation methods have been proposed. Following
tions and change their state according to the result of an Jain et al. [14] they are grouped into three types of criteria:
action. For the k-means algorithm the data set is split into
smaller pieces that are handled by these Agents. To reduce Internal criteria Measure the overlap between cluster struc-
the administrative overhead, several data points are handled ture and information inherent in data, for example sil-
by one agent. Clusters are also implemented as agents to al- houette, inter-cluster similarity.
low asynchronous changes to the cluster centroids. Because
every cluster has to know its members to update the cen- External criteria Compare different partitions, for exam-
troid additionally a list of members is included. This list is ple Rand-Index, Jaccard-Index, Fowlkes and Mallows.
decoupled from the cluster agents and accessed through the
STM system (via the commute function) in order to allow Relative criteria Decide which of two structures is better
simultaneous write access for the data subset agents without in some sense, for example quantifying the difference
need for synchronization. An overview of our parallel design between single-linkage or complete-linkage.
12
Data Data Data Data Data
Agent 0 Agent 1 Agent 2 Agent 3 Agent n
Member
Cluster Ref 0
Agent 0
Cluster Member
Agent 1 Ref 1
Cluster Member
Agent k Ref k
read
write
Figure 4: Basic design of the multi-core k-means algorithm in Clojure. Each Data Agent calculates its new
cluster assignment and adds its value to the list of cluster members. The Cluster Agents calculate their new
centers from the list of cluster members.
13
10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions R Clojure
150 20 Cluster 20 Cluster
450
600
runtime (seconds)
runtime (seconds)
runtime (minutes)
100
300
400
150
50
200
0
0
ParaKMeans R Clojure R Clojure
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20
Figure 5: Benchmark results for the simulated data Figure 7: Benchmark results for the simulated data
sets without clustering structure comparing the run- sets with 50.000 samples, 200 / 500 dimensions, and
time of ParaKMeans, R, and Clojure. For the 10 / 20 clusters. Runtimes for R are given on the
smaller data set (left) the computational overhead left, runtimes for Clojure on the right.
of the parallelization negatively effects the runtime.
For the large data (right) set an improvement of the
R Clojure
runtime by a factor of 10 can be observed.
800
runtime (seconds)
R Clojure
600
250
400
200
runtime (seconds)
150
200
100
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20 Figure 8: Benchmark results for the simulated data
Number of samples / Number of clusters sets with 100.000 samples, 200 / 500 dimensions,
and 10 / 20 clusters. Runtimes for R are given on
Figure 6: Benchmark results for the simulated data the left, runtimes for Clojure on the right.
sets with 20.000 samples, 200 / 500 dimensions, and
10 / 20 clusters. Runtimes for R are given on the
left, runtimes for Clojure on the right. discussed in the literature [10]. It has been shown that a re-
peated cluster analysis with different methods, parameters,
feature sets or sample sizes can help to reveal the underly-
lelization negatively effects the runtime. For the extremely ing data structure. For instance, the bootstrap technique
large data set having 1.000.000 data points an improvement is used for estimating the number of clusters [15]. If the
of the runtime by a factor of 10 can be observed, show- fluctuations among the partitions are small the clustering is
ing that our implementation scales well with the number of called robust. Although there are few theoretical findings
computer cores. on the stability property of clusterings, this methodology
has proven to work well in practice [5, 16, 19]. Recently it
Data sets with cluster structure. was shown that for large data sets determining the cluster
We simulated clustered data sets using separate multivari- number via robustness analysis may be problematic [2,3,27].
ate normal distributions as the basis for each cluster. We Specifically, it was shown that considering theoretical prop-
used a combination of different conditions for the number erties of k-means like clustering in large sample size set-
of data points (n = 20.000, 50.000, 100.000), number of di- tings, stability is not connected to the number of clusters,
mensions (d = 200, 500), and number of clusters present but solely controlled by properties of the algorithm’s under-
in the data set (k = 10, 20) to generate a total of 12 data lying objective function. Independent of the parameter k
sets. We generated the artificial data sets to reflect a specific the stability measure always converges to 1 if there is an
cluster structure. Figure 6, Figure 7, and Figure 8 show the unique global optimal minimum for the input data set [3].
benchmark results for these data sets. We generated two large data sets to simulate this behavior.
The stability is measured by comparing the agreement be-
Stability of large data sets. tween the different results of running k-means on subsets of
To further illustrate the need for a high computational the data. The agreement is measured with the Rand-index.
speed of cluster algorithms we performed simulations to in- A higher value indicates increased stability. Results from
fer the number of clusters inherent in a data set. This is an two examples are shown in Figure 9 and Figure 10. Here,
often discussed issue in cluster analysis and additionally con- we simulated clustered data sets using separate multivari-
cerns the reliability of the computed clustering / partition. ate normal distributions as the basis for each cluster. We
Evaluating the stability of cluster results have been widely generated two data sets with 200 and 100.000 samples each
14
1.0
1.0
0.8
0.8
Rand index
Rand index
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 3 4 5 6 7 2 3 4 5 6 7
number of clusters number of clusters
Figure 9: Stability resampling and clustering a small Figure 10: Stability resampling and clustering a
data set (n=200, d =100). large data set (n=100000, d =100).
containing 3 √
clusters. Each data set was resampled 10 times ation desktop, implementing parallel multi-core algorithms
leaving out n data points. The effect of resampling on will be a highly efficient alternative to compute clusters.
the stability of the clustering can be reproduced on both
data sets. Both experiments correctly predict a most stable 6. ACKNOWLEDGEMENTS
clustering into 3 clusters. Total running time was 0.56 sec This work is supported by the German Science Foundation
for the small data set and 49.26 min for the data set with (SFB 518, Project C5), the Stifterverband für die Deutsche
100000 samples. In each simulation 70 separate clusterings Wissenschaft (HAK), and the Graduate School of Mathe-
were performed. matical Analysis of Evolution, Information, and Complexity
at the University of Ulm (JMK).
5. DISCUSSION AND CONCLUSIONS
Developing an algorithm usually includes continual rounds 7. REFERENCES
[1] A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Saha.
of implementation, tests, bug fixing, and code refactoring.
Unlocking concurrency. ACM Queue, 4(10):24–33,
The family of Lisp programming languages strongly sup-
2006.
ports this process by the read-eval-print loop. Immutable
data structures reduce the amount of bugs produced during [2] S. Ben-David, D. Pál, and H. U. Simon. Stability of
the implementation phase. We presented a case study for k-means clustering. In N. H. Bshouty and C. Gentile,
multi-core parallelization of a data mining algorithm imple- editors, Conference on Learning Theory, volume 4539
mented in Clojure a new Lisp dialect. In contrast to other of Lecture Notes in Artificial Intelligence, pages 20–34,
programming languages targeting the JVM, such as Scala, Berlin, 2007. Springer.
Clojure integrates a software transactional memory system. [3] S. Ben-David, U. von Luxburg, and D. Pál. A sober
The design principle of the STM reduces the additional over- look at clustering stability. In J. G. Carbonell and
head required by the design of parallel algorithms. In con- J. Siekmann, editors, Conference on Learning Theory,
trast to parallelization of software using MPI or OpenMP, volume 4005 of Lecture Notes in Artificial Intelligence,
our approach only requires minimal additional effort in de- pages 5–19, Berlin, 2006. Springer.
signing the software. [4] P. A. Bernstein and N. Goodman. Concurrency
The benchmark results show that for large data sets a sig- control in distributed database systems. ACM
nificant improvement in the runtime can be achieved by a Computing Surveys, 13(2):185–221, 1981.
parallel multi-core implementation. This is especially im- [5] A. Bertoni and G. Valentini. Random projections for
portant if simulations are performed repeatedly to infer the assessing gene expression cluster stability. In
inherent cluster number and to quantify the stability and Proceedings of the IEEE-International Joint
robustness of the predictions. To summarize, we ended up Conference on Neural Networks (IJCNN), volume 1,
with about 100 lines of code running a multi-core parallel pages 149–154. IEEE Computer Society, 2005.
algorithm that is fast, easy to maintain, and yet ready to re- [6] B. Chapman, G. Jost, and R. van der Pas. Using
lease as a platform independent standalone application. As OpenMP: Portable Shared Memory Parallel
multi-core computers are the standard on the next gener- Programming. MIT Press, Cambridge, 2007.
15
[7] E. Fowlkes and C. Mallows. A method for comparing O’Reilly, Sebastopol, 2007.
two hierarchical clusterings. Journal of the American [25] R Development Core Team. R: A Language and
Statistical Association, 78(383):553–569, 1983. Environment for Statistical Computing. R Foundation
[8] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: for Statistical Computing, Vienna, 2009. ISBN
Portable Parallel Programming with the Message 3-900051-07-0.
Passing Interface. MIT Press, Cambridge, 1999. [26] R. Rajwar and J. Goodman. Transactional execution:
[9] S. Halloway. Programming Clojure. Pragmatic Toward reliable, high-performance multithreading.
Programmers, Raleigh, 2009. IEEE Micro, 23(6):117–125, 2003.
[10] J. Handl, J. Knowles, and D. Kell. Computational [27] A. Rakhlin and A. Caponnetto. Stability of k-means
cluster validation in post-genomic data analysis. clustering. In B. Schölkopf, J. C. Platt, and
Bioinformatics, 21(15):3201–3212, 2005. T. Hoffman, editors, Advances in Neural Information
[11] J. Hill, M. Hambley, T. Forster, M. Mewissen, T. M. Processing Systems 19, pages 1121–128. MIT Press,
Sloan, F. Scharinger, A. Trew, and P. Ghazal. Sprint: Cambridge, 2007.
A new parallel framework for r. BMC Bioinformatics, [28] W. Rand. Objective criteria for the evaluation of
9(558), 2008. clustering methods. Journal of the American
[12] L. Hubert and P. Arabie. Comparing partitions. Statistical Association, 66:846–850, 1971.
Journal of Mathematical Classification, 2:193–218, [29] N. Shavit and D. Touitou. Software transactional
1985. memory. In Proceedings of the 14th ACM Symposium
[13] P. Jaccard. Nouvelles recherches sur la distribution on Principles of Distributed Computing, pages
florale. Bulletin de la Société Vaudoise des sciences 204–213, New York, 1995. ACM Press.
naturelles, 44:223–270, 1908. [30] S. Thompson. Haskell: The Craft of Functional
[14] A. Jain and R. Dubes. Algorithms for Clustering Data. Programming. Addison Wesley, Boston, 2nd edition,
Prentice Hall, New Jersey, 1988. 1999.
[15] A. K. Jain and J. V. Moreau. Bootstrap technique in
cluster analysis. Pattern Recognition, 20(5):547–568, APPENDIX
1987.
[16] H. A. Kestler, A. Müller, F. Schwenker, T. Gress, A. SOURCE CODE
T. Mattfeldt, and G. Palm. Cluster analysis of Multi-core implementation of k-means in Clojure.
comparative genomic hybridization data. Lecture ; set global constants 1
Notes NATO ASI: Aritificial Intelligence and ( d e f d a t a s e t ( l o a d −tab− f i l e ” c l u s t e r e x a m p l e . ta b ”) ) 2
( def k 2) 3
Heuristic Methods for Bioinformatics, pages S–40, ( def maxiter 1000) 4
2001. Abstract. ( d e f numagents 8 ) 5
[17] D. Koenig, A. Glover, P. King, G. Laforge, and ( def stop ( r e f ’ ( ) ) 6
( d e f n check−s t o p [ c u r i t e r ] 7
J. Skeet. Groovy in Action. Manning Publications Co., ( o r (= c u r i t e r m a x i t e r ) ( e v e r y ? #(= t r u e %) 8
Greenwich, 2007. ( deref stop ) ) ) ) 9
[18] P. Kraj, A. Sharma, N. Garge, R. Podolsky, and R. A. ( d e f s t r u c t c l u s t e r : number : d a t a ) 10
( d e f n i n i t −c l u s t e r −a g e n t [ number d a t a ] 11
McIndoe. Parakmeans: Implementation of a ( a g e n t ( s t r u c t c l u s t e r number d a t a ) ) ) 12
parallelized k-means algorithm suitable for general ( d e f s t r u c t d a t a p o i n t : number : d a t a : a s s i g n m e n t ) 13
laboratory use. BMC Bioinformatics, 9(200), 2008. ( d e f n i n i t −data−a g e n t [ s u b s t a r t subend ] 14
( a g e n t ( d o a l l (map #( s t r u c t d a t a p o i n t %1 %2) 15
[19] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. ( i t e r a t e inc substart ) 16
Stability-based validation of clustering solutions. ( drop s u b s t a r t 17
Neural Computation, 16(6):1299–1323, 2004. ( t a k e subend d a t a s e t ) ) ) ) ) ) 18
19
[20] D. Lea. Concurrent Programming in Java: Design ; f u n c t i o n s needed f o r t h e init step 20
Principles and Patterns. Addison Wesley, Boston, 2nd ( d e f n i n i t −member−r e f [ ] 21
edition, 2000. ( ref ’())) 22
23
[21] J. MacQueen. Some methods for classification and ( d e f n i n i t −c e n t e r s [ d a t a k ] 24
analysis of multivariate observations. In J. Neyman (map #(nth d a t a %) ( sample ( c o u n t d a t a ) k ) ) ) 25
26
and L. L. Cam, editors, Proceedings of the 5th ; i n i t i a l i z a t i o n s t e p o f k−means 27
Berkeley Symposium on Math, Statistics and ( defn i n i t [ ] 28
Probability, volume 1, pages 281–297, Berkely, 1967. ( d o s y n c ( r e f −s e t s t o p ’ ( ) ) ) 29
University of California Press. ( d e f data−a g e n t s (map #( i n i t −data−a g e n t %1 %2) 30
( t a k e numagents ( i t e r a t e #(+ ( i n t ( . Math ( c e i l 31
[22] C. C. Minh, J. Chung, C. Kozyrakis, and ( / ( c o u n t d a t a s e t ) numagents ) ) ) ) %) 0 ) ) 32
K. Olukotun. Stamp: Stanford transactional ( t a k e numagents ( i t e r a t e #(+ ( i n t ( . Math ( c e i l 33
( / ( c o u n t d a t a s e t ) numagents ) ) ) ) %) 34
applications for multi-processing. In IISWC ’08: (+ ( i n t ( . Math ( c e i l ( / ( c o u n t d a t a s e t ) 35
Proceedings of the IEEE International Symposium on numagents ) ) ) ) ) ) ) ) ) 36
Workload Characterization, pages 35–46, Los ( d e f c l u s t e r −a g e n t s (map #( i n i t −c l u s t e r −a g e n t 37
Alamitos, 2008. IEEE Computer Society. %1 %2) ( i t e r a t e i n c 0 ) 38
( i n i t −c e n t e r s d a t a s e t k ) ) ) 39
[23] M. Odersky, L. Spoon, and B. Venners. Programming ( d e f member−r e f s (map ( f n [ x ] ( i n i t −member−r e f ) ) 40
in Scala. Artima, Mountain View, 2008. ( range k ) ) ) 41
true ) 42
[24] S. Peyton-Jones. Beautiful concurrency. In A. Oram 43
and G. Wilson, editors, Beautiful code, chapter 24. ; f u n c t i o n s used f o r the assignment s t e p 44
( d e f n whichmin [#ˆ d o u b l e s x s ] 45
16
( areduce xs i r e t 0 46
( i f (< ( a g e t x s i ) ( a g e t x s r e t ) ) 47
i 48
ret ))) 49
50
( d e f n d i s t a n c e [#ˆ d o u b l e s a s #ˆd o u b l e s bs ] 51
( areduce as i r e t ( double 0) 52
(+ r e t ( ∗ (− ( a g e t a s i ) ( a g e t bs i ) ) 53
(− ( a g e t a s i ) ( a g e t bs i ) ) ) ) ) ) 54
55
( d e f n s e a r c h −b e s t −c l u s t e r [ p o i n t ] 56
( l e t [ d i s t ( do u b l e−a r r a y 57
(map #( d i s t a n c e ( : d a t a p o i n t ) 58
( : d a t a ( d e r e f %))) c l u s t e r −a g e n t s ) ) ] 59
( whichmin d i s t ) ) ) 60
61
( d e f n update−d a t a p o i n t [ p o i n t ] 62
( l e t [ newass ( s e a r c h −b e s t −c l u s t e r p o i n t ) ] 63
( d o s y nc ( commute s t o p c o n j 64
(= ( : a s s i g n m e n t p o i n t ) newass ) ) ) 65
( d o s y nc ( commute ( nth member−r e f s newass ) 66
c o n j ( : data p o i n t ) ) ) 67
( a s s o c p o i n t : a s s i g n m e n t newass ) ) ) 68
69
( d e f n update−data−a g e n t [ d a t a l i s t ] 70
( d o a l l ( pmap update−d a t a p o i n t d a t a l i s t ) ) ) 71
72
; a s s i g n m e n t s t e p o f k−means 73
( defn assignment [ ] 74
( do s y n c ( r e f −s e t s t o p ’ ( ) ) ) 75
( dorun (map #(s e n d % update−data−a g e n t ) 76
data−a g e n t s ) ) 77
( a p p l y a w a i t data−a g e n t s ) ) 78
79
; f u n c t i o n s u s e d i n t h e update s t e p 80
( d e f n da+ [#ˆ d o u b l e s a s #ˆd o u b l e s bs ] 81
( amap a s i r e t 82
(+ ( a g e t a s i ) ( a g e t bs i ) ) ) ) 83
84
( d e f n update−c e n t e r p o i n t [ members ] 85
( l e t [ s c a l e ( d o u b l e ( c o u n t members ) ) 86
#ˆd o u b l e s newcen ( r e d u c e da+ members ) ] 87
( amap newcen i r e t 88
( / ( a g e t newcen i ) s c a l e ) ) ) ) 89
90
( d e f n update−c l u s t e r −a g e n t [ c l u s ] 91
( l e t [mem ( d e r e f ( nth member−r e f s 92
( : number c l u s ) ) ) 93
newcen ( update−c e n t e r p o i n t mem ) ] 94
( d o s y nc ( r e f −s e t ( nth member−r e f s 95
( : number c l u s ) ) ’ ( ) ) ) 96
( a s s o c c l u s : d a t a newcen ) ) ) 97
98
; update s t e p o f k−means 99
( d e f n update [ ] 100
( dorun (map #(s e n d % update−c l u s t e r −a g e n t ) 101
c l u s t e r −a g e n t s ) ) 102
( a p p l y a w a i t c l u s t e r −a g e n t s ) ) 103
104
; t h e k−means f u n c t i o n i t s e l f 105
( d e f n kmeans [ ] 106
( init ) 107
( loop [ c u r i t e r 0] 108
( assignment ) 109
( update ) 110
( i f ( check−s t o p c u r i t e r ) 111
curiter 112
( recur ( inc curiter ) ) ) ) ) 113
17