Multi-Core Parallelization in Clojure - A Case Study.: Johann M. Kraus Hans A. Kestler

Multi-core Parallelization in Clojure – a Case Study.
Johann M. Kraus Hans A. Kestler

Institute of Neural Information Processing Institute of Neural Information Processing
University of Ulm University of Ulm
89069 Ulm 89069 Ulm
johann.kraus@uni-ulm.de and
Department of Internal Medicine I
University Hospital Ulm
89081 Ulm
hans.kestler@uni-ulm.de
ABSTRACT 1. CONCEPTS OF PARALLEL PROGRAM-

In recent years, the demand for computational power in data MING
mining applications has increased due to rapidly growing Currently, large collections of data are accumulated in
data sets. As a consequence, standard algorithms need to many areas, such as customer reports or high-throughput
be parallelized for fast processing of the generated data sets. data in the life-sciences. This has increased the need for
Unfortunately, most approaches for parallelizing algorithms computer-intensive applications to handle these data sets.
require a careful software design and a deep knowledge about Most software is implemented as a series of tasks starting
thread-safe programming. As a consequence they are hardly with a main() function and sequentially running through a
applicable for rapid prototyping of new algorithms. We out- set of instructions to compute a result. A task is defined here
line the process of multi-core parallelization using Clojure, as a small set of instructions. Parallel programs process sev-
a new functional programming language utilizing the Java eral tasks simultaneously. Developers have to identify those
Virtual Machine (JVM) that does not require knowledge of tasks of a program that can be processed in parallel. These
thread-safe programming. We provide some benchmark re- tasks are mapped to processes (threads), which are handled
sults for our multi-core algorithm to demonstrate its compu- by the different hardware resources available.
tational power. The rationale behind Clojure is combining There are different challenges when decomposing a linear
the industry-standard JVM with functional programming, program into simultaneously working threads. For instance
immutable data structures, and a built-in concurrency sup- repeatedly sending large data sets over a network of comput-
port via software transactional memory. This makes it a ers reduces the profit gained from parallelization. Also load
suitable tool for parallelization and rapid prototyping in balancing can be required to use all hardware resources effi-
many areas. In this case study we present a multi-core ciently. Furthermore, synchronization problems arise when
parallel implementation of the k-means cluster algorithm. two simultaneously running threads have to access the same
The multi-core algorithm shows an increase in computation data. In the following we describe concepts to handle the
speed up to a factor of 10 compared to R or network based challenges of parallel programming.
parallelization.
Granularity of task decomposition.
Categories and Subject Descriptors Different levels of granularity are used to decompose a
D.3.2 [Programming Languages]: Concurrent, distributed, program into smaller tasks. A program can be broken down
and parallel languages; I.5.3 [Pattern Recognition]: Clus- to the instruction level where a compiler may identify single
tering instructions without data dependencies that can be paral-
lelized. Another level is given by data parallelism. The data
is split into independent sets and the same instructions are
General Terms run on all subsets in parallel. A similar solution to task
Algorithms, Experimentation, Performance decomposition is the parallelization of loops. In some lan-
guages there are special statements available to mark those
Keywords loops which can be run simultaneously due to data inde-
pendence. In this case the compiler is able to parallelize
ACM proceedings, Lisp, Parallel programming, Clustering the code. The most important level of parallelism for mod-
ern computing languages is a decomposition into functional
blocks. This approach requires synchronization techniques
Permission to make digital or hard copies of all or part of this work for
as data or control structures are usually shared between dif-
personal or classroom use is granted without fee provided that copies are ferent tasks.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to Local vs. distributed hardware.
republish, to post on servers or to redistribute to lists, requires prior specific To overcome the limitations of currently available com-
permission and/or a fee.
ELW’09, July 6, 2009 Genova, Italy
puter hardware, different software systems for parallelizing
Copyright 2009 ACM 978-1-60558-539-0 ...$5.00. software for distributed and local multi-processor hardware
8
have been proposed, such as the Message Passing Interface
(MPI) [8] and Open Multi-Processing (Open MP) [6]. The
idea of message passing is to parallelize independent tasks new
start runnable
awake
using a network of one master and several slave comput-
ers. While there is no possibility for communication be-
tween the slaves, this approach best fits scenarios where the
same algorithm is started several times on different data sets
schedule
or different analyses are calculated in parallel on the same waiting
data set [11]. In contrast to the network-based architectures,

there are other approaches to parallelize software that are
designed to run on a single multi-core computer. In a multi-
end block
core parallelization setting like in OpenMP or Clojure, there terminated running
is no need for network communication, as all threads run on

the same computer. Furthermore, the available memory can
be shared between different threads. Therefore, developing
multi-threaded programs reduces the overhead of communi-
cating through a network. In contrast, as every thread has
access to all objects on the heap there is a need for con- Figure 1: During its lifetime a thread undergoes
currency control [20]. If two threads simultaneously try to different states from creation to termination.
change the state of data, they potentially compromise the
data integrity. Concurrency control ensures that correct re- chronized block. The following code shows an example for
sults are produced by concurrent operations. a parallelized counter implemented in Java.
Explicit vs. implicit parallelization. public c l a s s Counter {

Programming languages can support explicit or implicit private i n t v a l u e = 0 ;
parallelization. In a language where only explicit paralleliza- public synchronized void i n c r {
tion is available, a developer has to explicitly define commu- v a l u e = v a l u e +1;
nication and synchronization details for each task. This is }
the standard programming scheme for MPI or thread-based }
programming languages like Java. With implicit definitions Counter c o u n t e r = new Counter ( ) ;
there is no need for manually administrating threads. Func- counter . incr ( ) ;
tional programming languages can support this type of par- Every object in Java has an implicit mutex variable that is
allelization. In these languages parallelization is implicitly implicitly set when a thread enters a synchronized block. If a
defined by parallel processing of a set of functions without mutex variable is already set, then this thread is blocked un-
side-effects and an immutable data base. til the concurrent thread exits the synchronized block. The
use of implicit mutex variables in Java handles synchronized
Thread programming. access to shared data. But the developer has still to declare
Multi-core programming is closely linked to thread pro- the synchronized blocks. Therefore, problems might occur
gramming. A sequential program is decomposed into sev- when using too few locks, too many locks, wrong locks, or
eral tasks, which are then processed as threads. The con- locks in the wrong order [1]. A typical example are dead-
cept of thread programming is available in many program- locks, where two threads are each waiting for each other to
ming languages like C (PThreads or OpenMP threads), Java first release a resource (unlock).
(JThreads), or Fortran (OpenMP threads). Threads are
refinements of a process that share the same memory and Concurrency control.
can be separately and simultaneously processed. Due to the Concurrency control ensures that software can be paral-
shared memory concept, communication between threads is lelized without violating data integrity. Currently, the most
much faster than the communication of processes through popular approach for managing concurrent programs is the
sockets. Invoking threads is also faster than generating pro- use of locks [6]. Locking and synchronizing ensures that
cesses. Therefore, thread programming offers a flexible al- changes to the states of the data are coordinated. But im-
ternative to programming with cooperating processes. The plementing thread-safe programs using locks can be fatally
execution of threads is handled by a scheduler that man- error-prone [24].
ages the available processing time. During its lifetime every Software transactional memory (STM) is a modern alter-
thread undergoes different states. Figure 1 shows a typical native to the lock-based concurrency control mechanism [26,
state diagram for threads. A thread is started in the state 29]. The basic functionality of software transactional mem-
new. Then it is alternating between the states runnable, ory is analogous to controlling simultaneous access via trans-
running, and waiting until it ends in the state terminated. actions in database management systems. The execution
A scheduler controls changes between runnable and run- of a task is encapsulated in a transaction, which ensures
ning. Blocking due to I/O operations or concurrency leads that all actions on the data are atomic, consistent, and iso-
to waiting. Different concurrency control mechanisms can lated. Here, the term atomic means that either all changes
be used to administrate blocking of threads. For instance of a transaction to the data occur or none do. Consistent
Java provides mutex (mutual exclusion) to avoid simultane- means that the new data from the transaction is checked
ous access of threads to a common resource. In Java syn- for consistency before it is committed. Isolated means that
chronized access to data has to be encapsulated in a syn- every transaction is encapsulated and cannot see effects of
9
any other transaction while it is running. Therefore, STM transactional model. All changes to the state of data are en-
allows for sharing changing states between threads in a syn- capsulated in transactions, i.e. every thread has a copy of its
chronous and coordinated manner. In contrast to lock-based working data and can change its value. During submission
strategies, parallelization is possible without explicit thread of the changes to the shared memory the consistency of the
management and without explicit locking resources. internal state is checked. If no interim changes occurred, the
submission is performed, see Figure 2 Thread 0 and Thread
n. If another thread working on another copy of the same
2. CLOJURE data meanwhile has submitted its changes, the transaction
Clojure as a new Lisp dialect uses a software transac- is rejected and restarted with a new copy of the data, see
tional memory system as a means of concurrency control. Figure 2 Thread 1 and Thread 2.
In the following we highlight some features of the program- MVCC supplies every actor with its own snapshot of the
ming language Clojure and its access to STM. For an in- data. In Clojure Refs and Agents handle changing mutable
troduction to Clojure refer to the Rich Hickey’s website states to new values automatically. Here, all interactions
(http://clojure.org) and to the book by Halloway [9]. with Refs must be encapsulated within a transaction via the
dosync macro. Transactions support synchronous change to
Functional programming. multiple Refs. Every actor will see a consistent snapshot of
Clojure is a new dialect of Lisp. Modern programming the world at a certain time point. Any changes to the data
languages have already adopted many features of Lisp, but made by an actor will not be seen by other actors until the
the interchangeability of code and data and the macro sys- transaction has been committed. A transaction is automati-
tem still are unique to Lisp. Clojure is not designed to pro- cally restarted if the states to be changed have been changed
vide backward compatibility to ANSI Common Lisp. For by another actor in the meantime. There are three methods
instance Clojure extends the s-expressions from lists to vec- to change the value of a Ref. Using ref-set the value is just
tors and maps. All of its basic data structures are immutable set to a new value, with alter the current in-transaction-
by design. This allows for very efficient copying of data, and value of the Ref is modified. The commute operation enables
is an important requirement for Clojure’s software transac- simultaneous write access without checking for intermediate
tional memory system with multiversion concurrency con- state changes, e.g. commutative changes to a list. For in-
trol [4]. stance a parallelized counter is realized with Refs as follows:
( def c o u n t e r ( r e f 0 ) )
Java Virtual Machine. ( dosync ( a l t e r c o u n t e r inc ) )
Clojure is hosted on the Java Virtual Machine (JVM). The
JVM is an industry-standard platform used by many devel- In contrast to Refs, Agents offer the possibility of asyn-
opers and provides access to numerous Java libraries. Its use chronous changes to a single reference. Functions are sent
is not restricted to a specific operating system. Therefore it to an Agent which are asynchronously applied to an Agent’s
is already used as an operational basis by several program- state and whose return value will become the Agent’s up-
ming languages like Groovy or Scala [17, 23]. Compared to dated state. The actions of all Agents are collected in a
Haskell and its STM system, Clojure can additionally profit thread pool and executed at some future point in time. Clo-
from this Java basis [24]. This assures that the final parallel jure’s Agents are integrated with the software transactional
program can be easily complemented by a Graphical User memory system, i.e. any changes made during an action
Interface (GUI) and distributed as a platform-independent are delayed until the final commit. As with all of Clojure’s
Java application. concurrency support, no user-code locking is involved. The
following code shows a parallelized counter realized with
Realizing software transactional memory in Clojure. Agents in Clojure:
On the one hand, functional programming languages can
help to eliminate concurrency issues while retaining thread- ( def c o u n t e r ( a g e n t 0 ) )
safe code, as their built-in immutable data structures will ( send c o u n t e r inc )
never change their state [30]. On the other hand, it is often
necessary to provide mutable states. Concurrency control
via software transactional memory enables an efficient man-
agement of mutable shared memory. This enables a fast and 3. A CLUSTERING CASE STUDY
parallel design of algorithms without the need for manually With our case study we wanted to inspect the capability
administrating threads and concurrency issues. Different of Clojure being used to parallelize existing algorithms on a
variants of realizing software transactional memory systems multi-core desktop computer. We decided to implement an
exist, such as locking / pessimistic, lock-free / optimistic, algorithm that does only require some basic transactional
and hybrids. Clojure uses a multiversion concurrency con- properties of Clojure to benchmark the potential benefit of
trol (MVCC, Figure 2) [4]. Every thread has its own copy of the parallelization. Also the final program should give ex-
the data and will process all of its modifications to the shared actly the same results as the single core implementation.
data without blocking other threads. Before committing the The k-means cluster algorithm fits our requirements and is
changes, it is checked whether other threads have altered the also included in several parallel benchmark suites [22]. Most
data in use. If so, the whole process of transaction has to be of the runtime for large data sets is spent on vector calcula-
retried until a consistent commit can be done. Unless the tions. Therefore, the parallel implementation was expected
number of failed transactions is small, this optimistic con- to scale well with the number of available processors.
currency control technique scales well with the number of
threads. Figure 2 shows the principles of Clojure’s software
10
Thread 0
changed Data k Data k (copy) changed Data 2
Thread 1
Data k Data 2 (copy)
Data 2
Data 1 Data 2 (copy)
Data 0
Thread 2
changed Data 0 Data 0 (copy) changed Data 2

Submission fails.
Data has changed.
Reload data.
Thread n
Figure 2: The design principle of software transactional memory as implemented in Clojure.
The k-means clustering algorithm. objective function f . The partition with the best value is
Clustering is a classical example of unsupervised learning, the set of clusters sought.
i.e. learning without a teacher. The term cluster analysis This brute force method is computationally infeasible as
summarizes a collection of methods for generating hypothe- the cardinality of the set of all possible partitions is huge
ses about the structure of the data by solely exploring pair- even for small k and N . The cardinality of Φ(X, k) can be
wise distances or similarities in the data space. Clustering is computed by the Stirling numbers of the second kind [14]:
often applied as a first step in data analysis for the creation !
1 X
k
of initial hypotheses. Let X = {x1 , . . . , xN } be a set of data k k−i k N
points with the feature vector xi ∈ Rd . Cluster analysis is |Φ(X, k)| = SN = (−1) i (2)
k! i=0 i
used to build a partition of a data set containing k clusters
such that data points within a cluster are more similar to Existing algorithms provide different heuristics for this search
each other than points from different clusters. A partition problem. K-means is probably one of the most popular of
P (k) is a set of clusters {C1 , C2 , . . . , Ck } with 0 < k < N these partitional cluster algorithms [21]. The following list-
and meets the following conditions: ing shows the pseudocode for the k-means algorithm:
[
k F u n c t i o n KMeans
Ci = X , Ci 6= ∅ (1a)
I n p u t : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d )
i=1
k ( Number o f c l u s t e r s )
Ci ∩ Cj = ∅ , i 6= j (1b)
Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s )
m: X−>C ( C l u s t e r a s s i g n m e n t s )
Figure 3 illustrates the idea of clustering data. The arti-
ficial two-dimensional data set is split into 9 clusters, where I n i t i a l i z e C ( e . g . random s e l e c t i o n from X)
each cluster centroid is placed in the center of gravity of the While C h a s c h a n g e d
cluster. The dotted lines mark the borders of the clusters, For e a c h x i i n X
m( x i ) = a r g m i n j d i s t a n c e ( x i , c j )
which are equally distant to all neighboring clusters. The End
basic clustering task can be formulated as an optimisation For e a c h c j i n C
problem: c j = c e n t r o i d ( { x i | m( x i ) = j } )
End
Partitional cluster analysis: For a fixed number of End
groups k find that partition P (k) of a data set X out of
the set of all possible partitions Φ(X, k) for which a chosen Given a number k, the k-means algorithm splits a data set
objective function f : Φ(X, k) → R+ is optimized. For all X = {x1 . . . , xn } into k disjoint clusters. Hereby, the cluster
possible partitions with k clusters compute the value of the centroids µ1 , . . . , µk are placed in the center of gravity of the
11
is given in Figure 4.
The following code shows the assignment step of the par-
allel k-means algorithm in Clojure:
( defn a s s i g n m e n t [ ]
(map #(send % update−d a t a a g e n t ) DataAgents )
( defn update−d a t a a g e n t [ d a t a p o i n t s ]
(map update−d a t a p o i n t d a t a p o i n t s ) )
( defn update−d a t a p o i n t [ d a t a p o i n t ]
( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ]
( dosync (commute ( nth MemberRefs newass )
conj ( : d a t a d a t a p o i n t ) ) )
( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )
During the assignment step (red lines) the DataAgents cal-

culate the nearest cluster centroid for each data point and
update their assignments. Additionally each data point is
written to the list of members (MemberRef) of its nearest
centroid. Simultaneous write access to these lists is pos-
sible through Clojure’s STM system. Each DataAgent re-
cieves a request to update the cluster assignments of its data
points via the assignment function. Each agent processes
the update-datapoint function over its list of data points.
Figure 3: An artificial two-dimensional dataset clus- This function searches for the nearest cluster, adds the data
tered into 9 clusters. The centroids of the clusters point to the respective MemberRef and updates the assign-
are marked by crosses. The dotted lines show the ment of the data point. The following code shows the update
border of the clusters where each neighboring cen- step of the parallel k-means algorithm in Clojure:
troid is equally distant.
( defn update [ ]
(map #(send % update−c l u s t e r a g e n t ) C l u s t e r A g e n t s )
clusters C1 , . . . , Ck . The algorithm minimizes the objective ( defn update−c l u s t e r a g e n t [ a g t ]

function: ( l e t [ members ( deref ( nth MemberRefs
( : number a g t ) ) )
X
k X newcen ( reduce c a l c u l a t e −c e n t e r members ) ]
F (µj , Cj ) = ||xi − µj ||2 . (3) ( dosync ( r e f −s e t ( nth MemberRefs ( : number a g t ) )
nil ))
j=1 xi ∈Cj
( assoc a g t : d a t a newcen ) ) )
This amounts to minimizing the sum of squared distances of During the update step (blue lines) each Cluster Agent cal-
data points to their respective cluster centroids. K-means is culates its new centroid from its list of members. The up-
implemented by iterating between two major steps, which date function distributes the call for update to each cluster
reassign data points to nearest cluster centroids and update agent. The update-clusteragent function collects all clus-
centroids (often also called prototypes) for the newly assem- ter members from the n-th MemberRef and then calculates
bled cluster. A centroid µj is updated by computing the their new center. The n-th element of the MemberRefs is
mean of all points in cluster Cj : cleared for the next assignment step. Finally, the new cen-
1 X ter is assigned to the cluster agent. The loop of assignment
µj = xi . (4) updates and center updates is repeated until no changes to
|Cj | x ∈C
i j the centers occur.
Cluster validation.
Parallel implementation of k-means in Clojure. Cluster validation provides methods to evaluate the re-
Using software transactional memory and Clojure’s built- sults of cluster analysis in an objective way. The validation
in thread management the parallel execution of code is im- methods always measure a selected point of view, that is
plicitly defined. In Clojure asynchronous state changes are one has to predefine an expected characteristic of the data.
modeled with Agents. Agents independently perform ac- Several validation methods have been proposed. Following
tions and change their state according to the result of an Jain et al. [14] they are grouped into three types of criteria:
action. For the k-means algorithm the data set is split into
smaller pieces that are handled by these Agents. To reduce Internal criteria Measure the overlap between cluster struc-
the administrative overhead, several data points are handled ture and information inherent in data, for example sil-
by one agent. Clusters are also implemented as agents to al- houette, inter-cluster similarity.
low asynchronous changes to the cluster centroids. Because
every cluster has to know its members to update the cen- External criteria Compare different partitions, for exam-
troid additionally a list of members is included. This list is ple Rand-Index, Jaccard-Index, Fowlkes and Mallows.
decoupled from the cluster agents and accessed through the
STM system (via the commute function) in order to allow Relative criteria Decide which of two structures is better
simultaneous write access for the data subset agents without in some sense, for example quantifying the difference
need for synchronization. An overview of our parallel design between single-linkage or complete-linkage.
12
Data Data Data Data Data
Agent 0 Agent 1 Agent 2 Agent 3 Agent n
Member
Cluster Ref 0
Agent 0
Cluster Member
Agent 1 Ref 1
Cluster Member
Agent k Ref k
read
write
Figure 4: Basic design of the multi-core k-means algorithm in Clojure. Each Data Agent calculates its new
cluster assignment and adds its value to the list of cluster members. The Cluster Agents calculate their new
centers from the list of cluster members.
To analyze cluster algorithms they are often applied on a-

priori labeled data sets and compared to the labels with the Table 1: External cluster indices describe the agree-
use of an external criterion. In the following we shortly in- ment of two cluster results. All of the following in-
troduce the basic concept of evaluating cluster results using dices can be derived from the contingency table of
an external cluster indices. An external index describes to two partitions.
which degree two partitions of N objects agree. Given a set Name Formula
of N objects X = {x1 , x2 , . . . , xN } and two different parti- Pr
i=1
Ps 2 Pr
j=1 nij −( “i=1 n2
Ps 2
i. + j=1 n.j )/2
Rand [28] 1+ ”
tions P = {C1 , C2 , . . . , Cr } and Q = {D1 , D2 , . . . , Ds } into n
2
r and s clusters respectively. A contingency table is defined Pr Ps 2
to compare these two partitions. Here nij denotes the num- i=1 j=1 nij −n
Jaccard [13] Pr P P Ps
n2 + s n2 − r n2 −n
ber of objects that are both in clusters Ci and Dj , ni. and i=1 i. j=1 .j i=1 j=1 ij
n.j denote the total number of objects in cluster Ci and Dj Pr

i=1
Ps
n2 −n
Fowlkes and Mallows [7] Pr “ ” j=1 ij“n ”
ni. Ps
respectively. Hubert and Arabie [12] give some indices based 2(
i=1 2 j=1
.j )1/2
2
on the contingency table of two partitions, see Table 1. As
our artificial data sets have been designed to reflect a spe-
cific clustering, we also decided to use the Rand-index [28]
as an external cluster validity index. of data points, dimensions, and clusters to simulate differ-
ent aspects of the k-means cluster analysis. The results are
summarized in the following paragraphs.
4. EXPERIMENTS AND RESULTS
We compared the performance of our multi-core algorithm Large data sets without cluster structure.
to ParaKMeans [18] and the k-means function implemented These data sets are included to benchmark the runtime of
in R [25]. Our test hardware was a Dell Precision T7400 the algorithms neglecting the effect of the random initializa-
with dual quad-core Intel Xeon 3.2 GHz and 32 GB RAM. tion of the centroids. Each feature is uniformly distributed
ParaKMeans was tested on the web-interface at over the interval [0, 1], i.e. there is apparently no cluster
http://bioanalysis.genomics.mcg.edu/parakmeans , structure hidden in the data set. The performance of clus-
but could not process our larger test data set because the tering these simulated data sets into 5 and 20 clusters is
master computer did not provide enough memory. We gen- summarized in Figure 5. In case of the small data set with
erated several artificial data sets having different numbers 10.000 data points the computational overhead of the paral-
13
10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions R Clojure
150 20 Cluster 20 Cluster
450
600
runtime (seconds)
runtime (seconds)
runtime (minutes)
100
300
400
150
50
200
0
0
ParaKMeans R Clojure R Clojure
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20
Number of samples / Number of clusters
Figure 5: Benchmark results for the simulated data Figure 7: Benchmark results for the simulated data
sets without clustering structure comparing the run- sets with 50.000 samples, 200 / 500 dimensions, and
time of ParaKMeans, R, and Clojure. For the 10 / 20 clusters. Runtimes for R are given on the
smaller data set (left) the computational overhead left, runtimes for Clojure on the right.
of the parallelization negatively effects the runtime.
For the large data (right) set an improvement of the
R Clojure
runtime by a factor of 10 can be observed.
800
runtime (seconds)
R Clojure
600
250
400
200
runtime (seconds)
150
200
100
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20

50
Number of samples / Number of clusters

0
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20 Figure 8: Benchmark results for the simulated data
Number of samples / Number of clusters sets with 100.000 samples, 200 / 500 dimensions,
and 10 / 20 clusters. Runtimes for R are given on
Figure 6: Benchmark results for the simulated data the left, runtimes for Clojure on the right.
sets with 20.000 samples, 200 / 500 dimensions, and
10 / 20 clusters. Runtimes for R are given on the
left, runtimes for Clojure on the right. discussed in the literature [10]. It has been shown that a re-
peated cluster analysis with different methods, parameters,
feature sets or sample sizes can help to reveal the underly-
lelization negatively effects the runtime. For the extremely ing data structure. For instance, the bootstrap technique
large data set having 1.000.000 data points an improvement is used for estimating the number of clusters [15]. If the
of the runtime by a factor of 10 can be observed, show- fluctuations among the partitions are small the clustering is
ing that our implementation scales well with the number of called robust. Although there are few theoretical findings
computer cores. on the stability property of clusterings, this methodology
has proven to work well in practice [5, 16, 19]. Recently it
Data sets with cluster structure. was shown that for large data sets determining the cluster
We simulated clustered data sets using separate multivari- number via robustness analysis may be problematic [2,3,27].
ate normal distributions as the basis for each cluster. We Specifically, it was shown that considering theoretical prop-
used a combination of different conditions for the number erties of k-means like clustering in large sample size set-
of data points (n = 20.000, 50.000, 100.000), number of di- tings, stability is not connected to the number of clusters,
mensions (d = 200, 500), and number of clusters present but solely controlled by properties of the algorithm’s under-
in the data set (k = 10, 20) to generate a total of 12 data lying objective function. Independent of the parameter k
sets. We generated the artificial data sets to reflect a specific the stability measure always converges to 1 if there is an
cluster structure. Figure 6, Figure 7, and Figure 8 show the unique global optimal minimum for the input data set [3].
benchmark results for these data sets. We generated two large data sets to simulate this behavior.
The stability is measured by comparing the agreement be-
Stability of large data sets. tween the different results of running k-means on subsets of
To further illustrate the need for a high computational the data. The agreement is measured with the Rand-index.
speed of cluster algorithms we performed simulations to in- A higher value indicates increased stability. Results from
fer the number of clusters inherent in a data set. This is an two examples are shown in Figure 9 and Figure 10. Here,
often discussed issue in cluster analysis and additionally con- we simulated clustered data sets using separate multivari-
cerns the reliability of the computed clustering / partition. ate normal distributions as the basis for each cluster. We
Evaluating the stability of cluster results have been widely generated two data sets with 200 and 100.000 samples each
14
1.0
1.0
0.8
0.8
Rand index
Rand index
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
2 3 4 5 6 7 2 3 4 5 6 7
number of clusters number of clusters
Figure 9: Stability resampling and clustering a small Figure 10: Stability resampling and clustering a
data set (n=200, d =100). large data set (n=100000, d =100).
containing 3 √
clusters. Each data set was resampled 10 times ation desktop, implementing parallel multi-core algorithms
leaving out n data points. The effect of resampling on will be a highly efficient alternative to compute clusters.
the stability of the clustering can be reproduced on both
data sets. Both experiments correctly predict a most stable 6. ACKNOWLEDGEMENTS
clustering into 3 clusters. Total running time was 0.56 sec This work is supported by the German Science Foundation
for the small data set and 49.26 min for the data set with (SFB 518, Project C5), the Stifterverband für die Deutsche
100000 samples. In each simulation 70 separate clusterings Wissenschaft (HAK), and the Graduate School of Mathe-
were performed. matical Analysis of Evolution, Information, and Complexity
at the University of Ulm (JMK).
5. DISCUSSION AND CONCLUSIONS
Developing an algorithm usually includes continual rounds 7. REFERENCES
[1] A.-R. Adl-Tabatabai, C. Kozyrakis, and B. Saha.
of implementation, tests, bug fixing, and code refactoring.
Unlocking concurrency. ACM Queue, 4(10):24–33,
The family of Lisp programming languages strongly sup-
2006.
ports this process by the read-eval-print loop. Immutable
data structures reduce the amount of bugs produced during [2] S. Ben-David, D. Pál, and H. U. Simon. Stability of
the implementation phase. We presented a case study for k-means clustering. In N. H. Bshouty and C. Gentile,
multi-core parallelization of a data mining algorithm imple- editors, Conference on Learning Theory, volume 4539
mented in Clojure a new Lisp dialect. In contrast to other of Lecture Notes in Artificial Intelligence, pages 20–34,
programming languages targeting the JVM, such as Scala, Berlin, 2007. Springer.
Clojure integrates a software transactional memory system. [3] S. Ben-David, U. von Luxburg, and D. Pál. A sober
The design principle of the STM reduces the additional over- look at clustering stability. In J. G. Carbonell and
head required by the design of parallel algorithms. In con- J. Siekmann, editors, Conference on Learning Theory,
trast to parallelization of software using MPI or OpenMP, volume 4005 of Lecture Notes in Artificial Intelligence,
our approach only requires minimal additional effort in de- pages 5–19, Berlin, 2006. Springer.
signing the software. [4] P. A. Bernstein and N. Goodman. Concurrency
The benchmark results show that for large data sets a sig- control in distributed database systems. ACM
nificant improvement in the runtime can be achieved by a Computing Surveys, 13(2):185–221, 1981.
parallel multi-core implementation. This is especially im- [5] A. Bertoni and G. Valentini. Random projections for
portant if simulations are performed repeatedly to infer the assessing gene expression cluster stability. In
inherent cluster number and to quantify the stability and Proceedings of the IEEE-International Joint
robustness of the predictions. To summarize, we ended up Conference on Neural Networks (IJCNN), volume 1,
with about 100 lines of code running a multi-core parallel pages 149–154. IEEE Computer Society, 2005.
algorithm that is fast, easy to maintain, and yet ready to re- [6] B. Chapman, G. Jost, and R. van der Pas. Using
lease as a platform independent standalone application. As OpenMP: Portable Shared Memory Parallel
multi-core computers are the standard on the next gener- Programming. MIT Press, Cambridge, 2007.
15
[7] E. Fowlkes and C. Mallows. A method for comparing O’Reilly, Sebastopol, 2007.
two hierarchical clusterings. Journal of the American [25] R Development Core Team. R: A Language and
Statistical Association, 78(383):553–569, 1983. Environment for Statistical Computing. R Foundation
[8] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: for Statistical Computing, Vienna, 2009. ISBN
Portable Parallel Programming with the Message 3-900051-07-0.
Passing Interface. MIT Press, Cambridge, 1999. [26] R. Rajwar and J. Goodman. Transactional execution:
[9] S. Halloway. Programming Clojure. Pragmatic Toward reliable, high-performance multithreading.
Programmers, Raleigh, 2009. IEEE Micro, 23(6):117–125, 2003.
[10] J. Handl, J. Knowles, and D. Kell. Computational [27] A. Rakhlin and A. Caponnetto. Stability of k-means
cluster validation in post-genomic data analysis. clustering. In B. Schölkopf, J. C. Platt, and
Bioinformatics, 21(15):3201–3212, 2005. T. Hoffman, editors, Advances in Neural Information
[11] J. Hill, M. Hambley, T. Forster, M. Mewissen, T. M. Processing Systems 19, pages 1121–128. MIT Press,
Sloan, F. Scharinger, A. Trew, and P. Ghazal. Sprint: Cambridge, 2007.
A new parallel framework for r. BMC Bioinformatics, [28] W. Rand. Objective criteria for the evaluation of
9(558), 2008. clustering methods. Journal of the American
[12] L. Hubert and P. Arabie. Comparing partitions. Statistical Association, 66:846–850, 1971.
Journal of Mathematical Classification, 2:193–218, [29] N. Shavit and D. Touitou. Software transactional
1985. memory. In Proceedings of the 14th ACM Symposium
[13] P. Jaccard. Nouvelles recherches sur la distribution on Principles of Distributed Computing, pages
florale. Bulletin de la Société Vaudoise des sciences 204–213, New York, 1995. ACM Press.
naturelles, 44:223–270, 1908. [30] S. Thompson. Haskell: The Craft of Functional
[14] A. Jain and R. Dubes. Algorithms for Clustering Data. Programming. Addison Wesley, Boston, 2nd edition,
Prentice Hall, New Jersey, 1988. 1999.
[15] A. K. Jain and J. V. Moreau. Bootstrap technique in
cluster analysis. Pattern Recognition, 20(5):547–568, APPENDIX
1987.
[16] H. A. Kestler, A. Müller, F. Schwenker, T. Gress, A. SOURCE CODE
T. Mattfeldt, and G. Palm. Cluster analysis of Multi-core implementation of k-means in Clojure.
comparative genomic hybridization data. Lecture ; set global constants 1
Notes NATO ASI: Aritificial Intelligence and ( d e f d a t a s e t ( l o a d −tab− f i l e ” c l u s t e r e x a m p l e . ta b ”) ) 2
( def k 2) 3
Heuristic Methods for Bioinformatics, pages S–40, ( def maxiter 1000) 4
2001. Abstract. ( d e f numagents 8 ) 5
[17] D. Koenig, A. Glover, P. King, G. Laforge, and ( def stop ( r e f ’ ( ) ) 6
( d e f n check−s t o p [ c u r i t e r ] 7
J. Skeet. Groovy in Action. Manning Publications Co., ( o r (= c u r i t e r m a x i t e r ) ( e v e r y ? #(= t r u e %) 8
Greenwich, 2007. ( deref stop ) ) ) ) 9
[18] P. Kraj, A. Sharma, N. Garge, R. Podolsky, and R. A. ( d e f s t r u c t c l u s t e r : number : d a t a ) 10
( d e f n i n i t −c l u s t e r −a g e n t [ number d a t a ] 11
McIndoe. Parakmeans: Implementation of a ( a g e n t ( s t r u c t c l u s t e r number d a t a ) ) ) 12
parallelized k-means algorithm suitable for general ( d e f s t r u c t d a t a p o i n t : number : d a t a : a s s i g n m e n t ) 13
laboratory use. BMC Bioinformatics, 9(200), 2008. ( d e f n i n i t −data−a g e n t [ s u b s t a r t subend ] 14
( a g e n t ( d o a l l (map #( s t r u c t d a t a p o i n t %1 %2) 15
[19] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. ( i t e r a t e inc substart ) 16
Stability-based validation of clustering solutions. ( drop s u b s t a r t 17
Neural Computation, 16(6):1299–1323, 2004. ( t a k e subend d a t a s e t ) ) ) ) ) ) 18
19
[20] D. Lea. Concurrent Programming in Java: Design ; f u n c t i o n s needed f o r t h e init step 20
Principles and Patterns. Addison Wesley, Boston, 2nd ( d e f n i n i t −member−r e f [ ] 21
edition, 2000. ( ref ’())) 22
23
[21] J. MacQueen. Some methods for classification and ( d e f n i n i t −c e n t e r s [ d a t a k ] 24
analysis of multivariate observations. In J. Neyman (map #(nth d a t a %) ( sample ( c o u n t d a t a ) k ) ) ) 25
26
and L. L. Cam, editors, Proceedings of the 5th ; i n i t i a l i z a t i o n s t e p o f k−means 27
Berkeley Symposium on Math, Statistics and ( defn i n i t [ ] 28
Probability, volume 1, pages 281–297, Berkely, 1967. ( d o s y n c ( r e f −s e t s t o p ’ ( ) ) ) 29
University of California Press. ( d e f data−a g e n t s (map #( i n i t −data−a g e n t %1 %2) 30
( t a k e numagents ( i t e r a t e #(+ ( i n t ( . Math ( c e i l 31
[22] C. C. Minh, J. Chung, C. Kozyrakis, and ( / ( c o u n t d a t a s e t ) numagents ) ) ) ) %) 0 ) ) 32
K. Olukotun. Stamp: Stanford transactional ( t a k e numagents ( i t e r a t e #(+ ( i n t ( . Math ( c e i l 33
( / ( c o u n t d a t a s e t ) numagents ) ) ) ) %) 34
applications for multi-processing. In IISWC ’08: (+ ( i n t ( . Math ( c e i l ( / ( c o u n t d a t a s e t ) 35
Proceedings of the IEEE International Symposium on numagents ) ) ) ) ) ) ) ) ) 36
Workload Characterization, pages 35–46, Los ( d e f c l u s t e r −a g e n t s (map #( i n i t −c l u s t e r −a g e n t 37
Alamitos, 2008. IEEE Computer Society. %1 %2) ( i t e r a t e i n c 0 ) 38
( i n i t −c e n t e r s d a t a s e t k ) ) ) 39
[23] M. Odersky, L. Spoon, and B. Venners. Programming ( d e f member−r e f s (map ( f n [ x ] ( i n i t −member−r e f ) ) 40
in Scala. Artima, Mountain View, 2008. ( range k ) ) ) 41
true ) 42
[24] S. Peyton-Jones. Beautiful concurrency. In A. Oram 43
and G. Wilson, editors, Beautiful code, chapter 24. ; f u n c t i o n s used f o r the assignment s t e p 44
( d e f n whichmin [#ˆ d o u b l e s x s ] 45
16
( areduce xs i r e t 0 46
( i f (< ( a g e t x s i ) ( a g e t x s r e t ) ) 47
i 48
ret ))) 49
50
( d e f n d i s t a n c e [#ˆ d o u b l e s a s #ˆd o u b l e s bs ] 51
( areduce as i r e t ( double 0) 52
(+ r e t ( ∗ (− ( a g e t a s i ) ( a g e t bs i ) ) 53
(− ( a g e t a s i ) ( a g e t bs i ) ) ) ) ) ) 54
55
( d e f n s e a r c h −b e s t −c l u s t e r [ p o i n t ] 56
( l e t [ d i s t ( do u b l e−a r r a y 57
(map #( d i s t a n c e ( : d a t a p o i n t ) 58
( : d a t a ( d e r e f %))) c l u s t e r −a g e n t s ) ) ] 59
( whichmin d i s t ) ) ) 60
61
( d e f n update−d a t a p o i n t [ p o i n t ] 62
( l e t [ newass ( s e a r c h −b e s t −c l u s t e r p o i n t ) ] 63
( d o s y nc ( commute s t o p c o n j 64
(= ( : a s s i g n m e n t p o i n t ) newass ) ) ) 65
( d o s y nc ( commute ( nth member−r e f s newass ) 66
c o n j ( : data p o i n t ) ) ) 67
( a s s o c p o i n t : a s s i g n m e n t newass ) ) ) 68
69
( d e f n update−data−a g e n t [ d a t a l i s t ] 70
( d o a l l ( pmap update−d a t a p o i n t d a t a l i s t ) ) ) 71
72
; a s s i g n m e n t s t e p o f k−means 73
( defn assignment [ ] 74
( do s y n c ( r e f −s e t s t o p ’ ( ) ) ) 75
( dorun (map #(s e n d % update−data−a g e n t ) 76
data−a g e n t s ) ) 77
( a p p l y a w a i t data−a g e n t s ) ) 78
79
; f u n c t i o n s u s e d i n t h e update s t e p 80
( d e f n da+ [#ˆ d o u b l e s a s #ˆd o u b l e s bs ] 81
( amap a s i r e t 82
(+ ( a g e t a s i ) ( a g e t bs i ) ) ) ) 83
84
( d e f n update−c e n t e r p o i n t [ members ] 85
( l e t [ s c a l e ( d o u b l e ( c o u n t members ) ) 86
#ˆd o u b l e s newcen ( r e d u c e da+ members ) ] 87
( amap newcen i r e t 88
( / ( a g e t newcen i ) s c a l e ) ) ) ) 89
90
( d e f n update−c l u s t e r −a g e n t [ c l u s ] 91
( l e t [mem ( d e r e f ( nth member−r e f s 92
( : number c l u s ) ) ) 93
newcen ( update−c e n t e r p o i n t mem ) ] 94
( d o s y nc ( r e f −s e t ( nth member−r e f s 95
( : number c l u s ) ) ’ ( ) ) ) 96
( a s s o c c l u s : d a t a newcen ) ) ) 97
98
; update s t e p o f k−means 99
( d e f n update [ ] 100
( dorun (map #(s e n d % update−c l u s t e r −a g e n t ) 101
c l u s t e r −a g e n t s ) ) 102
( a p p l y a w a i t c l u s t e r −a g e n t s ) ) 103
104
; t h e k−means f u n c t i o n i t s e l f 105
( d e f n kmeans [ ] 106
( init ) 107
( loop [ c u r i t e r 0] 108
( assignment ) 109
( update ) 110
( i f ( check−s t o p c u r i t e r ) 111
curiter 112
( recur ( inc curiter ) ) ) ) ) 113
17

Multi-Core Parallelization in Clojure - A Case Study.: Johann M. Kraus Hans A. Kestler

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Multi-Core Parallelization in Clojure - A Case Study.: Johann M. Kraus Hans A. Kestler

Diunggah oleh

Hak Cipta:

Format Tersedia

Multi-core Parallelization in Clojure – a Case Study.

Johann M. Kraus Hans A. Kestler

ABSTRACT 1. CONCEPTS OF PARALLEL PROGRAM-

data set [11]. In contrast to the network-based architectures,

is no need for network communication, as all threads run on

Explicit vs. implicit parallelization. public c l a s s Counter {

changed Data k Data k (copy) changed Data 2

Data k Data 2 (copy)

Data 1 Data 2 (copy)

changed Data 0 Data 0 (copy) changed Data 2

Figure 2: The design principle of software transactional memory as implemented in Clojure.

During the assignment step (red lines) the DataAgents cal-

clusters C1 , . . . , Ck . The algorithm minimizes the objective ( defn update−c l u s t e r a g e n t [ a g t ]

To analyze cluster algorithms they are often applied on a-

n.j denote the total number of objects in cluster Ci and Dj Pr

Number of samples / Number of clusters

200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20

Number of samples / Number of clusters

Anda mungkin juga menyukai