ECSE 420 - Parallel Cholesky Algorithm - Report

1
ECSE 420 - Fast Cholesky: a serial, OpenMP and

MPI comparative study
Renaud Jacques-Dagenais, B en edicte Leonard-Cannon, Payom Meshgin, Samantha Rouphael
I. INTRODUCTION
The aim of this project is to analyze different parallel
implementations of the Cholesky decomposition algorithm
to determine which is most efcient in executing the oper-
ation. In particular, we will study a shared memory-based
implementation in OpenMP as well as a message-passing
based implementation using MPI, all programmed in the C
programming language. These parallel versions will also be
compared against a plain, serial version of the algorithm.
To test our implementations of Cholesky decomposition,
we employ a series of tests to gauge the performance of the
programs under varying matrix sizes and number of parallel
threads.
II. MOTIVATION
Many mathematical computations involve solving large
sets of linear equations. In general, the LU decomposition
method applies for any square matrix. However, for certain
common classes of matrices, other methods are more efcient.
One such method commonly used to solve these problems
is Cholesky decomposition. This method is notably used in
engineering applications such as circuit simulation and nite
element analysis.
Sadly, Cholesky decomposition is relatively demanding in
terms of computational operations. Indeed, the computation
time of the algorithm scales quite poorly with respect to
matrix size, with a complexity of order O(n
3
) [Gallivan].
Luckily, much of the work can be parallelized, allowing for
higher performance in less time.
III. BACKGROUND THEORY
The Cholesky decomposition algorithm is a special version
of the LU decomposition, where the matrix to be factorized
is square symmetric and positive denite. The key difference
between LU and Cholesky decomposition is that the upper tri-
angular factor determined using the Cholesky decomposition
method is forced to be equal to the transpose of the lower
triangular matrix, rather than an arbitrary matrix. For LL
T
factorable matrices, i.e. matrices that can be represented as
a product of a lower triangular matrix L and its transpose,
the Cholesky algorithm has twice the efciency of the LU
decomposition method.
The algorithm is iterative and recursive, such that after
each major step in the algorithm, the rst column and row
of the L factor are known and the algorithm is applied to the
remainder of the matrix. Hence, the computational domain of
the algorithm successively reduces in size.
We begin with an n x n, symmetric and positive denite
matrix A. These properties ensure the matrix A is LL
T
factorable [Heath]. Given the linear system Ax = b, Cholesky
factorization decomposes the matrix A into the form A =
LL
T
, such that L is a lower triangular matrix and L
T
is
its transpose. Once L is computed, the original system can
be solved trivially by solving the matrix equation L(y) = b
(also called forward substitution), following by solving the
equation L
T
x = y (also called backward substitution). For the
scope of this project, we will strictly focus on parallelizing the
factorization algorithm rather than the entire task of solving
Ax = b.
IV. IMPLEMENTATION
A. Serial Cholesky Algorithm
There are three popular implementations of the Cholesky
decomposition algorithm, each based on which elements of
the matrix are computed rst within the innermost for loop
of the algorithm [Gallivan]:
Row-wise Cholesky (row-Cholesky)
Column-Cholesky (column Cholesky)
Block Matrix-Cholesky (block Cholesky)
In the case of the serial implemenation of the algorithm, the
column-wise version was used to limit the number of cache
misses on the system. The column-wise algorithm is shown
below:
for j = 1 to n do
for k = 1 to j-1 do
for i = j to n do
a
ij
= a
ij
a
jk
;
end
end
a
jj
=

a
jj
;
for i = j+1 to n do
a
ij
= a
ij
/a
jj
;
end
end
Algorithm 1: Column-wise Cholesky algorithm
B. MPI
In order to have an efcient MPI implementation of
Cholesky decomposition, it is necessary to distribute com-
putations in a way that minimizes communications between
processes. Before we can gure out what the best solution
is, we have to identify the dependencies. By analyzing the
algorithm, we notice two things: previous columns need to be
updated before any subsequent column and, when updating
entries of the same column, the new value of diagonal entry
has to be computed rst.
This is easier to understand with an example. Suppose we
have a 5x5 matrix and we wish to update entries in column
2
3 (entries in previous columns have already been updated).
Then, the dependencies for each entry in column 3 are the
following (Figure 1).
Fig. 1. Dependency chart for the message passing implementation
Notice that the values that are needed to update an entry are
either located on the same row as the entry or on the row that
contains the entry on the diagonal. Hence, decomposing the
domain into rows is probably the best solution to minimize
communications between processes.
Cyclic assignment of rows to processes can easily be
achieved by taking the remainder of the division of the column
number to the total number of processes (i.e. j%n
processors
).
Note that we can take the column number instead of the
row number because the matrices on which the algorithm is
applied are always square. For instance, if we take the same
5x5 matrix as above and run the algorithm with 3 processes,
rows would be assigned as follows:
Next we need to identify which parts of the algorithm can
be parallelized. Let us recapitulate how Cholesky decompo-
sition works.
For every column:
1) Replace the entries above the diagonal with ze-
roes
2) Update the entry on the diagonal
3) Update entries below the diagonal
The rst step is pointless to parallelize because it involves
no computations.
The second step could be parallelized. We could start by
broadcasting the row that contains the entry on the diagonal,
then use MPI_Reduce() to compute the new value of
the entry on the diagonal and broadcast again to send the
updated value to all processes. However, this would be terribly
inefcient because processes would spend a lot of time
sending and receiving messages. It is more efcient to simply
let the process assigned to that row compute the new value
for the entry on the diagonal and broadcast the whole row
once.
The third step is where we gain performance by paralleliz-
ing tasks. Since processes already have access to the data in
their respective row(s), we only need to broadcast the row
that contains the entry on the diagonal. As soon as a process
gets this data, it can proceed and update the entry(entries) that
correspond to its row(s). Thus, at this point in the program,
processes all compute in parallel.
We also tried to broadcast each individual entry up to the
diagonal. We thought this might yield better performances
because the total amount of data transferred is almost half.
However, it turned out to decrease performances by a factor of
almost 2! This is due to the communication overhead incurred
when transferring data between processes.
Note: The reason why we chose to parallelize the row-
Cholesky algorithm instead of the column-Cholesky algo-
rithm is because there is no simple way to broadcast the
columns of a matrix. We thought about transposing the matrix
before applying the algorithm but we gured that perfor-
mances would probably be worse than by simply applying
the row-Cholesky.
C. OpenMP
The OpenMP implementation of the Cholesky decom-
position algorithm is fairly straight-forward. Starting from
the code for the serial implementation of the Cholesky
algorithm, only a few lines of code were changed to par-
allelize the algorithm with OpenMP. Using the OpenMP
#pragma parallel for directive before the innermost
for loop splits the single thread into a certain number of
shared-memory threads. Moreover, a set of structures for
synchronization (i.e. locks) were explicitly dened in the
code to avoid any potential race conditions. The result of the
modications are shown in the listings section of the report.
V. TESTS
This section outlines the testing methodology undertaken to
analyze and compare the serial, OpenMP and MPI implemen-
tations. Each set of tests is executed on two 4-core machines
with the specications displayed in the Appendix.
A. Test 1
The purpose of the rst test is to verify the correctness
of each of the three implementations. To ensure that each
program executes the right computations, we begin by de-
composing a matrix A into its Cholesky decomposition, LL
T
.
Then, we compare these two matrices based on the fact that
A should be equivalent to LL
T
, using the least absolute error
formula:
S =
n
i=1
|y
i
f(x
i
)| (1)
B. Test 2
The second test is conducted to measure the accuracy of
the decomposed matrix generated by each implementation.
To acquire this measure, a set of matrix operations is applied
both to the original matrix A and the decomposed matrix L
to produce an experimental result x
to be compared to a
theoretical result x. More precisely, we begin by generating a
random matrix A and a unidimensional vector x of the same
length (x being the theoretical value). We then compute the
product b of these matrices using the following equation:
Ax = b (2)
Subsequently, As Cholesky decomposition, LL
T
is gener-
ated. Using these matrices and the result b previously found,
the experimental value x
is computed based on the following

equation:
LL
T
x
= b (3)
Since L and L
T
are triangular matrices, nding x is
straightforward. Indeed, one can replace L
T
x
by y and nd
for y by solving this system of linear equations:
Ly = b (4)
3
Finally, x can be computed in a similar way using:
LTx
= y (5)
The error between the theoretical and experimental vectors
x and x is computed by least absolute error :
S =
n
i=1
|y
i
f(x
i
)| (6)
C. Test 3
The third test aims at comparing the performance of
the three implementations. To acquire this information, we
measure the program execution time with respect to the matrix
size and the number of threads (the latter concerning OpenMP
and MPI exclusively). The serial, OpenMP and MPI programs
are tested with square matrices of sizes varying from 50 to
5000, 50 to 4000 and 50 to 2000/3000 (see discussion for
more details) respectively. The OpenMP and MPI programs
are executed with the number of threads ranging from 2 to
32 and 2 to 9 respectively.
The matrix size and number of threads are inputted as
arguments before running the executables, while the execution
time is obtained by comparing the system time before and
after Cholesky decomposition and printing this value to the
command prompt.
For each program execution, a new random matrix is
generated. In order to ensure consistency between different
runs, each test case is conducted ve times and the average
runtime is logged as the measurement.
VI. RESULTS
The following section summarizes the statistics acquired
after executing each three of the aforementioned tests.
TABLE II
MEASURE OF THE CORRECTNESS OF EACH ALGORITHM
Serial OpenMP MPI
Correctness Passed Passed Passed
The least absolute deviation between the Cholesky decom-
position and the original matrix is computed with a precision
of 0.00000000001 for all implementations. For the serial
algorithm and the OpenMP and MPI versions with a matrix
size from 50 to 3000 and 1 to 9 threads, the results of the
deviation are always exactly 0. Therefore, all implementations
of the algorithm are correct, although it is not certain whether
the parallelization of the task is optimal.
VII. DISCUSSION
In, this section, we discuss and analyze our ndings based
on the data collected during tests 1, 2 and 3.
The results of the rst test show that all three implemen-
tation that we designed are correct, as expected. Indeed, the
error for each one of them is equal to exactly 0. This result is
required to conduct further testing and comparison between
the different implementations since it shows that Cholesky
decomposition is properly programmed for each version.
Surprisingly, the least square error measured in the second
test is precisely 0 for all implementations when we expected
TABLE III
BEST EXECUTION TIMES IN FUNCTION OF THE MATRIX SIZE FOR EACH
IMPLEMENTATION
Matrix
size
Serial execution
time (s)
OpenMP best exe-
cution time (s)
MPI best
execution
time (s)
50 0.00043 0.00113 0.000229
100 0.00464 0.00299 0.00291
200 0.01739 0.01047 0.008708
300 0.07026 0.03178 0.023265
500 0.28296 0.14314 0.10016
1000 3.86278 0.94137 0.833415
1500 11.31204 2.98861 2.856436
2000 30.60744 6.92203 6.927265
3000 113.79156 22.94476 22.438851
4000 284.43736 54.20875 -
5000 546,4589 - -
some loss of precision in the MPI implementation. This may
be explained by the fact that test matrices were diagonally
dominant, and hence the condition number of the matrices
were very low. This property ensures that the decomposition
yields little error, even though we didnt expect it to yield no
error at all.
Fig. 2. Runtime of serial Cholesky algorithm
Using data acquired during the third testing phase, the
execution time versus the matrix size is plotted for each
implementation. For the serial version, we observe that the
execution time increases cubically in function of the matrix
size, as displayed on gure 2. This result corresponds to our
expectations since Cholesky decomposition is composed of
three nested for loops resulting in a time complexity of O(n
3
).
Similarly, we note that the OpenMP and MPI execution times
(for a xed number of threads) increase with respect to the
matrix size, but less dramatically as displayed on Figures 3
and 4. This results illustrates that parallelizing computations
improves execution time.
Second, one can observe from Figure 3 that for most
of the conducted tests, the execution time of the OpenMP
algorithm for a xed matrix size decreases with an increasing
number of threads until 5 threads are summoned. With 5
threads onwards, the execution time generally increases as
more threads are added. However, there is a notable exception;
for matrix sizes above 1000, calling 9 threads often results in
a longer execution time than that of 16 or 32 threads. This
might be due to a particularity of Cholesky decomposition, or
is more likely a result of the underlying computer architecture.
The speedup plotted on Figure 5 illustrates the ratio of
4
TABLE I
EXECUTION TIME IN FUNCTION OF THE MATRIX SIZE AND NUMBER OF THREADS FOR THE OPENMP AND MPI IMPLEMENTATIONS
Fig. 3. Average OpenMP Implementation execution time
Fig. 4. Average OpenMPI Implementation execution time
the computation time of the OpenMP implementation versus
its serial counterpart. As it can be observed, the speedup
grows linearly as the matrix size increases. Indeed, the larger
the matrix size, the more OpenMP outperforms the serial
algorithm since the former completes in O(n
2
) and the latter
in O(n
3
).
It is very interesting to discuss an intriguing problem
encountered with the OpenMP implementation. Initially, a
column Cholesky algorithm was programmed and paral-
Fig. 5. Speedups of Parallel Implementations
lelized. After executing dozens of tests, we noted that the
performance improved as the number of threads was increased
up to 2000 threads -despite the program being run on a 4-
core machine! We suspect that spawning an overwhelming
amount of threads forces the program to terminate prema-
turely, without necessarily returning an error message. As
a result, increasing the amount of threads causes prompter
termination and thus smaller execution times. To address this
problem, a row Cholesky algorithm was programmed instead,
which resulted in logical results as previously discussed. We
chose to completely change the Cholesky implementation
from the column to the row version because OpenMP may
not deliver the expected level of performance with some
algorithms [Chapman].
Figure 4 shows the execution time of the MPI implementa-
tion for varying matrix sizes and number of threads. As it can
be seen, the execution time for a xed matrix size decreases
as the number of threads increases until the number of threads
equals 4 (corresponding to the number of cores on the tested
machine). From that point on, the execution time generally
increases with the number of threads, and one can observe
that this is consistent for different matrix sizes. An interesting
observation can also be made from Table I regarding the
execution time with more than 4 threads. When the number
of threads is set to an even number, the execution time is
less than with the preceding and following odd numbers.
5
Once again, this might be a result of the underlying computer
architecture.
The ratio of the computation time of MPI over the serial
implementation is plotted in Figure 5 and we can observe
that as the matrix size increases, the speedup increases
linearly. In fact, this conrms that MPI outperforms the serial
implementation by a factor of the matrix size n. Interestingly,
we can conclude from these results that MPIs behaviour is
very similar to that of OpenMP.
As previously implied, the best OpenMP and MPI perfor-
mance for most matrix sizes is achieved with an amount of 4
threads. This is due to the fact that tests were conducted on
a 4-core machine; maximum performance is achieved when
the number of threads matches the number of cores [Jie
Chen]. Indeed, a too small amount of threads doesnt fully
exploit the computers parallelization capabilities, while a too
large amount induces important context switching overhead
between processes [Chapman].
A. Serial/MPI/OpenMP comparison
As expected, the parallel implementations result in a
much higher performance than their serial counterpart; obvi-
ously, parallelizing independent computations results in much
smaller runtimes than executing them serially. Surprisingly, in
their best cases, the OpenMP and MPI algorithms complete
Cholesky decomposition in perceptibly identical runtimes,
with MPI having an advantage of a few microseconds, as seen
on Figure 6. We were expecting OpenMP to perform better,
since we believed that message passing between processes
would cause more overhead than using a shared memory
space. However, we have not been able to explain this result
on a theoretical basis.
Fig. 6. Average OpenMPI Speedup
One can observe that the OpenMP and MPI implemen-
tations were tested with a maximum matrix size of 4000
and 3000 respectively, compared to that of 5000 for the
serial implementation. This outcome is due to the fact that
running the program with a size higher than these values
would completely freeze the computer because of memory
limitations. Therefore, the serial implementation seems to
have an advantage over the parallel algorithms when the
matrix size is signicantly large because it doesnt require
additional memory elements such extra threads, processes or
messages. Similarly, MPI was not tested with values of 16 and
32 threads unlike OpenMP because these values would freeze
the computer. We believe that this behavior is explained by the
fact that MPI requires more memory for generating not only
processes, but also messages, while OpenMP only requires
threads.
Finally, it is essential to mention that it is much simpler to
parallelize a program with OpenMP than with MPI [Mall on];
the former only requires a few extra lines of code dening
the parallelizable sections and required synchronization mech-
anisms (locks, barriers, etc), while the latter requires the code
to be restructured in its entirety before setting up the message
passing architecture.
VIII. CONCLUSION
Throughout this project, we evaluated the performance
impact of using different parallelization techniques for the
Cholesky decomposition. First, a serial implementation was
developed and used as the reference base. Second, we used
multicore programming tools, namely OpenMP and MPI, to
apply different parallelization approaches. We inspected the
results of the different implementations by varying the matrix
size (between 0 to 4000) and the number of threads used
(between 1 to 32). As expected, both parallel implementations
result in a higher performance than the serial version. Also,
we observed that MPI resulted in better execution times of a
few microseconds over OpenMP.
In the future, we would like to experiment with a mixed
mode of OpenMP and MPI in the hope of discovering an even
more efcient parallelization scheme for Cholesky decompo-
sition, since this scheme may increase the code performance.
Moreover, we would like to conduct further tests with the
current implementations by running the different programs on
computers with various amounts of CPUs, such as 8,16,32 and
64. We would expect machines with more cores to provide
better execution times.
This project improved our knowledge regarding the dif-
ferent parallelization tools that can be used to parallelize a
program. In fact, we were able to apply the parallel computing
theories learned in class.
IX. REFERENCES
D. Mall on, et al., Performance Evaluation of MPI,
UPC and OpenMP on Multicore Architectures, in Re-
cent Advances in Parallel Virtual Machine and Message
Passing Interface. vol. 5759, M. Ropo, et al., Eds., ed:
Springer Berlin Heidelberg, 2009, pp. 174-184.
J. Shen and A. L. Varbanescu, A Detailed Performance
Analysis of the OpenMP Rodinia Benchmark, Techni-
cal Report PDS-2011-011, 2011.
B. Chapman, et al., Using OpenMP: Portable Shared
Memory Parallel Programming, The MIT Press, 2008.
M.T. Heath. Parallel Numerical Algorithms course,
Chapter 7 - Cholesky Factorization. University of Illi-
nois at Urbana Champaign.
K. A. Gallivan, et al., Parallel Algorithms for Matrix
Computations: Society for Industrial and Applied Math-
ematics, 1990.
L. Smith and M. Bull, Development of mixed mode
MPI / OpenMP applications, Sci. Program., vol. 9, pp.
83-98, 2001.
6
X. APPENDIX
A. Test Machine Specications
1) Machine 1
Hardware:
CPU: Intel Core i3 M350 Quad-Core @ 2.27 GHz
Memory: 3.7 GB
Software:
Ubuntu 12.10 64-bit
Linux kernel 3.5.0-17-generic
GNOME 3.6.0
2) Machine 2
Hardware:
CPU: Intel Core i7 Q 720 Quad-Core @ 1.60 GHz
Memory: 4.00 GB
Software:
Ubuntu 13.04 64-bit
Linux kernel 3.5.0-17-generic
GNOME 3.6.0
XI. CODE
A. matrix.c - Helper Functions
# i nc l ude ma t r i x . h
/ / Pr i n t a s quar e mat r i x .
voi d p r i n t ( doubl e mat r i x , i nt ma t r i x Si z e ) {
i nt i , j ;
f or ( i = 0; i < ma t r i x Si z e ; i ++) {
f or ( j = 0; j < ma t r i x Si z e ; j ++) {
p r i n t f ( %.2 f \ t , ma t r i x [ i ] [ j ] ) ;
}
p r i n t f ( \n ) ;
}
p r i n t f ( \n ) ;
}
/ / Mu l t i p l y t wo s quar e ma t r i c e s of t he same s i z e .
doubl e ma t r i x Mu l t i p l y ( doubl e mat r i x1 , doubl e mat r i x2 , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a mat r i x of doubl e s .
i nt i , j , k ;
doubl e mat r i xOut = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = 0; i < ma t r i x Si z e ; i ++){
mat r i xOut [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
doubl e r e s u l t = 0;
/ / F i l l each c e l l of t he mat r i x out put .
f or ( i = 0 ; i < ma t r i x Si z e ; i ++){
f or ( j = 0; j < ma t r i x Si z e ; j ++){
/ / Mu l t i p l y each row of mat r i x 1 wi t h each col umn of mat r i x 2.
f or ( k = 0; k < ma t r i x Si z e ; k++){
r e s u l t += ma t r i x1 [ i ] [ k ] ma t r i x2 [ k ] [ j ] ;
}
mat r i xOut [ i ] [ j ] = r e s u l t ;
r e s u l t = 0; / / Re s e t ;
}
7
}
ret urn mat r i xOut ;
}
/ / Add t wo s quar e ma t r i c e s of t he same s i z e .
doubl e ma t r i xAddi t i on ( doubl e mat r i x1 , doubl e mat r i x2 , i nt ma t r i x Si z e ) {
i nt i , j ;
}
/ / F i l l each c e l l of t he mat r i x out put .
mat r i xOut [ i ] [ j ] = ma t r i x1 [ i ] [ j ] + ma t r i x2 [ i ] [ j ] ;
}
}
}
/ / Mu l t i p l y a s quar e mat r i x by a v e c t o r . Ret ur n n u l l i f f a i l u r e .
doubl e v e c t o r Mu l t i p l y ( doubl e mat r i x , doubl e ve c t or , i nt ma t r i xSi z e , i nt v e c t o r Si z e ) {
doubl e r e s u l t = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i f ( v e c t o r Si z e ! = ma t r i x Si z e ) {
ret urn NULL;
}
i nt i , j ;
doubl e sum = 0 . 0 ;
/ / Mu l t i p l i c a t i o n .
sum += ma t r i x [ i ] [ j ] v e c t o r [ j ] ;
}
r e s u l t [ i ] = sum;
sum = 0; / / Re s e t .
}
ret urn r e s u l t ;
}
/ / Ret ur n t he t r a n s p o s e of a s quar e mat r i x .
doubl e t r a n s p o s e ( doubl e mat r i x , i nt ma t r i x Si z e ) {
i nt i , j ;
}
/ / Tr ans pos e t he mat r i x .
8
mat r i xOut [ i ] [ j ] = ma t r i x [ j ] [ i ] ;
}
}
}
/ / Cr eat e a r e a l p o s i t i v e d e f i n i t e mat r i x .
doubl e i n i t i a l i z e ( i nt mi nVal ue , i nt maxValue , i nt ma t r i x Si z e ) {
/ / Al l o c a t e s memory f o r a ma t r i c e s of doubl e s .
i nt i , j ;
doubl e ma t r i x = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
doubl e i d e n t i t y = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
ma t r i x [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i d e n t i t y [ i ] = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
}
/ / Cr e at e s an uppert r i a n g u l a r mat r i x of random numbers bet ween mi nVal ue and maxVal ue .
/ / Cr e at e s an i d e n t i t y mat r i x mu l t i p l i e d by maxVal ue .
doubl e random ;
i d e n t i t y [ i ] [ i ] = maxValue ma t r i x Si z e ;
f or ( j = 0 ; j < ma t r i x Si z e ; j ++){
random = ( maxValue mi nVal ue )
( ( doubl e ) r and ( ) / ( doubl e )RAND MAX) + mi nVal ue ;
i f ( random == 0 . 0 ) {
random = 1 . 0 ; / / Avoi d d i v i s i o n by 0.
}
ma t r i x [ i ] [ j ] = random ;
}
}
/ / Tr ans f or m t o p o s i t i v e d e f i n i t e .
doubl e t r a n s p o s e d = t r a n s p o s e ( mat r i x , ma t r i x Si z e ) ;
ma t r i x = ma t r i xAddi t i on ( mat r i x , t r a ns pos e d , ma t r i x Si z e ) ;
ma t r i x = ma t r i xAddi t i on ( mat r i x , i d e n t i t y , ma t r i x Si z e ) ;
ret urn ma t r i x ;
}
/ / Comput es t he sum of Abs ol ut e e r r or bet ween 2 v e c t o r s
doubl e vect or Comput eSumof AbsEr r or ( doubl e ve c t o r 1 , doubl e ve ct or 2 , i nt s i z e )
{
i nt i ;
doubl e sumOfAbsError = 0;
f or ( i = 0; i < s i z e ; i ++)
{
sumOfAbsError += f a bs ( v e c t or 2 [ i ] ve c t o r 1 [ i ] ) ;
}
ret urn sumOfAbsError ;
}
/ / Comput es t he sum of Abs ol ut e e r r or bet ween 2 ma t r i c e s
voi d ComputeSumOfAbsError ( doubl e ma t r i x1 , doubl e mat r i x2 , i nt s i z e )
9
{
p r i n t f ( Mat r i x 1 : \ n ) ;
i nt i , j ;
doubl e sumOfAbsError = 0;
f or ( i = 0; i < s i z e ; i ++)
{
f or ( j = 0; j < s i z e ; j ++)
{
sumOfAbsError += f a bs ( ma t r i x1 [ i ] [ j ] ma t r i x2 [ i ] [ j ] ) ;
}
}
p r i n t f ( The sum of a b s o l u t e e r r o r i s %10.6 f \n , sumOfAbsError ) ;
}
voi d p r i n t Ve c t o r ( doubl e ve c t or , i nt s i z e ) {
i nt i ;
f or ( i = 0; i < s i z e ; i ++){
p r i n t f ( \ t %10.6 f , v e c t o r [ i ] ) ;
p r i n t f ( \n ) ;
}
p r i n t f ( \n ) ;
}
doubl e i n i t Ma t r i x ( i nt s i z e ) {
doubl e ma t r i x = ( doubl e ) mal l oc ( s i z e s i z e o f ( doubl e ) ) ;
i nt i ;
f or ( i = 0; i < s i z e ; i ++)
ma t r i x [ i ] = ( doubl e ) mal l oc ( s i z e s i z e o f ( doubl e ) ) ;
ret urn ma t r i x ;
}
voi d t r ans Copy ( doubl e s our ce , doubl e des t , i nt s i z e ) {
i nt i , j ;
f or ( i = 0; i < s i z e ; i ++){
f or ( j = 0; j <= i ; j ++){
d e s t [ i ] [ j ] = s our c e [ i ] [ j ] ;
}
}
}
voi d copyMat r i x ( doubl e s our ce , doubl e des t , i nt s i z e ) {
i nt i , j ;
f or ( i = 0; i < s i z e ; i ++){
f or ( j = 0; j < s i z e ; j ++){
d e s t [ i ] [ j ] = s our c e [ i ] [ j ] ;
}
}
}
;
10
B. cholSerial.c - Serial Cholesky
# i nc l ude c h o l S e r i a l . h
doubl e c h o l S e r i a l ( doubl e A, i nt n ) {
/ / Copy mat r i x A and t ak e onl y l ower t r i a n g u l a r par t
doubl e L = i n i t Ma t r i x ( n ) ;
t r ans Copy (A, L, n ) ;
i nt i , j , k ;
f or ( j = 0; j < n ; j ++){
f or ( k = 0; k < j ; k++){
/ / I nne r sum
f or ( i = j ; i < n ; i ++){
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
}
L[ j ] [ j ] = s q r t ( L[ j ] [ j ] ) ;
f or ( i = j +1; i < n ; i ++){
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
}
ret urn L;
}
;
11
C. cholOMP.c - OpenMP Cholesky
# i nc l ude <omp . h>
doubl e cholOMP ( doubl e L, i nt n ) {
/ / Warni ng : a c t s d i r e c t l y on gi v e n mat r i x !
i nt i , j , k ;
omp l ock t wr i t e l o c k ;
omp i ni t l oc k (&wr i t e l o c k ) ;
f or ( j = 0; j < n ; j ++) {
f or ( i = 0; i < j ; i ++){
L[ i ] [ j ] = 0;
}
#pragma omp p a r a l l e l f or s ha r e d ( L) p r i v a t e ( k )
f or ( k = 0; k < i ; k++) {
omp s et l ock (&wr i t e l o c k ) ;
L[ j ] [ j ] = L[ j ] [ j ] L[ j ] [ k ] L[ j ] [ k ] ; / / Cr i t i c a l s e c t i o n .
omp uns et l ock(&wr i t e l o c k ) ;
}
#pragma omp s i n g l e
L[ i ] [ i ] = s q r t ( L[ j ] [ j ] ) ;
#pragma omp p a r a l l e l f or s ha r e d ( L) p r i v a t e ( i , k )
f or ( i = j +1; i < n ; i ++) {
f or ( k = 0; k < j ; k++) {
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
omp des t r oy l ock (&wr i t e l o c k ) ;
}
ret urn L;
}
;
12
D. cholMPI.c - OpenMPI Cholesky
# i nc l ude <mpi . h>
voi d chol MPI ( doubl e A, doubl e L, i nt n , i nt ar gc , char ar gv ) {
/ / Warni ng : chol MPI ( ) a c t s d i r e c t l y on t he gi v e n mat r i x !
i nt npes , r ank ;
MPI I ni t (&ar gc , &ar gv ) ;
MPI Comm size (MPI COMM WORLD, &npes ) ;
MPI Comm rank (MPI COMM WORLD, &r ank ) ;
doubl e s t a r t , end ;
MPI Bar r i er (MPI COMM WORLD) ; / Ti mi ng /
i f ( r ank == 0) {
s t a r t = MPI Wtime ( ) ;
/ / / Te s t
p r i n t f ( A = \n ) ;
p r i n t ( L , n ) ; /
}
/ / For each col umn
i nt i , j , k ;
f or ( j = 0; j < n ; j ++) {
/
St e p 0:
Repl ace t he e n t r i e s above t he di agonal wi t h z e r oe s
/
i f ( r ank == 0) {
f or ( i = 0; i < j ; i ++) {
L[ i ] [ j ] = 0 . 0 ;
}
}
/
St e p 1:
Updat e t he di agonal e l e me nt
/
i f ( j%npes == r ank ) {
f or ( k = 0; k < j ; k++) {
L[ j ] [ j ] = L[ j ] [ j ] L[ j ] [ k ] L[ j ] [ k ] ;
}
L[ j ] [ j ] = s q r t ( L[ j ] [ j ] ) ;
}
/ / Br oadcas t row wi t h new v al ue s t o ot he r p r o c e s s e s
MPI Bcast ( L[ j ] , n , MPI DOUBLE, j%npes , MPI COMM WORLD) ;
/
St e p 2:
Updat e t he e l e me nt s bel ow t he di agonal e l e me nt
/
/ / Di vi de t he r e s t of t he work
13
f or ( i = j +1; i < n ; i ++) {
i f ( i%npes == r ank ) {
f or ( k = 0; k < j ; k++) {
L[ i ] [ j ] = L[ i ] [ j ] L[ i ] [ k ] L[ j ] [ k ] ;
}
L[ i ] [ j ] = L[ i ] [ j ] / L[ j ] [ j ] ;
}
}
}
MPI Bar r i er (MPI COMM WORLD) ; / Ti mi ng /
i f ( r ank == 0) {
end = MPI Wtime ( ) ;
p r i n t f ( Te s t i n g OpenMpi i mpl e me nt a t i on Out put : \n ) ;
p r i n t f ( Runt i me = %l f \n , ends t a r t ) ;
p r i n t f ( Te s t i n g MPI i mpl e me nt a t i on Out put : ) ;
t e s t Ba s i c Ou t p u t (A, L, n ) ;
/ / Te s t
/ doubl e LLT = ma t r i x Mu l t i p l y ( L , t r a n s p o s e ( L , n ) , n ) ;
p r i n t f ( LL T = \n ) ;
p r i n t ( LLT , n ) ; /
}
MPI Fi nal i ze ( ) ;
}
i nt t e s t Ba s i c Ou t p u t ( doubl e A, doubl e L, i nt n )
{
doubl e LLT = ma t r i x Mu l t i p l y ( L, t r a n s p o s e ( L, n ) , n ) ;
i nt i , j ;
f l o a t p r e c i s i o n = 0. 0000001;
f or ( i = 0; i < n ; i ++){
f or ( j = 0; j < n ; j ++){
i f ( ! ( abs ( LLT[ i ] [ j ] A[ i ] [ j ] ) < p r e c i s i o n ) )
{
p r i n t f ( FAILED\n ) ;
ComputeSumOfAbsError (A, LLT, n ) ;
ret urn 0;
}
}
}
p r i n t f ( PASSED\n ) ;
ret urn 1;
}
;
14
E. tests.c - General Test Code
# i nc l ude <s t d i o . h>
# i nc l ude <s t r i n g . h>
# i nc l ude <math . h>
# i nc l ude <f l o a t . h>
# i nc l ude <t i me . h>
# i nc l ude <s t d l i b . h>
# i nc l ude <t i me . h>
# i nc l ude <omp . h>
t ypedef i nt bool ;
enum { f a l s e , t r u e };
s t r uc t t i me s pe c begi n ={0 , 0} , end ={0 , 0};
t i me t s t a r t , s t op ;
i nt main ( i nt ar gc , char ar gv )
{
/ / ge ne r at e s eed
s r a nd ( t i me (NULL) ) ;
i f ( a r gc ! = 3)
{
p r i n t f ( You di d not f e e d me ar gument s , I wi l l di e now : ( . . . \n ) ;
p r i n t f ( Usage : %s [ ma t r i x s i z e ] [ number of t h r e a d s ] \n , ar gv [ 0 ] ) ;
ret urn 1;
}
i nt ma t r i x Si z e = a t o i ( ar gv [ 1 ] ) ;
i nt t hr eads Number = a t o i ( ar gv [ 2 ] ) ;
p r i n t f ( Te s t b a s i c out p ut f o r a ma t r i x of s i z e %d : \ n , ma t r i x Si z e ) ;
/ / Gener at e random SPD mat r i x
doubl e A = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ p r i n t f ( Chol mat r i x \n ) ;
p r i n t ( A, ma t r i x S i z e ) ; /
doubl e L = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ / Te s t S e r i a l Program
/ / Appl y S e r i a l Chol es ky
p r i n t f ( Te s t i n g S e r i a l i mpl e me nt a t i on Out put : \n ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &begi n ) ;
L = c h o l S e r i a l (A, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &end ) ; / / Get t he c u r r e n t t i me .
t e s t Ba s i c Out put Of Chol (A, L, ma t r i x Si z e ) ;
/ / Te s t e x e c u t i o n t i me
p r i n t f ( The s e r i a l comput at i on t ook %.5 f s econds \n ,
( ( doubl e ) end . t v s e c + 1. 0 e9 end . t v ns e c )
( ( doubl e ) begi n . t v s e c + 1. 0 e9 begi n . t v ns e c ) ) ;
/ / Te s t i n g OpenMP Program
p r i n t f ( Te s t i n g OpenMP i mpl e me nt a t i on Out put : \n ) ;
omp s et num t hr eads ( t hr eads Number ) ;
15
copyMat r i x (A, L, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &begi n ) ;
cholOMP ( L, ma t r i x Si z e ) ;
c l o c k g e t t i me (CLOCK MONOTONIC, &end ) ; / / Get t he c u r r e n t t i me .
t e s t Ba s i c Out put Of Chol (A, L, ma t r i x Si z e ) ;
/ / Te s t e x e c u t i o n t i me
p r i n t f ( The OpenMP comput at i on t ook %.5 f s econds \n ,
( ( doubl e ) end . t v s e c + 1. 0 e9 end . t v ns e c )
( ( doubl e ) begi n . t v s e c + 1. 0 e9 begi n . t v ns e c ) ) ;
p r i n t f ( \n ) ;
ret urn 0;
}
i nt t e s t Ba s i c Out put Of Chol ( doubl e A, doubl e L, i nt n )
{
doubl e LLT = ma t r i x Mu l t i p l y ( L, t r a n s p o s e ( L, n ) , n ) ;
i nt i , j ;
f l o a t p r e c i s i o n = 0. 00000000001;
f or ( i = 0; i < n ; i ++){
f or ( j = 0; j < n ; j ++){
i f ( ! ( abs ( LLT[ i ] [ j ] A[ i ] [ j ] ) < p r e c i s i o n ) )
{
p r i n t f ( FAILED\n ) ; / / i f i t f a i l s show t he e r r or
ComputeSumOfAbsError (A, LLT, n ) ;
ret urn 0;
}
}
}
p r i n t f ( PASSED\n ) ;
ret urn 1;
}
voi d t e s t Ti me f o r Se r i a l Ch o l ( i nt n )
{
p r i n t f ( Te s t d u r a t i o n f o r s e r i a l v e r s i o n wi t h ma t r i x of s i z e %d \n , n ) ;
doubl e A = i n i t i a l i z e ( 0 , 10 , n ) ;
c l o c k t s t a r t = c l oc k ( ) ;
/ / Appl y Chol es ky
doubl e L = c h o l S e r i a l (A, n ) ;
c l o c k t end = c l oc k ( ) ;
f l o a t s econds = ( f l o a t ) ( end s t a r t ) / CLOCKS PER SEC;
p r i n t f ( I t t ook %f s econds \n , s econds ) ;
}
voi d t e s t Er r o r Of Li n e a r Sy s t e mAp p l i c a t i o n ( i nt ma t r i x Si z e )
{
p r i n t f ( Te s t l i n e a r s ys t em a p p l i c a t i o n of Chol esky f o r ma t r i x s i z e %d : \ n ,
ma t r i x Si z e ) ;
doubl e xTheo = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
i nt i ndex ;
f or ( i ndex = 0; i ndex < ma t r i x Si z e ; i ndex ++)
{
16
xTheo [ i ndex ] = r and ( ) / ( doubl e ) RAND MAX 10;
}
doubl e b = v e c t o r Mu l t i p l y (A, xTheo , ma t r i xSi z e , ma t r i x Si z e ) ;
/ / Appl y Chol es ky
doubl e L = c h o l S e r i a l (A, ma t r i x Si z e ) ;
doubl e y = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
/ / Forwards u b s t i t u t i o n par t
i nt i , j ;
y [ i ] = b [ i ] ;
f or ( j = 0; j < i ; j ++){
y [ i ] = y [ i ] L[ i ] [ j ] y [ j ] ;
}
y [ i ] = y [ i ] / L[ i ] [ i ] ;
}
/ / Backs u b s t i t u t i o n par t
doubl e LT = t r a n s p o s e ( L, ma t r i x Si z e ) ;
doubl e xExpr = ( doubl e ) mal l oc ( ma t r i x Si z e s i z e o f ( doubl e ) ) ;
f or ( i = ma t r i x Si z e 1; i >=0; i ){
xExpr [ i ] = y [ i ] ;
f or ( j = i + 1 ; j < ma t r i x Si z e ; j ++){
xExpr [ i ] = xExpr [ i ] LT[ i ] [ j ] xExpr [ j ] ;
}
xExpr [ i ] = xExpr [ i ] / LT[ i ] [ i ] ;
}
p r i n t f ( x e x p e r i me n t a l i s : \n ) ;
p r i n t Ve c t o r ( xExpr , ma t r i x Si z e ) ;
p r i n t f ( The sum of abs e r r o r i s %10.6 f \n ,
vect or Comput eSumof AbsEr r or ( xTheo , xExpr , ma t r i x Si z e ) ) ;
}
;
17
F. testMPI.c - Test program for MPI implementation
i nt main ( i nt ar gc , char ar gv )
{
/ / ge ne r at e s eed
s r a nd ( t i me (NULL) ) ;
i f ( a r gc ! = 2)
{
p r i n t f ( You di d not f e e d me ar gument s , I wi l l di e now : ( . . . \n ) ;
p r i n t f ( Usage : %s [ ma t r i x s i z e ] \n , ar gv [ 0 ] ) ;
ret urn 1;
}
i nt ma t r i x Si z e = a t o i ( ar gv [ 1 ] ) ;
/ p r i n t f ( Chol mat r i x \n ) ;
p r i n t ( A, ma t r i x S i z e ) ; /
doubl e L = i n i t i a l i z e ( 0 , 10 , ma t r i x Si z e ) ;
/ / Te s t i n g OpenMpi Program
copyMat r i x (A, L, ma t r i x Si z e ) ;
chol MPI (A, L, ma t r i xSi z e , ar gc , ar gv ) ;
/ / Warni ng : chol MPI ( ) a c t s d i r e c t l y on t he gi v e n mat r i x L
}
;

ECSE 420 - Parallel Cholesky Algorithm - Report

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ECSE 420 - Parallel Cholesky Algorithm - Report

Diunggah oleh

Hak Cipta:

Format Tersedia

1

ECSE 420 - Fast Cholesky: a serial, OpenMP and

is computed based on the following

Anda mungkin juga menyukai