Computing on Low
Latecy Clusters
Vittorio Giovara
M. S. Electrical Engineering and Computer Science
University of Illinois at Chicago
May 2009
Contents
• Motivation • Application
• OpenMP
• Results
• MPI
• Conclusions
• Infinband
Motivation
Motivation
instruction pool
instruction pool
MISD MIMD
data pool data pool
instruction pool
instruction pool
abstraction level
algorithm
loop level
process management
Levels
recursion
memory
management
profiling
data dependency
branching overhead
control flow
algorithm
loop level
process management
SMP Multiprogramming
Multithreading and Scheduling
Backfire
• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1 Parallel Task 3
Thread A
Parallel Task 4
Thread B
Parallel Task 3
Join
Master Thread
OpenMP Sceduler
70000
60000
50000
microseconds
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads
102375
87750
73125
microseconds
58500
43875
29250
14625
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads
70000
60000
50000
microseconds
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads
• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs
1000000,0 µs
100000,0 µs
10000,0 µs
1000,0 µs
100,0 µs
10,0 µs
1,0 µs
kB
kB
kB
kB
kB
kB
12 B
25 B
51 B
kB
B
B
B
B
32 B
64 B
12 B
25 B
51 B
B
B
B
B
B
B
k
k
k
M
M
M
M
M
M
M
M
M
M
G
G
G
G
G
1
2
4
8
16
32
64
8
6
2
1
2
4
8
16
1
2
4
8
16
8
6
2
OpenMPI Mvapich2
MPI over Infiniband
10000000,00 µs
1000000,00 µs
100000,00 µs
10000,00 µs
1000,00 µs
100,00 µs
10,00 µs
1,00 µs
kB
kB
kB
kB
kB
kB
kB
kB
kB
kB
B
M
M
1
16
32
64
8
12
25
51
OpenMPI Mvapich2
Optimizations
•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software
• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
mathematical models
Implementation Scheme
sequential loop parallel loop
standard
programming
model
OpenMP Threads
distributed loop
• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x