(Slides) Parallel and Distributed Computing On Low Latency Clusters

Parallel and Distributed
Computing on Low
Latecy Clusters
Vittorio Giovara
M. S. Electrical Engineering and Computer Science
University of Illinois at Chicago
May 2009
Contents
• Motivation • Application
• Strategy • Compiler Optimizations
• Technologies • OpenMP and MPI over

Infinband
• OpenMP
• Results
• MPI
• Conclusions
• Infinband
Motivation
Motivation
• Scaling trend has to stop for CMOS

technology:
✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty
• Foundamental reason: rising fab cost

Motivation
• Easy to build multiple core processor

• Requires human action to modify and adapt
concurrent software
• New classification for computer
architectures
Classification
SISD SIMD
data pool data pool
instruction pool
instruction pool
CPU CPU CPU
MISD MIMD
data pool data pool
instruction pool
instruction pool
CPU CPU CPU
CPU CPU CPU

easier to parallelize
abstraction level
algorithm
loop level
process management
Levels
recursion
memory
management
profiling
data dependency
branching overhead
control flow
algorithm
loop level
process management
SMP Multiprogramming
Multithreading and Scheduling
Backfire
• Difficutly to fully exploit the parallelism

offered
• Automatic tools required to adapt software
to parallelism
• Compiler support for manual or semi-
automatic enhancement
Applications
• OpenMP and MPI are two popular tools
used to simplify the parallelizing process of
both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and
Background
• Sally3D is a micromagnetism program suit
for field analysis and modeling developed at
Politecnico di Torino (Department of
Electrical Engineering)
• Computationally intensive (even days of
CPU); speedup required
• Previous works still not fully encompassing
the problem (no Infiniband or OpenMP
+MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc
configuration for scientific computation
• Compile a OpenMP enable GCC
(supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some
MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
directives in the code
• Run test cases
OpenMP
• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1 Parallel Task 3

Thread A
Parallel Task 4
Thread B
Parallel Task 3
Join
Master Thread
OpenMP Sceduler
• Which scheduler available for hardware?

- Static
- Dynamic
- Guided
OpenMP Scheduler
OpenMP Static Scheduler Chart
80000
70000
60000
50000
microseconds
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler
OpenMP Dynamic Scheduler Chart
117000
102375
87750
73125
microseconds
58500
43875
29250
14625
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

OpenMP Scheduler
OpenMP Guided Scheduler Chart
80000
70000
60000
50000
microseconds
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

OpenMP Scheduler
OpenMP Scheduler
static scheduler dynamic scheduler guided scheduler

MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH
Infiniband
• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs
1000000,0 µs
100000,0 µs
10000,0 µs
1000,0 µs
100,0 µs
10,0 µs
1,0 µs
kB
kB
kB
kB
kB
kB
12 B
25 B
51 B
kB
B
B
B
B
32 B
64 B
12 B
25 B
51 B
B
B
B
B
B
B
k
k
k
M
M
M
M
M
M
M
M
M
M
G
G
G
G
G
1
2
4
8
16
32
64
8
6
2
1
2
4
8
16
1
2
4
8
16
8
6
2
OpenMPI Mvapich2
MPI over Infiniband
10000000,00 µs
1000000,00 µs
100000,00 µs
10000,00 µs
1000,00 µs
100,00 µs
10,00 µs
1,00 µs
kB
kB
kB
kB
kB
kB
kB
kB
kB
kB
B
M
M
1
16
32
64
8
12
25
51
OpenMPI Mvapich2
Optimizations
• Active at compile time

• Available only after porting the software to
standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations
•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software
• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
mathematical models
Implementation Scheme
sequential loop parallel loop
standard
programming
model
OpenMP Threads
distributed loop
OpenMP Threads OpenMP Threads

Host 1 Host 2
MPI
Implementation
Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
obiects required
➡ send() and recv() mechanism
➡ critical regions using OpenMP directives
➡ functions merging
➡ matrix conversion
Results
Results
OMP MPI OPT seconds
* * * 133
* * - 400
* - * 186
* - - 487
- * * 200
- * - 792
- - * 246
- - - 1062
Total Speed Increase: 87.52%

Actual Results
OMP MPI seconds
* * 59
* - 129
- * 174
- - 249
Function Name Normal OpenMP MPI OpenMP+MPI

calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s
calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
Actual Results
• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x
Total Raw Speed Increment: 76%

Conclusions
Conclusions and
Future Works
• Computational time has been significantly
decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives
• Perform algorithm optimizations
• Increase cluster size

(Slides) Parallel and Distributed Computing On Low Latency Clusters

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

(Slides) Parallel and Distributed Computing On Low Latency Clusters

Diunggah oleh

Hak Cipta:

Format Tersedia

Parallel and Distributed

• Strategy • Compiler Optimizations

• Technologies • OpenMP and MPI over

• Scaling trend has to stop for CMOS

• Foundamental reason: rising fab cost

• Easy to build multiple core processor

CPU CPU CPU

CPU CPU CPU

CPU CPU CPU

• Difficutly to fully exploit the parallelism

Parallel Task 2 Parallel Task 4

• Which scheduler available for hardware?

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

static scheduler dynamic scheduler guided scheduler

• Active at compile time

OpenMP Threads OpenMP Threads

Total Speed Increase: 87.52%

Function Name Normal OpenMP MPI OpenMP+MPI

Total Raw Speed Increment: 76%

Anda mungkin juga menyukai