13.10.2011
Overview
shared
memory
and
accelerators
CUDA
OpenCL
distributed
memory
13.10.2011
POSIX threads
13.10.2011
13.10.2011
Pros
Most basic threading interface
Straightforward, manageable API
Dynamic generation and destruction of threads
Reasonable synchronization primitives
Full execution control
Cons
Most basic threading interface
Higher functions (reductions, synchronization, work distributions, task
queueing) must be done by hand
Only available with C API
Only available on (near-) POSIX compliant OSs
Compiler has no clue about threads
13.10.2011
13.10.2011
13.10.2011
13.10.2011
Pros
High-level programming model
Task concept is often more natural for real-world problems than thread
concept
Built-in parallel (thread-safe) containers
Built-in work distribution (configurable, but not too finely)
Available for Linux, Windows, MacOS
Cons
C++ only
Mapping of threads to resources (cores) not part of the model
Number of threads concept only vaguely implemented
Dynamic work sharing and task stealing introduce variability, difficult to
optimize under ccNUMA constraints
Compiler has no clue about threads
13.10.2011
10
OpenMP
13.10.2011
11
13.10.2011
12
private
private
Shared
Memory
private
private
13.10.2011
13
thread runs
Parallel region: team of
worker threads is generated
(fork)
synchronize when leaving
parallel region (join)
Only master executes
sequential part
worker threads usually sleep
Thread
#
0
1
2
3
4
5
13.10.2011
Usually optimal:
one thread per core
Multicore Briefing - parallel programming models
14
1 n
f (t ) dt
f ( xi )
n i =1
where
xi
We want
i 0.5
n
(i =1,..., n)
4dx
0 1 + x 2
! function to integrate
double f(double x) {
return 4.0/(1.0+x*x);
}
w=1.0/n;
sum=0.0;
for(i=1; i<=n; ++i) {
x = w*(i-0.5);
sum += f(x);
}
pi=w*sum;
...
13.10.2011
(printout omitted)
15
concurrent execution by
team of threads
worksharing among
threads
sequential execution
13.10.2011
16
Pros
High-level programming model
Available for Fortran, C, C++
Ideal for data parallelism, some support for task parallelism
Built-in work distribution
Directive concept is part of the language
Good support for incremental parallelization
Cons
Mapping of threads to resources (cores) not part of the model
OpenMP parallelization may interfere with compiler optimization
Parallel data structures are not part of the model
Only limited synchronization facilities
Model revolves around parallel region concept
13.10.2011
17
CUDA
13.10.2011
18
NVIDIA CUDA
19
13.10.2011
20
13.10.2011
21
Pros
Relatively straightforward programming model
Low-level programming, explicit data management
Compatible with many NVIDIA GPUs code runs usually without changes
Available for C, but wrappers for many languages available
including scripting languages
Cons
Restricted to NVIDIA GPUs
No support for multicore processors
No support for AMD GPUs
22
OpenCL
13.10.2011
23
OpenCL
13.10.2011
24
25
Pros
Relatively straightforward programming model
Low-level programming, explicit data management
Available for NVIDIA and AMD GPUs, and multicore CPUs
Potential for overlapping GPU computation with CPU tasks
CUDA kernel code largely re-usable
Some support for modern SIMD instruction sets
Cons
Available for C(99)/C++
Just in time kernel compilation
Low-level programming, explicit data management
Powerful tools are just beginning to emerge
Largely manual work distribution, but more flexible than CUDA
Best performance on all architectures requires specialized code for each
13.10.2011
26
MPI
13.10.2011
27
Message
13.10.2011
28
13.10.2011
29
Startup phase:
launch tasks
establishes communication context
(communicator) among all tasks
Collective communication:
Clean shutdown
13.10.2011
30
MPI in a nutshell
Hello World!
program hello
use mpi
implicit none
integer rank, size, ierror
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) 'Hello World! I am ',rank,' of ',size
call MPI_FINALIZE(ierror)
end program
13.10.2011
Hello
Hello
Hello
Hello
World!
World!
World!
World!
I
I
I
I
am
am
am
am
3
1
0
2
of
of
of
of
4
4
4
4
31
MPI in a nutshell
Transmitting a message
13.10.2011
Message
32
Pros
Suitable for distributed-memory and shared-memory machines
Supports massive parallelism
Well supported, many free and commercial implementations
Tremendous code base, huge experience in the field
Standard supports Fortran and C, wrappers for other languages exist
including scripting languages
Cons
Execution environment is crucial to set up
Huge standard (500+ functions) with many obscure bits and pieces
Incremental parallelization next to impossible most sequential code
needs serious restructuring
Performance properties sometimes hard to understand
also implementation-dependent
13.10.2011
33