Multicore Code Entwicklung

Parallel code
for multicore systems

An overview of programming models
13.10.2011
Multicore Briefing - parallel programming models
Overview
There are just too many

and remember that the compiler will not help you.
Threading models for multicore processors

POSIX threads
Intel Threading building blocks
OpenMP
Threading models for GPGPUs
shared
memory and
accelerators
CUDA
OpenCL
Parallel programming for distributed memory

MPI
distributed
memory
Overall goal: Exploit the parallelism built into the hardware!
13.10.2011
POSIX threads
13.10.2011
Why threads for parallel programs?
Thread == Lightweight process

Independent instruction stream
In simulation we usually run one thread per (virtual or physical) core, but
more is possible
New processes are expensive to generate (via fork())

Threads share all the data of a process, so they are cheap
Inter-process communication is slow and cumbersome

Shared memory between threads provides an easy way to communicate
and synchronize
A threading model puts threads to use by making them accessible

to the programmer
Either explicitly or wrapped in some parallel paradigm
13.10.2011
POSIX threads example: Matrix-vector multiply with 100

threads
static double a[100][100], b[100], c[100];
int main(int argc, char* argv[]) {
pthread_t tids[100];
...
for (int i = 0; i < 100; i++)
pthread_create(tids + i, NULL, mult, (void *)(c + i));
for (int i = 0; i < 100; i++)
pthread_join(tids[i], NULL);
...
}
There are no shared
static void *mult(void *cp) {
resources here! (not
int i = (double *)cp - c;
really)
double sum = 0;
for (int j = 0; j < 100; j++)
sum += a[i][j] * b[j];
c[i] = sum;
return NULL;
}
Adapted from material by J. Kleinder
13.10.2011
POSIX threads pros and cons
Pros
Most basic threading interface
Straightforward, manageable API
Dynamic generation and destruction of threads
Reasonable synchronization primitives
Full execution control
Cons
Most basic threading interface
Higher functions (reductions, synchronization, work distributions, task
queueing) must be done by hand
Only available with C API
Only available on (near-) POSIX compliant OSs
Compiler has no clue about threads
13.10.2011
Intel Threading Building Blocks (TBB)
13.10.2011
Intel Threading Building Blocks (TBB)
Introduced by Intel in 2006
C++ threading library

Uses POSIX threads under the hood
Programmer works with tasks rather than threads
Task stealing model
Parallel C++ containers
Commercial and open source variants exist
13.10.2011
A simple parallel loop in TBB: Apply Foo() to every

element of an array
#include "tbb/tbb.h"
using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
void ParallelApplyFoo( float a[], size_t n ) {
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}
Adapted from the Intel TBB tutorial
13.10.2011
TBB pros and cons
Pros
High-level programming model
Task concept is often more natural for real-world problems than thread
concept
Built-in parallel (thread-safe) containers
Built-in work distribution (configurable, but not too finely)
Available for Linux, Windows, MacOS
Cons
C++ only
Mapping of threads to resources (cores) not part of the model
Number of threads concept only vaguely implemented
Dynamic work sharing and task stealing introduce variability, difficult to
optimize under ccNUMA constraints
Compiler has no clue about threads
13.10.2011
10
OpenMP
13.10.2011
11
Parallel Programming with OpenMP
Easy and portable parallel programming of

shared memory computers: OpenMP
Standardized set of compiler directives & library functions:

http://www.openmp.org/
FORTRAN, C and C++ interfaces
Supported by most/all commercial compilers, GNU starting with 4.2
Few free tools are available
OpenMP program can be written to compile and execute on a

single-processor machine just by ignoring the directives
13.10.2011
12
Shared Memory Model used by OpenMP

Central concept of OpenMP programming: Threads
n
private
private
Shared
Memory
Threads access globally

shared memory
Data can be shared or
private
n shared data available
to all threads (in
principle)
n private data only to
thread that owns it
private
private
13.10.2011
13
OpenMP Program Execution

Fork and Join
Program start: only master
thread runs
Parallel region: team of
worker threads is generated
(fork)
synchronize when leaving
parallel region (join)
Only master executes
sequential part
worker threads usually sleep
Task and data distribution via

directives
Thread # 0 1 2 3 4 5
13.10.2011
Usually optimal:
one thread per core
14
Example: Numerical integration in OpenMP

Approximate by a discrete sum
1
1 n
f (t ) dt
f ( xi )
n i =1
where
xi
We want
i 0.5
n
(i =1,..., n)
4dx
0 1 + x 2
solve this in OpenMP
! function to integrate
double f(double x) {
return 4.0/(1.0+x*x);
}
w=1.0/n;
sum=0.0;
for(i=1; i<=n; ++i) {
x = w*(i-0.5);
sum += f(x);
}
pi=w*sum;
...
13.10.2011
(printout omitted)
15
Example: Numerical integration in OpenMP

...
pi=0.0;
w=1.0/n;
#pragma omp parallel private(x,sum)
{
sum=0.0;
#pragma omp for
for(i=1; i<=n; ++i) {
x = w*(i-0.5);
sum += f(x);
}
concurrent execution by
team of threads
worksharing among
threads
sequential execution
#pragma omp critical

pi=pi+w*sum;
}
13.10.2011
16
OpenMP pros and cons
Pros
High-level programming model
Available for Fortran, C, C++
Ideal for data parallelism, some support for task parallelism
Built-in work distribution
Directive concept is part of the language
Good support for incremental parallelization
Cons
Mapping of threads to resources (cores) not part of the model
OpenMP parallelization may interfere with compiler optimization
Parallel data structures are not part of the model
Only limited synchronization facilities
Model revolves around parallel region concept
13.10.2011
17
CUDA
13.10.2011
18
NVIDIA CUDA
Compute Unified Device Architecture

Hardware architecture and software environment
Convenient programming model for using NVIDIA GPUs as
general-purpose compute devices
Implements Single Instruction Multiple Threads (SIMT) approach
Programming model
Accelerator style: Main program runs on host CPU, kernels are offloaded
to GPU
Unified binary for host + device
GPU #1
Supports multiple GPUs
Data transfer to/from device is
explicit
Kernel execution may be
PCIe link
asynchronous to CPU code
GPU #2
Latest devices (Fermi) allow
multiple concurrent kernels
13.10.2011
19
A simple CUDA example: Host code

// allocate memory on host
h_A
= (float *)malloc(DATA_SZ);
h_C
= (float *)malloc(DATA_SZ);
h_C_GPU = (float *)malloc(RESULT_SZ);
// allocate memory on CUDA device
cudaMalloc((void **)&d_A, DATA_SZ)
;
cudaMalloc((void **)&d_C, RESULT_SZ) ;
//Copy data to GPU memory for further processing
cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice);
cudaMemcpy(d_C, h_C, DATA_SZ, cudaMemcpyHostToDevice) ;
cudaThreadSynchronize() ;
//Kernel Call: <<<BLOCKS, THREADS_PER_BLOCK>>>
do_work_on_gpu<<<128,128>>>(d_C, d_A, DATA_N);
cudaThreadSynchronize();
// copy result back to host
cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) ;
13.10.2011
20
A simple CUDA example: CUDA kernel

__global__ void do_work_on_gpu (
float *d_C,
float *d_A,
int elementN
) {
for (
int pos = (blockIdx.x * blockDim.x) + threadIdx.x;
pos < elementN ;
pos += blockDim.x*gridDim.x ) {
d_C[pos] = 5.0f * d_A[pos];
}
__syncthreads();
}
13.10.2011
21
CUDA pros and cons
Pros
Relatively straightforward programming model
Low-level programming, explicit data management
Compatible with many NVIDIA GPUs code runs usually without changes
Available for C, but wrappers for many languages available
including scripting languages
Directive-based compiler extensions available (e.g., PGI)

Potential for overlapping GPU computation with CPU tasks
Cons
Restricted to NVIDIA GPUs
No support for multicore processors
No support for AMD GPUs

Powerful tools are just beginning to emerge
Largely manual work distribution
Not an open standard
13.10.2011
22
OpenCL
13.10.2011
23
OpenCL
Open Computing Language

Open standard
Convenient programming model for using any kind of
accelerator GPGPUs, multicore CPUs,
Programming model similar to CUDA but more flexible

Pure kernel code often portable from CUDA without major changes
13.10.2011
24
A simple OpenCL example: Host code

// Get platform (Platform is NVIDIA Corp or Intel Corp or AMD Corp)
std::vector<cl::Platform> platforms; cl::Platform::get(&platforms);
// Get devices
std::vector<cl::Device> devices;
platforms.front().getDevices( DEVTOQUERY , &devices );
// Build context and Command Queue
cl::Context context( devices );
cl::CommandQueue cmdQ ( context , devices[0] );
// Read Kernel and compile JIT
cl::Program::Sources sourceCode ;
source_str = (char*)malloc(MAX_SOURCE_SIZE);
source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp);
sourceCode.push_back(std::make_pair(source_str,source_size));
cl::Program program = cl::Program ( context ,sourceCode );
program.build ( devices ) ;
cl::Kernel kernel(program, "VectorCopy");
//Allocate buffer
cl::Buffer D_A(context,CL_MEM_READ_WRITE,sizeof(REAL)*Vectorlength);
//Copy data
cmdQ.enqueueWriteBuffer (D_A , true,0,sizeof(REAL)*Vectorlength , &H_A[0]);
// Bind parameters to kernel
cl::KernelFunctor kernel_func = kernel.bind (cmdQ,
cl::NDRange(Globalsize), cl::NDRange(Workgroupsize));
// Call Kernel
event = kernel_func(D_A, D_B, D_C, scalar, Vectorlength, i ) ;
13.10.2011
25
OpenCL pros and cons
Pros
Relatively straightforward programming model
Available for NVIDIA and AMD GPUs, and multicore CPUs
Potential for overlapping GPU computation with CPU tasks
CUDA kernel code largely re-usable
Some support for modern SIMD instruction sets
Cons
Available for C(99)/C++
Just in time kernel compilation
Powerful tools are just beginning to emerge
Largely manual work distribution, but more flexible than CUDA
Best performance on all architectures requires specialized code for each
13.10.2011
26
MPI
The Message Passing Interface
13.10.2011
27
The message passing paradigm: A programming model

Distributed memory
architecture:
Each process(or) can
only access its
dedicated address
space.
No global shared
address space
Data exchange and
communication
between processes is
done by explicitly
passing messages
through a
communication
network
Message
Message passing library:

Should be flexible, efficient and portable
Hide communication hardware and software layers from
application programmer
13.10.2011
28
The message passing paradigm
Widely accepted standard in HPC / numerical simulation:

Message Passing Library (MPI)
See http://www.mpi-forum.org for documents

Many free and commercial implementations: Intel MPI, OpenMPI,
MVAPICH,
Process-based approach: All variables are local!

Same program on each processor/machine (SPMD)
No restriction of the general MP model, because processes can be
distinguished by their rank (see later)
The program is written in a sequential language (Fortran/C/C

++)
Data exchange between processes: Send/receive messages
via MPI library calls
This is usually the most tedious but also the most flexible way of
parallelization
13.10.2011
29
MPI in a nutshell: Parallel execution

Processes run throughout program
execution: All variables are local
Process ID: 0 1 2 3 4
Startup phase:
launch tasks
establishes communication context
(communicator) among all tasks
Point-to-point data transfer:

between pairs of tasks
may be blocking or non-blocking
explicit synchronization is needed for
non-blocking
Collective communication:
between all tasks or a subgroup of

tasks
presently blocking-only
reductions, scatter/gather operations
efficiency of library call
Clean shutdown
13.10.2011
30
MPI in a nutshell
Hello World!
program hello
use mpi
implicit none
integer rank, size, ierror
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) 'Hello World! I am ',rank,' of ',size
call MPI_FINALIZE(ierror)
end program
13.10.2011
Hello
Hello
Hello
Hello
World!
World!
World!
World!
I
I
I
I
am
am
am
am
3
1
0
2
of
of
of
of
4
4
4
4
31
MPI in a nutshell
Transmitting a message
MPI requires the following information:

Which processor is sending the message.
Where is the data on the sending processor.
What kind of data is being sent.
How much data is there.
Which processor(s) are receiving the message.
Where should the data be left on the receiving processor.
How much data is the receiving processor prepared to accept.
Sender and receiver must pass their
information to MPI separately
Holds for point-to-point
communication
13.10.2011
Message
32
MPI pros and cons
Pros
Suitable for distributed-memory and shared-memory machines
Supports massive parallelism
Well supported, many free and commercial implementations
Tremendous code base, huge experience in the field
Standard supports Fortran and C, wrappers for other languages exist
including scripting languages
Hybrid MPI+X models are supported: X {OpenMP,CUDA,OpenCL,TBB,}
Cons
Execution environment is crucial to set up
Huge standard (500+ functions) with many obscure bits and pieces
Incremental parallelization next to impossible most sequential code
needs serious restructuring
Performance properties sometimes hard to understand
also implementation-dependent
13.10.2011
33

Multicore Code Entwicklung

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Multicore Code Entwicklung

Diunggah oleh

Hak Cipta:

Format Tersedia

Parallel code

for multicore systems

Multicore Briefing - parallel programming models

There are just too many

Threading models for multicore processors

Threading models for GPGPUs

Parallel programming for distributed memory

Overall goal: Exploit the parallelism built into the hardware!

Multicore Briefing - parallel programming models

Multicore Briefing - parallel programming models

Why threads for parallel programs?

Thread == Lightweight process

New processes are expensive to generate (via fork())

Inter-process communication is slow and cumbersome

A threading model puts threads to use by making them accessible

Multicore Briefing - parallel programming models

POSIX threads example: Matrix-vector multiply with 100

Multicore Briefing - parallel programming models

POSIX threads pros and cons

Multicore Briefing - parallel programming models

Intel Threading Building Blocks (TBB)

Multicore Briefing - parallel programming models

Intel Threading Building Blocks (TBB)

Introduced by Intel in 2006

C++ threading library

Commercial and open source variants exist

Multicore Briefing - parallel programming models

A simple parallel loop in TBB: Apply Foo() to every

Multicore Briefing - parallel programming models

TBB pros and cons

Multicore Briefing - parallel programming models

Multicore Briefing - parallel programming models

Parallel Programming with OpenMP

Easy and portable parallel programming of

Standardized set of compiler directives & library functions:

OpenMP program can be written to compile and execute on a

Multicore Briefing - parallel programming models

Shared Memory Model used by OpenMP

Threads access globally

Multicore Briefing - parallel programming models

OpenMP Program Execution

Program start: only master

Task and data distribution via

Example: Numerical integration in OpenMP

solve this in OpenMP

Multicore Briefing - parallel programming models

Example: Numerical integration in OpenMP

#pragma omp critical

Multicore Briefing - parallel programming models

OpenMP pros and cons

Multicore Briefing - parallel programming models

Multicore Briefing - parallel programming models

Compute Unified Device Architecture

Multicore Briefing - parallel programming models

A simple CUDA example: Host code

Multicore Briefing - parallel programming models

A simple CUDA example: CUDA kernel

Multicore Briefing - parallel programming models

CUDA pros and cons

Directive-based compiler extensions available (e.g., PGI)

Low-level programming, explicit data management

Multicore Briefing - parallel programming models

Multicore Briefing - parallel programming models

Open Computing Language