Anda di halaman 1dari 33

Parallel code

for multicore systems


An overview of programming models

13.10.2011

Multicore Briefing - parallel programming models

Overview

There are just too many


and remember that the compiler will not help you.

Threading models for multicore processors


POSIX threads
Intel Threading building blocks
OpenMP

Threading models for GPGPUs

shared
memory and
accelerators

CUDA
OpenCL

Parallel programming for distributed memory


MPI

distributed
memory

Overall goal: Exploit the parallelism built into the hardware!

13.10.2011

Multicore Briefing - parallel programming models

POSIX threads

13.10.2011

Multicore Briefing - parallel programming models

Why threads for parallel programs?

Thread == Lightweight process


Independent instruction stream
In simulation we usually run one thread per (virtual or physical) core, but
more is possible

New processes are expensive to generate (via fork())


Threads share all the data of a process, so they are cheap

Inter-process communication is slow and cumbersome


Shared memory between threads provides an easy way to communicate
and synchronize

A threading model puts threads to use by making them accessible


to the programmer
Either explicitly or wrapped in some parallel paradigm

13.10.2011

Multicore Briefing - parallel programming models

POSIX threads example: Matrix-vector multiply with 100


threads
static double a[100][100], b[100], c[100];
int main(int argc, char* argv[]) {
pthread_t tids[100];
...
for (int i = 0; i < 100; i++)
pthread_create(tids + i, NULL, mult, (void *)(c + i));
for (int i = 0; i < 100; i++)
pthread_join(tids[i], NULL);
...
}
There are no shared
static void *mult(void *cp) {
resources here! (not
int i = (double *)cp - c;
really)
double sum = 0;
for (int j = 0; j < 100; j++)
sum += a[i][j] * b[j];
c[i] = sum;
return NULL;
}
Adapted from material by J. Kleinder
13.10.2011

Multicore Briefing - parallel programming models

POSIX threads pros and cons

Pros
Most basic threading interface
Straightforward, manageable API
Dynamic generation and destruction of threads
Reasonable synchronization primitives
Full execution control

Cons
Most basic threading interface
Higher functions (reductions, synchronization, work distributions, task
queueing) must be done by hand
Only available with C API
Only available on (near-) POSIX compliant OSs
Compiler has no clue about threads

13.10.2011

Multicore Briefing - parallel programming models

Intel Threading Building Blocks (TBB)

13.10.2011

Multicore Briefing - parallel programming models

Intel Threading Building Blocks (TBB)

Introduced by Intel in 2006

C++ threading library


Uses POSIX threads under the hood
Programmer works with tasks rather than threads
Task stealing model
Parallel C++ containers

Commercial and open source variants exist

13.10.2011

Multicore Briefing - parallel programming models

A simple parallel loop in TBB: Apply Foo() to every


element of an array
#include "tbb/tbb.h"
using namespace tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) : my_a(a) {}
};
void ParallelApplyFoo( float a[], size_t n ) {
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}
Adapted from the Intel TBB tutorial

13.10.2011

Multicore Briefing - parallel programming models

TBB pros and cons

Pros
High-level programming model
Task concept is often more natural for real-world problems than thread
concept
Built-in parallel (thread-safe) containers
Built-in work distribution (configurable, but not too finely)
Available for Linux, Windows, MacOS

Cons
C++ only
Mapping of threads to resources (cores) not part of the model
Number of threads concept only vaguely implemented
Dynamic work sharing and task stealing introduce variability, difficult to
optimize under ccNUMA constraints
Compiler has no clue about threads

13.10.2011

Multicore Briefing - parallel programming models

10

OpenMP

13.10.2011

Multicore Briefing - parallel programming models

11

Parallel Programming with OpenMP

Easy and portable parallel programming of


shared memory computers: OpenMP

Standardized set of compiler directives & library functions:


http://www.openmp.org/
FORTRAN, C and C++ interfaces
Supported by most/all commercial compilers, GNU starting with 4.2
Few free tools are available

OpenMP program can be written to compile and execute on a


single-processor machine just by ignoring the directives

13.10.2011

Multicore Briefing - parallel programming models

12

Shared Memory Model used by OpenMP


Central concept of OpenMP programming: Threads
n

private

private

Shared
Memory

Threads access globally


shared memory
Data can be shared or
private
n shared data available
to all threads (in
principle)
n private data only to
thread that owns it

private
private

13.10.2011

Multicore Briefing - parallel programming models

13

OpenMP Program Execution


Fork and Join

Program start: only master

thread runs
Parallel region: team of
worker threads is generated
(fork)
synchronize when leaving
parallel region (join)
Only master executes
sequential part
worker threads usually sleep

Task and data distribution via


directives

Thread # 0 1 2 3 4 5
13.10.2011

Usually optimal:
one thread per core
Multicore Briefing - parallel programming models

14

Example: Numerical integration in OpenMP


Approximate by a discrete sum
1

1 n
f (t ) dt
f ( xi )

n i =1

where

xi

We want

i 0.5
n

(i =1,..., n)

4dx
0 1 + x 2

solve this in OpenMP

! function to integrate
double f(double x) {
return 4.0/(1.0+x*x);
}

w=1.0/n;
sum=0.0;
for(i=1; i<=n; ++i) {
x = w*(i-0.5);
sum += f(x);
}
pi=w*sum;
...

13.10.2011

(printout omitted)

Multicore Briefing - parallel programming models

15

Example: Numerical integration in OpenMP


...
pi=0.0;
w=1.0/n;
#pragma omp parallel private(x,sum)
{
sum=0.0;
#pragma omp for
for(i=1; i<=n; ++i) {
x = w*(i-0.5);
sum += f(x);
}

concurrent execution by
team of threads
worksharing among
threads

sequential execution

#pragma omp critical


pi=pi+w*sum;
}

13.10.2011

Multicore Briefing - parallel programming models

16

OpenMP pros and cons

Pros
High-level programming model
Available for Fortran, C, C++
Ideal for data parallelism, some support for task parallelism
Built-in work distribution
Directive concept is part of the language
Good support for incremental parallelization

Cons
Mapping of threads to resources (cores) not part of the model
OpenMP parallelization may interfere with compiler optimization
Parallel data structures are not part of the model
Only limited synchronization facilities
Model revolves around parallel region concept

13.10.2011

Multicore Briefing - parallel programming models

17

CUDA

13.10.2011

Multicore Briefing - parallel programming models

18

NVIDIA CUDA

Compute Unified Device Architecture


Hardware architecture and software environment
Convenient programming model for using NVIDIA GPUs as
general-purpose compute devices
Implements Single Instruction Multiple Threads (SIMT) approach
Programming model
Accelerator style: Main program runs on host CPU, kernels are offloaded
to GPU
Unified binary for host + device
GPU #1
Supports multiple GPUs
Data transfer to/from device is
explicit
Kernel execution may be
PCIe link
asynchronous to CPU code
GPU #2
Latest devices (Fermi) allow
multiple concurrent kernels
13.10.2011

Multicore Briefing - parallel programming models

19

A simple CUDA example: Host code


// allocate memory on host
h_A
= (float *)malloc(DATA_SZ);
h_C
= (float *)malloc(DATA_SZ);
h_C_GPU = (float *)malloc(RESULT_SZ);
// allocate memory on CUDA device
cudaMalloc((void **)&d_A, DATA_SZ)
;
cudaMalloc((void **)&d_C, RESULT_SZ) ;
//Copy data to GPU memory for further processing
cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice);
cudaMemcpy(d_C, h_C, DATA_SZ, cudaMemcpyHostToDevice) ;
cudaThreadSynchronize() ;
//Kernel Call: <<<BLOCKS, THREADS_PER_BLOCK>>>
do_work_on_gpu<<<128,128>>>(d_C, d_A, DATA_N);
cudaThreadSynchronize();
// copy result back to host
cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost) ;

13.10.2011

Multicore Briefing - parallel programming models

20

A simple CUDA example: CUDA kernel


__global__ void do_work_on_gpu (
float *d_C,
float *d_A,
int elementN
) {
for (
int pos = (blockIdx.x * blockDim.x) + threadIdx.x;
pos < elementN ;
pos += blockDim.x*gridDim.x ) {
d_C[pos] = 5.0f * d_A[pos];
}
__syncthreads();
}

13.10.2011

Multicore Briefing - parallel programming models

21

CUDA pros and cons

Pros
Relatively straightforward programming model
Low-level programming, explicit data management
Compatible with many NVIDIA GPUs code runs usually without changes
Available for C, but wrappers for many languages available
including scripting languages

Directive-based compiler extensions available (e.g., PGI)


Potential for overlapping GPU computation with CPU tasks

Cons
Restricted to NVIDIA GPUs
No support for multicore processors
No support for AMD GPUs

Low-level programming, explicit data management


Powerful tools are just beginning to emerge
Largely manual work distribution
Not an open standard
13.10.2011

Multicore Briefing - parallel programming models

22

OpenCL

13.10.2011

Multicore Briefing - parallel programming models

23

OpenCL

Open Computing Language


Open standard
Convenient programming model for using any kind of
accelerator GPGPUs, multicore CPUs,

Programming model similar to CUDA but more flexible


Pure kernel code often portable from CUDA without major changes

13.10.2011

Multicore Briefing - parallel programming models

24

A simple OpenCL example: Host code


// Get platform (Platform is NVIDIA Corp or Intel Corp or AMD Corp)
std::vector<cl::Platform> platforms; cl::Platform::get(&platforms);
// Get devices
std::vector<cl::Device> devices;
platforms.front().getDevices( DEVTOQUERY , &devices );
// Build context and Command Queue
cl::Context context( devices );
cl::CommandQueue cmdQ ( context , devices[0] );
// Read Kernel and compile JIT
cl::Program::Sources sourceCode ;
source_str = (char*)malloc(MAX_SOURCE_SIZE);
source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp);
sourceCode.push_back(std::make_pair(source_str,source_size));
cl::Program program = cl::Program ( context ,sourceCode );
program.build ( devices ) ;
cl::Kernel kernel(program, "VectorCopy");
//Allocate buffer
cl::Buffer D_A(context,CL_MEM_READ_WRITE,sizeof(REAL)*Vectorlength);
//Copy data
cmdQ.enqueueWriteBuffer (D_A , true,0,sizeof(REAL)*Vectorlength , &H_A[0]);
// Bind parameters to kernel
cl::KernelFunctor kernel_func = kernel.bind (cmdQ,
cl::NDRange(Globalsize), cl::NDRange(Workgroupsize));
// Call Kernel
event = kernel_func(D_A, D_B, D_C, scalar, Vectorlength, i ) ;
13.10.2011

Multicore Briefing - parallel programming models

25

OpenCL pros and cons

Pros
Relatively straightforward programming model
Low-level programming, explicit data management
Available for NVIDIA and AMD GPUs, and multicore CPUs
Potential for overlapping GPU computation with CPU tasks
CUDA kernel code largely re-usable
Some support for modern SIMD instruction sets

Cons
Available for C(99)/C++
Just in time kernel compilation
Low-level programming, explicit data management
Powerful tools are just beginning to emerge
Largely manual work distribution, but more flexible than CUDA
Best performance on all architectures requires specialized code for each

13.10.2011

Multicore Briefing - parallel programming models

26

MPI

The Message Passing Interface

13.10.2011

Multicore Briefing - parallel programming models

27

The message passing paradigm: A programming model


Distributed memory
architecture:
Each process(or) can
only access its
dedicated address
space.
No global shared
address space
Data exchange and
communication
between processes is
done by explicitly
passing messages
through a
communication
network

Message

Message passing library:


Should be flexible, efficient and portable
Hide communication hardware and software layers from
application programmer

13.10.2011

Multicore Briefing - parallel programming models

28

The message passing paradigm

Widely accepted standard in HPC / numerical simulation:


Message Passing Library (MPI)

See http://www.mpi-forum.org for documents


Many free and commercial implementations: Intel MPI, OpenMPI,
MVAPICH,

Process-based approach: All variables are local!


Same program on each processor/machine (SPMD)
No restriction of the general MP model, because processes can be
distinguished by their rank (see later)

The program is written in a sequential language (Fortran/C/C


++)
Data exchange between processes: Send/receive messages
via MPI library calls
This is usually the most tedious but also the most flexible way of
parallelization

13.10.2011

Multicore Briefing - parallel programming models

29

MPI in a nutshell: Parallel execution


Processes run throughout program
execution: All variables are local
Process ID: 0 1 2 3 4

Startup phase:
launch tasks
establishes communication context
(communicator) among all tasks

Point-to-point data transfer:


between pairs of tasks
may be blocking or non-blocking
explicit synchronization is needed for
non-blocking

Collective communication:

between all tasks or a subgroup of


tasks
presently blocking-only
reductions, scatter/gather operations
efficiency of library call

Clean shutdown

13.10.2011

Multicore Briefing - parallel programming models

30

MPI in a nutshell
Hello World!
program hello
use mpi
implicit none
integer rank, size, ierror
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
write(*,*) 'Hello World! I am ',rank,' of ',size
call MPI_FINALIZE(ierror)
end program

13.10.2011

Hello
Hello
Hello
Hello

World!
World!
World!
World!

Multicore Briefing - parallel programming models

I
I
I
I

am
am
am
am

3
1
0
2

of
of
of
of

4
4
4
4
31

MPI in a nutshell
Transmitting a message

MPI requires the following information:


Which processor is sending the message.
Where is the data on the sending processor.
What kind of data is being sent.
How much data is there.
Which processor(s) are receiving the message.
Where should the data be left on the receiving processor.
How much data is the receiving processor prepared to accept.
Sender and receiver must pass their
information to MPI separately
Holds for point-to-point
communication

13.10.2011

Message

Multicore Briefing - parallel programming models

32

MPI pros and cons

Pros
Suitable for distributed-memory and shared-memory machines
Supports massive parallelism
Well supported, many free and commercial implementations
Tremendous code base, huge experience in the field
Standard supports Fortran and C, wrappers for other languages exist
including scripting languages

Hybrid MPI+X models are supported: X {OpenMP,CUDA,OpenCL,TBB,}

Cons
Execution environment is crucial to set up
Huge standard (500+ functions) with many obscure bits and pieces
Incremental parallelization next to impossible most sequential code
needs serious restructuring
Performance properties sometimes hard to understand
also implementation-dependent
13.10.2011

Multicore Briefing - parallel programming models

33

Anda mungkin juga menyukai