Anda di halaman 1dari 65

Advanced Topics of OpenMP Programming

Christian Terboven
terboven@rz.rwth-aachen.de
Center for Computing and Communication
RWTH Aachen University

PPCES 2010
March 24th, RWTH Aachen University

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


2

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Summary of first part

24.03.2010 C. Terboven

o OpenMP is a parallel programming model for SharedMemory machines. That is, all threads have access to a
shared main memory. In addition to that, each thread
may have private data.
o The parallelism has to be expressed explicitly by the
programmer. The base construct is a Parallel Region:
A Team of threads is provided by the runtime system.
o Using the available Worksharing constructs, the work can be
distributed among the threads of a team, influencing the
scheduling is possible.
3

o To control the parallelization, thread exclusion and


synchronization constructs can be used.
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

The ordered construct

o Allows to execute a structured block within a parallel loop in


sequential order
C/C++
#pragma omp ordered
... structured block ...

o In addition, an ordered clause has to be added to the


Parallel Region in which this construct occurs
o Can be used e.g. to enforce ordering on printing of data

o May help to determine whether there is a data race

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

OpenMP Environment Variables

o OMP_NUM_THREADS: Controls how many threads will be used

to execute the program.

o OMP_SCHEDULE: If the schedule-type runtime is specified in

a schedule clause, the value specified in this environment


variable will be used.

o OMP_DYNAMIC: The OpenMP runtime is allowed to smartly

guess how many threads might deliver the best performance.


If you want full control, set this variable to false.

o OMP_NESTED: Most OpenMP implementations require this to


5

be set to true in order to enabled nested Parallel Regions.


Remember: Nesting Worksharing constructs is not possible.
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Fibonacci
w/ Tasks

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

OpenMP API: Locks

24.03.2010 C. Terboven

o OpenMP provides a set of low-level locking routines,


similar to semaphores:
void omp_func_lock (omp_lock_t *lck), with func:
init / init_nest: Initialize the lock variable
destroy / destroy_nest: Remove the lock variable association
set / set_nest: Set the lock, wait until lock acquired
test / test_nest: Set the lock, but test and return if lock could not
be acquired
unset / unset_nest: Unset the lock

Argument is address to an instance of omp_lock_t type


Simple lock: May not be locked if already in a locked state
Nested lock: May be locked multiple times by the same thread
6
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Memory Model of OpenMP

o OpenMP: Shared-Memory model


All threads share a common address space (shared memory)
Threads can have private data (explicit user control)
Fork-Join execution model

o Weak memory model


Temporary View: Memory consistency is guaranteed only after
certain points, namely implicit and explicit flushes

o Any OpenMP barrier includes a flush


o Entry to and exit from critical regions include a flush
o Entry to and exit from lock routines (OpenMP API) include a
7

flush

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Fibonacci
w/ Tasks

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

The flush directive

24.03.2010 C. Terboven

C/C++
#pragma omp flush [(list)]

o Enforces shared data to be consistent (but be cautious!)


If a thread has updated some variables, their values will be
flushed to memory, thus accessible to other threads
If a thread has not updated a value, the construct will ensure
that any local copy will get latest value from memory

o BUT: Do not use this for thread synchronization


Compiler optimization might come in your way
Rather use OpenMP lock functions for thread synchronization
8
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Fibonacci
w/ Tasks

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Book recommendation

Using OpenMP: Portable Shared Memory Parallel Programming


Barbara Chapman, Gabriele Jost, Ruud Van Der Pas
o ISBN-10: 0262533022
o ISBN-13: 978-0262533027
o MIT Press, Cambridge, UK

9
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


10

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Race Condition

24.03.2010 C. Terboven

o Data Race: The typical multi-threaded programming error


occurs when:
Two or more threads of a single process access the same
memory location concurrently in between two synchronization
points, and
At least on of these accesses modifies the location, and
The accesses are not protected by locks or critical regions.

11

o In many cases private clauses or barriers are missing


o Non-deterministic occurrence: The sequence of the
execution of parallel loop iterations is non-deterministic
and may change from run to run, for example
o Hard to find using a traditional debugger, instead use
- Sun Thread Analyzer
Repetition

Tools for
OpenMP

- Intel Thread Checker


OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Program with a Data Race

#pragma omp parallel


{
[...]

There are two OpenMP errors in


this code, lets find them ...

/* compute stencil, residual and update */


#pragma omp for
for (j=1; j<m-1; j++)
for (i=1; i<n-1; i++){
resid =(ax * (UOLD(j,i-1) + UOLD(j,i+1))
+ ay * (UOLD(j-1,i) + UOLD(j+1,i))
+ b * UOLD(j,i) - F(j,i) ) / b;
U(j,i) = UOLD(j,i) - omega * resid;
error = error + resid*resid;

12

}
}
/* end of parallel region */
printf(error: %f, double);

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Usage: Sun Thread Analyzer

o Compile with the Sun Compilers on Linux or Solaris:


Add compiler / linker switch: -xinstrument=datarace

o Execute the program (with multiple threads) with collect


export OMP_NUM_THREADS=2
collect r on program arguments

o The verification is done only for the given dataset


The tool traces all memory accesses Runtime and memory
requirements might explode!
Thus, use the smallest (and still meaningful) dataset
13

o View result: tha test.1.er


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Analysis Result: Sun Thread Analyzer (1/2)

o Three Data Races are found:

14

o Switch to Dual Source to examine the races


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Analysis Result: Sun Thread Analyzer (2/2)

o i
o resid

o error

15

Tool gives you the line of


the Data Race user has
to find the variable!
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Is this program now correct? (1/2)

#pragma omp parallel


{
[...]
/* compute stencil, residual and update */
#pragma omp for private(i, resid, error)
for (j=1; j<m-1; j++)
for (i=1; i<n-1; i++){
resid =(ax * (UOLD(j,i-1) + UOLD(j,i+1))
+ ay * (UOLD(j-1,i) + UOLD(j+1,i))
+ b * UOLD(j,i) - F(j,i) ) / b;
U(j,i) = UOLD(j,i) - omega * resid;
error = error + resid*resid;
16

}
/* end of parallel region */

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Is this program now correct? (2/2)

o Thread Analyzer tells you so


o It contains no Data Races any
more, but error has to be
reduced!

17

#pragma omp for private(i, resid) \


reduction(+:error)
for (j=1; j<m-1; j++)
for (i=1; i<n-1; i++){
[]
error = error + resid*resid;
}
}

/* end of parallel region */


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Our advice

24.03.2010 C. Terboven

Never put an OpenMP code into


production ...

... without using Thread Analyzer


or Thread Checker !!!

18
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


19

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

How to parallelize a While-loop?

o How would you parallelize this code?


typedef list<double> dList;
dList myList;
/* fill myList with tons of items */
dList::iterator it = myList.begin();
while (it != myList.end())
{
*it = processListItem(*it);
it++;
}

20

o One possibility: Create a fixed-sized array containing all list


items and a parallel loop running over this array
Concept: Inspector / Executor

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

How to parallelize a While-loop!

o Or: Use Tasking in OpenMP 3.0

21

#pragma omp parallel


{
#pragma omp single
{
dList::iterator it = myList.begin();
while (it != myList.end())
{
#pragma omp task
{
*it = processListItem(*it);
}
it++;
}
}
}

o All while-loop iterations are independent from each other!


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Biggest change in OpenMP 3.0: Task Parallelism

o Tasks allow to parallelize irregular problems, e.g.

unbounded loops
recursive algorithms
Producer / Consumer patterns
and more

o Task: A work unit which execution may be deferred


Can also be executed immediately

o Tasks are composed of

22

Code to execute
Data environment
Internal control variables (ICV)
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Tasks in OpenMP: Overview

o Tasks are executed by the threads of the Team


o Data environment of a Task is constructed at creation time
o A Task can be tied to a thread only that thread may execute
it or untied
o Tasks are either implicit or explicit
o Implicit tasks: The thread encountering a Parallel construct

23

Creates as many implicit Tasks as there are threads in the Team


Each thread executes one implicit Task
Implicit Tasks are tied
Different description than in 2.5, but equivalent semantics!
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

The task directive

24.03.2010 C. Terboven

C/C++
#pragma omp task [clause [[,] clause] ... ]
... structured block ...

o Each encountering thread creates a new Task


Code and data is being packaged up
Tasks can be nested
Into another Task directive
Into a Worksharing construct

Schedule clauses:

o Data scoping clauses:

untied

shared(list)
private(list)

24

Other clauses:

firstprivate(list)

if(expr)

default(shared | none)
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Task synchronization (1/2)

o At OpenMP barrier (implicit or explicit)


All tasks created by any thread of the current Team are
guaranteed to be completed at barrier exit

o Task barrier: taskwait


Encountering Task suspends until child tasks are complete
Only direct childs, not descendants!
C/C++
#pragma omp taskwait

25
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Task synchronization (2/2)

o Simple example of Task synchronization in OpenMP 3.0:

26

#pragma omp parallel num_threads(np)


{
np Tasks created here, one for each thread
#pragma omp task
function_A();
All Tasks guaranteed to be completed here
#pragma omp barrier
#pragma omp single
{
1 Task created here
#pragma omp task
function_B();
}
B-Task guaranteed to be completed here
}
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Tasks in OpenMP: Data Scoping

o Some rules from Parallel Regions apply:


Static and Global variables are shared
Automatic Storage (local) variables are private

o If no default clause is given:


Orphaned Task variables are firstprivate by default!
Non-Orphaned Task variables inherit the shared attribute!
Variables are firstprivate unless shared in the enclosing
context

27

o So far no verification tool is available to check Tasking


programs for correctness!
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Tasks in OpenMP: Scheduling

o Default: Tasks are tied to the thread that first executes them
not neccessarily the creator. Scheduling constraints:
Only the Thread a Task is tied to can execute it
A Task can only be suspended at a suspend point
Task creation, Task finish, taskwait, barrier

If Task is not suspended in a barrier, executing Thread can only


switch to a direct descendant of all Tasks tied to the Thread

o Tasks created with the untied clause are never tied


28

No scheduling restrictions, e.g. can be suspended at any point


But: More freedom to the implementation, e.g. load balancing

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Tasks in OpenMP: if clause

o If the expression of an if clause on a Task evaluates to false


The encountering Task is suspended
The new Task is executed immediately
The parent Task resumes when new Tasks finishes
Used for optimization, e.g. avoid creation of small tasks

o If the expression of an if clause on a Parallel Region


evaluates to false
The Parallel Region is executed with a Team of one Thread only
Used for optimization, e.g. avoid going parallel
29

o In both cases the OpenMP data scoping rules still apply!


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Task pitfalls (1/3)

24.03.2010 C. Terboven

o It is the users responsability to ensure data is alive:

30

// within Parallel Region


void foo() {
int a[LARGE_N];
#pragma omp task
{
bar1(a);
}
#pragma omp task
{
bar2(a);
}
}

Repetition

Tools for
OpenMP

Variable a has to be shared in order to


prevent copyin to task (default firstprivate).

If not shared: Parent Task may have exited


foo() by the time bar() accesses a: a is
variable of automatic storage duration
and thus is disposed when foo() is exited.
OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Task pitfalls (2/3)

24.03.2010 C. Terboven

o It is the users responsability to ensure data is alive:

31

// within Parallel Region


void foo() {
int a[LARGE_N];
#pragma omp task shared(a)
{
bar1(a);
}
#pragma omp task shared(a)
{
bar2(a);
}
#pragma omp taskwait
}
Repetition

Tools for
OpenMP

Wait for all Tasks that have been created


on this level.

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

Task pitfalls (3/3)

24.03.2010 C. Terboven

o Examing your code thoroughly before using untied Tasks:


int dummy;
#pragma omp threadprivate(dummy)
void foo() {dummy = ; }
void bar() { = dummy; }

32

#pragma omp task untied


{
foo();
bar();
Task could switch to a different Thread
between foo() and bar().
}
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Other news in OpenMP 3.0 (1/5)

o Static schedule guarantees


#pragma omp for schedule(static) nowait
for(i = 1; i < N; i++)
a[i] =
#pragma omp for schedule(static)
for (i = 1; i < N; i++)
c[i] = a[i] +
Allowed in OpenMP 3.0 if and only if:
- Number of iterations is the same
- Chunk is the same (or not specified)

33
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Other news in OpenMP 3.0 (2/5)

o Loop collapsing
#pragma omp for collapse(2)
for(i = 1; i < N; i++)
for(j = 1; j < M; j++)
for(k = 1; k < K; k++)
foo(i, j, k);
Iteration space from i-loop and j-loop is
collapsed into a single one, if loops are
perfectly nested and form a rectangular
iteration space.

34
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Other news in OpenMP 3.0 (3/5)

o New variable types allowed in for-Worksharing


#pragma omp for
for (unsigned int i = 0; i < N; i++)
foo(i);

Legal in OpenMP 3.0:


- Usigned integer types
- Pointer types
- Random access iterators (C++)

35

vector v;
vector::iterator it;
#pragma omp for
for (it = v.begin(); it < v.end(); it++)
foo(it);

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Other news in OpenMP 3.0 (4/5)

o Improvements in the API for Nested Parallelism:


How many nested Parallel Regions?
int omp_get_level()

How many nested Parallel Regions are active?


int omp_get_active_level()

Which thread-id was my ancestor, in given level?


int omp_get_ancestor_thread_num(int level)

How many Threads were in my ancestors team, at given level?


int omp_get_team_size(int level)

o This is now well-defined in OpenMP 3.0:

36

omp_set_num_threads(3);
#pragma omp parallel {
omp_set_num_threads(omp_get_thread_num() + 2);
#pragma omp parallel {
foo();
} }
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Other news in OpenMP 3.0 (5/5)

o Improved definition of environment interaction


Env. Var. OMP_MAX_NESTED_LEVEL + API functions
Controls the maximum number of active parallel regions

Env. Var. OMP_THREAD_LIMIT + API functions


Controls the maximum number of OpenMP threads

Env. Var. OMP_STACKSIZE


Controls the stack size of child threads

Env. Var. OMP_WAIT_POLICY


Control the thread idle policy:
37

active: Good for dedicated systems


passive: Good for shared systems (e.g. in batch mode)

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


38

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Recursive approach to compute Fibonacci

int main(int argc,


char* argv[])
{
[...]
fib(input);
[...]
}

int fib(int n)
{
if (n < 2) return n;
int x = fib(n - 1);
int y = fib(n - 2);
return x+y;
}

o On the following slides we will discuss three approaches to


parallelize this recursive code with Tasking.

39
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

First version parallelized with Tasking (omp-v1)

int main(int argc,


char* argv[])
{
[...]
#pragma omp parallel
{
#pragma omp single
{
fib(input);
}
}
[...]
}

40

int fib(int n)
{
if (n < 2) return n;
int x, y;
#pragma omp task shared(x)
{
x = fib(n - 1);
}
#pragma omp task shared(y)
{
y = fib(n - 2);
}
#pragma omp taskwait
return x+y;
}

o Only one Task / Thread enters fib() from main(), it is


responsable for creating the two initial work tasks
o Taskwait is required, as otherwise x and y would be lost
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Scalability measurements (1/3)


Speedup of Fibonacci with Tasks

9
8
7
Speedup

6
5
4

optimal

omp-v1

2
1
0
1

#Threads

41

o Overhead of task creation prevents better scalability!


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Improved parallelization with Tasking (omp-v2)

int main(int argc,


char* argv[])
{
[...]
#pragma omp parallel
{
#pragma omp single
{
fib(input);
}
}
[...]
}

42

int fib(int n)
{
if (n < 2) return n;
int x, y;
#pragma omp task shared(x) \
if(n > 30)
{
x = fib(n - 1);
}
#pragma omp task shared(y) \
if(n > 30)
{
y = fib(n - 2);
}
#pragma omp taskwait
yet return x+y;
}

o Improvement: Dont create


another task once a certain (small
enough) n is reached
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Fibonacci
w/ Tasks

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Scalability measurements (2/3)


Speedup of Fibonacci with Tasks

9
8
7
Speedup

6
5

optimal

omp-v1

omp-v2

2
1
0
1

#Threads

43

o Speedup is ok, but we still have some overhead


when running with 4 or 8 threads
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Fibonacci
w/ Tasks

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Improved parallelization with Tasking (omp-v3)

int main(int argc,


char* argv[])
{
[...]
#pragma omp parallel
{
#pragma omp single
{
fib(input);
}
}
[...]
}

44

int fib(int n)
{
if (n < 2) return n;
if (n <= 30)
return serfib(n);
int x, y;
#pragma omp task shared(x)
{
x = fib(n - 1);
}
#pragma omp task shared(y)
{
y = fib(n - 2);
}
#pragma omp taskwait
return x+y;
OpenMP} overhead once a certain n

o Improvement: Skip the


is reached (no issue w/ production compilers)
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Scalability measurements (3/3)


Speedup of Fibonacci with Tasks

9
8
7
Speedup

6
5

optimal

omp-v1

omp-v2

omp-v3

1
0
1

#Threads

45

o Everything ok now
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


46

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Repetition: Performance Aspects

o Performance Measurements

Runtime (real wall time, user cpu time, system time)


FLOPS: number of floating point operations (per sec)
Speedup: performance gain relative to one core/thread
Efficiency: Speedup relative to the theoretical maximum

o Performance Impacts

47

Load Imbalance
Data Locality on cc-NUMA architectures
Memory Bandwidth (consumption per thread)
Cache Effects

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Case Study: Sparse Matrix-Vector-Multiplication


Beijing Botanical Garden
left:
bottom-right:
bottom-left:

original building
lattice model
matrix shape

(image copyright: Beijing Botanical Garden and University of


Florida, Sparse Matrix Collection)

48
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Experiment Setup and Hardware

o Intel Xeon 5450 (2x4 cores) 3,0 GHz; 12MB Cache


L2 cache shared between 2 cores
flat memory architecture (FSB)

M
C

o AMD Opteron 875 (4x2 cores) - 2,2 GHz; 8MB Cache


M

49

large means 75 MB >> caches

o Matrices in CRS format

L2 cache not shared


ccNUMA architecture (HT-links)

SMXV is an important kernel in numerical codes (GMRES, )


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Performance with Static Row Distribution (1)

o compact: threads packed tightly; scatter: threads distributed


static row distribution
1400,000

up to 25%
better

1200,000

ca. 1200
mflops

mflops

1000,000
800,000

ca. 850
mflops

600,000
400,000
200,000
0,000
0

threads

50
AMD Opteron

Repetition

Tools for
OpenMP

Intel Xeon, compact

OpenMP 3.0
& Tasks

Example +
Case Study

Intel Xeon, scatter

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Performance Analysis of (1)

o Load Balancing: Static data distribution is not optimal!


o Data Locality on cc-NUMA architectures: Static data distribution
is not optimal!
o Memory Bandwidth: Compact
thread placement is not optimal
on Intel Xeon processors! M
51
Repetition

T
Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Performance with Dynamic Row Distribution (2)

o compact: threads packed tightly; scatter: threads distributed


dynamic row distribution

ca. 985
mflops

1400,000
1200,000

mflops

1000,000
800,000
600,000

ca. 660
mflops

400,000
200,000
0,000
0

threads

52
Intel Xeon, scatter

Repetition

Tools for
OpenMP

Intel Xeon, compact

OpenMP 3.0
& Tasks

Example +
Case Study

AMD Opteron

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Performance Analysis of (2)

o Why does the Xeon deliver better performance with the


Dynamic row distribution, but the Operon gets worse?
Data Locality. The Opteron is a cc-NUMA architecture, the
threads are distributed along the cores over all sockets, but
the data is not!
static
dynamic
Data

C
C

Solution: Parallel initialization


of the matrix to employ the
first-touch mechanism of Operating Systems: Data is placed
near to where it is first access from.
M

OpenMP 3.0
& Tasks

Tools for
OpenMP

Example +
Case Study

Repetition

OpenMP &
Architecture

53

C
C

Data

Data

Data

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Performance with Pre-Calculated Row Distribution (3)


static row distribution, sorted matrix

ca. 2000
mflops

2500,000

mflops

2000,000

1500,000

1000,000

ca. 1000
mflops

500,000

0,000
0

threads

54

AMD Opteron

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Intel Xeon
Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Final Performance Analysis

o By exploiting data locality the AMD Opteron is able to reach


about 2000 mflops!
o The Intel Xeon performance stagnates at about 1000 mflops,
since the memory bandwidth of the fronside bus is
exhausted with using four threads already:
Data
M
C

55

o If the matrix would be smaller and fit into the cache the
result would look different
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


56

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Comparing Processors and Boxes

Metric \ Server

57

SF V40z

FSC RX200 S4

Processor Chip

AMD Opteron 875


2.2 GHz

Intel Xeon 5450


3.00 GHz

# sockets

# cores

8 (dual-core)

8 (quad-core)

# threads

Accumulated L2 $

8 mb

16 mb

L2 $ Strategy

Separate per core

Shared by 2 cores

Technology

90 nm

45 nm

Peak Performance

35.2 GFLOPS

96 GFLOPS

Dimension

3 units

1 unit

Note: Here we compare machines of different ages which can be seen as unfair!
For example newer Opteron-based machines provide similar settings in 1 unit

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Measuring Memory Bandwidth

o Do not look at the CPU performance only the memory


subsystems performance is crucial for your HPC application!
long long *x, *xstart, *xend, mask;
for (x = xstart; x < xend; x++) *x ^= mask;

o Each loop iteration: One load + One store


o We ran this kernel with multiple threads working on private
data at a time using OpenMP (large memory footprint >> L2)
o Explicit processor binding to control the thread placement
58

Linux: taskset command


Solaris: SUN_MP_PROCBIND environment variable
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Selected results: 1 thread

2x Clovertown, 2.66 GHz


1 thread: 3.970 GB/s

M
C

4x Opteron 875, 2.2 GHz


1 thread: 3.998 GB/s
M

C
M

59

C
C

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Memory Bandwidth: Dual-Socket Quad-Core Clovertown


1 thread: 3.970 GB/s
2 threads: 3.998 GB/s

M
C

2 threads: 4.661 GB/s

M
C

2 threads: 6.871 GB/s

M
C

C
M

60

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

8 threads: 8.006 GB/s


C

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Memory Bandwidth: Quad-Socket Dual-Core Opteron


1 thread: 3.998 GB/s

2 threads: 4.335 GB/s

C
M

C
M

C
M

ccNUMA!

C
M

61

Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

2 threads: 8.210 GB/s

2 threads: 4.674 GB/s

8 OpenMP
threads:
& 18.470 GB/s
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

What does that mean for my application?

o Ouch, these boxes all behave differently


o Lets look at a Sparse Matrix Vector multiplication:
SF V40z
(Opteron)

FSC RX200 S4
(Xeon)

Sparse MatVec (small)

GFLOPS

2.17

9.34

Sparse MatVec (large)

GFLOPS

1.47

0.91

o Which architecture is suited best, depends on:

62

Can your application profit from big caches?


Can your application profit from shared / separate caches?
Can your application profit from a high clock rate?
Is your application memory bound anyway?
and more factors
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Agenda

o Repetition + Additional OpenMP 2.5 Topics


o Tools for OpenMP programming
o OpenMP 3.0 and Tasks
o Example and Case Study
Fibonacci w/ Tasks
Parallel Sparse MatVecMult

o OpenMP and the Hardware Architecture


63

o Summary of second part


Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

24.03.2010 C. Terboven

Summary of second part

o Tool-support is a special strength of OpenMP (compared to


other multi-threading paradigms): Always use a verification
tool before putting your parallel code in production!
o The Tasking concept of OpenMP 3.0 allows OpenMP to be
applicable for a much broader range of applications than 2.5.
o Multicore architectures are manifold.
o In order to achieve the full performance potential, programs
have to go parallel and respect the (memory) architecture.
64
Repetition

Tools for
OpenMP

OpenMP 3.0
& Tasks

Example +
Case Study

OpenMP &
Architecture

Summary

Advanced Topics of OpenMP Programming

The End

Thank you for


your attention!

65

24.03.2010 C. Terboven