Anda di halaman 1dari 14

Background

A Little Motivation

Brief Overview of Parallel Programming

Why do we parallel process in the first place?


Performance - either solve a bigger problem, or to reach solution
faster (or both)

M. D. Jones, Ph.D.

Hardware is becoming (actually it has been for a while now)


intrinsically parallel

Center for Computational Research


University at Buffalo
State University of New York

Parallel programming is not easy. How easy is sequential


programming? Now add another layer (a sizable one, at that) for
parallelism ...

Spring 2014

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

1 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Background

Spring 2014

3 / 61

Background

Big Picture

Decomposition

Basic idea of parallel programming is a simple one:


problem

decomposition

1D

CPU
CPU
CPU
CPU
CPU
CPU

2D

CPU
CPU

3D

CPU
CPU
CPU
CPU

in which we decompose a large computational problem into available


processing elements (CPUs). How you achieve that decomposition is
where all the fun lies ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

4 / 61

Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

5 / 61

Background

Demand for HPC

Decomposition (continued)

More Speed!

Why Use Parallel Computation?

Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Demand for HPC

Spring 2014

6 / 61

M. D. Jones, Ph.D. (CCR/UB)

More Speed!

Brief Overview of Parallel Programming


Demand for HPC

Spring 2014

8 / 61

More Speed!

A Concrete Example

A Size Example

Well, more of a discrete example, actually. Lets consider the


gravitational N-body problem.

On the other side, suppose that we need to diagonalize (or invert) a


dense matrix being used in, say, an eigenproblem derived from some
engineering or multiphysics problem.

Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

9 / 61

Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

10 / 61

Demand for HPC

More Speed!

Demand for HPC

So ...

Scaling Concepts

So the lessons to take from the preceding examples should be clear:


It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasing
speed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you see
the connection between the first example and general molecular
modeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming


Demand for HPC

Spring 2014

11 / 61

By scaling we typically mean the relative performance of a parallel


vs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =

Sequential execution time (using optimal implementation)


Parallel execution time using p processors

so, in the ideal case, S(p) = p.

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming


Demand for HPC

Parallel Efficiency

Spring 2014

12 / 61

Inherent Limitations

Inherent Limitations in Parallel Speedup

Using S as the (best) sequential execution time, we note that


S(p) =

Limitations on the maximum speedup:

S
,
p (p)

Fraction of the code, f , can not be made to execute in parallel

p,

Parallel overhead (communication, duplication costs)


Using this serial fraction, f , we can note that

for a lower bound, and the parallel efficiency is given by

p f S + (1 f ) S /p,

S(p)
,
p
S
=
.
p p

E(p) =

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

13 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

14 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Inherent Limitations

Amdahls Law

Implications of Amdahls Law

This simplification for p leads directly to Amdahls Law,


Definition (Amdahls Law):

The implications of Amdahls law are pretty straightforward:


The limit as p ,

S
,
f S + (1 f ) S /p
p

.
1 + f (p 1)

S(p)

lim S(p)

1
.
f

If the serial fraction is 5%, the maximum parallel speedup is only


20.

G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Demand for HPC

Spring 2014

15 / 61

Amdahl's Law

16 / 61

Let p = comm + comp , where comm is the time spent in communication


between parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.

256

f=0.001
f=0.01
f=0.02
f=0.05
f=0.1

64

Spring 2014

Inherent Limitations

A Practical Example

Maximum Parallel Speedup

1 pcomp ,

32
MAX(S(p))

Brief Overview of Parallel Programming


Demand for HPC

Implications of Amdahls Law (contd)

128

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

and

16

S(p) =

8
4

1
,
p

pcomp
,
comm + comp

p 1 + comm /comp

1
1

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors, p

64

Brief Overview of Parallel Programming

128

256

Spring 2014

1

The point here is that it is critical to minimize communication time to


time spent doing computation (recurring theme).
17 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

18 / 61

Demand for HPC

Inherent Limitations

Demand for HPC

Defeating Amdahls Law

Inherent Limitations

Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,

There are ways to work around the serious implications of Amdahls


law:

= (f S + (1 f )S )/p ,

We assumed that the problem size was fixed, which is (very) often
not the case.

= (f S + (1 f )S )/(f S + (1 f )S /p),

Now consider a case where the problem size is allowed to vary

' p + p(1 p)f . . . .

Assume that now the problem size is fixed such that p is held
constant

= p/(1 (1 p)f ),

Another way of looking at this is that the serial fraction becomes


negligible as the problem size is scaled. Actually that is a pretty good
definition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Demand for HPC

Spring 2014

19 / 61

M. D. Jones, Ph.D. (CCR/UB)

Inherent Limitations

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

20 / 61

Basic Terminology

Scalability

The Parallel Zoo

Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value

There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)

I like this (equivalent) one better:


Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

21 / 61

Message passing: e.g., MPI = Message Passing Interface


Hybrid methods: combine elements of other models

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

23 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Thread Models

Basic Terminology

Thread Models
thread 1
thread 2

a.out
thread 1
thread 2

a.out

data

data

time

time
thread N

Shared address space (all threads can access data of program


a.out).
Generally limited to single machine except in very specialized
cases.
GPGPU (general purpose GPU computing) uses thread model
(with a large number of threads on the host machine.

thread N

Typical thread model (e.g., OpenMP)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

24 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming


Overview of Parallel APIs

Message Passing Models

Spring 2014

25 / 61

Basic Terminology

Message Passing Models


a.out

a.out

PID

task 1

msg send/recv

data

a.out

a.out

data

task 1

PID
a.out

PID

task N1
time

Separate address space (tasks can not access data of other


tasks without sending messages or participating in collective
communications).
General purpose model - messages can be sent over network,
shared memory, etc.
Managing communication is generally up to the programmer.

PID

task N1
time

Typical message passing model (e.g., MPI).


Brief Overview of Parallel Programming

PID

data

data

M. D. Jones, Ph.D. (CCR/UB)

msg send/recv

collective communication (broadcast/reduce/...)

a.out

msg send/recv

task 0

task 0

msg send/recv

PID
data

collective communication (broadcast/reduce/...)

data

Spring 2014

26 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

27 / 61

Overview of Parallel APIs

Basic Terminology

Overview of Parallel APIs

Parallelism at Three Levels

Approaching the Problem

The first step in deciding where to add parallelism to an algorithm or


application is usually to analyze from the point of tasks and data:

Task - macroscopic view in which an algorithm is organized


around separate concurrent tasks

Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?

Data - structure data in (separately) update-able chunks

Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?

Instruction - ILP through the flow of data in a predictable fashion

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

28 / 61

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming


Overview of Parallel APIs

Dependencies

Spring 2014

29 / 61

Basic Terminology

An Hour of Planning ...

Taking the time to carefully analyze an existing application, or plan a


new one, can really pay off later:

Bear in mind that you always want to minimize the dependencies


between your concurrent tasks and data:

Plan for parallel programming by keeping data and task parallel


constructs/opportunities in mind

On the task level, design your tasks to be as independent as


possible - this can also be temporal, namely take into account
any necessity for ordering the tasks and their execution

Analyze an existing or new program to discover/confirm the


locations of the time-consuming portions

Sharing of data will require synchronization and thus introduce


data dependencies, if not race conditions or contention

M. D. Jones, Ph.D. (CCR/UB)

Basic Terminology

Brief Overview of Parallel Programming

Spring 2014

Optimize by implementing a strategy using data or task parallel


constructs

30 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

31 / 61

Overview of Parallel APIs

Auto-Parallelization

Overview of Parallel APIs

Auto-Parallelization

Auto-Vectorization

Often called implicit parallelism, some compilers (usally commercial


ones) have the ability to:
automatically parallelize simple (outer) loops (thread-level
parallelism, or TLP)
Note that thread level parallelism is only usable on an SMP
architecture

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

Like auto-parallelization, a feature of high-performance (often


commercial) compilers on hardware that supports at least some level
of vectorization:
compilers finds low-level operations that can operate
simultaneously on multiple data elements using a single
instruction
Intel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 23 vector length (hardware limitation that true vector processors
do not share)

automatically vectorize (usually innermost loops) using ILP

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

User can exert some control through compiler directives (usually


vendor specific) and data organization focused on vector
operations

32 / 61

M. D. Jones, Ph.D. (CCR/UB)

Auto-Vectorization

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

33 / 61

Shared Memory APIs

Downside of Automatic Methods

Shared Memory Parallelism

The automatic parallelization schemes have a number of shortfalls:

Here we consider explicit modes of parallel programming on shared


memory architectures:

Very limited in scope - parallelizes only a very small subset of


code
Without directives from the programmer, compiler is very limited in
identifying parallelizable regions of code
Performance is rather easily degraded rather than sped up

Brief Overview of Parallel Programming

Spring 2014

SHMEM, the old Cray shared memory library


Pthreads, UNIX ANSI standard (available on Windows as well)
OpenMP, common specification for compiler directives and API

Wrong results are easy to (automatically) produce

M. D. Jones, Ph.D. (CCR/UB)

HPF, High Performance Fortran

34 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

35 / 61

Overview of Parallel APIs

Distributed Memory APIs

Overview of Parallel APIs

Distributed Memory Parallel Programming

OpenMP

In this case, mainly message passing, with a few other (compiler


directives again) possibilities:
HPF, High Performance Fortran
Cluster OpenMP, a new product from Intel to push OpenMP into
a distributed memory regime
PVM, Parallel Virtual Machine
MPI, Message Passing Interface, has become the dominant API
for general purpose parallel programming
UPC, Unified Parallel C, extensions to ISO 99 C for parallel
programming (new)

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

Synopsis: designed for multi-platform shared memory parallel


programming in C/C++/FORTRAN primarily through the
use of compiler directives and environmental variables
(library routines also available).
Target: Shared memory systems.
Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented by
commercial compiler vendors, also now in most
open-source compilers (4.0 specification released in
2013-07, not yet available in production compilers).
More Info: The OpenMP home page:
http://www.openmp.org

CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003


(new, officially part of Fortran 2008)

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

36 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming


Overview of Parallel APIs

OpenMP Availability

Spring 2014

37 / 61

APIs/Language Extensions & Availability

MPI

Message Passing Interface


Platform
Linux x86_64
GNU (>=4.2)
GNU (>=4.4)
GNU (>=4.7)

Version
3.1
3.0
2.5
3.0
3.1

Synopsis: Message passing library specification proposed as a


standard by committee of vendors, implementors, and
users.

Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
gfortran -fopenmp ...
gfortran -fopenmp ...

Target: Distributed and shared memory systems.


Current Status: Most popular (and most portable) of the message
passing APIs.
More Info: ANLs main MPI site:
www-unix.mcs.anl.gov/mpi

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

38 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

39 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Overview of Parallel APIs

MPI Availability at CCR

Platform
Linux IA64
Linux x86_64

APIs/Language Extensions & Availability

Unified Parallel C (UPC)

Unified Parallel C (UPC) is a project based at Lawrence Berkeley


Laboratory to extend C (ISO 99) and provide large-scale parallel
computing on both shared and distributed memory hardware:

Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)

Explicit parallel execution (SPMD)


Shared address space (shared/private data like OpenMP), but
exploits data locality

I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.

Primitives for memory management


BSD license (free download)
http://upc.lbl.gov/, community website at
http://upc.gwu.edu/

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

40 / 61

M. D. Jones, Ph.D. (CCR/UB)

APIs/Language Extensions & Availability

Brief Overview of Parallel Programming


Overview of Parallel APIs

Spring 2014

41 / 61

APIs/Language Extensions & Availability

Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) to


provide data decomposition for parallel programs (somewhat akin to
UPC and HPF):

shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}

Original specification by Numrich and Reid, ACM Fortran Forum,


17, no. 2, pp 1-31 (1998).
ISO Fortran committee included co-arrays in next revision to
Fortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,
no 2, pp 4-17 (2005), Fortran 2008.
Does not require shared memory (more applicable than OpenMP)
Early compiler work at Rice:
http://www.hipersoft.rice.edu/caf/index.html

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

42 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

43 / 61

Overview of Parallel APIs

APIs/Language Extensions & Availability

Libraries

Leveraging Existing Parallel Libraries


Simple examples of CAF usage:
1
2
3
4
5
6
7
8

One easy way to utilize parallel programming resources is through


the use of existing parallel libraries. Most common examples (note
orientation around standard mathematical routines):

REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)

BLAS - Basic Linear Algebra Subroutines


LAPACK - Linear Algebra PACKage (uses BLAS)

UPC and CAF are examples of partitioned global address space


(PGAS) language, for which a large push is being made by
AHPCRC/DARPA as part of the Petascale computing initiative
(also one based on java called Titanium)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

44 / 61

ScaLAPACK - distributed memory (MPI) solvers for common LAPACK


functions
PetSC - Portable, Extensible Toolkit for Scientific Computation
FFTW - Fastest Fourier Transforms in the West

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Libraries

BLAS

Vendor BLAS

The Basic Linear Algebra Subroutines (BLAS) form a standard set of


library functions for vector and matrix operations:

Vendor implementations of the BLAS:

Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)


Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)
Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at the
level of assembly code for optimum performance on a particular
processor. Shared memory versions (multithreaded) and distributed
memory versions available.
www.netlib.org/blas

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

46 / 61

Spring 2014

48 / 61

Libraries

Spring 2014

47 / 61

AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN

M. D. Jones, Ph.D. (CCR/UB)

ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library

Brief Overview of Parallel Programming

Libraries

Libraries

Performance Example

LAPACK

DDOT Performance

The Linear Algebra PACKage (LAPACK) is a library that lies on top of


the BLAS (for optimium performance and parallelism) and provides:

U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)

4500
4000

solvers for systems of simultaneous linear equations

L1 cache
L2 cache

3500

least-squares solutions

MFlop/s

3000

eigenvalue problems

2500
2000

singular value problems

Main memory

1500

On CCR systems, the Intel MKL is generally preferred (includes


optimized BLAS and LAPACK)

1000
500
0
1
10

10

10

10
Vector Length [8B]

10

10

www.netlib.org/lapack

Performance advantage of Intel MKL ddot versus reference (system)


version. Level 3 BLAS routines have even more significant gains.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

49 / 61

Libraries

Brief Overview of Parallel Programming

Parallel Program Design Considerations

ScaLAPACK

Spring 2014

50 / 61

Programming Costs

Programming Costs

The Scalable LAPACK library, ScaLAPACK, is designed for use on the


largest of problems (typically implemented on distributed memory
systems):
subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers
explicit message passing for interprocessor communication
assumes matrices are laid out in a two-dimensional block cyclic
decomposition
On CCR systems, the Intel Cluster MKL includes
ScaLAPACK libraries (includes optimized BLAS, LAPACK, and
ScaLAPACK)
www.netlib.org/scalapack

M. D. Jones, Ph.D. (CCR/UB)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

51 / 61

Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

53 / 61

Parallel Program Design Considerations

Examples

Parallel Program Design Considerations

Examples

Speedup Limitation Example (MD/NAMD)


Joint Amber-CHARMM Benchmark
NAMD, v2.6b1, u2
1

256
ch_p4
ch_gm

128
79.9

88

Benchmark MD s/step

0.1

32

27.2

16

14.2

7.3
6.46

8
7.01

3.67

0.01

Parallel Speedup

64

49.6

3.54
2.8

1.9

1.89
1

JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023

TIP3 waters, PME (9cutoff)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

54 / 61

Communication Costs

M. D. Jones, Ph.D. (CCR/UB)

8
32
16
Number of Processors (ppn=2)

128

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Communications

64

256

Spring 2014

55 / 61

Communication Costs

Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:

All parallel programs will need some communication between tasks,


but the amount varies considerably:
minimal communication, also known as embarrassingly parallel
- think Monte Carlo calculations
robust communication, in which processes must exchange or
update information frequently - time evolution codes, domain
decomposed finite difference/element applications

latency - the cost in overhead for a zero length message


(microseconds)
bandwidth - the amount of data that can be sent per unit time
(MBytes/second); many applications utilize many small messages
and are latency-bound

Scope: point-to-point communication is always faster than


collective communication (that involves a large subset or all
processes)
Efficiency: are you using the best available resource (network) for
communicating?

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

56 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

57 / 61

Parallel Program Design Considerations

Communication Costs

Parallel Program Design Considerations

Latency Example - MPI/CCR

Communication Costs

Bandwidth Example - MPI/CCR

MPI PingPong Benchmark Performance

MPI PingPong Benchmark Performance

U2 Cluster Interconnects

U2 Cluster Interconnects

10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

1000

1000

Bandwidth [MByte/s]

Message Time [s]

100

10

100

ch_mx
ch_gm
ch_p4
DAPL-QDR-IB

10

1 0
10

10

10

10
10
Message Length [Bytes]

10

0.1

Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDR


Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Parallel Program Design Considerations

10

10

Spring 2014

58 / 61

10

10

10
10
Message Length [Bytes]

10

0.01

10

Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and


QDR Infiniband.
M. D. Jones, Ph.D. (CCR/UB)

Communication Costs

Brief Overview of Parallel Programming

Parallel Program Design Considerations

Spring 2014

59 / 61

Communication Costs

Collective Example - MPI/CCR


MPI_Alltoall Benchmark

Barriers - force collective synchronization; often implied, and always


expensive
Locks/Semaphores - typically protect memory locations from
simultaneous/conflicting updates to prevent race conditions

Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05

10000
talltoall(4KB) [sec]

Visibility: message-passing codes require the programmer to


explicitly manage all communications, but many data-parallel
codes do not, masking the underlying data transfers
Synchronicity: Synchronous communications are called blocking
as they require an explicit handshake between parallel
processes - asynchronous communications (non-blocking offer
the ability to overlap computation with communication, but place
an extra burden on the programmer. Examples include:

ppn=12
ppn=6
ppn=4
ppn=2
ppn=1

1000

100

10

10

100
Number of MPI Processes

1000

Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.


M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

60 / 61

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Spring 2014

61 / 61