The workshop will combine lecture material with hands-on practice. The lecture
will introduce basic principles, and the hands-on session will focus on the use of
MPI principles via selected examples. Programming will be done in Fortran and
C, so background in at least one of these languages is required.
Level: Introductory
Prerequisites: Knowledge of Fortran, C, or C++
Agenda
10:00-10:30 Introduction to MPI
10:30-11:00 Blocking Communication
11:00-11:30 Hands-on
11:30-12:00 Non-Blocking Communication
• p690+
– Reg1:32 × 1.3 GHz Power4 cpus sharing
64 GB of memory
– Reg2 :24 × 1.3 GHz Power4 cpus sharing
24 GB of memory
– Reg3 :32 × 1.7 GHz Power4 cpus sharing
128 GB of memory
– Reg4 :32 × 1.7 GHz Power4 cpus sharing
64 GB of memory
• p655+
– 13 nodes 8 × 1.5 GHz Power4 cpus
sharing 16 GB of memory
– 11 nodes 8 x 1.7 GHz Power4 cpus
sharing 16 Gb of memory
– 1 node (interactive) 8 x 1.7 GHz Power4
cpu
• 4 nodes - 4 processors (file server) 1.5
GHz Power4 cpus
http://www.msi.umn.edu/power4
• Total 344 processors, 680 GB of memory
and 6 TB of disk space
Feature name Node Type LoadLeveler Specification
– Baltix
• 48 processors Altix3700
• Intel Itanium2 1.3 GHz Madison
• IA64 processor (64 bit)
• 96 GB of memory
• Storage: 5Tbytes
– Login
• ssh username@balt.msi.umn.edu
http://www.msi.umn.edu/bscl
Netfinity Linux Cluster
• Processors
– 12 nodes
2 x 1.5 GHz Pentium 4 cpus per node 1.5
GB per node (Gigabit Ethernet)
– 59 nodes
2 × 1.26 GHz Pentium III cpus per node
3.25 GB per node (Fast Ethernet and
Myrinet)
=>Total of 142 Procs and 210 Gb Memory
• One interactive node
• 2.5 Tb of storage
• Networking:
– Myrinet, Fast Ethernet and Gigabit Ethernet
http://www.msi.umn.edu/netfinity
Introduction to parallel programming
P1 P2 P1 P2
Mem 1 Mem 2
P3 P4 P3 P4
Node1 Node2
Network
P1 P2 P1 P2
Mem 3 Mem 4
P3 P4 P3 P4
Node3 Node4
Compiling and Running MPI
SGI Altix
- To compile:
- Fortran: ifort -O3 program.f -lmpi
- C: icc -O3 program.c -lmpi
- To run:
- Interactive:
mpirun -np 2 ./a.out
- Batch : User PBSpro www.msi.umn.edu/altix/intro
Compiling and Running MPI
IBM Power4:
- To compile:
- Fortran: mpxlf, mpxlf90 e.g.: mpxlf -O3 program.f
- C/C++: mpcc, mpCC e.g.: mpcc -O3 program.c
- To run:
- Interactive:
poe ./a.out -nodes 1 -tasks_per_node 2 -rmpool 1
- Batch :Use LoadLeveler
www.msi.umn.edu/tutorials/Loadleveler/ll.html
Compiling and Running MPI
Netfinity Cluster
- To compile:
- Fortran: mpif77 -O3 program.f
- C: mpicc -O3 program.c
- To run:
- Interactive:
mpirun -np 2 –machinefile mynodes ./a.out
where file mynodes contains:
nx38
nx39
nx38 and nx39 are the nodes connected with Myrinet
- Batch : User PBSpro
- http://www.msi.umn.edu/netfinity/tutorials/intro.html
MPI
Process 0 Process 1
A:
Send Receive
• Questions B:
– To whom is data sent?
– Where is the data?
– What type of data is sent?
– How much data is sent
• Fortran Bindings:
Call MPI_XXXX (…, ierror )
– Case insensitive
– Almost all MPI calls are subroutines
– ierror is always the last parameter
– Program must include ‘mpif.h’
• C Bindings:
int ierror = MPI_Xxxxx (… )
– Case sensitive (as it always is in C)
– All MPI calls are functions: most return integer error code
– Program must include “mpi.h”
– Most parameters are passed by reference (i.e as pointers)
MPI Basic Send/Receive
Blocking send:
Blocking receive:
MPI_REAL REAL
MPI_REAL8 REAL*8
MPI_COMPLEX COMPLEX
MPI_LOGICAL LOGICAL
MPI_CHARACTER CHARACTER
MPI_BYTE
MPI_PACKED
MPI Communicators
Note: the RECV call is blocking. The send cold never execute, and both processes
are blocked at the RECV resulting in deadlock.
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank ierror)
tag = 100
if(rank .eq. 0) then
Login to SDVL
username: temp01 – temp10
passwd: 5Qw#npT<
Login to Power4
ssh –X p655-22
username: temp01 – temp10
passwd: 5Qw#npT<
MPI
Non-Blocking Communication
Blocking Communication
• One can also test the status without waiting using MPI_TEST
MPI_TEST(request, flag, status)
Non-blocking Send
Syntax
C:
int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm, MPI_Request *request
FORTRAN:
MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG,
COMM, REQUEST, IERROR)
[ IN buf] initial address of send buffer (choice)
[ IN count] number of elements in send buffer (integer
[ IN datatype] datatype of each send buffer element (handle)
[ IN dest ] rank of destination (integer)
[ IN tag ] message tag (integer)
[ IN comm ] communicator (handle)
[ OUT request ] communication request (handle)
Non-blocking Recv
Syntax
C:
int MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request
FORTRAN:
MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG,
COMM, REQUEST, IERROR)
[ OUT buf] initial address of receive buffer (choice)
[ IN count] number of elements in receive buffer (integer
[ IN datatype] datatype of each receive buffer element (handle)
[ IN dest ] rank of source (integer)
[ IN tag ] message tag (integer)
[ IN comm ] communicator (handle)
[OUT request ] communication request (handle)
Non-blocking Communication
completion calls
MPI_REQUEST_FREE(request)
[ INOUT request ] communication request (handle)
FORTRAN:
MPI_REQUEST_FREE(REQUEST, IERROR)
Non-blocking Communication
Examples
Example: Simple usage of nonblocking operations and MPI_WAIT
IF(rank.EQ.0) THEN
CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request,
ierr)
****do some computation to mask latency****
CALL MPI_WAIT(request, status, ierr)
ELSE
CALL MPI_IRECV(a(1), 10, MPI_REAL, 0, tag, comm, Request,
ierr)
****do some computation to mask latency****
CALL MPI_WAIT(Request, status, ierr)
END IF
Non-blocking Communication
Bad Examples
IF(rank.EQ.0) THEN
CALL MPI_ISEND(outval, 1, MPI_REAL, 1, 10, req, ierr)
CALL MPI_REQUEST_FREE(req, ierr)
CALL MPI_IRECV(inval, 1, MPI_REAL, 1, 0, req, ierr)
CALL MPI_WAIT(req, status, ierr)
ELSE ! rank.EQ.1
CALL MPI_IRECV(inval, 1, MPI_REAL, 0, 10, req, ierr)
CALL MPI_WAIT(req, status, ierr)
CALL MPI_ISEND(outval, 1, MPI_REAL, 0, 0, req, ierr)
CALL MPI_REQUEST_FREE( req, ierr)
END IF
3 issues – comm
- no distinguish among the req
Program P_P
IMPLICIT NONE
INCLUDE “mpif.h”
INTEGER ierror, rank, size, status(MPI_STATUS_SIZE), requests (2)
LOGICAL FLAG
REAL as, bs, cs, ds
CALL MPI_INIT(ierror)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
IF(rank.eq.0) THEN
as = 4.2
bs = 8.4
CALL MPI_ISEND(as, 1, MPI_REAL,1,101,
& MPI_COMM_WORLD, requests(1), ierror)
C Here one can do some computation, which will not store/write as
bs=bs*as
CALL MPI_WAIT(requests(1), status, ierror)
ds=bs*cs
ELSE
as = 14.2
bs = 18.4
CALL MPI_IRECV(cs,1,MPI_REAL,0,101,
& MPI_COMM_WORLD, requests(2), ierror)
C Here one can do some computation which will not store/write cs
bs = bs + as
CALL MPI_WAIT(requests(2), status,ierror)
END IF
CALL MPI_FINALIZE(ierror)
STOP
END
Non-blocking Communication
• Gain
• Avoid Deadlocks
• Decrease Synchronization Overhead
• Can Reduce Systems Overhead
• Post non-blocking sends/receives early and do waits late
• It is recommended to set the MPI_IRECV before the
MPI_Rsend is called.
• Be careful with reads and writes
– Avoid writing to send buffer between MPI_ISEND and
MPI_WAIT
– Avoid reading and writing in receive buffer between
MPI_IRECV and MPI_WAIT
MPI
Collective Communication
MPI Collective Communication
0 A0 A0
1
broadcast
A0
2 A0
3 A0
processes
MPI Collective Communication
Broadcast
• Syntax
C: int MPI_Bcast (void* buffer, int count, MPI_Datatype datatype,
int root, MPI_Comm comm
Fortran: MPI_BCAST (buffer, count, datatype, root, comm, ierr)
where:
buffer: is the starting address of a buffer
count: is an integer indicating the number of data elements in the
buffer
datatype: is MPI defined constant indicating the data type of the
elements in the buffer
root: is an integer indicating the rank of broadcast root process
comm: is the communicator
all processes
at root
rbuf
real a(100), rbuf(MAX)
call mpi_gather(a, 100, MPI_REAL, rbuf, 100, MPI_REAL, root, comm, ierr)
MPI Collective Communication
Gather
• Syntax
C:
int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int
rcount, MPI_Datatype rtype, int root, MPI_Comm comm
FORTRAN:
MPI_GATHER (sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)
where:
sbuf: is the starting address of a buffer,
scount: is the number of elements in the send buffer,
stype: is the data type of send buffer elements,
rbuf: is the address of the receive buffer
rcount: is the number of elements for any single receive
rtype: is the data type of the receive buffer elements
root: is the rank of receiving process, and
comm: is the communicator
ierr: is error message
MPI Collective Communication
INCLUDE ‘mpif.h’
Gather Example
DIMENSION A(25, 100), b(100), cpart(25), ctotal(100)
INTEGER root, rank
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
root=1 A * b = c
A=10*rank+0.5
B=0.5 Processor 1 1
DO I=1,25
Processor 2 2
cpart(I)=0.
DO K=1,100 Processor 3 3
cpart(I)=cpart(I)+A(I,K)*b(K)
Processor 4 4
END DO
END DO - A: Matrix distributed by rows
CALL MPI_GATHER (cpart, 25, MPI_REAL, ctotal, 25, - B: Vector shared by all process
&MPI_REAL, root, MPI_COMM_WORLD, ierr) - C: results to get by the root process
If(rank.eq.root) print*, (ctotal(I),I=15,100,25
CALL MPI_FINALIZE(ierr)
END
MPI Collective Communication
Scatter
all processes
at root
sbuf
real sbuf(MAX), rbuf(100)
call mpi_scatter(sbuf, 100, MPI_REAL, rbuf, 100, MPI_REAL, root, comm,
ierr)
MPI Collective Communication
Scatter Syntax
C:
int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int
rcount, MPI_Datatype rtype, int root, MPI_Comm comm)
FORTRAN:
MPI_SCATTER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)
where:
sbuf: is the address of the send buffer,
scount: is the number of elements sent to each process,
stype: is the data type of the send buffer elements,
rbuf: is the address of the receive buffer,
rcount: is the number of elements in the receive buffer,
rtype: is the data type of the receive buffer elements,
root: is the rank of the sending process, and
comm: is the communicator
sendcount
1 sendbuf
2
recvcount
3 recvbuf recvbuf recvbuf
1 2 3
MPI Collective Communication
All Gather
D0 A0 B0 C0 D0
processes
• The syntax of MPI_ALLGATHER is similar to MPI_GATHER. However,
the argument root is dropped.
MPI Collective Communication
All Gather Syntax
C:
int MPI_Allgather (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount,
MPI_Datatype rtype, MPI_Comm comm
FORTRAN
MPI_ALLGATHER (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)
Replcace
CALL MPI_GATHER (cpart, 25, MPI_REAL, ctotal, 25, \
MPI_REAL, root, MPI_COMM_WORLD, ierr)
print*, (ctotal(I),I=15,100,25)
With
CALL MPI_ALLGATHER (cpart, 25, MPI_REAL, ctotal, 25, \
MPI_REAL, MPI_COMM_WORLD, ierr)
print*, (ctotal(I),I=15,100,25)
MPI Collective Communication
Alltoall
Data Data
P A0 B0 C0 D0 E0 F0 P A0 A1 A2 A3 A4 A5
r r B0 B1 B2 B3 B4 B5
o A1 B1 C1 D1 E1 F1 o
c c C0 C1 C2 C3 C4 C5
A2 B2 C2 D2 E2 F3
e e
D0 D1 D2 D3 D4 D5
s A3 B3 C3 D3 E3 F3 s
s s E0 E1 E2 E3 E4 E5
A4 B4 C4 D4 E4 F4 e
e F0 F1 F2 F3 F4 F5
s A5 B5 C5 D5 E5 F5 s
MPI Collective Communication
Alltoall Syntax
C:
int MPI_Alltoall(void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int
rcount, MPI_Datatype rtype, MPI_Comm comm )
FORTRAN:
MPI_ALLTOALL (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)
where:
sbuf: is the starting address of send buffer,
scount: is the number of elements sent to each process,
stype: is the data type of send buffer elements,
rbuf: is the address of receive buffer,
rcount: is the number of elements received from any process,
rtype: is the data type of receive buffer elements, and
comm: is the group communicator.
PROGRAM alltoall
INCLUDE ‘mpif.h’
INTEGER isend (3), irecv (3)
CALL MPI_INIT(ierr)
rank=0 rank=1 rank=2 CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs,
sendcount ierr)
1 4 7
2 5 8 CALL MPI_COMM_RANK(MPI_COMM_WORLD,
3 6 9 myrank, ierr)
send buf send buf
DO i=1, nprocs
isend (i) = i + nprocs * myrank
recvcount 1 2 3 ENDDO
4 5 6
7 8 9 PRINT *, ‘isend =‘, isend
recvbuf recvbuf recvbuf CALL MPI_ALLTOALL(isend, 1, MPI_INTEGER,
& irecv, 1, MPI_INTEGER,
Figure of MPI_ALLTOALL
& MPI_COMM_WORLD, ierr)
PRINT *, ‘irecv =‘, irecv
$ a.out -procs 3 CALL MPI_FINALIZE(ierr)
0: isend 1 2 3 END
1: isend 4 5 6
2: isend 7 8 9
0: irecv 1 4 7
1: irecv 2 5 8
2: irecv 3 6 9
Hands-on
http://www.msi.umn.edu/tutorial/scicomp/general/MPI/worksh
op_MPI
Login to SDVL
username: temp01 – temp10
passwd: 5Qw#npT<
Login to Power4
ssh –X p655-22
username: temp01 – temp10
passwd: 5Qw#npT<
MPI
Collective Computations and
Synchronization
MPI_Reduce
where
sbuf: is the address of send buffer,
rbuf: is the address of receive buffer,
count: is the number of elements in send buffer,
stype: is the data type of elements of send buffer,
op: is the reduce operation (which may be MPI predefined, or your own),
root: is the rank of the root processes, and
comm:is the group communicator.
C:
int MPI_Reduce (void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, int
root, MPI Comm comm)
int MPI_Allreduce(void* sbuf, void* rbuf, int count, MPI Datatype stype, MPI_Op op,
MPI Comm comm)
int MPI_Reduce_scatter (void* sbuf, void* rbuf, int* rcounts, MPI Datatype stype,
MPI_Op op, MPI Comm comm)
where
sbuf: is the address of send buffer,
rbuf: is the address of receive buffer,
count: is the number of elements in send buffer,
stype: is the data type of elements of send buffer,
op: is the reduce operation (which may be MPI predefined, or your own),
root: is the rank of the root processes, and
comm:is the group communicator.
MPI Predefined Reduce Operations
Usage: CALL MPI_REDUCE (sendbuf, recvbuf, count, datatype, op, root, comm, ierror)
Parameters
(CHOICE) sendbuf The address of the send buffer (IN)
(CHOICE) recvbuf The address of the receive buffer, sendbuf and recvbuf cannot overlap in memory. (significant only at root) (OUT)
INTEGER count The number of elements in the send buffer (IN)
INTEGER datatype The data type of elements of the send buffer (handle) (IN)
INTEGER op The reduction operation (handle) (IN)
INTEGER root The rank of the root process (IN)
INTEGER comm
INTEGER ierror
The communicator (handle) (IN)
The Fortran return code
Sample Program
PROGRAM reduce
include ‘mpif.h’
CALL MPI_INIT (ierr)
CALL MPI_COMM_SIZE (MPI_COMM_WORLD, nprocs, ierr)
comm
CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr)
rank=0=root rank=1 rank=2 isend= myrank+1
count CALL MPI_REDUCE (isend, irecv, 1, MPI_INTEGER,
1 2 3
& MPI_SUM, 0, MPI_COMM_WORLD, ierr)
sendbuf sendbuf sendbuf
IF (myrank= =0) THEN
+ + PRINT *, ‘irecv =‘, irecv
count op op endif
6=1+2+3 CALL MPI_FINALIZE (ierr)
END
recvbuf
Sample execution
% a.out -procs 3
Figure: MPI_REDUCE for Scalar Variables % 0: irecv = 6
MPI_ALLREDUCE
Usage: CALL MPI_ALLREDUCE (sendbuf, recvbuf, count, datatype, op, comm, ierror)
comm
Usage: CALL MPI_REDUCE_SCATTER (sendbuf, recvbuf, recvcounts, datatype, op, comm, ierror)
COMM
Sample Program
rank=0 rank=1 rank=2
recvcounts(0) PROGRAM reduce_scatter
1 11 21
INCLUDE ‘mpif.h’
2 12 22
recvcounts(1) INTEGER isend (6), irecv (3)
3 13 23
INTEGER ircnt (0:2)
4 14 24
recvcounts(2) DATA ircnt/1.2.3/
5 15 25
CALL MPI_INIT (ierr)
6 16 26
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
sendbuf sendbuf
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
+
op DO i=1.6
isend (i) = i + myrank * 10
33
ENDDO
36
CALL MPI_REDUCE_SCATTER(isend, irecv, ircnt, MPI_INTEGER,
39
& MPI_SUM, MPI_COMM_WORLD, ierr)
42
PRINT *, ‘irecv =‘. irecv
45
CALL MPI_FINALIZE(ierr)
48
END
42 $ a.out -procs 3
33 36 0: irecv = 33 0 0
45
Fig. recvbuf 39 1: irecv = 36 39 0
48
MPI_REDUCE_ recvbuf
SCATTER recvbuf 2: irecv = 42 45 48
MPI_REDUCE_SCATTER
Parameters
(CHOICE) sendbuf The starting address of the send buffer (IN)
(CHOICE) recvbuf The starting address of the receive buffer, sendbuf and recvbuf cannot
overlap in memory. (OUT)
INTEGER recvcounts(*)
Integer array specifying the number of elements in result distributed to
each process. Must be identical on all calling processes. (IN)
INTEGER datatype The data type of elements of the input buffer (handle) (IN)
INTEGER op The reduction operation (handle) (IN)
INTEGER comm The communicator (handle) (IN)
INTEGER ierror The Fortran return code
Description MPI_REDUCE_SCATTER first performs an element-wise reduction on
vector of count = Σ, recvcounts(i) elements in the send buffer defined by
sendbuf, count and datatype. Next, the resulting vector is split into n
disjoint segments, where n is the number of members in the group.
Segment i contains recvcounts(i) elements. the ith segment is sent to
process I and stored in the receive buffer defined by recvbuf,
recvcounts(i) and datatype. MPI_REDUCE_SCATTER is
functionally equivalent to MPI_REDUCE with count equal to the sum of
recvcounts(i) followed by MPI_SCATTERV with
sendcounts equal to recvcounts. All processes in comm need to call this
routine.
MPI_SCAN
Scan
A scan or prefix-reduction operation performs partial reductions on distributed data.
Where:
sbuf: is the starting address of the send buffer,
rbuf: is the starting address of receive buffer,
count: is the number of elements in input buffer,
datatype: is the data type of elements of input buffer
op: is the operation, and
comm: is the group communicator.
MPI_SCAN
Usage: CALL MPI_SCAN (sendbuf, recvbuf, count, datatype, op, comm, ierror)
comm Parameters
(CHOICE) sendbuf The starting address of the send buffer (IN)
(CHOICE) recvbuf The starting address of the receive buffer, sendbuf and recvbuf
rank=0 rank=1 rank=2 cannot overlap in memory (OUT)
count INTEGER count The number of elements in sendbuf (IN)
INTEGER datatype The data type of elements of sendbuf (handle) (IN)
1 2 3 INTEGER op The reduction operation (handle) (IN)
sendbuf sendbuf INTEGER comm The communicator (handle) (IN)
+ + op + op INTEGER ierror The Fortran return code
count
1 3=1+2 6 = 1 +2 + 3 Sample program
recvbuf recvbuf recvbuf PROGRAM scan
INCLUDE ‘mpif.h’
CALL MPI_INIT (ierr)
CALL MPI_COMM_SIZE (MPI_COMM_WORLD, nprocs, ierr)
Figure. MPI_SCAN CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr)
isend = myrank + 1
CALL MPI_SCAN (isend, irecv, 1, MPI_INTEGER,
& MPI_SUM, MPI_COMM_WORLD, ierr)
PRINT *, ‘irecv =‘ irecv
CALL MPI_FINALIZE(ierr)
END
Sample execution
$ a.out -procs 3
0: irecv = 1
0: irecv = 3
0: irecv = 6
User-defined Operations
Performance Issues
- C:
MPI_Barrier (MPI_Comm comm)
- FORTRAN
MPI_BARRIER (comm, ierr)
where:
MPI_Comm: is an MPI predefined stucture of communicators,
comm: is an integer denoting a communicator
ierr: is an integer return error code.
other MPI Topics
www.msi.umn.edu/user_support/workshop_MPI
www.mpi-forum.org
Help?
call to 612-626-0802
e-mail: help@msi.umn.edu
Hands-on
http://www.msi.umn.edu/tutorial/scicomp/general/MPI/worksh
op_MPI
Login to SDVL
username: temp01 – temp10
passwd: 5Qw#npT<
Login to Power4
ssh –X p655-22
username: temp01 – temp10
passwd: 5Qw#npT<
Questions?