TD MXC Parallel Programming Tatkar

How to Develop Solaris
Parallel Applications
Vijay Tatkar
Sr. Engineering Manager
Sun Studio Developer Tools

http://blogs.sun.com/tatkar Sun Tech Days 07-08 /Sun Studio - # 1 1
The GHz Chip Clock Race is Over...
Classic CPU efficiencies:
Clock speed
Execution optimization
Cache
Where is my 10GHz chip?
Design Impediments:
Heat
Power
Slower Memory than chips
Sun Tech Days 07- 08 /Sun Studio - # 2

Putting transistors to work in a new way ...
UltraSPARC T2 Intel: Penryn, AMD: Barcelona
Intel: 4 cores * 3.1GHz
1.4GHz * 8 cores AMD: 4 cores * 2.3GHz
(64 threads in a chip) (4 threads in a chip)
The Multicore
Revolution
Every new system now has a multi-core chip in it

Things to know about Parallelism
• Parallel processing is not for massively parallel supercomputer anymore.
(HPC ≠ High Priced Computing)
• CPU clock speed doubled every 18 months, whereas memory doubled every
6 years! Heat, Memory, Power lead to multi-cores CPUs.
• Free ride is over for serial programs relying on the hardware to boost
performance.
• Parallel programming is BEST BET for speedups
> Parallelism is all about performance, first and foremost
> Program correctness is often harder for parallel programs
• Parallelism is often considered hard , but there are several models to choose
from, and compiler support for each model to ease the choice.

Programming Model
Shared Memory Model
OpenMP (de-facto standard)
Java, Native Multi-threaded Programming
Distributed Memory Model
Message Passing Interface – MPI (de-facto standard)
Parallel Virtual Machine – PVM (less popular)
Global Address Space
Unified Parallel C – UPC (research technology)
Grid Computing
Sun Grid Computing (www.network.com)
Sun Grid Engine (www.sun.com/software/gridware)

Compiler Support .... To The Rescue?
Application
Easiest Hardest
AutoPar MT OpenMP MPI
libumem Atomic Solaris Posix Event
Operations Threads Threads Ports
Solaris
UltraSPARC T1/T2 Intel/AMD

SPARC64 VI, x86/x64
UltraSPARC IV+
Automatic Parallelization and Vectorization
Application
Easiest Hardest
Instruction-level MT OpenMP MPI
Parallelism
libumem Atomic Solaris Posix
Automatic Threads
Operations Solaris Threads
Event
Ports
Parallelization,
Automatic
Vectorization
Tuned MTUltraSPARC
librariesT1/T2 Intel/AMD
SPARC64 VI, x86/x64
UltraSPARC IV+

Instruction level Parallelism
Chips have figured out how to dispatch multiple instructions in parallel
Compilers have figured out how to schedule for such processors
Chips + Compilers are very mature in this regard, so there is no programmer
action required and the gain is automatic, whereever possible
It IS possible to chew gum and walk at the same time!

Automatic Parallelization
Support for the Fortran, C and C++ applications
First introduced for 4-20 way SPARCserver 600 MP in 1991
Useful for loop oriented programs
Every (nested) loop will be analyzed for data dependencies and
parallelized if safe to do so
Non-loop code fragments will not be analyzed
Loops versioned with serial and parallel code (runtime)
Combine with powerful loop optimizations
One can have subtle interactions between loop transformations and
parallelization
Compilers have limited knowledge about the application
Overall gains can be impressive
Entire SPECfp 2006 suite gains 16% with PARALLEL=2
Individual gains can be upto 2x for suitable programs; libquantum from
SPEC CPU2006 speeds up 6-7x on 8-cores!
Not every program will see a gain Sun Tech Days 07- 08 /Sun Studio - # 9
Automatic Parallelization Options
-xautopar
Automatic parallelization (Fortran, C and C++ compiler) requires -xO3 or
higher (-xautopar implies -xdepend)
-xreduction
Parallelize reduction operations
Recommended to use -fsimple=2 as well
-xloopinfo
Show parallelization messages on screen
• Only apply to the most time consuming parts of program

AutoPar: SPECfp 2006 improvements
Woodcrest box: 3.0GHz dual-core
PARALLEL=2
Overall Gain: 16%
27.5
25
22.5
20
Base Flags
17.5
+ Autopar
15
12.5
10
7.5
5
2.5
0
bwav g milc zeus g cac- leslie n dealI so- p cal- ge tonto lbm wrf sphi
es ame mp ro- tusA 3d amd I plex ovr culix ms nx3
ss mac DM ay FDT

Automatic Vectorization
Support for the Fortran, C and C++ applications
-xvector=simd exploits special SSE2+ instructions
Works on data in adjacent memory locations
Gains are smaller than -xautopar
SPECfp 2006 gains are 3% overall and upto 14% range individually
Best suited for loop-level SIMD parallelism
for (i=0; i<1024; i++) for (i=0; i<1024; i+=4)

c[i] = a[i] * b[i] c[i:i+3] = a[i:i+3] * b[i:i+3]

Case Study:
Vectorizing STREAM

Tuned MT Libraries – Sun Perf Lib

Compiler Support : OpenMP
Application
Easiest Hardest
AutoPar MT MPI
Operations Solaris
ThreadsOpenMP
Threads Ports

SPARC64 VI, x86/x64
UltraSPARC IV+

What is OpenMP?
• Defacto industry standard API for writing shared-memory parallel applications
in C, C++ and Fortran See: http://www.openmp.org
• Consists of
> Compiler directives (pragmas)
> Runtime routines (libmtsk)
> Environment variables
• Advantages:
> Incremental parallelization of source code
> Small(er) amount of programming effort
> Good Performance and Scalability
> Portable across variety of vendor compilers
• Sun Studio has consistently led OpenMP
> Support for latest version (2.5 now, v3.0 API underway)
> Consistent World Record SPEC OMP submissions for several years now
OpenMP- Directives with Intelligence

A Loop Parallelized With OpenMP
#pragma omp parallel default (none) \
shared(n, x, y) private (i)
{
#pragma omp for
for ( i = 0; i < n; i++) Clauses
x[i] += y[i];
C/C++
} /*-- End of Parallel region -- */
!$omp parallel default (none) &

!$omp shared(n,x,y) private(i)
!$omp do
do i = 1, n
x(i) = x(i) + y(i)
end do
!$ end do
!$ end parallel
Fortran
Components Of OpenMP
Directives, Pragmas Runtime Environment
● Parallel regions
● Work sharing
● Number of threads
● Thread ID
● Synchronization
● Dynamic thread
● Data scope attributes
● private
adjustment
● Nested parallelism
● firstprivate
● Timers
● lastprivate
● API for locking
● shared
● reduction
● Orphaning

An OpenMP Example
Find the primes up to 3,000,000 (216816)
Run on Sun Fire 6800, Solaris 9, 24 processors 1.2GHz US-III+, with 9.8GB
main memory
Model # threads Time (secs) % change

Serial N/A 6.636 Base
1 7.210 8.65% drop
2 3.771 1.76x faster
4 1.988 3.34x faster
OpenMP 8 1.090 6.09x faster
16 0.638 10.40x faster
20 0.550 12.06x faster
24 0.931 Saturation drop

Compiler Support : Programming Threads
Application
Easiest Hardest
AutoPar OpenMP MPI
Operations Solaris
Threads Threads Ports
MT

SPARC64 VI, x86/x64
UltraSPARC IV+

Programming Threads
• Use the POSIX APIs – pthread_create, pthread_join,
pthread_exit, et. al.
> Recommendation: consider reducing the thread stack size
(default is 1MB)
> See pt hr ead_at t r _i ni t (3C) for this and other attributes
which can be adjusted
• Do not use the native Solaris threading API (e.g.,
t hr _c r eat e ).
> Though applications which use it are still supported, it is non-
portable.

Data Synchronization
• Concurrent access to shared data requires
synchronization
> Mutexes (pthread_mutex_lock/pthread_mutex_unlock)
> Condition Variables (pthread_cond_wait)
> Reader/Writer locks
(pthread_rwlock_rdlock/pthread_rwlock_wrlock)
> Spin locks (pthread_spin_lock)
• Objects can be local to a process or shared between
processes via shared memory.

MT Demo
Multithreading Primes

int is_prime(int v)
{
int i;
int bound = floor (sqrt((double)v)) + 1;
for (i=2; i <bound; i++) {

/* No need to check against known composites */
if (!pflag[i])
continue;
if (v % i == 0) {
pflag[v] = 0;
return 0;
}
}
return (v > 1);
}
void *work(void *arg)
{
int start;
int end;
int i;
int val= *((int *) arg);
start = (N/THREADS) * val;

end = start + N/THREADS;
for (i = start; i < end ; i++) {
if ( is_prime(i) ) {
primes[total] = i;
total++;
}
}
return NULL;
}
int main(int argc, char** argv)
for (i=0; i < N; i++) {
pflag[i] = 1;
}
for (i = 0; i < (THREADS-1); i++) {
pthread_create(&tids[i], NULL, work, (void *) &i);
}
i = THREADS -1;
work((void *) &i);
for (i = 0; i < THREADS ; i++) {
pthread_join(tids[i], NULL);
}

STOP!
Problem Ahead
RDT Demo, please

Data Race Condition
• A data race condition occurs when
> multiple threads access a shared memory location
> without synchronized accessing order
> At least one access is to write a new data
• A data race problem often occurs in shared memory
parallel programming models such as Pthread and
OpenMP.
> The effect of a data race problem is unpredictable and may
occur only once during hundreds of runs.

Thread Analyzer
Detects data races and deadlocks in a multithreaded application
Points to non-deterministic or incorrect execution
Bugs are notoriously difficult to detect by examination
Points out actual and potential deadlock situations
Process:
Instrument the code with -xinstrument=datarace
Detect runtime condition with collect -r all [or race, detection]
Use the Graphical Analyzer, tha, to identify conflicts and critical
regions
Works with OpenMP, Pthreads, Solaris Threads

API provided for user-defined synchronization primitives
Works on Solaris (SPARC, x86/x64) and Linux
Static lock_lint tool to detect inconsistent use of locks

A True SPEC Story
SPEC OMP Benchmark fma3d
101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of work to find the data race manually
Perils of Having a DataRace Condition

Program exhibits non-deterministic behavior
Failure may be hard to reproduce
Program may continue to execute, leading to failure in unrelated code
A data race is hard to detect using conventional debugging methods and
tools

How did Thread Analyzer help?
SPECOMP Benchmark fma3d

101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of work
to find the data race manually
With the Sun Studio Thread Analyzer, the data race

was detected in just a few hours!
Sun Tech Days 07-08 /Sun Studio - # 32

Compiler Support : Message Passing Interface
Application
Easiest Hardest
AutoPar MT OpenMP
Operations Solaris
Threads Threads Ports
MPI

SPARC64 VI, x86/x64
UltraSPARC IV+

Message Passing Interface (MPI)
MPI programming model is a de-facto standard for
distributed memory parallel programming
MPI API set is quite large (323 subroutines)
MPI application can be programmed with less than 10 different
calls
Implemented with very small set of device interconnect low
level routines.
Open MPI: http://www.open-mpi.org/
MPI home page at Argonne National Laboratories
http://www-unix.mcs.anl.gov/mpi/

Message Passing Interface (MPI)
• OpenMPI 2.0 Conformance
• ClusterTools 7.0 with Sun Studio
• Multiple processes runs under Open Runtime
Environment
• Pass data messages between processes in point/block
communication mode
• No race condition with right use of MPI message passing
calls
• MPI profiling under Performance Analyzer

Launching MPI application
• For Single Program Multiple Data (SPMD)
> mpirun -np x program1
• For Multiple Program Multiple Data (MPMD)
> mpirun -np x program1 : -np y program 2
• Launching on different nodes (hosts)
> mpirun -np x -host <machineA, ...> program1
• And more ...
• Very flexible way of launching

Comparing OpenMP and MPI
OpenMP MPI
Defacto industry standard Defacto industry standard
Limited to one (SMP) system Runs on number of systems
Not (yet?) GRID-ready GRID ready
Easier to get started High and steep learning curve
Assistance from compilers You're on your own
Mix and match model All or nothing model
Requires data scoping No data scoping required
Increasingly popular (CMT?) More widely used (but...)
Preserves sequential code No sequential version
Needs a compiler No compiler; just a library
No special environment Requires runtime environment
Performance issues implicit Easy to control performance
Thank you !
Vijay Tatkar
Sr. Engineering Manager

http://blogs.sun.com/tatkar Sun Tech Days 07-08 /Sun Studio - # 38 38
Case Study:
AutoPar Matrix Multiply

AutoPar Example Program
// Matrix Multiplication
32 $define MAX 1024
33 void matrix_mul(float (*x_mat)[MAX],
34 float(*y_mat)[MAX], float (*z_mat)[MAX]) {
35
36 for (int j = 0; j < MAX; j++) {
37 for (int k = 0; k < MAX; k++) {
38 z_mat[j][k] = 0.0;
39 for (int t = 0; t < MAX; t++) {
40 z_mat[j][k] += x_mat[j][t] * y_mat[t][k];
41 }
42 }
43 }
44 }

AutoPar Example Compilation
CC -c mat_mul.cc -g -fast -xrestrict -xautopar

-xloopinfo -o mat_mul.o
"mat_mul.cc", line 36: PARALLELIZED
"mat_mul.cc", line 37: not parallelized, not profitable
"mat_mul.cc", line 39: not parallelized, unsafe dependence
Can run er_src command on executable binary

to see internal compiler messages

%CC mat_mul.cc -g -fast -xrestrict -xinline=no -o noautopar
%CC mat_mul.cc -g -fast -xrestrict -xloopinfo -xautopar -xinline=no -o autopar
%ptime noautopar
Finish multiplication of matrix of 1024
real 1.536
user 1.521
sys 0.018
%ptime autopar
real 1.542
user 1.520
sys 0.016
%setenv PARALLEL 2
%ptime autopar
ptime ./autopar
real 0.817
user 1.572
sys 0.016
OpenMP Demo
Parallelizing Primes

Parallelizing Primes Example (OpenMP)
• Partition the problem space into smaller chunks and
dispatch processing of each partition into individual
(micro)tasks
> A popular and practical example to illustrate how parallel
software deals with large data
> The basic design concept of this program example can be
applied to many other parallel processing tasks.
> The overall program structure is very simple
> A thread worker routine
> Main program creating multiple working threads/microtasks

int main_omp(int argc, char** argv)
#ifdef _OPENMP
omp_set_num_threads( NTHRS );
omp_set_dynamic(0);
#endif
for (i=0; i < N; i++) {
pflag[i] = 1;
}
#pragma omp parallel for

for (i = 2; i < N ; i++) {
if ( is_prime(i) ) {
primes[total] = i;
total++;
}
}
printf("Number of prime numbers between 2 and %d: %d \n", N, total);

int is_prime(int v)
int is_prime(int v)
{ int i, bound = floor (sqrt((double)v)) + 1;
for (i=2; i <bound; i++) {

/* No need to check against known composites */
if (!pflag[i]) continue;
if (v % i == 0) {
pflag[v] = 0;
return 0;
}
}
return (v > 1);
}

General Race Condition
• A general race condition is caused by an
undetermined sequence of executions that violate the
program state integrity
> Data race condition is a simple form of general race condition
> A general race problem can occur in shared memory and distributed
memory parallel programming

General Race Example
Void shuffle_objects(...) {
// remove target objects from source container
mutex_lock();
source.remove_array( target_objects );
mutex_unlock();
// Here is an unstable state which may cause general race
// add target objects to source container
mutex_lock();
destination.add_array(target_objects);
mutex_unlock();
}

Design Practice to Avoid Races
• Adopt a higher design abstraction such as OpenMP
• Use Pass-by-value instead of pass-by-pointer to
communicate between the threads
• Design the data structure to limit the global variable
usage and restrict the access of shared memory
• Analyze a race problem to decide if it is a harmful
program bug or a benign race
• Understand and fix the real cause of a race condition
instead of fixing race condition symptom

MPI: Single Program Multiple Data
• The processes launched
are in the same MPI_Init(...);
communicator MPI_Comm_rank(
MPI_COMM_WORLD, &rank);
> mpirun -np 8 msorts MPI_Comm_size(
> The 8 processes launched MPI_COMM_WORLD,&size);
belongs to the if (rank == 0) {
MPI_COMM_WORLD ...
communicator else if (rank == 1)
> 8 ranks: 0, 1, 2, 3, 4, 5, 6, 7 ...
else if (rank == 2)
> Total size: 8 ...
}
• All 8 processes running MPI_Finailize();
the same program, control
flow differ by checking the
ranks.
MPI Example: 7 Sorting Processes
All together 8 processes

Driver
Binary
Bubblesort Insertion Heapsort Qiucksort
Sort
Straight Straight
Shakersort Insertion Selection
Sort Sort

MPI Demo
7 Sorting Processes

MPI: Non-Uniform Memory Performance
The length of a plateau is related to the size of that

memory component
Performance
The amount of the drop is related to the latency

(or bandwidth) of that memory component
Tuning area
MPI can help reduce program size

to fit into good regions
64 64 8
KB MB
Reg L1 L2 Main Memory Virtual
Memory

Sun Studio and HPC
Sun HPC http://www.sun.com/servers/HPC/index.jsp
Sun HPC ClusterTools 7 Software
http://www.sun.com/software/products/clustertools
N1 Grid Engine Manager Software
Other MPI Libraries
Open Source MPI-CH library for Solaris Sparc
http://www-unix.mcs.anl.gov/mpi/mpich
LAMMPI ported library for Solaris x86/x64
http://apstc.sun.com.sg/popup.php?l1=research&l2=projects&l3=s1
0port&f=applications#LAM/MPI
MVAPICH – MPI over InfiniBand for Solaris x86/x64
http://nowlab.cse.ohio-state.edu/projects/mpi-iba

Parallel Computing Environment
Loosely Coupled
Global and Enterprise Level

Grid
Grid & SOA
SOA
Web
Service
UPC/GAS
Local Cluster
MPI Grid
Multi-
Process
● N1Grid
● MPI Appl
OpenMP
● OpenMP Appl
Multi- ● MT Appl
Thread ● Serial Appl
Tightly Coupled

TD MXC Parallel Programming Tatkar

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

TD MXC Parallel Programming Tatkar

Diunggah oleh

Hak Cipta:

Format Tersedia

How to Develop Solaris

Sun Studio Developer Tools

Where is my 10GHz chip?

Sun Tech Days 07- 08 /Sun Studio - # 2

Every new system now has a multi-core chip in it

Sun Tech Days 07- 08 /Sun Studio - # 4

Sun Tech Days 07- 08 /Sun Studio - # 5

Sun Studio Developer Tools

UltraSPARC T1/T2 Intel/AMD

Sun Studio Developer Tools

Sun Tech Days 07- 08 /Sun Studio - # 7

Sun Tech Days 07- 08 /Sun Studio - # 8

Sun Tech Days 07- 08 /Sun Studio - # 10

Sun Tech Days 07- 08 /Sun Studio - # 11

for (i=0; i<1024; i++) for (i=0; i<1024; i+=4)

Sun Tech Days 07- 08 /Sun Studio - # 12

Sun Tech Days 07- 08 /Sun Studio - # 13

Sun Tech Days 07- 08 /Sun Studio - # 14

Sun Studio Developer Tools

UltraSPARC T1/T2 Intel/AMD

Sun Tech Days 07- 08 /Sun Studio - # 15

Sun Tech Days 07- 08 /Sun Studio - # 17

!$omp parallel default (none) &

Sun Tech Days 07- 08 /Sun Studio - # 19

Model # threads Time (secs) % change

Sun Tech Days 07- 08 /Sun Studio - # 20

Sun Studio Developer Tools

UltraSPARC T1/T2 Intel/AMD

Sun Tech Days 07- 08 /Sun Studio - # 21

Sun Tech Days 07- 08 /Sun Studio - # 22

Sun Tech Days 07- 08 /Sun Studio - # 23

Sun Tech Days 07- 08 /Sun Studio - # 24

for (i=2; i <bound; i++) {

start = (N/THREADS) * val;

Sun Tech Days 07- 08 /Sun Studio - # 27

Sun Tech Days 07- 08 /Sun Studio - # 28

Sun Tech Days 07- 08 /Sun Studio - # 29

Works with OpenMP, Pthreads, Solaris Threads

Sun Tech Days 07- 08 /Sun Studio - # 30

Perils of Having a DataRace Condition

Sun Tech Days 07- 08 /Sun Studio - # 31

SPECOMP Benchmark fma3d

With the Sun Studio Thread Analyzer, the data race

Sun Tech Days 07-08 /Sun Studio - # 32

Sun Studio Developer Tools

UltraSPARC T1/T2 Intel/AMD

Sun Tech Days 07- 08 /Sun Studio - # 33

Sun Tech Days 07- 08 /Sun Studio - # 34

Sun Tech Days 07- 08 /Sun Studio - # 35

Sun Tech Days 07- 08 /Sun Studio - # 36

Sun Studio Developer Tools

Sun Tech Days 07- 08 /Sun Studio - # 39

Sun Tech Days 07- 08 /Sun Studio - # 40

CC -c mat_mul.cc -g -fast -xrestrict -xautopar

Can run er_src command on executable binary

Sun Tech Days 07- 08 /Sun Studio - # 41

Sun Tech Days 07- 08 /Sun Studio - # 43

Sun Tech Days 07- 08 /Sun Studio - # 44

#pragma omp parallel for

Sun Tech Days 07- 08 /Sun Studio - # 45

for (i=2; i <bound; i++) {

Sun Tech Days 07- 08 /Sun Studio - # 46

Sun Tech Days 07- 08 /Sun Studio - # 47