Anda di halaman 1dari 55

How to Develop Solaris

Parallel Applications

Vijay Tatkar
Sr. Engineering Manager

Sun Studio Developer Tools


http://blogs.sun.com/tatkar Sun Tech Days 07-08 /Sun Studio - # 1 1
The GHz Chip Clock Race is Over...
Classic CPU efficiencies:
Clock speed
Execution optimization
Cache

Where is my 10GHz chip?

Design Impediments:
Heat
Power
Slower Memory than chips

Sun Tech Days 07- 08 /Sun Studio - # 2


Putting transistors to work in a new way ...
UltraSPARC T2 Intel: Penryn, AMD: Barcelona
Intel: 4 cores * 3.1GHz
1.4GHz * 8 cores AMD: 4 cores * 2.3GHz
(64 threads in a chip) (4 threads in a chip)

The Multicore
Revolution

Every new system now has a multi-core chip in it


Sun Tech Days 07- 08 /Sun Studio - # 3
Things to know about Parallelism
• Parallel processing is not for massively parallel supercomputer anymore.
(HPC ≠ High Priced Computing)
• CPU clock speed doubled every 18 months, whereas memory doubled every
6 years! Heat, Memory, Power lead to multi-cores CPUs.
• Free ride is over for serial programs relying on the hardware to boost
performance.
• Parallel programming is BEST BET for speedups
> Parallelism is all about performance, first and foremost
> Program correctness is often harder for parallel programs
• Parallelism is often considered hard , but there are several models to choose
from, and compiler support for each model to ease the choice.

Sun Tech Days 07- 08 /Sun Studio - # 4


Programming Model
Shared Memory Model
OpenMP (de-facto standard)
Java, Native Multi-threaded Programming
Distributed Memory Model
Message Passing Interface – MPI (de-facto standard)
Parallel Virtual Machine – PVM (less popular)
Global Address Space
Unified Parallel C – UPC (research technology)
Grid Computing
Sun Grid Computing (www.network.com)
Sun Grid Engine (www.sun.com/software/gridware)

Sun Tech Days 07- 08 /Sun Studio - # 5


Compiler Support .... To The Rescue?

Application

Sun Studio Developer Tools

Easiest Hardest
AutoPar MT OpenMP MPI
libumem Atomic Solaris Posix Event
Operations Threads Threads Ports

Solaris

UltraSPARC T1/T2 Intel/AMD


SPARC64 VI, x86/x64
UltraSPARC IV+
Sun Tech Days 07- 08 /Sun Studio - # 6
Automatic Parallelization and Vectorization
Application

Sun Studio Developer Tools

Easiest Hardest
Instruction-level MT OpenMP MPI
Parallelism
libumem Atomic Solaris Posix
Automatic Threads
Operations Solaris Threads
Event
Ports
Parallelization,
Automatic
Vectorization
Tuned MTUltraSPARC
librariesT1/T2 Intel/AMD
SPARC64 VI, x86/x64
UltraSPARC IV+

Sun Tech Days 07- 08 /Sun Studio - # 7


Instruction level Parallelism
Chips have figured out how to dispatch multiple instructions in parallel
Compilers have figured out how to schedule for such processors
Chips + Compilers are very mature in this regard, so there is no programmer
action required and the gain is automatic, whereever possible
It IS possible to chew gum and walk at the same time!

Sun Tech Days 07- 08 /Sun Studio - # 8


Automatic Parallelization
Support for the Fortran, C and C++ applications
First introduced for 4-20 way SPARCserver 600 MP in 1991
Useful for loop oriented programs
Every (nested) loop will be analyzed for data dependencies and
parallelized if safe to do so
Non-loop code fragments will not be analyzed
Loops versioned with serial and parallel code (runtime)
Combine with powerful loop optimizations
One can have subtle interactions between loop transformations and
parallelization
Compilers have limited knowledge about the application
Overall gains can be impressive
Entire SPECfp 2006 suite gains 16% with PARALLEL=2
Individual gains can be upto 2x for suitable programs; libquantum from
SPEC CPU2006 speeds up 6-7x on 8-cores!
Not every program will see a gain Sun Tech Days 07- 08 /Sun Studio - # 9
Automatic Parallelization Options
-xautopar
Automatic parallelization (Fortran, C and C++ compiler) requires -xO3 or
higher (-xautopar implies -xdepend)
-xreduction
Parallelize reduction operations
Recommended to use -fsimple=2 as well
-xloopinfo
Show parallelization messages on screen
• Only apply to the most time consuming parts of program

Sun Tech Days 07- 08 /Sun Studio - # 10


AutoPar: SPECfp 2006 improvements
Woodcrest box: 3.0GHz dual-core
PARALLEL=2
Overall Gain: 16%
27.5
25
22.5
20
Base Flags
17.5
+ Autopar
15
12.5
10
7.5
5
2.5
0
bwav g milc zeus g cac- leslie n dealI so- p cal- ge tonto lbm wrf sphi
es ame mp ro- tusA 3d amd I plex ovr culix ms nx3
ss mac DM ay FDT

Sun Tech Days 07- 08 /Sun Studio - # 11


Automatic Vectorization
Support for the Fortran, C and C++ applications
-xvector=simd exploits special SSE2+ instructions
Works on data in adjacent memory locations
Gains are smaller than -xautopar
SPECfp 2006 gains are 3% overall and upto 14% range individually
Best suited for loop-level SIMD parallelism

for (i=0; i<1024; i++) for (i=0; i<1024; i+=4)


c[i] = a[i] * b[i] c[i:i+3] = a[i:i+3] * b[i:i+3]

Sun Tech Days 07- 08 /Sun Studio - # 12


Case Study:
Vectorizing STREAM

Sun Tech Days 07- 08 /Sun Studio - # 13


Tuned MT Libraries – Sun Perf Lib

Sun Tech Days 07- 08 /Sun Studio - # 14


Compiler Support : OpenMP
Application

Sun Studio Developer Tools

Easiest Hardest
AutoPar MT MPI
libumem Atomic Solaris Posix Event
Operations Solaris
ThreadsOpenMP
Threads Ports

UltraSPARC T1/T2 Intel/AMD


SPARC64 VI, x86/x64
UltraSPARC IV+

Sun Tech Days 07- 08 /Sun Studio - # 15


What is OpenMP?
• Defacto industry standard API for writing shared-memory parallel applications
in C, C++ and Fortran See: http://www.openmp.org
• Consists of
> Compiler directives (pragmas)
> Runtime routines (libmtsk)
> Environment variables
• Advantages:
> Incremental parallelization of source code
> Small(er) amount of programming effort
> Good Performance and Scalability
> Portable across variety of vendor compilers
• Sun Studio has consistently led OpenMP
> Support for latest version (2.5 now, v3.0 API underway)
> Consistent World Record SPEC OMP submissions for several years now
Sun Tech Days 07- 08 /Sun Studio - # 16
OpenMP- Directives with Intelligence

Sun Tech Days 07- 08 /Sun Studio - # 17


A Loop Parallelized With OpenMP
#pragma omp parallel default (none) \
shared(n, x, y) private (i)
{
#pragma omp for
for ( i = 0; i < n; i++) Clauses
x[i] += y[i];
C/C++
} /*-- End of Parallel region -- */

!$omp parallel default (none) &


!$omp shared(n,x,y) private(i)
!$omp do
do i = 1, n
x(i) = x(i) + y(i)
end do
!$ end do
!$ end parallel
Fortran
Sun Tech Days 07- 08 /Sun Studio - # 18
Components Of OpenMP
Directives, Pragmas Runtime Environment

● Parallel regions
● Work sharing
● Number of threads
● Thread ID
● Synchronization
● Dynamic thread
● Data scope attributes

● private
adjustment
● Nested parallelism
● firstprivate
● Timers
● lastprivate
● API for locking
● shared

● reduction

● Orphaning

Sun Tech Days 07- 08 /Sun Studio - # 19


An OpenMP Example
Find the primes up to 3,000,000 (216816)
Run on Sun Fire 6800, Solaris 9, 24 processors 1.2GHz US-III+, with 9.8GB
main memory

Model # threads Time (secs) % change


Serial N/A 6.636 Base
1 7.210 8.65% drop
2 3.771 1.76x faster
4 1.988 3.34x faster
OpenMP 8 1.090 6.09x faster
16 0.638 10.40x faster
20 0.550 12.06x faster
24 0.931 Saturation drop

Sun Tech Days 07- 08 /Sun Studio - # 20


Compiler Support : Programming Threads
Application

Sun Studio Developer Tools

Easiest Hardest
AutoPar OpenMP MPI
libumem Atomic Solaris Posix Event
Operations Solaris
Threads Threads Ports
MT

UltraSPARC T1/T2 Intel/AMD


SPARC64 VI, x86/x64
UltraSPARC IV+

Sun Tech Days 07- 08 /Sun Studio - # 21


Programming Threads
• Use the POSIX APIs – pthread_create, pthread_join,
pthread_exit, et. al.
> Recommendation: consider reducing the thread stack size
(default is 1MB)
> See pt hr ead_at t r _i ni t (3C) for this and other attributes
which can be adjusted
• Do not use the native Solaris threading API (e.g.,
t hr _c r eat e ).
> Though applications which use it are still supported, it is non-
portable.

Sun Tech Days 07- 08 /Sun Studio - # 22


Data Synchronization
• Concurrent access to shared data requires
synchronization
> Mutexes (pthread_mutex_lock/pthread_mutex_unlock)
> Condition Variables (pthread_cond_wait)
> Reader/Writer locks
(pthread_rwlock_rdlock/pthread_rwlock_wrlock)
> Spin locks (pthread_spin_lock)
• Objects can be local to a process or shared between
processes via shared memory.

Sun Tech Days 07- 08 /Sun Studio - # 23


MT Demo
Multithreading Primes

Sun Tech Days 07- 08 /Sun Studio - # 24


int is_prime(int v)
{
int i;
int bound = floor (sqrt((double)v)) + 1;

for (i=2; i <bound; i++) {


/* No need to check against known composites */
if (!pflag[i])
continue;
if (v % i == 0) {
pflag[v] = 0;
return 0;
}
}
return (v > 1);
}
Sun Tech Days 07- 08 /Sun Studio - # 25
void *work(void *arg)
{
int start;
int end;
int i;
int val= *((int *) arg);

start = (N/THREADS) * val;


end = start + N/THREADS;
for (i = start; i < end ; i++) {
if ( is_prime(i) ) {
primes[total] = i;
total++;
}
}
return NULL;
}
Sun Tech Days 07- 08 /Sun Studio - # 26
int main(int argc, char** argv)
for (i=0; i < N; i++) {
pflag[i] = 1;
}
for (i = 0; i < (THREADS-1); i++) {
pthread_create(&tids[i], NULL, work, (void *) &i);
}
i = THREADS -1;
work((void *) &i);
for (i = 0; i < THREADS ; i++) {
pthread_join(tids[i], NULL);
}

Sun Tech Days 07- 08 /Sun Studio - # 27


STOP!
Problem Ahead
RDT Demo, please

Sun Tech Days 07- 08 /Sun Studio - # 28


Data Race Condition
• A data race condition occurs when
> multiple threads access a shared memory location
> without synchronized accessing order
> At least one access is to write a new data
• A data race problem often occurs in shared memory
parallel programming models such as Pthread and
OpenMP.
> The effect of a data race problem is unpredictable and may
occur only once during hundreds of runs.

Sun Tech Days 07- 08 /Sun Studio - # 29


Thread Analyzer
Detects data races and deadlocks in a multithreaded application
Points to non-deterministic or incorrect execution
Bugs are notoriously difficult to detect by examination
Points out actual and potential deadlock situations

Process:
Instrument the code with -xinstrument=datarace
Detect runtime condition with collect -r all [or race, detection]
Use the Graphical Analyzer, tha, to identify conflicts and critical
regions

Works with OpenMP, Pthreads, Solaris Threads


API provided for user-defined synchronization primitives
Works on Solaris (SPARC, x86/x64) and Linux
Static lock_lint tool to detect inconsistent use of locks

Sun Tech Days 07- 08 /Sun Studio - # 30


A True SPEC Story
SPEC OMP Benchmark fma3d
101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of work to find the data race manually

Perils of Having a DataRace Condition


Program exhibits non-deterministic behavior
Failure may be hard to reproduce
Program may continue to execute, leading to failure in unrelated code
A data race is hard to detect using conventional debugging methods and
tools

Sun Tech Days 07- 08 /Sun Studio - # 31


How did Thread Analyzer help?

SPECOMP Benchmark fma3d


101 source files; 61,000 lines of Fortran code
Data race in platq.f90 caused sporadic core dumps
It took several engineers and 6 weeks of work
to find the data race manually

With the Sun Studio Thread Analyzer, the data race


was detected in just a few hours!

Sun Tech Days 07-08 /Sun Studio - # 32


Compiler Support : Message Passing Interface
Application

Sun Studio Developer Tools

Easiest Hardest
AutoPar MT OpenMP
libumem Atomic Solaris Posix Event
Operations Solaris
Threads Threads Ports
MPI

UltraSPARC T1/T2 Intel/AMD


SPARC64 VI, x86/x64
UltraSPARC IV+

Sun Tech Days 07- 08 /Sun Studio - # 33


Message Passing Interface (MPI)
MPI programming model is a de-facto standard for
distributed memory parallel programming
MPI API set is quite large (323 subroutines)
MPI application can be programmed with less than 10 different
calls
Implemented with very small set of device interconnect low
level routines.
Open MPI: http://www.open-mpi.org/
MPI home page at Argonne National Laboratories
http://www-unix.mcs.anl.gov/mpi/

Sun Tech Days 07- 08 /Sun Studio - # 34


Message Passing Interface (MPI)
• OpenMPI 2.0 Conformance
• ClusterTools 7.0 with Sun Studio
• Multiple processes runs under Open Runtime
Environment
• Pass data messages between processes in point/block
communication mode
• No race condition with right use of MPI message passing
calls
• MPI profiling under Performance Analyzer

Sun Tech Days 07- 08 /Sun Studio - # 35


Launching MPI application
• For Single Program Multiple Data (SPMD)
> mpirun -np x program1
• For Multiple Program Multiple Data (MPMD)
> mpirun -np x program1 : -np y program 2
• Launching on different nodes (hosts)
> mpirun -np x -host <machineA, ...> program1
• And more ...
• Very flexible way of launching

Sun Tech Days 07- 08 /Sun Studio - # 36


Comparing OpenMP and MPI
OpenMP MPI
Defacto industry standard Defacto industry standard
Limited to one (SMP) system Runs on number of systems
Not (yet?) GRID-ready GRID ready
Easier to get started High and steep learning curve
Assistance from compilers You're on your own
Mix and match model All or nothing model
Requires data scoping No data scoping required
Increasingly popular (CMT?) More widely used (but...)
Preserves sequential code No sequential version
Needs a compiler No compiler; just a library
No special environment Requires runtime environment
Performance issues implicit Easy to control performance
Sun Tech Days 07- 08 /Sun Studio - # 37
Thank you !

Vijay Tatkar
Sr. Engineering Manager

Sun Studio Developer Tools


http://blogs.sun.com/tatkar Sun Tech Days 07-08 /Sun Studio - # 38 38
Case Study:
AutoPar Matrix Multiply

Sun Tech Days 07- 08 /Sun Studio - # 39


AutoPar Example Program
// Matrix Multiplication
32 $define MAX 1024
33 void matrix_mul(float (*x_mat)[MAX],
34 float(*y_mat)[MAX], float (*z_mat)[MAX]) {
35
36 for (int j = 0; j < MAX; j++) {
37 for (int k = 0; k < MAX; k++) {
38 z_mat[j][k] = 0.0;
39 for (int t = 0; t < MAX; t++) {
40 z_mat[j][k] += x_mat[j][t] * y_mat[t][k];
41 }
42 }
43 }
44 }

Sun Tech Days 07- 08 /Sun Studio - # 40


AutoPar Example Compilation

CC -c mat_mul.cc -g -fast -xrestrict -xautopar


-xloopinfo -o mat_mul.o
"mat_mul.cc", line 36: PARALLELIZED
"mat_mul.cc", line 37: not parallelized, not profitable
"mat_mul.cc", line 39: not parallelized, unsafe dependence

Can run er_src command on executable binary


to see internal compiler messages

Sun Tech Days 07- 08 /Sun Studio - # 41


%CC mat_mul.cc -g -fast -xrestrict -xinline=no -o noautopar
%CC mat_mul.cc -g -fast -xrestrict -xloopinfo -xautopar -xinline=no -o autopar

%ptime noautopar
Finish multiplication of matrix of 1024
real 1.536
user 1.521
sys 0.018

%ptime autopar
Finish multiplication of matrix of 1024
real 1.542
user 1.520
sys 0.016

%setenv PARALLEL 2
%ptime autopar
ptime ./autopar
Finish multiplication of matrix of 1024
real 0.817
user 1.572
sys 0.016
Sun Tech Days 07- 08 /Sun Studio - # 42
OpenMP Demo
Parallelizing Primes

Sun Tech Days 07- 08 /Sun Studio - # 43


Parallelizing Primes Example (OpenMP)
• Partition the problem space into smaller chunks and
dispatch processing of each partition into individual
(micro)tasks
> A popular and practical example to illustrate how parallel
software deals with large data
> The basic design concept of this program example can be
applied to many other parallel processing tasks.
> The overall program structure is very simple
> A thread worker routine
> Main program creating multiple working threads/microtasks

Sun Tech Days 07- 08 /Sun Studio - # 44


int main_omp(int argc, char** argv)
#ifdef _OPENMP
omp_set_num_threads( NTHRS );
omp_set_dynamic(0);
#endif
for (i=0; i < N; i++) {
pflag[i] = 1;
}

#pragma omp parallel for


for (i = 2; i < N ; i++) {
if ( is_prime(i) ) {
primes[total] = i;
total++;
}
}
printf("Number of prime numbers between 2 and %d: %d \n", N, total);

Sun Tech Days 07- 08 /Sun Studio - # 45


int is_prime(int v)

int is_prime(int v)
{ int i, bound = floor (sqrt((double)v)) + 1;

for (i=2; i <bound; i++) {


/* No need to check against known composites */
if (!pflag[i]) continue;
if (v % i == 0) {
pflag[v] = 0;
return 0;
}
}
return (v > 1);
}

Sun Tech Days 07- 08 /Sun Studio - # 46


General Race Condition
• A general race condition is caused by an
undetermined sequence of executions that violate the
program state integrity
> Data race condition is a simple form of general race condition
> A general race problem can occur in shared memory and distributed
memory parallel programming

Sun Tech Days 07- 08 /Sun Studio - # 47


General Race Example
Void shuffle_objects(...) {
// remove target objects from source container
mutex_lock();
source.remove_array( target_objects );
mutex_unlock();
// Here is an unstable state which may cause general race
// add target objects to source container
mutex_lock();
destination.add_array(target_objects);
mutex_unlock();
}

Sun Tech Days 07- 08 /Sun Studio - # 48


Design Practice to Avoid Races
• Adopt a higher design abstraction such as OpenMP
• Use Pass-by-value instead of pass-by-pointer to
communicate between the threads
• Design the data structure to limit the global variable
usage and restrict the access of shared memory
• Analyze a race problem to decide if it is a harmful
program bug or a benign race
• Understand and fix the real cause of a race condition
instead of fixing race condition symptom

Sun Tech Days 07- 08 /Sun Studio - # 49


MPI: Single Program Multiple Data
• The processes launched
are in the same MPI_Init(...);
communicator MPI_Comm_rank(
MPI_COMM_WORLD, &rank);
> mpirun -np 8 msorts MPI_Comm_size(
> The 8 processes launched MPI_COMM_WORLD,&size);
belongs to the if (rank == 0) {
MPI_COMM_WORLD ...
communicator else if (rank == 1)
> 8 ranks: 0, 1, 2, 3, 4, 5, 6, 7 ...
else if (rank == 2)
> Total size: 8 ...
}
• All 8 processes running MPI_Finailize();
the same program, control
flow differ by checking the
ranks.
Sun Tech Days 07- 08 /Sun Studio - # 50
MPI Example: 7 Sorting Processes

All together 8 processes


Driver

Binary
Bubblesort Insertion Heapsort Qiucksort
Sort

Straight Straight
Shakersort Insertion Selection
Sort Sort

Sun Tech Days 07- 08 /Sun Studio - # 51


MPI Demo
7 Sorting Processes

Sun Tech Days 07- 08 /Sun Studio - # 52


MPI: Non-Uniform Memory Performance

The length of a plateau is related to the size of that


memory component
Performance

The amount of the drop is related to the latency


(or bandwidth) of that memory component

Tuning area

MPI can help reduce program size


to fit into good regions

64 64 8
KB MB
Reg L1 L2 Main Memory Virtual
Memory

Sun Tech Days 07- 08 /Sun Studio - # 53


Sun Studio and HPC
Sun HPC http://www.sun.com/servers/HPC/index.jsp
Sun HPC ClusterTools 7 Software
http://www.sun.com/software/products/clustertools
N1 Grid Engine Manager Software
Other MPI Libraries
Open Source MPI-CH library for Solaris Sparc
http://www-unix.mcs.anl.gov/mpi/mpich
LAMMPI ported library for Solaris x86/x64
http://apstc.sun.com.sg/popup.php?l1=research&l2=projects&l3=s1
0port&f=applications#LAM/MPI
MVAPICH – MPI over InfiniBand for Solaris x86/x64
http://nowlab.cse.ohio-state.edu/projects/mpi-iba

Sun Tech Days 07- 08 /Sun Studio - # 54


Parallel Computing Environment
Loosely Coupled

Global and Enterprise Level


Grid
Grid & SOA
SOA
Web
Service

UPC/GAS
Local Cluster
MPI Grid
Multi-
Process
● N1Grid
● MPI Appl
OpenMP
● OpenMP Appl

Multi- ● MT Appl

Thread ● Serial Appl

Tightly Coupled
Sun Tech Days 07- 08 /Sun Studio - # 55

Anda mungkin juga menyukai