# Pemrosesan Paralel

Kudang B. Seminar

## Kebutuhan Komputer Berkinerja

Tinggi

Peramalan cuaca
Aerodinamik
Kercerdasan buatan: robotik
Rekayasa genetik

## Contoh aplikasi di atas

melibatkan komputasi
intensif dan memerlukan

## Example 1: Weather Prediction

Area, segments
3000*3000*11 cubic miles
.1*.1*.1 cubic mile: ~ 1011 segments

## Two day prediction

half hour periods: ~ 100 periods

## Computation per segment

Temp, Pressure, Humidity, Wind speed, Wind
direction
Assume ~ 100 FLOPs

Performance: Weather
Prediction
Computational requirement: 1015
Serial supercomputer: 109 instr/sec
Total serial time: 106 sec = 280 hours
Not too good for 48 hour weather
prediction

## Parallel Weather Prediction

1 K workstations, grid connected

## 108 segment computations per processor

108 instructions per second
100 instructions per segment computation
100 time steps: 104 seconds = ~3 hours
Much more acceptable
Assumption: Communication not a problem here

More workstations:
finer grid
better accuracy

## Example 2: N body problem

Astronomy: bodies in space
Attract each other: Gravitational force Newtons
law
O(n*n) calculations per snapshot
Galaxy: ~ 1011 bodies -> ~ 1022 calculations
Calculation 1 micro sec
Snapshot: 1016 secs = ~1011 days = ~ 3*108 years
Is parallelism going to help us? NO
What does help? Better algorithm: Barnes Hut
Divides the space in quad tree
Treats far away quads as one body

Other Challenging
Applications
Satellite data acquisition: billions of bits / sec
Satellite data processing
Pollution levels, Remote sensing of materials
Image recognition

## Planning, Scheduling, VLSI design

Material modeling
Nuclear weapons modeling (ASCI)
Airplane/Satellite/Vehicle design

Application Specific
Architectures
Mapping an algorithm directly onto hardware

## ASICs: Application Specific Integrated Circuits

Levels of specificity
Full custom ASICs
Standard cell ASICs
Field programmable gate arrays
Computational models
Dataflow graphs
Systolic arrays
Orders of magnitude better performance
Orders of magnitude lower power

ASICS cont
How much faster than General purpose?
Example: 1D 1024 FFT
General purpose machine (G4): 25 micro secs
ASIC device (MIT Lincoln Labs): 32 nano secs
ASIC device uses 20 milliwatts (100 * less power)

Future designs:

## 2 tera ops in small ( < cubic ft ) device

Target applications
FFT
Finite Impulse Response (FIR) Filters
Matrix multiply
QR decomposition

Contoh Nyata
Peramalan cuaca 24 jam di UK melibatkan sekitar 1012

## operasi untuk dieksekusi. Ini memerlukan waktu 2.7 hours

pada mesin Cray-1 (berkemampuan 108 operasi per detik).

## Berapa operasi untuk peramalan

mingguan, bulanan, tahunan?

## peralatan elektronik yang masing-masing mampu

melakukan 1012 operasi/detik dan terpisah dengan jarak 0.5
mm. Dalam hal ini akan lebih lama waktu yang diperlukan
bagi sinyal melakukan perjalanan antar dua peralatan
tersebut daripada waktu yang diperlukan untuk melakukan
eksekusi operasi (10-12 detik) oleh salah satu peralatan

elektronik tersebut.

SOLUSI: mendayagunakan
paralelisme

Motivation of Parallel
Computing
Parallel Computing is cost effective

## Off the shelf, commodity processors are very fast

Memory is very cheap
Building a processor that is a small factor faster
costs an order of magnitude more
NoW is the time!
Cheapest way to get more performance: multiprocessor
NoW: Networks of workstations
Workstation can be an SMP
SMP: Symmetric Multi Processor
Shared memory
Bus

Computer

## Get a lot of the fastest processors

Get a lot of memory per processor
Get the fastest network
Hook it all together
And then what ???

## Now you need to program

it!
Parallel programming introduces:

Data partitioning
Synchronization
Latency issues
hiding
tolerance

## Problem with Wile E. Coyote

Architecture
Von Neumann Machines not built for //ism
To get high speed, processors have lots of state
Cache, stack, global memory

## To tolerate latency, we need fast context switch. WHY?

No free lunch: cant have both
Certainly not if the processor was not designed for both

## Memory wall: memory gets slower and slower

in terms of number of cycles it takes to access

## Memory hierarchy gets more and more complex

Memory accesses block
No split phase memory access

Sequential vs Parallel
Algorithms
Efficient Parallel Algorithms

Maximize parallelism
Minimize synchronization, remote accesses
Efficiency is Architecture Dependent

## Efficient Sequential Algorithms

Minimize time, space
Efficiency is portable
Efficient C program on Pentium ~ Efficient C program on
Alpha

Speedup
Ideal: n processors n fold speed up

## Ideal not always possible. WHY?

Not all processors are always busy
Remote data

## Super linear speedup: >n speedup

Nonsense! Because we can execute the faster
parallel program sequentially
No nonsense!! Because parallel computers do not
just have more processors, they have more caches

Parallel Programming
Super compilers
20 years of parallelizing compilers and what do we get?
..not much: we understand loops (a bit)

Message Passing
MPI rules, ..well, there is PVM (parallel virtual machine)
Data parallel programming
Niche work, but important

## Implicit vs Explicit //ism

Implicit: super compilers
Extract parallelism from sequential program
The general case is too hard
pointers, aliases, recursion, separate compilation
dynamic dependence distances in array references

## Explicit Parallelism: threads or messages

Complicates programming
creation, allocation, scheduling of processes
data partitioning
Synchronization ( locks, messages )

## Pemrosesan Sekuensial &

Paralel

3 x lebih
cepat
dari

Klasifikasi Mesin
Models of Computation ( Flynn
Paralel
1966 )
1. Single Instruction Stream, Single Data Stream : SISD.
2. Multiple Instruction Stream, Single Data Stream : MISD.
3. Single Instruction Stream, Multiple Data Stream : SIMD.
4. Multiple Instruction Stream, Multiple Data Stream :
MIMD.
5. Single Program Multiple Data: SPMD.

SISD Computers

Untuk operasi a1 + a2 + a3 + + an
memerlukan sebanyak n akses ke
memori oleh prosesor dan sebanyak n-1

## von Neumann Architecture

Computer

MISD Computers
N prosesor yang memiliki unit kontrol pribadi, berbagi guna
memori bersama (shared memori).

mengerjakan operasi/tugas yang berbeda secara simultan pada
data yang sama.

SIMD Computers

## N prosesor beroperasi dibawah kendali aliran

instruksi tunggal yang dikeluarkan oleh unit
kontrol pusat.

## The processors operate synchronously and a

global clock is used to ensure lockstep operation.

MIMD Computers

komputer

SPMD Computers

MIMD.