Anda di halaman 1dari 27

Pemrosesan Paralel

Kudang B. Seminar

Kebutuhan Komputer Berkinerja


Tinggi

Peramalan cuaca
Aerodinamik
Kercerdasan buatan: robotik
Rekayasa genetik

Contoh aplikasi di atas


melibatkan komputasi
intensif dan memerlukan

Example 1: Weather Prediction


Area, segments
3000*3000*11 cubic miles
.1*.1*.1 cubic mile: ~ 1011 segments

Two day prediction


half hour periods: ~ 100 periods

Computation per segment


Temp, Pressure, Humidity, Wind speed, Wind
direction
Assume ~ 100 FLOPs

Performance: Weather
Prediction
Computational requirement: 1015
Serial supercomputer: 109 instr/sec
Total serial time: 106 sec = 280 hours
Not too good for 48 hour weather
prediction

Parallel Weather Prediction


1 K workstations, grid connected

108 segment computations per processor


108 instructions per second
100 instructions per segment computation
100 time steps: 104 seconds = ~3 hours
Much more acceptable
Assumption: Communication not a problem here

More workstations:
finer grid
better accuracy

Example 2: N body problem


Astronomy: bodies in space
Attract each other: Gravitational force Newtons
law
O(n*n) calculations per snapshot
Galaxy: ~ 1011 bodies -> ~ 1022 calculations
Calculation 1 micro sec
Snapshot: 1016 secs = ~1011 days = ~ 3*108 years
Is parallelism going to help us? NO
What does help? Better algorithm: Barnes Hut
Divides the space in quad tree
Treats far away quads as one body

Other Challenging
Applications
Satellite data acquisition: billions of bits / sec
Satellite data processing
Pollution levels, Remote sensing of materials
Image recognition

Discrete optimization problems

Planning, Scheduling, VLSI design

Material modeling
Nuclear weapons modeling (ASCI)
Airplane/Satellite/Vehicle design

Application Specific
Architectures
Mapping an algorithm directly onto hardware

ASICs: Application Specific Integrated Circuits


Levels of specificity
Full custom ASICs
Standard cell ASICs
Field programmable gate arrays
Computational models
Dataflow graphs
Systolic arrays
Orders of magnitude better performance
Orders of magnitude lower power

ASICS cont
How much faster than General purpose?
Example: 1D 1024 FFT
General purpose machine (G4): 25 micro secs
ASIC device (MIT Lincoln Labs): 32 nano secs
ASIC device uses 20 milliwatts (100 * less power)

Future designs:

2 tera ops in small ( < cubic ft ) device


Target applications
FFT
Finite Impulse Response (FIR) Filters
Matrix multiply
QR decomposition

Contoh Nyata
Peramalan cuaca 24 jam di UK melibatkan sekitar 1012

operasi untuk dieksekusi. Ini memerlukan waktu 2.7 hours


pada mesin Cray-1 (berkemampuan 108 operasi per detik).

Berapa operasi untuk peramalan


mingguan, bulanan, tahunan?

Menurut Einstein kecepatan cahaya: 3 x 108 m/dt. Dua

peralatan elektronik yang masing-masing mampu


melakukan 1012 operasi/detik dan terpisah dengan jarak 0.5
mm. Dalam hal ini akan lebih lama waktu yang diperlukan
bagi sinyal melakukan perjalanan antar dua peralatan
tersebut daripada waktu yang diperlukan untuk melakukan
eksekusi operasi (10-12 detik) oleh salah satu peralatan

Jadi faktor pembatasnya


adalah kecepatan cahaya.
elektronik tersebut.

SOLUSI: mendayagunakan
paralelisme

Motivation of Parallel
Computing
Parallel Computing is cost effective

Off the shelf, commodity processors are very fast


Memory is very cheap
Building a processor that is a small factor faster
costs an order of magnitude more
NoW is the time!
Cheapest way to get more performance: multiprocessor
NoW: Networks of workstations
Workstation can be an SMP
SMP: Symmetric Multi Processor
Shared memory
Bus

Wile E. Coyotes Parallel


Computer

Get a lot of the fastest processors


Get a lot of memory per processor
Get the fastest network
Hook it all together
And then what ???

Now you need to program


it!
Parallel programming introduces:

Task partitioning, task scheduling


Data partitioning
Synchronization
Load balancing
Latency issues
hiding
tolerance

Problem with Wile E. Coyote


Architecture
Von Neumann Machines not built for //ism
To get high speed, processors have lots of state
Cache, stack, global memory

To tolerate latency, we need fast context switch. WHY?


No free lunch: cant have both
Certainly not if the processor was not designed for both

Memory wall: memory gets slower and slower


in terms of number of cycles it takes to access

Memory hierarchy gets more and more complex


Memory accesses block
No split phase memory access

Sequential vs Parallel
Algorithms
Efficient Parallel Algorithms

Maximize parallelism
Minimize synchronization, remote accesses
Efficiency is Architecture Dependent

Efficient Sequential Algorithms


Minimize time, space
Efficiency is portable
Efficient C program on Pentium ~ Efficient C program on
Alpha

Speedup
Ideal: n processors n fold speed up

Ideal not always possible. WHY?


Tasks are data dependent
Not all processors are always busy
Remote data

Super linear speedup: >n speedup


Nonsense! Because we can execute the faster
parallel program sequentially
No nonsense!! Because parallel computers do not
just have more processors, they have more caches

Parallel Programming
Parallel Programming Paradigms
Super compilers
20 years of parallelizing compilers and what do we get?
..not much: we understand loops (a bit)

Multithreading
Pthreads, Solaris threads, not much difference
Message Passing
MPI rules, ..well, there is PVM (parallel virtual machine)
Data parallel programming
Niche work, but important

Implicit vs Explicit //ism


Implicit: super compilers
Extract parallelism from sequential program
The general case is too hard
pointers, aliases, recursion, separate compilation
dynamic dependence distances in array references

Explicit Parallelism: threads or messages


Complicates programming
creation, allocation, scheduling of processes
data partitioning
Synchronization ( locks, messages )

Pemrosesan Sekuensial &


Paralel

3 x lebih
cepat
dari

Klasifikasi Mesin
Models of Computation ( Flynn
Paralel
1966 )
1. Single Instruction Stream, Single Data Stream : SISD.
2. Multiple Instruction Stream, Single Data Stream : MISD.
3. Single Instruction Stream, Multiple Data Stream : SIMD.
4. Multiple Instruction Stream, Multiple Data Stream :
MIMD.
5. Single Program Multiple Data: SPMD.

SISD Computers

Untuk operasi a1 + a2 + a3 + + an
memerlukan sebanyak n akses ke
memori oleh prosesor dan sebanyak n-1
operasi penjumlahan. Jadi kompleksitas
waktu operasi adalah O(n).

von Neumann Architecture


Computer

MISD Computers
N prosesor yang memiliki unit kontrol pribadi, berbagi guna
memori bersama (shared memori).

Parallelisme diperoleh dengan menugaskan semua prosesor


mengerjakan operasi/tugas yang berbeda secara simultan pada
data yang sama.

SIMD Computers

N prosesor beroperasi dibawah kendali aliran


instruksi tunggal yang dikeluarkan oleh unit
kontrol pusat.

The processors operate synchronously and a


global clock is used to ensure lockstep operation.

MIMD Computers

Potensi dari 4 kelas


komputer

SPMD Computers

Program yang sama dieksekusi pada prosesor komputer


MIMD.
SPMD bukan merupakan paradigma hardware, ini adalah
software ekuivalen dari SIMD, namun bersifat
asynchronous.

Perhatikan instruksi IF X = 0 THEN S1 ELSE S2


Asumsikan X = 0 pada prosesor P1, dan untuk X != 0 pada
prosesor P2
Proses P1 mengeksekusi S1 paralel dengan prosesor P2
mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD )