Parallel Computer Architecture

CS 258 Parallel Computer Architecture
CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
Todays Goal:
Introduce you to Parallel Computer Architecture Answer your questions about CS 258 Provide you a sense of the trends that shape the field
8/23/2013
CS258 S99
What will you get out of CS258?

In-depth understanding of the design and engineering of modern parallel computers
technology forces fundamental architectural issues naming, replication, communication, synchronization basic design techniques cache coherence, protocols, networks, pipelining, methods of evaluation underlying engineering trade-offs
from moderate to very large scale across the hardware/software boundary
8/23/2013
CS258 S99
Will it be worthwhile?
Absolutely!
even through few of you will become PP designers
The fundamental issues and solutions translate across a wide spectrum of systems.
Crisp solutions in the context of parallel machines.
Pioneered at the thin-end of the platform pyramid on the most-demanding applications

migrate downward with time
Understand implications for software
SuperServers
Departmenatal Servers Workstations Personal Computers
8/23/2013
CS258 S99
Am I going to read my book to you?
NO!
Book provides a framework and complete background, so lectures can be more interactive.
You do the reading Well discuss it
Projects will go beyond
8/23/2013
CS258 S99
What is Parallel Architecture?

A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues:
Resource Allocation: how large a collection? how powerful are the elements? how much memory? Data access, Communication and Synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for cooperation? Performance and Scalability how does it all translate into performance? how does it scale?
CS258 S99
8/23/2013
Why Study Parallel Architecture?

Role of a computer architect:
To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.
Parallelism:
Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing
8/23/2013 CS258 S99 7
Why Study it Today?

History: diverse and innovative organizational structures, often tied to novel programming models Rapidly maturing under strong technological constraints
The killer micro is ubiquitous Laptops and supercomputers are fundamentally similar! Technological trends cause diverse approaches to converge
Technological trends make parallel computing inevitable Need to understand fundamental principles and design tradeoffs, not just taxonomies
Naming, Ordering, Replication, Communication performance
8/23/2013 CS258 S99 8
Is Parallel Computing Inevitable?

Application demands: Our insatiable need for computing cycles Technology Trends Architecture Trends Economics Current trends:
Todays microprocessors have multiprocessor support Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... Tomorrows microprocessors are multiprocessors
8/23/2013
CS258 S99
Application Trends
Application demand for performance fuels advances in hardware, which enables new applns, which...
Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder most demanding applications New Applications More Performance
Range of performance demands

Need range of system performance with progressively increasing cost
8/23/2013 CS258 S99 10
Speedup
Speedup (p processors) =
Performance (p processors) Performance (1 processor)
For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p processors) =

Time (1 processor) Time (p processors)
8/23/2013
CS258 S99
11
Commercial Computing
Relies on parallelism for high end
Computational power determines scale of business that can be handled
Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support)
Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm)
8/23/2013
CS258 S99
12
TPC-C Results for March 1996

25,000

20,000
Tandem Himalay a DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other
Throughput (tpmC)
15,000

10,000
5,000
20
40
60 Number of processors
80
100
120
Parallelism is pervasive Small to moderate scale parallelism very important Difficult to obtain snapshot to compare across vendor platforms 8/23/2013 CS258 S99 13
Scientific Computing Demand
8/23/2013
CS258 S99
14
Engineering Computing Demand

Large parallel machines a mainstay in many industries
Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis, combustion efficiency), Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) Financial modeling (yield and derivative analysis) etc.
8/23/2013 CS258 S99 15
Applications: Speech and Image Processing

10 GIP S 1,000 Words Continuous Speech Recognition ISDN-CD Stereo Receiver CELP Speech C oding Speaker Verication 5,000 Words Continuous Speech Recognition HDT V Receiver CIF Video
1 GIPS Telephone Number Recognition 200 Words Isolated Speech Recognition
100 MIPS
10 MIPS
1 MIPS
Sub-Band Speech C oding
1980
1985
1990
1995
Also CAD, Databases, . . . 100 processors gets you 10 years, 1000 gets you 20 !
8/23/2013 CS258 S99 16
Is better parallel arch enough?
AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
8/23/2013 CS258 S99 17
Summary of Application Trends

Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing
Database and transactions as well as financial Usually smaller-scale, but large-scale systems also used
Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads
Greatest use of small-scale multiprocessors
Solid application demand exists and will increase

8/23/2013 CS258 S99 18
- - - Little break - - -
8/23/2013
CS258 S99
19
Technology Trends
100 Supercomputers
Performance
10 Mainf rames Microprocessors Minicomputers 1
0.1 1965
1970
1975
1980
1985
1990
1995
Today the natural building-block is also fastest!

8/23/2013 CS258 S99 20
Cant we just wait for it to get faster?

Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years Huge investment per generation is carried by huge commodity market
180 160 140 120 100 80 60 40 20 0 1987 1988 1989 1990 1991 1992
MIPS Sun 4 M/120 260 MIPS M2000 IBM RS6000 540 HP 9000 750 DEC alpha
Integer
FP
8/23/2013
CS258 S99
21
Technology: A Closer Look

Basic advance is decreasing feature size ( )
Circuits become either faster or lower in power
Die size is growing too
Clock rate improves roughly proportional to improvement in Number of transistors improves like (or faster) clock rate < 10x, rest is transistor count
Performance > 100x per decade How to use more transistors?

Parallelism in processing multiple operations per cycle reduces CPI Locality in data access avoids latency and reduces CPI also improves processor utilization Both need resources, so tradeoff
Proc $
Interconnect
Fundamental issue is resource distribution, as in uniprocessors

8/23/2013 CS258 S99 22
Growth Rates
1,000
R10000 Pentium100 i80386
100,000,000
Clock rate (MHz)
100
10,000,000
Transistors
10
i8086 i80286
1,000,000 100,000 10,000 1,000 1970
i8080
R10000 Pentium i80386 i80286 R3000 R2000 i8086 i8008 i8080
i8008
i4004
0.1 1970
i4004
1975
1980
1985
1990
1995
2000
2005
1975
1980
1985
1990
1995
2000
2005
30% per year

8/23/2013 CS258 S99
40% per year

23
Architectural Trends
Architecture translates technologys gifts into performance and capability Resolves the tradeoff between parallelism and locality
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Tradeoffs may change with scale and technology advances
Understanding microprocessor architectural trends

=> Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in sequential computers
8/23/2013 CS258 S99 24
Phases in VLSI Generation

Bit-lev el parallelism 100, 000,000 Instruction-lev el Thread-lev el (?)
10,000, 000

R10000
1, 000,000
Pentium
Transistors
i80386
100, 000
i80286

R2000
R3000
i8086
10,000
i8080 i8008 i4004
1, 000 1970
1975
1980
1985
1990
1995
2000
2005
8/23/2013
CS258 S99
25
Architectural Trends
Greatest trend in VLSI generation is increase in parallelism
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction
to deal with control transfer and latency problems
Next step: thread level parallelism

8/23/2013 CS258 S99 26
How far will ILP go?

30 3 2.5 2

Fraction of total cycles (%)
25 20
Speedup
15 10 5 0 0 1 2 3 4 5 6+ Number of instructions issued
1.5 1 0.5 0 0 5 10 15 Instructions issued per cy cle
Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
8/23/2013 CS258 S99 27
Threads Level Parallelism on board

Proc Proc Proc Proc
MEM
Micro on a chip makes it natural to connect many to shared memory

dominates server and enterprise market, moving down to desktop
Faster processors began to saturate bus, then bus technology advanced

today, range of sizes for bus-based systems, desktop to large servers 8/23/2013 CS258 S99 No. of processors in fully configured commercial shared-memory systems 28
What about Multiprocessor Trends?

70
CRAY CS6400
60
Sun E10000
50
Number of processors
40 SGI Challenge
Sequent B2100 30
Sy mmetry 81
SE60
Sun E6000
SE70
20
Sun SC2000
SC2000E SGI PowerChallenge/XL
Sequent B8000
AS8400 Sy mmetry 21
10
Power SGI PowerSeries 0 1984 1986 1988 SS690MP 140 SS690MP 120 1990 1992
SS1000
SE10
SE30 SS1000E P-Pro
AS2100 HP K400 SS20 SS10 1994 1996
1998
8/23/2013
CS258 S99
29
Bus Bandwidth
100,000
Sun E10000 10,000
Shared bus bandwidth (MB/s)
1,000
SGI Sun E6000 PowerCh AS8400 XL CS6400 SGI Challenge HPK400 SC2000E AS2100 SC2000 P-Pro SS1000E SS1000 SS20 SS690MP 120 SE70/SE30 SS10/ SS690MP 140 SE10/ Sy mmetry 81/21
SGI PowerSeries Sequent B2100
100
SE60
Power
Sequent B8000 10 1984 1986 1988 1990 1992 1994 1996 1998
8/23/2013
CS258 S99
30
What about Storage Trends?

Divergence between memory capacity and speed even more pronounced
Capacity increased by 1000x from 1980-95, speed only 2x Gigabit DRAM by c. 2000, but gap with processor speed much greater
Larger memories are slower, while processors get faster

Need to transfer more data in parallel Need deeper cache hierarchies How to organize caches?
Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface Buffer caches most recently accessed data
Disks too: Parallel disks plus caching

8/23/2013 CS258 S99 31
Economics
Commodity microprocessors not only fast but CHEAP
Development costs tens of millions of dollars
BUT, many more are sold compared to supercomputers

Crucial to take advantage of the investment, and use the commodity building block
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? Multiprocessor on a chip?
8/23/2013
CS258 S99
32
Can we see some hard evidence?
8/23/2013
CS258 S99
33
Consider Scientific Supercomputing

Proving ground and driver for innovative architecture and techniques
Market smaller relative to commercial as MPs become mainstream Dominated by vector machines starting in 70s Microprocessors have made huge gains in floating-point performance high clock rates pipelined floating point units (e.g., multiply-add every cycle) instruction-level parallelism effective use of caches (e.g., automatic blocking) Plus economics
Large-scale multiprocessors replace vector supercomputers
8/23/2013
CS258 S99
34
Raw Uniprocessor Performance: LINPACK

10,000
CRAY CRAY Micro Micro
n = 1,000 n = 100 n = 1,000 n = 100
1,000
T94

C90
LINPACK (MFLOPS)
DEC 8200
Y mp Xmp/416

100
IBM Power2/990
Xmp/14se
MIPS R4400
HP9000/735
DEC Alpha
CRAY 1s
IBM RS6000/540
DEC Alpha AXP HP 9000/750
10
MIPS M/2000
MIPS M/120
Sun 4/260 1 1975

1980
1985
1990
1995
2000
8/23/2013
CS258 S99
35
Raw Parallel Performance: LINPACK

10,000
MPP peak CRAY peak
1,000
ASCI Red Paragon XP/S MP (6768) Paragon XP/S MP (1024) T3D CM-5 T932(32) Paragon XP/S CM-200 CM-2
C90(16)
LINPACK (GFLOPS)
100
10
Delta
Y mp/832(8) 1 Xmp /416(4)
iPSC/860 nCUBE/2(1024)
0.1 1985
1987
1989
1991
1993
1995
1996
Even vector Crays became parallel

X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
8/23/2013
Since 1993, Cray produces MPPs too (T3D, T3E)

CS258 S99
36
500 Fastest Computers

350 300 Number of systems 250 200 150 100 187 313 239 198 110 106 63 284 319
MPP PVP SMP 106 73
50
0 11/93
8/23/2013
11/94
11/95
11/96
37
CS258 S99
Summary: Why Parallel Architecture?

Increasingly attractive
Economics, technology, architecture, application demand
Increasingly central and mainstream Parallelism exploited at many levels

Instruction-level parallelism Multiprocessor servers Large-scale multiprocessors (MPPs)
Focus of this class: multiprocessor level of parallelism Same story from memory system perspective
Increase bandwidth, reduce average latency with many local memories
Spectrum of parallel architectures make sense

Different cost, performance and scalability
8/23/2013 CS258 S99 38
Where is Parallel Arch Going?

Old view: Divergent architectures, no predictable pattern of growth.
Application Software Systolic Arrays Dataflow System Software Architecture
SIMD
Message Passing
Shared Memory
Uncertainty of direction paralyzed parallel software development!

8/23/2013 CS258 S99 39
Today
Extension of computer architecture to support communication and cooperation
Instruction Set Architecture plus Communication Architecture
Defines
Critical abstractions, boundaries, and primitives (interfaces) Organizational structures that implement interfaces (hw or sw)
Compilers, libraries and OS are important bridges today
8/23/2013
CS258 S99
40
Modern Layered Framework

CAD Multiprogramming Database Shared address Scientific modeling Message passing Data parallel Parallel applications Programming models
Compilation or library Operating systems support Communication har dware Physical communication medium
Communication abstraction User/system boundary
Hardware/software boundary
8/23/2013
CS258 S99
41
How will we spend out time?
http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html
8/23/2013
CS258 S99
42
How will grading work?

30% homeworks (6) 30% exam 30% project (teams of 2) 10% participation
8/23/2013
CS258 S99
43
Any other questions?
8/23/2013
CS258 S99
44

Parallel Computer Architecture

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Parallel Computer Architecture

Diunggah oleh

Hak Cipta:

Format Tersedia

CS 258 Parallel Computer Architecture

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

What will you get out of CS258?

from moderate to very large scale across the hardware/software boundary

Pioneered at the thin-end of the platform pyramid on the most-demanding applications

Understand implications for software

Am I going to read my book to you?

Projects will go beyond

What is Parallel Architecture?

Why Study Parallel Architecture?

Why Study it Today?

Is Parallel Computing Inevitable?

Range of performance demands

For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

TPC-C Results for March 1996

Tandem Himalay a DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other

Scientific Computing Demand

Engineering Computing Demand

Applications: Speech and Image Processing

1 GIPS Telephone Number Recognition 200 Words Isolated Speech Recognition

Sub-Band Speech C oding

Is better parallel arch enough?

Summary of Application Trends

Solid application demand exists and will increase

10 Mainf rames Microprocessors Minicomputers 1

Today the natural building-block is also fastest!

Cant we just wait for it to get faster?

Technology: A Closer Look

Die size is growing too

Performance > 100x per decade How to use more transistors?

Fundamental issue is resource distribution, as in uniprocessors

Clock rate (MHz)

1,000,000 100,000 10,000 1,000 1970

R10000 Pentium i80386 i80286 R3000 R2000 i8086 i8008 i8080

30% per year

40% per year

Understanding microprocessor architectural trends

Phases in VLSI Generation

Next step: thread level parallelism

How far will ILP go?

Fraction of total cycles (%)

15 10 5 0 0 1 2 3 4 5 6+ Number of instructions issued

1.5 1 0.5 0 0 5 10 15 Instructions issued per cy cle

Threads Level Parallelism on board

Micro on a chip makes it natural to connect many to shared memory

Faster processors began to saturate bus, then bus technology advanced

What about Multiprocessor Trends?

SC2000E SGI PowerChallenge/XL

SE30 SS1000E P-Pro

AS2100 HP K400 SS20 SS10 1994 1996

Sun E10000 10,000

Shared bus bandwidth (MB/s)

What about Storage Trends?

Larger memories are slower, while processors get faster

Disks too: Parallel disks plus caching

BUT, many more are sold compared to supercomputers

Can we see some hard evidence?

Consider Scientific Supercomputing

Large-scale multiprocessors replace vector supercomputers

Raw Uniprocessor Performance: LINPACK

n = 1,000 n = 100 n = 1,000 n = 100

DEC Alpha AXP HP 9000/750