Anda di halaman 1dari 44

CS 258 Parallel Computer Architecture

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

Todays Goal:
Introduce you to Parallel Computer Architecture Answer your questions about CS 258 Provide you a sense of the trends that shape the field

8/23/2013

CS258 S99

What will you get out of CS258?


In-depth understanding of the design and engineering of modern parallel computers
technology forces fundamental architectural issues naming, replication, communication, synchronization basic design techniques cache coherence, protocols, networks, pipelining, methods of evaluation underlying engineering trade-offs

from moderate to very large scale across the hardware/software boundary

8/23/2013

CS258 S99

Will it be worthwhile?
Absolutely!
even through few of you will become PP designers

The fundamental issues and solutions translate across a wide spectrum of systems.
Crisp solutions in the context of parallel machines.

Pioneered at the thin-end of the platform pyramid on the most-demanding applications


migrate downward with time

Understand implications for software

SuperServers
Departmenatal Servers Workstations Personal Computers

8/23/2013

CS258 S99

Am I going to read my book to you?

NO!
Book provides a framework and complete background, so lectures can be more interactive.
You do the reading Well discuss it

Projects will go beyond

8/23/2013

CS258 S99

What is Parallel Architecture?


A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues:
Resource Allocation: how large a collection? how powerful are the elements? how much memory? Data access, Communication and Synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for cooperation? Performance and Scalability how does it all translate into performance? how does it scale?
CS258 S99

8/23/2013

Why Study Parallel Architecture?


Role of a computer architect:
To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost.

Parallelism:
Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing
8/23/2013 CS258 S99 7

Why Study it Today?


History: diverse and innovative organizational structures, often tied to novel programming models Rapidly maturing under strong technological constraints
The killer micro is ubiquitous Laptops and supercomputers are fundamentally similar! Technological trends cause diverse approaches to converge

Technological trends make parallel computing inevitable Need to understand fundamental principles and design tradeoffs, not just taxonomies
Naming, Ordering, Replication, Communication performance
8/23/2013 CS258 S99 8

Is Parallel Computing Inevitable?


Application demands: Our insatiable need for computing cycles Technology Trends Architecture Trends Economics Current trends:
Todays microprocessors have multiprocessor support Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... Tomorrows microprocessors are multiprocessors

8/23/2013

CS258 S99

Application Trends
Application demand for performance fuels advances in hardware, which enables new applns, which...
Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder most demanding applications New Applications More Performance

Range of performance demands


Need range of system performance with progressively increasing cost
8/23/2013 CS258 S99 10

Speedup
Speedup (p processors) =
Performance (p processors) Performance (1 processor)

For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =


Time (1 processor) Time (p processors)

8/23/2013

CS258 S99

11

Commercial Computing
Relies on parallelism for high end
Computational power determines scale of business that can be handled

Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support)
Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm)

8/23/2013

CS258 S99

12

TPC-C Results for March 1996


25,000

20,000

Tandem Himalay a DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other

Throughput (tpmC)

15,000

10,000

5,000

20

40

60 Number of processors

80

100

120

Parallelism is pervasive Small to moderate scale parallelism very important Difficult to obtain snapshot to compare across vendor platforms 8/23/2013 CS258 S99 13

Scientific Computing Demand

8/23/2013

CS258 S99

14

Engineering Computing Demand


Large parallel machines a mainstay in many industries
Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis, combustion efficiency), Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) Financial modeling (yield and derivative analysis) etc.
8/23/2013 CS258 S99 15

Applications: Speech and Image Processing


10 GIP S 1,000 Words Continuous Speech Recognition ISDN-CD Stereo Receiver CELP Speech C oding Speaker Verication 5,000 Words Continuous Speech Recognition HDT V Receiver CIF Video

1 GIPS Telephone Number Recognition 200 Words Isolated Speech Recognition

100 MIPS

10 MIPS

1 MIPS

Sub-Band Speech C oding

1980

1985

1990

1995

Also CAD, Databases, . . . 100 processors gets you 10 years, 1000 gets you 20 !
8/23/2013 CS258 S99 16

Is better parallel arch enough?

AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
8/23/2013 CS258 S99 17

Summary of Application Trends


Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing
Database and transactions as well as financial Usually smaller-scale, but large-scale systems also used

Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads
Greatest use of small-scale multiprocessors

Solid application demand exists and will increase


8/23/2013 CS258 S99 18

- - - Little break - - -

8/23/2013

CS258 S99

19

Technology Trends
100 Supercomputers

Performance

10 Mainf rames Microprocessors Minicomputers 1

0.1 1965

1970

1975

1980

1985

1990

1995

Today the natural building-block is also fastest!


8/23/2013 CS258 S99 20

Cant we just wait for it to get faster?


Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years Huge investment per generation is carried by huge commodity market
180 160 140 120 100 80 60 40 20 0 1987 1988 1989 1990 1991 1992
MIPS Sun 4 M/120 260 MIPS M2000 IBM RS6000 540 HP 9000 750 DEC alpha

Integer

FP

8/23/2013

CS258 S99

21

Technology: A Closer Look


Basic advance is decreasing feature size ( )
Circuits become either faster or lower in power

Die size is growing too

Clock rate improves roughly proportional to improvement in Number of transistors improves like (or faster) clock rate < 10x, rest is transistor count

Performance > 100x per decade How to use more transistors?


Parallelism in processing multiple operations per cycle reduces CPI Locality in data access avoids latency and reduces CPI also improves processor utilization Both need resources, so tradeoff
Proc $

Interconnect

Fundamental issue is resource distribution, as in uniprocessors


8/23/2013 CS258 S99 22

Growth Rates
1,000
R10000 Pentium100 i80386

100,000,000

Clock rate (MHz)

100

10,000,000
Transistors

10

i8086 i80286

1,000,000 100,000 10,000 1,000 1970

i8080

R10000 Pentium i80386 i80286 R3000 R2000 i8086 i8008 i8080

i8008

i4004

0.1 1970

i4004

1975

1980

1985

1990

1995

2000

2005

1975

1980

1985

1990

1995

2000

2005

30% per year


8/23/2013 CS258 S99

40% per year


23

Architectural Trends
Architecture translates technologys gifts into performance and capability Resolves the tradeoff between parallelism and locality
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Tradeoffs may change with scale and technology advances

Understanding microprocessor architectural trends


=> Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in sequential computers
8/23/2013 CS258 S99 24

Phases in VLSI Generation


Bit-lev el parallelism 100, 000,000 Instruction-lev el Thread-lev el (?)

10,000, 000

R10000

1, 000,000

Pentium

Transistors

i80386

100, 000

i80286

R2000

R3000

i8086

10,000
i8080 i8008 i4004

1, 000 1970

1975

1980

1985

1990

1995

2000

2005

8/23/2013

CS258 S99

25

Architectural Trends
Greatest trend in VLSI generation is increase in parallelism
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction
to deal with control transfer and latency problems

Next step: thread level parallelism


8/23/2013 CS258 S99 26

How far will ILP go?


30 3 2.5 2

Fraction of total cycles (%)

25 20

Speedup

15 10 5 0 0 1 2 3 4 5 6+ Number of instructions issued

1.5 1 0.5 0 0 5 10 15 Instructions issued per cy cle

Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
8/23/2013 CS258 S99 27

Threads Level Parallelism on board


Proc Proc Proc Proc

MEM

Micro on a chip makes it natural to connect many to shared memory


dominates server and enterprise market, moving down to desktop

Faster processors began to saturate bus, then bus technology advanced


today, range of sizes for bus-based systems, desktop to large servers 8/23/2013 CS258 S99 No. of processors in fully configured commercial shared-memory systems 28

What about Multiprocessor Trends?


70
CRAY CS6400

60

Sun E10000

50

Number of processors

40 SGI Challenge

Sequent B2100 30

Sy mmetry 81

SE60

Sun E6000

SE70

20

Sun SC2000

SC2000E SGI PowerChallenge/XL

Sequent B8000

AS8400 Sy mmetry 21

10

Power SGI PowerSeries 0 1984 1986 1988 SS690MP 140 SS690MP 120 1990 1992

SS1000

SE10

SE30 SS1000E P-Pro

AS2100 HP K400 SS20 SS10 1994 1996

1998

8/23/2013

CS258 S99

29

Bus Bandwidth
100,000

Sun E10000 10,000

Shared bus bandwidth (MB/s)

1,000

SGI Sun E6000 PowerCh AS8400 XL CS6400 SGI Challenge HPK400 SC2000E AS2100 SC2000 P-Pro SS1000E SS1000 SS20 SS690MP 120 SE70/SE30 SS10/ SS690MP 140 SE10/ Sy mmetry 81/21
SGI PowerSeries Sequent B2100

100

SE60
Power

Sequent B8000 10 1984 1986 1988 1990 1992 1994 1996 1998

8/23/2013

CS258 S99

30

What about Storage Trends?


Divergence between memory capacity and speed even more pronounced
Capacity increased by 1000x from 1980-95, speed only 2x Gigabit DRAM by c. 2000, but gap with processor speed much greater

Larger memories are slower, while processors get faster


Need to transfer more data in parallel Need deeper cache hierarchies How to organize caches?

Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface Buffer caches most recently accessed data

Disks too: Parallel disks plus caching


8/23/2013 CS258 S99 31

Economics
Commodity microprocessors not only fast but CHEAP
Development costs tens of millions of dollars

BUT, many more are sold compared to supercomputers


Crucial to take advantage of the investment, and use the commodity building block

Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? Multiprocessor on a chip?

8/23/2013

CS258 S99

32

Can we see some hard evidence?

8/23/2013

CS258 S99

33

Consider Scientific Supercomputing


Proving ground and driver for innovative architecture and techniques
Market smaller relative to commercial as MPs become mainstream Dominated by vector machines starting in 70s Microprocessors have made huge gains in floating-point performance high clock rates pipelined floating point units (e.g., multiply-add every cycle) instruction-level parallelism effective use of caches (e.g., automatic blocking) Plus economics

Large-scale multiprocessors replace vector supercomputers

8/23/2013

CS258 S99

34

Raw Uniprocessor Performance: LINPACK


10,000
CRAY CRAY Micro Micro

n = 1,000 n = 100 n = 1,000 n = 100

1,000

T94

C90

LINPACK (MFLOPS)

DEC 8200

Y mp Xmp/416

100

IBM Power2/990

Xmp/14se

MIPS R4400

HP9000/735

DEC Alpha

CRAY 1s

IBM RS6000/540

DEC Alpha AXP HP 9000/750

10

MIPS M/2000

MIPS M/120

Sun 4/260 1 1975


1980

1985

1990

1995

2000

8/23/2013

CS258 S99

35

Raw Parallel Performance: LINPACK


10,000
MPP peak CRAY peak

1,000

ASCI Red Paragon XP/S MP (6768) Paragon XP/S MP (1024) T3D CM-5 T932(32) Paragon XP/S CM-200 CM-2
C90(16)

LINPACK (GFLOPS)

100

10

Delta

Y mp/832(8) 1 Xmp /416(4)

iPSC/860 nCUBE/2(1024)

0.1 1985

1987

1989

1991

1993

1995

1996

Even vector Crays became parallel


X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
8/23/2013

Since 1993, Cray produces MPPs too (T3D, T3E)


CS258 S99

36

500 Fastest Computers


350 300 Number of systems 250 200 150 100 187 313 239 198 110 106 63 284 319

MPP PVP SMP 106 73

50
0 11/93
8/23/2013

11/94

11/95

11/96
37

CS258 S99

Summary: Why Parallel Architecture?


Increasingly attractive
Economics, technology, architecture, application demand

Increasingly central and mainstream Parallelism exploited at many levels


Instruction-level parallelism Multiprocessor servers Large-scale multiprocessors (MPPs)

Focus of this class: multiprocessor level of parallelism Same story from memory system perspective
Increase bandwidth, reduce average latency with many local memories

Spectrum of parallel architectures make sense


Different cost, performance and scalability
8/23/2013 CS258 S99 38

Where is Parallel Arch Going?


Old view: Divergent architectures, no predictable pattern of growth.

Application Software Systolic Arrays Dataflow System Software Architecture

SIMD
Message Passing

Shared Memory

Uncertainty of direction paralyzed parallel software development!


8/23/2013 CS258 S99 39

Today
Extension of computer architecture to support communication and cooperation
Instruction Set Architecture plus Communication Architecture

Defines
Critical abstractions, boundaries, and primitives (interfaces) Organizational structures that implement interfaces (hw or sw)

Compilers, libraries and OS are important bridges today

8/23/2013

CS258 S99

40

Modern Layered Framework


CAD Multiprogramming Database Shared address Scientific modeling Message passing Data parallel Parallel applications Programming models

Compilation or library Operating systems support Communication har dware Physical communication medium

Communication abstraction User/system boundary

Hardware/software boundary

8/23/2013

CS258 S99

41

How will we spend out time?

http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html

8/23/2013

CS258 S99

42

How will grading work?


30% homeworks (6) 30% exam 30% project (teams of 2) 10% participation

8/23/2013

CS258 S99

43

Any other questions?

8/23/2013

CS258 S99

44

Anda mungkin juga menyukai