Todays Goal:
Introduce you to Parallel Computer Architecture Answer your questions about CS 258 Provide you a sense of the trends that shape the field
8/23/2013
CS258 S99
8/23/2013
CS258 S99
Will it be worthwhile?
Absolutely!
even through few of you will become PP designers
The fundamental issues and solutions translate across a wide spectrum of systems.
Crisp solutions in the context of parallel machines.
SuperServers
Departmenatal Servers Workstations Personal Computers
8/23/2013
CS258 S99
NO!
Book provides a framework and complete background, so lectures can be more interactive.
You do the reading Well discuss it
8/23/2013
CS258 S99
8/23/2013
Parallelism:
Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing
8/23/2013 CS258 S99 7
Technological trends make parallel computing inevitable Need to understand fundamental principles and design tradeoffs, not just taxonomies
Naming, Ordering, Replication, Communication performance
8/23/2013 CS258 S99 8
8/23/2013
CS258 S99
Application Trends
Application demand for performance fuels advances in hardware, which enables new applns, which...
Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder most demanding applications New Applications More Performance
Speedup
Speedup (p processors) =
Performance (p processors) Performance (1 processor)
8/23/2013
CS258 S99
11
Commercial Computing
Relies on parallelism for high end
Computational power determines scale of business that can be handled
Databases, online-transaction processing, decision support, data mining, data warehousing ... TPC benchmarks (TPC-C order entry, TPC-D decision support)
Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm)
8/23/2013
CS258 S99
12
20,000
Throughput (tpmC)
15,000
10,000
5,000
20
40
60 Number of processors
80
100
120
Parallelism is pervasive Small to moderate scale parallelism very important Difficult to obtain snapshot to compare across vendor platforms 8/23/2013 CS258 S99 13
8/23/2013
CS258 S99
14
100 MIPS
10 MIPS
1 MIPS
1980
1985
1990
1995
Also CAD, Databases, . . . 100 processors gets you 10 years, 1000 gets you 20 !
8/23/2013 CS258 S99 16
AMBER molecular dynamics simulation program Starting point was vector code for Cray-1 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
8/23/2013 CS258 S99 17
Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads
Greatest use of small-scale multiprocessors
- - - Little break - - -
8/23/2013
CS258 S99
19
Technology Trends
100 Supercomputers
Performance
0.1 1965
1970
1975
1980
1985
1990
1995
Integer
FP
8/23/2013
CS258 S99
21
Clock rate improves roughly proportional to improvement in Number of transistors improves like (or faster) clock rate < 10x, rest is transistor count
Interconnect
Growth Rates
1,000
R10000 Pentium100 i80386
100,000,000
100
10,000,000
Transistors
10
i8086 i80286
i8080
i8008
i4004
0.1 1970
i4004
1975
1980
1985
1990
1995
2000
2005
1975
1980
1985
1990
1995
2000
2005
Architectural Trends
Architecture translates technologys gifts into performance and capability Resolves the tradeoff between parallelism and locality
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Tradeoffs may change with scale and technology advances
10,000, 000
R10000
1, 000,000
Pentium
Transistors
i80386
100, 000
i80286
R2000
R3000
i8086
10,000
i8080 i8008 i4004
1, 000 1970
1975
1980
1985
1990
1995
2000
2005
8/23/2013
CS258 S99
25
Architectural Trends
Greatest trend in VLSI generation is increase in parallelism
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit slows after 32 bit adoption of 64-bit now under way, 128-bit far (not performance issue) great inflection point when 32-bit micro and cache fit on a chip Mid 80s to mid 90s: instruction level parallelism pipelining and simple instruction sets, + compiler advances (RISC) on-chip caches and functional units => superscalar execution greater sophistication: out of order execution, speculation, prediction
to deal with control transfer and latency problems
25 20
Speedup
Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
8/23/2013 CS258 S99 27
MEM
60
Sun E10000
50
Number of processors
40 SGI Challenge
Sequent B2100 30
Sy mmetry 81
SE60
Sun E6000
SE70
20
Sun SC2000
Sequent B8000
AS8400 Sy mmetry 21
10
Power SGI PowerSeries 0 1984 1986 1988 SS690MP 140 SS690MP 120 1990 1992
SS1000
SE10
1998
8/23/2013
CS258 S99
29
Bus Bandwidth
100,000
1,000
SGI Sun E6000 PowerCh AS8400 XL CS6400 SGI Challenge HPK400 SC2000E AS2100 SC2000 P-Pro SS1000E SS1000 SS20 SS690MP 120 SE70/SE30 SS10/ SS690MP 140 SE10/ Sy mmetry 81/21
SGI PowerSeries Sequent B2100
100
SE60
Power
Sequent B8000 10 1984 1986 1988 1990 1992 1994 1996 1998
8/23/2013
CS258 S99
30
Parallelism increases effective size of each level of hierarchy, without increasing access time Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface Buffer caches most recently accessed data
Economics
Commodity microprocessors not only fast but CHEAP
Development costs tens of millions of dollars
Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors Standardization makes small, bus-based SMPs commodity Desktop: few smaller processors versus one larger one? Multiprocessor on a chip?
8/23/2013
CS258 S99
32
8/23/2013
CS258 S99
33
8/23/2013
CS258 S99
34
1,000
T94
C90
LINPACK (MFLOPS)
DEC 8200
Y mp Xmp/416
100
IBM Power2/990
Xmp/14se
MIPS R4400
HP9000/735
DEC Alpha
CRAY 1s
IBM RS6000/540
10
MIPS M/2000
MIPS M/120
1980
1985
1990
1995
2000
8/23/2013
CS258 S99
35
1,000
ASCI Red Paragon XP/S MP (6768) Paragon XP/S MP (1024) T3D CM-5 T932(32) Paragon XP/S CM-200 CM-2
C90(16)
LINPACK (GFLOPS)
100
10
Delta
iPSC/860 nCUBE/2(1024)
0.1 1985
1987
1989
1991
1993
1995
1996
36
50
0 11/93
8/23/2013
11/94
11/95
11/96
37
CS258 S99
Focus of this class: multiprocessor level of parallelism Same story from memory system perspective
Increase bandwidth, reduce average latency with many local memories
SIMD
Message Passing
Shared Memory
Today
Extension of computer architecture to support communication and cooperation
Instruction Set Architecture plus Communication Architecture
Defines
Critical abstractions, boundaries, and primitives (interfaces) Organizational structures that implement interfaces (hw or sw)
8/23/2013
CS258 S99
40
Compilation or library Operating systems support Communication har dware Physical communication medium
Hardware/software boundary
8/23/2013
CS258 S99
41
http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html
8/23/2013
CS258 S99
42
8/23/2013
CS258 S99
43
8/23/2013
CS258 S99
44