ch1 Intro Perf Cost 4.ps

Administrivia
CIS 501
Intro to Computer Architecture
addresses/meeting times/recitation
list of topics
Instructor: Prof. Amir Roth (amir@central)

T.A.: Libin Shen (libin@gradient)
expected background
homeworks
Fall 2001
University of Pennsylvania
exams
projects
grading + cheating
Based on slides developed by

Profs. Hill, Wood, Sohi, Smith, Lipasti at the UWMadison, and T. N. Vijaykumar at Purdue University
CS/ECE 752 Lecture Notes
1999 by by Hill, Wood, Sohi,

Smith, Vijaykumar, Lipasti
tentative syllabus
2000 by Hill, Wood, Sohi,

Smith, Vijaykumar, Lipasti &
Roth
People + Places
CIS 501 Lecture Notes: Chapter 1
Topic List
people
intro (yeah right) to computer architecture
instructor: Amir Roth (amir@central)
state of the art processor design
office hrs: 170 GRW, MW 2-3 (I think)
dynamically scheduled, superscalar processors, speculation

caches and advanced memory systems
examples of real processors: P6, 21264, Itanium, TM5420
TA: Libin Shen (libin@gradient)

office hrs: Moore 057A, T 3-4, W 1:30-2:30
alternative paradigms: vectors, VLIW (epic), dataflow

meeting times/places
recent & research ideas: multithreading, trace caches
lecture: 216 Moore, TR 1:30-3
I/O, intro to multi-processors
recitation: 224 Moore, T 10:30-12 (Libin will lead, mostly)
approach: quantitative + gut feel
I will use as review for exams, tutorials for homeworks/projects, etc.

I will use if I feel we are getting behind

Roth

Roth
Expected Background
Textbooks & Resources
CSE 371/372 or equivalent: how to build a simple uniprocessor
textbooks
simple instruction sets
H+P, Computer Architecture: A Quantitative Approach
datapath (ALU, register file, MUXs, decoders, etc.)
H,J+S, Readings in Computer Architecture
hardwired/microprogrammed control
other materials
simple pipelining
conference papers from ISCA, MICRO, etc. (I will supply)
basic caches
ACM digital library: www.acm.org/dl/
C/UNIX programming
WWW comp. arch. page: www.cs.wisc.edu/~arch/www
compilers a plus, but not necessary
Cahners: Microprocessor report (I will supply)
circuits/VLSI not necessary (I dont know these, either)

Roth

Roth
Homework
Exams
there will be several (4-5) assignments.
in-class mid-term
may require material that is not covered in depth in class
October 25 (tentatively)
read on your own
final
parts will require C/unix programming
during finals week
first use, later hack, SimpleScalar simulator
cumulative
may not all be weighted equally

due in class on the due date, late work not accepted
ask me/TA/each other for help, but cite (professional)

Roth

Roth
Class Project
Grading
most important part of class
grade breakdown
some piece of original research
homework: 10%
examine a modest extension to a paper studied in class

validate data in some paper
got your own idea? great!
class participation: 10% (no joke)

midterm: 25%
final: 25%
use simpleScalar simulator (I will get you)

or write your own (not recommended)
project: 30%
groups of 2-3
proposal + progress report + final report + presentation
cheating
more details later
dont do it, I will make you sorry


Roth

Roth
Approximate Schedule
10
Approximate Schedule
note I: no class Rosh Hashana, Yom Kippur
week 8: memory hierarchies
note II: subject to change
week 9: memory hierarchies, technology/power
week 1: intro
week 10: recent hot topics
week 2: performance/cost, instruction sets
week 11: case studies of real processors
week 3: instruction sets, basic pipelining
week 12: VLIW, EPIC and Itanium
week 4: basic pipelining, ILP
week 13: interconnect and I/O
week 5: ILP
week 14: introduction to multiprocessors
week 6: ILP
week 15: project presentations
week 7: ILP, midterm

Roth
11

Roth
12
What is Computer Architecture?
Levels of Architecture
The term architecture is used here to describe the attributes of a

system as seen by the programmer, i.e., the conceptual structure
and functional behavior as distinct from the organization of the
dataflow and controls, the logic design, and the physical
implementation.
SOFTWARE
1
System application
2
Gene Amdahl, IBM Journal of R&D, Apr 1964
Language processors
3
Logical resource
4
management
Physical resource mgmt.
5
5
Program execution
6
Input/Output processors
Controllers
Controllers
9
Communication paths
and devices
Storage
HARDWARE

Roth
13
Levels of Architecture

Roth
14
Instruction Set Processing
Myers, Advances in Computer Architecture, Wiley, 1981
The ART and Science of Instruction-Set Processor Design

[Gerrit Blaauw & Fred Brooks, 1981]
level 1: system architecture

interface to outside world (e.g., languages, GUI)
ARCHITECTURE
(ISA)
programmer/compiler view
- Functional appearance to its immediate user/system programmer
levels 2, 3, 4: software architecture
- Opcodes, addressing modes, architected registers, IEEE floating point
IMPLEMENTATION (architecture) processor designer view
2,3: programming language and 3,4: operating system
- Logical structure or organization that performs the architecture
level 5: computer architecture
- Pipelining, functional units, caches, physical registers
interface between hardware and software
REALIZATION
levels 6, 8, 9: Physical I/O and Level 7: Memory architecture
chip/system designer view
(Chip)
- Physical structure that embodies the implementation

- Gates, cells, transistors, wires

Roth
15

Roth
16
Role of the Computer ()Architect
Applications and Requirements
architect: defines the hardware/software interface
scientific/numerical: weather prediction, molecular modeling

need: large memory, floating-point arithmetic
microarchitect: defines the hardware implementation

usually the same person
commercial: inventory, payroll, web serving, e-commerce

need: integer arithmetic, high I/O
decisions based on
applications
embedded: automobile engines, door knobs, microwave, PDAs
performance
need: low power, low cost, interrupt driven
cost
home computing: multimedia, games, entertainment
reliability
need: high data bandwidth, graphics
power . . .

Roth
17
Classes of Computers

Roth
18
Why Study Computer Architecture?
high performance (supercomputers)
because this is UPenn!
supercomputers - Cray T-90
1944, in this very building, John Mauchly and J. Presper Eckert

built the Electronic Numerical Integrator and Calculator (ENIAC),
the first electronic computer
massively parallel computers - Cray T3E

balanced cost/performance
Preliminary Discussion of the Logical Design of an Electronic

Computing Instrument, Burks, Goldstein & von Neumann, 1946
workstations - SPARCstations
servers - SGI Origin, UltraSPARC, AS/400
report gave us term von Neumann computer
high-end PCs - Pentium quads
most of the ideas were present in ENIAC!!
low cost/power
low-end PCs, laptops, PDAs - mobile Pentiums, TM5400?
Roth
19

Roth
20
arent computers fast enough already?
answer #2: technology playing field is always changing

annual technology improvements (approximate)
are they?
logic: density +25%, speed +20%

DRAM (memory): density + 60%, speed: + 4%
disk: density +25%, disk speed: + 4%
fast enough to do everything we will EVER want?

AI, VR, protein sequencing, ????
is speed the only goal?
parameters change and change relative to one another!
power: heat dissipation + battery life

cost
reliability
etc.
designs change even if requirements fixed

but requirements are not fixed
answer #1: requirements are always changing

Roth
21

Roth
Examples of Changing Designs
Moores Law
example I: caches
Cramming More Components onto Integrated Circuits
1970: 10K transistors, DRAM faster than logic = bad idea
G.E. Moore, Electronics, 1965

observation: (DRAM) transistor density doubles annually
1990: 1M transistors, logic faster than DRAM = good idea
became known as Moores Law

wrong, density doubles every 18 months (had only 4 data points)
will caches ever be a bad idea again?

example II: out-of-order execution
corollaries
cost / transistor halves annually
power decreases with scaling
speed increases with scaling
reliability increases with scaling
1985: 100K transistors + no precise interrupts = bad idea

1995: 2M transistors + precise interrupts = good idea
2005: 100M transistors + 10GHz clock = bad idea?
most incredible paper in computer systems (in HJS, read)
semiconductor technology is an incredible driving force

Roth
22
23

Roth
24
Moores Law
Evolution of Single-Chip Microprocessors
performance doubles every 18 months

common interpretation of Moores Law, not original intent
Transistor Count
1971-1980
10K-100K
1981-1990
100K-1M
1991-2000
1M-100M
2010
1B
wrong, performance doubles every ~2 years
Clock Frequency
0.2-2MHz
2-20MHz
20M-1GHz
10GHz
IPC
< 0.1
0.1-0.9
0.9- 2.0
10 (?)
MIPS/MFLOPS
< 0.2
0.2-20
20-2,000
100,000
self-fulfilling prophecy (Moores Curve)

doubling every 18 months = ~4% increase per month
4% per month used to judge performance features, if feature adds 2
months to schedule, it should add at least 8% to performance
Itanium is under Moores Curve in a big way

Roth
25

Roth
Performance Growth in Perspective
26
Performance and Cost
same absolute increase in computing power
read: H+P chapter 1
big bang - 2001
performance metrics
2001 - 2003
iron law of processor performance

benchmarks and benchmarking
reporting averages
1971 - 2001: performance improved 35,000X!!!
Amdahls law
what if cars improved at this rate?

1971: 60 MPH / 10 MPG
2001: 2,100,000 MPH / 350,000 MPG
balance and bursty behavior

cost
but... what if cars crashed as often as computers did?

Roth
27

Roth
28
Performance Metrics
Throughput and Pipelining

latch
latency (response time, execution time)

elapsed time? processor time?
fetch
throughput (bandwidth, work per time)
latch
latch
decode
reg
read
latch
latch
ALU
mem
latch
reg
write
clock
performance
in real processors there is always overlap (pipelining)
= 1 / latency when there is NO OVERLAP
throughput is rate of initiation/completion (ideally 1)
> 1 / latency when there is
non-unit latencies, dependences (hazards)
choose metric according to your objective
pretend that latency is 1 if it makes you feel better
throughput: maximize work done in a given interval

latency: minimizing time to wait for a computation
both are important

Roth
29

Roth
MIPS
30
Relative MIPS
relative MIPS = (timereference / timenew) x MIPSreference
MIPS = (instruction count / execution time) x 10-6
e.g., VAX MIPS
= (clock rate / CPI) x 10-6
+ a little better than native MIPS

but very sensitive to reference machine
the problem is in defining a uniform measure of work

instruction sets are not equivalent
which compiler and OS and benchmark?
instruction count is not a reliable indicator of work
upshot: may be useful if same ISA/compiler/OS/workload
some optimizations add instructions

instructions may do varying work (FP mult >> register move)
may vary inversely with performance

Roth
31

Roth
32
MFLOPS
Normalized MFLOPS
argument #1: FP operations are same across machines?
MFLOPS = (FP ops / execution time) x 10-6
Cray does not implement divide (multiply by reciprocal)
like MIPS, but counts only FP operations

argument #1: FP ops are the same across machines
argument #2: cant be optimized away
argument #3: FP ops have longest latencies
Motorola has SQRT, SIN, and COS

normalized FP: assign a canonical # FP ops to a HLL program
what a pain this is
may have been valid in 1980 (most programs were FP)

most programs today are integer (i.e., FP light)
normalized MFLOPS = (# canonical FP ops / time) x 10-6
load from memory takes longer than FP divide

Roth
33

Roth
Iron Law of Processor Performance
34
Iron Law
processor performance = time / program
instructions / program (a.k.a. instruction count)
separate into three components
dynamic instructions executed, NOT static code size

mostly determined by program, compiler, ISA
seconds
program
instructions
cycles
=
x
x
program
instruction
seconds
cycle
program is unit of work (transaction in server domain)

cycles / instruction (a.k.a. CPI)
implementation
(micro-architecture)
processor-designer
mostly determined by ISA and CPU/memory organization

instruction overlap (ILP) makes this smaller
CIS501
architecture
(ISA)
compiler-designer
Roth
realization
(physical layout)
circuit-designer
35
seconds / cycle (a.k.a. cycle time, clock time, 1 / clock frequency)

mostly determined by technology and CPU organization
Roth
36
Iron Law
Iron Law Performance Comparison
comparing iron law performance
famous example: RISC Wars (RISC vs. CISC)
A is N times faster than B iff

perf(A)/perf(B) = time(B)/time(A) = N
CISC CPU time = P x 8 x T = 8PT

RISC CPU time = 2P x 2 x T = 4PT
A is x% faster than B iff

perf(A)/perf(B) = time(B)/time(A) = 1 + X/100
RISC CPU time = CISC CPU time/2

the truth is much, much, much more complex
actual data from IBM AS/400 experience (CISC->RISC in 1995):
uses of iron law
CISC (IMPI) time = P x 7 x T = 7PT
high-level performance comparisons
RISC (PPC) time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)
back of the envelope calculations

helping architects think about compilers and technology
Roth
37
Iron Law Back-of-the-Envelope Calculation

base machine

Roth
38
Actually Measuring Performance

how are execution-time/CPI actually measured?
43% ALU ops (1 cycle), 21% loads (1 cycle)
execution time: time (Unix cmd): wall-clock, CPU, system
12% stores (2 cycles), 24% branches (2 cycles)
CPI = CPU time / (clock frequency * # instructions)
note: pretending latency is 1 because of pipeline
aggregate CPI not so useful, want CPI stack (breakdown)

compute time, memory stall time, etc.
so we know what the performance problems are (what to fix)
Q: should 1 cycle stores be implemented if it slows clock 15%?

old CPI = 0.43 + 0.21 + (0.12 x 2) + (0.24 x 2) = 1.36
new CPI = 0.43 + 0.21 + 0.12 + (0.24 x 2) = 1.24
speedup = (P x 1.36 x T) / (P x 1.24 x 1.15T) = 0.95
Answer: NO!

Roth
39

Roth
40
Measuring CPI Breakdowns
Benchmarks and Benchmarking
hardware event counters (PentiumPro, Alpha DCPI)
program as unit of work
calculate CPI using instruction frequencies/event costs
there are millions of programs
A Characterization of Processor Performance in the VAX11/780, Emer+Clark (HJS, read)
not all are the same, most are very different
cycle-level microarchitecture simulator (e.g., SimpleScalar)
which ones to use?

benchmarks
measure exactly what you want
standard programs for measuring/comparing performance
must take care to model microarchitecture faithfully
representative of programs people care about
at least the parts you care about
repeatable!!
method of choice for many architects (yours, too)

Roth
41

Roth
Benchmarking Process
42
Benchmarking Process
steps
define workload
Define Workload
extract benchmarks from workload
w1
w2
w3
w4
t1
t2
t3
t4
choose performance metric
w5
t5
Extract Benchmarks
execute benchmarks on candidate machines
w1
t1
project performance in new machine

run workload on new machine and compare
w2
w3
t2
t3
w4
w5
t4
t5
Run Benchmarks
^t1
^
t2
^
t3
^
t4
^
t5
Project Performance
^w1
^t1

Roth
43

Roth
^w2
^
w3
^
w4
^t2
^t3
^
t4
^
w5
^t5
44
Benchmarks: Instruction Mixes
Benchmarks: Toy Benchmarks
calculate performance from instruction type frequencies
little programs that no one really runs
ignore dependences
e.g., fibonacci, 8 queens
ok for non-pipelined, scalar processor w/o caches
little value, what real programs do these represent?
the way all processors used to be
scary fact: used to prove the value of RISC in early 80s
example: Gibson Mix - developed in 1950s at IBM

load/store: 31%, branches: 17%
compare: 4%, shift: 4%, logical: 2%
fixed add/sub: 6%, float add/sub: 7%
float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1%
qualitatively, these numbers are still useful today!
Roth
45
Benchmarks: Kernels

Roth
46
Benchmarks: Synthetic Benchmarks
important (most frequently executed) pieces of real programs
programs made up for benchmarking purposes
e.g., Livermore loops, Linpack
e.g., Whetstone, Dhrystone
example: inner product (from Linpack)
often only slightly more complex than kernels
good for focusing on individual features not big picture
like toy benchmarks, which programs do these represent?
tend to over-emphasize target feature

over-estimate performance if feature is easy (fast)
under-estimate performance if feature is hard (slow)

Roth
47

Roth
48
Benchmarks: Real Programs
SPEC95
real programs
8 integer programs
only accurate way to characterize performance
go (plays a game of go), gcc (compiler)

m88ksim (motorola 88000 simulator)
compress (data compress/decompress), li (lisp interpreter)
jpeg (graphics jpeg compression/decompression)
perl (perl interpreter), vortex (object-oriented database)
requires considerable work

Standard Performance Evaluation Corporation (SPEC)
collects, standardizes and distributes benchmark suites
10 floating point programs
consortium made up of industry leaders

?!#$: program only included if it makes enough members look good
SPEC CPU (CPU intensive benchmarks)

SPEC89, SPEC92, SPEC95, SPEC2000 (consortium at work)

Roth
49
tomcatv (vectorized mesh generation), swim (shallow water model)

su2cor (quantum physics), hydro2d (galactic jets - navier stokes)
mgrid (multigrid solver for 3d field), fppp (quantum chemistry)
applu (partial differential equations), wave5 (n-body Maxwells)
turb3d (turbulence in a cube), apsi (temperature and wind velocity)
Roth
SPEC2000 Benchmarks
50
Benchmarking Pitfalls
12 integer programs
mismatch of benchmark properties with scale of features studied

e.g., using SPEC for large cache studies
gcc, perl, vortex (holdovers from SPEC95)

bzip2, gzip (replace compress), crafty (chess, replaces go)
eon (rendering), gap (group theoretic enumerations)
twolf, vpr (FPGA place and route)
parser (grammar checker), mcf (network optimization)
carelessly scaling benchmarks

using only first few million instructions (initialization phase)
reducing program data size
14 floating point programs

swim, mgrid, applu, apsi (holdovers from SPEC95)
wupwise (quantum chromodynamics), mesa (OpenGL library)
art (neural network image recognition), equake (wave propagation)
fma3d (crash simulation), sixtrack (accelerator design)
lucas (primality testing), galgel (fluid dynamics), ammp (chemistry)
Roth
51

Roth
52
Benchmarking Pitfalls
Reporting Average Performance
choosing performance from wrong application space
averages: one of the things architects frequently get wrong
e.g., in a realtime environment, choosing troff
what does the mean mean?
SPEC has other benchmark suites for other environments
arithmetic mean and harmonic mean
SPECweb, SPECmail, SPECjvm
weighted means
TPC has transaction processing (database) benchmarks
geometric mean
TPC-W simulates amazon.com
using old benchmarks

benchmark specials: benchmark-specific optimizations
bottom line: only use average when you have to

there is no such thing as the average program
benchmarks must be continuously maintained and updated

Roth
53

Roth
Arithmetic Mean and Harmonic Mean
54
Weighted Means (AM and HM)
arithmetic mean (AM): average execution times of n programs
what if programs run at different frequencies within workload?

use weight factors
time ( i ) n
1
weighted AM
n
( weight ( i ) time ( i ) ) n
1
harmonic mean (HM): average IPCs of n programs

arithmetic mean cannot be used for rates (like IPCs)
weighted HM is similar
30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH
n
--------------------------------------n
weight ( i )
------------------------rate ( i )
n
-----------------------------n
1
-
---------------- rate ( i )

Roth
55

Roth
56
Geometric Mean
Geometric Mean
what about averaging ratios (speedups)?
geometric mean of ratios is not proportional to total time!
HM / AM change depending on which machine is the base
if we take total execution time, B is 9.1 times faster

GM says they are equal
mach A
mach B
B/A
A/B
P1
10
10
0.1
P2
1000
100
0.1
10
5.05
5.05
SPEC uses GM
nT
base, i
n ---------------------T new , i
1
AM/HM
generally, GM will mispredict for three or more machines
use geometric mean (GM): independent of base choice

n
AM for times, HM for rates (IPCs), and GM for ratios (speedups)
n ratio ( i )
1

Roth
57

Roth
Qualitative Performance
58
Amdahls Law
Amdahls law (i.e., dont forget about the uncommon case)
why you shouldnt ignore the UNcommon case

let an optimization speed f fraction of time by a factor of s
balance
speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)
bursty behavior
Validity of the Single-Processor Approach to Achieving LargeScale Computing Capabilities

G. Amdahl, AFIPS, 1967
read (in HJS)

Roth
59

Roth
60
Amdahls Law Examples
Amdahls Law
f = 95% and s = 1.10 -> speedup common case a little bit (10%)
lim --------------------------- =
s 1 f + f s
speedup = 1/[(1-0.95) + (0.95/1.10)] = 1.094
1
----------- => make common case fast
1f
10
f = 5% and s = 10 -> speed up rare case a lot (10x)
8
Speedup
speedup = 1/[(1-0.05) + (0.05/10)] = 1.047

f = 5% and s ->
speedup = 1/[(1-0.05) + (0.05/ )] = 1.052
illustrates that common case should be sped up
f = 95% and s ->
6
4
2
speedup = 1/[(1-0.95) + (0.95/ )] = 20
0
0
uncommon case eventually limits performance


Roth
61
Amdahls Law

Roth
0.2
0.4
0.6
0.8
62
Making Common Case Fast
Amdahl was talking about a parallel processor with large speedup
uniprocessor example: memory hierarchy

keep recently referenced data/intructions close (fast)
at some point you have to pay attention to the serial part
exploit locality
10
Speedup
1/$0.1+0.9/x$
temporal locality: access same data in near future

spatial locality: access nearby data
most accesses (95% instructions, 90% data) hit in cache
implementation facts
4
on-chip faster than off-chip

2
SRAM faster than DRAM faster than disk
0
0
f = 0.9
Roth
2000
4000
6000
8000
10000
63

Roth
64
Memory Hierarchy
reg
L1
L2
memory
disk (swap)

Roth
reg
L1
L2
Memory Hierarchy Specs

fast
small
expensive
L3
size
speed
bandwidth
register
< 1 KB
1-5 ns
9600 MB/s
L1 cache
< 256 KB
10 ns
3200 MB/s
L2 cache
< 8 MB
30 ns
800 MB/s
memory
< 4 GB
100 ns
133 MB/s
disk
> 1 GB
20 ms
4 MB/s
next (well, this) generation (2 GHz)
memory
disk (swap)
type
150 GB/s (reg), 25GB/s (L1), 50GB/s(L2), 4GB/s (mem)
slow
large
cheap
65

Roth
Balance
reg
L1
L2
memory
disk (swap)
66
Balance Example
at a system level, bandwidths & capacities

should be balanced
compute required memory bandwidth

IPC = 1.5
each level capable of demanding/

supplying bandwidths
30% loads and stores

90% D$ hit rate, 95% I$ hit rate, 32 byte blocks, no L2
resource A: if A demand b/w = A supply b/w

then computation is A-bound
compute required memory b/w
e.g., if processor demand for

memory b/w >= available b/w then
program is memory bound
data b/w demand = 1.5 * 0.3 * 0.10 * 32 = 1.44 bytes/clock
similarly: CPU-bound, I/O-bound
total b/w required = 3.84 bytes/clock
instruction b/w demand = 1.5 * 1.0 * 0.05 * 32 = 2.4 bytes/clock
goal: be bound everywhere at once (why?)

Roth
67

Roth
68
Balancing a System
Bound
balance system by adjusting sizes of hierarchy components
term bound used for latency as well as bandwidth
e.g., larger L1 => higher hit rate => lower L2 demand
in general, A-bound means A is the performance limiter
e.g., larger memory => less paging => lower I/O demand
if A is bandwidth, it means you dont have enough of A

if A is latency, it means you are screwed
Amdahls rule:
1 MIPS <=> 1 MB memory <=> 1 Mbits/s I/O
Bandwidth problems can be fixed with money. Latency problems

are harder to fix, because the speed of light is fixed and you cant
bribe God
if corrected to 1 Mbytes/s of I/O, the rule is still good!
in actuality, can convert latency problem to bandwidth problem

e.g., prefetching
bandwidth/latency tradeoff (theme for the semester)

Roth
69
Example: Latency Bound

Roth
70
Example: Bandwidth Bound
copy: Z[i] = X[i], saxpy: Z[i] = a*X[i] + Y[i]
for large arrays + prefetching: copy, sum, scale, saxpy become

memory bandwidth bound
performance cliff (bound) when data set falls out of D$
e.g., bandwidths on real systems (MB/s)
another when data set too large for memory

falls out of cache
100
System
copy
scale
sum
saxpy
Cray C90
7000
7000
9400
9500
Cray T932
10800
10200
13000
13700
Alpha 150Mhz
98
90
68
90
Cray T3D
380
330
190
180
falls out of main memory

perf.
10
10
100
problem size

Roth
71

Roth
72
Bursty Behavior
Cost
Q: to sustain 2 IPC how many instructions should you
very important to most real designs
fetch per cycle?
changes over time
execute per cycle?
learning curve lowers manufacturing costs
complete per cycle?
technology improvements lower costs e.g., DRAM

$70
A: NOT 2 (more than 2)
1 MB
$60
dependences will cause stalls (under-utilization)
256 KB
$50
$40
if desired performance is X, peak performance must be > X
$20
programs != sand: cannot level performance peaks and valleys
16 KB
$10
my research: try to level performance in clever ways

Roth
64 KB
$30
1976
73

Roth
IC Cost
cost (IC) =
1980
1982
1984
1986
1988
1990
74
Total Cost
cos t ( die ) + cos t ( testing ) + cos t ( packaging )

------------------------------------------------------------------------------------------------------------------------yield ( finaltest )
costs
component: processor, DRAM, disk, power, packaging
direct: manufacturing (labor, scrap), warranty
cos t ( wafer )
--------------------------------------------------------------------( die wafer ) yield ( die )
cost (die) =
1978
indirect: R&D + marketing, administrative, profits, taxes
yield (die) =
defects cm area
yield ( wafer ) 1 + --------------------------------------------------------
often is 0.30
cost (die) = f (die area4)
Roth
75

Roth
76
Manufacturing Cost
Price
learning curve + technology lower manufacturing cost
price (loosely related to cost)
per unit
start with component cost
startup cost actually increases
+ 25%-40% direct cost

+ 45%-65% gross margin (selling price)
startup manufacturing costs
+60%-75% factor discounts and dealer profits (list price)
fabrication plant, clean rooms, lithography equipment, etc.
components: 15-30% of list, R&D: 8%-15% of lists
~$5B
chip testers/debuggers
~$5M a piece, typically several hundreds of them
not too many companies can play the manufacturing game

intel, IBM, Sun, (Compaq used to, sold it to Intel)
Roth
77
Moores Law and Cost/Price

Roth
78
Reading Summary: Performance+Cost
corrolaries of Moores Law
H+P
performance doubles every 2 years
chapter 1
cost per function halves every 2 years
HJ+S
1 + 2: price of highest performance system is constant
Moore, Cramming...
must control costs to leave yourself a profit margin
Amdahl, Validity...
e.g., cant spend too much on R&D for the price/volume point
Emer+Clark, A Characterization...
next up: instruction set design (H+P, chapter 2)

Roth
79

Roth
80

ch1 Intro Perf Cost 4.ps

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ch1 Intro Perf Cost 4.ps

Diunggah oleh

Hak Cipta:

Format Tersedia

Administrivia

Instructor: Prof. Amir Roth (amir@central)

Based on slides developed by

CS/ECE 752 Lecture Notes

1999 by by Hill, Wood, Sohi,

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

intro (yeah right) to computer architecture

instructor: Amir Roth (amir@central)

state of the art processor design

office hrs: 170 GRW, MW 2-3 (I think)

dynamically scheduled, superscalar processors, speculation

TA: Libin Shen (libin@gradient)

alternative paradigms: vectors, VLIW (epic), dataflow

recent & research ideas: multithreading, trace caches

lecture: 216 Moore, TR 1:30-3

I/O, intro to multi-processors

recitation: 224 Moore, T 10:30-12 (Libin will lead, mostly)

approach: quantitative + gut feel

I will use as review for exams, tutorials for homeworks/projects, etc.

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

Textbooks & Resources

CSE 371/372 or equivalent: how to build a simple uniprocessor

simple instruction sets

H+P, Computer Architecture: A Quantitative Approach

datapath (ALU, register file, MUXs, decoders, etc.)

H,J+S, Readings in Computer Architecture

conference papers from ISCA, MICRO, etc. (I will supply)

ACM digital library: www.acm.org/dl/

WWW comp. arch. page: www.cs.wisc.edu/~arch/www

compilers a plus, but not necessary

Cahners: Microprocessor report (I will supply)

circuits/VLSI not necessary (I dont know these, either)

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

there will be several (4-5) assignments.

may require material that is not covered in depth in class

read on your own

parts will require C/unix programming

during finals week

first use, later hack, SimpleScalar simulator

may not all be weighted equally

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

most important part of class

some piece of original research

examine a modest extension to a paper studied in class

class participation: 10% (no joke)

use simpleScalar simulator (I will get you)

more details later

dont do it, I will make you sorry

2000 by Hill, Wood, Sohi,

2000 by Hill, Wood, Sohi,

CIS 501 Lecture Notes: Chapter 1

note I: no class Rosh Hashana, Yom Kippur

week 8: memory hierarchies