Anda di halaman 1dari 20

Administrivia

CIS 501
Intro to Computer Architecture

addresses/meeting times/recitation
list of topics

Instructor: Prof. Amir Roth (amir@central)


T.A.: Libin Shen (libin@gradient)

expected background
homeworks

Fall 2001
University of Pennsylvania

exams
projects
grading + cheating

Based on slides developed by


Profs. Hill, Wood, Sohi, Smith, Lipasti at the UWMadison, and T. N. Vijaykumar at Purdue University

CS/ECE 752 Lecture Notes

1999 by by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti

tentative syllabus

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

People + Places

CIS 501 Lecture Notes: Chapter 1

Topic List

people

intro (yeah right) to computer architecture

instructor: Amir Roth (amir@central)

state of the art processor design

office hrs: 170 GRW, MW 2-3 (I think)

dynamically scheduled, superscalar processors, speculation


caches and advanced memory systems
examples of real processors: P6, 21264, Itanium, TM5420

TA: Libin Shen (libin@gradient)


office hrs: Moore 057A, T 3-4, W 1:30-2:30

alternative paradigms: vectors, VLIW (epic), dataflow


meeting times/places

recent & research ideas: multithreading, trace caches

lecture: 216 Moore, TR 1:30-3

I/O, intro to multi-processors

recitation: 224 Moore, T 10:30-12 (Libin will lead, mostly)

approach: quantitative + gut feel

I will use as review for exams, tutorials for homeworks/projects, etc.


I will use if I feel we are getting behind

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

Expected Background

Textbooks & Resources

CSE 371/372 or equivalent: how to build a simple uniprocessor

textbooks

simple instruction sets

H+P, Computer Architecture: A Quantitative Approach

datapath (ALU, register file, MUXs, decoders, etc.)

H,J+S, Readings in Computer Architecture

hardwired/microprogrammed control

other materials

simple pipelining

conference papers from ISCA, MICRO, etc. (I will supply)

basic caches

ACM digital library: www.acm.org/dl/

C/UNIX programming

WWW comp. arch. page: www.cs.wisc.edu/~arch/www

compilers a plus, but not necessary

Cahners: Microprocessor report (I will supply)

circuits/VLSI not necessary (I dont know these, either)


2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Homework

CIS 501 Lecture Notes: Chapter 1

Exams

there will be several (4-5) assignments.

in-class mid-term

may require material that is not covered in depth in class

October 25 (tentatively)

read on your own

final

parts will require C/unix programming

during finals week

first use, later hack, SimpleScalar simulator

cumulative

may not all be weighted equally


due in class on the due date, late work not accepted
ask me/TA/each other for help, but cite (professional)

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

Class Project

Grading

most important part of class

grade breakdown

some piece of original research

homework: 10%

examine a modest extension to a paper studied in class


validate data in some paper
got your own idea? great!

class participation: 10% (no joke)


midterm: 25%
final: 25%

use simpleScalar simulator (I will get you)


or write your own (not recommended)

project: 30%

groups of 2-3
proposal + progress report + final report + presentation
cheating

more details later

dont do it, I will make you sorry


CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Approximate Schedule

CIS 501 Lecture Notes: Chapter 1

10

Approximate Schedule

note I: no class Rosh Hashana, Yom Kippur

week 8: memory hierarchies

note II: subject to change

week 9: memory hierarchies, technology/power

week 1: intro

week 10: recent hot topics

week 2: performance/cost, instruction sets

week 11: case studies of real processors

week 3: instruction sets, basic pipelining

week 12: VLIW, EPIC and Itanium

week 4: basic pipelining, ILP

week 13: interconnect and I/O

week 5: ILP

week 14: introduction to multiprocessors

week 6: ILP

week 15: project presentations

week 7: ILP, midterm

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

11

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

12

What is Computer Architecture?

Levels of Architecture

The term architecture is used here to describe the attributes of a


system as seen by the programmer, i.e., the conceptual structure
and functional behavior as distinct from the organization of the
dataflow and controls, the logic design, and the physical
implementation.

SOFTWARE
1
System application
2

Gene Amdahl, IBM Journal of R&D, Apr 1964

Language processors

3
Logical resource
4
management
Physical resource mgmt.
5
5
Program execution

6
Input/Output processors

Controllers

Controllers

9
Communication paths
and devices

Storage

HARDWARE

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

13

Levels of Architecture

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

14

Instruction Set Processing

Myers, Advances in Computer Architecture, Wiley, 1981

The ART and Science of Instruction-Set Processor Design


[Gerrit Blaauw & Fred Brooks, 1981]

level 1: system architecture


interface to outside world (e.g., languages, GUI)

ARCHITECTURE
(ISA)
programmer/compiler view
- Functional appearance to its immediate user/system programmer

levels 2, 3, 4: software architecture

- Opcodes, addressing modes, architected registers, IEEE floating point

IMPLEMENTATION (architecture) processor designer view

2,3: programming language and 3,4: operating system

- Logical structure or organization that performs the architecture

level 5: computer architecture

- Pipelining, functional units, caches, physical registers

interface between hardware and software

REALIZATION

levels 6, 8, 9: Physical I/O and Level 7: Memory architecture

chip/system designer view

(Chip)

- Physical structure that embodies the implementation


- Gates, cells, transistors, wires

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

15

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

16

Role of the Computer ()Architect

Applications and Requirements

architect: defines the hardware/software interface

scientific/numerical: weather prediction, molecular modeling


need: large memory, floating-point arithmetic

microarchitect: defines the hardware implementation


usually the same person

commercial: inventory, payroll, web serving, e-commerce


need: integer arithmetic, high I/O

decisions based on
applications

embedded: automobile engines, door knobs, microwave, PDAs

performance

need: low power, low cost, interrupt driven

cost

home computing: multimedia, games, entertainment

reliability

need: high data bandwidth, graphics

power . . .
CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

17

Classes of Computers

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

18

Why Study Computer Architecture?

high performance (supercomputers)

because this is UPenn!

supercomputers - Cray T-90

1944, in this very building, John Mauchly and J. Presper Eckert


built the Electronic Numerical Integrator and Calculator (ENIAC),
the first electronic computer

massively parallel computers - Cray T3E


balanced cost/performance

Preliminary Discussion of the Logical Design of an Electronic


Computing Instrument, Burks, Goldstein & von Neumann, 1946

workstations - SPARCstations
servers - SGI Origin, UltraSPARC, AS/400

report gave us term von Neumann computer

high-end PCs - Pentium quads

most of the ideas were present in ENIAC!!

low cost/power
low-end PCs, laptops, PDAs - mobile Pentiums, TM5400?
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

19

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

20

Why Study Computer Architecture?

Why Study Computer Architecture?

arent computers fast enough already?

answer #2: technology playing field is always changing


annual technology improvements (approximate)

are they?

logic: density +25%, speed +20%


DRAM (memory): density + 60%, speed: + 4%
disk: density +25%, disk speed: + 4%

fast enough to do everything we will EVER want?


AI, VR, protein sequencing, ????

is speed the only goal?

parameters change and change relative to one another!

power: heat dissipation + battery life


cost
reliability
etc.

designs change even if requirements fixed


but requirements are not fixed

answer #1: requirements are always changing

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

21

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Examples of Changing Designs

CIS 501 Lecture Notes: Chapter 1

Moores Law

example I: caches

Cramming More Components onto Integrated Circuits

1970: 10K transistors, DRAM faster than logic = bad idea

G.E. Moore, Electronics, 1965


observation: (DRAM) transistor density doubles annually

1990: 1M transistors, logic faster than DRAM = good idea

became known as Moores Law


wrong, density doubles every 18 months (had only 4 data points)

will caches ever be a bad idea again?


example II: out-of-order execution

corollaries
cost / transistor halves annually
power decreases with scaling
speed increases with scaling
reliability increases with scaling

1985: 100K transistors + no precise interrupts = bad idea


1995: 2M transistors + precise interrupts = good idea
2005: 100M transistors + 10GHz clock = bad idea?

most incredible paper in computer systems (in HJS, read)

semiconductor technology is an incredible driving force


2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

22

CIS 501 Lecture Notes: Chapter 1

23

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

24

Moores Law

Evolution of Single-Chip Microprocessors

performance doubles every 18 months


common interpretation of Moores Law, not original intent

Transistor Count

1971-1980
10K-100K

1981-1990
100K-1M

1991-2000
1M-100M

2010
1B

wrong, performance doubles every ~2 years

Clock Frequency

0.2-2MHz

2-20MHz

20M-1GHz

10GHz

IPC

< 0.1

0.1-0.9

0.9- 2.0

10 (?)

MIPS/MFLOPS

< 0.2

0.2-20

20-2,000

100,000

self-fulfilling prophecy (Moores Curve)


doubling every 18 months = ~4% increase per month
4% per month used to judge performance features, if feature adds 2
months to schedule, it should add at least 8% to performance
Itanium is under Moores Curve in a big way

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

25

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Performance Growth in Perspective

26

Performance and Cost

same absolute increase in computing power

read: H+P chapter 1

big bang - 2001

performance metrics

2001 - 2003

iron law of processor performance


benchmarks and benchmarking
reporting averages

1971 - 2001: performance improved 35,000X!!!

Amdahls law

what if cars improved at this rate?


1971: 60 MPH / 10 MPG
2001: 2,100,000 MPH / 350,000 MPG

balance and bursty behavior


cost

but... what if cars crashed as often as computers did?

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

27

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

28

Performance Metrics

Throughput and Pipelining


latch

latency (response time, execution time)


elapsed time? processor time?

fetch

throughput (bandwidth, work per time)

latch

latch

decode

reg
read

latch

latch

ALU

mem

latch

reg
write

clock

performance
in real processors there is always overlap (pipelining)

= 1 / latency when there is NO OVERLAP

throughput is rate of initiation/completion (ideally 1)

> 1 / latency when there is

non-unit latencies, dependences (hazards)

choose metric according to your objective

pretend that latency is 1 if it makes you feel better

throughput: maximize work done in a given interval


latency: minimizing time to wait for a computation

both are important


2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

29

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

MIPS

30

Relative MIPS
relative MIPS = (timereference / timenew) x MIPSreference

MIPS = (instruction count / execution time) x 10-6

e.g., VAX MIPS

= (clock rate / CPI) x 10-6

+ a little better than native MIPS


but very sensitive to reference machine

the problem is in defining a uniform measure of work


instruction sets are not equivalent

which compiler and OS and benchmark?

instruction count is not a reliable indicator of work

upshot: may be useful if same ISA/compiler/OS/workload

some optimizations add instructions


instructions may do varying work (FP mult >> register move)

may vary inversely with performance

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

31

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

32

MFLOPS

Normalized MFLOPS
argument #1: FP operations are same across machines?

MFLOPS = (FP ops / execution time) x 10-6

Cray does not implement divide (multiply by reciprocal)

like MIPS, but counts only FP operations


argument #1: FP ops are the same across machines
argument #2: cant be optimized away
argument #3: FP ops have longest latencies

Motorola has SQRT, SIN, and COS


normalized FP: assign a canonical # FP ops to a HLL program
what a pain this is

may have been valid in 1980 (most programs were FP)


most programs today are integer (i.e., FP light)

normalized MFLOPS = (# canonical FP ops / time) x 10-6

load from memory takes longer than FP divide

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

33

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Iron Law of Processor Performance

CIS 501 Lecture Notes: Chapter 1

34

Iron Law

processor performance = time / program

instructions / program (a.k.a. instruction count)

separate into three components

dynamic instructions executed, NOT static code size


mostly determined by program, compiler, ISA

seconds
program

instructions
cycles
=
x
x
program
instruction

seconds
cycle

program is unit of work (transaction in server domain)


cycles / instruction (a.k.a. CPI)

implementation
(micro-architecture)
processor-designer

mostly determined by ISA and CPU/memory organization


instruction overlap (ILP) makes this smaller

CIS501
architecture
(ISA)
compiler-designer
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

realization
(physical layout)
circuit-designer
35

seconds / cycle (a.k.a. cycle time, clock time, 1 / clock frequency)


mostly determined by technology and CPU organization
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

36

Iron Law

Iron Law Performance Comparison

comparing iron law performance

famous example: RISC Wars (RISC vs. CISC)

A is N times faster than B iff


perf(A)/perf(B) = time(B)/time(A) = N

CISC CPU time = P x 8 x T = 8PT


RISC CPU time = 2P x 2 x T = 4PT

A is x% faster than B iff


perf(A)/perf(B) = time(B)/time(A) = 1 + X/100

RISC CPU time = CISC CPU time/2


the truth is much, much, much more complex
actual data from IBM AS/400 experience (CISC->RISC in 1995):

uses of iron law

CISC (IMPI) time = P x 7 x T = 7PT

high-level performance comparisons

RISC (PPC) time = 3.1P x 3 x T/3.1 = 3PT (+1 tech. gen.)

back of the envelope calculations


helping architects think about compilers and technology
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

37

Iron Law Back-of-the-Envelope Calculation


base machine

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

38

Actually Measuring Performance


how are execution-time/CPI actually measured?

43% ALU ops (1 cycle), 21% loads (1 cycle)

execution time: time (Unix cmd): wall-clock, CPU, system

12% stores (2 cycles), 24% branches (2 cycles)

CPI = CPU time / (clock frequency * # instructions)

note: pretending latency is 1 because of pipeline

aggregate CPI not so useful, want CPI stack (breakdown)


compute time, memory stall time, etc.
so we know what the performance problems are (what to fix)

Q: should 1 cycle stores be implemented if it slows clock 15%?


old CPI = 0.43 + 0.21 + (0.12 x 2) + (0.24 x 2) = 1.36
new CPI = 0.43 + 0.21 + 0.12 + (0.24 x 2) = 1.24
speedup = (P x 1.36 x T) / (P x 1.24 x 1.15T) = 0.95
Answer: NO!

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

39

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

40

Measuring CPI Breakdowns

Benchmarks and Benchmarking

hardware event counters (PentiumPro, Alpha DCPI)

program as unit of work

calculate CPI using instruction frequencies/event costs

there are millions of programs

A Characterization of Processor Performance in the VAX11/780, Emer+Clark (HJS, read)

not all are the same, most are very different

cycle-level microarchitecture simulator (e.g., SimpleScalar)

which ones to use?


benchmarks

measure exactly what you want

standard programs for measuring/comparing performance

must take care to model microarchitecture faithfully

representative of programs people care about

at least the parts you care about

repeatable!!

method of choice for many architects (yours, too)

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

41

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Benchmarking Process

42

Benchmarking Process

steps
define workload

Define Workload

extract benchmarks from workload

w1

w2

w3

w4

t1

t2

t3

t4

choose performance metric

w5
t5

Extract Benchmarks

execute benchmarks on candidate machines

w1
t1

project performance in new machine


run workload on new machine and compare

w2

w3

t2

t3

w4

w5

t4

t5

Run Benchmarks
^t1

^
t2

^
t3

^
t4

^
t5

Project Performance

^w1
^t1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

43

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

^w2

^
w3

^
w4

^t2

^t3

^
t4

CIS 501 Lecture Notes: Chapter 1

^
w5
^t5

44

Benchmarks: Instruction Mixes

Benchmarks: Toy Benchmarks

calculate performance from instruction type frequencies

little programs that no one really runs

ignore dependences

e.g., fibonacci, 8 queens

ok for non-pipelined, scalar processor w/o caches

little value, what real programs do these represent?

the way all processors used to be

scary fact: used to prove the value of RISC in early 80s

example: Gibson Mix - developed in 1950s at IBM


load/store: 31%, branches: 17%
compare: 4%, shift: 4%, logical: 2%
fixed add/sub: 6%, float add/sub: 7%
float mult: 4%, float div: 2%, fixed mul: 1%, fixed div: <1%
qualitatively, these numbers are still useful today!
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

45

Benchmarks: Kernels

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

46

Benchmarks: Synthetic Benchmarks

important (most frequently executed) pieces of real programs

programs made up for benchmarking purposes

e.g., Livermore loops, Linpack

e.g., Whetstone, Dhrystone

example: inner product (from Linpack)

often only slightly more complex than kernels

good for focusing on individual features not big picture

like toy benchmarks, which programs do these represent?

tend to over-emphasize target feature


over-estimate performance if feature is easy (fast)
under-estimate performance if feature is hard (slow)

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

47

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

48

Benchmarks: Real Programs

SPEC95

real programs

8 integer programs

only accurate way to characterize performance

go (plays a game of go), gcc (compiler)


m88ksim (motorola 88000 simulator)
compress (data compress/decompress), li (lisp interpreter)
jpeg (graphics jpeg compression/decompression)
perl (perl interpreter), vortex (object-oriented database)

requires considerable work


Standard Performance Evaluation Corporation (SPEC)
collects, standardizes and distributes benchmark suites

10 floating point programs

consortium made up of industry leaders


?!#$: program only included if it makes enough members look good

SPEC CPU (CPU intensive benchmarks)


SPEC89, SPEC92, SPEC95, SPEC2000 (consortium at work)

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

49

tomcatv (vectorized mesh generation), swim (shallow water model)


su2cor (quantum physics), hydro2d (galactic jets - navier stokes)
mgrid (multigrid solver for 3d field), fppp (quantum chemistry)
applu (partial differential equations), wave5 (n-body Maxwells)
turb3d (turbulence in a cube), apsi (temperature and wind velocity)
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

SPEC2000 Benchmarks

CIS 501 Lecture Notes: Chapter 1

50

Benchmarking Pitfalls

12 integer programs

mismatch of benchmark properties with scale of features studied


e.g., using SPEC for large cache studies

gcc, perl, vortex (holdovers from SPEC95)


bzip2, gzip (replace compress), crafty (chess, replaces go)
eon (rendering), gap (group theoretic enumerations)
twolf, vpr (FPGA place and route)
parser (grammar checker), mcf (network optimization)

carelessly scaling benchmarks


using only first few million instructions (initialization phase)
reducing program data size

14 floating point programs


swim, mgrid, applu, apsi (holdovers from SPEC95)
wupwise (quantum chromodynamics), mesa (OpenGL library)
art (neural network image recognition), equake (wave propagation)
fma3d (crash simulation), sixtrack (accelerator design)
lucas (primality testing), galgel (fluid dynamics), ammp (chemistry)
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

51

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

52

Benchmarking Pitfalls

Reporting Average Performance

choosing performance from wrong application space

averages: one of the things architects frequently get wrong

e.g., in a realtime environment, choosing troff

what does the mean mean?

SPEC has other benchmark suites for other environments

arithmetic mean and harmonic mean

SPECweb, SPECmail, SPECjvm

weighted means

TPC has transaction processing (database) benchmarks

geometric mean

TPC-W simulates amazon.com

using old benchmarks


benchmark specials: benchmark-specific optimizations

bottom line: only use average when you have to


there is no such thing as the average program

benchmarks must be continuously maintained and updated

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

53

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Arithmetic Mean and Harmonic Mean

54

Weighted Means (AM and HM)

arithmetic mean (AM): average execution times of n programs

what if programs run at different frequencies within workload?


use weight factors

time ( i ) n
1

weighted AM
n

( weight ( i ) time ( i ) ) n
1

harmonic mean (HM): average IPCs of n programs


arithmetic mean cannot be used for rates (like IPCs)

weighted HM is similar

30 MPH for 1 mile + 90 MPH for 1 mile != avg. 60 MPH

n
--------------------------------------n
weight ( i )
------------------------rate ( i )

n
-----------------------------n

1
-
---------------- rate ( i )

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

CIS 501 Lecture Notes: Chapter 1

55

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

56

Geometric Mean

Geometric Mean

what about averaging ratios (speedups)?

geometric mean of ratios is not proportional to total time!

HM / AM change depending on which machine is the base

if we take total execution time, B is 9.1 times faster


GM says they are equal

mach A

mach B

B/A

A/B

P1

10

10

0.1

P2

1000

100

0.1

10

5.05

5.05

SPEC uses GM
nT
base, i
n ---------------------T new , i
1

AM/HM

generally, GM will mispredict for three or more machines

use geometric mean (GM): independent of base choice


n

AM for times, HM for rates (IPCs), and GM for ratios (speedups)

n ratio ( i )
1
CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

57

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Qualitative Performance

CIS 501 Lecture Notes: Chapter 1

58

Amdahls Law

Amdahls law (i.e., dont forget about the uncommon case)

why you shouldnt ignore the UNcommon case


let an optimization speed f fraction of time by a factor of s

balance

speedup = old / ([(1-f) x old] + f/s x old) = 1 / (1 - f + f/s)

bursty behavior

Validity of the Single-Processor Approach to Achieving LargeScale Computing Capabilities


G. Amdahl, AFIPS, 1967
read (in HJS)

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

59

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

60

Amdahls Law Examples

Amdahls Law

f = 95% and s = 1.10 -> speedup common case a little bit (10%)

lim --------------------------- =
s 1 f + f s

speedup = 1/[(1-0.95) + (0.95/1.10)] = 1.094

1
----------- => make common case fast
1f

10

f = 5% and s = 10 -> speed up rare case a lot (10x)

8
Speedup

speedup = 1/[(1-0.05) + (0.05/10)] = 1.047


f = 5% and s ->
speedup = 1/[(1-0.05) + (0.05/ )] = 1.052
illustrates that common case should be sped up

f = 95% and s ->

6
4
2

speedup = 1/[(1-0.95) + (0.95/ )] = 20

0
0

uncommon case eventually limits performance


CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

61

Amdahls Law

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

0.2

0.4

0.6

0.8

CIS 501 Lecture Notes: Chapter 1

62

Making Common Case Fast

Amdahl was talking about a parallel processor with large speedup

uniprocessor example: memory hierarchy


keep recently referenced data/intructions close (fast)

at some point you have to pay attention to the serial part

exploit locality
10

Speedup

1/\(0.1+0.9/x\)

temporal locality: access same data in near future


spatial locality: access nearby data
most accesses (95% instructions, 90% data) hit in cache

implementation facts
4

on-chip faster than off-chip


2

SRAM faster than DRAM faster than disk

0
0

f = 0.9
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

2000

4000

6000

8000

CIS 501 Lecture Notes: Chapter 1

10000

63

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

64

Memory Hierarchy
reg
L1
L2

memory

disk (swap)

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

reg
L1

L2

Memory Hierarchy Specs


fast
small
expensive

L3

size

speed

bandwidth

register

< 1 KB

1-5 ns

9600 MB/s

L1 cache

< 256 KB

10 ns

3200 MB/s

L2 cache

< 8 MB

30 ns

800 MB/s

memory

< 4 GB

100 ns

133 MB/s

disk

> 1 GB

20 ms

4 MB/s

next (well, this) generation (2 GHz)

memory

disk (swap)

type

150 GB/s (reg), 25GB/s (L1), 50GB/s(L2), 4GB/s (mem)

slow
large
cheap

CIS 501 Lecture Notes: Chapter 1

65

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

Balance
reg
L1
L2

memory

disk (swap)

CIS 501 Lecture Notes: Chapter 1

66

Balance Example

at a system level, bandwidths & capacities


should be balanced

compute required memory bandwidth


IPC = 1.5

each level capable of demanding/


supplying bandwidths

30% loads and stores


90% D$ hit rate, 95% I$ hit rate, 32 byte blocks, no L2

resource A: if A demand b/w = A supply b/w


then computation is A-bound

compute required memory b/w

e.g., if processor demand for


memory b/w >= available b/w then
program is memory bound

data b/w demand = 1.5 * 0.3 * 0.10 * 32 = 1.44 bytes/clock

similarly: CPU-bound, I/O-bound

total b/w required = 3.84 bytes/clock

instruction b/w demand = 1.5 * 1.0 * 0.05 * 32 = 2.4 bytes/clock

goal: be bound everywhere at once (why?)


2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

67

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

68

Balancing a System

Bound

balance system by adjusting sizes of hierarchy components

term bound used for latency as well as bandwidth

e.g., larger L1 => higher hit rate => lower L2 demand

in general, A-bound means A is the performance limiter

e.g., larger memory => less paging => lower I/O demand

if A is bandwidth, it means you dont have enough of A


if A is latency, it means you are screwed

Amdahls rule:
1 MIPS <=> 1 MB memory <=> 1 Mbits/s I/O

Bandwidth problems can be fixed with money. Latency problems


are harder to fix, because the speed of light is fixed and you cant
bribe God

if corrected to 1 Mbytes/s of I/O, the rule is still good!

in actuality, can convert latency problem to bandwidth problem


e.g., prefetching
bandwidth/latency tradeoff (theme for the semester)
CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

69

Example: Latency Bound

CIS 501 Lecture Notes: Chapter 1

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

70

Example: Bandwidth Bound

copy: Z[i] = X[i], saxpy: Z[i] = a*X[i] + Y[i]

for large arrays + prefetching: copy, sum, scale, saxpy become


memory bandwidth bound

performance cliff (bound) when data set falls out of D$

e.g., bandwidths on real systems (MB/s)

another when data set too large for memory


falls out of cache
100

System

copy

scale

sum

saxpy

Cray C90

7000

7000

9400

9500

Cray T932

10800

10200

13000

13700

Alpha 150Mhz

98

90

68

90

Cray T3D

380

330

190

180

falls out of main memory


perf.

10

10

100

problem size

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

71

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

72

Bursty Behavior

Cost

Q: to sustain 2 IPC how many instructions should you

very important to most real designs

fetch per cycle?

changes over time

execute per cycle?

learning curve lowers manufacturing costs

complete per cycle?

technology improvements lower costs e.g., DRAM


$70

A: NOT 2 (more than 2)

1 MB

$60

dependences will cause stalls (under-utilization)

256 KB

$50
$40

if desired performance is X, peak performance must be > X

$20

programs != sand: cannot level performance peaks and valleys

CIS 501 Lecture Notes: Chapter 1

16 KB

$10

my research: try to level performance in clever ways


2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

64 KB

$30

1976

73

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

IC Cost
cost (IC) =

1980

1982

1984

1986

1988

1990

CIS 501 Lecture Notes: Chapter 1

74

Total Cost

cos t ( die ) + cos t ( testing ) + cos t ( packaging )


------------------------------------------------------------------------------------------------------------------------yield ( finaltest )

costs
component: processor, DRAM, disk, power, packaging
direct: manufacturing (labor, scrap), warranty

cos t ( wafer )
--------------------------------------------------------------------( die wafer ) yield ( die )

cost (die) =

1978

indirect: R&D + marketing, administrative, profits, taxes

yield (die) =

defects cm area
yield ( wafer ) 1 + --------------------------------------------------------

often is 0.30
cost (die) = f (die area4)
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

75

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

76

Manufacturing Cost

Price

learning curve + technology lower manufacturing cost

price (loosely related to cost)

per unit

start with component cost

startup cost actually increases

+ 25%-40% direct cost


+ 45%-65% gross margin (selling price)

startup manufacturing costs

+60%-75% factor discounts and dealer profits (list price)

fabrication plant, clean rooms, lithography equipment, etc.

components: 15-30% of list, R&D: 8%-15% of lists

~$5B

chip testers/debuggers
~$5M a piece, typically several hundreds of them

not too many companies can play the manufacturing game


intel, IBM, Sun, (Compaq used to, sold it to Intel)
2000 by Hill, Wood, Sohi,
Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

77

Moores Law and Cost/Price

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

78

Reading Summary: Performance+Cost

corrolaries of Moores Law

H+P

performance doubles every 2 years

chapter 1

cost per function halves every 2 years

HJ+S

1 + 2: price of highest performance system is constant

Moore, Cramming...

must control costs to leave yourself a profit margin

Amdahl, Validity...

e.g., cant spend too much on R&D for the price/volume point

Emer+Clark, A Characterization...

next up: instruction set design (H+P, chapter 2)

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

79

2000 by Hill, Wood, Sohi,


Smith, Vijaykumar, Lipasti &
Roth

CIS 501 Lecture Notes: Chapter 1

80

Anda mungkin juga menyukai