Performance

Performance
CSE 313:Computer Architecture 1

Performance
 Evaluation is a key component in the iterative
architecture design process
 Performance is crucial to system design
 Also important in many other cases
 Purchasing a computer
 Evaluating new technologies
 Writing software
 Implementing an instruction set
 Designing a new architecture
 Need to quantify performance

Performance
 What is performance?
 Performance Metrics
 How to evaluate and summarize
performance?
 Performance Pitfalls

Airplane Example
 Which one is the fastest for transporting 1 passenger?

 Which one is the fastest for transporting 5000 passengers?

Computer Performance: TIME,
TIME, TIME
 Response Time (latency)/ Execution Time
— How long does it take for my job to run?
— How long does it take to execute a job?
— How long must I wait for the database query?
 Throughput
— How many jobs can the machine run at once?
— What is the average execution rate?
— How much work is getting done?
 If we upgrade a machine with a new processor what do we improve?

 If we add a new machine to the lab what do we improve?

Book's Definition of
Performance
 For some program running on machine X,
PerformanceX = 1 / Execution timeX
 "X is n times faster than Y“ (speedup)
PerformanceX / PerformanceY = n
= Execution time y / Execution time x
 Hard Real Time/soft real time

 Embedded System and DVD playback
Example
 Machine A runs a program in 20 seconds
 Machine B runs the same program in 25
seconds
 How much faster is A than B?
 Performance ratio
25 / 20 = 1.25
 A is 1.25 times faster than B. (the speedup of
A over B is 1.25)

Time
 Elapsed time/Wall Clock time/response time
 Counts everything (disk and memory accesses, I/O,
operating system overhead etc.)
 A useful number, but often not good for comparison
purposes
 Time sharing among multiple programs
 CPU time/CPU Execution Time
 Doesn’t count I/O or time spent in running other programs
 Can be broken into system CPU time and user CPU time
 Our focus: user CPU time
 CPU time spent in executing the lines of code that are “in”
our program
 CPU performance
 System Performance

Clock Cycles
 Instead of reporting execution time in seconds, we often use
cycles
seconds = cycles * seconds / cycle

 Clock “ticks” indicate when to start activities (one abstraction):
time
 cycle time = time between ticks = seconds per cycle
 clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 2 Ghz. clock has a 1 / 2×109 = 0.5 nano-second (ns) cycle time

How to Improve
Performance
CPU time = # of Cycles * Clock cycle time
= # of Cycles / Clock rate
So, to improve performance (everything else being
equal) you can either
________ the # of required cycles for a program, or

________ the clock cycle time or, said another way,
________ the clock rate.

Example
 Our favorite program runs in 8 seconds on computer A, which has
a 4 GHz. clock. We are trying to help a computer designer build a
new machine B, that will run this program in 4 seconds. The
designer can use new (or perhaps more expensive) technology to
substantially increase the clock rate, but has informed us that this
increase will affect the rest of the CPU design, causing machine B
to require 1.2 times as many clock cycles as machine A for the
same program. What clock rate should we tell the designer to
target?"
 Don't Panic, you can easily work this out from basic principles

How many cycles are required
for a program?
 Can we assume that # of cycles = # of
instructions?
2nd instruction
3rd instruction
1st instruction
4th
5th
6th
...
time
This assumption is incorrect,

different instructions take different amounts of time on different machines.

Different numbers of cycles
for different instructions
 Multiplication takes more time than addition
 Floating point operations take longer than integer ones
 Accessing memory takes more time than accessing registers

Measure Clock Cycles
 CPU clock cycles/program is not an intuitive or easily determined
value
Clock cycles = # instructions * average clock cycles per instruction
 Cycles per instruction (CPI) is often used
Time = # instructions * CPI * clock cycle time
 CPI is an average since the number of cycles per instruction varies

from instruction to instruction
 CPI varies by application, as well as among implementation with
the same instruction set
 Number of cycles for each instruction
 Frequency of instructions (instruction mix)
 Memory access time

CPI Example
 It takes a program of 1,000,000
instructions 1.2ms to run on computer A.
Computer A has a clock rate of 2 GHz.
What is the CPI of the program on
computer A?

CPI Example
 Suppose we have two implementations of the same
instruction set architecture (ISA).
For some program,
Machine A has a clock cycle time of 1 ns. and a CPI of
2.0
Machine B has a clock cycle time of 2 ns. and a CPI of
1.2
Which machine is faster for this program, and by how
much?

# of Instructions Example
 A compiler designer is trying to decide between two code
sequences for a particular machine. Based on the
hardware implementation, there are three different classes
of instructions: Class A, Class B, and Class C, and they
require 1, 2, and 3 cycles respectively.
The first code sequence has 5 instructions: 2 of A, 1 of B,
and 2 of C
The second sequence has 6 instructions: 4 of A, 1 of B,
and 1 of C.
Which sequence will be faster? How much faster?
What is the CPI for each sequence?

CPI Variability
 Different instruction types require different number of
cycles per instruction
 CPI is often reported for classes of instructions
n
total clock cycles = ∑ CPI i × I i
i =1
where CPIi is the CPI for the instruction in class i and Ii is

the count of instructions in class i

Computing CPI
 To compute the overall average CPI use
∑ CPI ⋅# instruction
i i
CPI = i
total # instruction
Ii
= ∑ CPI i
i ∑ Ii
= ∑ CPI i ⋅ Pi
i
∑ P =1
i
i
Where Pi is the percentage (frequency) of instruction in class i

Computing CPI Example
Instruction Type CPI Frequency
ALU 1 50%
Branch 2 20%
Load 2 20%
Store 2 10%
 Average CPI is
1 x 50% + 2 x 20% + 2 x 20% + 2 x 10%
= 0.5 + 0.4 + 0.4 +0.2 = 1.5

Tradeoffs
 Instruction count, CPI, and clock cycle present tradeoffs
 RISC – reduced instruction set computer (MIPS)
 Simple instructions
 Higher instruction counts for an application
 Lower CPI
 CISC – complex instruction set computer (IA-32)
 More complex instructions
 Lower instruction counts for an application
 Higher CPI

Understanding Program
Performance

MIPS
 MIPS (millions instructions per second)
instruction count clock rate

MIPS = =
execution time × 10 6
CPI × 10 6
 Is it a good measure for performance?

MIPS example
 Two different compilers are being tested for a 1 GHz. machine
with three different classes of instructions: Class A, Class B, and
Class C, which require 1, 2, and 3 cycles (respectively). Both
compilers are used to produce code for a large piece of software.
The first compiler's code uses 5 billion Class A instructions, 1

billion Class B instructions, and 1 billion Class C instructions.
The second compiler's code uses 10 billion Class A instructions, 1

billion Class B instructions, and 1 billion Class C instructions.
 Which sequence will be faster according to MIPS?

 Which sequence will be faster according to execution time?

Performance
 Performance is determined by execution time
 Do any of the other variables equal performance?
 # of cycles to execute program?
 # of instructions in program?
 # of cycles per second?
 average # of cycles per instruction?
 average # of instructions per second?
 Common pitfall: thinking one of the variables is

indicative of performance when it really isn’t.

Now that we understand
Performance
 A given program will require
 some number of instructions (machine instructions)
 some number of cycles
 some number of seconds
 We have a vocabulary that relates these quantities:
 cycle time (seconds per cycle)
 clock rate (cycles per second)
 CPI (cycles per instruction)
a floating point intensive application might have a higher CPI
 MIPS (millions of instructions per second)
this would be higher for a program using simple instructions

Example
 If two machines have the same ISA which
of the following quantities will always be
identical for a same program and
compiler?
 clock rate
 # of instructions
 CPI
 execution time
 MIPS

Example
 Consider a program on a computer of two classes of instructions
 A: CPI = 2, frequency = 40%
 B: CPI = 4, frequency = 60%
 What’s the CPI of this machine?
 If the CPI of the instruction class B is reduced to 3 without changing clock rate, how
much faster is the new machine? What’s its CPI?
 If we can also reduce the number of class A instructions to 50% of the original for the
program, how much faster is the new machine? What’s its CPI?
1. CPI = 2 * 40% + 4 * 60% = 3.2

2. CPI = 2 * 40% + 3 * 60% = 2.6. Because both have the same number of
instructions and clock rate, the ratio of execution time is the ratio of the CPI.
The speedup is 3.2 / 2.6 = 1.23 times faster
3. Assume originally there are I instructions, then the number of cycles is 3.2*I.
The number of cycles after reducing CPI of B and instructions of A is
2 * 0.2 * I + 3 * 0.6 * I = 2.2 *I
So the speedup is 3.2 / 2.2 = 1.45
The CPI of the new machine is 2.2 * I / 0.8 * I = 2.75
We can not compute CPI as 2 * 20% + 3 * 60% = 2.2. Why?

Tools for Evaluating
Performance
 Estimations
 Area/delay estimate
 Analysis
 Queuing theory
 Simulations
 ISA execution, circuit, gate
 Benchmarks

Benchmarks
 Performance best determined by running real applications
 Use programs typical of expected workload
 Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
 Not always feasible
 Small benchmarks
 nice for architects and designers
 easy to standardize
 can be abused
 SPEC (System Performance Evaluation Cooperative)
 companies have agreed on a set of real program and inputs
 valuable indicator of performance (and compiler technology)
 require upgraded to current software applications: SPEC ’89, SPEC ’92,
SPEC ’95, SPEC ’2000 (www.spec.org)
 can still be abused

SPEC CPU2000

Benchmark Games
 An embarrassed Intel Corp. acknowledged Friday that a bug in a software
program known as a compiler had led the company to overstate the speed
of its microprocessor chips on an industry benchmark by 10 percent.
However, industry analysts said the coding error…was a sad commentary
on a common industry practice of “cheating” on standardized performance
tests…The error was pointed out to Intel two days ago by a competitor,
Motorola …came in a test known as SPECint92…Intel acknowledged that it
had “optimized” its compiler to improve its test scores. The company had
also said that it did not like the practice but felt to compelled to make the
optimizations because its competitors were doing the same thing…At the
heart of Intel’s problem is the practice of “tuning” compiler programs to
recognize certain computing problems in the test and then substituting
special handwritten pieces of code…
Saturday, January 6, 1996 New York Times

Compare Performance
Across Multiple Programs
 Benchmark results are often contradictory
Computer A Computer B Computer C
Program 1 (seconds) 1 10 20
A is 10 times faster than B for program 1

B is 10 times faster than A for program 2
A is 20 times faster than C for program 1
C is 50 times faster than A for program 2
B is 2 times faster than C for program 1
C is 5 times faster than B for program 2
Each statement is correct. But we just want to

know which machine is the fastest.
Summarizing Results
Use (weighted) arithmetic mean for with execution times

Example
Computer A Computer B
Program 1 2s (60%) 4s (60%)
Program 2 16s (40%) 4s (40%)
 Computer A average time = 2x60% + 16x40% = 7.6s

 Computer B average time = 4x60% + 4x40% = 4s
 Computer B is 7.6/4 = 1.9 times faster than A for the
benchmark

SPEC ‘95
Does doubling the clock rate double the performance?
Can a machine with a slower clock rate have better performance?

10 10
9 9
8 8
7 7
6 6
SPECint
SPECfp
5 5
4 4
3 3
2 2
1 1
0 0
50 100 150 200 250 50 100 150 200 250
Clock rate (MHz) Pentium Clock rate (MHz)

Pentium
Pentium Pro Pentium Pro

Amdahl's Law
Execution Time After Improvement =

Execution Time Unaffected +( Execution Time Affected / speedup )
ExTimeold 1
Speedup = =
ExTimenew Fractionenhanced
(1 − Fractionenhanced ) +
Speedupenhanced

Example
 Floating point instructions of a computer to run 2 times faster; but
only 10% of actual time is used for floating point operations. How
much faster will the computer become?
ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold

Speedup = 1 / 0.95 = 1.053
The speedup resulting from an enhancement is bounded by the inverse of the

fraction that is not enhanced.

Amdahl’s Law: Corollary
 Principle: Make the common case fast
 Examples:
 All instructions requires an instruction fetch, only a
fraction require a data fetch/store => optimize
instruction access over data access
 Programs spend a lot of time accessing memory and
exhibit spatial and temporal locality =>design a storage
hierarchy such that the most frequent access are to the
smallest (fastest) memories

Example
 "Suppose a program runs in 100 seconds on a
machine, with multiply responsible for 80 seconds of
this time. How much do we have to improve the
speed of multiplication if we want the program to run
4 times faster?"
How about making it 5 times faster?

Example
 A program takes 10 seconds to run on the
current computer. The program spends 40% of
its time on floating-point operations, 40% on
integer operations, and 20% on I/O operations.
 If you can make the floating-point operations 2
times faster, what is the overall speedup of the
program?
 If we want the whole program to run 2 times
faster, how much do we need to improve the
speed of integer operations, in additional to
doubling the speed of floating-point operations?

Remember
 Performance is specific to a particular program/s
 Execution time is a consistent summary of performance
 For a given architecture performance increases come from:

 increases in clock rate (without adverse CPI affects)
 improvements in processor organization that lower CPI
 compiler enhancements that lower CPI and/or instruction count
 Pitfall: expecting improvement in one aspect of a machine’s performance
to significantly affect the total performance
 References:
 4.1 4.2
 4.5
 See Slides and solve some exercise

Performance

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Performance

Diunggah oleh

Hak Cipta:

Format Tersedia

Performance

CSE 313:Computer Architecture 1

CSE 313:Computer Architecture 2

CSE 313:Computer Architecture 3

 Which one is the fastest for transporting 1 passenger?

CSE 313:Computer Architecture 4

 If we upgrade a machine with a new processor what do we improve?

CSE 313:Computer Architecture 5

 "X is n times faster than Y“ (speedup)

 Hard Real Time/soft real time

CSE 313:Computer Architecture 7

CSE 313:Computer Architecture 8

seconds = cycles * seconds / cycle

CSE 313:Computer Architecture 9

________ the # of required cycles for a program, or

CSE 313:Computer Architecture 10

CSE 313:Computer Architecture 11

This assumption is incorrect,

CSE 313:Computer Architecture 12

 Floating point operations take longer than integer ones

 Accessing memory takes more time than accessing registers

CSE 313:Computer Architecture 13

Clock cycles = # instructions * average clock cycles per instruction

 Cycles per instruction (CPI) is often used

Time = # instructions * CPI * clock cycle time

 CPI is an average since the number of cycles per instruction varies

CSE 313:Computer Architecture 14

CSE 313:Computer Architecture 15

CSE 313:Computer Architecture 16

CSE 313:Computer Architecture 17

where CPIi is the CPI for the instruction in class i and Ii is

CSE 313:Computer Architecture 18

Where Pi is the percentage (frequency) of instruction in class i

CSE 313:Computer Architecture 19

CSE 313:Computer Architecture 20

CSE 313:Computer Architecture 21

CSE 313:Computer Architecture 22

instruction count clock rate

CSE 313:Computer Architecture 23

The first compiler's code uses 5 billion Class A instructions, 1

The second compiler's code uses 10 billion Class A instructions, 1

 Which sequence will be faster according to MIPS?

CSE 313:Computer Architecture 24

 Common pitfall: thinking one of the variables is

CSE 313:Computer Architecture 25

CSE 313:Computer Architecture 26

CSE 313:Computer Architecture 27

1. CPI = 2 * 40% + 4 * 60% = 3.2

CSE 313:Computer Architecture 28

CSE 313:Computer Architecture 29

CSE 313:Computer Architecture 30

CSE 313:Computer Architecture 31

CSE 313:Computer Architecture 32

A is 10 times faster than B for program 1

Each statement is correct. But we just want to

Use (weighted) arithmetic mean for with execution times

CSE 313:Computer Architecture 34

Program 1 2s (60%) 4s (60%)

Program 2 16s (40%) 4s (40%)

 Computer A average time = 2x60% + 16x40% = 7.6s

CSE 313:Computer Architecture 35

Can a machine with a slower clock rate have better performance?

Clock rate (MHz) Pentium Clock rate (MHz)