Anda di halaman 1dari 108

Advanced Computer

Architecture
Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy & Patterson)

Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining I

Chapter Overview
Technique

Reduces

Section

Loop Unrolling

Control Stalls

4.1

Basic Pipeline Scheduling

RAW Stalls

4.1

Dynamic Scheduling with Scoreboarding

RAW stalls

4.2

Dynamic Scheduling with Register Renaming

WAR and WAW stalls

4.2

Dynamic Branch Prediction

Control Stalls

4.3

Issue Multiple Instructions per Cycle

Ideal CPI

4.4

Compiler Dependence Analysis

Ideal CPI & data stalls

4.5

Software pipelining and trace scheduling

Ideal CPI & data stalls

4.5

Speculation

All data & control stalls

4.6

Dynamic memory disambiguation

RAW stalls involving memory

4.2, 4.6

Chap. 4 - Pipelining I

Instruction Level
Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

ILP is the principle that there are many


instructions in code that dont
depend on each other. That means
its possible to execute those
instructions in parallel.
This is easier said than done:
Issues include:
Building compilers to analyze the
code,
Building hardware to be even
smarter than that code.
This section looks at some of the
problems to be solved.

Chap. 4 - Pipelining I

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

Terminology
Basic Block - That set of instructions between entry points and between
branches. A basic block has only one entry and one exit. Typically
this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such
parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit
the parallelism inherent in the loop.

Chap. 4 - Pipelining I

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

Simple Loop and its Assembler Equivalent


This is a clean and
simple example!

for (i=1; i<=1000; i++)


x(i) = x(i) + s;
Loop:

LD
ADDD
SD
SUBI
BNEZ
NOP

F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop

;F0=vector element
;add scalar from F2
;store result
;decrement pointer 8bytes (DW)
;branch R1!=zero
;delayed branch slot

Chap. 4 - Pipelining I

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

FP Loop Hazards
Loop:

LD
ADDD
SD
SUBI
BNEZ
NOP

F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,Loop

Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op

Where are the stalls?

;F0=vector element
;add scalar in F2
;store result
;decrement pointer 8B (DW)
;branch R1!=zero
;delayed branch slot

Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op

Chap. 4 - Pipelining I

Latency in
clock cycles
3
2
1
0
0

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

FP Loop Showing Stalls


1 Loop:
2
3
4
5
6
7
8
9
10

LD
stall
ADDD
stall
stall
SD
SUBI
stall
BNEZ
stall

F0,0(R1)

;F0=vector element

F4,F0,F2

;add scalar in F2

0(R1),F4
R1,R1,8

;store result
;decrement pointer 8Byte (DW)

R1,Loop

;branch R1!=zero
;delayed branch slot

Instruction
producing result
FP ALU op
FP ALU op
Load double
Load double
Integer op

Instruction
using result
Another FP ALU op
Store double
FP ALU op
Store double
Integer op

10 clocks: Rewrite code


to minimize stalls? Chap.

4 - Pipelining I

Latency in
clock cycles
3
2
1
0
0

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

Scheduled FP Loop Minimizing Stalls


1 Loop:
2
3
4
5
6

LD
SUBI
ADDD
stall
BNEZ
SD

F0,0(R1)
R1,R1,8
F4,F0,F2
R1,Loop
8(R1),F4

Stall is because SD
cant proceed.
;delayed branch
;altered when move past SUBI

Swap BNEZ and SD by changing address of SD


Instruction
producing result
FP ALU op
FP ALU op
Load double

Instruction
using result
Another FP ALU op
Store double
FP ALU op

Now 6 clocks: Now unroll


loop 4 times to make faster.
Chap.

Latency in
clock cycles
3
2
1

4 - Pipelining I

Pipeline Scheduling and


Instruction Level
Loop Unrolling
Parallelism
Unroll Loop Four Times (straightforward way)
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14

LD
stall
ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
LD
stall

F0,0(R1)

F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)

15
16
17
18
19
20
21
22
23
24
25
26
27
28

ADDD
stall
stall
SD
LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
NOP

F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#32
R1,LOOP

15 + 4 x (1+2) +1 = 28 clock cycles, or 7 per iteration


Assumes R1 is multiple of 4

Rewrite loop to minimize stalls.


Chap. 4 - Pipelining I

10

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

Unrolled Loop That Minimizes Stalls

1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14

LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD

F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16

What assumptions made when


moved code?
OK to move store past SUBI
even though changes register
OK to move loads before
stores: get right data?
When is it safe for compiler to
do such changes?

; 8-32 = -24

14 clock cycles, or 3.5 per iteration

Chap. 4 - Pipelining I

No Stalls!!
11

Instruction Level
Parallelism

Pipeline Scheduling and


Loop Unrolling

Summary of Loop Unrolling Example

Determine that it was legal to move the SD after the SUBI and BNEZ,
and find the amount to adjust the SD offset.
Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary constraints that would be
forced by using the same registers for different computations.
Eliminate the extra tests and branches and adjust the loop maintenance
code.
Determine that the loads and stores in the unrolled loop can be
interchanged by observing that the loads and stores from different
iterations are independent.
This requires analyzing the memory
addresses and finding that they do not refer to the same address.
Schedule the code, preserving any dependences needed to yield the
same result as the original code.

Chap. 4 - Pipelining I

12

Instruction Level
Parallelism

Dependencies

Compiler Perspectives on Code Movement


Compiler concerned about dependencies in program. Not concerned if a
HW hazard depends on a given pipeline.
Tries to schedule code to avoid hazards.
Looks for Data dependencies (RAW if a hazard for HW)
Instruction i produces a result used by instruction j, or
Instruction j is data dependent on instruction k, and instruction k is data
dependent on instruction i.
If dependent, cant execute in parallel
Easy to determine for registers (fixed names)
Hard for memory:
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?

Chap. 4 - Pipelining I

13

Instruction Level
Parallelism

Data Dependencies

Compiler Perspectives on Code Movement

1 Loop:
2
3
4
5

LD
ADDD
SUBI
BNEZ
SD

F0,0(R1)
F4,F0,F2
R1,R1,8
R1,Loop
8(R1),F4

Where are the data


dependencies?

;delayed branch
;altered when move past SUBI

Chap. 4 - Pipelining I

14

Instruction Level
Parallelism

Name Dependencies

Compiler Perspectives on Code Movement

Another kind of dependence called name dependence:


two instructions use same name (register or memory location) but dont
exchange data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory location that instruction i reads from
and instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same register or memory location;
ordering between instructions must be preserved.

Chap. 4 - Pipelining I

15

Instruction Level
Parallelism

Name Dependencies

Compiler Perspectives on Code Movement


1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15

LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ
NOP

F0,0(R1)
F4,F0,F2
0(R1),F4
F0,-8(R1)
F4,F0,F2
-8(R1),F4
F0,-16(R1)
F4,F0,F2
-16(R1),F4
F0,-24(R1)
F4,F0,F2
-24(R1),F4
R1,R1,#32
R1,LOOP

Where are the name


dependencies?
No data is passed in F0, but
cant reuse F0 in cycle 4.

How can we remove these


dependencies?Chap. 4 - Pipelining I

16

Instruction Level
Parallelism

Name Dependencies

Compiler Perspectives on Code Movement

Again Name Dependencies are Hard for Memory Accesses


Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
Our example required compiler to know that if R1 doesnt change then:
0(R1)?8(R1)?16(R1)?24(R1)
There were no dependencies between some loads and stores so they
could be moved around each other

Chap. 4 - Pipelining I

18

Instruction Level
Parallelism

Control Dependencies

Compiler Perspectives on Code Movement

Final kind of dependence called control dependence


Example
ifp1{S1;};
ifp2{S2;};
S1 is control dependent on p1 and S2 is control dependent on p2 but not
on p1.

Chap. 4 - Pipelining I

19

Instruction Level
Parallelism

Control Dependencies

Compiler Perspectives on Code Movement

Two (obvious) constraints on control dependences:


An instruction that is control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a branch cannot be
moved to after the branch so that its execution is controlled by the
branch.

Control dependencies relaxed to get parallelism; get same effect if


preserve order of exceptions (address in register checked by branch
before use) and data flow (value in register depends on branch)

Chap. 4 - Pipelining I

20

Instruction Level
Parallelism

Control Dependencies

Compiler Perspectives on Code Movement


1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
15
....

LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ
LD
ADDD
SD
SUBI
BEQZ

F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit
F0,0(R1)
F4,F0,F2
0(R1),F4
R1,R1,8
R1,exit

Where are the control


dependencies?

Chap. 4 - Pipelining I

21

Instruction Level
Parallelism

Loop Level Parallelism

When Safe to Unroll Loop?


Example: Where are data dependencies?
(A,B,C distinct & non-overlapping)
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i];
/* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
1. S2 uses the value, A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since
iteration i computes A[i+1] which is read in iteration i+1. The same
is true of S2 for B[i] and B[i+1].
This is a loop-carried dependence between iterations
Implies that iterations are dependent, and cant be executed in parallel
Note the case for our prior example; each iteration was distinct

Chap. 4 - Pipelining I

22

Instruction Level
Parallelism

Loop Level Parallelism

When Safe to Unroll Loop?


Example: Where are data dependencies?
(A,B,C,D distinct & non-overlapping)

for (i=1; i<=100; i=i+1) {


A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
1.

2.

No dependence from S1 to S2. If there were, then there would be a cycle in


the dependencies and the loop would not be parallel. Since this other
dependence is absent, interchanging the two statements will not affect the
execution of S2.
On the first iteration of the loop, statement S1 depends on the value of B[1]
computed prior to initiating the loop.

Chap. 4 - Pipelining I

23

Instruction Level
Parallelism

Loop Level Parallelism

Now Safe to Unroll Loop? (p. 240)


OLD:

NEW:

for (i=1; i<=100; i=i+1) {


A[i+1] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i];} /* S2 */

A[1] = A[1] + B[1];


for (i=1; i<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = + A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];

Chap. 4 - Pipelining I

No circular dependencies.
Loop caused dependence
on B.

Have eliminated loop


dependence.

24

Dynamic Scheduling
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

Dynamic Scheduling is when the


hardware rearranges the order of
instruction execution to reduce
stalls.
Advantages:
Dependencies unknown at compile
time can be handled by the
hardware.
Code compiled for one type of
pipeline can be efficiently run on
another.
Disadvantages:
Hardware much more complex.

Chap. 4 - Pipelining I

25

Dynamic Scheduling

The idea:

HW Schemes: Instruction Parallelism

Why in HW at run time?


Works when cant know real dependence at compile time
Compiler simpler
Code for one machine runs well on another
Key Idea: Allow instructions behind stall to proceed.
Key Idea: Instructions executing in parallel. There are multiple
execution units, so use them.
DIVD
F0,F2,F4
ADDD
F10,F0,F8
SUBD
F12,F8,F14
Enables out-of-order execution => out-of-order completion

Chap. 4 - Pipelining I

26

Dynamic Scheduling

The idea:

HW Schemes: Instruction Parallelism

Out-of-order execution divides ID stage:


1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever 1 & 2 hold, not


waiting for prior instructions.
A scoreboard is a data structure that provides the information
necessary for all pieces of the processor to work together.
We will use In order issue, out of order execution, out of order
commit ( also called completion)
First used in CDC6600. Our example modified here for DLX.
CDC had 4 FP units, 5 memory reference units, 7 integer units.
DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.

Chap. 4 - Pipelining I

27

Dynamic Scheduling

Using A Scoreboard

Scoreboard Implications

Out-of-order completion => WAR, WAW hazards?


Solutions for WAR
Queue both the operation and copies of its operands
Read registers only during Read Operands stage
For WAW, must detect hazard: stall until other completes
Need to have multiple instructions in execution phase => multiple
execution units or pipelined execution units
Scoreboard keeps track of dependencies, state or operations
Scoreboard replaces ID, EX, WB with 4 stages

Chap. 4 - Pipelining I

28

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


1. Issue decode instructions & check for structural hazards (ID1)
If a functional unit for the instruction is free and no other active
instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the instruction issue
stalls, and no further instructions will issue until these hazards
are cleared.

Chap. 4 - Pipelining I

29

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


2.

Read operands wait until no data hazards, then read


operands (ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing
the operand is being written by a currently active
functional unit.
When the source operands are available, the scoreboard tells
the functional unit to proceed to read the operands from
the registers and begin execution. The scoreboard
resolves RAW hazards dynamically in this step, and
instructions may be sent into execution out of order.

Chap. 4 - Pipelining I

30

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


3. Execution operate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the scoreboard
that it has completed execution.
4. Write result finish execution (WB)
Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD
F0,F2,F4
ADDD
F10,F0,F8
SUBD
F8,F8,F14
Scoreboard would stall SUBD until ADDD reads operands

Chap. 4 - Pipelining I

31

Using A Scoreboard

Dynamic Scheduling

Three Parts of the Scoreboard


1. Instruction

statuswhich

of

steps

the

instruction

is

in

2. Functional unit statusIndicates the state of the functional unit (FU). 9


fields for each functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., + or )
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source registers Fj, Fk
Rj,
RkFlags
indicating
when
Fj,
Fk
are
ready
3. Register result statusIndicates which functional unit will write each
register, if one exists. Blank when no pending instructions will write that
register

Chap. 4 - Pipelining I

32

Dynamic Scheduling

Using A Scoreboard

Detailed Scoreboard Pipeline Control


Instruction
status

Wait until

Bookkeeping

Issue

Not busy (FU)


and not result(D)

Busy(FU) yes; Op(FU) op;


Fi(FU) `D; Fj(FU) `S1;
Fk(FU) `S2; Qj Result(S1);
Qk Result(`S2); Rj not Qj;
Rk not Qk; Result(D) FU;

Read
operands

Rj and Rk

Rj No; Rk No

Execution
complete

Functional unit
done

f((Fj( f )Fi(FU)
or Rj( f )=No) &
Write result (Fk( f ) Fi(FU) or

f(if Qj(f)=FU then Rj(f) Yes);


f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No

Rk( f )=No))

Chap. 4 - Pipelining I

33

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example
This is the sample code well be working with in the example:
LD
LD
MULT
SUBD
DIVD
ADDD

F6, 34(R2)
F2, 45(R3)
F0, F2, F4
F8, F6, F2
F10, F0, F6
F6, F8, F2

What are the hazards in this code?

Latencies (clock cycles):


LD
1
MULT
10
SUBD
2
DIVD
40
ADDD
2

Chap. 4 - Pipelining I

34

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example
Instruction status
Instruction j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2 F4
SUBD F8
F6 F2
DIVD F10 F0 F6
ADDDF6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result

Busy
No
No
No
No
No

Clock

F0

Op

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

F2

F4

F6

F8

F10 F12

F30

...

FU

Chap. 4 - Pipelining I

35

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 1


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1

Busy
Yes
No
No
No
No

Clock

F0

FU

Issue LD #1
Shows in which cycle
the operation occurred.

Op
Load

dest
Fi
F6

S1
Fj

S2
Fk
R2

FU for j FU for k Fj?


Qj
Qk
Rj

F2

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Chap. 4 - Pipelining I

36

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 2


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2

Clock

F0

FU

Busy Op
Yes
Load
No
No
No
No

F2

S2
Fk
R2

LD #2 cant issue since


integer unit is busy.
MULT cant issue because
we require in-order issue.

dest
Fi
F6

S1
Fj

FU for j FU for k Fj?


Qj
Qk
Rj

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Chap. 4 - Pipelining I

37

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 3


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3

Clock

F0

FU

Busy Op
Yes
Load
No
No
No
No

F2

dest
Fi
F6

S1
Fj

S2
Fk
R2

FU for j FU for k Fj?


Qj
Qk
Rj

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Chap. 4 - Pipelining I

38

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 4


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4

Clock

F0

FU

Busy Op
Yes
Load
No
No
No
No

F2

dest
Fi
F6

S1
Fj

S2
Fk
R2

FU for j FU for k Fj?


Qj
Qk
Rj

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Chap. 4 - Pipelining I

39

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 5


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5

Clock

F0

FU

Busy Op
Yes
Load
No
No
No
No

F2

S2
Fk
R3

Issue LD #2 since integer


unit is now free.

dest
Fi
F2

S1
Fj

FU for j FU for k Fj?


Qj
Qk
Rj

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Chap. 4 - Pipelining I

40

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 6


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
6

Clock

F0

FU

Busy Op
Yes
Load
Yes
Mult
No
No
No

F2

dest
Fi
F2
F0

S1
Fj

F4

F6 F8 F10 F12

F2

S2
Fk
R3
F4

Issue MULT.

FU for j FU for k Fj?


Qj
Qk
Rj
Integer

No

Fk?
Rk
Yes
Yes

...

F30

Mult1 Integer

Chap. 4 - Pipelining I

41

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 7


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7

Busy
Yes
Yes
No
Yes
No

Clock

F0

FU

Op
Load
Mult

dest
Fi
F2
F0

S1
Fj

Sub

F2

Mult1 Integer

MULT cant read its


operands (F2) because LD
#2 hasnt finished.

FU for j FU for k Fj?


Qj
Qk
Rj

F2

S2
Fk
R3
F4

No

Fk?
Rk
Yes
Yes

F8

F6

F2

Integer Yes

No

F4

F6 F8 F10 F12

Integer

...

F30

Add

Chap. 4 - Pipelining I

42

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 8a


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
6
7
8

Busy
Yes
Yes
No
Yes
Yes

Clock

F0

FU

Op
Load
Mult

dest
Fi
F2
F0

S1
Fj

Sub
Div

F2

Mult1 Integer

DIVD issues.
MULT and SUBD both
waiting for F2.

FU for j FU for k Fj?


Qj
Qk
Rj

F2

S2
Fk
R3
F4

No

Fk?
Rk
Yes
Yes

F8
F10

F6
F0

F2
F6

Integer Yes
Mult1
No

No
Yes

F4

F6 F8 F10 F12

Integer

...

F30

Add Divide

Chap. 4 - Pipelining I

43

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 8b


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
7
8

Busy
No
Yes
No
Yes
Yes

Clock

F0

FU

Mult1

LD #2 writes F2.

Op

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Mult

F0

F2

F4

Yes

Yes

Sub
Div

F8
F10

F6
F0

F2
F6

Yes
No

Yes
Yes

F2

F4

F6 F8 F10 F12

...

F30

Mult1

Add Divide

Chap. 4 - Pipelining I

44

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 9


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8

Busy
No
Yes
No
Yes
Yes

Clock

F0

FU

Mult1

Now MULT and SUBD can


both read F2.
How can both instructions
do this at the same time??

Op

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Mult

F0

F2

F4

Yes

Yes

Sub
Div

F8
F10

F6
F0

F2
F6

Yes
No

Yes
Yes

F2

F4

F6 F8 F10 F12

...

F30

Mult1

Add Divide

Chap. 4 - Pipelining I

45

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 11


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
8 Mult1
Mult2
0 Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
8

Busy
No
Yes
No
Yes
Yes

Clock

F0

11

FU

Mult1

ADDD cant start because


add unit is busy.

Op

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Mult

F0

F2

F4

Yes

Yes

Sub
Div

F8
F10

F6
F0

F2
F6

Yes
No

Yes
Yes

F2

F4

F6 F8 F10 F12

...

F30

Mult1

Add Divide

Chap. 4 - Pipelining I

46

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 12


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
7 Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8

Clock

F0

12

FU

Busy Op
No
Yes
Mult
No
No
Yes
Div

F2

Mult1

SUBD finishes.
DIVD waiting for F0.

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

F0

F2

F4

Yes

Yes

F10

F0

F6

No

Yes

F4

F6 F8 F10 F12

...

F30

Mult1

Divide

Chap. 4 - Pipelining I

47

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 13


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
6 Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

13

FU

F2

Mult1

F4

ADDD issues.

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

48

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 14


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
5 Mult1
Mult2
2 Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

14

FU

F2

Mult1

F4

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

49

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 15


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
4 Mult1
Mult2
1 Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

15

FU

F2

Mult1

F4

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

50

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 16


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
3 Mult1
Mult2
0 Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

16

FU

F2

Mult1

F4

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

51

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 17


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
2 Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

17

FU

F2

Mult1

F4

ADDD cant write because


of DIVD. RAW!

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

52

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 18


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
1 Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

18

FU

F2

Mult1

F4

Nothing Happens!!

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

53

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 19


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
0 Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
Yes
Mult
F0
F2 F4
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

19

FU

F2

Mult1

F4

MULT completes execution.

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

Yes
No

Yes
Yes

...

F30

Mult1

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

54

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 20


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

20

FU

F2

F4

MULT writes.

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes
Yes

Yes
Yes

...

F30

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

55

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 21


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
Yes
Add
F6
F8 F2
Yes
Div
F10
F0 F6

Clock

F0

21

FU

F2

F4

DIVD loads operands

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes
Yes

Yes
Yes

...

F30

F6 F8 F10 F12
Add

Divide

Chap. 4 - Pipelining I

56

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 22


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
40 Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6

Clock

F0

22

FU

F2

F4

Now ADDD can write since


WAR removed.

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

...

F30

F6 F8 F10 F12
Divide

Chap. 4 - Pipelining I

57

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 61


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
Yes
Div
F10
F0 F6

Clock

F0

61

FU

F2

F4

DIVD completes execution

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

Yes

Yes

...

F30

F6 F8 F10 F12
Divide

Chap. 4 - Pipelining I

58

Using A Scoreboard

Dynamic Scheduling

Scoreboard Example Cycle 62


Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No

Clock

F0

62

F2

F4

DONE!!

FU for j FU for k Fj?


Qj
Qk
Rj

F6 F8 F10 F12

...

Fk?
Rk

F30

FU

Chap. 4 - Pipelining I

59

Dynamic Scheduling

Using A Scoreboard

Another Dynamic Algorithm:


Tomasulo Algorithm

For IBM 360/91 about 3 years after CDC 6600 (1966)


Goal: High Performance without special compilers
Differences between IBM 360 & CDC 6600 ISA
IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600

Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II,
PowerPC 604,

Chap. 4 - Pipelining I

60

Dynamic Scheduling

Using A Scoreboard

Tomasulo Algorithm vs. Scoreboard

Control & buffers distributed with Function Units (FU) vs.


centralized in scoreboard;
FU buffers called reservation stations; have pending operands

Registers in instructions replaced by values or pointers to


reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations
compilers cant

Results to FU from RS, not through registers, over Common


Data Bus that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue

Chap. 4 - Pipelining I

61

Dynamic Scheduling

Using A Scoreboard

Tomasulo Organization

Load
Buffer

FP Op Queue FP
Registers

Store
Buffer

Common
Data
Bus
FP Add
Res.
Station

FP Mul
Res.
Station
Chap. 4 - Pipelining I

62

Dynamic Scheduling

Using A Scoreboard

Reservation Station Components


OpOperation to perform in the unit (e.g., + or )
Vj, VkValue of Source operands
Store buffers have V field, result to be stored
Qj, QkReservation stations producing source registers (value to be
written)
Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result
BusyIndicates reservation station or FU is busy
Register result statusIndicates which functional unit will write each
register, if one exists. Blank when no pending instructions that will
write that register.

Chap. 4 - Pipelining I

63

Dynamic Scheduling

Using A Scoreboard

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instruction & sends operands (renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
Normal data bus: data + destination (go to bus)
Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast

Chap. 4 - Pipelining I

64

Using A Scoreboard

Dynamic Scheduling

Tomasulo Example Cycle 0


Instruction status
Instruction
j
k Issue
LD
F6
34+
R2
LD
F2
45+
R3
MULTDF0
F2
F4
SUBD F8
F6
F2
DIVD F10 F0
F6
ADDDF6
F8
F2
Reservation Stations
Time Name Busy Op
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status

Execution
complete

S1
Vj

S2
Vk

RS for j
Qj

RS for k
Qk

Clock

F2

F4

F6

F8

F0

Write
Result
Load1
Load2
Load3

Busy
No
No
No

Address

F10 F12 ...

F30

FU

Chap. 4 - Pipelining I

65

Dynamic Scheduling

Using A Scoreboard

Review: Tomasulo

Prevents Register as bottleneck


Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provided branch prediction)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation

360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA


8000; Intel Pentium Pro

Chap. 4 - Pipelining I

66

Dynamic Hardware
Prediction
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP

Dynamic Branch Prediction is the ability


of the hardware to make an educated
guess about which way a branch will
go - will the branch be taken or not.
The hardware can look for clues based
on the instructions, or it can use past
history - we will discuss both of
these directions.

4.6 Hardware Support for


Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining I

67

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Prediction Buffers

Dynamic Branch Prediction


Performance = (accuracy, cost of misprediction)
Branch History Lower bits of PC address index table of 1-bit values
Says whether or not branch taken last time

Problem: in a loop, 1-bit BHT will cause two mis-predictions:


End of loop case, when it exits instead of looping as before
First time through loop on next time through code, when it predicts exit instead
of looping

Address
31

0
1

Bits 13 - 2
1023

Chap. 4 - Pipelining I

P
r
e
d
i
c
t
i
o
n

68

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Prediction Buffers

Dynamic Branch Prediction


Solution: 2-bit scheme where change prediction only if get
misprediction twice: (Figure 4.13, p. 264)

NT

Predict Taken

Predict Taken
T

Predict Not
Taken

T
NT
T

Chap. 4 - Pipelining I

NT
Predict Not
Taken
NT

69

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Prediction Buffers

BHT Accuracy
Mispredict because either:
Wrong guess for that branch
Got branch history of wrong branch when index the table
4096 entry table programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%
4096 about as good as infinite table, but 4096 is a lot of HW

Chap. 4 - Pipelining I

70

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Prediction Buffers

Correlating Branches
Idea: taken/not taken of
recently executed branches is
related to behavior of next
branch (as well as the history
of that branch behavior)

Branch address
2-bits per branch predictors

Then behavior of recent


branches selects between, say,
four predictions of next branch,
updating just that prediction

Prediction
Prediction

2-bit global branch history

Chap. 4 - Pipelining I

71

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Prediction Buffers

Frequency of Mispredictions

Accuracy of Different Schemes

18%

(Figure 4.21,
p. 272)

4096 Entries 2-bits per entry


Unlimited Entries 2-bits per entry
1024 Entries - 2 bits of history,
2 bits per entry

18%

14%
12%

11%

10%
8%
6%
6%

6%

6%

5%

5%
4%

4%
1%

0%

1%

4,096 entries: 2-bits per entry

Chap. 4 - Pipelining I

Unlimited entries: 2-bits/entry

1,024 entries (2,2)

li

eqntott

espresso

gcc

fpppp

spice

m atrix300

0%

doducd

0%

tom catv

2%

nasa7

Fre que ncy o f Mispre dictio ns

16%

72

Dynamic Hardware
Prediction

Basic Branch Prediction:


Branch Target Buffers

Branch Target Buffer

Branch Target Buffer (BTB): Use address of branch as index to get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong branch address (Figure 4.22, p. 273)

Return instruction addresses predicted with stack

Predicted PC

Chap. 4 - Pipelining I

Branch Prediction:
Taken or not Taken

73

Dynamic Hardware
Prediction

Example

Instructions
in Buffer
Yes
Yes
No

Basic Branch Prediction:


Branch Target Buffers
Prediction
Taken
Taken

Actual
Branch
Taken
Not taken
Taken

Penalty
Cycles
0
2
2

Example on page 274.


Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
Prediction accuracy of 80%
Hit rate in the buffer of 90%
60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
( 1 - percent buffer hit rate) X Taken branches X 2

Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)


Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles

Chap. 4 - Pipelining I

74

Multiple Issue
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

Multiple Issue is the ability of the


processor to start more than one
instruction in a given cycle.

Flavor I:
Superscalar processors issue varying
number of instructions per clock - can
be either statically scheduled (by the
compiler) or dynamically scheduled
(by the hardware).
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled
by compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000

Chap. 4 - Pipelining I

75

Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of
instructions formatted either as one very large instruction or as a
fixed packet of smaller instructions.
fixed number of instructions (4-16) scheduled by the compiler; put
operators into wide templates
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
Style: Explicitly Parallel Instruction Computer (EPIC)

Chap. 4 - Pipelining I

76

Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II - continued:

3 Instructions in 128 bit groups; field determines if instructions


dependent or independent
Smaller code size than old VLIW, larger than x86/RISC
Groups can be linked to show independence > 3 instr
64 integer registers + 64 floating point registers
Not separate files per functional unit as in old VLIW
Hardware checks dependencies
(interlocks => binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit flags)
=> 40% fewer mis-predictions?
IA-64 : name of instruction set architecture; EPIC is type
Merced is name of first implementation (1999/2000?)

Chap. 4 - Pipelining I

77

Multiple Issue

A SuperScalar Version of DLX

Issuing Multiple Instructions/Cycle


Fetch 64-bits/clock cycle; Int on left, FP on right
Can only issue 2nd instruction if 1st instruction issues
More ports for FP registers to do FP load & FP op in a pair

In our DLX example,


we can handle 2
instructions/cycle:
Floating Point
Anything Else

Type
Pipe Stages
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
1 cycle load delay causes delay to 3 instructions in Superscalar
instruction in right half cant use it, nor instructions in next slot

Chap. 4 - Pipelining I

78

Multiple Issue

A SuperScalar Version of DLX

Unrolled Loop Minimizes Stalls for Scalar


1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14

LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
SD
SUBI
BNEZ
SD

F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16

Latencies:
LD to ADDD: 1 Cycle
ADDD to SD: 2 Cycles

; 8-32 = -24

14 clock cycles, or 3.5 per iteration


Chap. 4 - Pipelining I

79

Multiple Issue

A SuperScalar Version of DLX

Loop Unrolling in Superscalar


Integer instruction
FP instruction
Loop: LD F0,0(R1)
1
LD F6,-8(R1)
2
LD F10,-16(R1)
ADDD F4,F0,F2
LD F14,-24(R1)
ADDD F8,F6,F2
LD F18,-32(R1)
ADDD F12,F10,F2
SD 0(R1),F4
ADDD F16,F14,F2
SD -8(R1),F8
ADDD F20,F18,F2
SD -16(R1),F12
8
SD -24(R1),F16
9
SUBI R1,R1,#40
10
BNEZ R1,LOOP
11
SD
8(R1),F20
12
Unrolled 5 times to avoid delays (+1 due to SS)
12 clocks, or 2.4 clocks per iteration

Chap. 4 - Pipelining I

Clock cycle

3
4
5
6
7

80

Multiple Issue

Multiple Instruction Issue &


Dynamic Scheduling

Dynamic Scheduling in Superscalar


Code compiler for scalar version will run poorly on Superscalar
May want code to vary depending on how Superscalar
Simple approach: separate Tomasulo Control for separate reservation
stations for Integer FU/Reg and for FP FU/Reg

Chap. 4 - Pipelining I

81

Multiple Issue

Multiple Instruction Issue &


Dynamic Scheduling

Dynamic Scheduling in Superscalar


How to do instruction issue with two instructions and keep in-order
instruction issue for Tomasulo?
Issue 2X Clock Rate, so that issue remains in order
Only FP loads might cause dependency between integer and FP
issue:
Replace load reservation station with a load queue;
operands must be read in the order they are fetched
Load checks addresses in Store Queue to avoid RAW violation
Store checks addresses in Load Queue to avoid WAR,WAW

Chap. 4 - Pipelining I

82

Multiple Issue

Multiple Instruction Issue &


Dynamic Scheduling

Performance of Dynamic Superscalar


Iteration Instructions
Issues Executes Writes result
no.
clock-cycle number
1
LD F0,0(R1)
1
2
4
1
ADDD F4,F0,F2
1
5
8
1
SD 0(R1),F4
2
9
1
SUBI R1,R1,#8
3
4
5
1
BNEZ R1,LOOP
4
5
2
LD F0,0(R1)
5
6
8
2
ADDD F4,F0,F2
5
9
12
2
SD 0(R1),F4
6
13
2
SUBI R1,R1,#8
7
8
9
2
BNEZ R1,LOOP
8
9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle

Chap. 4 - Pipelining I

83

VLIW

Multiple Issue

Loop Unrolling in VLIW


Memory
reference 1
LD F0,0(R1)
LD F10,-16(R1)
LD F18,-32(R1)
LD F26,-48(R1)

Memory
FP
reference 2
operation 1
LD F6,-8(R1)
LD F14,-24(R1)
LD F22,-40(R1) ADDD F4,F0,F2
ADDD F12,F10,F2
ADDD F20,F18,F2
SD 0(R1),F4
SD -8(R1),F8 ADDD F28,F26,F2
SD -16(R1),F12 SD -24(R1),F16
SD -32(R1),F20 SD -40(R1),F24
SD -0(R1),F28

FP
op. 2

Int. op/
branch

Clock

ADDD F8,F6,F2
ADDD F16,F14,F2
ADDD F24,F22,F2

SUBI R1,R1,#48
BNEZ R1,LOOP

Unrolled 7 times to avoid delays


7 results in 9 clocks, or 1.3 clocks per iteration
Need more registers to effectively use VLIW

Chap. 4 - Pipelining I

84

1
2
3
4
5
6
7
8
9

Multiple Issue

Limitations With Multiple Issue

Limits to Multi-Issue Machines


Inherent limitations of ILP
1 branch in 5 instructions => how to keep a 5-way VLIW busy?
Latencies of units => many operations must be scheduled
Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
Difficulties in building HW
Duplicate Functional Units to get parallel execution
Increase ports to Register File (VLIW example needs 6 read and 3
write for Int. Reg. & 6 read and 4 write for Reg.)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline depth

Chap. 4 - Pipelining I

85

Multiple Issue

Limitations With Multiple Issue

Limits to Multi-Issue Machines


Limitations specific to either SS or VLIW implementation
Decode issue in SS
VLIW code size: unroll loops + wasted fields in VLIW
VLIW lock step => 1 hazard & all instructions stall
VLIW & binary compatibility

Chap. 4 - Pipelining I

86

Multiple Issue

Limitations With Multiple Issue

Multiple Issue Challenges

While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
Exactly 50% FP operations
No hazards

If more instructions issue at same time, greater difficulty of decode and


issue
Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2
instructions can issue

VLIW: tradeoff instruction space for simple decoding


The long instruction word has room for many operations
By definition, all the operations the compiler puts in the long instruction word are
independent => execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide

Need compiling technique that schedules across several branches

Chap. 4 - Pipelining I

87

Compiler Support For ILP


4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism
4.7 Studies of ILP

How can compilers be smart?


1. Produce good scheduling of code.
2. Determine which loops might contain
parallelism.
3. Eliminate name dependencies.
Compilers must be REALLY smart to
figure out aliases -- pointers in C are
a real problem.
Techniques lead to:
Symbolic Loop Unrolling
Critical Path Scheduling

Chap. 4 - Pipelining I

88

Compiler Support For ILP

Symbolic Loop Unrolling

Software Pipelining
Observation: if iterations from loops are independent, then can get ILP
by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made
from instructions chosen from different iterations of the original loop
(Tomasulo in SW)
Iteration
0

Iteration
Iteration
1
2
Iteration
3
Iteration
4

Softwarepipelined
iteration

Chap. 4 - Pipelining I

89

Compiler Support For ILP

Symbolic Loop Unrolling

SW Pipelining Example
Before: Unrolled 3 times
1 LD
F0,0(R1)
2 ADDD F4,F0,F2
3 SD
0(R1),F4
4 LD
F6,-8(R1)
5 ADDD F8,F6,F2
6 SD
-8(R1),F8
7 LD
F10,-16(R1)
8 ADDD F12,F10,F2
9 SD
-16(R1),F12
10 SUBI R1,R1,#24
11 BNEZ R1,LOOP

SD
IF
ADDD
LD

After: Software Pipelined

1
2
3
4
5

Read F4

ID
IF

EX
ID
IF

LD
ADDD
LD
SD
ADDD
LD
SUBI
BNEZ
SD
ADDD
SD
Read F0

Mem
EX
ID

F0,0(R1)
F4,F0,F2
F0,-8(R1)
0(R1),F4;
Stores M[i]
F4,F0,F2;
Adds to M[i-1]
F0,-16(R1); loads M[i-2]
R1,R1,#8
R1,LOOP
0(R1),F4
F4,F0,F2
-8(R1),F4

WB Write F4
Mem WB
EX
Mem WB

Chap. 4 - Pipelining Write


I F0

90

Compiler Support For ILP

Symbolic Loop Unrolling

SW Pipelining Example
Symbolic Loop Unrolling
Less code space
Overhead paid only once
vs. each iteration in loop unrolling

Software Pipelining

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

Chap. 4 - Pipelining I

91

Compiler Support For ILP

Critical Path Scheduling

Trace Scheduling

Parallelism across IF branches vs. LOOP branches


Two steps:
Trace Selection
Find likely sequence of basic blocks (trace)
of (statically predicted or profile predicted)
long sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess
(discards values in registers)
Subtle compiler bugs mean wrong answer
vs. poorer performance; no hardware interlocks

Chap. 4 - Pipelining I

92

Hardware Support For


Parallelism
4.1 Instruction Level Parallelism:
Concepts and Challenges
4.2 Overcoming Data Hazards
with Dynamic Scheduling

Software support of ILP is best when


code is predictable at compile time.
But what if theres no predictability?

4.3 Reducing Branch Penalties


with Dynamic Hardware
Prediction

Here
well
talk
about
hardware
techniques. These include:

4.4 Taking Advantage of More ILP


with Multiple Issue

Conditional
Instructions

Hardware Speculation

4.5 Compiler Support for


Exploiting ILP
4.6 Hardware Support for
Extracting more Parallelism

or

Predicated

4.7 Studies of ILP

Chap. 4 - Pipelining I

93

Hardware Support For


Parallelism

Nullified Instructions

Tell the Hardware To Ignore An Instruction


Avoid branch prediction by turning branches into
conditionally executed instructions:
IF (x) then A = B op C else NOP
If false, then neither store result nor cause exception
Expanded ISA of Alpha, MIPs, PowerPC, SPARC,
have conditional move. PA-RISC can annul any
following instruction.
IA-64: 64 1-bit condition fields selected so
conditional execution of any instruction
Drawbacks to conditional instructions:
Still takes a clock, even if annulled
Stalls if condition evaluated late
Complex conditions reduce effectiveness; condition
becomes known late in pipeline.
This can be a major win because there is no time lost by
taking a branch!!

Chap. 4 - Pipelining I

A=
B op C

94

Hardware Support For


Parallelism

Nullified Instructions

Tell the Hardware To Ignore An Instruction


Suppose we have the code:
if ( VarA == 0 )
VarS = VarT;
Previous Method:
LD
R1, VarA
BNEZ
R1, Label
LD
R2, VarT
SD
VarS, R2
Label:

Compare
and Nullify
Next Instr.
If Not Zero

Compare
and Move
IF Zero

Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMPNNZ
R1, #0
SD
VarS, R2
Label:
Nullified Method:
LD
R1, VarA
LD
R2, VarT
CMOVZ
VarS,R2, R1

Chap. 4 - Pipelining I

95

Hardware Support For


Parallelism

Compiler Speculation

Increasing Parallelism
The theory here is to move an instruction across a branch so as to
increase the size of a basic block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions. For example
if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases.
Methods for increasing speculation include:
1. Use a set of status bits (poison bits) associated with the registers.
Are a signal that the instruction results are invalid until some later
time.
2. Result of instruction isnt written until its certain the instruction is
no longer speculative.

Chap. 4 - Pipelining I

96

Hardware Support For


Parallelism

Increasing
Parallelism
Example on Page 305.
Code for
if ( A == 0 )
A = B;
else
A = A + 4;
Assume A is at 0(R3) and
B is at 0(R4)
Note here that only ONE
side needs to take a
branch!!

Compiler Speculation

Original Code:
LW
R1, 0(R3)
BNEZ R1, L1
LW
R1, 0(R2)
J
L2
L1: ADDI R1, R1, #4
L2: SW
0(R3), R1

Load A
Test A
If Clause
Skip Else
Else Clause
Store A

Speculated Code:
LW
R1, 0(R3)
LW
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14

Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store

Chap. 4 - Pipelining I

97

Hardware Support For


Parallelism

Compiler Speculation

Poison Bits
In the example on the last
page, if the LW* produces
an exception, a poison bit
is set on that register. The
if a later instruction tries to
use the register, an
exception is THEN raised.

Speculated Code:
LW
R1, 0(R3)
LW*
R14, 0(R2)
BEQZ R1, L3
ADDI R14, R1, #4
L3: SW
0(R3), R14

Chap. 4 - Pipelining I

Load A
Spec Load B
Other if Branch
Else Clause
Non-Spec Store

98

Hardware Support For


Parallelism

Hardware Speculation

HW support for More ILP


Need HW buffer for results of
uncommitted instructions: reorder buffer
Reorder buffer can be operand source
FP
Once operand commits, result is
Op
found in register
Queue
3 fields: instr. type, destination, value
Use reorder buffer number instead
of reservation station
Discard instructions on mis-predicted Res Stations
branches or on exceptions
FP Adder

Reorder
Buffer

FP Regs

Res Stations
FP Adder

Figure 4.34, page 311

Chap. 4 - Pipelining I

99

Hardware Support For


Parallelism

Hardware Speculation

HW support for More ILP


How is this used in practice?
Rather than predicting the direction of a branch, execute the
instructions on both side!!
We early on know the target of a branch, long before we know it if will
be taken or not.
So begin fetching/executing at that new Target PC.
But also continue fetching/executing as if the branch NOT taken.

Chap. 4 - Pipelining I

100

Studies of ILP
4.1 Instruction Level Parallelism:
Concepts and Challenges

4.2 Overcoming Data Hazards


with Dynamic Scheduling
4.3 Reducing Branch Penalties
with Dynamic Hardware
Prediction
4.4 Taking Advantage of More ILP
with Multiple Issue
4.5 Compiler Support for
Exploiting ILP

4.6 Hardware Support for


Extracting more Parallelism

Conflicting studies of amount of


improvement available
Benchmarks (vectorized FP
Fortran vs. integer C programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using
existing mechanisms with increasing
HW budgets?
Do we need to invent new HW/SW
mechanisms to keep on processor
performance curve?

4.7 Studies of ILP

Chap. 4 - Pipelining I

101

Studies of ILP

Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaminginfinite virtual registers and all WAW & WAR
hazards are avoided
2. Branch predictionperfect; no mispredictions
3. Jump predictionall jumps perfectly predicted => machine with
perfect speculation & an unbounded buffer of instructions available
4. Memory-address alias analysisaddresses are known & a store can
be moved before a load provided addresses not equal
1 cycle latency for all instructions; unlimited number of instructions
issued per clock cycle

Chap. 4 - Pipelining I

102

Studies of ILP

Upper Limit to ILP: Ideal


Machine

This is the amount of parallelism when


there are no branch mis-predictions and
were limited only by data dependencies.

(Figure 4.38, page 319)

Instruction Issues per cycle

IPC

160

FP: 75 - 150

140
120

150.1

118.7

Integer: 18 - 60

100
75.2

80
60

54.8

62.6

40
17.9

20
0
gcc

Instructions that could


theoretically be issued
per cycle.

espresso

li

fpppp

doducd

tomcatv

Programs

Chap. 4 - Pipelining I

103

Studies of ILP

Impact of Realistic Branch


Prediction

What parallelism do we get when we dont allow perfect branch


prediction, as in the last picture, but assume some realistic model?
Possibilities include:
1. Perfect - all branches are perfectly predicted (the last slide)
2. Selective History Predictor - a complicated but do-able mechanism for
selection.
3. Standard 2-bit history predictor with 512 2-bit entries.
4. Static prediction based on past history of the program.
5. None - Parallelism is limited to basic block.

Chap. 4 - Pipelining I

104

Studies of ILP

Bonus!!

Selective History Predictor


8096 x 2 bits

1
0

11
Choose Non-correlator
10
01 Choose Correlator
00

Branch Addr
2
Global
History

Taken/Not Taken

00
01
10
11

2048 x 4 x 2 bits

8K x 2 bit
Selector
11 Taken
10
01 Not Taken
00

Chap. 4 - Pipelining I

105

Impact of Realistic
Branch Prediction

Studies of ILP

Figure 4.42, Page 325

Limiting the type of


branch prediction.

60

61

48

50

IPC

Instruction issues per cycle

60

58

FP: 15 - 45
46 45

46 45 45

41

40

35

Integer: 6 - 12

29

30

20
12

10

19

16

9
6

13 14

10
7

15

0
gcc

espresso

li

fpppp

doducd

tomcatv

Program
Perfect

Perfect

Selective predictor

Standard 2-bit

Static

Chap. 4BHT
- Pipelining
I
(512)
Profile

Selective Hist

None

106

No prediction

More Realistic HW:


Register Impact

Studies of ILP
Effect of limiting the
number of renaming
registers.

60

Figure 4.44, Page 328


59

FP: 11 - 45

54
49

Instruction issues per cycle

IPC

50

45

40

44

35

Integer: 5 - 15

30

29

28

20

20

15 15
11 10 10

10

16
13

12 12 12 11

10

9
5

11
6

15

0
gcc

espresso

li

fpppp

doducd

tomcatv

Program
Infinite

Infinite

256
128
64
32
Chap.
4
Pipelining
I
256 128 64
32

None

None

107

Studies of ILP

More Realistic HW:


Alias Impact

What happens when there


may be conflicts with
memory aliasing?

Figure 4.46, Page 330


49

50

FP: 4 - 45
(Fortran,
no heap)

49

Instruction issues per cy cle

IPC

45

45

45

40
35

Integer: 4 - 9

30
25
20

16

15

15

12

10

10
5

16

7
4

9
5

0
gcc

espresso

li

fpppp
Program

Global/Stack perf;
Perfect
Inspec.
Perfect
Global/stack Perfect
Inspection
heapChap.
conflicts
4 - Pipelining
I
Assem.

doducd

None
None

tomcatv

108

Summary
4.1 Instruction Level Parallelism: Concepts and Challenges
4.2 Overcoming Data Hazards with Dynamic Scheduling
4.3 Reducing Branch Penalties with Dynamic Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more Parallelism
4.7 Studies of ILP

Chap. 4 - Pipelining I

109

Anda mungkin juga menyukai