ACA Unit 3

UNIT-III
Instruction Level Parallelism
UNIT-III
3.1 Instruction Level Parallelism: Concepts and Challenges
3.2 Overcoming Data Hazards with Dynamic Scheduling

3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP
Chap. 3 -ILP 1
Ideas To Reduce Stalls

Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls
Chapter 3
Chapter 4
Chap. 3 -ILP 1

3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP
ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel. This is easier said than done: Issues include: Building compilers to analyze the code, Building hardware to be even smarter than that code.
This section looks at some of the problems to be solved.
Chap. 3 -ILP 1
Terminology
Basic Block - That set of instructions between entry points and between branches. A basic block has only one entry and one exit. Typically this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inherent in the loop.
Chap. 3 -ILP 1
Terminology
Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Vector is one way If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler
Chap. 3 -ILP 1
Data Dependence and Hazards
InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3 J: sub r4,r1,r3

or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a True Dependence (compiler term) If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard
Chap. 3 -ILP 1
Data Dependence and Hazards
Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Today looking at HW schemes to avoid hazard
Chap. 3 -ILP 1
Name Dependence #1: Anti-dependence
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Called an anti-dependence by compiler writers. This results from reuse of the name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
Chap. 3 -ILP 1 9
Name Dependence #2: Output dependence
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Called an output dependence by compiler writers This also results from the reuse of name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
Chap. 3 -ILP 1
10
ILP and Data Hazards
HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW
Chap. 3 -ILP 1
11

Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
Chap. 3 -ILP 1
12

Control dependence need not be preserved willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program Instead, 2 properties critical to program correctness are exception behavior and data flow
Chap. 3 -ILP 1
13
Exception Behavior
Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions) Example:
DADDU BEQZ LW L1: Problem with moving LW before BEQZ? R2,R3,R4 R2,L1 R1,0(R2)
Chap. 3 -ILP 1
14
Data Flow
Data flow: actual flow of data values among instructions that produce results and those that consume them branches make flow dynamic, determine which instruction is supplier of data Example: DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6 L: OR R7,R1,R8 OR depends on DADDU or DSUBU? Must preserve data flow on execution
Chap. 3 -ILP 1 15
Dynamic Scheduling
3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP
Advantages of Dynamic Scheduling

Handles cases when dependences unknown at compile time (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling
Chap. 3 -ILP 1
16
Dynamic Scheduling
Logistics
Sections 3.2 and 3.3 of the text use, as an example of Dynamic Scheduling, an algorithm due to Tomasulo. We instead use another scoreboarding technique which is discussed in Appendix A8
Chap. 3 -ILP 1
17
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism

Why is this in Hardware at run time? Works when cant know real dependence at compile time Compiler simpler Code for one machine runs well on another Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple execution units, so use them. DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F12,F8,F14 Even though ADDD stalls, the SUBD has no dependencies and can run.
Enables out-of-order execution => out-of-order completion 18
Chap. 3 -ILP 1
Dynamic Scheduling
The idea:
HW Schemes: Instruction Parallelism

Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 3 -ILP 1
19
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Chap. 3 -ILP 1
20
Dynamic Scheduling
Using A Scoreboard
Four Stages of Scoreboard Control

1. Issue decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.
Chap. 3 -ILP 1
21
Dynamic Scheduling
Using A Scoreboard

2. Read operands wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.
Chap. 3 -ILP 1
22
Dynamic Scheduling
Using A Scoreboard

3. Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads operands
Chap. 3 -ILP 1 23
Dynamic Scheduling
Using A Scoreboard
Three Parts of the Scoreboard

1. Instruction statuswhich of 4 steps the instruction is in
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1 24
Dynamic Scheduling
Instruction status
Using A Scoreboard
Detailed Scoreboard Pipeline Control

Wait until
Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU;
Rj No; Rk No
Issue
Not busy (FU) and not result(D)
Read operands
Execution complete
Rj and Rk Functional unit done
f((Fj( f )Fi(FU) or Rj( f )=No) & Write result (Fk( f ) Fi(FU) or Rk( f )=No))
f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 25
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code well be working with in the example: LD LD MULT SUBD DIVD ADDD F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Latencies (clock cycles): LD 1 MULT 10 SUBD 2 DIVD 40 ADDD 2 26
What are the hazards in this code?
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result
Busy No No No No No
Op
dest Fi
S1 Fj
S2 Fk
FU for j FU for k Fj? Qj Qk Rj
Fk? Rk
Clock
FU
F0
F2
F4
F6
F8
F10
F12
...
F30
27
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example Cycle 1

Issue 1 Read Execution Write operands completeResult
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status
Issue LD #1
Shows in which cycle the operation occurred.
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
1 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
28
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2
LD #2 cant issue since integer unit is busy. MULT cant issue because we require in-order issue.
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
2 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
29
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
3 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
30
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
4 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
31
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5
Issue LD #2 since integer unit is now free.
Op Load
dest Fi F2
S1 Fj
S2 Fk R3
Fk? Rk Yes
Clock
5 FU
F0
F2
Integer
F4
F6 F8 F10
F12
...
F30
32
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 6
Issue MULT.
Busy Yes Yes No No No
Op Load Mult
dest Fi F2 F0
S1 Fj F2
S2 Fk R3 F4
FU for j FU for k Fj? Qj Qk Rj Integer No
Fk? Rk Yes Yes
Clock
6 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
33
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7
MULT cant read its operands (F2) because LD #2 hasnt finished.
Busy Yes Yes No Yes No
Op Load Mult Sub
dest Fi F2 F0 F8
S1 Fj F2 F6
S2 Fk R3 F4 F2
FU for j FU for k Fj? Qj Qk Rj Integer Integer No Yes
Fk? Rk Yes Yes No
Clock
7 FU
F0
F2
F4
F6 F8 F10
Add
F12
...
F30
34
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example Cycle 8a

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7 8 dest Fi F2 F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk R3 F4 F2 F6
DIVD issues. MULT and SUBD both waiting for F2.
Busy Yes Yes No Yes Yes
Op Load Mult Sub Div
FU for j FU for k Fj? Qj Qk Rj Integer Integer Mult1 No Yes No
Fk? Rk Yes Yes No Yes
Clock
8 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
35
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example Cycle 8b

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 7 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6
LD #2 writes F2.
Busy No Yes No Yes Yes
Op Mult Sub Div
FU for j FU for k Fj? Qj Qk Rj Yes Yes No
Fk? Rk Yes Yes Yes
Mult1
Clock
8 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
36
Chap. 3 -ILP 1
Dynamic Scheduling
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0
Using A Scoreboard

Now MULT and SUBD can both read F2. How can both instructions do this at the same time??
FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes
Op Mult Sub Div
S2 Fk F4 F2 F6
Mult1
Clock
9 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Chap. 3 -ILP 1
37
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 8 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6
ADDD cant start because add unit is busy.
Op Mult Sub Div
Fk? Rk Yes Yes Yes
Mult1
Clock
11 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
38
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 7 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 dest Fi F0 S1 Fj F2 S2 Fk F4
SUBD finishes. DIVD waiting for F0.
Busy No Yes No No Yes
Op Mult
FU for j FU for k Fj? Qj Qk Rj Yes
Fk? Rk Yes
Div
F10
F0
F6
Mult1
No
Yes
Clock
12 FU
F0
Mult1
F2
F4
F6 F8 F10
Divide
F12
...
F30
39
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 6 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
ADDD issues.
Fk? Rk Yes Yes Yes
Mult1
Clock
13 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
40
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 5 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
Fk? Rk Yes Yes Yes
Mult1
Clock
14 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
41
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 4 Mult1 Mult2 1 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
Fk? Rk Yes Yes Yes
Mult1
Clock
15 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
42
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 3 Mult1 Mult2 0 Add Divide Register result status
Fk? Rk Yes Yes Yes
Mult1
Clock
16 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
43
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 2 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
ADDD cant write because of DIVD. RAW!
Fk? Rk Yes Yes Yes
Mult1
Clock
17 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
44
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 1 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
Nothing Happens!!
Fk? Rk Yes Yes Yes
Mult1
Clock
18 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
45
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 0 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6
MULT completes execution.
Fk? Rk Yes Yes Yes
Mult1
Clock
19 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
46
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6
MULT writes.
Fk? Rk
Yes Yes
Yes Yes
Clock
20 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
47
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6
DIVD loads operands
Fk? Rk
Yes Yes
Yes Yes
Clock
21 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
48
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 40 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6
Now ADDD can write since WAR removed.
Fk? Rk
Yes
Yes
Clock
22 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
49
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6
DIVD completes execution
Fk? Rk
Yes
Yes
Clock
61 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
50
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No No
DONE!!
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
51
Chap. 3 -ILP 1
Dynamic Scheduling
Tomasulo Algorithm
Another Dynamic Algorithm: Tomasulo Algorithm

For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA
IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
This is the example that the text uses in Sections 3.2 & 3.3.
Chap. 3 -ILP 1
52
Dynamic Scheduling
Tomasulo Algorithm
Tomasulo Algorithm vs. Scoreboard

Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard;
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Chap. 3 -ILP 1 53
Dynamic Scheduling
FP Registers From Mem FP Op Queue
Tomasulo Organization
Load1 Load2 Load3 Load4 Load5 Load6
Load Buffers Store Buffers

Add1 Add2 Add3 Mult1 Mult2
FP adders
Reservation Stations
To Mem FP multipliers
Common Data3 Bus Chap. -ILP(CDB) 1
54
Dynamic Scheduling
Tomasulo Algorithm
Reservation Station Components

OpOperation to perform in the unit (e.g., + or ) Vj, VkValue of Source operands Store buffers have V field, result to be stored Qj, QkReservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result BusyIndicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Chap. 3 -ILP 1
55
Dynamic Scheduling
Tomasulo Algorithm
Three Stages of Tomasulo Algorithm

1. Issueget instruction from FP Op Queue If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers). 2. Executionoperate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write resultfinish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Chap. 3 -ILP 1 56
Dynamic Scheduling
Tomasulo Algorithm
Tomasulo Example Cycle 0

Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 Reservation Stations Time Name 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 Register result status k R2 R3 F4 F2 F6 F2 Issue Execution complete Write Result Load1 Load2 Load3 Busy No No No Address
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Chap. 3 -ILP 1
57
Dynamic Scheduling
Tomasulo Algorithm
Review: Tomasulo
Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (provided branch prediction) Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro
Chap. 3 -ILP 1
58

ACA Unit 3

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ACA Unit 3

Diunggah oleh

Hak Cipta:

Format Tersedia

UNIT-III

Instruction Level Parallelism

3.2 Overcoming Data Hazards with Dynamic Scheduling

Ideas To Reduce Stalls

Instruction Level Parallelism

This section looks at some of the problems to be solved.

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Data Dependence and Hazards

I: add r1,r2,r3 J: sub r4,r1,r3

Instruction Level Parallelism

Data Dependence and Hazards

Instruction Level Parallelism

Name Dependence #1: Anti-dependence

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Instruction Level Parallelism

Name Dependence #2: Output dependence

InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Instruction Level Parallelism

ILP and Data Hazards

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Advantages of Dynamic Scheduling

HW Schemes: Instruction Parallelism

Enables out-of-order execution => out-of-order completion 18

HW Schemes: Instruction Parallelism

Four Stages of Scoreboard Control

Four Stages of Scoreboard Control

Four Stages of Scoreboard Control

Three Parts of the Scoreboard

Detailed Scoreboard Pipeline Control

Not busy (FU) and not result(D)

Rj and Rk Functional unit done

What are the hazards in this code?

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 1

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 2

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 3

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 4

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 5

Issue LD #2 since integer unit is now free.

FU for j FU for k Fj? Qj Qk Rj

Scoreboard Example Cycle 6

Busy Yes Yes No No No

FU for j FU for k Fj? Qj Qk Rj Integer No

Fk? Rk Yes Yes

Scoreboard Example Cycle 7

MULT cant read its operands (F2) because LD #2 hasnt finished.

Busy Yes Yes No Yes No

Op Load Mult Sub

FU for j FU for k Fj? Qj Qk Rj Integer Integer No Yes

Fk? Rk Yes Yes No

Scoreboard Example Cycle 8a

DIVD issues. MULT and SUBD both waiting for F2.

Busy Yes Yes No Yes Yes

Op Load Mult Sub Div