UNIT-III
3.1 Instruction Level Parallelism: Concepts and Challenges
Chap. 3 -ILP 1
Chapter 3
Chapter 4
Chap. 3 -ILP 1
ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel. This is easier said than done: Issues include: Building compilers to analyze the code, Building hardware to be even smarter than that code.
Chap. 3 -ILP 1
Terminology
Basic Block - That set of instructions between entry points and between branches. A basic block has only one entry and one exit. Typically this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inherent in the loop.
Chap. 3 -ILP 1
Terminology
Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Vector is one way If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler
Chap. 3 -ILP 1
InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it
Chap. 3 -ILP 1
Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Today looking at HW schemes to avoid hazard
Chap. 3 -ILP 1
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it
Chap. 3 -ILP 1
10
HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW
Chap. 3 -ILP 1
11
Chap. 3 -ILP 1
12
Chap. 3 -ILP 1
13
Exception Behavior
Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions) Example:
DADDU BEQZ LW L1: Problem with moving LW before BEQZ? R2,R3,R4 R2,L1 R1,0(R2)
Chap. 3 -ILP 1
14
Data Flow
Data flow: actual flow of data values among instructions that produce results and those that consume them branches make flow dynamic, determine which instruction is supplier of data Example: DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6 L: OR R7,R1,R8 OR depends on DADDU or DSUBU? Must preserve data flow on execution
Chap. 3 -ILP 1 15
Dynamic Scheduling
3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP
Chap. 3 -ILP 1
16
Dynamic Scheduling
Logistics
Sections 3.2 and 3.3 of the text use, as an example of Dynamic Scheduling, an algorithm due to Tomasulo. We instead use another scoreboarding technique which is discussed in Appendix A8
Chap. 3 -ILP 1
17
Dynamic Scheduling
The idea:
Chap. 3 -ILP 1
Dynamic Scheduling
The idea:
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 3 -ILP 1
19
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Chap. 3 -ILP 1
20
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
21
Dynamic Scheduling
Using A Scoreboard
Chap. 3 -ILP 1
22
Dynamic Scheduling
Using A Scoreboard
Dynamic Scheduling
Using A Scoreboard
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1 24
Dynamic Scheduling
Instruction status
Using A Scoreboard
Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU;
Rj No; Rk No
Issue
Read operands
Execution complete
f((Fj( f )Fi(FU) or Rj( f )=No) & Write result (Fk( f ) Fi(FU) or Rk( f )=No))
f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 25
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code well be working with in the example: LD LD MULT SUBD DIVD ADDD F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Latencies (clock cycles): LD 1 MULT 10 SUBD 2 DIVD 40 ADDD 2 26
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result
Busy No No No No No
Op
dest Fi
S1 Fj
S2 Fk
Fk? Rk
Clock
FU
F0
F2
F4
F6
F8
F10
F12
...
F30
27
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status
Issue LD #1
Shows in which cycle the operation occurred.
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
1 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
28
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
LD #2 cant issue since integer unit is busy. MULT cant issue because we require in-order issue.
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
2 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
29
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
3 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
30
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
4 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
31
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F2
S1 Fj
S2 Fk R3
Fk? Rk Yes
Clock
5 FU
F0
F2
Integer
F4
F6 F8 F10
F12
...
F30
32
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
Op Load Mult
dest Fi F2 F0
S1 Fj F2
S2 Fk R3 F4
Clock
6 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
33
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
dest Fi F2 F0 F8
S1 Fj F2 F6
S2 Fk R3 F4 F2
Clock
7 FU
F0
F2
F4
F6 F8 F10
Add
F12
...
F30
34
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Clock
8 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
35
Mult1 Integer
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
LD #2 writes F2.
Mult1
Clock
8 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
36
Chap. 3 -ILP 1
Dynamic Scheduling
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0
Using A Scoreboard
S2 Fk F4 F2 F6
Mult1
Clock
9 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Chap. 3 -ILP 1
37
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
11 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
38
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Op Mult
Fk? Rk Yes
Div
F10
F0
F6
Mult1
No
Yes
Clock
12 FU
F0
Mult1
F2
F4
F6 F8 F10
Divide
F12
...
F30
39
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
Mult1
Clock
13 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
40
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
14 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
41
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
15 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
42
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 3 Mult1 Mult2 0 Add Divide Register result status
Mult1
Clock
16 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
43
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
17 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
44
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
Mult1
Clock
18 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
45
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
19 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
46
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
MULT writes.
Fk? Rk
Yes Yes
Yes Yes
Clock
20 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
47
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes Yes
Yes Yes
Clock
21 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
48
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes
Yes
Clock
22 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
49
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes
Yes
Clock
61 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
50
Chap. 3 -ILP 1
Dynamic Scheduling
Using A Scoreboard
DONE!!
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
51
Chap. 3 -ILP 1
Dynamic Scheduling
Tomasulo Algorithm
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
This is the example that the text uses in Sections 3.2 & 3.3.
Chap. 3 -ILP 1
52
Dynamic Scheduling
Tomasulo Algorithm
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Chap. 3 -ILP 1 53
Dynamic Scheduling
FP Registers From Mem FP Op Queue
Tomasulo Organization
FP adders
Reservation Stations
To Mem FP multipliers
54
Dynamic Scheduling
Tomasulo Algorithm
Chap. 3 -ILP 1
55
Dynamic Scheduling
Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Chap. 3 -ILP 1
57
Dynamic Scheduling
Tomasulo Algorithm
Review: Tomasulo
Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (provided branch prediction) Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro
Chap. 3 -ILP 1
58