4
The pipelines have equal
3.5 Performance when
3
2.5
Speedup
2
3.85 = 1.1 * (1 + stalls)
1.5
1
0.5
Or stalls = 3.85 / 1.1 – 1 =
0 2.5 stalls per instruction
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Stalls / Instruction
Structural Hazards
• We have already resolved one structural hazard: two
possible cache accesses in one cycle
– This would arise any time we have a load/store instruction
• As it moves down the pipeline and reaches the MEM stage, it would
conflict with the next instruction fetch
• Assuming 35% loads and 15% stores in a program, half of the instructions
would cause this hazard requiring a stall, this would introduce .5 stalls per
instruction or an overall CPI of 1.5!
• We avoid this with 2 caches
• The other source of structural hazard occurs in the EX stage
if an operation takes more than 1 cycle to complete
– We cannot have the next instruction move into EX if the current
instruction is still there
– This happens with longer ALU operations: multiplication,
division, floating point operations
– We will resolve this problem later when we add FP to our
pipeline, for now, assume all ALU operations take 1 cycle
Data Hazards
• The data hazard arises when a value is needed in a later
instruction earlier in the pipeline
– For instance, if we have a LD R1, 0(R2) followed by
DADDI R3, R1, #1, the LD reaches the MEM stage after
the DADDI reaches the ID stage (where it retrieves R1
from the register file)
– We need to stall the DADDI by 3 cycles!
• LD: IF ID EX MEM WB
• DADDI: IF stall stall stall ID …
– Another source of data hazard is when two consecutive
ALU operations access the same register, the first
producing the result for the second
• DADD R1, R2, R3: IF ID EX MEM WB
• DSUB R4, R5, R1: IF stall stall stall ID …
– Yet another source is an ALU operation which produces a
result used in a branch
• DSUBI R1, R1, #1
• BNEZ R1, top
Data Hazards
Solutions
• We will implement 3 solutions to data hazards
– First, we will only access registers in the first half of
the cycle in WB and the second half of the cycle in ID
• this permits an instruction to place a result in the register file
and in the same cycle another instruction can read the same
register to get the new value
– Second, we will implement forwarding (covered in the
next slide)
• this will shunt a value directly from the ALU as output
directly into the ALU as input
• this will shunt a value received from memory directly into
the ALU as input or directly back to memory
– Third, we will let the compiler fill any remaining stalls
with neutral instructions, this is called compiler
scheduling
Forwarding
• Forwarding can handle ALU to ALU data
dependencies, MEM to ALU data dependences
and MEM to MEM data dependencies
Notice that
The DADD
And OR do
Not require
Forwarding
Since the
WB write
happens
before the
ID read
Forwarding is Not Enough
• Forwarding will resolve the following situations:
– DADDI R1, R1, #4
– DSUBI R2, R1, R3
• the value of R1 is passed from ALU output to ALU input
– LD R1, 0(R3)
– SD R1, 0(R4)
• the value of R1 is passed from MEM output to MEM input
– DSUBI R1, R1, #1
– BNEZ R1, foo
• the value of R1 is passed from ALU output to ALU input
• It does not resolve these problems
– LD R1, 0(R2) IF ID EX MEM WB
– DADDI R1, R1, #1 IF ID …. EX
• the value of R1 is available at the end of MEM but needed in DADDI at the
beginning of EX
– LD R1, 0(R2)
– BNEZ R1, foo
• same
Stalling or Scheduling
• To resolve the last two forms of data hazard, the pipeline has to
either stall the latter instruction or the compiler needs to perform
scheduling
– For a stall, the ID/EX latches look to see if one of the source registers
in this instruction (entering EX) is the same as an instruction entering
MEM, if so, then a 1 cycle stall is inserted, causing the latches in
ID/EX to remain closed
• The compiler can be written to resolve as many of these hazards
as possible by finding an independent instruction (one that does
not use this source/destination register) to place in between the
two dependent instructions
– Consider for example the following code which loads two data from
arrays and adds them together, the code on the right removes all stalls
• LD R1, 0(R2) LD R1, 0(R2)
• DADDI R1, R1, #1 LD R3, 0(R4)
• LD R3, 0(R4) DADDI R1, R1, #1
• DADDI R3, R3, #1 DADDI R3, R3, #1
• DADD R5, R1, R3 DADD R5, R1, R3
• SD R5, 0(R6) SD R5, 0(R6)
Impact of Stalls
• Assume a benchmark of 35% loads, 15% stores, 10% branches,
40% ALU operations
– Of the loads, 50% of the loaded values are used immediately
afterward
– Of the ALU operations, 25% are used immediately afterward either in
other ALU operations, stores or branches
• Without coordinating the ID/WB stages, forwarding or
scheduling, all stalls result in 3 cycle penalties
– Number of stalls per instruction = .35 * .50 * 3 + .40 * .25 * 3 = .825,
or a CPI of 1.825
• With coordinating the ID/WB stages and forwarding, stalls are
reduced to 1 cycle for LD ALU and LD Branch operations
– Number of stalls per instruction = .35 * .50 * 1 = .175, or a CPI of
1.175
• Assuming an optimizing compiler can schedule half of these
situations, number of stalls per instruction = .0875 or a CPI of
1.0875
Branch Hazards
• The last form of stall occurs with any branch that is
taken
– Unconditional branches are always taken
– Conditional branches are taken when the condition is true
• Why is the branch a problem?
– Branch conditions (conditional branches) and branch target
locations (PC + offset) are both computed in the EX stage
(we do not reset the PC until the MEM stage, but let’s
move that MUX into the EX stage to further reduce the
penalty by 1)
• We have a 2 cycle penalty because we fetched two instructions in
the meantime (one is in IF, one is in ID)
– If the branch is taken, those 2 instructions need to be
flushed from the pipeline, thus taken branches cause a
penalty of 2 cycles
– There are several ways to handle the 2 cycle penalty, both
through hardware and software
Branch Penalty
If the branch is taken, instructions i+1 and i+2 should not have been fetched,
but we do not know this until instruction i completes its EX stage
If the branch is not taken, i+1 and i+2 would need to be fetched anyway,
no penalty
MIPS Solutions to the Branch Penalty
• Hardware solution
– There is no particular reason why the PC + offset and
condition evaluation have to wait until the EX stage
– Let’s add an ADDER to the ID stage to do PC + offset
– We can also move the zero tester into the ID stage so that
the comparison takes place after registers are read
• recall that the ID stage is one of the two shortest (time-wise) in
the pipeline, we should have enough time in this stage to read
from registers and do the zero test
– If the branches are now being determined in the ID stage,
it reduces the branch penalty to 1
• Software solution
– The compiler can try to move a neutral instruction into
that penalty location, known as the branch delay slot
• The new IF and ID stages are Continued
shown to the right
– The PC + Offset is computed
automatically
– A MUX is used to select which
PC value should be used in the
next fetch, PC + 4 or PC +
Offset, this is based on two
decisions
• is the instruction in ID a branch
and if the instruction is a
conditional branch, is the
condition true? if so, use PC +
Offset
– We simplified our MIPS
instruction set so that the only
two branches are BEQZ and One consequence of this new
BNEZ, that is, an integer architecture is a new source of stall
register is simply tested against LW R1, 0(R2)
0, this can be done quickly (in BEQZ R1, foo // 2 cycles
essence, all bits are NORed
DSUBI R1, R1, #4
together)
BNEZ R1, foo // 1 cycle stall
Filling the Branch Delay Slot
• The compiler will look
for a neutral instruction
to move down into the
branch delay
– A neutral instruction is
one that does not impact
the branch condition, nor
produces a value that is
used by an instruction
between it and the branch
• If a neutral instruction
can not be found, there
are two other possible
instructions that could be
sought, neither of which Above, (a) is always safe, (b) and
are safe in that, if the (c) are not, depending on how aggressively
branch is mispredicted, the compiler is set up, it may try to
the instruction would schedule (b) and (c) type instructions or
have to be flushed
not
Impact of Branch Hazards
• Assume a benchmark of 35% loads, 15% stores, 40%
ALU operations, 8% conditional branches and 2%
unconditional branches
– What is the impact on branch hazards if
• we use the original MIPS pipeline with no compiler scheduling
• we use the new MIPS pipeline with no compiler scheduling
• we use the new MIPS pipeline where compiler scheduling can
successfully move a neutral instruction (type a) into the branch
delay slot 60% of the time
– 10% of instructions are branches
• original pipeline has a penalty of 2 cycles per branch, our CPI
goes from 1.0 to 1.0 + 10% * 2 = 1.2
• new pipeline has a penalty of 1 cycle per branch, our CPI goes
from 1.0 to 1.0 + 10% * 1 = 1.1
• new pipeline plus scheduling, our CPI goes from 1.0 to 1.0 +
10% * 40% * 1 = 1.04
Scheduling Examples
Loop: LD R1, 0(R2) IF ID EX MEM WB
DADDI R1, R1, #1 IF ID s EX MEM WB
SD R1, 0(R2) IF s ID EX MEM WB
DADDI R2, R2, #4 IF ID EX MEM WB
DSUB R4, R3, R2 IF ID EX MEM WB
BNEZ R4, Loop IF s ID EX MEM WB
branch delay (LD or next instruction sequential) s IF …
• Stalls arise after the LD (data hazard), after the DSUB (data hazard
caused by moving the branch computation to ID) and after the BEQZ
(branch hazard)
• Below, the code has been scheduled by the compiler to remove all stalls
with the SD filling the branch delay slot
Loop: LD R1, 0(R2) IF ID EX MEM WB
DADDI R2, R2, #4 IF ID EX MEM WB
DSUB R4, R3, R2 IF ID EX MEM WB
DADDI R1, R1, #1 IF ID EX MEM WB
BNEZ R4, Loop IF ID EX MEM WB
SD R1, -4(R2) IF ID EX MEM WB
Branches in Other Pipelines
• In some pipelines, the stage where the target PC value is
computed occurs earlier than the stage in which the
condition is determined
– This is in part due to the computation of PC + offset being
available earlier
– The condition is usually a test that requires one or more
registers be read first, whereas PC and offset are already
available, so the PC + offset occurs earlier than say R1 == 0 or
R2 != R3
– Thus, in some pipelines, we might implement “assume taken”,
immediately changing the PC as soon as possible, and then
canceling the incorrectly fetched instruction if the branch is not
taken
• Why assume taken for conditional branches?
– In loops, the conditional branch is typically taken (to branch
back to the top of the loop) and perhaps 50% of conditional
branches are taken for if and if-else statements, so we might
assume a conditional branch is taken 60-70% of the time
Example
• The MIPS R4000 pipeline is 8 stages where branch target
locations are known in stage 3 and branch conditions are
evaluated in stage 4
conditional
unconditional branch not conditional
branch taken branch taken
Predict taken 2 3 2
Predict not taken 2 0 3
– Assume a benchmark with 4% unconditional branches, 6%
conditional branches not taken and 70% conditional branches taken
– Predict taken penalty = .04 * 2 + .06 * 3 + .06 * .70 * 2 = .344
– Predict not taken penalty = .04 * 2 + .06 * 0 + 06 * .70 * 3 = .206
• This argues that, like MIPS, assuming a branch is not taken
makes more sense than assuming branches are taken
– However, this may not be the case in even longer pipelines or for
benchmarks that have more conditional branches and fewer
unconditional branches – we will visit this in some example problems
out of class
Adding Floating Point to MIPS
• FP operations take longer than integer
– Even a FP addition takes more time because we have to
normalize both numbers (line up their decimal point) and then
put them back into FP notation when done with the operation
– We could either lengthen the clock cycle time
• this impacts all operations
– Or alter our EX stage to handle variable lengths
– We choose the latter approach as it has less impact on the
CPU’s performance although it causes new problems with
handling exceptions
• We will replace the current EX stage with a 4-device EX
stage
– The integer ALU
– An FP adder
– An FP multiplier (which will also be used for int multiplies)
– An FP divider
New Pipeline
• The integer EX unit will
still complete all
instructions in 1 cycle
• The EX adder will take 4
cycles
• The EX/int multiplier
will take 7 cycles
• The EX/int divider will
take 25 cycles
• The adder and multiplier
will be pipelined
– No need to pipeline the
integer unit
– The divider will not be
used often enough to Functional Unit Latency Initiation Interval
warrant it being Integer ALU 0 1
pipelined Data Memory 1 1
FP Add 3 1
FP/Int Multiply 6 1
FP/Int Divide/Sqrt 24 25
Pipelining FP Adder and Multiplier
New Complications
• Forwarding is still available from M7/A4/Div/Ex to
Ex/M1/A1/Div but more data hazard stalls may be needed
• What happens if two instructions reach MEM at the same time?
• What happens if a later instruction reaches MEM before an
earlier instruction? (out of order completion)
• What happens if 2 divisions occur within 25 cycles of each
other?
• What happens if an earlier instruction raises an interrupt after a
later instruction leaves the pipeline?
The Need for Stalls
L.D F3, 0(R2) IF ID EX MEM WB
MUL.D F0, F3, F6 IF ID stall M1 M2 M3 M4 M5 M6 M7 MEM WB
ADD.D F2, F0, F8 IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 MEM WB
S.D F2, 0(R2) IF stall stall stall stall stall stall ID EX stall stall MEM WB