Appendix A

Appendix A: Basic Pipelining: Basic and Intermediate Concepts
Rung-Bin Lin
Appendix-1
Appendix A. Pipelining: Basic and Intermediate

Concept
What is Pipelining?
Pipelining is an implementation technique whereby multiple
instructions are overlaped in execution.
Pipe stage (pipe segment)
Throughput
Machine cycle: The time required between moving an instruction one
step down the pipeline. This time is equal to the time required for the
slowest pipe stage.
In a computer, the machine cycle is usually one clock cycle.
The pipeline designers goal is to balance the length of each pipe stage.
If the stages are perfectly balanced,
Time per instruction
Time per instruction on unpipelined machine

Number of pipe stages
Rung-Bin Lin
Appendix-2
A Simple Implementation of A RISC ISA

Five-cycle implementation
Instruction fetch cycle (IF)
Instruction decode/register fetch cycle (ID)
Operand fetches;
Sign-extending the immediate field;
Decoding is done in parallel with reading registers. This technique is
known as fixed-field decoding;
Test branch condition and computed branch address; finished branching
at the end of this cycle.
Execution/effective address cycle (EX)

Memory reference;
Register-Register ALU instruction;
Register-Immediate ALU instruction;
Memory access/branch completion cycle (MEM)

Write-back cycle (WB)
Register-Register ALU instruction;
Register-Immediate ALU instruction;
Load instruction;
Rung-Bin Lin
Appendix-3
Performance of the Five-Cycle Implementation

CPI=4.54
Branch instructions (12%) take 2 cycles
Store instructions (10%) require 4 cycles
Others takes 5 cycles
Rung-Bin Lin
The Classic Five-Stage Pipeline for a RSIC

Processor
Appendix-4
Rung-Bin Lin
The RISC Pipeline with Registers
Appendix-5
Rung-Bin Lin
Appendix-6
Instruction Issue
The process of letting an instruction move from the
instruction decode stage (ID) into execution stage
(EX) of this pipeline.
Rung-Bin Lin
Appendix-7
Basic Performance Issues in Pipelining

Pipelining increasing instruction execution throughput,
but it does not reduce the execution time of an individual
instruction due to pipeline overhead.
Register delay
Clock skew
The limitation of pipeline depth is due to

Pipeline latency
Pipe stage imbalance
Pipeline overhead
Example in A-10.
Rung-Bin Lin
Appendix-8
The Major Hurdle of Pipelining - Pipelining

Hazards
A hazard is a situation that prevents the next instruction in
the instruction stream from executing during its designated
clock cycle.
Three classes of hazards
Structural hazard: Arise from resource conflicts.
Data hazard: Arise when an instruction depends on the results of a
previous instruction.
Control hazard: Arise from branches and other instructions that
change the PC.
A pipeline can be stalled by a hazard. To eliminate hazards,

Instructions issued later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled one must continue.
Note that a cache miss stalls the whole pipeline.
Rung-Bin Lin
Appendix-9
Performance of Pipeline with Stalls

Average instruction time unpipelined
Average instruction time pipelined
CPI unpipelined Clock cycle unpipelined
CPI pipelined
Clock cycle pipelined
Speedup from pipelining
When pipelining is thought of as decreasing the CPI,

Speedup
CPI unpipelined
1 Pipeline stall cycles per instruction
Pipeline depth
Rung-Bin Lin
Appendix-10
When pipelining is thought of as improving the clock cycle

time,
Speedup
1
Clock cycle unpipelined
1 Pipeline stall cycles Clock cycle pipelined
Pipeline depth
Rung-Bin Lin
Structural Hazards
Due to resource conflicts (Example in A-14)
Due to some functional unit being not fully pipelined.
When some resources have not been duplicated enough.
Appendix-11
Rung-Bin Lin
Data Hazards
A memory access depends on the results of unfinishing
instructions.
Appendix-12
Rung-Bin Lin
Forwarding (Bypassing) ALU Results To

Minimize Hazards
Appendix-13
Rung-Bin Lin
Forwarding (Bypassing) Results to Store
Appendix-14
Bypassing Results of LOAD
Rung-Bin Lin
Appendix-15
Rung-Bin Lin
Appendix-16
Data Hazard Classification

Consider two instructions i and j, with i occurring before j,
the possible hazards are,
RAW (read after write) : j tries to read a source before i writes it.
WAW (write after write): j tries to write an operand before it is
written by i. For example,
LW R1, 0(R2)
IF ID EX MEM1 MEM2 WB
DADD R1, R2, R3
IF ID EX
WB
WAR (write after read): j tries to write a destination before it is read
by i. For example, if read is done in the second half of MEM2, and
write is done in the first half of WB.
SW 0(R1), R2
IF ID EX MEM1 MEM2 WB
DADD R2, R3, R4
IF ID EX
WB
RAR (read after read): not a hazard.
Rung-Bin Lin
Appendix-17
Data Hazards Requiring Stalls

Pipeline interlock
A piece of hardware that detects a hazard and stalls the pipeline
until the hazard is cleared.
Load interlock
Example (Fig. A.10 at A-21)
Rung-Bin Lin
Appendix-18
Control Hazards
Caused by the instructions that change PC.
Some basics
If a branch changes the PC to its target address, it is a taken
branch. If it does not change the PC, it falls through or it is not
taken.
Recall that if an instruction i is a taken branch, the PC is normally
not changed until the end of ID. A stall cycle is required.
Branch Instruction
Branch successor
Branch successor+1
Branch successor+2
IF ID EX MEM WB
IF IF ID
EX MEM WB
IF
ID EX
MEM WB
IF ID
EX
MEM WB
Rung-Bin Lin
Appendix-19
Branch Penalty
Branch delay: The length of a control hazard.
Branch penalty: The branch delay, unless it is dealt with,
turns into branch penalty.
The deeper the pipeline, the worse the branch penalty.
The number of branch stalls can be reduced by two steps
Find out whether the branch is taken or not taken earlier in the
pipeline.
Compute the taken PC (i.e., the address of the branch target)
earlier.
Branch behavior in programs

Average frequency of taken branches : 67%
60% of the forward branches are taken.
85% of the backward branches are taken.
Rung-Bin Lin
Reducing Pipeline Branch Penalties

Static branch prediction methods (Compile-time guess).
Free or flush the pipeline
Holding or deleting any instructions after the branch until the branch
destination is known.
Predict-not-taken (untaken) (Fig. A.12 in A-23)

Predict-taken
Does it have any advantage? Ans: no.
Delayed branch:
The execution cycle with a branch delay n is
Branch instruction
Sequential successor 1
Sequential successor 2
Sequential successor n (n=1 for MIPS)
Branch target if taken
Appendix-20
Rung-Bin Lin
Scheduling the Branch Delay Slot
Appendix-21
Rung-Bin Lin
Appendix-22
Effectiveness of Scheduling Branch Delay Slots

Requirements for being effective
Scheduling from before: Always
Scheduling from target: Taken
Scheduling from fall through: Not taken
The limitation on delayed-branch scheduling arises from

The restrictions on the instructions that are scheduled into the
delay slots.
The ability to predict at compile time whether a branch is likely to
be taken or not.
Using canceling or nullified branch to relieve the limlits

In a canceling branch, the instruction includes the direction that
the branch was predicted. When the branch behaves as predicted,
the instruction in the branch delay slot is simply executed.
Otherwise, the instruction in the branch delay slot is simply turned
into a No-Op.
Rung-Bin Lin
How Is Pipelining Implemented?

Unpipelined 5-cycle implementation
Appendix-23
Rung-Bin Lin
Appendix-24
Simple Pipelining Implementation for MIPS
Rung-Bin Lin
Appendix-25
Implementing the Control for MIPS Pipeline

Implementing the control focuses on detecting of hazards and
generating of control signals for forwarding.
Hazard detection
All the data hazards can be checked and forwarding control
signals can be set during the ID phase. If a data hazard exists, the
instruction is stalled before it is issued.
Or, alternatively, hazards forwarding are checked at the beginning
of a clock cycle that uses an operand (EX and MEM for the MIPS
pipeline).
Implementing the logic for hazard detection

Hazard detection by comparing the destination and sources of
adjacent instructions (fig. A.20 on page A-34).
An example shows detecting of all load interlocks when the
instruction using the load result in the ID stage (fig. A.21 on page A-34).
Rung-Bin Lin
Appendix-26
Implementing Forwarding Logic

Forwarding sources: ALU or data memory output.
Forwarding destination: ALU input, data memory input,
or zero detection unit (for BRANCH).
The forwarding can be implemented by checking the following
conditions
EX/MEM.IR.destination =ID/EX.IR.source ?
MEM/WB.IR.destination = ID/EX.IR.source ?
MEM/WB.IR.destination = EX/MEM.IR.source?
Rung-Bin Lin
Forwarding Data to the Two ALU Inputs
Appendix-27
Rung-Bin Lin
Dealing with Branches in the Pipeline
Appendix-28
Rung-Bin Lin
Appendix-29
What Makes Pipelining Hard to Implement

Exception (interrupt, fault) makes pipelining
difficult to implement.
Instruction set complications
Rung-Bin Lin
Appendix-30
Types of Exceptions
Types
I/O device request

Invoking an OS service from a user program
Tracing instruction execution
Breakpoint
Integer arithmetic overflow or underflow
FP arithmetic anomaly
Page fault
Misaligned memory access
Memory-protection violation
Using an undefined instruction
Hardware malfunction
Power failure
Exceptions for different architecture (fig. A.26 on page A40).
Rung-Bin Lin
Appendix-31
Classification of Exceptions
Synchronous versus asynchronous
If the event occurs at the same place every time that the program
is executed with the same data and memory allocation, the event is
called synchronous.
User requested versus coerced

User maskable versus nonmaskable
Within versus between instruction
Depend on whether the event prevents instruction completion by
occurring in the middle of execution or whether it is recognized
between instructions.
Resume versus terminate (fig. 3.40 on page 182).
Rung-Bin Lin
Appendix-32
Action Requirements for Different Exception

Types (Fig. A.27 on page A-42)
Actions
Resume
Terminate
The most difficult exceptions have two properties:

They occur within instructions (i.e. at EX or MEM stages).
They must be restartable (must save the PC of the
instruction at which to restart).
Rung-Bin Lin
Appendix-33
Exception Handling
Stopping and restarting execution
Force a trap instruction on the next IF
Until the trap is taken, turn off all writes for the faulting instruction and
for all instructions that follow in the pipeline.
After the exception-handling routine in the operating system receives
control, it immediately saves the PC of the faulting instruction.
IF
ID
EX
MEM
WB <--- Faulting instruction
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
WB
MEM
Trap instruction -> IF

ID
EX
If delayed branch is used, we need to save and restore as many PCs as the
length of the branch delay plus one.
Rung-Bin Lin
Appendix-34
Precise Interrupt
If a pipeline can be stopped so that the instructions
just before the faulting instruction are completed
and those after it can be restarted from scratch.
Supporting precise interrupts is a requirement in many
systems.
Exceptions in DLX
With pipelining, multiple exceptions may occur in the
same clock cycle. (fig. A.28 on page A-44).
Rung-Bin Lin
Appendix-35
Implementations of Precise Exceptions

Principle
The pipeline should be able to handle the exceptions caused by
instruction i prior to the exceptions caused by instruction i+1.
Implementation
Hardware posts all exceptions caused by a given instruction in a
status vector associated that instruction.
Once an exception indication is set in the exception status vector,
any control signal that may cause a data value to be written is
turned off.
When an instruction enters WB, the exception status vector is
checked, if any exceptions are posted, they are handled in the
order in which they would occur in time on an unpipelined
machine.
This will guarantee that all exceptions will be seen on instruction i
before any are seen on i+1.
Rung-Bin Lin
Appendix-36
Instruction Committed
When an instruction is guaranteed to complete, it is called
committed.
In the MIPS pipeline, all instructions are committed when
they reach the end of the MEM stage and no instruction
updates the state before that stage. Thus precise exceptions
are straight forward.
Rung-Bin Lin
Appendix-37
Instruction Set Complications

Some machines have instructions that change the state in
the middle if the instruction execution.
VAX: Autoincrement addressing mode.
VAX or IBM 360: String copy.
Implicitly set condition code.
Cause difficulties in scheduling any pipeline delays between
setting condition code and the branch.
ADD XXX <--- Set condition code C.
<- Can not place instructions that change C.

BR C, YYY <--- Use C for branch.
In fact, the condition code must be treated as an operand that
requires hazard detection for RAW hazards with branch no matter
the condition code is set implicitly or explicitly
Multicycle operations in VAX
Rung-Bin Lin
Appendix-38
Extending the MIPS Pipeline to Handle MultiCycle Operations

Assuming four separate functional units in our MIPS
implementation
Integer unit
Handle loads and stores, ALU operations and branches.
FP and integer multiplier
FP adder
FP and integer divider
If an instruction cannot proceed to the EX stage , the entire

pipeline behind that instruction will be stalled.
Rung-Bin Lin
MIPS Pipeline with Multi-cycle Functional

Units
Appendix-39
Rung-Bin Lin
Pipelining Multi-cycle Functional Units
Appendix-40
Rung-Bin Lin
Appendix-41
Latency and Initiation(repeat interval)

Latency
The number of intervening cycles between an instruction that
produces a result and an instruction that uses the result.
Initiation (repeat) interval

The number of cycles that must elapse between issuing two
operations of a given type.
Latency and initiation interval for pipelining multi-cycle

functional units
Functional Unit
Integer ALU
Data memory access
FP add
FP (integer) multiply
FP (integer) divide
Latency
0
1
3
6
24
Initiation interval
1
1
1
1
25
Rung-Bin Lin
Appendix-42
Hazards and Forwarding in Longer Latency

Pipelines
Hazard detection and forwarding for a pipeline as before.
Structural hazards can occur because the divide unit is not fully
pipelined.
The number of register writes can be larger than 1 because the
instructions have varying running time.
WAW hazards are possible, but WAR hazards are not possible.
Instructions can complete in a different order than they were
issued, causing problems with exceptions.
Stalls for RAW hazards will be more frequent because of longer
latency.
Assuming all hazard detection is done in ID, three checks must be
done before issuing an instruction:
Check for structural hazards
Check for a RAW data hazard
Check for a WAW data hazard
Rung-Bin Lin
RAW Hazards Caused by Longer Pipeline

Fig. A.33
Appendix-43
Rung-Bin Lin
Structural Hazards in Longer Pipeline

Fig. A.34
Appendix-44
Rung-Bin Lin
Appendix-45
Maintaining Precise Exceptions (1)

Problems caused by out-of-order completion
DIV.D
ADD.D
SUB.D
F0, F2, F4
F10, F10, F8
F12, F12, F14
Four possible approaches

Ignore the problem and settle for imprecise exceptions
Buffer the results of an operation until all the operations that were
issued earlier are completed.
History file approach: Buffer the original register values.
Future file approach: Keep the newer values of registers.
Allow the exceptions to become somewhat imprecise, but to keep
enough information so that the trap-handling routines can create a
precise sequence for exceptions. This means knowing what
operations were in the pipeline and their PCs.
Rung-Bin Lin
Appendix-46
Maintaining Precise Exceptions (2)

Worst-case scenario:
Instruction 1: A long-running instruction that interrupts.
Instruction 2 : not completed.
.
Instruction n-1: not completed.
Instruction n: completed. <-- The latest completed instruction.
The software must simulate the instruction 1 through instruction n1 and restart the execution at instruction n+1.
Allows the instruction issue to continue only if it is certain that all
the instructions before the issuing instruction will complete
without causing an exception. This sometimes means stalling the
machine to maintain precise exceptions.
Rung-Bin Lin
Number of Stalls per FP Operation
Appendix-47
Rung-Bin Lin
Performance of a MIPS FP Pipeline
Appendix-48
Rung-Bin Lin
Overview of The MIPS R4000 Pipeline

An implementation of MIPS64
Eight pipeline stages (superpipelining)
Appendix-49
Load Delay in MIPS R4000
Rung-Bin Lin
Appendix-50
Branch Delay in MIPS R4000
Rung-Bin Lin
Appendix-51
CPI of MIPS R4000
Rung-Bin Lin
Appendix-52
Rung-Bin Lin
Appendix-53
Concluding Remarks
We can spend a little money to buy a very powerful
computer today.

Appendix A

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Appendix A

Diunggah oleh

Hak Cipta:

Format Tersedia

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Appendix A. Pipelining: Basic and Intermediate

Time per instruction

Time per instruction on unpipelined machine

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

A Simple Implementation of A RISC ISA

Execution/effective address cycle (EX)

Memory access/branch completion cycle (MEM)

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Performance of the Five-Cycle Implementation

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

The Classic Five-Stage Pipeline for a RSIC

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

The RISC Pipeline with Registers

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Basic Performance Issues in Pipelining

The limitation of pipeline depth is due to

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

The Major Hurdle of Pipelining - Pipelining

A pipeline can be stalled by a hazard. To eliminate hazards,

Note that a cache miss stalls the whole pipeline.

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Performance of Pipeline with Stalls

Speedup from pipelining

When pipelining is thought of as decreasing the CPI,

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

When pipelining is thought of as improving the clock cycle

1 Pipeline stall cycles Clock cycle pipelined

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Forwarding (Bypassing) ALU Results To

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Forwarding (Bypassing) Results to Store

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Bypassing Results of LOAD

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Data Hazard Classification

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Data Hazards Requiring Stalls

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Branch behavior in programs

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Reducing Pipeline Branch Penalties

Predict-not-taken (untaken) (Fig. A.12 in A-23)

Sequential successor n (n=1 for MIPS)

Branch target if taken

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Scheduling the Branch Delay Slot

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Effectiveness of Scheduling Branch Delay Slots

The limitation on delayed-branch scheduling arises from

Using canceling or nullified branch to relieve the limlits

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

How Is Pipelining Implemented?

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Simple Pipelining Implementation for MIPS

Appendix A: Basic Pipelining: Basic and Intermediate Concepts

Implementing the Control for MIPS Pipeline

Implementing the logic for hazard detection

Appendix A: Basic Pipelining: Basic and Intermediate Concepts