Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Lecture 4: Basic Implementation and Pipelining
Lecture Outline
Compilers 101
Compiler Optimizations
High-level optimizations
Done on source, may be source-to-source conversions
Examples map data for cache efficiency, remove conditions,
etc.
Local Optimizations
Optimize code in small straight-line sections
Global Optimizations
Extend local opts across branches and do loop optimizations
(loop unrolling)
Register Allocation
Assign temporary values to registers, insert spill code
Computer Science 146
David Brooks
Implementation Review
First, lets think about how different instructions
get executed
Instruction
Fetch
Instruction
Decode
ALU Ops
Register
Fetch
All Instructions
Write
Result
Execute
Memory Ops
Calculate
Eff. Addr
Memory
Access
Control Ops
Calculate
Eff. Addr
Branch
Complete
Write
Result
Instruction Fetch
Send the Program Counter (PC) to memory
Fetch the current instruction from memory
IR <= Mem[PC]
Optimizations
Instruction Caches, Instruction Prefetch
Performance Affected by
Code density, Instruction size variability (CISC/RISC)
Computer Science 146
David Brooks
Abstract Implementation
Data
Register #
PC
Address
Instruction
memory
Instruction
Registers
Register #
ALU
Address
Data
memory
Register #
Data
Performance Affected by
Regularity in instruction format, instruction length
Rs
Rd
Immediate
ALUoutput<= A
Register-Immediate
ALUoutput<= A op Imm
Memory Access
Take effective address, perform Load or Store
Load
LMD <= Mem[ALUoutput ]
Store
Mem[ALUoutput ] <= B
Write-Back
Send results back to register file
Register-register ALU instructions
Regs[IR15..11] <= ALUoutput
Load Instruction
Regs[IR20..16] <= LMD
Final Implementation
PCSrc
Add
4
RegWrite
Instruction [25 21]
PC
Read
address
Instruction
[31 0]
Instruction
memory
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
Registers
data
RegDst
Instruction [15 0]
16
Sign 32
extend
Shift
left 2
ALU
Add result
1
M
u
x
0
MemWrite
ALUSrc
1
M
u
x
0
Zero
ALU ALU
result
ALU
control
MemtoReg
Address
Read
data
Data
Write
data memory
1
M
u
x
0
MemRead
Instruction [5 0]
ALUOp
What is Pipelining?
Implementation where multiple instructions are
simultaneously overlapped in execution
Instruction processing has N different stages
Overlap different instructions working on different
stages
Pipelining Advantages
Unpipelined
time
latency
Instrs
Pipelined
1/throughput
latency
1/throughput
10
Hazards
CPIReal =
CPIIdeal+CPIStall
IBM POWER4 Clock Skew Measurement
Computer Science 146
David Brooks
11
EX: Execute/
address calculation
0
M
u
x
1
Add
4
Add
Add
result
Shift
left 2
Read
register 1
Address
PC
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Instruction
Instruction
memory
Zero
0
M
u
x
1
Write
data
16
Sign
extend
ALU
ALU
result
Read
data
Address
Data
memory
Write
data
1
M
u
x
0
32
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
16
Sign
extend
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
Read
data
1
M
u
x
0
32
12
Representation of Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $10, 20($1)
CC 1
CC 2
CC 3
IM
Reg
ALU
IM
lw R10, 20(R1) IF
Sub R11, R2, R3
ID
IF
Reg
EX
ID
CC 4
CC 5
DM
Reg
ALU
DM
MEM
EX
WB
MEM
CC 6
Reg
WB
Pipeline Hazards
Hazards
Situations that prevent the next instruction from executing
in its designated clock cycle
Structural Hazards
When two different instructions want to use the same
hardware resource in the same cycle (resource conflict)
Data Hazards
When an instruction depends on the result of a previous
instruction that exposes overlapping of instructions
Control Hazards
Pipelining of PC-modifying instructions (branch, jump, etc)
Computer Science 146
David Brooks
13
Structural Hazards
Two cases when this can occur
Resource used more than once in a cycle (Memory, ALU)
Resource is not fully pipelined (FP Unit)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
14
Structural Hazards
Solutions
Stall
Low Cost, Simple (+)
Increases CPI (-)
Try to use for rare events in high-performance CPUs
Duplicate Resources
Decreases CPI (+)
Increases cost (area), possibly cycle time (-)
Use for cheap resources, frequent cases
Separate I-, D-caches, Separate ALU/PC adders, Reg File Ports
Structural Hazards
Solutions
Pipeline Resources
High performance (+)
Control is simpler than duplication (+)
Tough to pipeline some things (RAMs) (-)
Use when frequency makes it worthwhile
Ex. Fully pipelined FP add/multiplies critical for scientific
Good news
Structural hazards dont occur as long as each instruction uses a resource
At most once
Always in the same pipeline stage
For one cycle
RISC ISAs are designed with this in mind, reduces structural hazards
Computer Science 146
David Brooks
15
Pipeline Stalls
What could the performance impact of unified
instruction/data memory be?
Loads ~15% of instructions, Stores ~10%
Prob (Ifetch + Dfetch) = .25
CPIReal = CPIIdeal+CPIStall= 1.0 + .25 = 1.25
Data Hazards
Two operands from different instructions use the
same storage location
Must appear as if instructions are executed to
completion one at a time
Three types of Data Hazards
Read-After-Write (RAW)
True data-dependence (Most important)
Write-After-Read (WAR)
Write-After-Write (WAW)
Computer Science 146
David Brooks
16
RAW Example
Cycle
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
17
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
WB
EX
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
18
Forwarding, Bypassing
Cycle
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
lw R3, 10(R1)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
lw R3, 10(R1)
IF
ID
EX
MEM
WB
IF
ID
stall
EX
MEM
WB
IF
stall
ID
EX
MEM
WB
19
Delayed Loads
Define the pipeline slot after a load to be a delay slot
NO interlock hardware. Machine assumes the correct
compiler
LW Rb, b
LW Rc, c
LW Re, e
ADD Ra, Rb, Rc
LW Rf, f
SW a, Ra
SUB Rd, Re, Rf
SW d, Rd
Computer Science 146
David Brooks
20
21
Control Hazards
Cycle
Branch Instr.
IF
Instr +1
ID
EX
MEM
WB
IF
stall
stall
IF
ID
EX
MEM
stall
stall
stall
IF
ID
EX
Instr +2
Branch Instr.
IF
Instr +1
Instr +2
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
WB
22
Control Hazards:
Branch Characteristics
Integer Benchmarks: 14-16% instructions are
conditional branches
FP: 3-12%
On Average:
3. Assume taken
4. Delay Branches
23
Control Hazards:
Assume Not Taken
Cycle
Untaken Branch
IF
ID
EX
MEM
WB
Instr +1
IF
Instr +2
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Taken Branch
IF
ID
EX
MEM
WB
Instr +1
IF
Branch Target
Branch Target +1
flush
flush
flush
flush
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Control Hazards:
Branch Delay Slots
Find one instruction that will be executed no matter which
way the branch goes
Now we dont care which way the branch goes!
Harder than it sounds to find instructions
24
Interrupt Taxonomy
Synchronous vs. Asychronous (HW error, I/O)
User Request (exception?) vs. Coerced
User maskable vs. Nonmaskable (Ignorable)
Within vs. Between Instructions
Resume vs. Terminate
The difficult exceptions are resumable interrupts
within instructions
Save the state, correct the cause, restore the state,
continue execution
Computer Science 146
David Brooks
25
Interrupt Taxonomy
Exception Type
Sync vs.
Async
User Request
Vs. Coerced
Within vs.
Between Insn
Resume vs.
terminate
Async
Coerced
Nonmask
Between
Resume
Invoke O/S
Sync
User
Nonmask
Between
Resume
Tracing Instructions
Sync
User
Maskable
Between
Resume
Breakpoint
Sync
User
Maskable
Between
Resume
Arithmetic Overflow
Sync
Coerced
Maskable
Within
Resume
Sync
Coerced
Nonmask
Within
Resume
Misaligned Memory
Sync
Coerced
Maskable
Within
Resume
Sync
Coerced
Nonmask
Within
Resume
Sync
Coerced
Nonmask
Within
Terminate
Hardware/Power Failure
Async
Coerced
Nonmask
Within
Terminate
IF
ID
EXE
Arithmetic Overflow
MEM
WB
Misaligned Memory
IF
ID
IF
EX
ID
MEM
EX
WB
MEM
WB
26
Summary of Exceptions
27
SuperScalar/Dynamic Scheduling
28