EE/CE 6304
Prof. Yiorgos Makris
Exam #1A
October 8, 2015
(b) [Total 18 points] Assume that we have a single-cycle implementation of a processor, which
has a clock cycle of 8 ns. Suppose that we split this single-cycle processor to a 5-stage pipelined
processor where the measured times for each of the five stages are:
IF: 1.9 ns
ID: 2.7 ns
EX: 3.7 ns
MEM: 2.6 ns
WB: 2.4 ns
In addition, the pipeline register delay is 0.3 ns and, on average, the pipeline will stall 2 cycles
for every 6 instructions executed.
(i) (8 points) What is the speed-up of the pipelined processor over the single cycle processor?
(ii) (10 points) We are considering to change this 5-stage processor to a (4+x)-stage processor,
where x2, by splitting the EX stage into x stages (each having 1/x of the delay of the original
state) and adding registers in between the new stages accordingly. However, this will result in
more data dependencies between instructions in flight: specifically, the pipeline will now stall,
on average, 2x-1 cycles for every 6 instructions executed. If your objective is to maximize the
number of stages in the pipeline, what is the maximum value of x for which this proposition
makes sense? Justify your answer quantitatively.
L:
L2:
SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB
R3, R2, R1
R3, L
R1, R4, R5
R3, R1, R5
L2
R2, R3, R5
R3, R2, R5
R1, R2, R5
SUB
BEQZ
SUB
ADD
JUMP
ADD
SUB
SUB
R3, R2, R1
R3, L
R2, R4, R5
R5, R1, R4
L2
R2, R4, R5
R3, R2, R5
R2, R3, R5
SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB
R3, R2, R1
R1, L
R1, R4, R5
R3, R1, R5
L2
R2, R4, R5
R3, R2, R5
R1, R2, R5
(unconditional branch)
(ii) (6 points)
L:
L2:
(unconditional branch)
(iii) (6 points)
L:
L2:
(unconditional branch)
Consuming Instruction
FP ALU Operation
Store FP Operand
FP ALU Operation
Store FP Operand
Latency
4 cycles in between
1 cycle in between
2 cycles in between
0 cycles in between
LD
SUBD
LD
ADDD
SD
SUBI
BNEZ
F0, 0 (R1)
F4, F0, F2
F6, -8 (R1)
F8, F4, F6
0 (R1), F8
R1, R1, 8
R1, Loop
(i) (6 points) Show the unscheduled code with the stalls it requires.
(ii) (8 points) Show the scheduled code with the stalls it requires. Similar to the example we did
in class, in your scheduling you can use as many register names as you see necessary, you can
modify the loop index if it helps, and you can fill the branch delay slot as you see appropriate.
(iii) (12 points) What is the minimum number, x, of times that you need to unroll the loop so that
you can eliminate all stalls? For this minimum x, show the scheduled, loop-unrolled, stall-free,
minimal instruction-count code, assuming that the number of loop iterations is a multiple of x. In
your scheduling you can use as many register names as you see necessary, you can modify the
loop index if it helps, you can modify register names if it helps, and you can fill the branch delay
slot as you see appropriate.
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
Exec. Complete
Write
x
x
Exec. Complete
Write
x
x
(ii) (3 points)
DIVD
MULTD
MULTD
ADDD
Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
(iii) (3 points)
MULTD
ADDD
ADDD
LD
Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
Exec. Complete
Write
x
x
x
x
Exec. Complete
x
Write
x
x
x
x
x
(iv) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
x
(b) [Total 12 points] A processor implementing Tomasulos algorithm has reservation stations
associated with 2 floating-point adders (capable of doing both addition and subtraction), 1
floating-point multiplier (capable of doing only multiplication), and 1 floating-point divider
(capable of doing only division). For the purpose of this exercise, assume that the processor has
unlimited reservation stations associated with load and store buffers as well as integer units.
Assume also that execution (EX) takes 4 clock cycles for floating-point addition or subtraction
(ADDD, SUBD), 10 clock cycles for floating-point multiplication (MULTD) and 40 clock cycles
for floating-point division (DIVD). The integer unit finishes execution (EX) in 1 clock cycle.
Please comment on whether the following instruction status table representations (annotated with
the instructions in their static order) can represent legal execution states for this processor AND
explain briefly your reasoning for your answer.
(i) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
Write
x
x
(ii) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
x
10
Write
x
x
x
(iii) (3 points)
DIVD
MULTD
MULTD
ADDD
Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
Write
(iv) (3 points)
MULTD
MULTD
ADDD
LD
Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
11
Write
x
x
Problem
Points
/32
/18
/26
/24
Total
/100
12