Anda di halaman 1dari 12

Name ________________________________________________________________________

EE/CE 6304
Prof. Yiorgos Makris

Exam #1A
October 8, 2015

CLOSED BOOKS CLOSED NOTES NO CALCULATORS


Duration: 75 minutes.
Good Luck!!!
1

Problem 1 (Total 32 pts): Amdahls Law / CPI / Speedup


(a) [Total 14 points] You are the design team leader for the next generation of a microprocessor
and your two main computer architects propose two alternative ideas for improving upon the
performance of the current design on which the new one is based, all of which cost the same
area/power overhead.
The first architect recommends the design of a specialized high-speed ALU that cuts execution
time of all arithmetic and logic operations to 40% of the execution time of the original ALU, and
a specialized high-speed memory that cuts execution time of all loads and stores to one third of
the execution time of the original memory.
The second architect recommends a system involving four of the old processors operating in
parallel and communicating with each other through a dedicated bus. The overall additional time
required for the communication between the processors is 5% of the original execution time.
Assuming that the profile of the workload to be executed by the new microprocessor system
involves 30% Load/Store operations and 50% Arithmetic/Logic operations, and that 80% of the
code is parallelizable, which of the two ideas would you chose to pursue?
Justify your response to the architects by using Amdahls Law to quantitatively assess the
speedup obtained by each of the two ideas.

(b) [Total 18 points] Assume that we have a single-cycle implementation of a processor, which
has a clock cycle of 8 ns. Suppose that we split this single-cycle processor to a 5-stage pipelined
processor where the measured times for each of the five stages are:
IF: 1.9 ns
ID: 2.7 ns
EX: 3.7 ns
MEM: 2.6 ns
WB: 2.4 ns
In addition, the pipeline register delay is 0.3 ns and, on average, the pipeline will stall 2 cycles
for every 6 instructions executed.

(i) (8 points) What is the speed-up of the pipelined processor over the single cycle processor?

(ii) (10 points) We are considering to change this 5-stage processor to a (4+x)-stage processor,
where x2, by splitting the EX stage into x stages (each having 1/x of the delay of the original
state) and adding registers in between the new stages accordingly. However, this will result in
more data dependencies between instructions in flight: specifically, the pipeline will now stall,
on average, 2x-1 cycles for every 6 instructions executed. If your objective is to maximize the
number of stages in the pipeline, what is the maximum value of x for which this proposition
makes sense? Justify your answer quantitatively.

Problem 2 (Total 18 pts): Branch Delay Slot Scheduling


You are asked to do the best scheduling (i.e. maximizing performance) for the branch delay slot
for each of the following codes. Please only identify the instruction that you would use for the
branch delay slot. Assume no cancelling (or nullifying) branches or instructions exist. Also
assume 80% probability of a branch being taken.
(i) (6 points)

L:
L2:

SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB

R3, R2, R1
R3, L
R1, R4, R5
R3, R1, R5
L2
R2, R3, R5
R3, R2, R5
R1, R2, R5

SUB
BEQZ
SUB
ADD
JUMP
ADD
SUB
SUB

R3, R2, R1
R3, L
R2, R4, R5
R5, R1, R4
L2
R2, R4, R5
R3, R2, R5
R2, R3, R5

SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB

R3, R2, R1
R1, L
R1, R4, R5
R3, R1, R5
L2
R2, R4, R5
R3, R2, R5
R1, R2, R5

(branch if equal to zero)

(unconditional branch)

(ii) (6 points)

L:
L2:

(branch if equal to zero)

(unconditional branch)

(iii) (6 points)

L:
L2:

(branch if equal to zero)

(unconditional branch)

Problem 3 (Total 26 pts): Scheduling and Loop Unrolling


You are given the following latencies for a processor:
Producing Instruction
FP ALU Operation
FP ALU Operation
Load FP Operand
Load FP Operand

Consuming Instruction
FP ALU Operation
Store FP Operand
FP ALU Operation
Store FP Operand

Latency
4 cycles in between
1 cycle in between
2 cycles in between
0 cycles in between

For the following code:


Loop:

LD
SUBD
LD
ADDD
SD
SUBI
BNEZ

F0, 0 (R1)
F4, F0, F2
F6, -8 (R1)
F8, F4, F6
0 (R1), F8
R1, R1, 8
R1, Loop

| load F0 with element of array


| subtract from it the scalar residing in F2
| load F6 with previous element of array
| add to it the result of the previous subtraction
| store result
| decrement loop index by 8
| branch to Loop if R1 not equal to zero

(i) (6 points) Show the unscheduled code with the stalls it requires.

(ii) (8 points) Show the scheduled code with the stalls it requires. Similar to the example we did
in class, in your scheduling you can use as many register names as you see necessary, you can
modify the loop index if it helps, and you can fill the branch delay slot as you see appropriate.

(iii) (12 points) What is the minimum number, x, of times that you need to unroll the loop so that
you can eliminate all stalls? For this minimum x, show the scheduled, loop-unrolled, stall-free,
minimal instruction-count code, assuming that the number of loop iterations is a multiple of x. In
your scheduling you can use as many register names as you see necessary, you can modify the
loop index if it helps, you can modify register names if it helps, and you can fill the branch delay
slot as you see appropriate.

Problem 4 (Total 24 pts): Scoreboard vs. Tomasulo


(a) [Total 12 points] A scoreboard-based processor has 2 floating-point adders (capable of doing
both addition and subtraction), 1 floating-point multiplier (capable of doing only multiplication),
1 floating-point divider (capable of doing only division) and 1 integer unit. Assume that
execution (EX) takes 4 clock cycles for floating-point addition or subtraction (ADDD, SUBD),
10 clock cycles for floating-point multiplication (MULTD) and 40 clock cycles for floating-point
division (DIVD). The integer unit finishes execution (EX) in 1 clock cycle.
Please comment on whether the following instruction status tables (annotated with the
instructions in their static order) can represent legal execution states for this processor AND
explain briefly your reasoning for your answer.
(i) (3 points)

Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)

Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x

Exec. Complete

Write

x
x

Exec. Complete

Write

x
x

(ii) (3 points)

DIVD
MULTD
MULTD
ADDD

Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14

Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x

(iii) (3 points)

MULTD
ADDD
ADDD
LD

Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)

Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x

Exec. Complete

Write

x
x

x
x

Exec. Complete
x

Write
x

x
x

x
x

(iv) (3 points)

Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)

Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
x

(b) [Total 12 points] A processor implementing Tomasulos algorithm has reservation stations
associated with 2 floating-point adders (capable of doing both addition and subtraction), 1
floating-point multiplier (capable of doing only multiplication), and 1 floating-point divider
(capable of doing only division). For the purpose of this exercise, assume that the processor has
unlimited reservation stations associated with load and store buffers as well as integer units.
Assume also that execution (EX) takes 4 clock cycles for floating-point addition or subtraction
(ADDD, SUBD), 10 clock cycles for floating-point multiplication (MULTD) and 40 clock cycles
for floating-point division (DIVD). The integer unit finishes execution (EX) in 1 clock cycle.
Please comment on whether the following instruction status table representations (annotated with
the instructions in their static order) can represent legal execution states for this processor AND
explain briefly your reasoning for your answer.

(i) (3 points)

Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)

Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x

Write

x
x

(ii) (3 points)

Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)

Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
x

10

Write
x
x
x

(iii) (3 points)

DIVD
MULTD
MULTD
ADDD

Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14

Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x

Write

(iv) (3 points)

MULTD
MULTD
ADDD
LD

Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)

Instruction Status
Issue
Exec. Started
x
x
x
x

x
x

11

Write

x
x

Problem

Points

/32

/18

/26

/24

Total

/100

12

Anda mungkin juga menyukai