EE6304 Test1 2015 Fall

Name ________________________________________________________________________
EE/CE 6304
Prof. Yiorgos Makris
Exam #1A
October 8, 2015
CLOSED BOOKS CLOSED NOTES NO CALCULATORS

Duration: 75 minutes.
Good Luck!!!
1
Problem 1 (Total 32 pts): Amdahls Law / CPI / Speedup

(a) [Total 14 points] You are the design team leader for the next generation of a microprocessor
and your two main computer architects propose two alternative ideas for improving upon the
performance of the current design on which the new one is based, all of which cost the same
area/power overhead.
The first architect recommends the design of a specialized high-speed ALU that cuts execution
time of all arithmetic and logic operations to 40% of the execution time of the original ALU, and
a specialized high-speed memory that cuts execution time of all loads and stores to one third of
the execution time of the original memory.
The second architect recommends a system involving four of the old processors operating in
parallel and communicating with each other through a dedicated bus. The overall additional time
required for the communication between the processors is 5% of the original execution time.
Assuming that the profile of the workload to be executed by the new microprocessor system
involves 30% Load/Store operations and 50% Arithmetic/Logic operations, and that 80% of the
code is parallelizable, which of the two ideas would you chose to pursue?
Justify your response to the architects by using Amdahls Law to quantitatively assess the
speedup obtained by each of the two ideas.
(b) [Total 18 points] Assume that we have a single-cycle implementation of a processor, which
has a clock cycle of 8 ns. Suppose that we split this single-cycle processor to a 5-stage pipelined
processor where the measured times for each of the five stages are:
IF: 1.9 ns
ID: 2.7 ns
EX: 3.7 ns
MEM: 2.6 ns
WB: 2.4 ns
In addition, the pipeline register delay is 0.3 ns and, on average, the pipeline will stall 2 cycles
for every 6 instructions executed.
(i) (8 points) What is the speed-up of the pipelined processor over the single cycle processor?
(ii) (10 points) We are considering to change this 5-stage processor to a (4+x)-stage processor,
where x2, by splitting the EX stage into x stages (each having 1/x of the delay of the original
state) and adding registers in between the new stages accordingly. However, this will result in
more data dependencies between instructions in flight: specifically, the pipeline will now stall,
on average, 2x-1 cycles for every 6 instructions executed. If your objective is to maximize the
number of stages in the pipeline, what is the maximum value of x for which this proposition
makes sense? Justify your answer quantitatively.
Problem 2 (Total 18 pts): Branch Delay Slot Scheduling

You are asked to do the best scheduling (i.e. maximizing performance) for the branch delay slot
for each of the following codes. Please only identify the instruction that you would use for the
branch delay slot. Assume no cancelling (or nullifying) branches or instructions exist. Also
assume 80% probability of a branch being taken.
(i) (6 points)
L:
L2:
SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB
R3, R2, R1
R3, L
R1, R4, R5
R3, R1, R5
L2
R2, R3, R5
R3, R2, R5
R1, R2, R5
SUB
BEQZ
SUB
ADD
JUMP
ADD
SUB
SUB
R3, R2, R1
R3, L
R2, R4, R5
R5, R1, R4
L2
R2, R4, R5
R3, R2, R5
R2, R3, R5
SUB
BEQZ
ADD
SUB
JUMP
ADD
SUB
SUB
R3, R2, R1
R1, L
R1, R4, R5
R3, R1, R5
L2
R2, R4, R5
R3, R2, R5
R1, R2, R5
(branch if equal to zero)
(unconditional branch)
(ii) (6 points)
L:
L2:
(iii) (6 points)
L:
L2:
Problem 3 (Total 26 pts): Scheduling and Loop Unrolling

You are given the following latencies for a processor:
Producing Instruction
FP ALU Operation
FP ALU Operation
Load FP Operand
Load FP Operand
Consuming Instruction
FP ALU Operation
Store FP Operand
FP ALU Operation
Store FP Operand
Latency
4 cycles in between
1 cycle in between
2 cycles in between
0 cycles in between
For the following code:

Loop:
LD
SUBD
LD
ADDD
SD
SUBI
BNEZ
F0, 0 (R1)
F4, F0, F2
F6, -8 (R1)
F8, F4, F6
0 (R1), F8
R1, R1, 8
R1, Loop
| load F0 with element of array

| subtract from it the scalar residing in F2
| load F6 with previous element of array
| add to it the result of the previous subtraction
| store result
| decrement loop index by 8
| branch to Loop if R1 not equal to zero
(i) (6 points) Show the unscheduled code with the stalls it requires.
(ii) (8 points) Show the scheduled code with the stalls it requires. Similar to the example we did
in class, in your scheduling you can use as many register names as you see necessary, you can
modify the loop index if it helps, and you can fill the branch delay slot as you see appropriate.
(iii) (12 points) What is the minimum number, x, of times that you need to unroll the loop so that
you can eliminate all stalls? For this minimum x, show the scheduled, loop-unrolled, stall-free,
minimal instruction-count code, assuming that the number of loop iterations is a multiple of x. In
your scheduling you can use as many register names as you see necessary, you can modify the
loop index if it helps, you can modify register names if it helps, and you can fill the branch delay
slot as you see appropriate.
Problem 4 (Total 24 pts): Scoreboard vs. Tomasulo

(a) [Total 12 points] A scoreboard-based processor has 2 floating-point adders (capable of doing
both addition and subtraction), 1 floating-point multiplier (capable of doing only multiplication),
1 floating-point divider (capable of doing only division) and 1 integer unit. Assume that
execution (EX) takes 4 clock cycles for floating-point addition or subtraction (ADDD, SUBD),
10 clock cycles for floating-point multiplication (MULTD) and 40 clock cycles for floating-point
division (DIVD). The integer unit finishes execution (EX) in 1 clock cycle.
Please comment on whether the following instruction status tables (annotated with the
instructions in their static order) can represent legal execution states for this processor AND
explain briefly your reasoning for your answer.
(i) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
Exec. Complete
Write
x
x
Exec. Complete
Write
x
x
(ii) (3 points)
DIVD
MULTD
MULTD
ADDD
Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
(iii) (3 points)
MULTD
ADDD
ADDD
LD
Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
Exec. Complete
Write
x
x
x
x
Exec. Complete
x
Write
x
x
x
x
x
(iv) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)
Instruction Status
Issue
Read Operands
x
x
x
x
x
x
x
x
(b) [Total 12 points] A processor implementing Tomasulos algorithm has reservation stations
associated with 2 floating-point adders (capable of doing both addition and subtraction), 1
floating-point multiplier (capable of doing only multiplication), and 1 floating-point divider
(capable of doing only division). For the purpose of this exercise, assume that the processor has
unlimited reservation stations associated with load and store buffers as well as integer units.
Assume also that execution (EX) takes 4 clock cycles for floating-point addition or subtraction
(ADDD, SUBD), 10 clock cycles for floating-point multiplication (MULTD) and 40 clock cycles
for floating-point division (DIVD). The integer unit finishes execution (EX) in 1 clock cycle.
Please comment on whether the following instruction status table representations (annotated with
the instructions in their static order) can represent legal execution states for this processor AND
explain briefly your reasoning for your answer.
(i) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F4, F0, F2
LD
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
Write
x
x
(ii) (3 points)
Instruction
MULTD F4, F0, F2
DIVD
F8, F4, F6
ADDD
F6, F10, F4
LD
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
x
10
Write
x
x
x
(iii) (3 points)
DIVD
MULTD
MULTD
ADDD
Instruction
F10, F0, F4
F6, F10, F4
F12, F0, F4
F8, F2, F14
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
x
Write
(iv) (3 points)
MULTD
MULTD
ADDD
LD
Instruction
F4, F0, F2
F8, F4, F6
F6, F0, F2
F10, 0 (R2)
Instruction Status
Issue
Exec. Started
x
x
x
x
x
x
11
Write
x
x
Problem
Points
/32
/18
/26
/24
Total
/100
12

EE6304 Test1 2015 Fall

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

EE6304 Test1 2015 Fall

Diunggah oleh

Hak Cipta:

Format Tersedia

Name ________________________________________________________________________

CLOSED BOOKS CLOSED NOTES NO CALCULATORS

Problem 1 (Total 32 pts): Amdahls Law / CPI / Speedup

Problem 2 (Total 18 pts): Branch Delay Slot Scheduling

(branch if equal to zero)

(branch if equal to zero)

(branch if equal to zero)

Problem 3 (Total 26 pts): Scheduling and Loop Unrolling

For the following code:

| load F0 with element of array

Problem 4 (Total 24 pts): Scoreboard vs. Tomasulo

Anda mungkin juga menyukai