College of Engineering
Computer Science Division EECS
Midterm I
March 12, 2003
CS152 Computer Architecture and Engineering
Your Name:
SID Number:
Discussion Section:
1 25
2 20
3 25
4 30
Total
1
[ This page left for π ]
3.141592653589793238462643383279502884197169399375105820974944
2
Problem 1: Short Answer
Problem 1a [3 pts]:
What is Amdhal’s law? Give a formula and define the terms. How is this useful?
1
Amdal’s law: speed up =
F
(1 − F ) +
n
F is the part that you speed up
n is the speed up ratio
The Amdal’s law tells you can NOT speed up too much the whole system by speeding up only one part
Problem 1b [3 pts]:
What are setup and hold-time and how can they be violated in a synchronous circuit? Can you
still use a chip that is experiencing hold-time violations? How about setup violations?
Tsetup is the time that input signal should arrive before the clock edge
Thold is the time that input signal should keep unchanged after the clock edge
If the clock cycle time is not long enough, you could have Setup time violation. To avoid setup violation, the cycle
time Tclk should satisfy:
Tclk>Tclk_Q+Tmax+Tsetup+Tskew+Tj
If the combinational logic delay is too short, you could have hold time violation. The following has to be satisfied
to avoid hold time violation:
Tclk_Q+Tmin>Tskew+Thold
If setup time is violated, you can slow down the clock to make the chip work
If hold time is violated, you can NOT make the chip work
Problem 1c [2 pts]:
Is the multi-cycle data path always faster than the single-cycle data path? Explain.
NO, for example: lw for MIPs in multi-cycle data path will be slower than single-cycle data path for more Tclk_Q
overhead is introduced in multi-cycle data path.
Problem 1d [3 pts]:
What are precise interrupts and why is it easy to provide them for our multi-cycle data path?
Precise interrupts means that the instructions before offending instruction are executed but no instructions are
executed after the offending instruction. In our multi-cycle data path, it is easy to provide precise interrupts for
interrupts only occur before WB stage and instructions are executed in order.
3
Problem 1e [3 pts]:
Suppose that you have analyzed a benchmark that runs on your company’s processor. This
processor runs at 500MHz and has the following characteristics:
What is the CPI and MIPS rating of this processor running this benchmark?
CPI=1*0.4+2*0.30+1*0.2+12*0.10=2.4
MIPS=500/2.4=208 MIPS
Problem 1f [2 pts]:
What is the technique used for a carry-select adder and how can it be used in general to speed up
hardware (hint: this is a very general technique to make a time⇔space tradeoff)
In carry select adder, you pre-compute the addition both with carry in “0” and “1”. Then, the correct result is
selected by the late arrival carry in. In general, you can throw in more hardware, do parallel pre-computing and
select the correct by using late arrival signal, which can speed up the hardware.
Problem 1g [3pts]: What does it mean for a branch to have a delay slot, and why does it
complicate the servicing of interrupts?
Delay slot means that the instruction right after branch is always executed regardless the branch is taken or not. If
the instruction in the delay causes the exception, you have to re-evaluate the branch to figure out the correct next
executing instruction after the interrupt service.
4
Problem 1h[3pts]: The Clark paper on testing talked about using randomness in at least two
different ways during testing of the VAX. What were they?
1. Use random vector
2. Insert random errors
Problem 1i [3 pts]:
The 1-bit Booth algorithm recodes one of the operands of a multiplier from binary into trinary
logic with symbols: 1 , 0, and 1. The transformation occurs one bit at a time, as given in class:
Cur Prev Out
0 0 0
0 1 1
1 0 1
1 1 0
Suppose we encode 3 bits at a time. Finish filling out the following transformation table:
Cur Prev Out
000 0 0
000 1 1
001 0 1
001 1 2
010 0 2
010 1 3
011 0 3
011 1 4
100 0 -4
100 1 -3
101 0 -3
101 1 -2
110 0 -2
110 1 -1
111 0 -1
111 1 0
5
Problem 2: Delay For a Full Adder
A key component of an ALU is a full adder. A symbol for a full adder is:
A B
Problem 2a [5pts]:
Implement a full adder using as few 2-input AND, OR, and XOR gates as possible. Keep in
mind that the Carry In signal may arrive much later than the A or B inputs. Thus, optimize your
design (if possible) to have as few gates between Carry In and the two outputs as possible:
S
B
Cin Cout
6
Assume the following characteristics for the gates:
AND: Input load: 100fF,
Propagation delay: TPlh=0.4ns, TPhl=0.4ns,
Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns
OR: Input load: 100fF
Propagation delay: TPlh=0.2ns, TPhl=0.6ns
Load-Dependent delay: TPlhf=.0020ns, TPhlf=.0021ns
XOR: Input load: 200fF,
Propagation delay: TPlh=.8ns, TPhl=.8ns
Load-Dependent delay: TPlhf=.0040ns,TPhlf=.0042ns
Problem 2b [3pts]:
Compute the input load for each of the 3 inputs to your full adder:
CA=CB=Ccin=200fF+100fF=300fF
Problem 2c [4pts]:
Identify two critical paths from the inputs to the Sum and the Carry Out signal.
Compute the propagation delays for these critical paths based on the information given above.
(You will have 2 numbers for each of these two paths):
Critical path from inputs to the carry out: since XOR is slow compared to AND and OR, the
critical path for sum is A->XOR->AND->OR->Cout
W/o wire delay caculation
TphlAS= 0.8+0.0042*(200+100)+0.8=2.86ns
TplhAS= 0.8+0.0042*(200+100)+0.8=2.86ns
TphlACout=0.6+0.0021*200+0.4+0.0042*(200+100)+0.8=3.48ns
TplhACout=0.2+0.0021*200+0.4+0.0042*(200+100)+0.8=3.08ns
TphlACout=0.6+0.0021*200*2+0.4+0.0042*(200+100)*2+0.8=5.16ns
TplhACout=0.2+0.0021*200*2+0.4+0.0042*(200+100)*2+0.8=4.76ns
Problem 2d [2pts]:
Compute the Load Dependent delay for your two outputs.
TphlASf=0.0042ns/fF, TplhASf=0.0040ns/fF, TplhACoutf=0.0020ns/fF
TphlACoutf=0.0021ns/fF
7
Problem 2e [6pts]:
Suppose we wish to build a fast adder. One component might be a 4-bit adder with propagate
and generate signals (such as might be used for carry-lookahead logic). Construct this adder as a
4-bit ripple adder internally (you can use your full-adders) with separate propagate and generate
outputs. Let the output carry be a function of the propagate and generate signals. Draw a circuit
for this component and compute the propagation delay for the slowest signal.
P
A
A[3:0] B[3:0]
S
B
C out 4-BIT Adder Cin
[FAST]
A3 B3 A2 B2 A1 B1 A0 B0
1 bit full adder 1 bit full adder 1 bit full adder 1 bit full adder Cin
S3 S2 S1 S0
P0
P1
P2
C4
P3
Notice that the slowest path should be A0->S3 for the XOR gate is much slower than AND/OR
used for generating Carry out C4(which uses 7 AND/OR levels). Assuming that P0 drives 1 AND
gate, P1 drive 2 AND gates, P2 drives 3 AND gates and P3 drives 4 AND gates for C4
generation. Notice the implementation is NOT unique!
8
Problem 3: Non-Restoring Division
Here is the pseudo-code for an unsigned division algorithm. It is the non-restoring version of the
last divider that we developed. Assume that quotient and remainder are 32-bit global values.
As far as inputs are concerned dividend is 32 bits wide, while divisor is no more than 31 bits.
divide(dividend, divisor)
{ int count;
MOBIUS64(remainder,quotient);
while (count > 0) {
count--;
/* Low bit of quotient is inverted sign of remainder*/
if (quotient & 0x1)
remainder = remainder - divisor;
else
remainder = remainder + divisor;
MOBIUS64(remainder,quotient);
}
The MOBIUS64(hi,lo) instruction rotates the 64-bit value <hi,lo> around to the left 1 bit,
wrapping the top bit of hi back to the bottom of lo and inverting it in the process. So, for
instance: MOBIUS64(0xFFFFFFFF, 0x000000FF) ⇒ 0xFFFFFFFE, 0x000001FE
And: MOBIUS64(0x0FFFFFFF, 0x000000FF) ⇒ 0x1FFFFFFE, 0x000001FF
Problem 3a [5pts]:
Implement MOBIUS64($t1, $t0) as 6 MIPS instructions. Assume $t2 contains the constant
0x80000000. Hint: what can different flavors of slt/sltu do to help?
slt $t3, $t0,$0
sltu $t4, $t1, $t2
sll $t0,$t0, 1
sll $t1,$t1,1
addiu, $t0,$t0, $t4
addiu, $t1,$t1, $t3
Problem 3b [5pts]:
The divide algorithm is incomplete. It is missing some initialization and some final code. What
is missing? (Careful – what if the remainder is negative?)
Initialization: Remainder =0; Quotient Dividend; Count=32;
Trail: Remainder =Remainder>>1;
If (!(Quotient&0x01)){
Remainder=Remainder | 0x80000000;
Remainder+=Divisor;
}
9
Problem 3c [11pts]:
Assume that you have a MIPS processor that is missing the divide instruction. Assume that
dividend and divisor are in $a0 and $a1 respectively, and that remainder and quotient are
returned in registers $v0 and $v1 respectively. You can use MOBIUS64 as a pseudo-instruction
that takes 3 registers (don’t forget the constant in $t2). Make sure to adher to MIPS register
conventions, and optimize the loop as much as possible.
Divide:
mov $t1, $0
mov $t0, $a0
ori $t8, 32
lui $t2, 0x8000
loop:
MOBIUS64($t1,$t0)
beq $t8, $0, DONE
addi $t8, $t8, -1
andi $t3, $t0, 1
beq $t3, $0, MOBIUSNEG
j loop
MOBIUSNEG:
add $t1, $t1, $a1
j loop
DONE:
mov $v1, $t9
mov $v0, $t1
srl $v0,$v0, 1
andi $t3, $v1,1
bne $t3,$0, exit
or $v0,$v0, $t2
add $v0,$v0,$a0
exit: jr $ra
Problem 3d [4pts]:
How will this algorithm have to change if the divisor is allowed to be 32 bits in size?
Check for the 32bit value for divisor and treat separately
Either Q=1, Remainder=Dividend-Divisor
Or Q=0, Remainder=Divisor
10
Problem 4: New instructions for a multi-cycle data path
PCWr PCWrCond PCSrc
Zero
IorD MemWr IRWr RegDst RegWr ALUSelA 1
32
Mux
32
PC
0 0
32 Zero
Instruction Reg
Rs
Mux
0 Ra
32 RAdr 5 32
Mux
ALU Out
32 Rt
ALU
Rb busA A 1
32 Ideal
Mem Data Reg
5 Reg File 32
1
Memory Rt 0 4 0
Mux
WrAdr 32 Rw 32
B 1
32 Din Dout Rd busW busB 32 32
1 2
32 ALU
1 Mux 0 3
<< 2 Control
Imm 16 Extend
32 ALUOp
ExtOp MemtoReg ALUSelB
The Multi-Cycle datapath developed in class and the book is shown above. In class, we
developed an assembly language for microcode. It is included here for reference:
11
In class, we made our multicycle machine support the following six MIPS instructions:
In this problem, we are going to add five new instructions to this data path:
jal <const> ⇒ PC ← zero_ext(Instr[25:0]) ||00
R[31] ← PC + 4
addiu $rt, $rs, <const> ⇒ R[rt] ← R[rs] + sign_ext(Imm16)
1. The jal instruction is familiar to you from the normal MIPS instruction set.
2. The addiu instruction is also a normal MIPS instruction that has an immediate value
3. The multu instruction is an unsigned multiply, and the mfhi/mflo instructions are for getting
the result.
12
Problem 4a [10pts]:
Describe/sketch the modifications needed to the datapath for the new instructions. Assume that
the original datapath had only enough functionality to implement the original 6 instructions. Try
to add as little hardware as possible. In particular, you must use the existing ALU for multiply.
You can showing additions to the data path – you do not need to completely redraw it. Make
sure that you are very clear about your changes. Hint: you need to add hi and lo registers. They
can be made from shift-registers. Your CPI for multiply should be no more than 40. So: how will
you loop 32 times for multu???
jal
1.expand RegW mux to include $r31
2.expand BusW mux to include data from pc register
3.expand PC source mux to include jump address
PCsrc
RegW
32
BusW 32
5 ALU
31 RegtoMem
Mux 32
M
PC
M ALUout u
Rt 5 RW
u 32 x
32 32 32 PC[31:28] || ShiftedExtImm
5 x
Rd
PC MEM ALUout
addiu
No change is needed to implement this instruction
WriteHi WriteLo
multu $rs, $rt
1. pc<-pc+4
2. ALUout<-pc+ShiftedExImm
OVM hi lo Shift
3. lo<-R[rt],hi<-0,A<-R[rs],count<-32
4. hi<-hi+ lo[0]= =0? 0, A; count<-count-1
5. srl ovm||hi||lo, 1 32 32
6. hi<hi+ lo[0]= =0? 0, A; count<-count-1 HiSr Mux
R[rt]
ALUsrcA 32 32
ALUsrcB
PC 0
32 ALU
Lo[0] 4 32
M
B M 32
u Counter
0 32 32 u
x ExImm load
M 32 x
32 dec Zero
u shiftExImm
32 x
A Hi
32
mfhi RegW
32
BusW
5
RegtoMem Rs
mflo Mux
M
Rt 5 RW
32 u
32 32
5 x
Rd
Hi Lo MEM ALUout13
Problem 4b [6pts]:
Draw a block diagram of as microcontroller that will support the new instructions (it will be
slightly different than that required for the original instructions). Include sequencing hardware,
the dispatch ROM, the microcode ROM, and decode blocks to turn the fields of the microcode
into control signals. Hint: you will need to provide a branching facility to your sequencer. Make
sure to include this functionality!
microPC
Adder
-1 0
Mux
M
Adder u 0
x µAddress ROM
1 Select
Logic
Opcode
5 3 6 7 3 1 4 5
4
14
Problem 4c [4pts]:
Describe changes to the microinstruction assembly language for these new instructions. How
wide are your microinstructions now?
ALUop: no change (3bits)
SRC1: 1bit->2bits (including “0” data)
SRC2: no change (3bits, but the mux is expanded)
ALUdest: 2bits->3bits(supporting r31-PC, rs-Hi ,rs-Lo)
Mem: no change (2bits)
MemReg: no change (1bit)
PCWrite: no change (2bits, but support jump address now)
Sequence: no change (2bist, but support “dec”)
Multu: new 3bits(support load, shift, WriteHi,WriteLo, HiSrc)
New microinstruction length is :3+2+3+3+2+1+2+2+3=21
Problem 4d [10pts]:
Write complete microcode for the new instructions. Include the Fetch and Dispatch
microinstructions. If any of the microcode for the original instructions must change, explain how.
label ALU SRC1 SRC2 ALUdest Mem MemReg PCwrite Sequence MULTU
fetch add pc 4 ReadPC IR ALU seq
dispatch add pc shiftext dispatch
15
Problem 4e [Extra Credit: 5pts]: Describe a (relatively) small change that will allow you to
implement the signed mult instruction (think Booth!). Be complete with your answer
(datapath/microcode, etc).
The microcode will be quite similar to the non-booth encoding case. but one reset signal is
introduced to reset prev lo[0] 1bit register to “0”.
WriteHi WriteLo
Shift
32 32
Mux Reset
R[rt]
32 32
0 ALU
ALUsrcA
Lo[1:0] 32
PC M 32
0 u
x
A M 32
u
-A x
16
[ This page left for scratch ]
17
[ This page left for scratch ]
18
EXTRA SHEET FOR PROBLEM 4: FEEL FREE TO REMOVE
PCWr PCWrCond PCSrc
Zero
IorD MemWr IRWr RegDst RegWr ALUSelA 1
32
Mux
32
PC
0 0
32 Zero
Instruction Reg
Rs
Mux
0 Ra
32 RAdr 5 32
Mux
ALU Out
32 Rt
ALU
Rb busA A 1
32 Ideal
Mux
WrAdr 32 Rw 32
B 1
32 Din Dout Rd busW busB 32 32
1 2
32 ALU
1 Mux 0 3
<< 2 Control
Imm 16 Extend
32 ALUOp
ExtOp MemtoReg ALUSelB
The Multi-Cycle datapath developed in class and the book is shown above. In class, we
developed an assembly language for microcode. It is included here for reference:
Field Name Values For Field Function of Field
Add ALU Adds
Sub ALU subtracts
ALU
Func ALU does function code (Inst[5:0])
Or ALU does logical OR
PC PC ⇒ 1st ALU input
SRC1
rs R[rs] ⇒1st ALU input
4 4 ⇒ 2nd ALU input
rt R[rt] ⇒ 2nd ALU input
SRC2 Extend sign ext imm16 (Inst[15:0]) ⇒ 2nd ALU input
Extend0 zero ext imm16 (Inst[15:0]) ⇒ 2nd ALU input
ExtShft 2nd ALU input = sign extended imm16 << 2
rd-ALU ALUout ⇒ R[rd]
ALU Dest rt-ALU ALUout ⇒ R[rt]
rt-Mem Mem input ⇒ R[rt]
Read-PC Read Memory using the PC for the address
Memory Read-ALU Read Memory using the ALUout register for the address
Write-ALU Write Memory using the ALUout register for the address
MemReg IR Mem input ⇒ IR
ALU ALU value ⇒ PCibm
PC Write
ALUoutCond If ALU Zero is true, then ALUout ⇒ PC
Seq Go to next sequential microinstruction
Sequence Fetch Go to the first microinstruction
Dispatch Dispatch using ROM
19
EXTRA SHEET FOR PROBLEM 4: FEEL FREE TO REMOVE
In class, we made our multicycle machine support the following six MIPS instructions:
In this problem, we are going to add five new instructions to this data path:
jal <const> ⇒ PC ← zero_ext(Instr[25:0]) ||00
R[31] ← PC + 4
addiu $rt, $rs, <const> ⇒ R[rt] ← R[rs] + sign_ext(Imm16)
4. The jal instruction is familiar to you from the normal MIPS instruction set.
5. The addiu instruction is also a normal MIPS instruction that has an immediate value
6. The multu instruction is an unsigned multiply, and the mfhi/mflo instructions are for getting
the result.
20