Anda di halaman 1dari 8

Homework 2 Solutions

B.5
A useful tool for solving this type of problem is to extract all of the available information from the
problem description. It is possible that not all of the information will be necessary to solve the problem,
but having it in summary form makes it easier to think about. Here is a summary
! "#$ %.% &H' ().*)*ns e+uivalent,, "#I of ).- (excludes memory accesses,
! Instruction mix -5. non/memory/access instructions, 0). loads, 5. stores
! "aches 1plit 2% with no hit penalty, (i.e., the access time is the time it takes to execute the
load3store instruction
4 2% I/cache 0. miss rate, 50/byte blocks (re+uires 0 bus cycles to fill, miss penalty is %5ns 6
0 cycles
4 2% 7/cache 5. miss rate, write/through (no write/allocate,, *5. of all writes do not stall
because of a write buffer, %8/byte blocks (re+uires % bus cycle to fill,, miss penalty is %5ns 6 % cycle
! 2%320 bus %09/bit, 088 :H' bus between the 2% and 20 caches
! 20 (unified, cache, 5%0 ;B, write/back (write/allocate,, 9). hit rate, 5). of replaced blocks
are dirty (must go to main memory,, 8</byte blocks (re+uires < bus cycles to fill,, miss penalty is 8)ns
6 -.50ns = 8-.50ns
! :emory, %09 bits (%8 bytes, wide, first access takes 8)ns, subse+uent accesses take % cycle
on %55 :H', %09/bit bus
a. >he average memory access time for instruction accesses
!2% (inst, miss time in 20 %5ns access time plus two 20 cycles ( two = 50 bytes in inst. cache
line3%8 bytes width of 20 bus, = %5 6 0 ? 5.-5 = 00.5ns. (5.-5 is e+uivalent to one 088 :H' 20 cache
cycle,
! 20 miss time in memory 8)ns 6 plus four memory cycles (four = 8< bytes in 20 cache3%8
bytes width of memory bus, = 8) 6 < ? -.5 = *)ns (-.5 is e+uivalent to one %55 :H' memory bus
cycle,.
! Avg. memory access time for inst = avg. access time in 20 cache 6 avg. access time in
memory 6 avg. access time for 20 write/back.
= ).)0 ? 00.5 6 ).)0 ? (% @ ).9, ? *) 6 ).)0 ? (% @ ).9, ? ).5 ? *) = ).**ns (%.)* "#$ cycles,
b. >he average memory access time for data reads
1imilar to the above formula with one difference the data cache width is %8 bytes which takes one 20
bus cycles transfer (versus two for the inst. cache,, so
! 2% (read, miss time in 20 %5 6 5.-5 = %9.-5ns
! 20 miss time in memory *)ns
! Avg. memory access time for read = ).)0 ? %9.-5 6 ).)0 ? (% @ ).9, ? *) 6 ).)0 ? (% @ ).9, ?
).5 ? *) = ).*0ns (%.)% "#$ cycles,
c. >he average memory access time for data writes
Assume that writes misses are not allocated in 2%, hence all writes use the write buffer. Also assume
the write buffer is as wide as the 2% data cache.
! 2% (write, time to 20 %5 6 5.-5 = %9.-5ns
! 20 miss time in memory *)ns
! Avg. memory access time for data writes = ).)5 ? %9.-5 6 ).)5 ? (% @ ).9, ? *) 6 ).)5 ? (% @
).9, ? ).5 ? *) = 0.0*ns (0.50 "#$ cycles,
d. Ahat is the overall "#I, including memory accesses
!"omponents base "#I, Inst fetch "#I, read "#I or write "#I, inst fetch time is added to data
read or write time (for load3store instructions,.
"#I = ).- 6 %.)* 6 ).0 ? %.)% 6 ).)5 ? 0.50 = 0.%* "#I.
B.%)
a. 2% cache miss behavior when the caches are organi'ed in an inclusive hierarchy and the two caches
have identical block si'e
! Access 20 cache.
! If 20 cache hits, supply block to 2% from 20, evicted 2% block can be stored in 20 if it is not
already there.
! If 20 cache misses, supply block to both 2% and 20 from memory, evicted 2% block can be
stored in 20 if it is not already there.
! In both cases (hit, miss,, if storing an 2% evicted block in 20 causes a block to be evicted
from 20, then 2% has to be checked and if 20 block that was evicted is found in 2%, it has to be
invalidated.
b. 2% cache miss behavior when the caches are organi'ed in an exclusive hierarchy and the two caches
have identical block si'e
! Access 20 cache
! If 20 cache hits, supply block to 2% from 20, invalidate block in 20, write evicted block from
2% to 20 (it must have not been there,
! If 20 cache misses, supply block to 2% from memory, write evicted block from 2% to 20 (it
must have not been there,
c. Ahen 2% evicted block is dirty it must be written back to 20 even if an earlier copy was there
(inclusive 20,. Bo change for exclusive case.
0.%
a. Cach element is 9B. 1ince a 8<B cacheline has 9 elements, and each column access will result in
fetching a new line for the non/ideal matrix, we need a minimum of 9x9 (8< elements, for each matrix.
Hence, the minimum cache si'e is %09 ? 9B = %;B.
b. >he blocked version only has to fetch each input and output element once. >he unblocked version
will have one cache miss for every 8<B39B = 9 row elements. Cach column re+uires 8<Bx058 of
storage, or %8;B. >hus, column elements will be replaced in the cache before they can be used again.
Hence the unblocked version will have * misses (% row and 9 columns, for every 0 in the blocked
version.
c. for (i = )D i E 058D i=i6B, F
for (G = )D G E 058D G=G6B, F
for(m=)D mEBD m66, F
for(n=)D nEBD n66, F
outputHG6nIHi6mI = inputHi6mIHG6nID
J
J
J
J
d. 0/way set associative. In a direct/mapped cache the blocks could be allocated so that they map to
overlapping regions in the cache.
0.0
1ince the unblocked version is too large to fit in the cache, processing eight 9B elements re+uires
fetching one 8<B row cache block and 9 column cache blocks. 1ince each iteration re+uires 0 cycles
without misses, prefetches can be initiated every 0 cycles, and the number of prefetches per iteration is
more than one, the memory system will be completely saturated with prefetches. Because the latency
of a prefetch is %8 cycles, and one will start every 0 cycles, %830 = 9 will be outstanding at a time.
0.9
a. >he access time of the direct/mapped cache is ).98ns, while the 0/way and </way are %.%0ns and
%.5-ns respectively. >his makes the relative access times %.%03.98 = %.5) or 5). more for the 0/way
and %.5-3).98 = %.5* or 5*. more for the </way.
b. >he access time of the %8;B cache is %.0-ns, while the 50;B and 8<;B are %.55ns and %.5-ns
respectively. >his makes the relative access times %.553%.0- = %.)8 or 8. larger for the 50;B and
%.5-3%.0- = %.)-9 or 9. larger for the 8<;B.
c. Avg. access time = hit. ? hit time 6 miss. ? miss penalty, miss. = misses per
instruction3references per instruction = 0.0. (7:,, %.0. (0/way,, ).55. (</way,, ).)*. (9/way,.
7irect mapped access time = ).98ns 3 ).5ns cycle time = 0 cycles
0/way set associative = %.%0ns 3 ).5ns cycle time = 5 cycles
</way set associative = %.5-ns 3 ).95ns cycle time = 0 cycles
9/way set associative = 0.)5ns 3 ).-*ns cycle time = 5 cycles
:iss penalty = (%)3).5, = 0) cycles for 7: and 0/wayD %)3).95 = %5 cycles for </wayD %)3.-* = %5
cycles for 9/way.
7irect mapped @ (% @ ).)00, ? 0 6 ).)00 ? (0), = 0.5* 8 cycles =K 0.5*8 ? ).5 = %.0ns
0/way @ (% @ ).)%0, ? 5 6 ).)%0 ? (0), = 5. 0 cycles =K 5.0 ? ).5 = %.8ns
</way @ (% @ ).))55, ? 0 6 ).))55 ? (%5, = 0.)58 cycles =K 0.)8 ? ).95 = %.8*ns
9/way @ (% @ ).)))*, ? 5 6 ).)))* ? %5 = 5 cycles =K 5 ? ).-* = 0.5-ns
7irect mapped cache is the best.
0.*
a. >he average memory access time of the current (</way 8<;B, cache is %.8*ns. 8<;B direct mapped
cache access time = ).98ns 3 ).5 ns cycle time = 0 cycles Aay/predicted cache has cycle time and
access time similar to direct mapped cache and miss rate similar to </way cache. >he A:A> of the
way/predicted cache has three components miss, hit with way prediction correct, and hit with way
prediction mispredict ).))55 ? (0), 6 ().9) ? 0 6 (% @ ).9), ? 5, ? (% @ ).))55, = 0.08 cycles = %.%5ns
b. >he cycle time of the 8<;B </way cache is ).95ns, while the 8<;B direct mapped cache can be
accessed in ).5ns. >his provides ).953).5 = %.88 or 88. faster cache access.
c. Aith % cycle way misprediction penalty, A:A> is %.%5ns (as per part a,, but with a %5 cycle
misprediction penalty, the A:A> becomes ).))55 ? 0) 6 ().9) ? 0 6 (% @ ).9), ? %5, ? (% @ ).))55, =
<.85 cycles or 0.5ns.
d. >he serial access is 0.<ns3%.5*ns = %.5)* or 5%. slower.
"%
5.%
>he baseline performance (in cycles, per loop iteration, of the code se+uence inLigure 5.<9, if no new
instructionMs execution could be initiated until the previous instructionMs execution had completed, is
<). 1ee Ligure 1.0. Cach instruction re+uires one clock cycle of execution (a clock cycle in which that
instruction, and only that instruction, is occupying the execution unitsD since every instruction must
execute, the loop will take at least that many clock cycles,. >o that base number, we add the extra
latency cycles. 7onMt forget the branch shadow cycle.
5.0
How many cycles would the loop body in the code se+uence in Ligure 5.<9 re+uire if the pipeline
detected true data dependencies and only stalled on those, rather than blindly stalling everything Gust
because one functional unit is busyN >he answer is 05, as shown in Ligure 1.5. Oemember, the point of
the extra latency cycles is to allow an instruction to complete whatever actions it needs, in order to
produce its correct output. $ntil that output is ready, no dependent instructions can be executed. 1o the
first 27 must stall the next instruction for three clock cycles. >he :$2>7 produces a result for its
successor, and therefore must stall < more clocks, and so on.
5.%-

Anda mungkin juga menyukai