ElevenAdvancedOptimizationfor
CachePerformance
Reducinghittime
Reducingmisspenalty
Reducingmissrate
g
Reducingmisspenalty*missrate
Ref:5.2,ComputerArchitecture:AQuantitative
Approach,HennessyPattersonBook,4th Edition,
PDFVersionAvailableonCoursewebsite(Intranet)
ASahu 1 ASahu 2
ReducingHitTime
Smallandsimplecaches
Pipelinedcacheaccess
Tracecaches
Reducing Cache Hit Time
ReducingCacheHitTime Avoid time loss in address translation
Avoidtimelossinaddresstranslation
Virtuallyindexed,physicallytaggedcache
simpleandeffectiveapproach
possibleonlyifcacheisnottoolarge
Virtuallyaddressedcache
protection?,multipleprocesses?,aliasing?,I/O?
ASahu 3 ASahu 4
SmallandSimpleCaches CacheaccesstimeestimatesusingCACTI
Smallsize=>fasteraccess
Smallsize=>fitonthechip,lowerdelay
Simple(directmapped)=>lowerdelay
Secondlevel tagsmaybekeptonchip
.8microntechnology,1R/Wport,32baddress,64bo/p,32Bblock
ASahu 5 ASahu 6
ASahu 1
CS521CSEIITG 11/23/2012
PipelinedCacheAccess TraceCaches:Predecoded
Whatmapstoacacheblock?
Multicyclecacheaccessbutpipelined notstaticallydetermined
reducescycletimebuthittimeismorethan decidedbythedynamicsequenceof
onecycle instructions,includingpredictedbranches
Used
UsedinPentium4(NetBurst
in Pentium 4 (NetBurst architecture)
Pentium4takes4cycles
i k l
startingaddressesnotwordsize*powersof
Greaterpenaltyonbranchmisprediction 2
Moreclockcyclesbetweenissueofloadand Betterutilizationofcachespace
useofdata downside sameinstructionmaybestored
IFIF IF inpipeline multipletimes
ASahu 7 ASahu 8
ReducingMissPenalty
Multilevelcaches
Criticalwordfirstandearlyrestart
Reducing Cache Miss Penalty
ReducingCacheMissPenalty Givingprioritytoreadmissesoverwrite
gp y
Mergingwritebuffer
Victimcaches
ASahu 9 ASahu 10
MultiLevelCaches CriticalWordFirstandEarlyRestart
Averagememoryaccesstime= Readpolicy:ConcurrentRead/Forward
HittimeL1 +MissrateL1 *MisspenaltyL1 Loadpolicy:Wraparoundload
Miss penaltyL1 =
Misspenalty =
Moreeffectivewhenblocksizeislarge
HittimeL2 +MissrateL2 *MisspenaltyL2
ASahu 11 ASahu 12
ASahu 2
CS521CSEIITG 11/23/2012
ReadMissPriorityOverWrite MergingWriteBuffer
Providewritebuffers Mergewritesbelongingtosameblockincaseof
Processorwritesintobufferandproceeds(for writethrough
writethroughaswellaswriteback)
Onreadmiss
waitforbuffertobeempty,or
checkaddressesinbufferforconflict
ASahu 13 ASahu 14
VictimCache:Recyclebin/Dustbin
Evictedblocksarerecycled toproc
Muchfasterthangettinga
blockfromthenextlevel Cache
Size=1to5blocks
Size = 1 to 5 blocks Reducing Cache Miss Rate
ReducingCacheMissRate
Asignificantfractionof
missesmaybefoundin Victim
Cache
victimcache
frommem
ASahu 15 ASahu 16
ReducingMissRate LargeBlockSize
LargeBlockSize Takebenefitofspatiallocality
LargerCache Reducescompulsorymisses
HigherAssociativity Toolargeblocksize missesincrease
Waypredictionandpseudoassociativecache MissPenaltyincreases
Compileroptimizations
ASahu 17 ASahu 18
ASahu 3
CS521CSEIITG 11/23/2012
LargeCache HigherAssociativity
Reducescapacitymisses Reducesconflictmisses
Hittimeincreases 8wayisalmostlikefullyassociative
KeepsmallL1cacheandlargeL2cache
p g Hittimeincreases:Whattodo?
PseudoAssociativity
ASahu 19 ASahu 20
WayPredictionandPseudoassociative Compileroptimizations
Cache
Wayprediction:lowmissrateofSAcache Loop
Loopinterchange
interchange
withhittimeofDMcache
Improvespatiallocalitybyscanningarraysrow
Onlyonetagiscomparedinitially wise
Extrabitsarekeptforprediction
p p
Hittimeincaseofmispredictionishigh Blocking
Pseudoassoc.orcolumnassoc.cache:get Improvetemporalandspatiallocality
advantageofSAcacheinaDMcache
Checksequentiallyinapseudoset
Fasthitandslowhit
ASahu 21 ASahu 22
ImprovingLocality CacheOrganizationfortheexample
Cache line (or block) = 4 matrix elements.
MatrixMultiplicationexample Matrices are stored row wise.
Cache cant accommodate a full row/column.
ASahu 23 ASahu 24
ASahu 4
CS521CSEIITG 11/23/2012
MatrixMultiplication:CodeI MatrixMultiplication:CodeII
C A B C A B
accesses LM LMN LMN accesses LMN LN LMN
misses LM/4 LMN/4 LMN misses LMN/4 LN LMN/4
ASahu 25 ASahu 26
MatrixMultiplication:CodeIII
for (i = 0; i < L; i++)
for (k = 0; k < N; k++) i k j
for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j];
Reducing MissRate*MissPenality
ReducingMissRate MissPenality
C A B
accesses LMN LN LMN
misses LMN/4 LN/4 LMN/4
ASahu 27 ASahu 28
ReducingMissPenalty*MissRate NonblockingCache
Nonblockingcache InOOOprocessor
Hardwareprefetching
Compilercontrolledprefetching Hitunderamiss
complexityofcachecontrollerincreases
Hitundermultiplemissesormissunderamiss
memoryshouldbeabletohandlemultiplemisses
ASahu 29 ASahu 30
ASahu 5
CS521CSEIITG 11/23/2012
frommemory
ASahu 31 ASahu 32
MatMul:CodeIII
for (i = 0; i < L; i++) i k j CompilerControlledPrefetching
for (k = 0; k < N; k++) Semanticallyinvisible(nochangeinregisters
for (j = 0; j < M; j++) orcachecontents)
C[i][j] += A[i][k] * B[k][j];
C A B
Makessenseifprocessordoesntstallwhile
accesses LMN LN LMN prefetching (nonblockingcache)
misses LMN/4 LN/4 LMN/4 Overheadofprefetch instructionshouldnot
Total misses = LN(2M+1)/4 exceedthebenefit
Suppose 3 Separate Prefetcher for A, B and C PreFecth(A[i]);//Prefetch X=A[i];//Prefetch Instr
All the 3 block can be brought to buffer & one STMT; STMT;
swap out in Te, Te = 4 time execution of stmt; STMT; STMT;
Over 3=1+1+1 STMT; STMT;
How manyy number Miss ? Y+A[i];//UsingData Y+A[i];//UsingData
ASahu 33
Z=Y+K; ASahu
Z=Y+K; 34
ASahu 35
ASahu 6