Sie sind auf Seite 1von 6

CS521CSEIITG 11/23/2012

ElevenAdvancedOptimizationfor
CachePerformance
Reducinghittime
Reducingmisspenalty
Reducingmissrate
g
Reducingmisspenalty*missrate

Ref:5.2,ComputerArchitecture:AQuantitative
Approach,HennessyPattersonBook,4th Edition,
PDFVersionAvailableonCoursewebsite(Intranet)
ASahu 1 ASahu 2

ReducingHitTime
Smallandsimplecaches
Pipelinedcacheaccess
Tracecaches
Reducing Cache Hit Time
ReducingCacheHitTime Avoid time loss in address translation
Avoidtimelossinaddresstranslation
Virtuallyindexed,physicallytaggedcache
simpleandeffectiveapproach
possibleonlyifcacheisnottoolarge
Virtuallyaddressedcache
protection?,multipleprocesses?,aliasing?,I/O?

ASahu 3 ASahu 4

SmallandSimpleCaches CacheaccesstimeestimatesusingCACTI

Smallsize=>fasteraccess
Smallsize=>fitonthechip,lowerdelay
Simple(directmapped)=>lowerdelay
Secondlevel tagsmaybekeptonchip

.8microntechnology,1R/Wport,32baddress,64bo/p,32Bblock

ASahu 5 ASahu 6

ASahu 1
CS521CSEIITG 11/23/2012

PipelinedCacheAccess TraceCaches:Predecoded
Whatmapstoacacheblock?
Multicyclecacheaccessbutpipelined notstaticallydetermined
reducescycletimebuthittimeismorethan decidedbythedynamicsequenceof
onecycle instructions,includingpredictedbranches
Used
UsedinPentium4(NetBurst
in Pentium 4 (NetBurst architecture)
Pentium4takes4cycles
i k l
startingaddressesnotwordsize*powersof
Greaterpenaltyonbranchmisprediction 2
Moreclockcyclesbetweenissueofloadand Betterutilizationofcachespace
useofdata downside sameinstructionmaybestored
IFIF IF inpipeline multipletimes

ASahu 7 ASahu 8

ReducingMissPenalty
Multilevelcaches
Criticalwordfirstandearlyrestart
Reducing Cache Miss Penalty
ReducingCacheMissPenalty Givingprioritytoreadmissesoverwrite
gp y
Mergingwritebuffer
Victimcaches

ASahu 9 ASahu 10

MultiLevelCaches CriticalWordFirstandEarlyRestart

Averagememoryaccesstime= Readpolicy:ConcurrentRead/Forward
HittimeL1 +MissrateL1 *MisspenaltyL1 Loadpolicy:Wraparoundload

Miss penaltyL1 =
Misspenalty =
Moreeffectivewhenblocksizeislarge
HittimeL2 +MissrateL2 *MisspenaltyL2

ASahu 11 ASahu 12

ASahu 2
CS521CSEIITG 11/23/2012

ReadMissPriorityOverWrite MergingWriteBuffer
Providewritebuffers Mergewritesbelongingtosameblockincaseof
Processorwritesintobufferandproceeds(for writethrough
writethroughaswellaswriteback)

Onreadmiss
waitforbuffertobeempty,or
checkaddressesinbufferforconflict

ASahu 13 ASahu 14

VictimCache:Recyclebin/Dustbin
Evictedblocksarerecycled toproc
Muchfasterthangettinga
blockfromthenextlevel Cache
Size=1to5blocks
Size = 1 to 5 blocks Reducing Cache Miss Rate
ReducingCacheMissRate
Asignificantfractionof
missesmaybefoundin Victim
Cache
victimcache

frommem

ASahu 15 ASahu 16

ReducingMissRate LargeBlockSize
LargeBlockSize Takebenefitofspatiallocality
LargerCache Reducescompulsorymisses
HigherAssociativity Toolargeblocksize missesincrease
Waypredictionandpseudoassociativecache MissPenaltyincreases
Compileroptimizations

ASahu 17 ASahu 18

ASahu 3
CS521CSEIITG 11/23/2012

LargeCache HigherAssociativity
Reducescapacitymisses Reducesconflictmisses
Hittimeincreases 8wayisalmostlikefullyassociative
KeepsmallL1cacheandlargeL2cache
p g Hittimeincreases:Whattodo?
PseudoAssociativity

ASahu 19 ASahu 20

WayPredictionandPseudoassociative Compileroptimizations
Cache
Wayprediction:lowmissrateofSAcache Loop
Loopinterchange
interchange
withhittimeofDMcache
Improvespatiallocalitybyscanningarraysrow
Onlyonetagiscomparedinitially wise
Extrabitsarekeptforprediction
p p
Hittimeincaseofmispredictionishigh Blocking
Pseudoassoc.orcolumnassoc.cache:get Improvetemporalandspatiallocality
advantageofSAcacheinaDMcache
Checksequentiallyinapseudoset
Fasthitandslowhit
ASahu 21 ASahu 22

ImprovingLocality CacheOrganizationfortheexample
Cache line (or block) = 4 matrix elements.
MatrixMultiplicationexample Matrices are stored row wise.
Cache cant accommodate a full row/column.

[C ] = [ A] [B ] L, M and N are so large w.r.t. the cache size


After an iteration along any of the three indices
indices,
when an element is accessed again, it results
in a miss.
LM L N N M Ignore misses due to conflict between matrices.
As if there was a separate cache for each
matrix.

ASahu 23 ASahu 24

ASahu 4
CS521CSEIITG 11/23/2012

MatrixMultiplication:CodeI MatrixMultiplication:CodeII

for (i = 0; i < L; i++) for (k = 0; k < N; k++)


for (j = 0; j < M; j++) i j k for (i = 0; i < L; i++) k i j
for (k = 0; k < N; k++) for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j]; C[i][j] += A[i][k] * B[k][j];

C A B C A B
accesses LM LMN LMN accesses LMN LN LMN
misses LM/4 LMN/4 LMN misses LMN/4 LN LMN/4

Total misses = LM(5N+1)/4 Total misses = LN(2M+4)/4

ASahu 25 ASahu 26

MatrixMultiplication:CodeIII
for (i = 0; i < L; i++)
for (k = 0; k < N; k++) i k j
for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j];
Reducing MissRate*MissPenality
ReducingMissRate MissPenality
C A B
accesses LMN LN LMN
misses LMN/4 LN/4 LMN/4

Total misses = LN(2M+1)/4

ASahu 27 ASahu 28

ReducingMissPenalty*MissRate NonblockingCache

Nonblockingcache InOOOprocessor
Hardwareprefetching
Compilercontrolledprefetching Hitunderamiss
complexityofcachecontrollerincreases
Hitundermultiplemissesormissunderamiss
memoryshouldbeabletohandlemultiplemisses

ASahu 29 ASahu 30

ASahu 5
CS521CSEIITG 11/23/2012

HardwarePrefetching Prefetch Buffer/StreamBuffer

Prefetch itemsbeforetheyarerequested toprocessor


bothdataandinstructions
Whatandwhentoprefetch? Cache
fetchtwoblocksonamiss(requested+next)
f h bl k i ( d )
Wheretokeepprefetched information?
incache prefetch
buffer
inaseparatebuffer(mostcommoncase)

frommemory
ASahu 31 ASahu 32

MatMul:CodeIII
for (i = 0; i < L; i++) i k j CompilerControlledPrefetching
for (k = 0; k < N; k++) Semanticallyinvisible(nochangeinregisters
for (j = 0; j < M; j++) orcachecontents)
C[i][j] += A[i][k] * B[k][j];
C A B
Makessenseifprocessordoesntstallwhile
accesses LMN LN LMN prefetching (nonblockingcache)
misses LMN/4 LN/4 LMN/4 Overheadofprefetch instructionshouldnot
Total misses = LN(2M+1)/4 exceedthebenefit
Suppose 3 Separate Prefetcher for A, B and C PreFecth(A[i]);//Prefetch X=A[i];//Prefetch Instr
All the 3 block can be brought to buffer & one STMT; STMT;
swap out in Te, Te = 4 time execution of stmt; STMT; STMT;
Over 3=1+1+1 STMT; STMT;
How manyy number Miss ? Y+A[i];//UsingData Y+A[i];//UsingData
ASahu 33
Z=Y+K; ASahu
Z=Y+K; 34

ASahu 35

ASahu 6

Das könnte Ihnen auch gefallen