Beruflich Dokumente
Kultur Dokumente
Output
Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches
3
Technology Trends
DRAM
4000:1!
2.5:1!
4
(improvement ratio)
Memory 7%/yr
Memory Hierarchy
Levels of the Memory Hierarchy
Capacity Access Time CPU Registers 500 bytes 0.25 ns Cache 64 KB 1 ns
Registers
Blocks
Main Memory 512 MB 100ns Disk 100 GB 5 ms
Capacity
Speed
Cache
ABCs of Caches
Cache: In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
main Memory
Blk Y
From Processor
Cache Measures
CPU execution time incorporated with cache performance: CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty = IC*(Misses/Instruction)*Miss penalty = IC*(Memory accesses/Instruction)*Miss rate*Miss penalty = IC * Reads per instruction * Read miss rate * Read miss penalty +IC * Writes per instruction * Write miss rate * Write miss penalty Memory access consists of fetching instructions and reading/writing data
P.395 Example
Example Assume we have a computer where the CPI is 1.0 when all memory accesses
hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?
Answer:
(A) If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time (B) If there are 2% miss, CPI = 1.0, we need to calculate memory stalls. memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time : CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.
10
11
12
3
4 5 6 7
8
9 A B C D E F
The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)
13
: :
:
Byte 1023
:
Byte 992 31
14
Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit Use the Cache Index to select the cache set Check the Tag on each block in that set No need to check index or block offset A valid bit is added to the Tag to indicate whether or not this entry contains a valid address Select the desiredbytes using Block Offset Increasing associativity => shrinks index expands tag
15
Valid
Cache Tag
Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0
Cache Tag
Valid
:
Adr Tag
:
0x50
Compare
Sel1 1
Mux
0 Sel0
Compare
OR Hit
Cache Block
16
Valid
Cache Tag
Cache Tag
Valid
:
Adr Tag
Compare
Sel1 1
Mux
0 Sel0
Compare
OR
Hit Cache Block
17
Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)
Data cache misses per 1000 instructions for various replacement strategies Associativity: 2-way 4-way 8-way
Size 16 KB 64 KB 256 KB LRU Random FIFO LRU Random FIFO LRU Random FIFO 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes
18
19
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4
20
Write allocate the block is allocated on a write miss, followed by the write hit actions
No-write allocate write misses do not affect the cache. The block is modified only in the lower-level memory
Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache
21
Write allocate:
Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[100]; 1 write hit
4 misses; 1 hit
2 misses; 3 hits
22
Cache Performance
Example: Split Cache vs. Unified Cache Which has the better avg. memory access time? A 16-KB instruction cache with a 16-KB data cache (split cache), or A 32-KB unified cache? Miss rates Size Instruction Cache Data Cache Unified Cache
16KB 0.4% 11.4% 32 KB 3.18% Assume A hit takes 1 clock cycle and the miss penalty is 100 cycles A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port 36% of the instructions are data transfer instructions. About 74% of the memory accesses are instruction references
Answer: Average memory access time (split) = % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty) = 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24 Average memory access time(unified) = 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44
23
26
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Section 5.7
Section 5.5
Section 5.4
Memory Accesses Miss Rate Miss Penalty) Clock Cycle Time Instruction
27
28
Multilevel Caches
Approaches Make the cache faster to keep pace with the speed of CPUs Make the cache larger to overcome the widening gap L1: fast hits, L2: fewer misses L2 Equations Average Memory Access Time = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Average Memory Access Time = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Hit TimeL1 << Hit TimeL2 << << Hit TimeMem Miss RateL1 < Miss RateL2 < Definitions: Local miss rate misses in this cache divided by the total number of memory accesses to this cache (Miss rateL1 , Miss rateL2) L1 cache skims the cream of the memory accesses Global miss ratemisses in this cache divided by the total number of memory accesses generated by the CPU (Miss rateL1, Miss RateL1 x Miss RateL2) Indicate what fraction of the memory accesses that leave the CPU go all the way to memory 29
Design of L2 Cache
Size
Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1
Whether data in L1 is in L2
novice approach: design L1 and L2 independently multilevel inclusion: L1 data are always present in L2
Advantage: easy for consistency between I/O and cache (checking L2 only) Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2
31
Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back Suppose a read miss will replace a dirty block Normal: Write dirty block to memory, and then do the read Instead: Copy the dirty block to a write buffer, do the read, and then do the write CPU stall less since restarts as soon as do read
32
Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined
Write buffer with 4 entries, each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry writing multiple words at the same time is faster than writing multiple times
33
Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again
rather simply discarded or swapped into L2
victim cache: a small, fully associative cache between a cache and its refill path
contain only blocks that are discarded from a cache because of a miss, victims checked on a miss before going to the next lower-level memory Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches AMD Athlon: 8 entries 34
3 Cs of Cache Miss
3Cs Absolute Miss Rate (SPEC92) 2:1 Cache Rule
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
0.14 0.12 2-way Miss Rate per Type 0.1 4-way 0.08 8-way 0.06 Capacity
Conflict
1-way
0.04 0.02
0 4 1 2 8 16 32 64
Compulsory
128
36
100% 1-way 80% Miss Rate per Type 60% 40% 20% 2-way 4-way 8-way
Conflict
Capacity
0%
4 1 2 8 16 32 64 128
Compulsory
37
1. 2. 3. 4. 5.
Larger block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimizations
38
Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.
Size of Cache
16 32 64 128
Block Size (bytes)
Take advantage of spatial locality -The larger the block, the greater the chance parts of it is used again # of blocks is reduced for the cache of same size => Increase miss penalty It may increase conflict misses and even capacity misses if the cache is small Usually high latency and high bandwidth encourage large block size
39
256
0.1
4-way
0.08 8-way
0.06 0.04 0.02 0 4 1 2 8 16 32 64 128
Capacity
Compulsory
Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15) May be longer hit time and higher cost Trends: Larger L2 or L3 off-chip caches
40
Beware: Execution time is the only final measure! Will Clock Cycle time increase as a result of having a more complicated cache? Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10%, internal + 2%
41
On a miss, a 2nd cache entry is checked before going to the next lower level Invert the most significant bit to the find other block in the pseudoset Miss penalty may become slightly longer
42
Aligning basic block: the entry point is at the beginning of a cache block
Decrease the chance of a cache miss for sequential code
43
Maximize accesses to the data loaded into the cache before replaced Improve temporal locality /* After: B=blocking factor */ X=Y*Z for(jj=0;jj<N;jj=jj+B)
/* Before */ for(i=0;i<N;i=i+1) for(j=0;j<N;j=j+1){ r=0; for(k=0;k<N;k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; }
for(kk=0;kk<N;kk=kk+B) for(i=0;i<N;i=i+1) for(j=jj;j<min(jj+B,N;j=j+1){ r=0; for(k=kk;k<min(kk+B,N);k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; }
total # of memory words accessed = 2N3/B+N2 y benefits from spatial locality z benefits from temporal locality 44
45
Up to 8 simultaneous prefetches It may interfere with demand misses resulting in lowering performance
47
Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations
normal load instruction = faulting register prefetch instruction
48
Four techniques:
1.Small and simple caches 2.Avoiding address translation during indexing of the cache 3.Pipelined cache access 4.Trace caches
49
General design:
small and simple cache for 1st-level cache Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory
50
2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed
solution: increase width of cache address tag with process-identifier tag (PID)
4.I/O: typically use PA, so need to interact with cache (see Section 5.12)
51
MEM
Conventional Organization
Overlap cache access with VA translation: requires $ index to remain invariant across translation
52
Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit
53