Beruflich Dokumente
Kultur Dokumente
April 9, 2013
Outline
Memory Mountain
Cache Performance
Memory mountain: Measured read throughput as a function of spatial and temporal locality.
Compact way to characterize memory system performance.
Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip
6000
5000 4000 3000
2000
1000 s1 s3 s5 s7 s9 2K 0 s11 128K
s13
s32 64M
s15
8M
1M
16K
Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip
6000
5000 4000 3000
2000
s13
s32 64M
s15
8M
1M
Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip
6000
5000 4000 3000 L2
2000
L3
s13
s32 64M
s15
8M
128K
1M
16K
Mem
2K
A couple of remarks
There is an order of magnitude difference between the highest and the lowest point of the mountain
Even when working set is too large, the highest point is a factor of 7 higher than its lowest points
Concluding Observations
Programmer can optimize for cache performance
How data structures are organized How data are accessed
Nested loop structure Blocking is a general technique
10
11
Example 1
Suppose we have made the following measurements:
Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR= 20
Assume that the two design alternatives are: 1. Decrease the CPI of FPSQR to 2
Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Typical numbers:
1-2 clock cycle for L1 5-20 clock cycles for L2
Miss Penalty
Additional time required because of a miss
typically 50-200 cycles for main memory (Trend: increasing!)
13
15
Miss rates and miss penalties are often different for reads and writes!
Memory stall cycles=IC x Reads per instruction x Read miss rate x Read miss penalty
+ IC x Writes per instruction x Write miss rate x Write miss penalty
16
Example 2
Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hit?
17