Sie sind auf Seite 1von 17

Computer Architecture and Organization

Lecture15: Cache Performance

Majid Khabbazian mkhabbazian@ualberta.ca


Electrical and Computer Engineering University of Alberta

April 9, 2013

Outline
Memory Mountain
Cache Performance

The Memory Mountain


Read throughput (read bandwidth)
Number of bytes read from memory per second (MB/s)

Memory mountain: Measured read throughput as a function of spatial and temporal locality.
Compact way to characterize memory system performance.

Memory Mountain Test Function


/* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink;
for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }

The Memory Mountain


Read throughput (MB/s) 7000

Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip

6000
5000 4000 3000

2000
1000 s1 s3 s5 s7 s9 2K 0 s11 128K

s13

s32 64M

Stride (x8 bytes)

s15

8M

1M

Working set size (bytes)

16K

The Memory Mountain


Read throughput (MB/s) 7000

Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip

6000
5000 4000 3000

Slopes of spatial 1000 locality


s1 s3 s5 s7 s9 128K 16K 2K 0 s11

2000

s13

s32 64M

Stride (x8 bytes)

s15

8M

1M

Working set size (bytes)

The Memory Mountain


Read throughput (MB/s) 7000 L1

Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip

6000
5000 4000 3000 L2

Slopes of spatial 1000 locality


0 s1 s3 s5 s7 s9 s11

2000

Ridges of Temporal locality

L3

s13

s32 64M

Stride (x8 bytes)

s15

8M

Working set size (bytes)

128K

1M

16K

Mem

2K

A couple of remarks
There is an order of magnitude difference between the highest and the lowest point of the mountain
Even when working set is too large, the highest point is a factor of 7 higher than its lowest points

Concluding Observations
Programmer can optimize for cache performance
How data structures are organized How data are accessed
Nested loop structure Blocking is a general technique

All systems favor cache friendly code


Getting absolute optimum performance is very platform specific
Cache sizes, line sizes, associativities, etc.

Can get most of the advantage with generic code


Keep working set reasonably small (temporal locality) Use small strides (spatial locality)
9

The Processor Performance


CPU time = CPU clock cycles for a program x Clock cycle time

10

The Processor Performance


Different instruction types having different CPIs

11

Example 1
Suppose we have made the following measurements:
Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR= 20

Assume that the two design alternatives are: 1. Decrease the CPI of FPSQR to 2

2. Decrease the average CPI of all FP operations to 2.5


Compare the two design alternatives using the processor performance equation.
12

Cache Performance Metrics


Miss Rate
Fraction of memory references not found in cache (misses / accesses) = 1 hit rate Typical numbers (in percentages):
3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache

Typical numbers:
1-2 clock cycle for L1 5-20 clock cycles for L2

Miss Penalty
Additional time required because of a miss
typically 50-200 cycles for main memory (Trend: increasing!)
13

Lets think about those numbers


Huge difference between a hit and a miss
Could be 100x, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%?


Consider: cache hit time of 1 cycle miss penalty of 100 cycles
Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

This is why miss rate is used instead of hit rate


14

CPU execution time: revisited


Lets now account for cycles during which the processor is stalled waiting for a memory access
CPU execution time= (CPU clock cycles + Memory stall cycles) x Clock cycle time

15

Memory stall cycles


Simplifying assumptions:
CPU clock cycles include the time to handle a cache hit The processor is stalled during a cache miss
Memory stall cycles=number of misses x Miss penalty =IC x (Memory accesses / Instructions) x Miss rate x Miss penalty

Miss rates and miss penalties are often different for reads and writes!
Memory stall cycles=IC x Reads per instruction x Read miss rate x Read miss penalty
+ IC x Writes per instruction x Write miss rate x Write miss penalty

16

Example 2
Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hit?

17

Das könnte Ihnen auch gefallen