Computer Architecture and Organization: Lecture15: Cache Performance

Computer Architecture and Organization
Lecture15: Cache Performance
Majid Khabbazian mkhabbazian@ualberta.ca

Electrical and Computer Engineering University of Alberta
April 9, 2013
Outline
Memory Mountain
Cache Performance
The Memory Mountain

Read throughput (read bandwidth)
Number of bytes read from memory per second (MB/s)
Memory mountain: Measured read throughput as a function of spatial and temporal locality.
Compact way to characterize memory system performance.
Memory Mountain Test Function

/* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink;
for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }
The Memory Mountain

Read throughput (MB/s) 7000
Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip
6000
5000 4000 3000
2000
1000 s1 s3 s5 s7 s9 2K 0 s11 128K
s13
s32 64M
Stride (x8 bytes)
s15
8M
1M
Working set size (bytes)
16K
The Memory Mountain

Read throughput (MB/s) 7000
6000
5000 4000 3000
Slopes of spatial 1000 locality

s1 s3 s5 s7 s9 128K 16K 2K 0 s11
2000
s13
s32 64M
Stride (x8 bytes)
s15
8M
1M
The Memory Mountain

Read throughput (MB/s) 7000 L1
6000
5000 4000 3000 L2
Slopes of spatial 1000 locality

0 s1 s3 s5 s7 s9 s11
2000
Ridges of Temporal locality
L3
s13
s32 64M
Stride (x8 bytes)
s15
8M
128K
1M
16K
Mem
2K
A couple of remarks
There is an order of magnitude difference between the highest and the lowest point of the mountain
Even when working set is too large, the highest point is a factor of 7 higher than its lowest points
Concluding Observations
Programmer can optimize for cache performance
How data structures are organized How data are accessed
Nested loop structure Blocking is a general technique
All systems favor cache friendly code

Getting absolute optimum performance is very platform specific
Cache sizes, line sizes, associativities, etc.
Can get most of the advantage with generic code

Keep working set reasonably small (temporal locality) Use small strides (spatial locality)
9
The Processor Performance

CPU time = CPU clock cycles for a program x Clock cycle time
10
The Processor Performance

Different instruction types having different CPIs
11
Example 1
Suppose we have made the following measurements:
Frequency of FP operations = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR= 20
Assume that the two design alternatives are: 1. Decrease the CPI of FPSQR to 2
2. Decrease the average CPI of all FP operations to 2.5

Compare the two design alternatives using the processor performance equation.
12
Cache Performance Metrics

Miss Rate
Fraction of memory references not found in cache (misses / accesses) = 1 hit rate Typical numbers (in percentages):
3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Typical numbers:
1-2 clock cycle for L1 5-20 clock cycles for L2
Miss Penalty
Additional time required because of a miss
typically 50-200 cycles for main memory (Trend: increasing!)
13
Lets think about those numbers

Huge difference between a hit and a miss
Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?

Consider: cache hit time of 1 cycle miss penalty of 100 cycles
Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why miss rate is used instead of hit rate

14
CPU execution time: revisited

Lets now account for cycles during which the processor is stalled waiting for a memory access
CPU execution time= (CPU clock cycles + Memory stall cycles) x Clock cycle time
15
Memory stall cycles

Simplifying assumptions:
CPU clock cycles include the time to handle a cache hit The processor is stalled during a cache miss
Memory stall cycles=number of misses x Miss penalty =IC x (Memory accesses / Instructions) x Miss rate x Miss penalty
Miss rates and miss penalties are often different for reads and writes!
Memory stall cycles=IC x Reads per instruction x Read miss rate x Read miss penalty
+ IC x Writes per instruction x Write miss rate x Write miss penalty
16
Example 2
Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hit?
17

Computer Architecture and Organization: Lecture15: Cache Performance

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Computer Architecture and Organization: Lecture15: Cache Performance

Hochgeladen von

Copyright:

Verfügbare Formate

Computer Architecture and Organization

Lecture15: Cache Performance

Majid Khabbazian mkhabbazian@ualberta.ca

The Memory Mountain

Memory Mountain Test Function

The Memory Mountain

Stride (x8 bytes)

Working set size (bytes)

The Memory Mountain

Slopes of spatial 1000 locality

Stride (x8 bytes)

Working set size (bytes)

The Memory Mountain

Slopes of spatial 1000 locality

Ridges of Temporal locality

Stride (x8 bytes)

Working set size (bytes)

All systems favor cache friendly code

Can get most of the advantage with generic code

The Processor Performance

The Processor Performance

2. Decrease the average CPI of all FP operations to 2.5

Cache Performance Metrics

Lets think about those numbers

Would you believe 99% hits is twice as good as 97%?

This is why miss rate is used instead of hit rate

CPU execution time: revisited

Memory stall cycles

Das könnte Ihnen auch gefallen