Sie sind auf Seite 1von 56

EEF011 Computer Architecture 計算機結構

Chapter 5 Memory Hierarchy Design

吳俊興

高雄大學資訊工程學系

December 2004

Chapter 5 Memory Hierarchy Design

5.1

Introduction

5.2

Review of the ABCs of Caches

5.3

Cache Performance

5.4

Reducing Cache Miss Penalty

5.5

Reducing Cache Miss Rate

5.6

Reducing Cache Miss Penalty or Miss Rate via Parallelism

5.7

Reducing Hit Time

5.8

Main Memory and Organizations for Improving Performance

5.9

Memory Technology

5.10 Virtual Memory

5.11 Protection and Examples of Virtual Memory

5.1

Introduction

The five classic components of a computer:

Processor

Control

Datapath

Memory

Input

Output

Where do we fetch instructions to execute?

Build a memory hierarchy which includes main memory & caches (internal memory) and hard disk (external memory) Instructions are first fetched from external storage such as hard disk and are kept in the main memory. Before they go to the CPU, they are probably extracted to stay in the caches

Memory Performance Index

Capacity

Speed (latency)

CPU:

2x in 1.5 years 2x in 1.5 years

 

DRAM:

4x in 3 years

2x in 10 years

Technology Trends

Disk:

4x

in 3 years

2x in 10 years

DRAM

Year

Size

Cycle Time

1980

64 Kb

250 ns

1983

256 Kb

220 ns

1986

1 Mb

190 ns

1989

4 Mb

165 ns

1992

16 Mb

145 ns

1995

64 Mb

120 ns

2000

256 Mb

100 ns

4000:1!

2.5:1!

Performance Gap between CPUs and Memory

ratio)
ratio)

(improvement

CPU

1.35X/yr

1.55X/yr

Memory

7%/yr

The gap (latency) grows about 50% per year!

Memory Hierarchy

Levels of the Memory Hierarchy

Capacity Access Time

CPU Registers Registers 500 bytes 0.25 ns Cache 64 KB Cache 1 ns Blocks Main
CPU Registers
Registers
500 bytes
0.25 ns
Cache
64 KB
Cache
1 ns
Blocks
Main Memory
512 MB
Memory
100ns
Pages
Disk
100 GB
I/O Devices
5 ms
Files
???

Upper Level Faster

Speed Capacity
Speed
Capacity

Larger

Lower Level

5.2 ABCs of Caches

Cache:

– In this textbook it mainly means the first level of the memory hierarchy encountered once the address leaves the CPU

– applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on

Principle of Locality:

– Program access a relatively small portion of the address space at any instant of time.

Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Memory Hierarchy: Terminology

Hit: data appears in some block in the cache (example: Block X)

Hit Rate: the fraction of cache access found in the cache

Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

Miss: data needs to be retrieved from a block in the main memory (Block Y)

Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in cache + Time to deliver the block to the processor

Hit Time << Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles)

To Processor

Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) To Processor From Processor cache Blk

From Processor

Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) To Processor From Processor cache Blk

cache

Blk X

Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) To Processor From Processor cache Blk
Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) To Processor From Processor cache Blk

main

Memory

Blk Y

Miss Penalty (e.g. 1 clock cycle .vs. 40 clock cycles) To Processor From Processor cache Blk

Cache Measures

CPU execution time incorporated with cache performance:

CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time

Memory stall cycles: number of cycles during which the CPU is stalled waiting for a memory access Memory stall clock cycles = Number of misses * miss penalty

= IC*(Misses/Instruction)*Miss penalty

= IC*(Memory accesses/Instruction)*Miss rate*Miss penalty

= IC * Reads per instruction * Read miss rate * Read miss penalty

+IC * Writes per instruction * Write miss rate * Write miss penalty

Memory access consists of fetching instructions and reading/writing data

P.395 Example

Example Assume we have a computer where the CPI is 1.0 when all memory accesses hit the cache. The only data access are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions are in the cache?

Answer:

(A)

If instructions always hit in the cache, CPI=1.0, no memory stalls, then CPU(A) = (IC*CPI + 0)*clock cycle time = IC*clock cycle time

(B)

If there are 2% miss, CPI = 1.0, we need to calculate memory stalls.

memory stall = IC*(Memory accesses/Instruction)*miss rate* miss penalty = IC*(1+50%)*2%*25 = IC*0.75 then CPU(B) = (IC + IC*0.75)* Clock cycle time = 1.75*IC*clock cycle time The performance ration is easy to get to be the inverse of the CPU execution time :

CPU(B)/CPU(A) = 1.75 The computer with no cache miss is 1.75 times faster.

Four Memory Hierarchy Questions

Q1 (block placement):

Where can a block be placed in the upper level? Q2 (block identification):

How is a block found if it is in the upper level? Q3 (block replacement):

Which bock should be replaced on a miss? Q4 (write strategy):

What happens on a write?

Q1(block placement): Where can a block be placed?

Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) – # of set # of blocks – n-way: n blocks in a set – 1-way = direct mapped Fully associative: # of set = 1

12
12

Example: block 12 placed in a 8-block cache

Simplest Cache: Direct Mapped (1-way)

Block number

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

Memory

(1-way) Block number 0 1 2 3 4 5 6 7 8 9 A B C

4 Block Direct Mapped Cache

Block Index in Cache

0

1

2

3

The block have only one place it can appear in the cache. The mapping is usually (Block address) MOD ( Number of blocks in cache)

Example: 1 KB Direct Mapped Cache, 32B Blocks

For a 2 N byte cache:

– The uppermost (32 - N) bits are always the Cache Tag

– The lowest M bits are the Byte Select (Block Size = 2 M )

31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex:
31
9
4
0
Cache Tag
Example: 0x50
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Valid Bit
Cache Tag
Cache Data
Byte 31
Byte 1
Byte 0
0
0x50
Byte 63
Byte 33
Byte 32
1
2
3
:
:
:
Byte 1023
Byte 992
31
:
:
:

Q2 (block identification): How is a block found?

Three portions of an address in a set-associative or direct-mapped cache

Block Address

Block Address

Tag

Cache/Set Index

Block Offset

(Block Size)

Block Offset selects the desired data from the block, the index filed selects the set, and the tag field compared against the CPU address for a hit

• Use the Cache Index to select the cache set

• Check the Tag on each block in that set

– No need to check index or block offset

– A valid bit is added to the Tag to indicate whether or not this entry

contains a valid address

• Select the desiredbytes using Block Offset

Increasing associativity => shrinks index

expands tag

Example: Two-way set associative cache

• Cache Index selects a “set” from the cache

• The two tags in the set are compared in parallel

• Data is selected based on the tag result

31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex:
31
9
4
0
Cache Tag
Example: 0x50
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Valid
Cache Tag
Cache Data
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Cache Block 0
:
:
:
:
:
:
0x50
Adr Tag
Compare
1
0
Compare
Sel1
Mux
Sel0
OR
Cache Block
Hit
16

Disadvantage of Set Associative Cache

N-way Set Associative Cache v.s. Direct Mapped Cache:

– N comparators vs. 1

– Extra MUX delay for the data

– Data comes AFTER Hit/Miss

In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

Valid Cache Tag Cache Data Cache Index Cache Data Cache Tag Valid Cache Block 0
Valid
Cache Tag
Cache Data
Cache Index
Cache Data
Cache Tag
Valid
Cache Block 0
Cache Block 0
:
:
:
:
:
:
Adr Tag
Compare
1
0
Compare
Sel1
Mux
Sel0
OR
Cache Block
Hit

Q3 (block replacement): Which block should be replaced on a cache miss?

Easy for Direct Mapped – hardware decisions are simplified

Only one block frame is checked and only that block can be replaced

Set Associative or Fully Associative

There are many blocks to choose from on a miss to replace

Three primary strategies for selecting a block to be replaced

Random: randomly selected LRU: Least Recently Used block is removed FIFO(First in, First out)

Data cache misses per 1000 instructions for various replacement strategies

Associativity:

 

2-way

 

4-way

 

8-way

Size

LRU Random FIFO

LRU Random FIFO

LRU Random FIFO

16

KB

114.1

117.3 115.5

111.7

115.1

113.3

109.0

111.8

110.4

64

KB

103.4

104.3 103.9

102.4

102.3

103.1

99.7 100.5

100.3

256 KB

92.2

92.1

92.5

92.1

92.1 92.5

92.1

92.1

92.5

There are little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes

Q4(write strategy): What happens on a write?

Reads dominate processor cache accesses. E.g. 7% of overall memory traffic are writes while 21% of data cache access are writes Two option we can adopt when writing to the cache:

Write through —The information is written to both the block in the

cache and to the block in the lower-level memory. Write back —The information is written only to the block in the cache.

The modified cache block is written to main memory only when it is replaced. To reduce the frequency of writing back blocks on replacement, a dirty bit is used to indicate whether the block was modified in the cache (dirty) or not (clean). If clean, no write back since identical information to the cache is found Pros and Cons WT: simply to be implemented. The cache is always clean, so read misses cannot result in writes WB: writes occur at the speed of the cache. And multiple writes within a block require only one write to the lower-level memory

19

Write Stall and Write Buffer

When the CPU must wait for writes to complete during WT, the CPU is said to write stall A common optimization to reduce write stall is a write buffer, which allows the processor to continue as soon as the data are written to the buffer, thereby overlapping processor execution with memory updating

Processor

Cache
Cache

DRAM

Write Buffer

A Write Buffer is needed between the Cache and Memory

– Processor: writes data into the cache and the write buffer

– Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:

– Typical number of entries: 4

Write-Miss Policy: Write Allocate vs. Not Allocate

Two options on a write miss

Write allocate – the block is allocated on a write miss, followed by the write hit actions

Write misses act like read misses

No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory

Block stay out of the cache in no-write allocate until the program tries

to read the blocks, but with write allocate even blocks that are only

written will still be in the cache

Write-Miss Policy Example

Example: Assume a fully associative write-back cache with many

cache entries that starts empty. Below is sequence of five memory operations.

Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate?

Answer:

No-write Allocate:

Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100].

4 misses;

1 hit

1 write miss 1 write miss 1 read miss 1 write hit 1 write miss

Write allocate:

Write Mem[100]; Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100];

2 misses; 3 hits

1 write miss 1 write hit 1 read miss 1 write hit 1 write hit

5.3 Cache Performance

Example: Split Cache vs. Unified Cache

Which has the better avg. memory access time?

A 16-KB instruction cache with a 16-KB data cache (split cache), or

A 32-KB unified cache?

Miss rates

Assume

Size

Instruction Cache

Data Cache

Unified Cache

16KB

0.4%

11.4%

 

32 KB

   

3.18%

• A hit takes 1 clock cycle and the miss penalty is 100 cycles

• A load or store takes 1 extra clock cycle on a unified cache since there is only one cache port

• 36% of the instructions are data transfer instructions.

• About 74% of the memory accesses are instruction references

Answer:

Average memory access time (split)

= % instructions x (Hit time + Instruction miss rate x Miss penalty) + % data x (Hit time + Instruction miss rate x Miss penalty)

= 74% X ( 1 + 0.4% X 100) + 26% X ( 1 + 11.4% X 100) = 4.24

Average memory access time(unified)

= 74% X ( 1 + 3.18%x100) + 26% X ( 1 + 1 + 3.18% X 100) = 4.44

23

Impact of Memory Access on CPU Performance

Example: Suppose a processor:

– Ideal CPI = 1.0 (ignoring memory stalls)

– Avg. miss rate is 2%

– Avg. memory references per instruction is 1.5

– Miss penalty is 100 cycles

What are the impact on performance when behavior of the cache is included?

Answer:

CPI = CPU execution cycles per instr. + Memory stall cycles per instr. = CPI execution + Miss rate x Memory accesses per instr. x Miss penalty CPI with cache = 1.0 + 2% x 1.5 x 100 = 4 CPI without cache = 1.0 + 1.5 x 100 = 151

CPU time with cache = IC x CPI x Clock cycle time = IC x 4.0 x Clock cycle time CPU time without cache = IC x 151 x Clock cycle time

Without cache, the CPI of the processor increases from 1 to 151! 75 % of the time the processor is stalled waiting for memory! (CPI: 14)

Impact of Cache Organizations on CPU Performance

Example: What is the impact of two different cache organizations (direct

mapped vs. 2-way set associative) on the performance of a CPU?

– Ideal CPI = 2.0 (ignoring memory stalls)

– Clock cycle time is 1.0 ns

– Avg. memory references per instruction is 1.5

– Cache size: 64 KB, block size: 64 bytes

– For set-associative, assume the clock cycle time is stretched 1.25 times to accommodate the selection multiplexer

– Cache miss penalty is 75 ns

– Hit time is 1 clock cycle

– Miss rate: direct mapped 1.4%; 2-way set-associative 1.0%.

Answer:

• Avg. memory access time 1-way = 1.0+(0.014 x 75) = 2.05 ns Avg. memory access time 2-way = 1.0 x 1.25 + (0.01 x 75) = 2.00 ns

• CPU time 1-way = IC x (CPI execution + Miss rate x Memory accesses per instruction x Miss penalty) x Clock cycle time = IC x (2.0 x 1.0 + (1.5 x 0.014 x 75)) = 3.58 IC CPU time 2-way = IC x (2.0 x 1.0 x 1.25 + (1.5 x 0.01 x 75)) = 3.63 IC

Summary of Performance Equations

26
26

Improving Cache Performance

The next few sections in the text book look at ways to improve cache and memory access times.

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

CPU Time = IC

*(

CPI Execution

+

Section 5.7

Section 5.5

Section 5.4

Memory Accesses

Instruction

×

Miss Rate Miss Penalty Clock Cycle Time

×

)

×

5.4 Reducing Cache Miss Penalty

Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory.

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Five optimizations

1. Multilevel caches

2. Critical word first and early restart

3. Giving priority to read misses over writes

4. Merging write buffer

5. Victim caches

O1: Multilevel Caches

• Approaches

– Make the cache faster to keep pace with the speed of CPUs

– Make the cache larger to overcome the widening gap

L1: fast hits, L2: fewer misses

• L2 Equations Average Memory Access Time = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 Average Memory Access Time = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 )

L 2 + Miss Rate L 2 x Miss Penalty L 2 ) Hit Time L
L 2 + Miss Rate L 2 x Miss Penalty L 2 ) Hit Time L

Hit Time L1 << Hit Time L2 << … << Hit Time Mem Miss Rate L1 < Miss Rate L2 < … Definitions:

Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L1 , Miss rate L2 )

• L1 cache skims the cream of the memory accesses

Global miss rate—misses in this cache divided by the total number of memory

accesses generated by the CPU (Miss rate L1 , Miss Rate L1 x Miss Rate L2 )

• Indicate what fraction of the memory accesses that leave the CPU go all the way to memory

29

Design of L2 Cache

•Size

– Since everything in L1 cache is likely to be in L2 cache, L2 cache should be much bigger than L1

•Whether data in L1 is in L2

– novice approach: design L1 and L2 independently

– multilevel inclusion: L1 data are always present in L2

• Advantage: easy for consistency between I/O and cache (checking L2 only)

• Drawback: L2 must invalidate all L1 blocks that map onto the 2nd-level block to be replaced => slightly higher 1st-level miss rate

• i.e. Intel Pentium 4: 64-byte block in L1 and 128-byte in L2

– multilevel exclusion: L1 data is never found in L2

• A cache miss in L1 results in a swap of blocks between L1 and L2

• Advantage: prevent wasting space in L2

• i.e. AMD Athlon: 64 KB L1 and 256 KB L2

O2: Critical Word First and Early Restart

Don’t wait for full block to be loaded before restarting CPU Critical Word First—Request missed word first from memory

and send it to CPU as soon as it arrives; let CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restart—As soon as the requested word of the block

arrives, send it to the CPU and let the CPU continue execution

– Given spatial locality, CPU tends to want next sequential word, so it’s not clear if benefit by early restart

Generally useful only in large blocks,

next sequential word, so it’s not clear if benefit by early restart Generally useful only in
next sequential word, so it’s not clear if benefit by early restart Generally useful only in
next sequential word, so it’s not clear if benefit by early restart Generally useful only in

block

O3: Giving Priority to Read Misses over Writes

• Serve reads before writes have been completed

• Write through with write buffers

SW

R3, 512(R0)

; M[512] <- R3

(cache index 0)

LW

R1, 1024(R0) ; R1 <- M[1024]

(cache index 0)

LW

R2, 512(R0)

; R2 <- M[512]

(cache index 0)

Problem: write through with write buffers offer RAW conflicts with main memory reads on cache misses

– If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% )

– Check write buffer contents before read; if no conflicts, let the memory access continue

• Write Back Suppose a read miss will replace a dirty block

– Normal: Write dirty block to memory, and then do the read

– Instead: Copy the dirty block to a write buffer, do the read, and then do the write

– CPU stall less since restarts as soon as do read

O4: Merging Write Buffer

• If a write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPU’s perspective

– Usually a write buffer supports multi-words

• Write merging: addresses of write buffers are checked to see if the address of the new data matches the address of a valid write buffer entry. If so, the new data are combined

valid write buffer entry. If so, the new data are combined Write buffer with 4 entries,
valid write buffer entry. If so, the new data are combined Write buffer with 4 entries,

Write buffer with 4 entries, each can hold four 64-bit words (left) without merging (right) Four writes are merged into a single entry •writing multiple words at the same time is faster than writing multiple times

O5: Victim Caches

O5: Victim Caches Idea of recycling: remember what was discarded latest due to cache miss in

Idea of recycling: remember what was discarded latest due to cache miss in case it is needed again

–rather simply discarded or swapped into L2

victim cache: a small, fully associative cache between a cache and its refill path

–contain only blocks that are discarded from a cache because of a miss, “victims” –checked on a miss before going to the next lower-level memory –Victim caches of 1 to 5 entries are effective at reducing misses, especially for small, direct-mapped data caches –AMD Athlon: 8 entries

34

5.5 Reducing Miss Rate

3 C’s of Cache Miss

Compulsory—The first access to a block is not in the cache, so the block

must be brought into the cache. Also called cold start misses or first

reference misses.

(Misses in even an Infinite Cache)

Capacity—If the cache cannot contain all the blocks needed during

execution of a program, capacity misses will occur due to blocks being

discarded and later retrieved. (Misses in Fully Associative Size X Cache)

Conflict—If block-placement strategy is set associative or direct mapped,

conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.

(Misses in N-way Associative but hits in Fully Associative Size X Cache)

3 C’s of Cache Miss

3Cs Absolute Miss Rate (SPEC92)

2:1 Cache Rule

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Conflict 0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 0 Miss
Conflict
0.14
1-way
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
Miss Rate per Type
1
2
4
8
16
32
64
128

Compulsory vanishingly small

Cache Size (KB)

Compulsory

Miss Rate per Type

3Cs Relative Miss Rate

100%

80%

60%

40%

20%

0%

1-way 2-way 4-way Conflict 8-way Capacity 1 2 4 8 16 32 64 128
1-way
2-way 4-way
Conflict
8-way
Capacity
1
2
4
8
16
32
64
128

Flaws: for fixed block size Good: insight => invention

Cache Size (KB)

Compulsory

37

Five Techniques to Reduce Miss Rate

1. Larger block size

2. Larger caches

3. Higher associativity

4. Way prediction and pseudoassociative caches

5. Compiler optimizations

O1: Larger Block Size

 

25%

20%

Miss

15%

Rate

10%

5%

0%

16 32 64 128 256
16
32
64
128
256

Block Size (bytes)

1K4K 16K 64K 256K

4K1K 16K 64K 256K

16K1K 4K 64K 256K

64K1K 4K 16K 256K

256K1K 4K 16K 64K

Size of Cache

Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.

• Take advantage of spatial locality -The larger the block, the greater the chance parts of it is used again

• # of blocks is reduced for the cache of same size => Increase miss penalty

• It may increase conflict misses and even capacity misses if the cache is small

• Usually high latency and high bandwidth encourage large block size

O2: Larger Caches

0.14 1-way 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 0 Miss Rate
0.14
1-way
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
Miss Rate per Type
1
2
4
8
16
32
64
128

Cache Size (KB)

Compulsory

• Increasing capacity of cache reduces capacity misses (Figure 5.14 and 5.15)

• May be longer hit time and higher cost

• Trends: Larger L2 or L3 off-chip caches

40

O3: Higher Associativity

• Figure 5-14 and 5-15 show how improve miss rates improve with higher associativity

– 8-way set asociative is as effective as fully associative for practical purposes

– 2:1 Cache Rule:

Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2

• Tradeoff: higher associative cache complicates the circuit

– May have longer clock cycle

• Beware: Execution time is the only final measure! –Will Clock Cycle time increase as a result of having a more complicated cache? –Hill [1988] suggested hit time for 2-way vs. 1-way is:

external cache +10%, internal + 2%

O4: Way Prediction & Pseudoassociative Caches

way prediction: extra bits are kept in cache to predict the way, or block within the set of the next cache access

Example: 2-way I-cache of Alpha 21264

– If the predictor is correct, I-cache latency is 1 clock cycle

– If incorrect, tries the other block, changes the way predictor, and has a latency of 3 clock cycles

– excess of 85% accuracy reduce conflict miss and maintain the hit speed of direct-mapped cache

pseudoassociative or column associative

– On a miss, a 2nd cache entry is checked before going to the next lower level

• one fast hit and one slow hit

– Invert the most significant bit to the find other block in the “pseudoset”

– Miss penalty may become slightly longer

the most significant bit to t he find other block in the “pseudoset” – Miss penalty

42

O5: Compiler Optimizations

Improve hit rate by compile-time optimization

• Reordering instructions with profiling information (McFarling[1989])

– Reduce misses by 50% for a 2KB direct-mapped 4-byte-block I-cache, and 75% in an 8KB cache

– Get best performance when it was possible to prevent some instruction from entering the cache

• Aligning basic block: the entry point is at the beginning of a cache block

– Decrease the chance of a cache miss for sequential code

• Loop Interchange: exchanging the nesting of loops

– Improve spatial locality => reduce misses

– Make data be accessed in order => maximize use of data in a cache block before discarded

/* Before: row first */

for(j=0;j<100;j=j+1)

for(i=0;i<5000;i=i+1)

x[i][j]=2*x[i][j];

skip through memory in strides of 100 words

/* Before: row first */

for(i=0;i<5000;i=i+1)

for(j=0;j<100;j=j+1)

x[i][j]=2*x[i][j];

access all words in a cache block

Blocking: operating on submatrices or blocks

–Maximize accesses to the data loaded into the cache before replaced –Improve temporal locality

X=Y*Z

/* After: B=blocking factor */

for(jj=0;jj<N;jj=jj+B)

for(kk=0;kk<N;kk=kk+B)

for(i=0;i<N;i=i+1)

/* Before */

for(i=0;i<N;i=i+1)

for(j=0;j<N;j=j+1){

r=0;

for(k=0;k<N;k=k+1)

for(j=jj;j<min(jj+B,N;j=j+1){

r=0;

for(k=kk;k<min(kk+B,N);k=k+1)

r=r+y[i][k]*z[k][j];

x[i][j]=r;

r=r+y[i][k]*z[k][j];

x[i][j]=x[i][j]+r;

}

}

x[i][j]=r; r=r+y[i][k]*z[k][j]; x[i][j]=x[i][j]+r; } } # of capacity misses depends on N and cache size •total

# of capacity misses depends on N and cache size

} } # of capacity misses depends on N and cache size •total # of memory

•total # of memory words accessed = 2N 3 /B+N 2 •y benefits from spatial locality •z benefits from temporal locality

44

5.6 Reducing Cache Penalty or Miss Rate via Parallelism

Three techniques that overlap the execution of instructions 1.Nonblocking caches to reduce stalls on cache misses

•to match the out-of-order processors

2.Hardware prefetching of insructions and data 3.Compiler-controlled prefetching

O1: Nonblocking cache to reduce stalls on cache miss

For pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss •separate I-cache and D-cache

– Continue fetching instructions from I-cache while waiting for D-cache to return missing data

•“Nonblocking cache (lookup-free cache)

– “hit under miss”: D-cache continues to supply cache hits during a miss – “hit under multiple miss” or “miss under miss”: overlap multiple misses

miss” or “miss under miss”: overlap multiple misses Ratio of average memory stall time for a

Ratio of average memory stall time for a blocking cache to hit-under-miss schemes •first 14 are FP programs average: 76% for 1-miss, 51% for 2-miss, 39% for 64- miss •final 4 are INT programs average: 81%, 78% and 78%

46

O2: Hardware Prefetching of Instructions and Data

Prefetch instructions or data before requested by the CPU

– either directly into the caches or into an external buffer (faster than accessing main memory)

•Instruction prefetch: frequently done in hardware outside cache

– Fetch two blocks on a miss

• the requested block is placed in I-cache when it returns

• the prefetched block is placed in instruction stream buffer (ISB)

• 1 single ISB would catch 15% to 25% of misses from a 4KB 16-byte-block direct-mapped I-cache. 4 ISBs increased the data hit rate to 43% (Jouppi1990)

•UltraSPARC III: data prefetch

– If a load hits in the prefetch cache

• the block is read from the prefetch cache

• the next prefetch request is issued: calculating the “stride” of the next prefetched block using the difference between the current address and the previous address

– Up to 8 simultaneous prefetches

It may interfere with demand misses resulting in lowering performance

O3: Compiler-Controlled Prefetching

•Compiler-controlled prefetching

– Register prefetch: load the value into a register

– Cache prefetch: load data only into the cache (not register)

•Faulting vs. nonfaulting: the address does or does not cause an exception for virtual address faults and protection violations

– normal load instruction = faulting register prefetch instruction

•Most effective prefetch: “semantically invisible” to a program

– doesn’t change the contents of registers and memory, and

– cannot cause virtual memory faults

•nonbinding prefetch: nonfaulting cache prefetch

– Overlapping execution: CPU proceeds while the prefetched data are being fetched

– Advantage: The compiler may avoid unnecessary prefetches in hardware

– Drawback: Prefetch instructions incurs instruction overhead

5.7 Reducing Hit Time

•Importance of cache hit time

–Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty –More importantly, cache access time limits the clock cycle rate in many processors today!

•Fast hit time:

–Quickly and efficiently find out if data is in the cache, and –if it is, get that data out of the cache

•Four techniques:

1.Small and simple caches 2.Avoiding address translation during indexing of the cache 3.Pipelined cache access 4.Trace caches

O1: Small and Simple Caches

•A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address

Guideline: smaller hardware is faster

–Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache? • Small data cache and thus fast clock rate

Guideline: simpler hardware is faster

–Direct Mapped, on chip

•General design:

–small and simple cache for 1st-level cache –Keeping the tags on chip and the data off chip for 2nd-level caches The emphasis recently is on fast clock time while hiding L1 misses with dynamic execution and using L2 caches to avoid going to memory

O2: Avoiding address translation during cache indexing

•Two tasks: indexing the cache and comparing addresses •virtually vs. physically addressed cache

–virtual cache: use virtual address (VA) for the cache –physical cache: use physical address (PA) after translating virtual address

•Challenges to virtual cache

1.Protection: page-level protection (RW/RO/Invalid) must be checked

–It’s checked as part of the virtual to physical address translation –solution: an addition field to copy the protection information from TLB and check it on every access to the cache

2.context switching: same VA of different processes refer to different PA, requiring the cache to be flushed

–solution: increase width of cache address tag with process-identifier tag (PID)

3.Synonyms or aliases: two different VA for the same PA

–inconsistency problem: two copies of the same data in a virtual cache –hardware antialiasing solution: guarantee every cache block a unique PA –Alpha 21264: check all possible locations. If one is found, it is invalidated –software page-coloring solution: forcing aliases to share some address bits –Sun’s Solaris: all aliases must be identical in last 18 bits => no duplicate PA

4.I/O: typically use PA, so need to interact with cache (see Section 5.12)

Virtually indexed, physically tagged cache

CPU

TB

$

VA

PA

MEM

PA

Conventional

Organization

CPU

CPU

VA $ VA TB PA MEM
VA
$
VA
TB
PA
MEM

CPU

$ TB

L2 $

MEM

VA

PA

VA

Tags

PA

Tags

Overlap cache access with VA translation:

requires $ index to remain invariant across translation

Virtually Addressed Cache Translate only on miss Synonym Problem

O3: Pipelined Cache Access

Simply to pipeline cache access

– Multiple clock cycle for 1st-level cache hit

•Advantage: fast cycle time and slow hit

Example: accessing instructions from I-cache

– Pentium: 1 clock cycle

– Pentium Pro ~ Pentium III: 2 clocks

– Pentium 4: 4 clocks

•Drawback: Increasing the number of pipeline stages leads to

– greater penalty on mispredicted branches and

– more clock cycles between the issue of the load and the use of the data

Note that it increases the bandwidth of instructions rather than decreasing the actual latency of a cache hit

O4: Trace Caches

Trace cache for instructions: find a dynamic sequence of instructions including taken branches to load into a cache block

– The cache blocks contain

• dynamic traces of executed instructions determined by CPU

• rather than static sequences of instructions determined by memory

– branch prediction is folded into the cache: validated along with the addresses to have a valid fetch

– i.e. Intel NetBurst microarchitecture

•advantage: better utilization

– Trace caches store instructions only from the branch entry point to the exit of the trace

– Unused part of a long block entered or exited from a taken branch in conventional I-cache may not be fetched

•Downside: store the same instructions multiple times

Cache

Optimization

Summary

55
55

5.4 miss penalty

5.5 miss rate

5.6 parallelism

5.7 hit time

Summary

Chapter 5 Memory Hierarchy Design

5.1

Introduction

5.2

Review of the ABCs of Caches

5.3

Cache Performance

5.4

Reducing Cache Miss Penalty

5.5

Reducing Cache Miss Rate

5.6

Reducing Cache Miss Penalty/Miss Rate via Parallelism

5.7

Reducing Hit Time

5.8

Main Memory and Organizations for Improving Performance

5.9

Memory Technology

5.10 Virtual Memory

5.11 Protection and Examples of Virtual Memory