Beruflich Dokumente
Kultur Dokumente
William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 4
Cache Memory
I/O Main
memory
System
Bus
CPU
CPU
Registers ALU
Structure Internal
Bus
Control
Unit
CONTROL
UNIT
Sequencing
Logic
Control Unit
Registers and
Decoders
Control
Memory
Ou isk
t t i cd
sto boar
e
gn OM
rag d M a D- R W
e C D -R W
C R M
D-
D V D- R A y
a
DV lu-R
B
Of ap
e
f ct
sto -line eti
rag gn
e Ma
Fastest Fast
Less Slow
fast
Register Access
file desktop in 2 s Cache
memory
Main
memory
Temporal:
Accesses to the
same address are
typically clustered
in time
Spatial:
When a location is
accessed, nearby
locations tend to Working set
be accessed also
Time
START
Receive address
RA from CPU
Load main
Deliver RA word
memory block
to CPU
into cache line
DONE
In our example, suppose 95% of the memory accesses are found in level 1. Then the
average time to access a word can be expressed as
The average access time is much closer to 0.01 us than to 0.1 us, as desired.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
T1 + T2
T2
T1
0 1
Fraction of accesses involving only Level 1 (Hit ratio)
(a) Level 2 between level 1 and main (b) Level 2 connected to “backside” bus
The average CPI for L1 and L2 cache system with no miss is 1.0
L1 has a hit rate of 95%. L2 has a hit rate of 80%. The miss penalty for
L1 is 8 cycles and L2 is 60 cycles. 1 cycle takes 0.1 ns.
What is the average access time or Ceffective? CPU registers
CPU
1%
4%
Level Local hit rate Miss penalty 95%
L1 95 % 8 cycles Level-1 Level-2 Main
cache cache memory
L2 80 % 60 cycles 8 60
cycles cycles
Table 4.2
Elements of Cache Design
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+Cache Memory Design Parameters
Cache size (in bytes or words). A larger cache can hold more of the program’s
useful data but is more costly and likely to be slower.
Block or cache-line size (unit of data transfer between cache and main). With a
larger cache line, more data is brought in cache with each miss. This can
improve the hit rate but also may bring low-utility data in.
Placement policy. Determining where an incoming cache line is stored. More
flexible policies imply higher hardware cost and may or may not have
performance benefits (due to more complex data location).
Replacement policy. Determining which of several existing cache blocks (into
which a new cache line can be mapped) should be overwritten. Typical policies:
choosing a random or the least recently used block.
Write policy. Determining if updates to cache words are immediately forwarded
to main (write-through) or modified blocks are copied back to main if and
when they must be replaced (write-back or copy-back).
Mapping Function
Because there are fewer cache lines than main memory
blocks, an algorithm is needed for mapping main memory
blocks into cache lines
C–1
Block Length
(K Words)
(a) Cache
Block M – 1
2n – 1
Word
Length
(b) Main memory
s–r
s
W4j
w Li
W(4j+1) Bj
Compare w
W(4j+2)
W(4j+3)
(hit in cache)
1 if match
0 if no match
Lm–1
0 if match
1 if no match
(miss in cache)
00 000000001111111111111000
00 000000001111111111111100
Line
Tag Data Number
16 000101100000000000000000 77777777 00 13579246 0000
16 000101100000000000000100 11235813 16 11235813 0001
FF 11223344 3FFE
16 000101101111111111111100 12345678 16 12345678 3FFF
8 bits 32 bits
FF 111111110000000000000000 16-Kline cache
FF 111111110000000000000100
FF 111111111111111111111000 11223344
FF 111111111111111111111100 24682468
Note: Memory address values are
in binary representation;
32 bits other values are in hexadecimal
16-MByte main memory
64-67
68-71
72-75
Direct-mapped cache holding 32 words within eight 4-word lines. Each line is
associated with a tag and a valid bit.
+
Direct Mapping Summary
w Lj
s
W4j
W(4j+1)
Compare w Bj
W(4j+2)
W(4j+3)
(hit in cache)
1 if match
0 if no match
s
Lm–1
0 if match
1 if no match
(miss in cache)
Line
Tag Data Number
3FFFFE 11223344 0000
058CE7 FEDCBA98 0001
058CE6 000101100011001110011000
058CE7 000101100011001110011100 FEDCBA98 FEDCBA98
058CE8 000101100011001110100000
3FFFFD 33333333 3FFD
000000 13579246 3FFE
3FFFFF 24682468 3FFF
22 bits 32 bits
16 Kline Cache
Tag Word
Main Memory Address =
22 bits 2 bits
Set 0
s–d Fk–1
Fk s+w
Bj
0 if match
1 if no match
(miss in cache)
000 000000001111111111111000
000 000000001111111111111100
Set
Tag Data Number Tag Data
02C 000101100000000000000000 77777777 000 13579246 0000 02C 77777777
02C 000101100000000000000100 11235813 02C 11235813 0001
32 bits
Note: Memory address values are
16 MByte Main Memory in binary representation;
other values are in hexadecimal
64-67
Read tag and specified
Tags word from each option
Valid bits 80-83
0
Data
out 96-99
1
1,Tag Com-
pare Cache 112-115
1 if equal miss
Com-
pare
Number of sets = v = 2d
For a given cache size, the following design issues and tradeoffs exist:
Line width (2W). Too small a value for W causes a lot of main memory accesses;
too large a value increases the miss penalty and may tie up cache space with low-
utility items that are replaced before being used.
Set size or associativity (2S). Direct mapping (S = 0) is simple and fast; greater
associativity leads to more complexity, and thus slower access, but tends to
reduce conflict misses.
Line replacement policy. Usually LRU (least recently used) algorithm or some
approximation thereof; not an issue for direct-mapped caches. Somewhat
surprisingly, random selection works quite well in practice.
Write policy. Modern caches are very fast, so that write-through if seldom a good
choice. We usually implement write-back or copy-back, using write buffers to
soften the impact of main memory latency.
+Effect of Associativity on Cache Performance
0.3
0.2
Miss rate
0.1
0
Direct 2-way 4-way 8-way 16-way 32-way 64-way
Associativity
Performance improvement of caches with increased associativity.
1.0
0.9
0.8
0.7
Hit ratio
0.6
0.5
0.4
0.3
0.2
0.1
0.0
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M
Cache size (bytes)
direct
2-way
4-way
8-way
16-way
Once the cache has been filled, when a new block is brought
into the cache, one of the existing blocks must be replaced
For direct mapping there is only one possible line for any
particular block and no choice is possible
First-in-first-out (FIFO)
Replace that block in the set that has been in the cache longest
Easily implemented as a round-robin or circular buffer technique
If at least one write operation has been A more complex problem occurs when
performed on a word in that line of the multiple processors are attached to the
cache then main memory must be same bus and each processor has its own
updated by writing the line of cache out local cache - if a word is altered in one
to the block of memory before bringing cache it could conceivably invalidate a
in the new block word in other caches
Write back
Minimizes memory writes
Updates are made only in the cache
Portions of main memory are invalid and hence accesses by I/O
modules can be allowed only through the cache
This makes for complex circuitry and a potential bottleneck
The on-chip cache reduces the processor’s external bus activity and
speeds up execution time and increases overall system performance
When the requested instruction or data is found in the on-chip cache, the bus
access is eliminated
On-chip cache accesses will complete appreciably faster than would even
zero-wait state bus cycles
During this period the bus is free to support other transfers
Two-level cache:
Internal cache designated as level 1 (L1)
External cache designated as level 2 (L2)
Potential savings due to the use of an L2 cache depends on the hit rates
in both the L1 and L2 caches
The use of multilevel caches complicates all of the design issues related
to caches, including size, replacement algorithm, and write policy
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
0.98
0.96
0.94
0.92
0.90
L1 = 16k
Hit ratio
0.88 L1 = 8k
0.86
0.84
0.82
0.80
0.78
1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M
Figure 4.17 Total Hit Ratio (L1 and L2) for 8 Kbyte and 16 Kbyte L1