Beruflich Dokumente
Kultur Dokumente
Upper Level
Registers faster
Instructions, Operands
Cache
Blocks (Lines)
Main Memory
Pages
Disk
Files
CPE 432
Taking Advantage of Locality
Organize the computer memory system in
a hierarchy
Store everything on disk
Copy recently accessed (and nearby)
items from disk to smaller DRAM memory
Main memory
Copy more recently accessed (and
nearby) items from DRAM to smaller
SRAM memory
Cache memory attached to CPU
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 6
Memory Hierarchy Levels
Block (aka line): unit of copying
May be multiple words
How do we know if
the data is present?
Where do we look?
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
36
n-way Set Associative Caches
Organize the cache into sets (groups) of blocks
Each set contains n cache blocks
Main memory block number is used to
determine which set in the cache will store the
memory block
(Memory block number) modulo (#Sets in cache)
Search all cache blocks in a given set
simultaneously
n comparators are needed (less expensive
than a fully associative cache organization)
37
Associative Cache Example
Number of misses = 3
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
44
Set Associative Cache Organization
45
Block Replacement Policy
Direct mapped: no choice
Set associative
Prefer non-valid block, if there is one
Otherwise, choose among blocks in the set
Least-recently used (LRU)
Choose the block unused for the longest time
Simple for 2-way, manageable for 4-way, too hard
beyond that
Random
Gives approximately the same performance
as LRU for high associativity
Instructions / Item
Misses depend on
memory access
patterns
optimization for
memory access
31 30 29 28 ......15 14 13 12 11 10 9 ...........1 0
29 28 ...15 14 13 12 11 10 9 .......1 0
Hardware caches
Reduce comparisons to reduce cost
Virtual memory
Full table lookup makes full associativity feasible
Benefit in reduced miss rate
31 14 13 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 76
Interface Signals
When set to 1, indicates that When set to 1, indicates that
there is a cache operation there is a memory operation
Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready
Memory
Memory
Not Ready
Not Ready
79
CSE431 Chapter 5A.79 Irwin, PSU, 2008
5.8 Parallelism and Memory
Hierarchies: Cache
Coherence
80
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 80
Cache Coherence in Multicores
In future multicore processors its likely that the cores will
share a common physical address space, causing a
cache coherence problem
Read X Read X
Core 1 Core 2
Write 1 to X
L1 I$ L1 D$ L1 I$ L1 D$
XX == 01 X=0
XX =
= 01
Unified (shared) L2
82
CSE431 Chapter 5A.82 Irwin, PSU, 2008
Coherence Defined
Informally: If caches are coherent, then
reads return most recently written value
Formally:
If core1 writes X and then core1 reads X (with
no intervening writes)
the read operation returns the written value
Core1 writes X then core2 reads X (sufficiently
later)
the read operation returns the written value
core1 writes X then core2 writes X
cores 1 & 2 see writes in the same order
End up with the same final value for X
Chapter 5 Large and Fast: Exploiting Memory Hierarchy 83
A Coherent Memory System
Any read of a data item should return the most
recently written value of the data item
Coherence defines what values can be returned by a
read
- Writes to the same location are serialized (two
writes to the same location must be seen in the
same order by all cores)
Consistency determines when a written value will be
returned by a read
84
CSE431 Chapter 5A.84 Irwin, PSU, 2008
A Coherent Memory System (cont.)
To enforce coherence, caches must provide
Replication of shared data items in multiple cores
caches (e.g. without accessing L2 cache in our example)
Replication reduces both latency and contention for a
read shared data item
Migration of shared data items to a cores local cache
Migration reduced the latency of the access the data
and the bandwidth demand on the shared memory
(L2 in our example)
85
CSE431 Chapter 5A.85 Irwin, PSU, 2008
Cache Coherence Protocols
Need a hardware protocol to ensure cache
coherence
the most popular of cache coherence protocols
is snooping
The cache controllers monitor (snoop) on the
broadcast medium (e.g., bus) to determine if their
cache has a copy of a block that is requested
Write invalidate protocol writes require
exclusive access and invalidate all other copies
Exclusive access ensure that no other readable or
writable copies of an item exists
86
CSE431 Chapter 5A.86 Irwin, PSU, 2008
Cache Coherence Protocols (cont.)
If two cores attempt to write the same data at the
same time, one of them wins the race causing
the other cores copy to be invalidated.
For the other core to complete, it must obtain a
new copy of the data which must now contain
the updated value thus enforcing write
serialization
87
CSE431 Chapter 5A.87 Irwin, PSU, 2008
Example of Snooping Invalidation
Read X Read X
Core 1 Core 2
INVALID COPY
Write 1 to X Read X
L1 I$ L1 D$ L1 I$ L1 D$
X
X == 01 X 0I
X == 1
XX==1I0
Unified (shared) L2
93
Skip sections 5.9, 5.10, 5.11