AC14L08 Memory Hierarchy

AC14L08
Memory hierarchy
A structure that uses multiple levels of memories; as the distance from the processor
increases, the size of the memories and the access time both increase.
The gap in performance between Processors and DRAM memory
Memory devices -- typical values for 2012
Levels in a typical memory hierarchy

Pairs of levels -- upper and lower level.
Why hierarchy works -- The principle of locality
Temporal locality: if an item is referenced, it will tend to be referenced again soon
Spatial locality: if an item is referenced, items whose addresses are close by will tend to
be referenced soon
Pipeline -- Physical (not virtual) cache and memory addresses

Cache: a safe place for hiding or storing things.
Cache memory -- a small amount of fast memory, between two levels of memory hierarchy,
-- To bridge the gap in access times
Terminology
block (or line)- minimum unit of information that can be either present or not in a cache.
hit rate
The fraction of memory accesses found in a level of the memory hierarchy.
cache miss Requested data is not present in the cache.
miss rate
Fraction of memory accesses not found in a level of the memory hierarchy.
hit time
Time required to access a level of the memory hierarchy, including the time
needed to determine whether the access is a hit or a miss.
miss penalty Time required to fetch a block into a level of the memory hierarchy from the lower
level, including the time to access the block, transmit it from one level to the other,
insert it in the level that experienced the miss, and then pass the block to the
requestor.
Cache -- Main Memory organization Block
Cache Read Operation
Average memory access time (AMAT) = Hit time + Miss rate x Miss penalty.
Performance reduce AMAT - Reduce the miss rate, miss penalty, or hit time
Cache Parameters
Cache Addresses
Logical
Physical
Cache Size
Only the data bytes
Mapping Function
Direct
Associative
Set associative
Replacement Algorithm
Least recently used (LRU)
First in first out (FIFO)
Least frequently used (LFU)
Random
Write Policy
Write through
Write back
Write-allocate
Write no-allocate
Line (block)Size
Bytes sharing a tag
Number of Caches
Single or two level
Unified or split
Elements of Cache Design
Direct-mapped cache
-- each memory location is mapped to exactly one location in the cache
Mapping main memory- direct mapped cache

Direct-mapped cache with 8 entries
Memory words -- that map to the same cache locations
Direct-mapped cache address mapping to find a block:
(Block address) modulo (Number of blocks in the cache)
Direct mapped cache with 8 sets

8-entry x(1+27+32)-bitSRAM
Accessing the cache memory
Direct-mapped 4kB (data) cache block diagram

The address fields
The three portions of an address -- set-associative or direct-mapped cache.

Tag field -- used to compare with the value of the tag field of the cache
Cache index -- used to select the block
Byte offset block length (word)
Cache size
32-bit addresses
A direct-mapped cache
n
The cache size is 2 blocks -- n bits are used for the index
The block size is 2m words (2m+2 bytes)
m bits are used for the word within the block, and
2 bits are used for the byte part of the address
The size of the tag field is
32 - (n + m + 2).
The total number of bits in a direct-mapped cache is
2n x (block size + tag size + valid field size).
The naming convention is -- to exclude the size of the tag and valid field and to count only the
size of the data.
Reducing Cache Misses by More Flexible Placement of Blocks -- Cache organization
Direct mapped cache -- a block can be placed in exactly one location
Fully associative cache -- a block can be placed in any location in the cache
Set associative cache -- a block can be placed in a fixed number of locations
A set-associative cache with n locations for a block -- n-way set-associative cache.
The location of a memory block whose address is 12 in a cache with 8 blocks

direct mapped - 1 cache block where memory block 12 can be found, (12 modulo 8) = 4
two-way set-associative, 4 sets, and memory block 12 must be in set (12 mod 4) = 0;
in either element of the set.
fully associative placement, the memory block 12 can appear in any of the 8 cache blocks
Advantage -of increasing the degree of associativity -- usually decreases the miss rate
Disadvantage -- a potential increase in the hit time.
8-block cache with different degrees of associativity
Set Index in Cache
Locating a Block in the Cache
Direct mapped cache with two sets and a four-word block size
address 0x8000009C
Cache contents with a block size (b) of four words
Direct mapped cache - multiple data/tag Taking advantage of spatial locality

64KB cache, 4k blocks, 4 words per block;
byte offset ignored, next 2 bits block offset, next 12 bits index into cache
Two-way set associative cache
4-way set associative cache Size of cache: 1K blocks = 256 sets * 4-block/set, or
4 KB cache: 4x256 blocks, 1 word/ block
The Output enable signals of the cache RAMs can be used to select the entry in the set
that drives the output.
The Output enable signal driven by the comparators, selects the required data
This organization eliminates the need for the multiplexor.
Advantages of Set associative caches
Higher Hit rate for the same cache size - reduced Conflict Misses.
Disadvantages of Set Associative Caches
N-way Set Associative Cache versus Direct Mapped Cache:
N comparators vs. 1
Extra MUX delay for the data
Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss
Possible to assume a hit and continue. Recover later if miss.
Fully Associative Cache
Eight-block fully associative cache

No Cache Index - Compare the Cache Tags of all cache entries in parallel.
Usually implemented using content addressable memory (CAM).
By definition: Conflict Miss = 0 for a fully associative cache.
Choosing Which Block to Replace at a miss -- replacement policy

Replacement Algorithm Least recently used (LRU)
First in first out (FIFO)
Least frequently used (LFU)
Random
Direct mapping -- one possible line for any particular block -- no choice is possible.
Associative and set-associative mapping hardware replacement algorithm
Least recently used (LRU) -- replace the block that is least recently used -- cost
For two-way set associative, each block includes a USE bit.
When a block is referenced, its USE bit is set to 1 and the USE bit of the other is set to 0.
When a block is to be replaced into the set, the line whose USE bit is 0 is used.
For more associativity -- LRU is difficult/expensive
Pseudo LRU
Practical Pseudo-LRU -- use a binary tree Each node records which half is older/newer
Update nodes on each reference
Follow older pointers to find LRU victim
LRU for a fully associative cache.
The cache mechanism maintains a separate list of indexes to all the lines in the cache.
When a line is referenced, it moves to the front of the list.
For replacement, the line at the back of the list is used.
First-in-first-out (FIFO) -- Replace that block in the set that has been in the cache longest.
FIFO is easily implemented as a round-robin or circular buffer technique.
Least Frequently Used (LFU) -- Counter per block, incremented on reference
Evictions choose lowest count
Random -- Victim blocks are randomly selected
Simulation -- random replacement provides only slightly inferior performance to an
algorithm based on usage
Handling Writes
Write Policy Choices
Write through write cache and lower-level memory.
Slow - always requires memory write, higher traffic
Performance improved with write buffer - blocks stored while waiting to be written to
memory ( by MMU) processor can continue execution until write buffer is full.
+ Read misses cannot result in writes, + data coherency
Write back write only in the cache.
The modified cache block is written to main memory only when it is replaced.
More efficient than write-through, more complex to implement.
Dirty bit per block indicating modified value can further reduce the traffic
+ Less memory bandwidth
Write miss: allocate block if its a miss?
Write allocate The block is allocated on a write miss, followed by the write hit actions.
In this option, write misses act like read misses.

No-write allocateThe write misses do not affect the cache.
The block is modified only in the lower-level memory.
Common combinations: write through and no write allocate
write back with write allocate
Write Buffer
Evicted dirty lines for write- back cache or all writes in writethrough cache
Write Buffer reduces Read Miss Penalty
Processor is not stalled on writes, read misses can go ahead of write to main memory
Write buffer: FIFO queue holds data to be written to memory

Memory controller writes contents of the buffer to memory and
o Frees the write buffer entry after memory write;
o Stall the CPU if write buffer is full.
Problem: Write buffer may hold updated value of location needed by a read miss.
o On a read miss, wait for the write buffer to go empty, or
o Check write buffer addresses against read miss addresses,
o if no match, allow read miss to go ahead of writes, else,
return value from write buffer.
Four Memory Hierarchy Questions

Q1: Where can a block be placed in the upper level? (Block placement)
Associativity: Fully Associative, Set Associative, Direct Mapped
Organization
Variations on a set-associative
The advantage of increasing the degree of associativity -- usually decreases the miss rate
Potential disadvantages of associativity -- increased cost and slower access time.
Q2: How is a block found if it is in the upper level? (Block identification)
Addressing
Tag / Index / Block
Q3: Which block should be replaced on a miss? (Block replacement)
Random, LRU,
Q4: What happens on a write? (Write strategy)
Write Back or Write Through, Write Buffer
Write-back advantages:
Individual words can be written by the processor at the rate that the cache, rather than the
memory can accept them.
Multiple writes within a block require only one write to the lower level in the hierarchy.
When blocks are written back, the system can make effective use of a high bandwidth
transfer, since the entire block is written.
Write-through advantages:
Misses are simpler and cheaper because they never require a block to be written back to the
lower level.
Write-through is easier to implement than write-back, although to be practical, a writethrough cache will still need to use a write buffer.
Cache performance
Causes for Cache Misses
Average memory access time = Hit time + Miss rate Miss penalty
3 Cs model -- cache misses: compulsory, capacity, and conflict misses.
compulsory miss -- also called cold-start miss.
A cache miss caused by the first access to a block that has never been in the cache.
capacity miss -- the cache cannot contain all the blocks needed to satisfy the requests
conflict miss -- also called collision miss.
Occurs in a set-associative or direct mapped cache when multiple blocks compete for
the same set (are eliminated in a fully associative cache of the same size)
4th C: Coherence - Misses caused by cache coherence (Multiprocessors)
Miss rate
versus cache size and associativity

versus block size and cache size
on SPEC2000 benchmark
Increased associativity -- decreases the number of conflict misses
Larger blocks exploit spatial locality -- to lower miss rates.
Increasing the block size
decreases the miss rate, but reduces the number of blocks that can be held in the cache,
-- competition for those blocks
the cost of a miss increases -- the miss penalty
the miss rate actually goes up if the block size is too large relative to the cache size
Memory hierarchy design challenges

Cache Performance Six Basic Cache Optimizations
Six cache optimizations - three categories:
_ Reducing the miss rate: larger block size, larger cache size, and higher associativity
_ Reducing the miss penalty: multilevel caches and giving reads priority over writes
_ Reducing the hit time cache: avoiding address translation when indexing the cache
Summary of basic cache optimizations.
Virtually indexed-- page offset, physically tagged cache

Address aliasing - two Virtual Addresses mapped to the same Physical Address.
Virtual memory
Pipeline and VM
Analogous cache and virtual memory terms

virtual memory -- main memory as a cache for secondary storage
Motivations -- sharing of memory among multiple programs, and
o remove the programming burdens of a small, limited amount of main memory
o illusion of an unbounded amount of virtual memory
CPU generates virtual addresses (VA) - memory accessed using physical addresses(PA)
address translation -- address mapping: VA PA
Virtual and physical memory are broken into fixed size blocks -- pages
addresses composed of -- page number and page offset.
page fault -- accessed page is not present in main memory.
relocation -- maps VAs to different PAs before the addresses are used to access memory
protection -- mechanisms to isolate multiple processes and OS in memory
Virtual page mapped to Physical page

Address translation
Virtual page in main memory or disk.
page size 212 = 4 KB.
Physical pages can be shared - 2 virtual addresses the same physical address.
High cost of a page fault.
A page fault to disk will take millions of clock cycles to process.
Pages should be large enough to try to amortize the high access time -- 4 KB, 16 KB...
Fully associative placement of pages in memory.
Page faults handled in software - small overhead compared to the disk access time.
Write-through will not work for virtual memory -- writes take too long.
Virtual memory systems use write-back.
Page Table VA PA translation
Page table - indexed with VPn to obtain the PPn 4GB 1GB
Page table register pointer to the starting address of the page table.
The number of entries in the page table is 220, or 1 million entries.
Valid bit -- If it is off , the page is not present in main memory.
Extra bits in page table -- additional information -- protection

Page Faults
Exception OS -- OS must find the page in the next level of the hierarchy and decide where
to place the requested page in main memory.
The virtual address alone is not enough to find the page is on disk.
We must keep track of the location on disk of each page in virtual address space.
A single Page table holds either the physical page number or the disk address.
Page table entry -- PPn or disk address
VPn index to page table
Valid bit is on, the page table supplies the PPn (starting address of the page in memory)
Valid bit is off , the page currently resides only on disk, at a specified disk address.
The table of physical page addresses and disk page addresses -- logically one table may
be stored in two separate data structures
Multi level page tables To avoid large page table size -- Each program has its own page table.
multi-level page tables
Only the active page tables are in the main memory
Two memory accesses: PageTable, Data Slow

One memory access to obtain the physical address, Second access to get the data
A cache for address translations: TLB cache for recently used page table entries
Address translation using a two-entry TLB
Making Address Translation Fast - TLB (translation look aside buffer)

TLB acts as a cache of the page table for the entries that map to physical pages only.
If there is no matching entry in the TLB for a page -- the page table must be examined.
Page table -- supplies PPn for the page (which can then be used to build a TLB entry) or
indicates that the page resides on disk, in which case a page fault occurs.
Page table -- entry for every virtual page, no tag field is needed -- page table is not a cache
Translation VA PA, page tables + TLB

Cache Placement and Address Translation
Physically addressed
Virtually addressed
Virtually Indexed, Physically Tagged
Longer hit time
Aliasing problem
Cache index = page size
virtually indexed and virtually tagged cache
the address translation hardware (TLB) is unused during the normal cache access
This takes the TLB out of the critical path, reducing cache latency.
Aliasing one object - two names -- two virtual addresses for the same physical page
A word on such a page may be cached in two different locations, each corresponding to
different virtual addresses.
This ambiguity would allow one program to write the data without the other program being
aware that the data had changed.
Virtually addressed caches - design limitations to ensure that aliases do not occur.
virtually indexed but physically tagged caches -- page offset used as cache index
Accessing the TLB and the (direct-mapped) cache in parallel; k= Index+Dsp = page size
the last k bits of the virtual and physical addresses are the same
The same Index Cache size cache associativity
Implementing Protection with Virtual Memory
Write access bit in -- TLB, Page Table
OS implements the protection CPU: user supervisor, kernel mode
-- special instructions that are only available in supervisor mode.
Mechanism for changing -- user mode to supervisor mode and vice versa.
system call exception (type) -- special instruction (syscall in MIPS) -- transfers control to a
dedicated location in supervisor code space.
o PC from the point of the system call is saved in the exception PC (EPC), and the
processor is placed in supervisor mode.
Return to user mode from the exception -- return from exception (ERET) instruction
resets to user mode and jumps to the address in EPC.
Process protection
Each process has its own virtual address space.
Page tables -- in the protected address space of the OS.
When processes want to share information in a limited way -- OS must assist them
Access right bits for a page must be included in both the page table and the TLB
Protection with Virtual Memory
Virtual memory and process protection

Address translation and protection
TLB miss; solution
hardware(SPARC v8, x86, PowerPC), software (MIPS, Alpha)
Example: Pentium III Memory System

2 level paging system, 2-level page table
Page directory contains pointers to page tables, one page directory per process.
Page table contains pointers to pages.
Page table entries (PTE) and page directory entries (PDE) are 32 bits wide.
L1: 16 KB, 128 sets, 4 x 32 Bytes/set, 4-way set associative.
VPN virtual page number, VPO virtual page offset
TLBT TLB Tag, TLBI TLB Index.
PPN phisical page number, PPO physical page offset
CT cache target, CI cache index, CO cache offset
MIPS R3000 pipeline
Support hardware for memory-mapped I/O
Hardware for driving the SPO256 speech synthesizer
Evolution of Intel IA-32 microprocessor memory systems

AC14L08 Memory Hierarchy

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AC14L08 Memory Hierarchy

Hochgeladen von

Copyright:

Verfügbare Formate

AC14L08

The gap in performance between Processors and DRAM memory

Memory devices -- typical values for 2012

Levels in a typical memory hierarchy

Pipeline -- Physical (not virtual) cache and memory addresses

Cache -- Main Memory organization Block

Cache Read Operation

Mapping main memory- direct mapped cache

Direct mapped cache with 8 sets

Direct-mapped 4kB (data) cache block diagram

The three portions of an address -- set-associative or direct-mapped cache.

The location of a memory block whose address is 12 in a cache with 8 blocks

8-block cache with different degrees of associativity

Set Index in Cache

Locating a Block in the Cache

Cache contents with a block size (b) of four words

Direct mapped cache - multiple data/tag Taking advantage of spatial locality

Two-way set associative cache

Eight-block fully associative cache

Choosing Which Block to Replace at a miss -- replacement policy

In this option, write misses act like read misses.

Write buffer: FIFO queue holds data to be written to memory

Four Memory Hierarchy Questions

Q2: How is a block found if it is in the upper level? (Block identification)

versus cache size and associativity

Memory hierarchy design challenges

Summary of basic cache optimizations.

Virtually indexed-- page offset, physically tagged cache

Analogous cache and virtual memory terms

Virtual page mapped to Physical page

Extra bits in page table -- additional information -- protection

multi-level page tables

Only the active page tables are in the main memory

Two memory accesses: PageTable, Data Slow

Address translation using a two-entry TLB

Making Address Translation Fast - TLB (translation look aside buffer)

Translation VA PA, page tables + TLB

Virtual memory and process protection

Example: Pentium III Memory System

MIPS R3000 pipeline

Support hardware for memory-mapped I/O

Hardware for driving the SPO256 speech synthesizer

Evolution of Intel IA-32 microprocessor memory systems

Das könnte Ihnen auch gefallen