Chapter 5

Chapter 5: Memory Hierarchy Design
Rung-Bin Lin
5-1
Chapter 5 Memory Hierarchy Design
Rung-Bin Lin
5-2
Introduction
The necessity of memory-hierarchy in a computer system design is enabled by the following two factors:
Locality of reference: The nature of program behavior Large gap in speed between CPU and mass storage devices such a DRAM.
Level of memory hierarchy

High level <----> Low level CPU Register, Cache, Main-memory, Disk The levels of the hierarchy subset one another: all data in one level is also found in the level below.
Rung-Bin Lin
5-3
Memory Hierarchy
Rung-Bin Lin
5-4
Speed Gap between CPU and DRAM
Rung-Bin Lin
5-5
Memory Hierarchy Difference between Desktops and Embedded Processors

Memory hierarchy for desktops
Speed
Memory hierarchy for Embedded Processors

Real-time applications need to care about worst-case performance. Concerning about power consumption. No memory hierarchy actually needed for simple and fix applications running on embedded processors. Main memory itself may be quite small.
Rung-Bin Lin
5-6
ABCs of Caches
Recalling some terms Cache: The name given to the first level of the memory hierarchy encountered once the address leaves the CPU. Miss rate: The fraction of accesses not in the cache. Miss penalty: The additional time to service the miss. Block: The minimum unit of information that can be present in the cache. Four questions about any level of the hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)
Rung-Bin Lin
5-7
Cache Performance
Formula for performance evaluation
CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle time =IC *(CPIexecution + Memory stall clock cycles/IC)*Clock cycle
time
Memory stall cycles = IC * Memory reference per instruction *miss rate *miss penalty Measure of memory-hierarchy performance Average memory access time = Hit time + Miss rate * Miss penalty
Example on page 395. Example on page 396.
Rung-Bin Lin
5-8
Rung-Bin Lin
5-9
Rung-Bin Lin
5-10
Four Memory Hierarchy Questions

Q1: Where can a block be placed in the upper level? ( block placement) Q2: How is a block found if it is in the upper level? ( block identification) Q3: Which block should be replaced on a miss? ( block replacement) Q4: What happens on a write? ( write strategy)
Rung-Bin Lin
5-11
Block Placement (1)

Q1: Where can a block be placed in a cache?
Direct mapped: Each block has only one place it can appear in the cache. The mapping is usually
(Block address) MOD (Number of blocks in cache)
Fully associative: A block can be placed anywhere in the cache. Set associative: A block can be placed in a restricted set of places in the cache. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within that set. The set is usually obtained by
(block address) MOD (Number of sets in a cache) If there are n blocks in a set, the cache is called n-way set associative.
Rung-Bin Lin
5-12
Block Placement (2)
Rung-Bin Lin
5-13
Block Identification
Q2: How is a block found if it is in the cache
Each cache block consists of
Address tag: Give the block address Valid bit: Indicate whether or not the associated entry contains a valid address. Data
Relationship of a CPU address to the cache

Address presented by CPU Block address ## Block offset Index: Select the set Block offset: Select the desired data from the block.
Rung-Bin Lin
5-14
Identification Steps
Index field of the CPU address is used to select a set. Tag field presented by the CPU is compared in parallel to all address tags of the blocks in the selected set. If any address tag matches the tag field of the CPU address and its valid bit is true, it is a cache hit. Offset field is used to select the desired data.
Rung-Bin Lin
5-15
Associativity versus Index Field

If the total cache size is kept the same,
Increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag.
The following formula characterized this property:

2index = (cache size)/(block size *set associativity).
Rung-Bin Lin
5-16
Block Replacement
Q3: Which block should be replaced on a cache miss?
For direct mapped cache, the answer is obvious. For set associative or fully associative cache, the following two strategies can be used:
Random Least-recently used (LRU) First in, first out (FIFO)
Rung-Bin Lin
5-17
Comparison of Miss Rate between Random and LRU

Fig. 5.6 on page 400
Rung-Bin Lin
5-18
Write Strategy
Q4: What happens on a write?
Traffic patterns
Writes take about 7% of the overall memory traffic and take about 25% of the data cache traffic. Though read dominates processor cache traffic, write still can not be ignored in a high performance design.
Read can be done faster than write

In reading, the block data can be read at the same time that the tag is read and compared. In writing, modifying a block cannot begin until the tag is checked to see if the address is a hit.
Rung-Bin Lin
5-19
Write Policies and Write Miss Options

Write policies
Write through (or store through) Write to both the block in the cache and the block in the lower-level memory. Write back Write only to the block in the cache. A dirty bit, attached to each block in the cache, is set when the block is modified. When a block is being replaced and the dirty bit is set, the block is copy back to main memory. This can reduce bus traffic.
Common options on a write miss

Write allocate The block is loaded on a write miss, followed by the write-hit. No-write allocate (write around) The block is modified in the lower level and not loaded into the cache.
Either write miss option can be used with write through or write back, but write-back caches generally use write allocate and write-through cache often use no-write allocate.
Rung-Bin Lin
5-20
Comparison between Write Through and Write Back

Write back can reduce bus traffic, but the content of cache blocks can be inconsistent with that of the blocks in main memory at some moment. Write through increases bus traffic, but the content is consistent all the time. Reduce write stall
Use a writing buffer. As soon as the CPU places the write data into the writing buffer, the CPU is allowed to continue.
Example on page 402
Rung-Bin Lin
5-21
Rung-Bin Lin
5-22
An Example: the Alpha 21264 Data Cache

Features
64K bytes of data in 64-byte blocks. Two-way set associative. Write back with a dirty bit. Write allocate on a write miss.
The CPU address

48-bit virtual address 44-bit physical address 38-bit block address 29-bit tag address 9-bit index, obtained by 2index = 512= 65536/(64*2) 6-bit block offset
FIFO replacement strategy What happen on a miss?

64-byte block is fetched from main memory in four transfer, each takes 5 clock cycles.
Rung-Bin Lin
5-23
Cache Access Steps
Rung-Bin Lin
5-24
Unified versus Split Caches

Unified cache: A cache contains instructions and data. Spit caches: Data is contained only in data cache, while instruction is contained in instruction cache.
Fig. 5.8 on page 406.
Rung-Bin Lin
5-25
Cache Performance
Average memory access time for processors with in-order execution
Average memory access time = Hit time + Miss rate * Miss penalty Examples on pages 408 and 409
Miss penalty and out-of-order execution processors

Memory stall cycles / instruction = Misses/instruction * (Total miss latency Overlapped miss latency) Length of memory latency: Time between the start and the end of a memory reference in an out-of-order processor. Length of latency overlap: A time period of memory latency overlapping the operations of the processor.
Rung-Bin Lin
5-26
Improving Cache Performance

Reduce the miss rate Reduce the miss penalty Reduce the hit time Reduce the miss penalty or miss rate via parallelism
Rung-Bin Lin
5-27
Reducing Cache Miss Penalty

Multilevel caches Critical word first and early restart Giving priority to read misses over writes Merging write buffers Victim caches
Rung-Bin Lin
5-28
Multilevel Caches
Question:
Larger cache or faster cache? A contradictory scenario. Solution: Adding another level of cache. Second level cache complicates performance evaluation of cache memory.
Average memory access time = Hit timeL1 + Miss rateL1 *Miss penaltyL1
Where,
Miss penaltyL1 = Hit timeL2 + Miss rateL2 * Miss penaltyL2
Rung-Bin Lin
5-29
Local and Global Miss Rates

The second-level miss rate is measured on the leftovers from the first-level cache.
Local miss rate (Miss rateL2)
The number of misses in the cache divided by the total number of memory accesses to this cache.
Global miss rate (Miss rateL1 *Miss rateL2)

The number of misses in the cache divided by the total number of memory accesses generated by the CPU.
Rung-Bin Lin
5-30
Miss Rate versus Cache size
Rung-Bin Lin
5-31
Two Insights and Questions

Two insights from the observation of the results shown above:
The global cache miss rate is very similar to the single cache miss rate of the second-level cache. The local cache miss rate is not a good measure of secondary caches; The global cache miss rate should be used because the effectiveness of second-level cache is a function of the miss rate of the first-level cache.
Two questions for the design of the second-level cache:

Will it lower the average memory access time portion of the CPI, and how much it cost?
Rung-Bin Lin
5-32
Example (P417)
Rung-Bin Lin
5-33
Influence of L2 Hit Time
Rung-Bin Lin
5-34
Early Restart and Critical Word First

Basic idea: Dont wait for the full block to be loaded before sending the requested word and restarting the CPU.
Two strategies: Early restart: As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution. Critical word first: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Example on page 419.
Rung-Bin Lin
5-35
Given Priority to Read Miss over Writes

A write buffer can free the CPU from waiting for the completion of write, but it could hold the updated value of a location needed on a read miss. This complicates memory access, i.e., it may cause a RAW hazard.
Two solutions: The read miss waits until the write buffer is empty. This certainly increases miss penalty. Or, Check the contents of the write buffer on a read miss, and let the read miss fetch the data from the write buffer. Example on page 419
Rung-Bin Lin
5-36
Merging Write Buffers
Rung-Bin Lin
5-37
Victim Caches (1)

Victim cache
A small fully associative cache contains only blocks that are discarded from a cache because of a miss -- victim. The blocks of the victim cache is checked on a miss to see if they have the desired data before going to the next lower-level memory. If it is found there, the victim block and cache block are swapped. A four entry victim cache can remove 20% to 95% of the conflict misses in a 4-KB direct mapped data cache.
Rung-Bin Lin
5-38
Victim Caches (2)
Rung-Bin Lin
5-39
Reducing Miss Rate

Larger block size Larger caches Higher associativity Way prediction and psudoassociative caches Compiler optimizations
Rung-Bin Lin
5-40
Miss Categories
Compulsory miss
The first access to a block is not in the cache.
Capacity miss
Occur because of blocks being discarded and later retrieved if the cache cannot contain all the blocks needed during execution of a program.
Conflict miss
Occur because a block can be discarded and later retrieved if two many blocks map to its set for direct mapped or set associative caches.
What can a designer do with the miss rate?

Reduce conflict miss is the easiest: Fully associativity, but very expensive. Reduce capacity miss: Use large cache. Reduce compulsory miss: Use large block.
Rung-Bin Lin
5-41
Miss Rate for Each Category
Rung-Bin Lin
5-42
Larger Block Size

Reduce compulsory miss by taking advantage of spatial locality.
Increase miss penalty

Increase capacity miss if cache is small. The selection of block size depends on both the latency and bandwidth of the lower-level memory:
High latency and high bandwidth encourages larger block sizes. Low latency and low bandwidth encourages smaller block sizes.
Example on page 426.
Rung-Bin Lin
5-43
Example (P426)
Rung-Bin Lin
5-44
Miss Rate, Block Size versus Cache Size
Rung-Bin Lin
5-45
Average Memory Access Time, Block Size versus Cache Size

Rung-Bin Lin
5-46
Larger Caches
Drawbacks
Longer hit time Higher cost
Rung-Bin Lin
5-47
Higher Associativity
Two general rules of thumb
8-way set associative is for practical purposes as effective in reducing misses as fully associative. 2:1 cache rule of thumb A direct mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/2.
The pressure of a fast processor clock cycle encourages simple cache, but the increasing miss penalty rewards associativity Example on page 429.
Rung-Bin Lin
5-48
Rung-Bin Lin
5-49
Rung-Bin Lin
5-50
Average Memory Access Time versus Associativity

Rung-Bin Lin
5-51
Way Prediction
Reduce conflict misses and yet maintain the hit speed of a direct-mapped cache. Way prediction
Extra bits are kept in the cache to predict the way, or block within the set of the next cache access. It means the MUX can be set early to select desired block. A miss results in checking the other blocks for matches. Alpha 21264 uses such technique.
Hits take 1 cycle Misses take 3 cycles
Can also be used to reduce power consumption.
Rung-Bin Lin
5-52
Pseudoassociative Caches
Access proceed just as in the direct-mapped cache for a hit. On a miss, a second cache entry is checked to see if it matches there.
Rung-Bin Lin
5-53
Compiler Optimizations
Loop intercahnge
Reduce misses by improving spatial locality
Blocking
Reducing capacity miss
Rung-Bin Lin
5-54
Blocking
Rung-Bin Lin
5-55
Reducing Cache Miss Penalty or Miss Rate via Parallelism

Nonblocking caches to reduce stalls on cache misses Hardware prefetching of instructions and data Compiler-controlled prefetching
Rung-Bin Lin
5-56
Nonblocking Caches to Reduce Stalls on Cache Misses

For pipeline machines that implement Tomasulos algorithm, allowing out-of-order completion, the CPU need not stall on a cache miss. A nonblocking cache can escalates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This is called hit undermiss. When the allowable misses are more than one, it is called hit under multiple misses. Example on page 436.
Rung-Bin Lin
5-57
Rung-Bin Lin
5-58
Performance of Hit-Under-Miss
Rung-Bin Lin
5-59
Hardware Prefetching of Instructions and Data

A processor fetches two (consecutive) blocks on a miss.
The requested block is placed in the instruction (data) cache when it returns. The prefetched block is placed into instruction (data) stream buffer. When the requested block can be found and read from the stream buffer, the next prefetch request is issued.
With four instruction (data) stream buffers, the hit rate improves to 50% (43%).
Rung-Bin Lin
5-60
Controller-Controlled Prefetching
Compiler inserts prefetch instructions to request the data before they are needed.
Register prefetch Cache prefetch
Rung-Bin Lin
5-61
Reducing Hit Time

Hit time is critical because it affects the clock rate of the processor. Strategies to reduce hit time
Small and simple cache: direct mapped Avoid address translation during indexing of the cache Pipelined cache access Trace cache
Rung-Bin Lin
5-62
Access Time versus Cache Size
Rung-Bin Lin
5-63
Summary of Cache Optimizations

Fig. 5.26
Rung-Bin Lin
5-64
Main Memory and Organization for Improving Performance

Performance measures of main memory emphasizes both latency and bandwidth.
Traditionally, latency is the primary concern of the cache, while the bandwidth is the primary concern of I/O. However, with a second-level cache and their larger block size, bandwidth becomes important to caches as well. It is easier improve the memory bandwidth with new organization.
Rung-Bin Lin
5-65
Techniques for Improving Bandwidth

Techniques
Wider main memory Simple interleaved memory Independent memory banks
Assume the performance of the basic organization is

4 clock cycles to send address
56 clock cycles for the access time per word (8 bytes) 4 clock cycle to send a word of data
Given a cache block of four words, the miss penalty is 4*(4+56+4)=256 clock cycles.
Rung-Bin Lin
5-66
Wider Main Memory (1)

With a main memory width of two words, the miss penalty for the above example would drop from 256 cycles to 128 cycles.
Drawbacks: Increase the critical path timing by introducing a multiplexer in between the CPU and the cache. Memory with error correction has difficulties with writes to a portion of the protected block (e.g. a write of a byte).
Rung-Bin Lin
5-67
Wider Main Memory (2)
Rung-Bin Lin
5-68
Simple Interleaved Memory

Basic concept
Memory chips can be organized in banks to read or write multiple words at a time rather than a single word. The addresses are sent to several banks permits them all to read at the same time. The miss penalty with this scheme becomes 4+56+4*4= 76 cycles. The mapping of addresses to banks affects the behavior of the memory system. Usually, The addresses are interleaved at word level.
Example on page 452.
Rung-Bin Lin
5-69
Rung-Bin Lin
5-70
Rung-Bin Lin
5-71
Independent Memory Banks

Multiple memory controllers allow banks to operate independently. Each bank needs separate address lines and possibly a separate data bus.
Such a design enables the use of nonblocking cache.
Rung-Bin Lin
5-72
Memory Technology
Performance metrics
Latency: two measures
Access time: The time between when a read is requested and when the desired word arrives. Cycle time: The minimum time between requests to memory.
Usually cycle time > access time
Rung-Bin Lin
5-73
DRAM
Refresh time < 5%; slow increase in speed.
Rung-Bin Lin
5-74
SRAM, ROM and Flash Technology

SRAM
No refresh 8 to 16 times faster than DRAM 8 to 16 times more expensive than DRAM Suitable for embedded applications
ROM and flash

Non-volatile Best suit the embedded processors
Rung-Bin Lin
5-75
Improving Memory Performance in a Standard DRAM Chip

Use of multi-bank organization provides larger bandwidth Other three methods to increase bandwidth
Fast page mode
Repeated accesses to a row without another row access time.
Synchronous DRAM
Have a programmable register to hold the number of bytes requested and hence can send many bytes over several cycles per request with the overhead of synchronizing the controller.
Double Data Rate (DDR) DRAM

Use falling and rising edges of the clock for transfering data.
Rung-Bin Lin
5-76
RAMBUS DRAM (RDARM)

Each chip has interleaved memory and a high-speed interface and acts more like a memory system. RDARM: First generation RAMBUS DRAM
Drop RAS/CAS, replacing it with a bus that allows other accesses over the bus between the sending of the address and return of the data (called packet-switched bus or split-transaction bus). Use double edges of the clock. Runs at 300MHZ.
Direct RDRAM (DRDRAM): Second generation

Separate data, row, column buses such that three transactions on these buses can be performed simultaneously. Runs at 400 MHZ.
Comparing RAMBUS and DDRSDRAM

Both increase memory bandwidth. None help in reducing latency.
Rung-Bin Lin
5-77
Virtual Memory (VM)

VM divides physical memory into blocks and allocates them to different processes, each of which has its own address space. Need a protection scheme that restricts a process to the blocks belonging only to that process. With VM, not all code and data are needed to be in physical memory before a program can begin. VM provides process (program) relocation. Virtual address
Given by CPU
Physical address
For having an access to main memory
Address translation
Convert a virtual address to a physical address. Can easily form the critical path that limits the clock cycle time.
Rung-Bin Lin
5-78
Types of VM
Paged Segmented Paged segment Fig. 5.34 on 463
Rung-Bin Lin
5-79
Rung-Bin Lin
5-80
Address Space Mapping in VM
Rung-Bin Lin
5-81
Differences between Caches and VM

Replacement
On cache is managed by hardware, while On VM is managed by OS
The size
Of VM is determined by the size of processor address. Of cache is independent of processor address size.
Second storage in VM occupied by file system is not normally in the address space.
Rung-Bin Lin
5-82
Parameter Ranges of Caches versus VM

Fig. 5.32 on page 462.
Rung-Bin Lin
5-83
Four Memory Hierarchy Questions

Q1: Where can a block be placed in main memory?
Anywhere (fully associative)
Q2: How is a block found if it is in main memory?

Use page table, or Use inverted page table to reduce the size of page table by hashing. Problems
Need two memory accesses to obtain requested data. Solution is to use translation lookaside buffer.
Q3: Which block should be replaced on a VM miss?

LRU
Q4: What happens on a write?

Write back
Rung-Bin Lin
5-84
Concept of Addressing Mapping Using Page Table
Rung-Bin Lin
5-85
Techniques for Fast Address Translation

Problem with pure page table translation
Needs to have two memory accesses
Translation Lookaside Buffer (TLB) solves the problem.

Use locality of page table references. A fully associative memory whose entries record the most recently used base addresses of the pages. Each entry consists
Tag Physical page frame number Protection field Valid bit, dirty bit, and used bit ASN (Address Space Number): to identify which process owns the corresponding page.
Rung-Bin Lin
5-86
Alpha 21264 TLB
Rung-Bin Lin
5-87
Summary of VM and Caches
Rung-Bin Lin
5-88
Protection and Examples of VM

Process
A running program plus any state needed to continue running it.
Process (context) switch

One process is stop execution and another process is brought into execution.
Requirements for context switches

Be able to save CPU states for continue execution
A computer designers responsibility
Protect a process from been interfered by another process

OSs responsibility
Computer designers can make protection easily implemented by the OS via VM design.
Rung-Bin Lin
5-89
Protecting Process
The simplest mechanism
Use base and bound registers
An access is valid if Base <= Address <= Bound
To enable this protection, computer designers have the following three responsibilities:
Provide at least two execution modes: user or kernel (OS, supervisor) modes Provide a portion of the CPU state that a user process can use but not write. Provide mechanisms whereby the CPU can go from user mode to kernel modes.
More Sophisticated Mechanisms

Ring Capabilities
A program cant unlock access to the data unless it has keys (capabilities).
Rung-Bin Lin
5-90
The Alpha Memory Management and the 21264 TLB

Alpha VM architecture
A combination of segmentation and paging, providing protection while minimizing page table size
64-bit address space, but with 48-bit virtual addresses Three segments, each of which is paged seg0 (bits 63~46 = 000): hold user processes seg1 (bits 63~46 = 111): kseg (bits 63~46 = 010): reserved for operating system kernel
Rung-Bin Lin
5-91
Mapping of an Alpha Virtual Address

Each page table is held in a page.
Rung-Bin Lin
5-92
Memory Protection in Alpha 21264

Each page table entry(PTE)has 64 bits
The first 32 bits contain the physical page frame number. The other half includes the following protection fields:
Valid User read enable Kernel read enable User write enable Kernel write enable
The Alpha obeys only the protection requirements imposed by the bottom-level PTEs.
Rung-Bin Lin
5-93
Concluding Remarks
The primary challenge for the memory hierarchy designer is in choosing parameters that work well together, not in inventing new techniques (already enough).

Chapter 5

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 5

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 5: Memory Hierarchy Design

Chapter 5 Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Level of memory hierarchy

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Speed Gap between CPU and DRAM

Chapter 5: Memory Hierarchy Design

Memory Hierarchy Difference between Desktops and Embedded Processors

Memory hierarchy for Embedded Processors

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Example on page 395. Example on page 396.

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Four Memory Hierarchy Questions

Chapter 5: Memory Hierarchy Design

Block Placement (1)

Chapter 5: Memory Hierarchy Design

Block Placement (2)

Chapter 5: Memory Hierarchy Design

Relationship of a CPU address to the cache

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Associativity versus Index Field

The following formula characterized this property:

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Comparison of Miss Rate between Random and LRU

Chapter 5: Memory Hierarchy Design

Read can be done faster than write

Chapter 5: Memory Hierarchy Design

Write Policies and Write Miss Options

Common options on a write miss

Chapter 5: Memory Hierarchy Design

Comparison between Write Through and Write Back

Example on page 402

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

An Example: the Alpha 21264 Data Cache

The CPU address

FIFO replacement strategy What happen on a miss?

Chapter 5: Memory Hierarchy Design

Cache Access Steps

Chapter 5: Memory Hierarchy Design

Unified versus Split Caches

Chapter 5: Memory Hierarchy Design

Miss penalty and out-of-order execution processors

Chapter 5: Memory Hierarchy Design

Improving Cache Performance

Chapter 5: Memory Hierarchy Design

Reducing Cache Miss Penalty

Chapter 5: Memory Hierarchy Design

Chapter 5: Memory Hierarchy Design

Local and Global Miss Rates

Global miss rate (Miss rateL1 *Miss rateL2)

Chapter 5: Memory Hierarchy Design

Miss Rate versus Cache size

Chapter 5: Memory Hierarchy Design

Two Insights and Questions

Two questions for the design of the second-level cache:

Chapter 5: Memory Hierarchy Design