Memory Hierarchy - Introduction: Cost Performance of Memory Reference

Memory Hierarchy - Introduction
Computer programmers want unlimited amount of
fast memory
An economical soln for that is a memory hierarchy
which takes the advantage of principle of locality and cost performance of memory reference
Principle of Locality : says that most programs dont
access all code or data uniformly.
Multilevel Memory hierarchy
Since fast memory is expensive, a memory hierarchy is
organized into several levels each smaller, faster, and more expensive per byte than the next lower level.
The goal is to provide a memory system with cost per
byte is as low as cheapest level of memory and speed is as high as the fastest level.
Each level maps addresses from a slower larger
memory to a smaller but faster memory higher in the hierarchy.
Terminologies
Cache : is the name given to the highest or first level
of memory hierarchy once the address leaves the processor. cache is a temporary storage area where frequently accessed data can be stored for rapid access.
Cache hit : When the processor finds the requested
data in the cache it is known as cache it.

Cache miss occurs when the processor does not find
the needed data item in cache.
Three categories of cache misses

1)
Compulsory the very first access to block can not be in cache, so the block must be brought into the cache. Compulsory misses are those that occur even if you had infinite cache. Capacity if the cache cannot contain all the blocks needed during execution of a program, capacity misses occur because of blocks being discarded and later retrieved. Conflict if the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to its set.
2)
3)
The time required for cache miss depends on both the
latency and the bandwidth of the memory.

Latency determines the time to retrieve first word of the
block.
Bandwidth determines the time to retrieve the rest of
this block.
The cache misses are handled by hardware and causes
processors using in-order execution to stall.

With out of order execution, an instruction using the
result must wait, but the other instructions may proceed during miss.
Conti..
Block : A fixed size collection of data containing the requested word ,
also known as line run.
Virtual memory : Not all objects referenced by a program need to
reside in main memory. Some objects may reside in disk.
Pages : Address space is usually broken into fixed size blocks. Page fault: At any time a page resides either in memory or in disk.
When processor references an item within a page that is not present in memory then a page fault occurs, and the entire page is moved from disk to memory. processor is not stalled. The processor usually switches to some other task.
Since page faults take so long they are handled in software and the
Cache performance
Memory hierarchy can substantially improve the performance because
of locality of reference and the higher speed of smaller memories.

The processor execution time equation when we consider the number
of cycles during which processor is stalled waiting for memory access, which is called memory stall cycles is as given bellow. CPU Execution Time= (CPU clock cycles + Memory stall clock cycles) * clock cycle time.
This equation assumes that the CPU clock cycles include the time to
handle a cache hit, and that the processor stalled during a cache miss.
Contin.
The number of memory stall cycles depends on both the number
of misses and the cost per miss, which is called miss penalty.
Memory stall cycles = Number of misses * miss penalty
= IC * Misses * miss penalty Instruction = IC * Memory accesses * miss rate*miss penalty Instruction The component miss rate is simply the fraction of cache accesses that results in a miss( i.e the number of accesses that miss divided by the number of accesses)
The formula was an approximation since the miss rates and miss
penalties are different for reads and writes.

Memory stall cycles could then be defined in terms of the number of
memory accesses per instruction, miss penalty(in clock cycles) for reads and writes, and miss rate for reads and writes;
Memory stall cycles =
IC * reads per instruction *read miss rate* read miss penalty + IC * writes per instruction*write miss rate * write miss penalty.
Some designers measure miss rate as misses per
instruction rather than misses per memory reference. These two are related. Misses = Instrn miss rate * mem accesses = miss rate* mem accesses Instrn count Instrn
Misses per instruction is often reported as misses per
1000 instrns to show integers instead of fractions.
4 memory hierarchy questions

Cache first level of memory hierarchy Answering following questions help us understand the
different trade-offs of memories help us at different levels of hierarchy. 1 . Where can a block be placed in the upper level? (block placement) 2. How is a block found if it is in the upper level ? (block identification) 3. Which block should be replaced on miss ? ( block replacement) 4. What happens on a write ? ( write strategy).
Where can a block be placed in a cache ???
Where can a block be placed in a cache ???

Three categories of cache organization
1. If each block has only one place it can appear in cache it is said to be direct mapped. The mapping is done usually (block addr ) MOD (no of blocks in cache) 2. If a cache block can be placed anywhere in the cache, the cache is said to be fully associative. 3. If a block can be placed in restricted set of places in the cache, the cache is set associative. A set is a group of blocks in the cache. A block is mapped into a set, and then the block can be placed anywhere within the set. The set is usually chosen by bit selection; that is (block address) MOD (number of sets in cache)
Conti..
If there are n blocks in a set the cache placement is
called n-way set associative. Direct mapped is simply one way set associative and a fully associative cache with m blocks could be called m-way set associative. Direct associative can be thought of having m sets and fully associative as having one set. The vast majority of processor caches today are direct mapped, two way set associative, or four way set associative.
How is a block found if it is not in cache ???
How is a block found if it is not in cache ???

Caches have an address tag on each block frame that
gives the block address. The tag of every cache block is checked to see if it matches the block address from the processor. All possible tags are searched in parallel because speed is critical. A valid bit is added to the tag to say whether or not this entry contains a valid address. If this bit is not set, there can not be a match on this address.
Relationship of a processor address to cache

The first division is between is between the block
address and the block offset. The block frame address is further divided into the tag field and the index field. The block offset field selects the desired data from the block. The index field selects the set. The tag field is compared against it for a hit.
Memory Hierarchy-Review
By, Chandru 1RV08SCS05
Objectives > Basic Information

> Four Memory Hierarchy Questions Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found if it is in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (write strategy) > An Example: The Opteron Data Cache
Memory arrangement of storage in current computer The hierarchical Hierarchy
architectures is called the memory hierarchy. It is designed to take advantage of memory locality in computer programs.
Most modern CPUs are so fast that for most program workloads, the
locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy are the practical limitation on processing speed.
Smaller, faster, and costlier (per byte) storage devices
An Example Memory Hierarchy

L0: registers
CPU registers hold words retrieved from L1 cache.
L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM)
L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.
L3:
Larger, slower, and cheaper (per byte) storage devices

L5:
main memory (DRAM)
Main memory holds disk blocks retrieved from local disks.
L4:
local secondary storage (local disks)

Local disks hold files retrieved from disks on remote network servers.
remote secondary storage (distributed file systems, Web servers)
The Principle of Locality The Principle of Locality:

Program access a relatively small portion of the address
space at any instant of time.
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Cache Algorithm (Read) tags to Look at Processor Address, search cache

find match. Then either
HIT - Found in Cache Return copy of data from cache
Hit Rate = fraction of accesses found in cache Miss Rate = 1 Hit rate Hit Time = RAM access time +
MISS - Not in cache

Read block of data from Main Memory Wait Return data to processor and update cache
time to determine HIT/MISS

Miss Time = time to replace block in cache + time to deliver block to processor
Caching in a Memory Hierarchy

Level k: 8 4 9 14 10 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 10 4 Data is copied between levels in block-sized transfer units
0 Level k+1: 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Types of cache misses: Cold (compulsory) miss
Cold misses occur because the cache is empty. Conflict miss If the block placement strategy is not fully associative, conflict misses will occur because a block may be discarded and later retrieved if conflicting blocks map to itself. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. Capacity miss If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur
Q3: Which block should be replaced on a miss?

Easy for Direct Mapped Set Associative or Fully Associative:
Random
Candidate blocks are randomly selected. Some systems generate pseudorandom block number Least Recently Used (LRU) LRU relies on a corollary of locality: If recently used blocks are likely to be used again, then LRU block is good candidate. First In, First Out (FIFO) used in highly associative caches
General Caching Concepts

14 12
1
Program needs object d, which is stored in some
Request 12 14
2
block b. Cache hit
Level k:
Program finds b in the cache at level k.
4* 12
14
E.g., block 14.

Cache miss
b is not at level k, so level k cache must
12 4*
Request 12
0 Level k+1:
fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the victim?
4 4*
8 12
5
9 13
6
10 14
7
11 15
Placement policy: where can the new block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU
Q4: What happens on a write?

Cache hit:
write through: write both cache & memory
generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) a dirty bit per block can further reduce the traffic
Cache miss: Q-4 continued
no write allocate: only write to main memory (lower-level memory)

write allocate : The block is allocated on a write miss. Write misses
act like read misses
Assume a fully associative cache with cache Example empty.write-back sequence ofmanymemory entries that starts Below is a five operations (the address is in square brackets): WriteMem[100]; WriteMem[100]; Read Mem[200]; WriteMem[200]; WriteMem[100]. What are the number of hits and misses when using no-write allocate versus writeallocate?
Answer For no-write allocate, the address 100 is not in the cache, and
there is no allocation on write, so the first two writes will result in misses. Address 200 is also not in the cache, so the read is also a miss. The subsequent write to address 200 is a hit. The last write to 100 is still a miss. The result for no-write allocate is four misses and one hit. For write allocate, the first accesses to 100 and 200 are misses, and the rest are hits since 100 and 200 are both found in the cache. Thus, the result for write allocate is two misses and three hits.
Six basic Cache Optimization
Sandeep Singh M.Tech (CSE) 2nd Sem
Average memory access time = Hit time + Miss rate * Miss penalty Three Categories of Cache Optimizations Reducing the miss rate : large block size ,large cache size and higher associativity Reducing the miss penalty : multilevel caches and giving reads priority over writes Reducing the time to hit in the cache : avoiding address translation when indexing the cache.
Three categories of misses

Compulsory : the very first access to a block cannot be in the cache ,so
the block must be brought into the cache. These are also called coldstart misses or first-reference misses. Compulsory misses are those that occur in an infinite cache. Capacity : if the cache cannot contain all the blocks needed during execution of a program ,capacity misses will occur because of blocks discarded and later retrieved. Capacity misses are those that occur in a fully associative cache. Conflict : if the block placement strategy is set associative or direct mapped ,conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. These misses are also called collision misses. Conflict misses are those that occur going from fully associative to eight-way associative ,four-way associative and so on.
Four Division of Conflict Misses Eight-way :conflict misses due to going from fully associative to
eight-way associative. Four-way :conflict misses due to going from eight-way associative to four-way associative. Two-way :conflict misses due to going from four-way associative to two-way associative. One-way :conflict misses due to going from two-way associative to one-way associative.
First Optimization : Larger Block Size to Reduce
Miss Rate
The simplest way to reduce miss rate is to increase the block size
,larger block size will reduce also compulsory misses. This reduction occur because the principle of locality has two components : temporal locality and spatial locality. Larger block take advantage of spatial locality. But larger blocks also increase the miss penalty. Since they reduce the number of blocks in the cache ,larger blocks may increase conflict misses and even capacity misses if the cache is small.
Cont
The selection of block size depends on both the latency and
bandwidth of the lower-level memory. High latency and high bandwidth encourage large block size since the cache gets many more bytes per miss for a small increase in miss penalty. Low latency and low bandwidth encourage smaller block size since there is little time saved from a larger block.
Second Optimization : Larger Caches to Reduce
Miss Rate The obvious way to reduce capacity misses is to increase capacity of the cache. The drawback is potentially longer hit time and higher cost and power. This technique has been especially popular in off-chip caches.
Third Optimization : Higher Associativity to
Reduce Miss Rate

There are two rules The first is that eight-way set associative is for practical purpose
as effective in reducing misses for these sized caches as fully associative. The second rule ,called the 2:1 cache rule of thumb , is that a direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2. The higher associativity increases average memory access time.
Fourth Optimization : Multilevel Caches to
Reduce Miss Penalty

Due to performance gap between processor and memory
designer added another level of cache between the original cache and memory. The first-level cache can be small enough to match the clock cycle time of the fast processor. The second-level cache can be large enough to capture many accesses that would go to main memory, thereby lessening the effective miss penalty.
Average memory access time for a two-level cache : Average memory access time = Hit time(L1) + Miss rate(L1)*Miss
penalty(L1) Miss penalty(L1) = Hit time(L2) + Miss rate(L2)*Miss penalty(L2) So Average memory access time = Hit time(L1) + Miss rate(L1)*(hit time(L2) + Miss rate(L2)*Miss penalty(L2))
Term adopted for a two-level cache system :

Local miss rate : this rate is simply the number of misses in a
cache divided by the total number of memory access to this cache. for the first-level cache it is equal to Miss rate(L1) and for the second-level cache it is Miss rate(L2). Global miss rate : the number of misses in the cache divided by the total number of memory accesses generated by the processor. The global miss rate for the first-level cache is still just Miss rate(L1) ,but for second-level cache it is Miss rate(L1)*Miss rate(L2).
Fifth Optimization : Giving Priority to read Misses
over Writes to Reduce Miss Penalty

This optimization serves reads before writes have been completed. With a write-through cache most important improvement is a write
buffer of the proper size. Write buffers ,however do complicate memory accesses because they might hold the updated value of a location needed on a read miss. The simplest way out of this is for the read miss to wait until the write buffer is empty. The alternative is to check the content of the write buffer on a read miss ,and if there are no conflict and the memory system is available , let the read miss continue.
Sixth Optimization : Avoiding Address Translation during indexing of the Cache to Reduce Hit rate
We use virtual addresses for the cache ,since hits are much more
common than misses. Such caches are termed virtual caches , with physical cache used to identify the traditional caches that uses physical addresses. Two important tasks are : indexing the cache and comparing addresses. Full virtual addressing for both indices and tags eliminates address translation time from a cache hit.
Some Reasons for not building virtually addressed caches: Protection : page-level protection is checked as part of the virtual to
physical address translation and it must be enforced. One solution is to copy the protection information from the TLB on a miss , and a field to hold it and check it on every access to the virtually addressed cache. Another reason is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed. One solution is to increase the width of the cache address tag with a process-identifier tag (PID). If operating system assign these tags to processes, it only need flush the cache when a PID is recycled, that is PID distinguishes whether or not the data in the cache are for this program
A third reason is operating systems and users program may use
different virtual addresses for the same physical address. These duplicate addresses, called synonyms or aliases, could result in two copies of the same data in virtual cache, if one is modified the other will have the wrong value. With a physical cache this would not happen, since the accesses would first be translated to the same physical cache block. Hardware solution to the synonym problem, called antialiasing, guarantee every cache block a unique physical address. Software can make this problem much easier by forcing to share some address bits, this restriction is called page coloring.
I/O typically uses physical addresses and thus would require
mapping to virtual addresses to interact with a virtual cache. One alternative is to use part of the page offset-the part that is identical in both virtual and physical addresses-to index the cache. At the same time as the cache is being read using that index, the virtual part of the address is translated, and the tag match uses physical addresses. This alternative allows the cache read to begin immediately, and yet the tag comparison is still with physical addresses. The limitation of this virtually indexed, physically tagged alternative is that a direct-mapped cache can be no bigger than the page size.

Memory Hierarchy - Introduction: Cost Performance of Memory Reference

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Memory Hierarchy - Introduction: Cost Performance of Memory Reference

Hochgeladen von

Copyright:

Verfügbare Formate

Memory Hierarchy - Introduction

Computer programmers want unlimited amount of

access all code or data uniformly.

Multilevel Memory hierarchy

Since fast memory is expensive, a memory hierarchy is

memory to a smaller but faster memory higher in the hierarchy.

data in the cache it is known as cache it.

the needed data item in cache.

Three categories of cache misses

The time required for cache miss depends on both the

latency and the bandwidth of the memory.

processors using in-order execution to stall.

also known as line run.

Virtual memory : Not all objects referenced by a program need to

reside in main memory. Some objects may reside in disk.

of locality of reference and the higher speed of smaller memories.

penalties are different for reads and writes.

Some designers measure miss rate as misses per

1000 instrns to show integers instead of fractions.

4 memory hierarchy questions

Where can a block be placed in a cache ???

Where can a block be placed in a cache ???

How is a block found if it is not in cache ???

How is a block found if it is not in cache ???

Relationship of a processor address to cache

Objectives > Basic Information

Memory arrangement of storage in current computer The hierarchical Hierarchy

Smaller, faster, and costlier (per byte) storage devices

An Example Memory Hierarchy

L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM)

Larger, slower, and cheaper (per byte) storage devices

main memory (DRAM)

Main memory holds disk blocks retrieved from local disks.

local secondary storage (local disks)

remote secondary storage (distributed file systems, Web servers)

The Principle of Locality The Principle of Locality:

space at any instant of time.

Cache Algorithm (Read) tags to Look at Processor Address, search cache

MISS - Not in cache

time to determine HIT/MISS

Caching in a Memory Hierarchy

Types of cache misses: Cold (compulsory) miss

Q3: Which block should be replaced on a miss?

General Caching Concepts

Program needs object d, which is stored in some

block b. Cache hit

Program finds b in the cache at level k.

E.g., block 14.

Q4: What happens on a write?

Cache miss: Q-4 continued

no write allocate: only write to main memory (lower-level memory)

act like read misses

Six basic Cache Optimization

Sandeep Singh M.Tech (CSE) 2nd Sem

Three categories of misses

First Optimization : Larger Block Size to Reduce

Second Optimization : Larger Caches to Reduce

Third Optimization : Higher Associativity to

Reduce Miss Rate

Fourth Optimization : Multilevel Caches to

Reduce Miss Penalty

Term adopted for a two-level cache system :