CPS 104 Computer Organization and Programming Lecture-28: Cache Memory, Virtual Memory

CPS 104 Computer Organization and Programming Lecture- 28: Cache Memory, Virtual Memory
March 26, 2004 Gershon Kedem http://kedem.duke.edu/cps104/Lectures
CPS104 Lec28.1
GK Spring 2004
Admin.
Homework-6: Is posted. u Due date was extended to March 29. u No further extension! u The second part of this assignment will be posted Monday March 22. u This assignment is harder then it looks. u These two assignment have larger weight then other homework assignments. u Please start ASAP!! Homework -7: is posted.
CPS104 Lec28.2
GK Spring 2004
Review: Four Questions for Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)
CPS104 Lec28.3
GK Spring 2004
Separate Instruction and Data Caches

Inst Cache Data Path Data Cache
Separate Inst & Data Caches u Harvard Architecture Can access both at same time Combined L2 u L2 >> L1
Combined L2 Cache
DRAM
CPS104 Lec28.4
GK Spring 2004
Cache Performance
CPU time = (CPU_execution_clock_cycles + Memory_stall_clock_cycles) x clock_cycle_time Memory_stall_clock_cycles = Memory_accesses x Miss_rate x Miss_penalty Example Assume every instruction takes 1 cycle Miss penalty = 20 cycles Miss rate = 10% 1000 total instructions, 300 memory accesses Memory stall cycles? CPU clocks?
CPS104 Lec28.5
GK Spring 2004
Cache Performance
Memory Stall cycles = 300 * 0.10 * 20 = 600 CPU_clocks = 1000 + 600 = 1600 60% slower because of cache misses!
CPS104 Lec28.6
GK Spring 2004
Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
CPS104 Lec28.7
GK Spring 2004
Reducing Misses
Classifying Misses: 3 Cs u CompulsoryThe first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) u CapacityIf the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Size X Cache) u ConflictIf the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache)
GK Spring 2004
CPS104 Lec28.8
Cache Performance
Your program and caches Can you affect performance? Think about 3Cs
CPS104 Lec28.9
GK Spring 2004
Reducing Misses by Compiler Optimizations
Instructions u Reorder procedures in memory so as to reduce misses u Profiling to look at conflicts u McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data
u
Merging Arrays: improve spatial locality by single

array of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access

data in order stored in memory
Loop Fusion: Combine 2 independent loops that have

same looping and some variables overlap
Blocking: Improve temporal locality by accessing

blocks of data repeatedly vs. going down whole columns or rows
CPS104 Lec28.10
GK Spring 2004
Merging Arrays Example

/* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE];
Reducing conflicts between val & key
CPS104 Lec28.11
GK Spring 2004
Loop Interchange Example

/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses Instead of striding through memory every 100 words
CPS104 Lec28.12
GK Spring 2004
Loop Fusion Example

/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];
/* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access
CPS104 Lec28.13
GK Spring 2004
Blocking Example
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = r; };
Two Inner Loops: u Read all NxN elements of z[] u Read N elements of 1 row of y[] repeatedly u Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: u 3 NxN => no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits
CPS104 Lec28.14
GK Spring 2004
Blocking Example
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; };

Capacity Misses from 2N3 + N2 to 2N3/B +N2 B called Blocking Factor Conflict Misses Too?
CPS104 Lec28.15
GK Spring 2004
Reducing Conflict Misses by Blocking

0.15
0.1 Direct Mapped Cache 0.05 Fully Associative Cache 0 0 50 100 150 Blocking Factor
Conflict misses in caches not FA vs. Blocking size u Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache
GK Spring 2004
CPS104 Lec28.16
Summary of Compiler Optimizations to Reduce Cache Misses

vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress 1 1.5 2 2.5 3
Performance Improvement merged arrays

CPS104 Lec28.17
loop interchange
loop fusion
blocking
GK Spring 2004
Summary
Cost Effective Memory Hierarchy Split Instruction and Data Cache 4 Questions CPU cycles/time, Memory Stall Cycles Your programs and cache performance Virtual Memory
Next
CPS104 Lec28.18
GK Spring 2004
Virtual Memory
CPS104 Lec28.19
GK Spring 2004
Memory System Management Problems
Different Programs have different memory requirements. u How to manage program placement? Different machines have different amount of memory. u How to run the same program on many different machines? At any given time each machine runs a different set of programs. u How to fit the program mix into memory? Reclaiming unused memory? Moving code around? The amount of memory consumed by each program is dynamic (changes over time) u How to effect changes in memory location: add or subtract space? Program bugs can cause a program to generate reads and writes outside the program address space. u How to protect one program from another?
CPS104 Lec28.20
GK Spring 2004
Virtual Memory
Provides illusion of very large memory Sum of the memory of many jobs greater than physical memory Address space of each job larger than physical memory Allows available (fast and expensive) physical memory to be well utilized. Simplifies memory management: code and data movement, protection, ... (main reason today) Exploits memory hierarchy to keep average access time low. Involves at least two storage levels: main and secondary Virtual Address -- address used by the programmer Virtual Address Space -- collection of such addresses Memory Address -- address in physical memory also known as physical address or real address
CPS104 Lec28.21
GK Spring 2004
Review: A Simple Programs Memory 2n-1 Layout

Stack
5
What are the possible addresses generated by the program? How big is our DRAM? Is there more than one program running? If so, how do we allocate memory to each?
Data
3
Text
add r,s1,s2
Reserved
CPS104 Lec28.22
GK Spring 2004
Paged Virtual Memory: Main Idea
Divide memory (virtual and physical) into fixed size blocks (Pages, Frames).
u u
Pages in Virtual space. Frames in Physical space.
Make page size a power of 2: (page size = 2k) All pages in the virtual address space are contiguous. Pages can be mapped into physical Frames in any order. Some of the pages are in main memory (DRAM), some of the pages are on secodary memory (disk). All programs are written using Virtual Memory Address Space. The hardware does on-the-fly translation between virtual and physical address spaces. Use a Page Table to translate between Virtual and Physical addresses
CPS104 Lec28.23
GK Spring 2004
Basic Issues in Virtual Memory System Design

size of information blocks (pages)that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy
cache reg
mem
disk
pages
frame Paging Organization virtual and physical address space partitioned into blocks of equal size frames pages
CPS104 Lec28.24
GK Spring 2004
Virtual and Physical Memories

Physical Memory Virtual Memory
Frame-0 Page-0 Frame-1 Page-1 Frame-2 Page-2 Frame-3 Page-3 Frame-4 Frame-5
Disk
Page -2
Page N-2 Page N-1 Page N

CPS104 Lec28.25
GK Spring 2004
Virtual to Physical Address translation

Page size: 4K
Virtual Address
31 Virtual Page Number 11 Page offset 0
Page Table
29 Physical Frame Number
11 Page offset
Physical Address
CPS104 Lec28.26
GK Spring 2004

CPS 104 Computer Organization and Programming Lecture-28: Cache Memory, Virtual Memory

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CPS 104 Computer Organization and Programming Lecture-28: Cache Memory, Virtual Memory

Hochgeladen von

Copyright:

Verfügbare Formate

CPS 104 Computer Organization and Programming Lecture- 28: Cache Memory, Virtual Memory

March 26, 2004 Gershon Kedem http://kedem.duke.edu/cps104/Lectures

Review: Four Questions for Memory Hierarchy Designers

Separate Instruction and Data Caches

Improving Cache Performance

Reducing Misses by Compiler Optimizations

Merging Arrays: improve spatial locality by single

Loop Interchange: change nesting of loops to access

Loop Fusion: Combine 2 independent loops that have

Blocking: Improve temporal locality by accessing

Merging Arrays Example

Reducing conflicts between val & key

Loop Interchange Example

Loop Fusion Example

Reducing Conflict Misses by Blocking

Summary of Compiler Optimizations to Reduce Cache Misses

Performance Improvement merged arrays

Memory System Management Problems

Review: A Simple Programs Memory 2n-1 Layout

Paged Virtual Memory: Main Idea

Pages in Virtual space. Frames in Physical space.

Basic Issues in Virtual Memory System Design

Virtual and Physical Memories

Page N-2 Page N-1 Page N

Virtual to Physical Address translation

29 Physical Frame Number

Das könnte Ihnen auch gefallen