Beruflich Dokumente
Kultur Dokumente
a.True
b.False
Entry Quiz
2.SRAM is implemented using
a.Flip-Flop
b.Magnetic core
c.Capacitor
d.Non-volatile Technology
Entry Quiz
3.Main memory (200 ns) is slower
compared register (0.2 ns) by an
order of
a.3
b.4
c.1000
d.10000
Entry Quiz
4.Virtual Memory is
5.
a.Same as caching
b.Same as associative memory
c.Different from caching
d.Same as disk memory
e.
Entry Quiz
5.Cache Miss occurs when the
a.Required instruction is not found in
the cache
b.Required data is not found in the
cache
c.Required instruction or data is not
found in the cache
d.Required instruction or data is not
found in the main memory
e.For all of the above conditions
Module Objective
• To understand
1. Memory requirements of different computers
2. Memory hierarchy and the motivation behind it
3. Moore’s Law
4. Principles of Locality
5. Cache Memory and its implementation
6. Cache Performance
7. Terms: Cache, Cache Miss, Cache Hit, Latency,
Bandwidth, SRAM, DRAM, by an order of,
Direct Mapping, Associative Mapping, Set
Mapping, Write Through, Write Allocated,
Write Back, Dirty Bit, and Valid Bit
Memory Requirements
• In general we would like to have
– Faster Memory (lower access
time or latency)
– Larger (capacity and bandwidth)
Memory
– Simpler Memory
Memory Requirements - Server, Desktop,
and Embedded devices
nDesktop • Server nEmbedded
nLower – Lower Access nLower
time
Access time Access
– Higher time
nLarger
Bandwidth*
Memory nSimpler
– Better
Protection* Memory*
– Larger Memory
Processor-Memory Gap
Moore’s Law
• Transistor density on a chip dye
doubles every couple (1.5) of years.
•
• Short reference:
http://en.wikipedia.org/wiki/Moore's_law
What is Memory Hierarchy
and Why?
250 ns
Main
Memor
CPU y
0.25 ns
(Registe
rs)
Bus Storage &
Adapte I/O devices
r
2,500,000 ns!
Memory Hierarchy &
Cache
Cache
• Cache is a smaller, faster, and expensive
memory.
• Improves the througput/latency of slower
memory next to it in the memory
hierarchy.
• Blocking reads and delaying the writes to
slower memory offers better
performance.
• There are two cache memories L1 and L2
between CPU and main memory.
• L1 is built into CPU.
• L2 is an SRAM.
Cache Operation
• Cache Hit: CPU finds the required data item (or
instruction) in the cache.
• Cache Miss: CPU does not find the required item
in the cache.
– CPU Stalled
– Hardware loads the entire block that contains the
required data item from the memory into cache.
– CPU continues to execute once the cache loaded.
• Spatial Locality
• Principle of Locality
Hit and Miss
• Hit
– Hit Rate
– Hit Time
• Miss
– Miss Rate Higher Level
– Miss Penalty
• Hit Time << Miss Penalty Lower Level
Cache Performance
Program Execution Time
• CPU Clock Cycle
• Cycle Time
• IC – Instruction Count
• Program Execution Time (Simple
model)
Instruction
Example 1
• Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when
all memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions.
If the miss penalty is 25 clock cycles and
miss rate is 2%, how much faster would
the computer be if all instructions were
cache hits?
Example 1 …
1.Execution time for the program in a
computer with 100 % hits
4.Ratios
= 1.75 IC Clock Cycles / IC X 1.0 X Cycle
Time
= 1.75
with no Misses.
Example 2
• Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when
all memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions.
Miss penalty for read miss is 25 clock
cycles and miss penalty for write miss is
50 clock cycles. Miss rate is 2% and out
of this 80% is Read miss. How much
faster would the computer be if all
instructions were cache hits?
Cache Operation
0
1
Line 2
Number Tag Block 3 Block
( K Words )
0
1
2
CPU
..
.
..
.
C= 16
Block Length
(K Words) Block
Cache 2N - 1
Word Length
(K Words)
Main Memory
Elements of Cache Design
• Cache Size
• Block Size
• Mapping Function
• Replacement Algorithm
• Write Policy
• Write Miss
• Number of caches
• Split versus Unified/Mixed Cache
•
Mapping Functions
Address
From CPU Tag Line Word Tag
Block 0
Line 0 Block 1
1. Select
Line 1
2. Copy
3.Compare
5. Load Line 2
+
4. Hit
Block n-1
Miss Line 3 (m)
Cache
Main Memory
Mapping Function
• Direct
– Line value in address uniquely points to a line in cache. 1
tag Comparison
• Set Associative
– Line value in address points to a set of lines in cache
(typically 2/4/8, so 2/4/8 tag comparisons). This is
known as 2/4/8 way Set Associative.
• Associative
– Line value is always 0. This means Line points to all the
lines in cache (4 (m) tag Comparisons)
– Uses Content Addressable Memory (CAM) for
comparison.
– Needs non-trivial replacement algorithm
Address 0
From CPU Tag Line Word Tag 1
2 Block 0
3
5 2 2 3 Line 0
4
5 Block 1
6
Line 1 7
1. Select
SET 0
2. Copy
3.Compare Line 2
5. Load
+ 504
4. Hit 505
Line 3 Block 126
506
Miss
507
SET 1 508
• Write back
– Information written only to cache. Content of the
cache is written to the main memory only when
this cache block is replaced or the program
terminates.
size
n
Instruct
Size (KB) Cache
8 8.
Improving Cache
Performance
• Reducing Penalty
• Reducing Misses
– Compiler optimization attempts to reduce
the Cache Misses falls under this category
•
Stalled Cycles
Instruction
Improving Cache
Performance Using Compiler
• Compilers are built with the following
optimization:
Instructions:
Reordering instructions to avoid conflict misses
Data :
Merging Arrays
Loop Interchange
Loop Fusion
Blocking
Merging Arrays
/* Conflict */ Key
int key;
int value;
} ;
Loop Interchange
for (k = 0; k < 100; k= k+1) for (k = 0; k < 100; k= k+1)
for (j = 0; j < 100; j= j+1) for (i = 0; i < 5000; i=
for (i = 0; i < 5000; i= i+1)
i+1) for (j = 0; j < 100; j=
x[i][j] = 2*x[i][j] j+1)
/*instead */ Memory x[i][j] = 2*x[i][j]
X(0,0) Address
X(0,1)
X(0,2)
X(0,3)
X(0,4)
X(0,5)
…
X(0,4999)
X(1,0)
X(1,1)
…
Blocking
Loop Fusion
for (i = 0; i < 5000; i= i+1)
a[i] = i;
…
b[i] = b[i]+a[i];
/*instead */
WriteMem[100];
WriteMem[100];
ReadMem[200];
WriteMem[200];
WriteMem[100];
Exit Quiz
1.In memory hierarchy top layer is
occupied by the
a.EPROM
b.DRAM
c.SRAM
d.Flash
e.
Exit Quiz
3.Server’s memory requirements are
different from desktop’s memory
requirements
a.True
b.False
Exit Quiz
4.Hit Rate is (1 – Miss Rate)
a.True
b.False
c.
6.Gordon Moore’s law states that the
transistor density
a.Triples every year
b.Doubles every two years
c.Doubles every 1.5 years
d.Doubles every year
Improving Cache
Performance
• Two options
– By Reducing Miss Penalty
– By Reducing Miss Rate
Reducing Cache Miss
Penalty
1.Multilevel caches
2.Critical word first & early restart
3.Priority for Read misses over Write
4.Merging Write Buffers
5.Victim Cache
•
Reducing Miss Penalty -
Multilevel caches (1)
100 ns, 128M 100 ns, 128M
10 ns, 512K
2 ns, 16K
• Obviously!
Reducing Cache Misses – Higher
Associativity (3)
• Addressing
– VA -> Physical Address -> cache
• Skip two levels, VA maps to cache
• Problems:
– No page boundary checks
– Building direct mapping between VA
and Cache for every process is not
easy.
Reducing Hit Time – Pipelined Cache (3)
• N o te : N o
n n e e d to d e scrib e th e se tw o p o licie s
nW rite - th ro u g h d o e s n o t b u y a n yth in g extra fo r a
Multi-level Cache
5. Load Line 2
+
4. Hit
Block 126
4. Miss Line 3 (m)
508
509
To main memory 510 Block 127
Cache
511
Main Memory
Assignment I – Due same
day next week
• Mapping functions
• Replacement algorithms
• Write policies
• Write Miss policies
• Split Cache versus Unified Cache
• Primary Cache versus Secondary
Cache
• Compiler cache optimization
techniques with examples
Assignment II - Due same
day next week
• Multilevel Cache
• Cache Inclusion/Exclusion Property
• Thumb rules of cache
• Compiler pre-fetch
• Multi-level Caching + one another
Miss Penalty Optimization
technique
• Two miss Rate Optimization
Techniques
Assignment III - Due 2nd
class of next week
• All odd numbered problems from
cache module of your text book.
Assignment IV - Due 2nd
class of next week
• All even numbered problems from
cache module of your text book.
CPU Execution Time & Average Access Time
Memory
Cache
CPU
1 CC 100 CC
With Multi-level Cache
Memory
Cache
CPU
10 CC 100 CC
1 CC
1000
Memory Hierarchy
Main Memory
Main Memory
• Module Objective
– To understand Main memory latency
and bandwidth
– Techniques to improve latency and
bandwidth
Memory Hierarchy &
Cache
Main Memory – Cache – I/O
250 ns
Main
Memor
y
ca
CPU ch
0.25 ns e
Bus Storage &
Adapte I/O devices
r
2,500,000 ns!
ca 56 cc
CPU ch
e 4 cc
CC – Clock Cycle
Data
Bus Address
Bus
Access Time per word = 4+56+4 CC
One word is 8 Bytes
Latency is 1 bit/CC
Improving Memory
Performance
• Improving Latency ( time to access 1
memory unit - word)
• Improving bandwidth (bytes
accessed in unit time)
Improving Memory
Simple Design
Bandwidth
Wider Bus Interleaved
Memory Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Cache Block 4 words
One word 8 bytes Bank 0 Bank 1 Bank 2 Bank 3