Beruflich Dokumente
Kultur Dokumente
Learning Objectives
Why is that some memories slow ?
What is memory hierarchy?
Why do we need memory hierarchy?
What is a Cache?
7
Memory in a Modern System
(High Performance Microprocessors: Logic Chip or Memory Chip?
Processor-DRAM Memory Gap (latency)
µProc speed increases ~55% per year
processor-memory performance
Core gap
0
L2 Cache (grows ~ 50% / year)
6MB
L2 Cache
6MB
L1
Core 1
Penryn (dual-core) L2 L2
45nm Technology
Montecito* (dual-core)
(L2-24MB, L1-2MB) Source: David Patterson, UCB
L1
90nm Technology
V. George et al., “Penryn: 45-nm Next Generation Intel® Core™ 2 Processor,” Proceedings of the IEEE
8
Solid-State Circuits Conference, *Intel
Review: Datapath Operations
• Load operation: Load data from data memory to Register File
• ALU operation: Transforms data by passing one or two RF register values through ALU, performing
operation (ADD, SUB, AND, OR, etc.), and writing back into RF.
• Store operation: Stores RF register value back into data memory
• Each operation can be done in one clock cycle
write
Load operation ALU operation Store operation
9
What are the Common Memory
Technologies?
Memory
Memory Arrays
Processor
Control Tertiary
Secondary Storage
Storage (Tape)
Second Main
(Disk)
Cache
On-Chip
Registers
Level Memory
Datapath Cache (DRAM)
(SRAM)
A3 Address
Buffers
Some Addr A2
Amplifers &
Sense Amplifiers
bits select A1 Mux/Demux
within row Column
A0 Decoders
CS
Data Buffers
WE
1 A3
S16..23
A4 Word 16 Word 17 ... Word 23
0
0
... M/L =
… ... ... ... ... 1024/8=
0 128 rows
A9
S1016-1023
Word 1016 ... ... Word 1023
Row
Address
Column
Address
Block Blk EN
Blk EN
Address Blk EN
Blk EN
Global Bus
15
Random Access Memory (RAM)
• RAM – Readable and writable memory 32 32
W_data R_data
– “Random access memory” 4 4
• Strange name—Created several decades ago to W_addr R_addr
contrast with sequentially-accessed storage like tape W_en R_en
drives 16×32
register file
– Logically same as register file—Memory with address
inputs, data inputs/outputs, and control
32
• RAM usually one port; RF usually two or more data
10
– RAM vs. RF addr
1024× 32
• RAM typically larger than about 512 or 1024 words rw RAM
• RAM typically stores bits using a bit storage approach en
that is more efficient than a flip-flop
• RAM typically implemented on a chip in a square rather
than rectangular shape—keeps longest wires (hence RAM block symbol
delay) short
16
RAM Internal Structure
Let A = log2M wdata(N-1) wdata(N-2) wdata0
32
data
10 word bit storage
addr block
1024x32 enable
rw RAM d0 (aka “cell”)
en
addr0 a0 word
addr1 a1 AxM
d1
decoder
addr(A-1) a(A-1) data cell
word word
e d(M-1) enable enable
clk
Combining rd and wr rw data
en
data lines rw to all cells
wdata
rdata0
wdata0
(N-1)
rdata
(N-1)
data(N-1) data0
• Similar internal structure as register file
– Decoder enables appropriate word based on address inputs
– rw controls whether cell is written or read
– rd and wr data lines typically combined
– Let’s see what’s inside each RAM cell
17
Static RAM (SRAM) SRAM cell
data data’
cell
32 d d’
data
10
addr
1024x32 a
rw RAM
en
word 0
enable
SRAM cell
data data’
• “Static” RAM cell 1 0
– 6 transistors (recall inverter is 2 transistors) d
a
– Writing this cell 1 0
• word enable input comes from decoder
• When 0, value d loops around inverters word 1
– That loop is where a bit stays stored enable
• When 1, the data bit value enters the loop
data data’
– data is the bit to be stored in this cell
cell
– data’ enters on other side d d’
– Example shows a “1” being written into cell 1 0
a
word 0
enable
18
Static RAM (SRAM)
32
data
10
addr
1024x32
rw RAM
en
19
SRAM Column Example
Bitline Conditioning Bitline Conditioning
2
2
More
Cells More
Word 1 Cells
Word 1
Read Write
bit_b
SRAM Cell
bit
bit_b
bit
SRAM Cell
write
1
2
data_s1
Word 1
Bit/bit_b
out_
Dynamic RAM (DRAM)
32
data
10
addr
1024x32
rw RAM
en
DRAM cell
data
21
Dynamic RAM (DRAM)
DRAM cell
• “Dynamic” RAM cell bit
cell
– 1 transistor (rather than 6) word
enable
– Relies on capacitor to store bit
d
capacitor
slowly
discharging
(a)
bit
enable
discharges
d
(b)
22
Dynamic RAM 1-Transistor Cell: Layout Capacitor
Word line Dielectric
Insulating Layer Cell plate layer
Cell Plate Si
Transfer gate Isolation
Capacitor Insulator
Refilling Poly Storage electrode
Largest Cell
Manufacturing
Variation
Low High
DRAM Latency
Source: Prof. Onur Mutlu, ETH Zürich 25
DRAM Latency
High % chance to fail Low % chance to fail
with reduced tRCD with reduced tRCD
Row Decoder
SA SA SA SA SA SA SA
wordline
Vdd access
transistor
Vmin
Ready to Access
Bitline Voltage
bitline
Voltage Level
capacitor Process variation during
manufacturing leads to
Bitline Charge Sharing cells having unique
SA
characteristics
0.5 Vdd
Time
ACTIVATE SA Enable READ
tRCD
DRAM Accesses and Failures
wordline
Vdd access
transistor
Vmin
Ready to Access
Bitline Voltage
bitline
Voltage Level
capacitor weaker cells have a
higher probability to fail
SA
0.5 Vdd
Time
ACTIVATE SA Enable READ
tRCD
Comparing Memory
• Register file MxN Memory
– Fastest implemented as a:
• DRAM
– Slowest (capacitor)
• And refreshing takes time Size comparison for same
– But very compact (lower cost) number of bits (not to scale)
• Use register file for small items, SRAM
for large items, and DRAM for huge
items
– Note: DRAM’s big capacitor
requires a special chip design
process, so DRAM is often a
29 separate chip
Read Timing for Conventional DRAM
30
Read Timing for Extended Data Out
31
Read Timing for Extended Data Out
32
Memory Hierarchy
Actual Memory Systems
(Cache Principle)
Locality
• Principle of Locality: Programs tend to use
data and instructions with addresses near or
equal to those they have used recently
• Temporal locality:
– Recently referenced items are likely
to be referenced again in the near future
• Spatial locality:
– Items with nearby addresses tend
to be referenced close together in time
Locality Example
sum = 0;
for (i = 0; i < n; i++)
• Data references sum += a[i];
return sum;
– Reference array elements in
succession Spatial locality
39 Source: http://www.eetimes.com/document.asp?doc_id=1275470
The Memory Hierarchy
By taking advantage of the principle of locality
Can present the user with as much memory as is available in the
cheapest technology
at the speed offered by the fastest technology
On-Chip Components
Control eDRAM
Secondary
Cache Cache
Second
Instr
Cache
(DRAM)
Data
(SRAM)
L0:
Regs CPU registers hold words
Smaller, retrieved from the L1 cache.
faster, L1: L1 cache
and (SRAM) L1 cache holds cache lines
costlier retrieved from the L2 cache.
L2: L2 cache
(per byte)
(SRAM)
storage L2 cache holds cache lines
devices retrieved from L3 cache
L3: L3 cache
(SRAM)
L3 cache holds cache lines
retrieved from main memory.
Larger,
slower, L4: Main memory
and (DRAM) Main memory holds disk
cheaper blocks retrieved from
(per byte) local disks.
storage L5: Local secondary storage
devices (local disks)
Local disks hold files
retrieved from disks
on remote servers
L6: Remote secondary storage
(e.g., Web servers)
Cache Example
Time 1: Hit: in cache
CPU
Main Memory
Average Memory Access Time = hit time (tc) + miss rate (m) * miss penalty(tp)
= tc + m* tp
43 Hit time = Time 1 (tc) Miss penalty (tp)= Time 2 + Time 3
Example: Impact on Performance
• Suppose a processor executes at
– Clock Rate = 1000 MHz (1 ns per cycle)
– CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50 cycle
miss penalty
• CPI = ideal CPI + average stalls per instruction
= 1.1(cyc) +( 0.30 (datamops/ins)
x 0.10 (miss/datamop) x 50 (cycle/miss) )
= 1.1 cycle + 0.3x0.1x50 cycle =1.1cycle + 1.5 cycle
= 2. 6
• 58 % of the time the processor
is stalled waiting for memory!
• a 1% instruction miss rate would add
an additional 0.5 cycles to the CPI!
44 Source: David Patterson
Let’s think about those numbers
• Huge difference between a hit and a miss
– Could be 100x, if just L1 and main memory
Hit
20 10 Data
Data
Hit Tag
Index
1021
1022
1023
20 32
Compare Tags
4KB Direct Mapped Cache Address from CPU
Index field
Byte A d d r e s s ( s h o w i n g b it p o s i tio n s )
(10 bits)
Example Tag field
31 30 13 12 11 2 1 0
B y te
20 10
H it D a ta
210
Tag
1K = = 1024 Blocks In d e x
Block offset
Each block = one word (2 bits)
(4 bytes) In d e x V a l id T ag D a ta
Can cache up to 0
232 bytes = 4 GB 2
of memory SRAM
1022
Mapping
Hit Access Time = SRAM Delay + Hit/Miss Logic Delay
Direct Mapped Cache [contd…]
• What is the size of cache ?
4K
• If I read
0000 0000 0000 0000 0000 0000 1000 0001
Can cache up to In d e x B lo c k o f fs e t
232 bytes = 4 GB
1 6 b its 1 2 8 b its
of memory
V T ag D a ta
SRAM 4K
e n trie s
Typical cache 16 32 32 32 32
Block or line size:
32-64 bytes
Tag
Matching
Mux
Hit or miss? 32
Larger cache blocks take better advantage of spatial locality Block Address = 28 bits
Block offset
and thus may result in a lower miss rate Tag = 16 bits Index = 12 bits = 4 bits
X
Mapping Function: Cache Block frame number = (Block address) MOD (4096)
i.e. index field or 12 low bit of block address
Hit Access Time = SRAM Delay + Hit/Miss Logic Delay
Direct Mapped Cache [contd…]
Taking advantage of spatial locality, we read 4 bytes at a time
Caching Terminology
block/line : the unit of information that can be exchanged between two
adjacent levels in the memory hierarchy
cache hit: when CPU finds a requested data item in the cache
cache miss: when CPU does not find a requested data item in the cache
miss rate (m) : fraction of cache accesses that result in a miss
hit rate (h) : (1- miss rate)fraction of memory access found in the faster
level (cache)
hit time (tc) : time required to access the level 1 (cache)
miss penalty (tp) : time required to replace a data unit in the faster level
(cache) + time involved to deliver the requested data to the processor
from the cache. (hit time << miss penalty)
average memory access time (amat)= hit time (tc) + miss rate (m) * miss
penalty (tp) = tc + (1-h) * tp
54
Cache Misses
• Compulsory (cold start or process migration, first reference): first
access to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Conflict (collision):
– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
Address
16339C
Two Way Set Associative Example
59
4K Four-Way Set Associative Cache:
Nominal
Capacity MIPS Implementation Example
Byte A d dress
Block Offset Field
Tag 31 3 0 1 2 11 10 9 8 3 2 1 0 (2 bits)
Field (22 bits)
22 8
Index Field
1024 block frames (8 bits) Set Number
Each block = one word Ind ex V Tag D ata V Tag D a ta V T ag D ata V T ag D a ta
0
4-way set associative 1
1024 / 4= 28= 256 sets 2
SRAM
25 3
Can cache up to 25 4
232 bytes = 4 GB 25 5
22 32
of memory Parallel
Tag Matching
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
Problem 1
(a) A 64KB, direct mapped cache has 16 byte blocks. If addresses
are 32 bits, how many bits are used the tag, index, and offset in
this cache?
(b) How would the address be divided if the cache were 4-way
set associative instead?
(c) How many bits is the index for a fully associative cache.
Problem 2
An ISA has 44 bit addresses with each addressable item being a
byte. Design a four-way set associative cache with each of the
four lines in a set containing 64 bytes. Assume that you have 256
sets in the cache.
Calculating Number of Cache Bits Needed
Tag
Block Address
Index
Block offset V Tag Data
Cache Block Frame (or just cache block)
Address Fields
• How many total bits are needed for a direct- mapped cache with 64 KBytes of
data and one word blocks, assuming a 32-bit address?
i.e nominal cache
– 64 Kbytes = 16 K words = 214
words = blocks 214 Capacity = 64 KB
– Block size = 4 bytes => offset size = log2(4) = 2 bits, Number of cache block frames
– 14
#sets = #blocks = 2 => index size = 14 bits
– Tag size = address size - index size - offset size = 32 - 14 - 2 = 16 bits
– Bits/block = data bits + tag bits + valid bit = 32 + 16 + 1 = 49 Actual number of
bits in a cache block
– Bits in cache = #blocks x bits/block = 214 x 49 = 98 Kbytes frame
• How many total bits would be needed for a 4-way set associative cache to store
the same amount of data?
– Block size and #blocks does not change.
– #sets = #blocks/4 = (214)/4 = 212 => index size = 12 bits
– Tag size = address size - index size - offset = 32 - 12 - 2 = 18 bits
– Bits/block = data bits + tag bits + valid bit = 32 + 18 + 1 = 51
– Bits in cache = #blocks x bits/block = 214 x 51 = 102 Kbytes
• Increase associativity => increase bits in cache
– 64 Kbytes = 214 words = (214)/8 = 211 blocks Number of cache block frames
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block
Hit
Fully Associative Cache
• Fully Associative Cache
– Forget about the Cache Index
– Compare the Cache Tags of all cache entries in parallel
– Example: Block Size = 32 B blocks, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01
: :
= Byte 31 Byte 1 Byte 0
= Byte 63 Byte 33 Byte 32
=
=
=
: : :
Fully Associative Cache Organization
V Tag Data
Cache Organization: Cache Block Frame
(d ire c t m a p p e d )
Set associative cache reduces cache misses by reducing conflicts
B lo c k T ag D a ta between blocks that would have been mapped to the same cache
1-way set associative: block frame in the case of direct mapped cache
0
(direct mapped) T w o - w a y s e t a s s o c i a t iv e
1
1 block frame per set
Set Tag D a ta Tag D a ta
2
0 2-way set associative:
3
4
1 2 blocks frames per set
2
5
3
6
0
8-way set associative:
1 8 blocks frames per set
In this case it becomes fully associative
since total number of block frames = 8
E ig h t - w a y s e t a s s o c ia t iv e ( fu l ly a s s o c i a t i v e )
Set Select
Data Select
Cache
Processor DRAM
Write Buffer
Problem
Intel Cache Evolution Solution
Processor on which feature
first appears
External memory slower than the system bus. Add external cache using faster 386
memory technology.
Increased processor speed results in external bus becoming a Move external cache on-chip, 486
bottleneck for cache access. operating at the same speed as the
processor.
Internal cache is rather small, due to limited space on chip Add external L2 cache using faster 486
technology than main memory
8
0
2
Consider logical address 1025 and the following
page table for some process P0. Assume a 15-bit
address space with a page size of 1K. What is
the physical address to which logical address
1025 maps?
8 Step 1. Convert to
binary:
0
000010000000001
2
Consider logical address 1025 and the following
page table for some process P0. Assume a 15-bit
address space with a page size of 1K. What is
the physical address to which logical address
1025 maps?
2 00001 0000000001
Consider logical address 1025 and the following
page table for some process P0. What is the
physical address?
2
Paging Model of Logical and Physical Memory
Shared Pages Example
Valid (v) or Invalid (i) Bit in a Page Table
Hierarchical Page Tables
Break up the
logical address
space into
multiple page
tables
A simple
technique is a
two-level page
table
Two-Level Paging Example
• A logical address (on 32-bit machine with 1K page size) is divided into:
– a page number consisting of 22 bits
– a page offset consisting of 10 bits
• Since the page table is paged, the page number is further divided into:
– a 12-bit page number
– a 10-bit page offset
• Thus, a logical address is as follows:
12 10 10
where pi is an index into the outer page table, and p2 is the displacement
within the page of the outer page table
Address-Translation Scheme
Paging in 64 bit Linux
Address
Page Paging Address
Platform Bits
Size Levels Splitting
Used
Alpha 8 KB 43 3 10+10+10+13
IA64 4 KB 39 3 9+9+9+12
PPC64 4 KB 41 3 10+10+9+12
sh64 4 KB 41 3 10+10+9+12
X86_64 4 KB 48 4 9+9+9+9+12
TLB look up and cache access
Cache, sequence of events for
read and write access
The Arm Cortex-A8
Memory subsystem
Summary
• SRAM is fast but expensive and not very dense:
– Good choice for providing the user FAST access time.
• DRAM is slow but cheap and dense:
– Good choice for presenting the user with a BIG memory system
• Flash Memory : Read is fast but write is slow
103