Beruflich Dokumente
Kultur Dokumente
Techniques
Sparsh Mittal
IIT Hyderabad, India
2
A Background on TLB
5
Useful terms and concepts
6
Useful terms
7
TLB Shootdown and HW/SW-managed TLB
8
TLB in real processors
10
TLB architecture in Older Processors
14
Challenges in using superpage (1/2)
16
Challenges in using variable page size
17
Minimizing TLB flushing overhead
18
TLB design tradeoffs
21
Key Ideas of TLB Management Techniques (1/2)
22
Key Ideas of TLB Management Techniques (2/2)
23
TLB Architectures for
Multicore Processors
32
Private v/s shared TLB Designs
• In multicore processors
– Private per-core TLBs reduce access latency, scales well to large
core-count
– Shared TLB provides capacity benefits =>reduces TLB misses
33
Inter-core prefetching (1/2)
• To avoid them, on each TLB miss, a core can fill its own TLB and also
place this translation in prefetch buffers of other cores
• Case 1:
(1a): DTLB miss but prefetch buffer (PB) hit on core 0
(1b): Remove entry from core 0's PB and add to its DTLB
• Case 2:
(2a): miss in both DTLB and PB of core 1 => fetch translation
from page table into DTLB
(2b): Also push this translation to PBs of other cores
35
Synergistic TLB design
42
Motivation for TLB power management
49
DTLB optimization techniques
1. Array interleaving
2. Instruction scheduling
3. Operand reordering
for (i = 0 to 99) P[0] P[1] P[2] ------ P[97] P[98] P[99] Page 1
{ 100 page
(a)
R[i] = P[i] + Q[i]; transitions
} Q[0] Q[1] Q[2] ------ Q[97] Q[98] Q[99] Page 2
After array-interleaving
for (i = 0 to 99) Page 1 Page 2
{
(b) R[i] = PQ[2i]+ PQ[0] PQ[1] ------ PQ[98] PQ[99] PQ[100] PQ[101] ------ PQ[198] PQ[199]
PQ[2i+1];
}
1 page
transition
51
Loop-unrolling, instruction scheduling and operand reordering
52
ITLB optimization techniques
• Some processors use two-level TLB design with different L1 TLBs for
every page size
L1-4KB TLB L1-2MB TLB L1-1GB TLB
L2 TLB
64
4 possible cache/TLB organizations
65
4 possible cache/TLB organizations Legend
xI = x (V/P) Index xT = x (V/P) Tag
PI
VA TLB L1$ L2$ Physically indexed
(a)
PT Physically tagged
VI
VA L1$ L2$
(b) Virtually indexed
PT Physically tagged
TLB Commonly used
VI
VA L1$ TLB L2$ Virtually indexed
(c) Virtually tagged
VT Reduces TLB access
But presents challenges
PI
VA TLB L1$ L2$ Physically indexed
(d) Virtually tagged
VT
66
Physical address subdivision
(a)
Special case: all bits of cache set come from page offset
=> Cache set location of an address is independent of VA-to-PA translation
Page number Page offset
68
Useful term: Synonyms
70
Ideas for designing virtual cache
ASID = app-specific ID
Memory access Synonym Synonym candidate
filter
False positive
TLB
Non-synonym
ASID+VA PA
Cache hierarchy
Delayed translation
Memory
Park et al. ISCA'16
72
References
85