Pentium Memory Hierarchy (By Indranil Nandy, IIT KGP)

HPCA Assignment
)Memory Hierarchy of Pentium System(
: Submitted by
Indranil Nandy
MTech, 2006
Roll no. : 06CS6010
Memory Hierarchy of Pentium System
.First, we see a simplified memory hierarchy block diagram of a Pentium System
In the most simplified form the memory hierarchy of Pentium can be explained as
: below
L1 cache is on-chip, it is separate for data and instructions. Data cache size is 16 KB,
it is 4-way associative. It is a write-through cache of 32B lines. Instruction cache also
.is of size 16 KB and is 4-way associative
L2 is unified cache and its size is at least 128 KB. It uses write allocation technique
.and it is also 4-way associative
: Now, we will look for more details and particulars

,First we show the more complex modern PentiumIV Memory System
Block Diagram of Pentium IV memory system

Storage Hierarchy
• CPU cache - memory located on the processor chip (VOLATILE)

• on-board cache - located on circuit board; fastest external memory available
(VOLATILE)
• main memory - software managed (VOLATILE)
• secondary memory - hard drive (NON-VOLATILE)
• slow secondary memory - tapes, diskettes (NON-VOLATILE)
Cache Memory
• A cache is a very fast block of memory that speeds up the performance of
another device. Frequently used data are stored in the cache. The computer
looks in the cache first to see if what it needs is there.
• Level 1 Cache is located directly inside the CPU itself, and stores frequently
used data or commands. Although relatively small, Level 1 Cache has the most
direct effect on overall performance.
• Level 2 Cache is located on the motherboard. It stores frequently used data
from the computer's main memory (RAM). In Intel Pentium chips, Advanced
Transfer Cache is an improved version of the Level 2 Cache, in which the
cache memory operates at the same speed as the processor, which is as much
as four times the speed of a standard Level 2 Cache
Pentium :
• primary or Level 1 )L1( cache - located in the Pentium processor chip
- 32Kb, with 16Kb for instructions and 16Kb for data
• integrated Level 2 )L2( cache )OR NEXT CHOICE( - extra memory

located in the Pentium processor chip
- accessed by a 256-bit data bus - 8-way set associative
• discrete L2 cache called the Advanced Transfer Cache - EITHER THIS OR
PREVIOUS CHOICE
- connected to processor with a dedicated 64-bit cache bus
- faster that cache-on-motherboard implementations
• main memory - SDRAMs
• secondary memory - hard drive
• slow secondary memory - tapes, diskettes
Cache memory
The Pentium processor has two caches, called the primary or Level 1 ((L1) cache and
the secondary or level 2 (L2) cache.
L1 Cache: Pentium Cache is where the processor stores frequently accessed

instructions or data for faster performance. The Pentium processor incorporates a 32K
Level 1 cache. This cache consists of 16K for instructions and 16K for data. This
cache provides the highest information access speeds available. It is non-blocking.
L2 Cache: Working along with the L1 cache, the Pentium processor has either a
512K unified, non-blocking, Level 2(L2) cache or an integrated 256K Advanced
Transfer Cache. The integrated 256K L2 cache is located on-die and runs at the core
frequency of the processor. The L2 cache is an area of high-speed memory that
improves performance by reducing the average memory access time.
 Pentium 4's Cache Organization
Cache Organization in the Memory Hierarchy
There is usually a trade-off between cache size and speed. This is mostly because of
the extra capacitive loading on the signals that drive the larger SRAM arrays. Refer to the
block diagram of the Pentium 4 memory system. Intel has chosen to keep the L1 caches
rather small so that they can reduce the latency of cache accesses. Even a data cache
hit will take 2 cycles to complete (6 cycles for floating-point data). We'll talk about the L1
caches in a moment, but further down the hierarchy we find that the L2 cache is an 8-way,
unified (includes both instruction and data), 256KB cache with a 128B line size.
The 8-way structure means it has 8 sets of tags, providing about the same cache
miss rate as a "fully-associative" cache (as good as it gets). This makes the 256KB cache
more effective than its size indicates, since the miss rate of this cache is approximately
60% of the miss rate for a direct-mapped (1-way) cache of the same size.
The downside is that an 8-way cache will be slower to access. Intel states that the
load latency is 7 cycles (this reflects the time it takes an L2 cache line to be fully retrieved to
either the L1 data cache or the x86 instruction prefetch/decode buffers), but the cache is able
to transfer new data every 2 cycles (which is the effective throughput assuming multiple
concurrent cache transfers are initiated). Again, notice that the L2 cache is shared between
instruction fetches and data accesses (unified).
System Bus Architecture is Matched to Memory Hierarchy
Organization
One interesting change for the L2 cache is to make the line size 128 bytes, instead of
the familiar 32 bytes. The larger line size can slightly improve the hit rate (in some
cases), but requires a longer latency for cache line refills from the system bus. This is
where the new Pentium 4 bus comes into play. Using a 100MHz clock and transferring data
four times on each bus clock (which Intel calls a 400MHz data rate), the 64-bit system bus
can bring in 32 bytes each cycle. This translates to a bandwidth of 3.2 GB/sec.
To fill an L2 cache line requires four bus cycles- the same number of cycles as the P6
bus for a 32-byte line). Note that the system bus protocol has a 64-byte access length
(matching the line size of the L1 cache) and requires 2 main memory request operations to fill
an L2 cache line. However, the faster bus only helps overcome the latency of getting the extra
data into the CPU from the North Bridge. The longer line size still causes a longer latency
before getting all the burst data from main memory. In fact, some analysts note that P4
systems have about 19% more memory latency than Pentium III systems (measured in
nanoseconds for the demand word of a cache refill). Smart pre-fetching is critical or else the
P4 will end up with less performance on many applications.
Pre-Fetching Hardware Can Help if Data Accesses Follow a

Regular Pattern
The L2 cache has pre-fetch hardware to request the next 2 cache lines (256
bytes) beyond the current access location. This pre-fetch logic has some intelligence to allow
it to monitor the history of cache misses and try to avoid unnecessary pre-fetches (that waste
bandwidth and cache space The hardware pre-fetch logic should easily notice the pattern of
cache misses and then pre-load data, leading to much better performance on applications like
streaming media types (like video).
Designing for Data Cache Hits
Intel boasts of "new algorithms" to allow faster access to the 8KB, four-way, L1 data
cache. They are most likely referring to the fact that the Pentium 4 speculatively processes
load instructions as if they always hit in the L1 data cache (and data TLB). By optimizing for
this case, there aren't any extra cycles burned while cache tags are checked for a miss.
The load instruction is sent on its merry way down the pipeline; if a cache miss delays the
load, the processor passes temporarily incorrect data to dependent instructions that assumed
the data arrived in 2 cycles. Once the hardware discovers the L1 data cache miss and brings
in the actual data from the rest of the memory hierarchy, the machine must "replay' any
instructions that had data dependencies and grabbed the wrong data.
Pentium 4 design seems to have been optimized for the case of streaming media
(just as Intel claims), since these algorithms are much more regular and demand high
performance. The designers probably hope that the pathological worst case only occurs for
code that doesn't need high performance. When the L1 data cache does have a miss, it has a
"fat pipe" (32 bytes wide) to the L2 cache, allowing each 64-byte cache line to be refilled in 2
clocks. However, there is a 7-cycle latency before the L2 data starts arriving, as we
mentioned previously. The Pentium 4 can have up to four L1 data cache misses in process.
 Pentium 4's Trace Cache
The Trace Cache Depends on Good Branch Prediction
Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident

enough in their branch prediction algorithms to implement a trace cache. Rather than storing
standard x86 instructions, the trace cache stores the instructions after they've already
been decoded into RISC-style instructions. Intel calls them "µops" (micro-ops) and stores
6 µops for each "trace line". The trace cache can house up to 12K µops. Since the
instructions have already been decoded, hardware knows about any branches and fetches
instructions that follow the branch. We know that it's the conditional branches that could really
cause a problem, since we won't know if we're wrong until the branch condition check in
Arithmetic Logic Unit 0 (ALU0) of the execution core. By then, our trace cache could have
pre-fetched and decoded a lot of instructions we don't need. The pipeline could also allow
several out-of-order instructions to proceed if the branch instruction was forced to wait
for ALU0.
Hopefully, the alternative branch address is somewhere in the trace cache.

Otherwise, we'll have to pay those 7 cycles of latency to get the proper instructions from the
L2 cache plus the time to decode the fetched x86 instructions. Intel's reference to the 20-
stage P4 pipeline actually starts with the trace cache, and does not include the cycles for
instruction or data fetches from system memory or L2 cache.
The Trace Cache has Several Advantages
If predictors work well, then the trace cache is able to provide (the correct) three µops
per cycle to the execution scheduler. Since the trace cache is (hopefully) only storing
instructions that actually get executed, then it makes more efficient use of the limited
cache space. Since the branch target instruction has already been decoded and fetched in
execution order, there isn't any extra latency for branches. The person in the back of the
room just reminded us of an interesting point. We never mentioned a TLB check for the trace
cache, because it does not use one. So, the Pentium 4 isn't so complicated after all. This
cache uses virtual addressing, so there isn't any need to convert to physical addresses
until we access the L2 cache. Intel documents don't give the size of the instruction TLB for
the L2 cache.

Pentium Memory Hierarchy (By Indranil Nandy, IIT KGP)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Pentium Memory Hierarchy (By Indranil Nandy, IIT KGP)

Hochgeladen von

Copyright:

Verfügbare Formate

HPCA Assignment

)Memory Hierarchy of Pentium System(

.First, we see a simplified memory hierarchy block diagram of a Pentium System

: Now, we will look for more details and particulars

Block Diagram of Pentium IV memory system

• CPU cache - memory located on the processor chip (VOLATILE)

- 32Kb, with 16Kb for instructions and 16Kb for data

• integrated Level 2 )L2( cache )OR NEXT CHOICE( - extra memory

L1 Cache: Pentium Cache is where the processor stores frequently accessed

 Pentium 4's Cache Organization

Cache Organization in the Memory Hierarchy

Pre-Fetching Hardware Can Help if Data Accesses Follow a

Designing for Data Cache Hits

The Trace Cache Depends on Good Branch Prediction

Instead of a classic L1 instruction cache, the Pentium 4 designers felt confident

Hopefully, the alternative branch address is somewhere in the trace cache.

The Trace Cache has Several Advantages

Das könnte Ihnen auch gefallen