Sie sind auf Seite 1von 27

Memory Hierarchy

Memory Hierarchy

-1-

Memory Hierarchy 1. This question has two parts. a. For a fixed size set associative cache, if the number of sets is reduced, what is the impact on the three classes of misses: capacity, compulsory, and capacity. Provide a brief explanation.

As the number of sets is reduced, the associativity is increased (total cache size and block size are fixed). Increasing associativity will reduce the number of conflict misses. Compulsory misses are independent of cache size and remain the same. While the overall miss rate is reduced (with increase in associativity), capacity misses will increase as percentage of the overall miss rate. b. Consider the data remapping optimization discussed in class for improving the locality of memory references. Can you describe (in language form, not physical addresses) a memory reference pattern (stream of addresses) for which this optimization will be effective in improving the hit rate? In general, when the profiled access patterns are representative of the actual access patterns, data remapping will be effective in improving the hit rate. Profile information in turn will be an effective predictor if the data sets that are used to generate the profile information are representative. Accesses to arrays of records, allocated on the heap are the primary target. A specific example is the structured accesses to elements of a list of records. The storage pattern for records can be remapped statically to substantially improve the hit rate or reduce cache size for the same hit rate.

-2-

Memory Hierarchy 2. This question concerns using victim caches and pre-fetch buffers to optimize the performance of a cache. a. What is a victim cache? How does it improve the performance of a L1 cache? A victim cache store recently evicted lines in a fully associative store between the cache and main memory fill path. It reduces the miss penalty for accesses to recently discarded line by reducing misses to the next level of the memory hierarchy.

b. What is the difference between a victim cache and a pre-fetch buffer? Why have one vs. the other? In contrast, as the name suggest, pre-fetch buffers store lines optimistically fetched from memory in anticipation of their use.

-3-

Memory Hierarchy c. The following figure shows the organization of the data cache in the Alpha 21264 microprocessor. Show where you would place a pre-fetch buffer or victim cache (pick one or the other) in the memory hierarchy. Show all, correct modified connections. Briefly describe how it would operate emphasizing its benefits.

VC

The victim cache is between the next lower level of the memory hierarchy and the L1 cache. If a victim is found on a miss it is swapped with the contents of the L1 cache. This requires the tag to be also made available to the victim cache. Additional interconnect dependencies will arise depending whether there is a L2 cache (are replaced lines in L2 in the victim cache?) and the update strategy. The preceding figure shows the simplest, common paths that are necessary. The pre-fetch buffer is similarly placed between the L1 cache and the next level of the memory hierarchy. However, does not take evicted lines from the cache, but rather holds anticipated lines form the next level of the memory hierarchy. Typically, tags are checked concurrently with TLB on a miss in L1. -4-

Memory Hierarchy 3. Consider a unified, 64 Kbyte, direct mapped, instruction and data cache that exhibits the following miss rates as a function of block size, keeping the cache size fixed. Block Size 4 8 16 32 64 128 256 512 1K 2K 4K Miss Rate 35.18% 22.57% 16.009% 12.47% 10.79% 10.27% 10.72% 12.75% 16.41% 23.15% 32.97%

a. Why does the miss rate initially decrease as the block size increases? Why does it eventually start increasing again? Initially increased spatial locality causes a decrease in miss rate that is subsequently overrun by decreased temporal locality. b. Can the 128 byte block size result be guaranteed to provide the best average memory access time? Justify your answer. No. Smaller block sizes may have slightly higher miss rate (e.g., 64 byte blocks) that may be overshadowed by the much larger miss penalty of a 128 byte block. The answer depends on the design of the cache-memory interface. For average memory access time the key is the behavior of the miss penalty, (1-h) * tp, where tp is the time to fetch a block from memory and (1-h) is the miss rate.

-5-

Memory Hierarchy 4. We have a 4 Kbyte cache operating with 256 byte lines. Consider two designs for this cache. The first is a direct mapped design. The second is a fully associative cache with an LRU replacement policy. Physical addresses are 16 bits. Provide a sequence of physical addresses (in hexadecimal notation) that when repeated indefinitely will result in a lower miss rate in the direct mapped cache than in the fully associative cache. The address stream should be accompanied by an explanation of this behavior. The following stream of addresses will cause this to happen (the 0x designation signifies hexadecimal notation) 0x0000, 0x0100, 0x0200, , 0x0f00, 0x1000, 0x0000, 0x0100, 0x0200, 0x0300, Repeating pattern There are sixteen lines in the cache. Have a sequence of 17 addresses that address 17 consecutive blocks of memory and repeats itself. In a fully associative cache with an LRU policy after the cache as filled up with the first 16 addresses you get the following behavior. The least recently used line is always removed before it is referenced on the next memory reference. Thus, every reference becomes a miss. In a direct mapped cache, the 17th and 1st reference will always miss but the remaining will always hit. In this case the rigidity of the direct mapped cache causes misses to be localized to two blocks while the flexibility of fully associative placement causes a miss every reference. Draw the LRU stack at each reference and you will see why starting with the 17th reference, every successive reference will miss. IT may be simpler to have four line cache with a sequence of 5 consecutive memory addresses.

-6-

Memory Hierarchy 5. Consider the complete memory hierarchy. You have a paged memory system. The processor has 512 Mbytes of memory and an 8 Gbyte virtual address space with 4 Kbyte pages. The L1 cache is a 64 Kbyte, 2-way set associative cache with 64 byte lines. The L2 cache is a 1 Mbyte, 8-way set associative cache with 256 byte lines. Address translation is supported by a 4 entry TLB. Each page table entry is 32 bits. a. Show the breakdown of the fields of virtual address, physical address, the physical address as interpreted by the L1 cache, and the physical address as interpreted by the L2 cache. Clearly mark the number of bits in each address and the number of bits in each field. 32 Virtual Address 28 Physical Address 28 L1 Cache Address 28 L2 Cache Address 14 9 8 14 9 6 0 17 12 0 21 12 0 0

b. Fill in the following table Number of L2 cache lines/page Number of entries in the page table Page table size in pages

16 2M 2K

-7-

Memory Hierarchy c. What is the function of the TLB (Translation Lookaside buffer)?

Translate virtual addresses into physical addresses.

d. In the preceding example, is it possible for the system to address the L1 cache concurrently with the address translation provided by the TLB? Justify your answer using the address breakdowns shown above. What would be the benefit of doing so? The L1 cache is addressed by the 15 lower bits of the address (9+6). Some of the bits needed to address the L1 cache will be provided by the virtual to physical translation. Thus, concurrent addressing of the data cache by itself will provide no benefit. In general the benefit would be that addressing of the L1 cache and the address translation can be overlapped.

-8-

Memory Hierarchy 6. Consider a processor with the following memory hierarchy i) a 64 Kbyte direct mapped unified L1 cache with 128 byte lines, ii) a 2 Mbyte set-associative L2 cache with 512 byte lines and 4-way associativity and can supply the L1 cache with a line in 8 cycles on an L2 hit, iii) 256 Mbyte main memory organized as 8 banks, each bank supplying one word with an access time of 20 cycles and is connected to the L2 cache via a 256 bit wide bus, and iv) a 4 Gbyte virtual address space with 16 Kbyte pages. a. Write an expression for the average memory access time using the following notation, m1, m2 miss rates for the L1 and L2 caches tp1, tp2 miss penalties for the L1 and L2 caches tL1, tL2 access time from the L1 and L2 caches on a hit respectively.

tavg = tL1 + m1.tL2 + m1.m2.tp2

b. Write an expression for the global miss rate using the preceding notation. m1.m2

c. What is the L2 cache penalty in cycles assuming that data (256-bit) and address transfers across the bus take 1 cycle? (Address bus cycle + memory access + data delivery bus cycles) * 16 = (1 + 20 + 1) * 16 = 468 cycles The assumption is that each 32 byte interleaved access will require an address bus cycle. d. Show the address breakdown for addressing the L1 and L2 caches 27 0 12 9 7

27 9 10 9

-9-

Memory Hierarchy

e. Name two optimizations to improve the performance of this memory hierarchy. Specify whether the optimization improves, hit time, miss rate or miss penalty. Any pair from Chapter 5.

f. If the local miss rate for the L1 is 4% and the local miss rate for the L2 cache is 40%, what is the increase in CPI if there are 1.25 memory references per instruction? Stall cycles from an L2 miss = 0.016 * 468 cycles = 7.48 cycles Stall cycles from L1 miss and L2 hit = 0.04 * 8 cycles = 0.32 cycles Increase in CPI = 1.25 * (7.48 + 0.32)

g. On a page fault a page is transferred from disk. Write an expression showing all of the components of the latency in transferring a page from disk into main memory? Which of these components depend on the page size? Latency = controller delay + seek time + rotational delay + transfer rate. With optimized layouts, only the last term is

- 10 -

Memory Hierarchy 7. Consider a 64-bit server with 2 GByte of main memory, a 4 Gbyte virtual address space, 8 Kbyte pages, and a 8 entry TLB. Each page table entry is 32 bits. The system has a split data and instruction L1 caches. Each is a direct mapped 64 Kbyte cache with 64 byte lines and a miss penalty to the L2 cache of 10 cycles. The L1 cache is backed by a 2 entry victim cache. The L2 cache is 1 Mbyte 16 way set associative cache with 1024 byte lines. Main memory is organized as 16 banks where each bank can deliver one 64-bit word in 60 cycles. Cache lines are interleaved across the 16 memory banks. Main memory is connected by a 128-bit data, 32-bit address, split transaction bus to the L2 cache. Transfer of an address across this bus takes 2 cycles. Data transfers take 4 cycle for each 128 bit word pair. The victim cache is initially empty. Now consider the following sequence of addresses submitted to the L1 cache: 0x40006448, 0x80006438, 0x40006420, 0x66446418, 0x80006438. Ignore the initial state of the L1 cache. Fill in the contents of the Victim cache tag entries at the end of this sequence.

Tag 0x40006420 0x66446418

- 11 -

Memory Hierarchy 8. The questions pertaining to memory systems, will use the following memory hierarchy. Consider a 64-bit server with 2 GByte of main memory, a 4 Gbyte virtual address space, 8 Kbyte pages, and a 8 entry TLB. Each page table entry is 32 bits. The system has a split data and instruction L1 caches. Each is a direct mapped 64 Kbyte cache with 64 byte lines and a miss penalty to the L2 cache of 10 cycles. The L1 cache is backed by a 2 entry victim cache. The L2 cache is 1 Mbyte 16 way set associative cache with 1024 byte lines. Main memory is organized as 16 banks where each bank can deliver one 64-bit word in 60 cycles. Cache lines are interleaved across the 16 memory banks. Main memory is connected by a 128-bit data, 32-bit address, split transaction bus to the L2 cache. Transfer of an address across this bus takes 2 cycles. Data transfers take 4 cycle for each 128 bit word pair.

19

13

Virtual Address

18 15 15 10 6

13 6 10

Physical Address L1 Address L2 Address

- 12 -

Memory Hierarchy a. You wish to share data between two processes and all of the shared data fits within a single page. The L1 cache is virtually addressed. i. What set of virtual pages in each processs address space can the shared page be mapped to ensure that synonyms will not exist in the L1 cache?

19

13

Virtual Address

15

10

L1 Address

These three bits must be the same for the virtual pages that contain shared data. In this case, a reference to the shared page will always result in it being mapped to the same location in the L1 cache by both processes.

ii. How many alternative sets of virtual pages are there in the virtual address space that can be used to map shared virtual pages?

Based on the preceding analysis, there are 23 = 8 sets of pages.

- 13 -

Memory Hierarchy b. Imagine we have implemented the critical word first optimization. The processor issues an access for which a miss is generated in the L2 cache. What would be the improvement in latency in cycles to the L2 (ignore transfer to the processor) if we implemented the critical word first optimization? First we must compute the L2 penalty in cycles in the normal case. The 1024 byte line size = 128 64-bit words. This requires 8 memory cycles from 16 banks. With a split transaction bus, 7 address cycles can be overlapped with memory access (we assume that the memory system can support seven pending accesses). Thus we have, Address cycle + (memory cycle + data transfer cycle) * 8 The data transfer of 16, 64-bit words takes 8 * 4 = 32 cycles (two words transferred at a time) Thus the total penalty is 2 + (60 + 32) * 8 = 738 cycles In a critical word first optimization this can at best produce a reduction in latency of (address cycle + memory cycle + transfer cycle) = 2 + 60 + 4 = 66 cycles The maximal reduction can be 738 66 = 672 cycles.

- 14 -

Memory Hierarchy 9. All of the following questions pertaining to memory systems, unless otherwise stated, will use the following memory hierarchy. We have processor with 1 GByte of main memory, a 4 Gbyte virtual address space, 4 Kbyte pages, and a 4 entry TLB. Each page table entry is 32 bits. The unified L1 cache is a direct mapped 64 Kbyte cache with 64 byte lines and a miss penalty to the L2 cache of 10 cycles. The L1 cache is backed by a 4 entry victim cache. The L2 cache is 512 Kbyte 8 way set associative cache with 512 byte lines. Main memory is organized as 16 banks where each bank can deliver a word in 50 cycles.. Main memory is connected by a one word wide, split transaction bus to the L2 cache. Transfer of one word across the bus takes one cycle. a. Answer i. What is the L2 cache miss penalty in cycles. Each access can deliver 64 bytes (16 modules and 4 bytes/module). We need 8 accesses. Each access takes 50 cycles. With a split transaction bus the 7 addresses can be hidden or overlapped with the memory access time assuming that the memory system can indeed buffer these addresses. In this case the miss penalty is 1 + (50 + 16) *8 cycles = 529 cycles If you do not have a split transaction bus the penalty would be [1 + 50 + 16] *8 cycles = 536 cycles ii. If each page table entry is 64 bits, what is the size of the TLB and page table in bytes. TLB: Page Table: ___32 bytes___ _____8 Mbytes_______

- 15 -

Memory Hierarchy b. A 16 x 16 matrix is mapped into memory using a (3,1) skewing scheme. i. Assuming the first element is indexed as (0,0), in what memory module will you find (11,13). Show how you obtained the module number. Memory module 14 = [(11*3) mod 16 + 13] mod 16 ii. What is the minimum number of memory accesses necessary to fetch all elements of the main diagonal, for example if conflict free access is possible the minimum number is 1. Elements along the main forward diagonal are skewed by (3+1) = 4. The number of elements that can be concurrently accessed is (16/gcd(4,16)). The number of accesses is gcd(4,16) = 4. c. We wish to overlap indexing of the L1 cache access with virtual to physical address translation. Answer either i. or ii. i. Demonstrate that this is indeed possible. Not feasible.

ii. It is not possible, in which case determine the degree of associativity required to make it possible.

Virtual address Physical Address Presented to the L1

20 14 4 10

12 6

4 bits of the index field must be translated before you can index the cache. This can be avoided by increasing the cache associativity to 16.

- 16 -

Memory Hierarchy

d. You wish to share data between two processes and all of the shared data fits within a single physical page which starts at 0x30001000. Process A maps the shared page to a virtual page starting at 0x4000A000 and Process B maps the shared physical page to virtual page starting 0x80006000. During operation, the machine context switches between processes A and B. The L1 cache is virtually addressed and is not flushed during context switches. Process A has a reference to 0x4000A424. Process B as a reference to 0x80006424. Will these references result in synonyms in the L1 and if so in what cache lines? They will result in synonyms because they address the same word in the shared page but produce different values of indices in for the cache. The 10 bit cache line address from Process As address is 1010 0100 00 The 10 bit cache line address from Process Bs address is 0110 0100 00

e. If the L1 local miss rate is 3% and the L2 local miss rate is 30% what is the increase in CPI due to the cache misses. Increase in cycles per reference is = 0.03 * 10 + 0.03 * 0.3 * 529 (this expression ignores the miss detection time in the L2 cache = 10 cycles) = 5.061 If the number of memory references/instruction is M, the increase in CPI is M* 5.061 f. Can references to physical pages 0x20008000 and 0x30004000 generate conflict misses between each other in the L2 cache? Provide the analysis to justify your answer. 4 bits 14 Physical address as interpreted by the L2 7 9 12 bit page offset

- 17 -

Memory Hierarchy To conflict, both pages must map to the same sets. For this to be true, accesses to both pages must map into the same set. The upper four bits of the index for all addresses in the two pages are distinct. Therefore no conflict misses will arise between accesses to these two physical pages.

- 18 -

Memory Hierarchy 10. All of the questions pertaining to memory systems, unless otherwise stated, will use the following memory hierarchy. We have processor with 32-bit words, 1 GByte of main memory, a 4 GByte virtual address space, 16 Kbyte pages, and an 8 entry TLB. Each page table entry is 32 bits. We have split, instruction and data L1 caches, each a direct mapped 64 Kbyte cache with 32 byte lines. The miss penalty to L1 from L2 is 4 cycles for the first word and 2 cycles/word thereafter. The L2 cache is a 2 Mbyte, 8 way set associative cache with 1024 byte lines. Main memory is organized as 32 banks where each bank can deliver a 32 bit word in 50 cycles. Main memory is connected by a 64 bit wide, split transaction bus to the L2 cache. Transfer of an address across the memory bus takes 2 cycles. Transfer of a 64-bit data item takes 1 cycle. The data miss rate in the L1 is 3% and the instruction miss rate is 2%. The local miss rate of the L2 is 20% for instructions and 40% for data.

16

14

Physical address Virtual address

18

14

14

11

Addressing the L1 (breakdown of the physical address Addressing the L2 (breakdown of the physical address

12

11 8

10 5

The preceding assumes that the physical address is constructed to use as many bits as necessary. Note that the physical address could have been 32 bits where the higher order 2 bits would be 00. The miss penalty from L2 to L1 = 4 + 7 x 2 = 18 cycles A miss in the L2 requires 256/32 = 8 accesses (256 words/line and 32 memory banks can deliver 32 words in one memory cycle). For a split transaction bus and a memory system that can handle multiple requests the L2 miss penalty = 2 + (50 + 16) x 8 = 530 cycles. Note that to deliver 32 words that are accessed in a memory cycle requires 16 bus cycles since each cycle can transfer 64 bits.

- 19 -

Memory Hierarchy a. The following is a sequence of short questions. Keep you answers to the point. i. What is a non-blocking cache? This cache does not block on a miss. Memory referencing continues. The number of outstanding misses that can be supported is prime factor in the design of such a cache. ii. How many unique sets in the L2 cache does a memory page occupy? A memory page contains 16 lines. These lines are (direct) mapped across 16 sets. The L2 cache contains more than 16 sets. Therefore each page will span 16 distinct sets.

b. If 60% of all memory accesses are for instructions, what is the increase in CPI due to misses in the memory hierarchy? If 60% of all memory references are for instructions the remaining are for data. We first compute the number of memory references for each instruction. There is at least one reference for fetching an instruction. Then there is a probability of 40/60 of a data reference. Therefore we have an average if 1.667 memory references per instruction. (one could arrive at this quicker by computing 1/0.6) We separately compute the effect of I misses and D misses. For each class, we must compute the effect of the L1 and L2 caches. We use the expressions for multi-level caches The increase in CPI due to instruction misses = 0.02 x (18 + 0.2 x 530) The increase in CPI due to data misses = 0.67 x [0.03 x (18 + 0.4 x 530)]

c. Assume that the matrix is stored in row major order (increasing value of the second index), the first element of the matrix is [0][0] and that a word interleaved memory organization is used. int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];

- 20 -

Memory Hierarchy i. Ignoring the L2 cache for a moment, what is the impact of loop interchange on conflict misses in the L1. The first element of the matrix is stored at 0x40040000. Each successive group of 8 elements corresponds to cache line 32 rows will fit in the L1 cache The preceding figure shows how the matrix is mapped into memory. If accesses occur along a row, you will have one miss every 8 references corresponding to one/cache line. The first 32 rows will fill the cache as compulsory misses. Then the first cache line in the 33rd row will conflict with the first line in the first row. From this point on the first reference to every line will generate a conflict miss. When we reference the matrix in column order, every reference is to a new cache line since the matrix is mapped into memory in row major order. There are 256 rows so there is not enough room in the cache to hold all lines in the column. This column major access order will generate more misses than row major access order. The above loop is written to access the matrix is column major order (the first dimension). This loop interchange will reduce the number of misses considerably.

64 cache lines/row

- 21 -

Memory Hierarchy ii. Provide a skewing scheme for the matrix such that the total miss penalty to the L2 for accessing (once) a column is minimized. Justify your answer and provide the memory module number for element x[2][18]. A (1,1) skewing scheme will work in which the element (2,18) is in module [(2x1) + 18 x 1]mod 32 = 20. The first term corresponds to the module number for the first element in the second row. The second term if the relative module address of the 18th element of the the second row. Successive column elements are in successive memory modules. If we have 256 memory modules all elements of the column would be in distinct modules. Since we have only 32 modules the 512 unique module addresses are mapped modulo 32. d. Consider the virtual memory system. i. We decide to virtually address the L1 caches and implement a new optimization such that, rather than prevent synonyms, we flush or invalidate potential synonyms in the L1 on a miss. For address virtual 0x6A304004, which lines in the data cache would have to be flushed/invalidated? 14 bits corresponding to the offset in the page, i.e., will not change

0110 1010 0011 0000 0100 0000 0000 0100

2 bits that will be translated by virtual to physical address translation (note this is part of the line address)

11 bits corresponding to the L1 line address

From the binary representation of 0x6A304004 we see that the upper two bits of the line address will be translated by the virtual physical addresse translation mechanism. That means there can four possible locations in the cache for a virtual address. For the above address the four possible 11 bit line addresses are 0x000, 0x200, 0x400, and 0x600. ii. Define a placement policy for virtual pages that will enable concurrent TLB access and L1 cache access, i.e., a policy that determines where a virtual page can be placed in main memory. Note from the solution to part a. that two bits of the line address are changed by the virtual to physical address translation. This would preclude concurrent TLB and cache addressing. The placement policy should therefore ensure that these bits do not change. Ensure that bits 14 and 15 ( the least significant bit is bit 0) of the physical address is the same as the corresponding bits of the virtual address. We can do so by placing on

- 22 -

Memory Hierarchy physical pages where the least significant 2 bits of the physical page address match the virtual page address.

- 23 -

Memory Hierarchy 11. All of the questions pertaining to memory systems, unless otherwise stated, will use the following memory hierarchy. We have processor with 32-bit words, 1 GByte of main memory, a 4 GByte virtual address space, 8 Kbyte pages, and separate 4 entry instruction and data TLBs. Each page table entry is 32 bits. We have split, instruction and data L1 caches, each a direct mapped 32 Kbyte cache with 16 byte lines. The L1 hit time is 1 cycle. The miss penalty to L1 from L2 is 4 cycles for the first word and 2 cycles/word thereafter. The L2 cache is a 1 Mbyte, 8 way set associative cache with 512 byte lines. Main memory is organized as 16 banks where each bank can deliver a 32 bit word in 50 cycles. Main memory is connected by a 64 bit wide, split transaction bus to the L2 cache. Transfer of an address across the memory bus takes 2 cycles. Transfer of a 64-bit datum from memory to the L2 cache takes 1 cycle. The data miss rate in the L1 is 3% and the instruction miss rate is 2%. The local miss rate of the L2 is 20% for instructions and 40% for data.

a. Short questions. Please be to the point i. Distinguish an instruction cache from and a trace cache. Instruction cache stores copies of instructions from a static program description instructions are stored in memory in program order. The location of instructions in the cache are determined by program order and placement in memory. A trace cache stores a segment of the dynamic trace of instructions that is captured during execution.

- 24 -

Memory Hierarchy ii. On a miss in L1 that hits in L2, how many cycles will be saved by using critical word first optimization in transferring a line to the L1? L1 cache-line size = 16 bytes = 16/4 words = 4 words With Critical Word First Opt, the word requested by the processor is the first word passed from L2 to L1. This word becomes available in the L1 cache after 4 cycles. So the word can be provided to the processor in (4 + L1-hit-time) cycles Time to provide accessed word to processor, without Critical Word Optimization = L1-hit-time + * (4 + 6 + 8 + 10) = (L1-hit-time + 7) cycles on average This is assuming that the processor was equally likely to access any of the 4 words in the L1 cache-line, and can access the word as soon as L1 gets it. Otherwise, there is always a 10-cycle latency before the processor gets its word (with savings = 10 4 = 6 cycles). Therefore, savings = 7 4 = 3 cycles due to Critical Word Optimization iii. What is the average memory access time? No Critical Word First Optimization here. Calculating for the data-caches. L2-cache-line size = 512 bytes = 512/4 words = 128 words L2-miss-penalty = 2 + (50 +16/2) * (128/16) + 1 = 467 cycles This is because: - A bus-transaction consists of the read of 16 words (one from each bank) (128/16) bus transactions to read an entire L2 cache-line. - Each of the 16 banks produces its word in 50 cycles in parallel, but only 2 words (64-bits) can be transferred per cycle over the bus 50 + 16/2 cycles - 2 cycles to receive the address of a new bus-transaction (overlapped with previous bus-transaction, split-transaction) - 1 cycle additional latency for a 64-bit result from memory to arrive at L2 cache (pipelined over all 64-bit transfers & split-transaction) L2-hit-time = 4 (for 1st word) + 3 * 2 = 10 cycles AMATL1 = L2-hit-time + L2-miss-rate * L2-miss-penalty = 10 + 0.40 * 467 = 196.8 (this is the average L1-miss-penalty) AMATProc = L1-hit-time + L1-miss-rate * L1-miss-penalty = 1 + 0.03 * 196.8 = 6.904 cycles

- 25 -

Memory Hierarchy

b. Assume that the matrix, X [16][16] is stored in main memory. The first element of the matrix is [0][0] and a word interleaved memory organization is used. Provide a skewing scheme that minimizes the number of main memory accesses for access to all elements of the reverse diagonal (do not worry about unscrambling the elements). Assume no elements are initially in the cache. There are 16 banks (lets label them: 0 to 15). There are 16 elements in the reverse-diagonal of X (elements X[i][15-i] with i=0 to 15). The skewing scheme should put each X[i][15-i] in a different bank. Lets try to put X[i][15-i] in the (15-i)th bank. A (0, 1) skewing scheme achieves this, since all elements X[i][0] go into bank-0. So all elements X[i][k] go into bank-k. This can also be determined analytically as follows: Assume a (d1, d2) skewing. This puts element X[i][j] into bank# = (d1 * i + d2 * j) mod 16 For reverse-diagonal element X[i][15-i]: bank# = (d1 * i + d2 * (15 i)) mod 16 = ((d1 d2) * i + d2*15) mod 16 = ((d1 d2) * i + offset) mod 16 This requires that for values of i = 0 to 15, the produced bank numbers are all different. This is achieved for all (d1, d2) s.t. (d1 d2) = odd. This will not work for any (d1 d2) = even as this will put elements into fewer banks (banks 1, 3, 5, 15 for (d1 d2) = 2; banks 3, 7, 11, 15 for (d1 d2) = 4, etc). Of course, for (d1 d2) = 1 (or -1), it is trivial to show that all banks will be used for reverse diagonal elements. Thus, a (0, 1) or (1, 0) skewing works.

- 26 -

Memory Hierarchy c. We wish to minimize the time it takes to handle an L1 miss. Can the access to the L2 cache be overlapped with access to the L1 cache? Justify your answer and include the breakdown of the address bits used to address the L1 and L2 caches. Yes. When the processor makes a memory access to the L1-cache, the address can also be instantly forwarded to the L2-cache since both operate on the physical address. Thus regardless of whether there is an L1 hit, an L2 lookup is initiated. By the time an L1- miss is determined, the L2 cache would have already spent time = L1-hit-time doing its lookup. This would save time = L1-hit-time = 1 cycle if the L1 misses but L2 hits. With such a scheme if the L2-cache misses, it should only initiate a memoryaccess if the L1-cache also missed. Otherwise, unnecessary memory traffic would be generated for some L1-hits but L2-misses that would stall the L2-cache. Subsequent L1-misses would then stall since the L2-cache is stalled. Using byte addresses

15

11

L1 Cache

13

L2 Cache

- 27 -