Sie sind auf Seite 1von 31

Department of Computer Science & Engineering

Chalmers University of Technology

DAT105: Computer Architecture


Exercise 4 (5.1, 5.2, 5.3, 5.4, 5.5)
By

Minh Quang Do
2007-11-29

Cache Access and Cycle TIme model


(http://quid.hpl.hp.com:9081/cacti/)
References: [1] [2] CACTI4.0, David Tarjan et al., HPL-2006-86, HP Laboratories Palo Alto, USA, June 2, 2006 CACTI: An Enhanced Cache Access and Cycle Time Model Steven J.E. Wilton et al., TR-1993 Western Research Laboratory Palo Alto, USA, July 1994 eCACTI: Enhanced Power Estimation Model for On-chip Caches, Mahesh Mamidipaka et al., CECS Technical report #04-28, University of California, Irvine, USA, Sept. 14, 2004 HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects, Yan Zhang et al., TR-CS-2003-05, University of Virginia, Dept. of Comp. Science, USA, March 2003

[3]

[4]

11/8/2007

Cache Organization (CACTI)


Parameters: [2] C: cache size (Bytes) B: Block size (Bytes) A: associativity b0: output width (bits) baddr: address width (bits) Ndwl, Ndbl, Nspd Ntwl, Ntbl, Ntspd
11/8/2007 3

CACTI Valid Transformation


(Nspd, Ndbl, Ndwl)

11/8/2007

Cache Organization
Size of a Tag field = s (n + m + w)
Where: s: # of memory address bits w: Byte offset (2w = # bytes per a word) m: word offset (2m= # of words per a block) n: Index (2n = # of sets)

Direct-mapped cache: 4KB

s=32, w=2,m=0, n=10 -> tag= 20


11/8/2007 5

Direct mapped cache: 4KB


s=32, w=2,m=2, n=12 -> tag= 16

11/8/2007

Cache Organization
Size of a Tag field = s (n log2A + m + w)

11/8/2007

4-way set associative cache: 4KB s=32, w=2,m=0, n=8 -> tag= 22
11/8/2007 8

Figure 5.3 (p 292): Memory Hierarchy

L1: direct-mapped 8KB L2: direct-mapped 4MB L1, L2 use 64B blocks Page size: 8KB TLB: direct-mapped with 256 entries Virtual: 64b, Physical: 40b
11/8/2007 9

SRAM Memory Partitioning


Block diagram of a typical physically partitioned SRAM array (within a bank)

11/8/2007

10

CACTI Algorithm to find the best Power*Delay Product and Area Estimation for it
(from [2])

11/8/2007

11

Web Interface CACTI 4.0

11/8/2007

12

Web Interface CACTI 4.2

http://quid.hpl.hp.com:9081/cacti/
11/8/2007 13

Web Interface

11/8/2007

14

11/8/2007

15

Exercise 5.1a
Using CACTI 4.2, for direct-mapped, 2-way and 4-way set associative caches of 32KB with 64B block size implemented in 90-nm process: (OBS! No leakage power for 90-nm) 32% 21%
90nm (Vdd=1.04869097076) Cache size: 32KB with 64B line Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007 1-way 0.727237609 0.041520158 0.474678046 0.555439619 1 512 1 8 1 32 4 16 2-way 0.95916641708 0.05737028351 0.47078647474 0.78984622349 1 256 4 4 2 16 1 32

more

more
4-way 0.883799463 0.149968343 0.421119816 0.743759781 1 128 16 1 2 1 16 1 16

Exercise 5.1b
Using CACTI 4.2, for 2-way set associative caches of 16KB, 32KB and 64KB with 64B block size implemented in 90-nm process:
90nm (Vdd=1.04869097076)

18% more
16K
0.8154413713 0.0668911783 0.4630303771 0.3412720341 1 128 1 4 0.5 8 1 16

23% more
64K
1.004488549 0.078455944 0.495774489 1.142408636 1 512 4 4 2 8 1 16 17

Cache size (2-way) 64B line


Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) No of bank No of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007

32K
0.95916641708 0.05737028351 0.47078647474 0.78984622349 1 256 4 4 2 16 1 32

Exercise 5.1c (1)


Using CACTI 4.2, for 2-way set associative caches of 8KB, 16KB, 32KB, 64KB with 64B block size implemented in 90-nm process:
90nm (Vdd=1.04869097076) Cache size (2-way) 64B line Access time (ns) Total dynamic Read Power (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007

38% increase in access time, 8X in size Log relation


8K 0.721949734 0.075579576 0.38809874 0.184715483 1 64 1 4 0.25 1 4 2 16K 0.81544137 0.06689117 0.46303037 0.34127203 1 128 1 4 0.5 8 1 16 32K 0.959166417 0.057370283 0.470786474 0.789846223 1 256 4 4 2 16 1 32 64K 1.0044885 0.0784559 0.4957744 1.1424086 1 512 4 4 2 8 1 16 18

Exercise 5.1c (2)


1.45

Increase in Access time (normalized to 8KB cache)

1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1

38% increase in access time, 8X in size Log relation

Cache Size (normalized to 8KB cache)

11/8/2007

19

Exercise 5.1d

From the Fig. 5.29, the current version of CACTI states that 16 KB 8-way set-associative caches with 64 byte blocks have an access time of 0.88 ns. This has the lowest miss rate for 16 KB caches, except for fully associative caches, which would have an access time greater than 0.90 ns.

11/8/2007

20

11/8/2007

21

Exercise 5.2a: AMAT Miss Penalty = 20 cycles; Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101 AMAT = 0.0056101 x 20 + (1 - 0.0056101) = 1.106 cycles

Way-predicted cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 2 cycles: AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 0.85) x 2) x (1 - 0.0056101) = 1.26

11/8/2007

22

Exercise 5.2b Using CACTI 4.2, for 16KB direct-mapped with 64B block size implemented in 90-nm process: Access time = 0.66318353 ns; Cycle time = 0.36661061 ns Total dynamic power = 0.033881430 W A 32KB 2-way set associative caches with 64B block size implemented in 90-nm process: Access time = 0.95916641708 ns Improvement in execution = 0. 9591 / 0.6631 = 1.446, or 44.6 % faster

11/8/2007

23

Exercise 5.2c: Way-prediction on a data cache Assumptions: Miss Penalty = 20 cycles; Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101 Way-predicted data cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 15 cycles: AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 0.85) x 15) x (1 - 0.0056101) = 3.19 Increase in: 3.19 1.26 = 1.93 ns

11/8/2007

24

Exercise 5.2d
Using CACTI 4.2, for 1MB 4-way with 64B block size, 144b read out, 1 bank, 1 read/write port, 30b tags implemented in 90-nm process:
90nm (Vdd=1.04869097076) Cache size: 1MB with 64B line Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Nspt 11/8/2007

37% increase in access time, 17% reduction in total dynamic read power
Normal 2.542393257 0.360252018 0.466345737 19.71918143 1 4096 32 8 4.5 4 32 4 Fast 1.715168589 0.611948165 0.513336315 27.28437726 1 4096 8 32 1.125 8 32 16 Serial

3.500265285 0.302312556 0.523116448 56.92118501 1 4096 8 32 1.125 8 32 16 25

Exercise 5.3a: Pipelined vs. Banked

Using CACTI 4.2, for 64KB 2-way, 2 banks, with 64B block size implemented in 90-nm process: Access time = 0.958448597337 (ns) Cycle time = 0.47078647474 (ns) Total dynamic power = 0.114334683539 (W) Total area subbanked = 1.64216420153 (mm^2) Number of potential pipe stages = 0.958 / 0.471 = 2.03

11/8/2007

26

Exercise 5.3b: Deeper Pipelined

Assumptions: (from Fig. 5.29) Miss penalty = 40 cycles Miss rate (64KB L1 cache, 2-way, 1 bank)= 0.0036625 AMAT = 0.00367 x 40 + (1 - 0.00367) x 1 = 1.14 cycles If 20% of cache access pipe stages is added: AMAT = 0.00367 x 40 + (1 - 0.00367) x 1.2 = 1.34 cycles

11/8/2007

27

Exercise 5.3c Assumptions: 2 banks; a bank conflict causes 1 cycle delay A random distribution of addresses, a steady stream of accesses, each access has a 50% probability of conflicting with the previous access. Miss rate (similar to the one of 64KB L1 cache, 2-way, 1 bank) = 0.0036625 Miss penalty = 20 cycles AMAT = 0.00367 x 20 + + (0.5 x 1 + 0.5 x 2) x (1 - 0.00367) = 1.57 cycles

11/8/2007

28

Exercise 5.4a: Early restart and critical word first


Assumptions: 1 MB L2 with 64B block size and 16B refill path Be written with 16B every 4 processor cycles, Time to receive data from main memory: first 16B block in 100 cycles; each additional 16B in 16 cycles Ignore cycles to transfer the miss request to L2 and the requested data to L1 cache With critical word first and early restart: L2 cache miss requires 100 cycles Without critical word first and early restart: L2 cache miss requires: 100 + 3 x 16 = 148 cycles

11/8/2007

29

Exercise 5.4b: Early restart and critical word first It depends on: 1. The contribution to AMAT of the L1 and L2 cache misses 2. The percent reduction in miss service times provided by critical word first and early restart. In case if 2) is aproximately the same for both L1 and L2, then the AMAT contribution for L1 and L2 decides the importance of critical word first and early restart

11/8/2007

30

Exercise 5.5: Optimizing Write Buffer


Assumptions: Write-through L1 and Write-back L2 cache L2 write data bus: 16B wide; perform a write every 4 processor cycles A) How many bytes wide should each write buffer entry be? Should be equal to L2 write data bus: 16B B) Speedup using merging buffer to execute 32b (or 4B) stores? If merging buffer entry is assumed to be 16B wide With nonmerging buffer takes: 4 cycles x (16B / 4B) = 16 cycles With merging buffer takes 4 cycles to fill an 16B entry Speedup: 16 / 4 = 4 X faster
11/8/2007 31

Das könnte Ihnen auch gefallen