OLD DAT105 Exercise4

Department of Computer Science & Engineering
Chalmers University of Technology
DAT105: Computer Architecture

Exercise 4 (5.1, 5.2, 5.3, 5.4, 5.5)
By
Minh Quang Do
2007-11-29
Cache Access and Cycle TIme model

(http://quid.hpl.hp.com:9081/cacti/)
References: [1] [2] CACTI4.0, David Tarjan et al., HPL-2006-86, HP Laboratories Palo Alto, USA, June 2, 2006 CACTI: An Enhanced Cache Access and Cycle Time Model Steven J.E. Wilton et al., TR-1993 Western Research Laboratory Palo Alto, USA, July 1994 eCACTI: Enhanced Power Estimation Model for On-chip Caches, Mahesh Mamidipaka et al., CECS Technical report #04-28, University of California, Irvine, USA, Sept. 14, 2004 HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects, Yan Zhang et al., TR-CS-2003-05, University of Virginia, Dept. of Comp. Science, USA, March 2003
[3]
[4]
11/8/2007
Cache Organization (CACTI)

Parameters: [2] C: cache size (Bytes) B: Block size (Bytes) A: associativity b0: output width (bits) baddr: address width (bits) Ndwl, Ndbl, Nspd Ntwl, Ntbl, Ntspd
11/8/2007 3
CACTI Valid Transformation

(Nspd, Ndbl, Ndwl)
11/8/2007
Cache Organization
Size of a Tag field = s (n + m + w)
Where: s: # of memory address bits w: Byte offset (2w = # bytes per a word) m: word offset (2m= # of words per a block) n: Index (2n = # of sets)
Direct-mapped cache: 4KB
s=32, w=2,m=0, n=10 -> tag= 20

11/8/2007 5
Direct mapped cache: 4KB

s=32, w=2,m=2, n=12 -> tag= 16
11/8/2007
Cache Organization
Size of a Tag field = s (n log2A + m + w)
11/8/2007
4-way set associative cache: 4KB s=32, w=2,m=0, n=8 -> tag= 22
11/8/2007 8
Figure 5.3 (p 292): Memory Hierarchy
L1: direct-mapped 8KB L2: direct-mapped 4MB L1, L2 use 64B blocks Page size: 8KB TLB: direct-mapped with 256 entries Virtual: 64b, Physical: 40b
11/8/2007 9
SRAM Memory Partitioning

Block diagram of a typical physically partitioned SRAM array (within a bank)
11/8/2007
10
CACTI Algorithm to find the best Power*Delay Product and Area Estimation for it
(from [2])
11/8/2007
11
Web Interface CACTI 4.0
11/8/2007
12
Web Interface CACTI 4.2
http://quid.hpl.hp.com:9081/cacti/
11/8/2007 13
Web Interface
11/8/2007
14
11/8/2007
15
Exercise 5.1a
Using CACTI 4.2, for direct-mapped, 2-way and 4-way set associative caches of 32KB with 64B block size implemented in 90-nm process: (OBS! No leakage power for 90-nm) 32% 21%
90nm (Vdd=1.04869097076) Cache size: 32KB with 64B line Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007 1-way 0.727237609 0.041520158 0.474678046 0.555439619 1 512 1 8 1 32 4 16 2-way 0.95916641708 0.05737028351 0.47078647474 0.78984622349 1 256 4 4 2 16 1 32
more
more
4-way 0.883799463 0.149968343 0.421119816 0.743759781 1 128 16 1 2 1 16 1 16
Exercise 5.1b
Using CACTI 4.2, for 2-way set associative caches of 16KB, 32KB and 64KB with 64B block size implemented in 90-nm process:
90nm (Vdd=1.04869097076)
18% more
16K
0.8154413713 0.0668911783 0.4630303771 0.3412720341 1 128 1 4 0.5 8 1 16
23% more
64K
1.004488549 0.078455944 0.495774489 1.142408636 1 512 4 4 2 8 1 16 17
Cache size (2-way) 64B line

Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) No of bank No of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007
32K
0.95916641708 0.05737028351 0.47078647474 0.78984622349 1 256 4 4 2 16 1 32
Exercise 5.1c (1)

Using CACTI 4.2, for 2-way set associative caches of 8KB, 16KB, 32KB, 64KB with 64B block size implemented in 90-nm process:
90nm (Vdd=1.04869097076) Cache size (2-way) 64B line Access time (ns) Total dynamic Read Power (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Ntspd 11/8/2007
38% increase in access time, 8X in size Log relation

8K 0.721949734 0.075579576 0.38809874 0.184715483 1 64 1 4 0.25 1 4 2 16K 0.81544137 0.06689117 0.46303037 0.34127203 1 128 1 4 0.5 8 1 16 32K 0.959166417 0.057370283 0.470786474 0.789846223 1 256 4 4 2 16 1 32 64K 1.0044885 0.0784559 0.4957744 1.1424086 1 512 4 4 2 8 1 16 18
Exercise 5.1c (2)

1.45
Increase in Access time (normalized to 8KB cache)
1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1
38% increase in access time, 8X in size Log relation
Cache Size (normalized to 8KB cache)
11/8/2007
19
Exercise 5.1d
From the Fig. 5.29, the current version of CACTI states that 16 KB 8-way set-associative caches with 64 byte blocks have an access time of 0.88 ns. This has the lowest miss rate for 16 KB caches, except for fully associative caches, which would have an access time greater than 0.90 ns.
11/8/2007
20
11/8/2007
21
Exercise 5.2a: AMAT Miss Penalty = 20 cycles; Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101 AMAT = 0.0056101 x 20 + (1 - 0.0056101) = 1.106 cycles
Way-predicted cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 2 cycles: AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 0.85) x 2) x (1 - 0.0056101) = 1.26
11/8/2007
22
Exercise 5.2b Using CACTI 4.2, for 16KB direct-mapped with 64B block size implemented in 90-nm process: Access time = 0.66318353 ns; Cycle time = 0.36661061 ns Total dynamic power = 0.033881430 W A 32KB 2-way set associative caches with 64B block size implemented in 90-nm process: Access time = 0.95916641708 ns Improvement in execution = 0. 9591 / 0.6631 = 1.446, or 44.6 % faster
11/8/2007
23
Exercise 5.2c: Way-prediction on a data cache Assumptions: Miss Penalty = 20 cycles; Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101 Way-predicted data cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 15 cycles: AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 0.85) x 15) x (1 - 0.0056101) = 3.19 Increase in: 3.19 1.26 = 1.93 ns
11/8/2007
24
Exercise 5.2d
Using CACTI 4.2, for 1MB 4-way with 64B block size, 144b read out, 1 bank, 1 read/write port, 30b tags implemented in 90-nm process:
90nm (Vdd=1.04869097076) Cache size: 1MB with 64B line Access time (ns) Total dynamic Read Power at max. freq. (W) Cycle Time (ns) Total area subbanked (mm^2) Nbank N of sets per bank Ndwl Ndbl Nspd Ntwl Ntbl Nspt 11/8/2007
37% increase in access time, 17% reduction in total dynamic read power
Normal 2.542393257 0.360252018 0.466345737 19.71918143 1 4096 32 8 4.5 4 32 4 Fast 1.715168589 0.611948165 0.513336315 27.28437726 1 4096 8 32 1.125 8 32 16 Serial
3.500265285 0.302312556 0.523116448 56.92118501 1 4096 8 32 1.125 8 32 16 25
Exercise 5.3a: Pipelined vs. Banked
Using CACTI 4.2, for 64KB 2-way, 2 banks, with 64B block size implemented in 90-nm process: Access time = 0.958448597337 (ns) Cycle time = 0.47078647474 (ns) Total dynamic power = 0.114334683539 (W) Total area subbanked = 1.64216420153 (mm^2) Number of potential pipe stages = 0.958 / 0.471 = 2.03
11/8/2007
26
Exercise 5.3b: Deeper Pipelined
Assumptions: (from Fig. 5.29) Miss penalty = 40 cycles Miss rate (64KB L1 cache, 2-way, 1 bank)= 0.0036625 AMAT = 0.00367 x 40 + (1 - 0.00367) x 1 = 1.14 cycles If 20% of cache access pipe stages is added: AMAT = 0.00367 x 40 + (1 - 0.00367) x 1.2 = 1.34 cycles
11/8/2007
27
Exercise 5.3c Assumptions: 2 banks; a bank conflict causes 1 cycle delay A random distribution of addresses, a steady stream of accesses, each access has a 50% probability of conflicting with the previous access. Miss rate (similar to the one of 64KB L1 cache, 2-way, 1 bank) = 0.0036625 Miss penalty = 20 cycles AMAT = 0.00367 x 20 + + (0.5 x 1 + 0.5 x 2) x (1 - 0.00367) = 1.57 cycles
11/8/2007
28
Exercise 5.4a: Early restart and critical word first

Assumptions: 1 MB L2 with 64B block size and 16B refill path Be written with 16B every 4 processor cycles, Time to receive data from main memory: first 16B block in 100 cycles; each additional 16B in 16 cycles Ignore cycles to transfer the miss request to L2 and the requested data to L1 cache With critical word first and early restart: L2 cache miss requires 100 cycles Without critical word first and early restart: L2 cache miss requires: 100 + 3 x 16 = 148 cycles
11/8/2007
29
Exercise 5.4b: Early restart and critical word first It depends on: 1. The contribution to AMAT of the L1 and L2 cache misses 2. The percent reduction in miss service times provided by critical word first and early restart. In case if 2) is aproximately the same for both L1 and L2, then the AMAT contribution for L1 and L2 decides the importance of critical word first and early restart
11/8/2007
30
Exercise 5.5: Optimizing Write Buffer

Assumptions: Write-through L1 and Write-back L2 cache L2 write data bus: 16B wide; perform a write every 4 processor cycles A) How many bytes wide should each write buffer entry be? Should be equal to L2 write data bus: 16B B) Speedup using merging buffer to execute 32b (or 4B) stores? If merging buffer entry is assumed to be 16B wide With nonmerging buffer takes: 4 cycles x (16B / 4B) = 16 cycles With merging buffer takes 4 cycles to fill an 16B entry Speedup: 16 / 4 = 4 X faster
11/8/2007 31

OLD DAT105 Exercise4

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

OLD DAT105 Exercise4

Hochgeladen von

Copyright:

Verfügbare Formate

Department of Computer Science & Engineering

Chalmers University of Technology

DAT105: Computer Architecture

Cache Access and Cycle TIme model

Cache Organization (CACTI)

CACTI Valid Transformation

Direct-mapped cache: 4KB

s=32, w=2,m=0, n=10 -> tag= 20

Direct mapped cache: 4KB

Figure 5.3 (p 292): Memory Hierarchy

SRAM Memory Partitioning

Web Interface CACTI 4.0

Web Interface CACTI 4.2

Cache size (2-way) 64B line

Exercise 5.1c (1)

38% increase in access time, 8X in size Log relation

Exercise 5.1c (2)

Increase in Access time (normalized to 8KB cache)

1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1

38% increase in access time, 8X in size Log relation

Cache Size (normalized to 8KB cache)

3.500265285 0.302312556 0.523116448 56.92118501 1 4096 8 32 1.125 8 32 16 25

Exercise 5.3a: Pipelined vs. Banked

Exercise 5.3b: Deeper Pipelined

Exercise 5.4a: Early restart and critical word first

Exercise 5.5: Optimizing Write Buffer

Das könnte Ihnen auch gefallen