AMD ZEN Architecture PDF

A NEW X86 CORE
ARCHITECTURE FOR THE NEXT

GENERATION OF COMPUTING
MIKE CLARK
SENIOR FELLOW
1 | HOT CHIPS 28 | AUGUST 23, 2016

AGENDA
THE ROAD TO ZEN

HIGH LEVEL ARCHITECTURE
‐ IMPROVEMENTS IN CORE ENGINE
‐ FLOATING POINT
‐ IMPROVEMENTS IN CACHE SYSTEM
‐ SMT DESIGN TO MAXIMIZE THROUGHPUT
‐ NEW ISA EXTENSIONS
SUMMARY
NEXT STEP UP

AMD X86 CORES: DRIVING COMPETITIVE PERFORMANCE
“ZEN”
INSTRUCTIONS PER CLOCK
“Excavator”
40%
Core More Instructions
Per Clock*
“Bulldozer”
Core
*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

AMD CPU DESIGN OPTIMIZATION POINTS
“ZEN”
PERFORMANCE
High Performance
“Excavator”
ONE CORE FROM FANLESS NOTEBOOKS

Low Power TO SUPERCOMPUTERS
“Jaguar”
LOWER POWER MORE PERFORMANCE

SMALLER MORE AREA, POWER

DEFYING CONVENTION: A WIDE, HIGH PERFORMANCE, EFFICIENT CORE
“ZEN”
Instructions-Per-Clock
+40%
work per
cycle*
“Excavator”
“Steamroller” Total
“Piledriver” Efficiency
“Bulldozer” Gain
At = Energy
Per Cycle Energy Per Cycle

ZEN PERFORMANCE & POWER IMPROVEMENTS
BETTER CORE ENGINE BETTER CACHE SYSTEM LOWER POWER

‐ Two threads per core ‐ Write back L1 cache ‐ Aggressive clock gating with
‐ Branch mispredict improved ‐ Faster L2 cache multi-level regions
‐ Better branch prediction with 2 ‐ Faster L3 cache ‐ Write back L1 cache
branches per BTB entry ‐ Faster Load to FPU: 7 vs. 9 cycles ‐ Large Op Cache
‐ Large Op Cache ‐ Better L1 and L2 data prefetcher ‐ Stack Engine
‐ Wider micro-op dispatch 6 vs. 4 ‐ Close to 2x the L1 and L2 bandwidth ‐ Move elimination
‐ Larger Instruction Schedulers ‐ Total L3 bandwidth up 5x ‐ Power focus from project inception
Integer: 84 vs. 48 | FP: 96 vs. 60 ‐ Low Power Design Methodologies
‐ Larger retire 8 ops vs. 4 ops
‐ Quad issue FPU
‐ Larger Retire Queue 192 vs. 128
‐ Larger Load Queue 72 vs. 44
‐ Larger Store Queue 44 vs. 32
40% IPC PERFORMANCE UPLIFT

64K I-Cache Branch Prediction
4 way
ZEN MICROARCHITECTURE
Decode Op Cache  Fetch Four x86 instructions
 Op Cache instructions
Micro-op Queue
4 instructions/cycle Micro-ops  4 Integer units
6 ops dispatched ‒ Large rename space – 168 Registers
‒ 192 instructions in flight/8 wide retire
INTEGER FLOATING POINT
 2 Load/Store units
Integer Rename Floating Point Rename
‒ 72 Out-of-Order Loads supported
 2 Floating Point units x 128 FMACs
Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler
‒ built as 4 pipes, 2 Fadd, 2 Fmul
 I-Cache 64K, 4-way

Integer Physical Register File FP Register File
 D-Cache 32K, 8-way
 L2 Cache 512K, 8-way
ALU ALU ALU ALU AGU AGU MUL ADD MUL ADD
 Large shared L3 cache
 2 threads per core
2 loads + 1 store 512K
Load/Store 32K D-Cache L2 (I+D) Cache
per cycle Queues 8 Way 8 Way

Next PC
FETCH
Redirect
from DE/EX  Decoupled Branch Prediction
L0/L1/L2 TLB  TLB in the BP pipe
‒ 8 entry L0 TLB, all page sizes
‒ 64 entry L1 TLB, all page sizes
‒ 512 entry L2 TLB, no 1G pages
L1/L2 BTB Hash Perceptron
Return Stack ITA  2 branches per BTB entry
 Large L1 / L2 BTB
To Op  32 entry return stack
Cache
Physical Request Queue Micro-Tags  Indirect Target Array (ITA)
 64K, 4-way Instruction cache
 Micro-tags for IC & Op cache
64K Instruction Cache
32 bytes/
 32 byte fetch
cycle
from L2
32 bytes to Decode

From IC From Micro Tags
DECODE
Instruction Byte Buffer
 Inline Instruction-length Decoder

Pick
 Decode 4 x86 instructions
Op Cache  Op cache
Decode
 Micro-op Queue
Instructions Micro-ops
 Stack Engine
 Branch Fusion
Micro-op Queue  Memory File for Store to Load Forwarding
Microcode Rom Stack Engine Memfile

To FP,
4 Micro-ops
Dispatch
To EX, 6 Micro-ops

Redirect 6 Micro-op Dispatch To DE EXECUTE
to Fetch
Map Retire Queue  6x14 entry Scheduling Queues
 168 entry Physical Register File
 6 issue per cycle
‒ 4 ALU’s, 2 AGU’s
ALQ0 ALQ1 ALQ2 ALQ3 AGQ0 AGQ1
 192 entry Retire Queue
168 Entry Physical Register File  Differential Checkpoints
 2 Branches per cycle
Forwarding Muxes  Move Elimination
 8-Wide Retire
ALU0 ALU1 ALU2 ALU3 AGU0 AGU1
LS

AGU0 AGU1 To Ex
LOAD/STORE AND L2
Store Queue  72 Out of Order Loads

Load Queue  44 entry Store Queue
Pre Fetch
 Split TLB/Data Pipe, store pipe
L0 Pick L1 Pick  64 entry L1 TLB, all page sizes
Store Pipe Pick
 1.5K entry L2 TLB, no 1G pages
 32K, 8 way Data Cache
TLB0 DAT0 DAT1 TLB1 STP Store Commit ‒ Supports two 128-bit accesses
 Optimized L1 and L2 Prefetchers

L1/L2 TLB +  512K, private (2 threads), inclusive L2
DC tags
32K Data Cache

MAB WCB To FP
32 bytes to/from L2 To L2

4 Micro-op Dispatch
8 Micro-op
Retire
FLOATING POINT
192 Entry
NSQ Retire Queue  2 Level Scheduling Queue
 160 entry Physical Register File
128 bit
 8 Wide Retire
SQ LDCVT Loads
 1 pipe for 1x128b store
 Accelerated Recovery on Flushes
Int to FP
160 Entry Physical Register File FP to Int  SSE, AVX1, AVX2, AES, SHA, and legacy
mmx/x87 compliant
Forwarding Muxes  2 AES units
MUL0 ADD0 MUL1 ADD1

ZEN CACHE HIERARCHY
CORE 0
32B fetch 32B/cycle
64K
I-Cache
 Fast private 512K L2 cache
4-way
512K L2
32B/cycle
 Fast shared L3 cache
I+D
2*16B load 32B/cycle Cache  High bandwidth enables prefetch
32K
D-Cache 8-way 8M L3 improvements
1*16B store 8-way
32B/cycle
I+D  L3 is filled from L2 victims
Cache  Fast cache-to-cache transfers
16-way
 Large Queues for Handling L1 and L2
misses

CPU COMPLEX
 A CPU complex (CCX) is four
L L L L
CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 cores connected to an
C 512K 1MB C 1MB C C
T T
1MB
T
1MB 512K
T L3 Cache.
L L L L
 The L3 Cache is 16-way
associative, 8MB, mostly
exclusive of L2.
 The L3 Cache is made of 4
L L L L
L2M L3M 3 L3M 3 slices, by low-order address
CORE 2 2 L3M L3M L2M 2 CORE
CORE 3
3
C
T
512K 1MB C
T
1MB 1MB C
T
1MB 512K C
T
interleave.
L L L L
 Every core can access every
cache with same average
latency

L0/L1/L2
64K I-Cache 4 way Branch Prediction
ITLB
SMT OVERVIEW
Decode Micro-op
Op-Cache
Cache
 All structures fully available in 1T mode
instructions
Micro-op Queue
micro-ops
 Front End Queues are round robin with
Vertically Threaded 6 ops dispatched priority overrides
 Increased throughput from SMT
INTEGER Retire Queue FLOATING
POINT
Integer Rename Floating Point Rename
Schedulers Scheduler
Competitively shared structures

Integer Physical Register File FP Register File
Competitively shared and SMT Tagged
2x AGUs 4x ALUs MUL ADD MUL ADD Competitively shared with Algorithmic Priority
Statically Partitioned
Store Queue 512K
L1/L2 32K D-Cache L2 (I+D) Cache
Load Queue DTLB 8 Way 8 Way

NEW INSTRUCTIONS
Feature Notes Excavator Zen
ADX Extending multi-precision arithmetic support 

RDSEED Complement to RDRAND random number generation 
SMAP Supervisor Mode Access Prevention 
SHA1/SHA256 Secure Hash Implementation Instructions 
CLFLUSHOPT CLFLUSH ordered by SFENCE 
XSAVEC/XSAVES/XRSTORS New Compact and Supervisor Save/Restore 
CLZERO Clear Cache Line 
PTE Coalescing Combines 4K page tables into 32K page size 
AMD Exclusive
We support all the standard ISA including AVX &AVX-2, BMI1 & BMI2, AES, RDRAND, SMEP

“ZEN”
DESIGNED FROM THE GROUND
UP FOR OPTIMAL BALANCE OF
PERFORMANCE AND POWER
Totally New New High-Bandwidth,

High-performance Low Latency Cache
Core Design System
Simultaneous Energy-efficient FinFET

Multithreading (SMT) Design Scales from
for High Throughput Enterprise to Client
Products

A COMMITTED ROADMAP TO X86 PERFORMANCE
“ZEN+”
“ZEN”
INSTRUCTIONS PER CLOCK
“Excavator”
40%
Core More Instructions
Per Clock*
“Bulldozer”
Core

DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and
motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like.
AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time
to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS
THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR
ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United
States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

AMD ZEN Architecture PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AMD ZEN Architecture PDF

Hochgeladen von

Copyright:

Verfügbare Formate

A NEW X86 CORE

ARCHITECTURE FOR THE NEXT

1 | HOT CHIPS 28 | AUGUST 23, 2016

THE ROAD TO ZEN

‐ IMPROVEMENTS IN CACHE SYSTEM

‐ SMT DESIGN TO MAXIMIZE THROUGHPUT

‐ NEW ISA EXTENSIONS

2 | HOT CHIPS 28 | AUGUST 23, 2016

3 | HOT CHIPS 28 | AUGUST 23, 2016

ONE CORE FROM FANLESS NOTEBOOKS

LOWER POWER MORE PERFORMANCE

4 | HOT CHIPS 28 | AUGUST 23, 2016

5 | HOT CHIPS 28 | AUGUST 23, 2016

BETTER CORE ENGINE BETTER CACHE SYSTEM LOWER POWER

40% IPC PERFORMANCE UPLIFT

 I-Cache 64K, 4-way

7 | HOT CHIPS 28 | AUGUST 23, 2016

8 | HOT CHIPS 28 | AUGUST 23, 2016

 Inline Instruction-length Decoder

Microcode Rom Stack Engine Memfile

9 | HOT CHIPS 28 | AUGUST 23, 2016

10 | HOT CHIPS 28 | AUGUST 23, 2016

Store Queue  72 Out of Order Loads

 Optimized L1 and L2 Prefetchers

32K Data Cache

11 | HOT CHIPS 28 | AUGUST 23, 2016

MUL0 ADD0 MUL1 ADD1

12 | HOT CHIPS 28 | AUGUST 23, 2016

13 | HOT CHIPS 28 | AUGUST 23, 2016

14 | HOT CHIPS 28 | AUGUST 23, 2016

Integer Rename Floating Point Rename

Competitively shared structures

15 | HOT CHIPS 28 | AUGUST 23, 2016

Feature Notes Excavator Zen

ADX Extending multi-precision arithmetic support 

16 | HOT CHIPS 28 | AUGUST 23, 2016

Totally New New High-Bandwidth,

Simultaneous Energy-efficient FinFET

17 | HOT CHIPS 28 | AUGUST 23, 2016

18 | HOT CHIPS 28 | AUGUST 23, 2016

19 | HOT CHIPS 28 | AUGUST 23, 2016

Das könnte Ihnen auch gefallen