Sie sind auf Seite 1von 19

A NEW X86 CORE

ARCHITECTURE FOR THE NEXT


GENERATION OF COMPUTING

MIKE CLARK
SENIOR FELLOW

1 | HOT CHIPS 28 | AUGUST 23, 2016


AGENDA

THE ROAD TO ZEN


HIGH LEVEL ARCHITECTURE
‐ IMPROVEMENTS IN CORE ENGINE

‐ FLOATING POINT

‐ IMPROVEMENTS IN CACHE SYSTEM

‐ SMT DESIGN TO MAXIMIZE THROUGHPUT

‐ NEW ISA EXTENSIONS

SUMMARY
NEXT STEP UP

2 | HOT CHIPS 28 | AUGUST 23, 2016


AMD X86 CORES: DRIVING COMPETITIVE PERFORMANCE

“ZEN”
INSTRUCTIONS PER CLOCK

“Excavator”
40%
Core More Instructions
Per Clock*

“Bulldozer”
Core

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

3 | HOT CHIPS 28 | AUGUST 23, 2016


AMD CPU DESIGN OPTIMIZATION POINTS

“ZEN”
PERFORMANCE

High Performance
“Excavator”

ONE CORE FROM FANLESS NOTEBOOKS


Low Power TO SUPERCOMPUTERS
“Jaguar”

LOWER POWER MORE PERFORMANCE


SMALLER MORE AREA, POWER

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

4 | HOT CHIPS 28 | AUGUST 23, 2016


DEFYING CONVENTION: A WIDE, HIGH PERFORMANCE, EFFICIENT CORE

“ZEN”
Instructions-Per-Clock
+40%
work per
cycle*
“Excavator”
“Steamroller” Total
“Piledriver” Efficiency
“Bulldozer” Gain

At = Energy
Per Cycle Energy Per Cycle

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

5 | HOT CHIPS 28 | AUGUST 23, 2016


ZEN PERFORMANCE & POWER IMPROVEMENTS

BETTER CORE ENGINE BETTER CACHE SYSTEM LOWER POWER


‐ Two threads per core ‐ Write back L1 cache ‐ Aggressive clock gating with
‐ Branch mispredict improved ‐ Faster L2 cache multi-level regions
‐ Better branch prediction with 2 ‐ Faster L3 cache ‐ Write back L1 cache
branches per BTB entry ‐ Faster Load to FPU: 7 vs. 9 cycles ‐ Large Op Cache
‐ Large Op Cache ‐ Better L1 and L2 data prefetcher ‐ Stack Engine
‐ Wider micro-op dispatch 6 vs. 4 ‐ Close to 2x the L1 and L2 bandwidth ‐ Move elimination
‐ Larger Instruction Schedulers ‐ Total L3 bandwidth up 5x ‐ Power focus from project inception
Integer: 84 vs. 48 | FP: 96 vs. 60 ‐ Low Power Design Methodologies
‐ Larger retire 8 ops vs. 4 ops
‐ Quad issue FPU
‐ Larger Retire Queue 192 vs. 128
‐ Larger Load Queue 72 vs. 44
‐ Larger Store Queue 44 vs. 32

40% IPC PERFORMANCE UPLIFT


6 | HOT CHIPS 28 | AUGUST 23, 2016
64K I-Cache Branch Prediction
4 way
ZEN MICROARCHITECTURE
Decode Op Cache  Fetch Four x86 instructions
 Op Cache instructions
Micro-op Queue
4 instructions/cycle Micro-ops  4 Integer units
6 ops dispatched ‒ Large rename space – 168 Registers
‒ 192 instructions in flight/8 wide retire
INTEGER FLOATING POINT
 2 Load/Store units
Integer Rename Floating Point Rename
‒ 72 Out-of-Order Loads supported
 2 Floating Point units x 128 FMACs
Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler
‒ built as 4 pipes, 2 Fadd, 2 Fmul

 I-Cache 64K, 4-way


Integer Physical Register File FP Register File
 D-Cache 32K, 8-way
 L2 Cache 512K, 8-way
ALU ALU ALU ALU AGU AGU MUL ADD MUL ADD
 Large shared L3 cache
 2 threads per core
2 loads + 1 store 512K
Load/Store 32K D-Cache L2 (I+D) Cache
per cycle Queues 8 Way 8 Way

7 | HOT CHIPS 28 | AUGUST 23, 2016


Next PC
FETCH
Redirect
from DE/EX  Decoupled Branch Prediction
L0/L1/L2 TLB  TLB in the BP pipe
‒ 8 entry L0 TLB, all page sizes
‒ 64 entry L1 TLB, all page sizes
‒ 512 entry L2 TLB, no 1G pages
L1/L2 BTB Hash Perceptron
Return Stack ITA  2 branches per BTB entry
 Large L1 / L2 BTB
To Op  32 entry return stack
Cache
Physical Request Queue Micro-Tags  Indirect Target Array (ITA)
 64K, 4-way Instruction cache
 Micro-tags for IC & Op cache
64K Instruction Cache
32 bytes/
 32 byte fetch
cycle
from L2
32 bytes to Decode

8 | HOT CHIPS 28 | AUGUST 23, 2016


From IC From Micro Tags

DECODE
Instruction Byte Buffer

 Inline Instruction-length Decoder


Pick
 Decode 4 x86 instructions
Op Cache  Op cache
Decode
 Micro-op Queue
Instructions Micro-ops
 Stack Engine
 Branch Fusion
Micro-op Queue  Memory File for Store to Load Forwarding

Microcode Rom Stack Engine Memfile


To FP,
4 Micro-ops
Dispatch

To EX, 6 Micro-ops

9 | HOT CHIPS 28 | AUGUST 23, 2016


Redirect 6 Micro-op Dispatch To DE EXECUTE
to Fetch
Map Retire Queue  6x14 entry Scheduling Queues
 168 entry Physical Register File
 6 issue per cycle
‒ 4 ALU’s, 2 AGU’s
ALQ0 ALQ1 ALQ2 ALQ3 AGQ0 AGQ1
 192 entry Retire Queue
168 Entry Physical Register File  Differential Checkpoints
 2 Branches per cycle
Forwarding Muxes  Move Elimination
 8-Wide Retire
ALU0 ALU1 ALU2 ALU3 AGU0 AGU1

LS

10 | HOT CHIPS 28 | AUGUST 23, 2016


AGU0 AGU1 To Ex
LOAD/STORE AND L2

Store Queue  72 Out of Order Loads


Load Queue  44 entry Store Queue
Pre Fetch
 Split TLB/Data Pipe, store pipe
L0 Pick L1 Pick  64 entry L1 TLB, all page sizes
Store Pipe Pick
 1.5K entry L2 TLB, no 1G pages
 32K, 8 way Data Cache
TLB0 DAT0 DAT1 TLB1 STP Store Commit ‒ Supports two 128-bit accesses

 Optimized L1 and L2 Prefetchers


L1/L2 TLB +  512K, private (2 threads), inclusive L2
DC tags

32K Data Cache


MAB WCB To FP

32 bytes to/from L2 To L2

11 | HOT CHIPS 28 | AUGUST 23, 2016


4 Micro-op Dispatch
8 Micro-op
Retire
FLOATING POINT
192 Entry
NSQ Retire Queue  2 Level Scheduling Queue
 160 entry Physical Register File
128 bit
 8 Wide Retire
SQ LDCVT Loads
 1 pipe for 1x128b store
 Accelerated Recovery on Flushes
Int to FP
160 Entry Physical Register File FP to Int  SSE, AVX1, AVX2, AES, SHA, and legacy
mmx/x87 compliant
Forwarding Muxes  2 AES units

MUL0 ADD0 MUL1 ADD1

12 | HOT CHIPS 28 | AUGUST 23, 2016


ZEN CACHE HIERARCHY
CORE 0
32B fetch 32B/cycle
64K
I-Cache
 Fast private 512K L2 cache
4-way
512K L2
32B/cycle
 Fast shared L3 cache
I+D
2*16B load 32B/cycle Cache  High bandwidth enables prefetch
32K
D-Cache 8-way 8M L3 improvements
1*16B store 8-way
32B/cycle
I+D  L3 is filled from L2 victims
Cache  Fast cache-to-cache transfers
16-way
 Large Queues for Handling L1 and L2
misses

13 | HOT CHIPS 28 | AUGUST 23, 2016


CPU COMPLEX
 A CPU complex (CCX) is four
L L L L
CORE 0 2 L2M L3M 3 L3M L3M 3 L3M L2M 2 CORE 1 cores connected to an
C 512K 1MB C 1MB C C
T T
1MB
T
1MB 512K
T L3 Cache.
L L L L
 The L3 Cache is 16-way
associative, 8MB, mostly
exclusive of L2.
 The L3 Cache is made of 4
L L L L
L2M L3M 3 L3M 3 slices, by low-order address
CORE 2 2 L3M L3M L2M 2 CORE
CORE 3
3
C
T
512K 1MB C
T
1MB 1MB C
T
1MB 512K C
T
interleave.
L L L L
 Every core can access every
cache with same average
latency

14 | HOT CHIPS 28 | AUGUST 23, 2016


L0/L1/L2
64K I-Cache 4 way Branch Prediction
ITLB
SMT OVERVIEW
Decode Micro-op
Op-Cache
Cache
 All structures fully available in 1T mode
instructions
Micro-op Queue
micro-ops
 Front End Queues are round robin with
Vertically Threaded 6 ops dispatched priority overrides
 Increased throughput from SMT
INTEGER Retire Queue FLOATING
POINT

Integer Rename Floating Point Rename

Schedulers Scheduler

Competitively shared structures


Integer Physical Register File FP Register File
Competitively shared and SMT Tagged
2x AGUs 4x ALUs MUL ADD MUL ADD Competitively shared with Algorithmic Priority
Statically Partitioned
Store Queue 512K
L1/L2 32K D-Cache L2 (I+D) Cache
Load Queue DTLB 8 Way 8 Way

15 | HOT CHIPS 28 | AUGUST 23, 2016


NEW INSTRUCTIONS

Feature Notes Excavator Zen

ADX Extending multi-precision arithmetic support 


RDSEED Complement to RDRAND random number generation 
SMAP Supervisor Mode Access Prevention 
SHA1/SHA256 Secure Hash Implementation Instructions 
CLFLUSHOPT CLFLUSH ordered by SFENCE 
XSAVEC/XSAVES/XRSTORS New Compact and Supervisor Save/Restore 
CLZERO Clear Cache Line 
PTE Coalescing Combines 4K page tables into 32K page size 

AMD Exclusive
We support all the standard ISA including AVX &AVX-2, BMI1 & BMI2, AES, RDRAND, SMEP

16 | HOT CHIPS 28 | AUGUST 23, 2016


“ZEN”
DESIGNED FROM THE GROUND
UP FOR OPTIMAL BALANCE OF
PERFORMANCE AND POWER

Totally New New High-Bandwidth,


High-performance Low Latency Cache
Core Design System

Simultaneous Energy-efficient FinFET


Multithreading (SMT) Design Scales from
for High Throughput Enterprise to Client
Products

17 | HOT CHIPS 28 | AUGUST 23, 2016


A COMMITTED ROADMAP TO X86 PERFORMANCE
“ZEN+”
“ZEN”
INSTRUCTIONS PER CLOCK

“Excavator”
40%
Core More Instructions
Per Clock*

“Bulldozer”
Core

*Based on internal AMD estimates for “Zen” x86 CPU core compared to “Excavator” x86 CPU core.

18 | HOT CHIPS 28 | AUGUST 23, 2016


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and
motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like.
AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time
to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS
THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR
ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United
States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

19 | HOT CHIPS 28 | AUGUST 23, 2016

Das könnte Ihnen auch gefallen