AMD Zen

ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / 3.
3.2 Zen: A Next-Generation High-Performance x86 Core onset keepers [2] and the global bitlines employ a contention-free read dynamic
circuit. Further robustness at Vmin is achieved by aggressively avoiding circuits
with poor roll-off or high variation at low voltages.
Teja Singh1, Sundar Rangarajan1, Deepesh John1, Carson Henrion2,
Shane Southard1, Hugh McIntyre3, Amy Novak1, Stephen Kosonocky2, A shared PLL per CCX and a fine-grain DFS allow each core and the L3 cache to
Ravi Jotwani1, Alex Schaefer1, Edward Chang2, Joshua Bell1, Michael Co1 operate at different frequencies, while maintaining a synchronous timing interface.
The synchronous interface and floorplan optimization reduces the average L3
1
AMD, Austin, TX latency by >30% compared to the previous generation. The fine-grain DFS
2
AMD, Fort Collins, CO contains programmable structures that can adjust insertion delay up to 15% of
3
AMD, Sunnyvale, CA cycle and duty cycle up to 5% of cycle for silicon tuning.
Codenamed “Zen”, AMD’s next-generation, high-performance x86 core targets Global clock construction leverages configurable custom clocking cells and signal
server, desktop, and mobile client applications. Utilizing Global Foundries’ energy- pre-routes to build a skew and power optimized recombinant mesh. Zen employs
efficient 14nm LPP FinFET process, the 44mm² Zen core complex unit (CCX) has coarse-grain clock gating with an efficiency >50%. Coarse gating and optimizing
1.4B transistors and contains a shared 8MB L3 cache and four cores (Fig. 3.2.7). gater cloning reduces clock power as a percentage of total power by 30% over
The 7mm² Zen core contains a dedicated 0.5MB L2 cache, 32KB L1 data cache, Jaguar [3] in average workloads like the SpecInt06 benchmark. L3 active power
and 64KB L1 instruction cache. Each core has a digital low drop-out (LDO) voltage is reduced 35% for average workloads like the SpecInt06 benchmark and 60%
regulator and digital frequency synthesizer (DFS) to independently vary frequency during idle by clock gating the mesh. Excessive voltage guardbands are reduced
and voltage across power states. through per-core regulation using the LDO in conjunction with an AVFS [4] system
which runs on the SMU, utilizing frequency-to-PSM curves fused at test-time to
The scalable single Zen core combines both low power and high performance to determine optimal per-core voltage for a given target frequency.
replace AMD’s current two-core portfolio. The built-from-scratch Zen architecture
improves instructions per clock cycle by 40% [1] without increasing power over Zen CCX incorporates several circuit solutions to mitigate frequency loss due to
Excavator (XV), and introduces simultaneous multi-threading, allowing eight power supply droop events. An integrated power supply droop detector (DD)
active threads per CCX. Zen increases the issue width and execution resources detects voltage droops that trigger a coarse-grain DFS for a short amount of time
by 150% and the instruction scheduler window by 175% over XV. The 168 entry with a low-latency response time. Secondly, DD triggers the digital LDO to turn
integer register file has 12 read and six write ports. The integer unit can execute on more drivers limiting the droop longer term. Lastly, DD triggers a fine-grain
four ALU operations and two AGU operations, while the 128b FPU can execute DFS to reduce clock frequency as a percentage of the cycle for the event duration
two MUL and two ADD operations. The L2 cache supports an overall bandwidth [Fig. 3.2.4].
of 32B/cycle in each direction and the L2 latency reduces vs. the previous
generation. The L3 operates with all cores powered down and flushes itself, which The team relies on standard place and route tools carefully tuned for a high-
proves invaluable in multiple-CCX SoC configurations. L3 cache bandwidth is performance design. Zen is partitioned into blocks of <0.7M instances to minimize
32B/cycle in each direction for a single core or 128B/cycle in each direction for turnaround time, while maintaining high frequencies and area efficiency. Critical
four cores. The L3 includes a duplicated L2 tag in a power-optimized structure to interfaces and global interconnects are manually placed and constrained to yield
filter transactions to the core. Single-thread power ranges from <1W to 8W as high quality repeatable results. Each block has a mix of preplaced structures and
Zen reduces AC capacitance (Cac) by >15% over XV for an average workload standard place and route logic with tuned optimization recipes. Extensive use of
similar to the SpecInt06 benchmark. Emphasizing power efficiency, the team latch and flop arrays, with custom-structured read muxes, reduces area, power,
carefully optimized Cac across various workloads and process points. Zen adds and route congestion in regular storage queues and structures. Zen utilizes 3
an operation cache that stores decoded instructions, which increases ops/cycle primary Vt types with longer length variants that allow for aggressive swapping
and saves power by reducing effective pipeline length. algorithms to reduce leakage power (Fig. 3.2.5).
Zen uses an 11-layer telescoping metal stack combining thin metals for lower The available sequential cell palette is rich with a full set of low-power to high-
level density, while utilizing wider metals for critical signals and clocks. An speed flipflops for both inverting and non-inverting variants. The rich flop library
additional low-resistance tall aluminum layer provides connections to ESD gives designers granularity to close tough critical paths. The fastest flop achieves
protection and power redistribution necessary to support power headers. FinFET’s a 7% frequency advantage at roughly 1.7× the cell dynamic power (Fig. 3.2.6).
increased current density drives careful cell placement and a more robust power The Zen CCX is scalable for low-power and high-performance market segments
grid. and provides substantially better performance/W over previous generation AMD
cores.
Zen timing is optimized across a wide voltage range to support fanless client and
high-end desktop applications (Fig. 3.2.1). The achieved Vmax is considerably Acknowledgements:
higher than the nominal process voltage and requires detailed analysis of gate The authors would like to acknowledge AMD’s talented Zen design team for their
and intra-dielectric breakdown. The core’s power supply is generated through two contributions.
PFET header channels located at the top and bottom of each core. These channels
are used for both power gating and voltage regulation through the digital LDO References:
(Fig. 3.2.2). The digital LDO consists of a high-precision slow loop utilizing a [1] M. Clark, “A New x86 Core Architecture for the Next Generation of Computing,”
digital compensator and power supply monitor (PSM) for voltage monitoring, and Hot Chips, 2016.
a fast loop with a high-speed droop detector to provide charge injection in [2] R. Jotwani, S. Sundaram, et al., ”An x86-64 Core Implemented in 32nm SOI
response to worst-case current transients. CMOS”, ISSCC, pp. 106-107, 2010.
[3] T. Singh, J. Bell, et al., “Jaguar: A Next-Generation Low-Power x86-64 Core,”
Zen CCX achieves low Vmin by implementing wordline boost circuitry (Fig. 3.2.3) ISSCC, pp. 52-53, 2013.
for the L1 macros and powering the L2 and L3 bitcells on a separate memory [4] A. Grenat, S. Sundaram, et al., “Increasing the Performance of a 28nm x86-
voltage plane. The L1 boost circuit supports fuse controllable overdrive options 64 Microprocessor Through System Power Management,” ISSCC, pp. 74-75,
to reduce the product Vmin. The system management unit (SMU) controls the 2016.
boost circuit to activate only at low voltages to maintain reliability. To avoid circuit
contention for improved low-voltage performance, the L1 local bitlines use delay
52 • 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

ISSCC 2017 / February 6, 2017 / 2:00 PM
Figure 3.2.1: Zen CCX optimization for different market segments. Figure 3.2.2: Digital LDO implementation.
Figure 3.2.3: L1 cache wordline boost and contention-free dynamic circuit. Figure 3.2.4: Zen CCX clock stretch block diagram.
Figure 3.2.5: Power breakdown and Vt usage per core. Figure 3.2.6: Flop library breakdown.
DIGEST OF TECHNICAL PAPERS • 53

ISSCC 2017 PAPER CONTINUATIONS
Figure 3.2.7: Zen CCX die photo.
• 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

AMD Zen

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AMD Zen

Hochgeladen von

Copyright:

Verfügbare Formate

ISSCC 2017 / SESSION 3 / DIGITAL PROCESSORS / 3.

52 • 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

DIGEST OF TECHNICAL PAPERS • 53

Figure 3.2.7: Zen CCX die photo.

• 2017 IEEE International Solid-State Circuits Conference 978-1-5090-3758-2/17/$31.00 ©2017 IEEE

Das könnte Ihnen auch gefallen