Carrizo: A High Performance, Energy Efficient 28 NM APU: Abstract-AMD's 6th Generation "Carrizo" APU, Targeted at

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO.
1, JANUARY 2016
105
Carrizo: A High Performance, Energy Efcient

28 nm APU
Benjamin Munger, Member, IEEE, David Akeson, Srikanth Arekapudi, Tom Burd, Member, IEEE,
Harry R. Fair, III, Member, IEEE, Jim Farrell, Dave Johnson, Guhan Krishnan, Member, IEEE,
Hugh McIntyre, Member, IEEE, Edward McLellan, Samuel Naffziger, Fellow, IEEE,
Russell Schreiber, Member, IEEE, Sriram Sundaram, Jonathan White, Member, IEEE, and
Kathryn Wilcox, Member, IEEE
AbstractAMD's 6th generation Carrizo APU, targeted at

1235 W mobile computing form factors, contains 3.1 billion
transistors, occupies 250.04 mm and is implemented in a 28 nm
HKMG planar dual-oxide FET technology with 12 metal layers.
The design achieves a 29% improvement in transistor density
compared to the 5th generation Kaveri APU, also a 28 nm
design, and implements several power management features
resulting in area and power improvements similar to a technology
shrink. Increased power density makes meeting the thermal limits
required for reliability and power distribution to the APU's processors substantial design challenges. Pre-silicon thermal analysis
is used to understand and take advantage of thermal gradients.
Adaptive voltage-frequency scaling in the processor core as well
as wordline and bitline assist techniques in the L2 cache enable
lower minimum voltage requirements.
Index TermsAVFS, high-frequency CMOS design, microprocessors, power efciency, power management, 28 nm.
generation APU, codenamed Kaveri [2], but ts 29% more

transistors (3.1 billion) into a die than Kaveri. The Excavator
module, excluding the L2 cache, has an area of 14.48 mm
and includes 102 M total transistors, 18% more transistors than
Steamroller. The increased Excavator transistor count enables
IPC improvements, including doubling the data cache from
16 KB to 32 KB per core, while the higher density allows more
of the SoC area to be allocated to graphics performance and
multimedia ofoad features. Excavator enables further power
and frequency improvement by adding adaptive voltage-frequency scaling (AVFS) capabilities to reduce the required
voltage guard band at a particular operating point and by
implementing wordline and bitline assist techniques in the L2
cache to lower the minimum operating voltage (Vmin).
II. POWER VS FREQUENCY TRADEOFFS
I. INTRODUCTION
MD's next-generation mobile performance accelerated

processing unit (APU) is AMD's 6th generation APU,
codenamed Carrizo (Fig. 1) [1]. The APU includes two
processor core-pair modules, codenamed Excavator, and
eight AMD Radeon Graphics Core Next (GCN) cores. Carrizo
is implemented in a 28 nm HKMG dual-oxide planar FET
technology featuring 3 Vts of thin-oxide devices and 12 Cu
metal layers. This is a density-focused version of the 28 nm
technology used in AMD's previous generation processor core,
codenamed Steamroller [3]. The technology features eight
1 metals for dense routing, one 2 and one 4 metal for
lower RC routing and two 16 metals for power distribution.
At 250.04 mm this APU has a similar area footprint to the 5th
Manuscript received May 05, 2015; revised June 25, 2015; accepted July 22,
2015. Date of publication August 25, 2015; date of current version December
30, 2015. This paper was approved by Guest Editor Jinuk Luke Shin.
B. Munger, D. Akeson, H. R. Fair, J. Farrell, G. Krishnan, J. White, and
K. Wilcox are with AMD, Boxborough, MA, USA.
S. Arekapudi, T. Burd, and H. McIntyre are with AMD, Sunnyvale, CA, USA.
D. Johnson, S. Naffziger, and R. Schreiber are with AMD, Fort Collins, CO,
USA.
E. McLellan is with Cavium Networks, Marlborough, MA, USA.
S. Sundaram is with AMD, Austin, TX, USA.
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/JSSC.2015.2464688
Carrizo uses a process with 8 1 layers, improving routing

density compared to the 6 1 layers in Kaveri, and a high density (HD) 9-track standard cell library in the Excavator module.
The HD library enables better logic density in Excavator than
the high performance (HP) 13-track library allows in Steamroller. This provides signicant area and power reductions, similar to a technology shrink, while staying in a 28 nm process.
Meanwhile the area reduction achieved using the HD library
decreases wire lengths which offsets the higher wire RC of the
extra 1 metal layers. Comparisons of designs constructed with
HD cells to the same designs constructed with HP cells show an
average area reduction of approximately 24% and a frequency
impact of only 10% at constant voltage. Fig. 2 shows examples of this area reduction. At constant power, frequency is increased because the power reduction at constant voltage allows
a higher voltage. In addition, Carrizo selected a lower leakage
device suite resulting in a factor of 2.5 lower leakage while
losing less than 10% of drive current compared to the devices
used by Kaveri (Fig. 3). Excavator leverages this leakage prole to reduce power by 40% compared to Steamroller, enabling
Carrizo to t into the target power budgets (1235 W). Each Excavator module is optimized to deliver frequency uplifts in the
2 W to 18 W power range while only slightly compromising the
maximum frequency at higher power levels (Fig. 4).
To maximize graphics performance in the 1235 W range,
the Carrizo graphics core is optimized for operation between
5 W and 25 W. Graphics performance is dictated by the number
of enabled parallel graphics compute units, the graphics engine
0018-9200 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
106
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016
Fig. 1. Carrizo die photo and block diagram.
Fig. 2. Examples of the area reductions provided by the HD library.
speed and the available DRAM bandwidth. Carrizo optimizes

graphics performance by enabling more efcient parallel operation at lower voltage and power levels. The graphics design leverages the lower leakage device suite by using more
lower Vt devices to achieve 18% leakage reduction for the same
total device width as Kaveri and a 58% frequency improvement
at Vmin.
III. FEATURES FOR HIGH PERFORMANCE MOBILE COMPUTING
AT LOW POWER
Fig. 3. Universal Curve comparing Carrizo and Kaveri thin oxide devices.
Each process has 3 Vts: the two smallest channel lengths for each Vt are shown.
The 29% density improvement in the Excavator module enables a larger area allocation for graphics IP, multi-media ofoad and the integration of a system controller into a single
BGA package. The increased graphics and multi-media area
allocation enables Carrizo to implement twice as many video
compression engines as Kaveri, a new video decoder to facilitate high throughput H264 decode, a new high efciency video
MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU
107
leakage reduction with entry and exit times of less than 1 .

Carrizo also moves the graphics engine onto its own power
supply which is powered only when work is queued to it. Previous APU designs shared this power supply with the integrated
Northbridge, which introduced conicting voltage requirements
and wasted power. Moving the graphics engine to its own supply
increases battery life and also allows the graphics engine to
operate at the minimum voltage required for a given use case
(Fig. 6). The ability of the graphics core to operate at the optimal voltage, along with the high density implementation of
the graphics engine, allows all eight of GCN cores to operate
at Vmin resulting in up to 20% improvement in performance
at 15 W over Kaveri at 15 W. Finally, Carrizo achieves a signicant power reduction across all battery use cases by integrating and power gating the system hub. This integration reduces the higher analog and digital voltages needed by an external Southbridge, and eliminates the PCIe interconnect to the
external Southbridge.
Fig. 4. Normalized frequency comparison of Excavator and Steamroller across
Carrizo target operating range.
IV. DATA CACHE
Fig. 5. Power improvements for Carrizo over Kaveri for common battery-life
cases.
coding (HEVC) content decode and 4 K at 60 frames per second

(fps) video playback at Vmin. Carrizo supports HEVC standards and can transcode nine real time 1080p video streams, a
factor of 3 improvement over Kaveri.
For thinner notebook form factors, thermal solutions limit an
APU to between 12 and 35 W requiring each IP in Carrizo to
operate more efciently than the previous generation. In spite
of the area growth in the decoder engine, the net power prole of 1080p video playback power improves signicantly over
Kaveri (Fig. 5) due to new power management features. The
majority of video is still 1080p resolution at 24 or 30 fps which
requires much less time to decode than the 60 fps Carrizo supports. Therefore, to save power the universal video decoder
(UVD) dynamically power gates in the idle time between decoding frames resulting in almost a factor of 3 reduction in standard cell leakage power.
Carrizo reduces the typical-use energy consumption to improve battery life through several techniques. The Excavator
module adds a low power retention mode (RCC3) by using the
ring power gating structure as a linear regulator to reduce the
core voltage to retention levels (0.5 V), providing up to 80%
The Excavator data cache achieves a nearly 50% reduction in

dynamic power compared to Steamroller, despite doubling the
cache's capacity. The cache design (Fig. 7) consists of the 8-way
32KB data cache and a way predictor (WP) which supplies way
select signals to the data array. The cache has 512 64B lines of
data, with 64 lines per way. All 8 ways of the WP are read at the
same time to determine whether there is a match. The way match
signals disable clock gaters in cache ways that do not need to be
read, reducing the power required for a cache read substantially
compared to Steamroller. Power is reduced further by disabling
Local Wordlines (LWL) to mask out bytes that are not needed.
Steamroller has a xed 16:1 bank conict rate, strictly based
on address [7:4]. Excavator improves the conict rate to at least
32:1, made up of 4:1 address and 8:1 way. For typical integer
code the Excavator conict rate is much better than 32:1 because
Excavator also avoids conicts for smaller than 128 bit reads
by qualifying the LWL with a byte enable which allows two
different addresses to be read from the same macro (as long as
the bytes do not overlap). In all cases, reading from the same
address on two different ports is not a conict.
Each bank is divided into eight macro sites (Fig. 8) consisting
of an array macro, which contains clocking circuits and the array
of bitcells, as well as logic implemented using standard cells.
The standard cell logic implements the decoders, byte select
logic, read muxes and other control logic. Each array macro
is partitioned into 4 quadrants and consists of 64 8B entries
with byte and quadrant controls. Although each bitslice in the
macro supports one read and one write, the design supports up
to two reads with non-overlapping bytes by using the byte and
quadrant controls. The decoders outside of the array macro provide two sets of pre-decoded Global Word Line (GWL) signals:
GWL0[31:0] and GWL1[7:0]. GWL0[31:0] provides four sets
of eight signals, one set per quadrant, where each set can belong to either pipe A or pipe B so that both ports may read the
same macro. The 8 GWL1 signals work in zero-hot pairs which,
for a given way, are driven low if there is no read in that quadrant. The 4 Byte-Sel signals, one for each quadrant in a byte, are
108
Fig. 6. Frequency-power curves with and without a separate graphics voltage domain.
Fig. 7. D-cache block diagram.
one-hot and are asserted high if the byte is read. The combination of the GWL0, GWL1 and Byte-Sel ensure that the bitslice
within a byte gets a 64 entry zero-hot wordline.
The same array macro is used to implement the data cache and
way predict arrays. This macro uses a 0.241 m 8 T bitcell with
8 cells per local bitline and two-stage dynamic sensing. Steamroller uses a 2-stack pulldown in the second dynamic sensing
stage while the Excavator micro banking scheme enables the
use of a single high stack, resulting in 10% area reduction and
8% faster read time in the array macro.
V. POWER DENSITY
The density improvement in Excavator more than offsets the
lower leakage technology resulting in higher power density and
increased operating temperatures at the same power as Steamroller. A variety of techniques are used to mitigate power density
in order to enable Carrizo to operate across the desired power
range (up to 35 W).
A useful tool to managing die temperature is pre-silicon
thermal analysis. The inputs to this analysis are power estimates for each IP in the die oorplan and a 2D matrix that
109
Fig. 8. D-cache macro site.
models the effective thermal resistance (including both the

package and the platform-level cooling solution). After the
thermal resistance matrix for a particular package/platform
has been created using 3D heat-ow modeling tools, many die
oorplan what-if scenarios can be evaluated extremely quickly
(seconds). Based on insights from pre-silicon thermal analysis
Excavator is placed away from the edge of the die and other
high-power-density IPs improving peak temperature by more
than 3 degrees, as shown in Fig. 9. The temperature differences
between Fig. 9(b) and Fig. 9(a) illustrate the benet of placing
Excavator away from the die edge. The thermal improvement
in Excavator that results from having lower power IPs adjacent
can be seen by comparing Fig. 9(b) to Fig. 9c. The thermal benet pre-silicon thermal analysis also extends the peak voltage
by using the worst-case thermal gradients over the design in
product reliability calculations to lower Excavator voltage
guard-bands by 23% of Vmax. In addition the cooler regions
of Excavator such as the L2 cache are designed assuming
510% lower maximum operating temperatures signicantly

reducing overdesign for electromigration (EM) in these regions.
Instead of the distributed power gate headers used in Steamroller, a power gate ring is used to reduce the size of the power
switch required to support Excavator's higher power density.
Distributed headers must be sized to accommodate the worst
case current draw within small regions of the design, whereas a
power gate ring need only be sized to accommodate the worst
case across the entire power gated region. This enables a 3.54
reduction in FET width in the Excavator power switch and a corresponding leakage power reduction when it is in CC6 (sleep
state).
Using a ring requires that the entire current of the power gated
domain is funneled through a single ring of bumps, which introduces two design challenges. First, having fewer bumps to
deliver the un-gated supply can make EM limits impossible to
meet if the current is too unevenly distributed. The bumps on the
left side of Excavator, farthest from the low activity L2 cache,
110
Fig. 9. Temperature differentials and Carrizo optimized oor plan.
Fig. 10. Current distribution in Excavator power bumps.
111
Fig. 11. Shadow op and replica path schematic.
are the most sensitive to non-uniform local current distribution

because they draw the most current. Second, the single ring of
bumps creates the potential for a large IR drop between the
bumps and the middle of the gated power domain. Designs that
implement a power ring can use low resistance package plane
layers to redistribute the switched supply across the entire gated
power domain [8], creating a very at voltage gradient across it.
In Excavator, using package layers limits the IR drop difference
between the edge and center of Excavator to less than 1% of
the un-gated supply voltage. Redistributing the supply on these
layers also makes EM limits in the bumps and power ring easier
to meet by spreading the current drawn by the left side evenly
among the bumps on that side (Fig. 10).
Much of the Excavator EM budget is reserved for the power
bumps and connections into and out of the power gate headers
on the left side of the Excavator module (Fig. 10). The design
team uses a statistical electromigration budgeting (SEB) ow
[7] to estimate the failure rate in time (FIT) for each copper
interconnect and SnAg bump, and to characterize the FIT for
each cell in the standard cell library. In order to reserve EM
budget, each standard cell is required to meet a maximum capacitance limit corresponding to a value that keeps the combined FIT of all the cells low. Additionally, the power EM FIT
for the bottom 8 1 metals in Excavator is analyzed with worstcase cell and custom macro switching activity. Therefore the
FIT contribution from 1 metal used in standard cells and to
distribute power is minimized. The entire Excavator module
including upper metal layers, power headers, and the supply
bumps is then analyzed with a lower switching activity corresponding to the maximum possible processor power to verify
that the design meets the die-level reliability targets.
VI. ADAPTIVE VOLTAGE-FREQUENCY SCALING

Excavator supports AMD's rst implementation of adaptive
voltage-frequency scaling (AVFS). AVFS collects information
about each part's frequency capability (Fmax) across process,
voltage and temperature. By allowing parts to self-calibrate,
AVFS enables reduced guard-bands for voltage uncertainty and
provides the potential to minimize or eliminate costly system
level tests in production. The AVFS system can be triggered either by microcode or by an on-chip system-management unit
(SMU) and is transparent to normal core operation. The Excavator AVFS system has similar characteristics to other adaptive
voltage approaches [4][6] but also adds infrastructure to enable
replica paths to function as a statistical sample of the full set of
Fmax limiting paths. In addition, the ability to separate the impact of voltage variation on delay from intrinsic circuit speed
is added by coupling path margin assessments with a voltage
reading from the integrated Power Supply Monitors (PSMs) [3].
AVFS uses a set of timing-critical replica paths and shadow
ops to self-calibrate. The replica paths include gate-dominated,
wire-dominated and cache/memory array paths. Each shadow
op compares the output of a replica path with a data-delayed
version (Fig. 11) and monitors it for a late transition, indicating
a near miss in the part's ability to capture the correct value
at the operating frequency and voltage. A critical path accumulator (CPA) steps through each delay setting and accumulates
information about near misses. The SMU summarizes the statistics of the near misses collected by the CPAs and creates a
voltage-frequency-temperature (VFT) table which describes the
part's voltage requirement for any frequency-temperature combination. P-state changes during normal operation reference the
VFT table to determine the optimal voltage. The system can
112
Fig. 12. Measured AVFS results with sampling-theory-based
prediction and self-measured voltage compensation.
automatically adjust for local voltage noise by having the CPAs

modify their timing margin assessments to compensate for PSM
indicated voltage levels (Fig. 12). Sampling of critical path variability is provided by 10 instances of the CPA dispersed across
the entire Excavator module. Each CPA exercises 50 critical
paths for a total of 500 paths (300 gate-dominated, 100 wiredominated and 100 macro replica paths). AVFS extracts descriptive statistics for the paths and calculates the actual timing
margin using sampling statistics (Fig. 13). Gates, wires, and
macros are treated separately to differentiate the distribution,
and guard-band is added for sampling uncertainty.
Allowing AVFS to set the minimum voltage required across
the entire voltage range results in up to 30% power savings while the full implementation cost is less than 1% of
Excavator area.
Fig. 13. Near miss statistics and extrapolation to estimate timing margin.
VII. L2 DATA AND TAG ARRAY ASSIST TECHNIQUES

The Excavator L2 cache contains two 6 T macros, one for the
data array and one for the tag array. These macros employ read
and write assist techniques, improving Vmin by 40 mV. The L2

data macro wordline is asserted for a full cycle and is designed
113
Fig. 14. L2 wordline assist.
to be able to use a combination wordline underdrive and wordline boost for read and write assist. The wordline can be driven
to a voltage lower than VDD during the rst phase of the access
by turning on a PFET pulldown (Fig. 14), allowing the bitlines
to partially discharge while the wordline is underdriven which
reduces susceptibility to read disturb. The PFET pulldown can
then be turned off during the second phase of the access, allowing the wordline to reach VDD. Each set of 16 wordlines
share a power header which can be turned off after a wordline
returns to VDD. The virtual supply can then be boosted above
VDD via an nFET used as a capacitor (Fig. 14). An nFET keeper
ensures the wordline never leaks further than a Vt below VDD.
The circuit can be congured to allow any combination of rst
phase underdrive, return to VDD and second phase boost.
The L2 tag macro wordline is only asserted for half a cycle.
It uses a combination of wordline underdrive and negative bitline for its assist techniques because the phase-bound wordline
does not leave enough time for the bitlines to discharge prior to
the start of the write assist. The negative-bitline circuitry uses a
single capacitor per logical bit column which is coupled to the
bitline through an nFET passgate (Fig. 15). The circuit couples
the bitline down after a self-timed delay which is designed to
delay the coupling event until after the bitline has discharged to
ground.
Since both of these techniques involve extending the voltage

applied to some transistors beyond the supply voltage, they
are implemented to be programmable and can be disabled. A
microcode-controllable signal called superVminEnable turns
the assist functions off at high voltage to prevent damage to the
transistors.
VIII. CONCLUSION
Carrizo achieves power and area improvements over Kaveri
similar to a process shrink while staying in a cost-effective
28 nm process. By reducing overall power consumption across
all voltages Excavator shifts its operating power range lower at
the expense of frequency scaling at higher power ranges (which
are not achievable in Carrizo's intended mobile platforms).
GCN leverages Carrizo's lower leakage device palate to reduce
leakage power at Vmin, enabling up to 20% performance improvement over Kaveri at 15 W. Carrizo's multimedia ofoad
engines provide 4 K at 60 fps video playback while power
gating when lower fps is required to improve battery life.
The power density challenges posed by Excavator's improved
area density are managed by using a lower leakage power
gating scheme than Steamroller and through use of pre-silicon
114
Fig. 15. L2 bitline assist.
thermal analysis to identify and minimize thermal hot spots.

Carrizo's technology, design choices and power management
features enable power-constrained frequency uplifts for mobile
platforms designed around a 12 to 35 W SoC.
REFERENCES
[1] K. Wilcox et al., A 28 nm 86 APU optimized for power and area
efciency, in IEEE ISSCC Tech. Papers, 2015, pp. 8485.
[2] D. Bouvier et al., Applying AMD's Kaveri APU for heterogeneous
computing, presented at the Hot Chips 2014 Symp., Cupertino, CA,
USA, 2014.
[3] K. Gillespie et al., Steamroller: An 86-64 core implemented in

28 nm bulk CMOS, in IEEE ISSCC Dig. Tech. Papers, 2014, pp.
104105.
[4] M. Floyd et al., Introducing the adaptive energy management features
of the Power 7 chip, IEEE Micro, vol. 31, no. 2, pp. 6075, 2011.
[5] D. Blauuw et al., Razor II: In situ error detection and correction for
PVT and SER tolerance, in IEEE ISSCC Dig. Tech. Papers, 2008, pp.
400401.
[6] K. Bowman et al., Resilient microprocessor design for improving performance and energy efciency, in IEEE Int. Conf. Computer-Aided
Design, 2010, pp. 8588.
[7] S. Krishnamoorthy et al., Switching constraint-driven thermal reliability analysis of nanometer designs, in Proc. Int. Symp. Quality Electronic Design, 2011, pp. 473480.
[8] R. Jotwani et al., An 86-64 core implemented in 32 nm SOI
CMOS, in IEEE ISSCC Dig. Tech. Papers, 2010, pp. 106107.
Benjamin Munger (M09) received the B.S. and

M.S. degrees in electrical engineering in 1997 and
1999 from the University of Rochester and the
University of Michigan at Ann Arbor, respectively.
He has 16 years of high speed and low power
digital circuit design experience. He has worked on
CPU designs at Compaq, Hewlett Packard, Intel,
and Advanced Micro Devices as well as on high
frequency asynchronous circuit designs at Achronix.
Since joining AMD, he has led several small CPU
design and design methodology teams working on
the K8G processor, the Bulldozer processor family, and future AMD products.
David Akeson is a Senior Manager in AMDs CPU

Core team in Boxborough, MA. After receiving the
B.S. and M.S. degrees in electrical engineering from
Northeastern University in 1993 and 1994, he contributed to several Alpha microprocessors at Digital
Equipment Corp., and the UltraSPARC V design at
Sun Microsystems. He has led design teams on multiple generations of processors since joining AMD
in 2004, and is currently the integration lead for a
next-generation core. He holds one U.S. patent, and
has co-authored three other IEEE publications.
Srikanth Arekapudi received the B.Tech. degree

in electronics and communication engineering from
the National Institute of Technology Warangal in
1999, the M.S. degree in electrical and computer
engineering from the University of Massachusetts at
Amherst in 2001, and the D.Eng. degree in electrical
engineering from the Stanford University, Stanford,
CA, USA, in 2004.
He worked at Agilent Technologies and Silicon
Graphics Inc before he joined Advanced Micro
Devices in 2003. At AMD he contributed to several
custom circuit and stdcell designs for K8 and Bulldozer based projects. In the
past he served as a design lead for Integer Execution Unit, Load Store Unit,
High Speed Clock Design and RTL lead for Integer Execution Unit and Decode
Unit. He is now a AMD Fellow and is involved in the micro-architecture and
design of next-generation ARM/X86 processors.
Tom Burd (M94) received the B.S., M.S., and Ph.D.

degrees from the University of California, Berkeley,
CA, USA, in 1992, 1994, and 2001, respectively.
After consulting at multiple start-ups, he joined
Advanced Micro Devices in 2005, where he is a
Senior Fellow Design Engineer. He has worked
on multiple generations of high-performance 86
cores, including K8, the Bulldozer family of cores,
and more recently, next generation Zen and K12
cores, in the areas of physical design and analysis methodology, power delivery, and design for
reliability.
Dr. Burd has served on the Technical Program Committee for the Symposium on VLSI Circuits (20122015), the International Conference on Computer
Aided Design (20032005), and Hot Chips (1996). He has authored or co-authored 21 conference and journal publications, in addition to the book Energy
Efficient Microprocessor Design, and is an inventor on four U.S. patents. He
was the recipient of the 2001 ISSCC Lewis Winner Award for best conference
paper, and the recipient of the 1998 Analog Devices Outstanding Student Award
for recognition of excellence in IC design.
115
Harry R. Fair, III (M89) is the Sr. Director for the

AMD CPU Core team in Boxborough, MA. He was
the project director for the Excavator Core and is
currently the project director for a future CPU Core
design. After receiving his B.S.E.E. from Purdue
University in 1989, he joined Digital Equipment
Corporation and contributed to multiple CMOS
VAX and Alpha designs. After leaving Digital he
co-founded a new Sun Microsystems CPU design
team in the Boston area to work on an UltraSPARC
design. Mr. Fair and many former Sun colleagues
later joined AMD in 2004 where they have made signicant contributions to
several 86 processors including K8 Rev G, Greyhound, Bulldozer, Piledriver
and Steamroller. He is an inventor on seven US Patents and has co-authored
six publications.
Mr. Fair was a past member of the Digital subcommittee of the IEEE International Solid-State Circuits Conference for two years.
Jim Farrell received the B.S.E.E. degree in 1984 and

the M.S.E.E. degree in 1985 from Rensselaer Polytechnic Institute, Troy, NY, USA.
He is Senior Fellow at AMD working in design
and silicon technology interaction, with particular
emphasis on client products. He has almost 30 years
of experience, starting at Digital Equipment Corporation in CPU design, then Network processors at
C-Port/Motorola before joining AMD in the CPU
design group.
Dave Johnson is the Director of the AMD CPU

Core team in Fort Collins, CO, USA. After receiving
the B.S.E.E. degree from the University of Illinois,
Urbana-Champaign, IL, USA, in 1991, he spent
two years with Amdahl Corporation designing
mainframe CPUs. He received the M.S.E.E. degree
from Stanford University, Stanford, CA, USA, and
spent 14 years with Hewlett-Packard designing six
generations of PA-RISC CPU's. Since joining AMD
in 2007, he has made signicant contributions to
the physical designs of AMD's Bulldozer family of
Cores. Mr. Johnson holds 6 U.S. patents and has co-authored several papers
and presentations.
Guhan Krishnan (M15) received the B.E. degree

in electrical engineering from Birla Institute of Technology and Sciences, Pilani, India, in 1995, and the
M.S. degree in computer engineering from the State
University of New York, Buffalo, NY, USA, in 1998.
He is an AMD Fellow in the SOC architecture team
in Boxborough, MA, USA. He has over 18 years of
industry experience in designing complex high performance system-on-chips. He is currently the SoC
architect for AMDs future APU offering. He holds
12 U.S. patents in processor caches, clocking, DRAM
controllers and power management.
Hugh McIntyre (M92) received the B.A. and M.A.

degrees in engineering and electrical and information sciences in 1987 and the Diploma in computer
science in 1988 from Cambridge University, Cambridge, U.K.
He is currently a Principal Member of Technical
Staff at AMD, where he has worked since 2005
64 and ARM CPUs, including
on design of
Macro Lead for Excavator, and circuit methodology,
macros, technology, and implementation for other
projects. He is currently working on circuit method-
116
ology, technology, and ESD for future AMD processors. From 1996 to 2005
he worked at Sun Microsystems on SRAMs and custom circuits, including
memory compilers, for multiple UltraSPARC processors. Prior to joining Sun,
Hugh was at Inmos Limited, then ST Microelectronics, working on high-speed
SRAMs, Flash EPROMs, and a media processor. Mr. McIntyre holds eleven
U.S. patents and has four other IEEE publications.
Edward McLellan is a Distinguished engineer

with Cavium Inc. developing next generation ARM
cores and power management controls. He has
contributed to the design of multiple PDP-11, VAX,
PRISM, Alpha, MIPS and 86 processors for DEC,
C-Port/Motorola and AMD over 35 years after
graduating from Rensselaer Polytechnic Institute,
Troy, NY, USA, with a B.S. degree in computer and
systems engineering. He is an inventor on over 20
patents issued and pending.
Samuel Naffziger (M02SM10F14) received the

B.S.E.E. degree from the California Institute of Technology (CalTech), Pasadena, CA, USA, in 1988 and
the M.S.E.E. degree from Stanford University, Stanford, CA, USA, in 1993.
He is a Corporate Fellow at AMD responsible
for low power technology development, and has
been the key innovator behind many of AMD's low
power features. He has been in the industry 27 years
with a background in microprocessors and circuit
design, starting at Hewlett Packard, moving to Intel
and then at AMD since 2006. He holds 113 U.S. patents in processor circuits,
architecture and power management. He has authored dozens of publications
and presentations in the eld.
Russell Schreiber (M14) received the B.S. degree

in computer engineering from Pennsylvania State
University, State College, PA, USA, in 2001, and
the M.S. degree in electrical engineering from the
University of Illinois, Urbana-Champaign, IL, USA,
in 2004.
He joined AMD in 2004 and is currently a Principal Member of Technical Staff. While at AMD, he
has worked on custom circuitry in L2 and L3 cache
designs on several generations of processor cores. He
is currently lead circuit designer for the L2 and L3
cache hierarchy of AMD's next generation processor core. He is listed as an inventor on 14 issued and pending U.S. patents.
Sriram Sundaram received the Bachelor degree in

electronics and telecommunications from the College
of Engineering, Pune, India, in 2000, the Master degree in electrical engineering from Texas Tech University, Lubbock, TX, USA, in 2001, and is currently
pursuing a part-time Ph.D. at the University of Texas
at Austin, TX, USA.
He has 14 years of industry experience. He started
his career at AMD as a circuit design engineer
in 2001. He has experience in high-speed custom
memory array design, high-speed clock distribution,
and custom circuit design methodologies. In addition, he has co-developed
a framework for power-frequency modeling that has been used for roadmap
projections for various AMD SoCs. He is currently also a Si lead for adaptive
voltage frequency scaling (AVFS) implementations across multiple AMD
products.
Jonathan White (M01) received the B.S. degree in

computer systems engineering in 1992 and the M.S.
degree in electrical and computer engineering in 1996
from the University of Massachusetts, Amherst, MA,
USA.
He worked for Digital Equipment Corp. in
Hudson, MA, USA, until 1999 designing Alpha
CPUs. From 1999 to 2004, he worked on Millennium
CPU for Sun Microsystems, Burlington, MA, USA.
In 2004, he joined AMD where he has worked on
several standard cell designs, power analysis and
served as a lead for large sized teams. He holds one patent on low power design.
Kathryn Wilcox (M11) is an AMD Fellow in the

CPU Core team in Boxborough, MA, USA. After
earning the B.S.E.E. at Cornell University, Ithaca,
NY, USA, in 1990, she joined Digital Equipment
Corporation and worked on multiple Alpha processors. She worked on network processors at C-Port
Corp./Motorola from 2000 to 2003. She has been
with AMD since 2003 working on custom designs
and methodology for K8G, BullDozer, SteamRoller
and Excavator processor cores.
Ms. Wilcox has served on the Technical Program
Committee for the Symposium on VLSI Circuits from 2011 to 2014 and the
Technical Program Committee for ISSCC since 2014.

Carrizo: A High Performance, Energy Efficient 28 NM APU: Abstract-AMD's 6th Generation "Carrizo" APU, Targeted at

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Carrizo: A High Performance, Energy Efficient 28 NM APU: Abstract-AMD's 6th Generation "Carrizo" APU, Targeted at

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO.

Carrizo: A High Performance, Energy Efcient

AbstractAMD's 6th generation Carrizo APU, targeted at

generation APU, codenamed Kaveri [2], but ts 29% more

MD's next-generation mobile performance accelerated

Carrizo uses a process with 8 1 layers, improving routing

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

Fig. 1. Carrizo die photo and block diagram.

Fig. 2. Examples of the area reductions provided by the HD library.

speed and the available DRAM bandwidth. Carrizo optimizes

MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU

leakage reduction with entry and exit times of less than 1 .

IV. DATA CACHE

coding (HEVC) content decode and 4 K at 60 frames per second

The Excavator data cache achieves a nearly 50% reduction in

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

Fig. 7. D-cache block diagram.

MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU

Fig. 8. D-cache macro site.

models the effective thermal resistance (including both the

510% lower maximum operating temperatures signicantly

Fig. 9. Temperature differentials and Carrizo optimized oor plan.

Fig. 10. Current distribution in Excavator power bumps.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU

Fig. 11. Shadow op and replica path schematic.

are the most sensitive to non-uniform local current distribution

VI. ADAPTIVE VOLTAGE-FREQUENCY SCALING

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

Fig. 12. Measured AVFS results with sampling-theory-based

prediction and self-measured voltage compensation.

automatically adjust for local voltage noise by having the CPAs

VII. L2 DATA AND TAG ARRAY ASSIST TECHNIQUES

and write assist techniques, improving Vmin by 40 mV. The L2

MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU

Fig. 14. L2 wordline assist.

Since both of these techniques involve extending the voltage

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

Fig. 15. L2 bitline assist.

thermal analysis to identify and minimize thermal hot spots.

[3] K. Gillespie et al., Steamroller: An 86-64 core implemented in

MUNGER et al.: CARRIZO: A HIGH PERFORMANCE, ENERGY EFFICIENT 28 nm APU

Benjamin Munger (M09) received the B.S. and

David Akeson is a Senior Manager in AMDs CPU

Srikanth Arekapudi received the B.Tech. degree

Tom Burd (M94) received the B.S., M.S., and Ph.D.

Harry R. Fair, III (M89) is the Sr. Director for the

Jim Farrell received the B.S.E.E. degree in 1984 and

Dave Johnson is the Director of the AMD CPU

Guhan Krishnan (M15) received the B.E. degree

Hugh McIntyre (M92) received the B.A. and M.A.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 51, NO. 1, JANUARY 2016

Edward McLellan is a Distinguished engineer

Samuel Naffziger (M02SM10F14) received the

Russell Schreiber (M14) received the B.S. degree

Sriram Sundaram received the Bachelor degree in

Jonathan White (M01) received the B.S. degree in

Kathryn Wilcox (M11) is an AMD Fellow in the

Das könnte Ihnen auch gefallen