Beruflich Dokumente
Kultur Dokumente
AbstractThis paper describes the implementation of a Qualcomm Hexagon digital signal processor (DSP) in a 28 nm high-
metal gate technology. The DSP is a multi-threaded very-longinstruction-word (VLIW) machine optimized for low leakage and
energy efficiency. It uses a clock distribution network, clock gating
cells, and pulsed latches that are optimized for low switching
energy. The processor can be powered using a low-dropout (LDO)
voltage regulator or a head switch. It operates from 255 MHz at
0.60 V to 1.24 GHz at 1.05 V. When operating from the LDO,
the power consumption of the core can be as low as 58 W/MHz,
which is two to three times lower than comparable cores optimized
for ultra-low voltage operation.
TABLE I
INTERCONNECT STACK
I. INTRODUCTION
0018-9200 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
increases complexity even more. In contrast, this VLIW architecture relies on the compiler to form packets of instructions
that can execute in parallel. The packets themselves are executed in order. This significantly reduces the complexity of the
issue logic, which helps to minimize area and power. To keep
the code density high, each packet has a variable length and
can contain between one and four instructions.
The pipeline is built so that all instructions in a packet
complete before the next packet is dispatched. This allows the
compiler to schedule a dependent packet immediately after
the source packet. Consequently, the compiler does not need
to find unrelated instructions to form independent packets to
be scheduled between the source and the dependent packets.
Finding enough independent instructions tends to be difficult.
Not having to schedule these independent packets helps the
compiler increase the number of instructions per packet, which
increases the utilization of the functional units.
As shown in Fig. 1, the core has two 64 bit load/store units.
These units can also perform 32 bit ALU operations. In addition, the core has two 64 bit vector execution units that, together, are capable of performing eight 16 bit multiply-accumulate operations per cycle. This core also supports dynamic
multithreading. The threads are typically scheduled in a simple
round-robin fashion, where a packet from a different thread is
dispatched each clock cycle. This temporal multithreading is
particularly useful to help ensure deterministic real-time thread
performance. Certain events, like an L2 cache miss, can cause
a stalled thread to be skipped, thus improving performance for
the other threads that are not stalled. Each thread has a unified
register file. The 16 kB L1 instruction cache, the 32 kB L1 data
cache, and the 256 kB L2 cache are shared between all threads.
The L2 cache can be reconfigured as a tightly coupled memory.
This is useful for critical interrupt handlers or real-time tasks as
the unpredictable latency of a cache would be highly undesirable. Finally, the core communicates with the rest of the system
and the external memory through asynchronous bus interfaces.
III. CLOCKING
The DSP core can be clocked from multiple sources. Because
the bus interfaces of the DSP are asynchronous, the phase or frequency of the bus clocks can be independent from the phase or
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS
logic gates.
on
and , the pull-down stack could turn on. However, it
would then harmlessly help the keeper hold the state node low.
The simulated global clock distribution power is
3.96 W/MHz at 0.90 V. When running a typical workload,
79.8% of the regional clocks and 94.3% of the local clocks are
gated off.
Fig. 4. Low-power clock gating cell topology.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
path. The auxiliary latch would then act as a master latch and
transform each pulsed latch into a flip-flop when a test pattern is
shifted in. However, in this technology, simply adding highdelay stages was found to require less area while still providing
enough timing margin to avoid the hold-time race between successive bits. The scan chain is gated in functional mode to save
power. Otherwise, whenever the pulsed latch captures new data,
the delay stages and the input capacitance of the scan multiplexer of the next pulsed latch would toggle unnecessarily.
The pulse generator supports clock gating. Compared to
a clock gating cell driving 32 conventional master-slave
flip-flops, the pulsed latches consume 2.5X less clock power.
In designs with aggressive clock gating, the power consumed
when the clock is gated and the data input
is switching is
particularly important. This power is 2.0X lower with pulsed
latches. This is because the entire master latch toggles in a
conventional flip-flop. The power when the data output
is switching is 13% higher with pulsed latches. Obviously,
cannot change unless also changes. However, since the clock
can be gated, not all input transitions lead to output transitions.
Therefore, the switching activity of the output cannot exceed
the switching activity of the input. For this core, the switching
activity of the output is typically three to four times lower than
the input. Because of this, the lower clock and input power of
the pulsed latches overwhelmingly dominate the slightly higher
output power.
Fig. 6. Asynchronous FIFOs. Combined level shifter and isolation cells drive
the bitlines and the write wordlines.
V. ASYNCHRONOUS FIFOS
The core clock of the DSP is not synchronized to any other
clock. Additionally, the core voltage can be different than the
chip voltage. Asynchronous FIFOs with four or eight entries
bridge these clock and voltage boundaries and move data across
the bus interfaces of the DSP.
Fig. 6 shows the structure of the FIFOs. The memory array
is implemented using a custom bitcell with a tristate output.
Voltage level shifters are placed between the input flip-flops and
the array. A level shifter is also placed on each write wordline.
This makes the write timing relatively insensitive to the difference between the core and chip voltages. An entry can be read
by enabling the appropriate row of tristate drivers.
In an asynchronous FIFO, ensuring the proper ordering of
events between the read and write domains is a major challenge.
Data is written into the memory array using a write pointer
maintained in the input clock domain. Similarly, data is read
from the memory array using a read pointer managed in the
output clock domain. When a write occurs to a certain entry,
the write pointer is updated and the new value is sent to the receiver. When the read and write pointers become different, the
receiver assumes that the entry is valid and readable.
Sending the updated write pointer too early to the receiver
could cause it to incorrectly read an entry that has not yet been
properly written. Therefore, a timing shift of the write-pointer
path relative to the actual write path of the array could lead
to read failures. The handshaking protocol used between the
domains must include enough margin to cover any variation
between the write pointer and array timings. To reduce this
overhead, the FIFO includes a write-pointer tracking circuit
that mimics the path from the input register to the latch array.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS
first performs the cell swaps that provide the most leakage
reduction for the least timing margin degradation. More specifically, the first pass considers the cells having the leakiest device
type (NVT30) and attempts to swap them to the next most leaky
and slightly slower device type (NVT35). Compared to a swap
to the least leaky device type (UHVT40), this recovers most
of the leakage because the leakage decreases quickly (approximately exponentially) as the delay of the device type increases.
However, it avoids most of the timing degradation. Leaving
more timing margin in the paths going through a particular cell
increases the likelihood of being able to later swap other cells
also using the leakiest device type in these paths. After the
completion of this first pass, the algorithm attempts to swap the
cells using the second leakiest device type (NVT35) to the third
leakiest device type (NVT40). This continues with swaps from
NVT40 to HVT40 and, finally, from HVT40 to UHVT40.
This leakage recovery process provides two important side
benefits. By slowing down the gates that are too fast, the process
significantly increases the minimum-delay (hold) timing margin
of a large number of paths. This extra delay naturally fixes a
large number of hold-time violations at no area cost. This considerably reduces the number of delay buffers that must be inserted to fix the remaining violations. The second side benefit
is related to crosstalk. The slower cells drive their outputs with
higher transition times and are, therefore, weaker aggressors to
other signals. Occasionally, one of these victims is part of a
critical path. Then, reducing the strength of the aggressor improves the timing of the critical path. Because several critical
paths have a large number of aggressors, it is not uncommon
for the maximum frequency of the core to increase slightly after
leakage recovery.
Fig. 9 shows the final leakage and area distributions per
device type. The NVT30 device type dominates the overall
leakage despite its relatively low use. Fig. 10 illustrates the
power profile of the core while running a typical workload and
a maximum-activity workload. For these workloads, most of
the power consumed by the core is dynamic. However, the
leakage power can be a much larger fraction of the total for
other workloads containing long periods of inactivity. Fig. 10
also breaks down the dynamic power consumed by each circuit
type. For the maximum activity workload, the power required
for distributing the global, regional, and local clocks is higher
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS
During power down, the BHS can cut the DSP leakage to
practically zero.
Fig. 13 shows a block diagram of the LDO, where an analog
and a digital loop operate in parallel to supply the load current.
The LDO does not require an external capacitor. However, when
the current demand of the core increases, not having a large
capacitor to quickly supply charge forces the LDO itself to be
faster.
To improve the transient response of the LDO, a mechanism
is introduced to limit the maximum current that the analog loop
has to deliver. The analog loop has a current mirror driving an
analog-to-digital converter (ADC). The ADC senses the current
provided by the analog loop. The digital loop can then offload
the excess leakage from the analog loop. This mechanism takes
advantage of the analog loop to respond to the fast current transients and of the low-bandwidth digital loop to supply the static
or close-to-static load current. Sizing the analog pass transistor
to deliver the worst-case leakage current would make it much
larger. This would increase its gate capacitance and the time required to change its gate voltage, degrading the ability of the
LDO to respond quickly to transient events.
Another challenge is that the load current of the LDO can
vary by over three orders of magnitude. When the DSP is idle,
the load of the LDO is only the leakage of the core. For slow
silicon at low temperatures and low voltages, this leakage can
be roughly 30 times smaller than for typical silicon at 25 C
(4.08 mW at 0.90 V). When the DSP is active, the load current
also includes a dynamic component and can increase well above
100 mA, especially at high temperatures.
With this load current variation, ensuring the stability of the
LDO is more challenging. Here, the dominant pole is at the
output of the error amplifier and a second pole is located at
the load [7]. The frequency of the dominant pole is a function of the effective capacitance driven by the error amplifier.
This capacitance includes the gate-to-source and gate-to-drain
capacitances
and
of the pass transistor. However, due
to the Miller effect,
increases the effective capacitance by
, where
is the voltage gain of the pass transistor. Anything that changes
will perturb the frequency of
the first pole. Similarly, anything that changes the output resistance
of the LDO will perturb the frequency of the second
pole. The two are related because
, where
is the
transconductance of the pass transistor. Load current variations
affect both
and
, although
tends to increase when
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8
Fig. 15. Dynamic level shifter. The select is the only VDDQ6 signal.
Fig. 17. Measured frequency.
a cost in area, these arrays use 8T bitcells. These bitcells are connected to VDDQ6. In the arrays using the 8T bitcells, the write
wordlines are boosted to VDDMX to ensure functionality at low
voltages. This allows much of the dynamic power consumption
and leakage of the instruction cache to scale down with the core
voltage.
IX. MEASUREMENTS
A micrograph of the DSP is shown in Fig. 16 with the caches
and other major structures identified. A large fraction of the
logic is synthesized.
When the LDO is powering the DSP, VDDQ6 can be scaled
down from VDDCX. The measurements in Fig. 17 show that
the DSP is operational from 255 MHz at 0.60 V up to 1.20 GHz
(5640 DMIPS) at 1.05 V. There, VDDCX is fixed at 1.15 V.
With the block head switch (BHS), VDDQ6 is electrically
tied to VDDCX. Due to a limitation of the rest of the system,
VDDCX cannot go below 0.65 V, which constrains the lowest
DSP voltage. At higher voltages, the BHS allows the DSP
to run slightly faster than with the LDO and reach 1.24 GHz
at 1.05 V. This indicates that the voltage regulation loop of
the LDO can increase the current through the pass transistor
quickly enough to respond to the load transients of the core.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS
TABLE II
COMPARABLE PROCESSORS
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10
described in Section VI. It also suggests that the energy efficiency of these cores would degrade significantly at high temperatures, where the leakage tends to become a much larger fraction of the total power consumption.
Around the point of optimal energy efficiency, the energy-versus-voltage curve is fairly flat. This implies that the
voltage can be increased without significantly degrading the
energy efficiency. In [9], increasing the optimal voltage by
100 mV to 0.55 V degrades the energy efficiency by approximately 12%. This suggests that the slightly higher operating
voltage (0.60 V) targeted by the Hexagon DSP is close enough
to the near-threshold region to reap most of the benefits of
NTC, but high enough to keep the benefits of the aggressive
power optimization techniques discussed in this paper.
Ken Lin received the B.S.E.E. degree from National Taiwan University, Taiwan, in 1991, and the
M.S.E.E. degree from Syracuse University, NY,
USA, in 1995.
He is a Principal Engineer at Qualcomm Inc.,
San Diego, CA, USA. He joined Qualcomm in
2004 where he has worked on several generations
of Hexagon processors. Previously, he worked for
Digital Equipment Corporation, Compaq, and Sun
Microsystems, where he contributed to Alpha and
SPARC microprocessor circuit designs. He holds
REFERENCES
[1] S. Y. Wu et al., A highly manufacturable 28 nm CMOS low power
platform technology with fully functional 64 Mb SRAM using
dual/tripe gate oxide process, in IEEE Symp. VLSI Circuits, 2009,
pp. 210211.
[2] L. Codrescu et al., Hexagon DSP: An architecture optimized for mobile multimedia and communications, IEEE Micro, vol. 34, pp. 3443,
Mar. 2014.
[3] P. Bassett and M. Saint-Laurent, Energy efficient design techniques
for a digital signal processor, in IEEE Int. Conf. IC Design and Tech.,
2012, pp. 4144.
[4] M. Saint-Laurent and A. Datta, A low-power clock gating cell optimized for low-voltage operation in a 45-nm technology, in ACM/IEEE
Int. Symp. Low-Power Electronics and Design, 2010, pp. 159163.
[5] J. V. Faricelli, Layout-dependent proximity effects in deep nanoscale
CMOS, in Proc. IEEE Custom Integrated Circuits Conf., 2010, pp.
18.
[6] M. Saint-Laurent and M. Swaminathan, Impact of power-supply
noise on timing in high-frequency microprocessors, IEEE Trans.
Adv. Packag., vol. 27, pp. 135144, Feb. 2004.
[7] J. Torres et al., Low drop-out voltage regulators: Capacitor-less architecture comparison, IEEE Circuits Syst. Mag., vol. 14, no. 2, pp.
626, 2014.
[8] B. Mohammad, P. Bassett, A. Aziz, and J. Abraham, Cache organizations for embedded processors: CAM-vs-SRAM, in IEEE Int. SOC
Conf., 2006, pp. 299302.
[9] S. Jain et al., A 280 mV-to-1.2 V wide-operating-range IA-32
processor in 32 nm CMOS, in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, 2012, pp. 6667.
[10] N. Ickes et al., A 28 nm 0.6 V low power DSP for mobile applications, IEEE Journal of Solid-State Circuits, vol. 47, pp. 3546, Jan.
2012.
[11] L. Chang and W. Haensch, Near-threshold operation for power-effi, in Proc. ACM/EDAC/IEEE Design
cient computing? It depends
Automation Conf. (DAC), 2012, pp. 11551159.
[12] M. Severson, K. Yuen, and Y. Du, Not so fast my friend: Is nearthreshold computing the answer for power reduction of wireless devices?, in ACM/EDAC/IEEE Design Automation Conf. (DAC), 2012,
pp. 11601162.
[13] G. Ruhl, S. Dighe, S. Jain, S. Khare, and S. R. Vangal, IA-32 processor
with a wide-voltage-operating range in 32-nm CMOS, IEEE Micro,
vol. 33, pp. 2836, Mar. 2013.
Martin Saint-Laurent (SM09) received the B.Eng.
degree (with honors) from McGill University, Montreal, Canada, and the M.S. and Ph.D. degrees from
the Georgia Institute of Technology (Georgia Tech),
Atlanta, GA, USA, all in electrical engineering.
From 1998 to 2005, he was with Intel Corp., where
he worked on two generations of high-frequency
IA-32 processors as a custom circuit designer. His
responsibilities included clock distribution and
sequential element design. In 2005, he joined Qualcomm, Inc., Austin, TX, USA, where he is currently
a Principal Engineer. He manages the team responsible for low-power design,
power delivery, and power-aware verification. He has 24 patents granted or
pending related to clocking, energy-efficient circuits, and voltage regulation.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS
11
Maen Alradaideh received the B.S. degree in electrical and computer engineering at Jordan University
of Science and Technology, Irbid, Jordan, and the
M.S. degree in electrical and computer engineering
at the University of Alabama, Huntsville, AL, USA.
He spent over 20 years in the field of processor verification and validation. He is currently leading silicon validation and customer enablement.
Tom Wernimont received the B.S. degree in electrical and computer engineering at the University of
Notre Dame, Notre Dame, IN, USA, and the M.S.
degree in electrical and computer engineering at
the University of Illinois at Urbana-Champaign, IL,
USA.
He is currently a Senior Staff Engineer at Qualcomm Inc., Austin, TX, USA. He has five patents
issued.
Dan Bui received the B.S. degree in computer engineering at the University of Texas at Austin, TX,
USA.
He has over 20 years working experience and
is currently a Principal Engineer/Manager at Qualcomm in Austin, TX.