Sie sind auf Seite 1von 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

A 28 nm DSP Powered by an On-Chip LDO


for High-Performance and Energy-Efficient
Mobile Applications
Martin Saint-Laurent, Senior Member, IEEE, Paul Bassett, Ken Lin, Baker Mohammad, Senior Member, IEEE,
Yuhe Wang, Xufeng Chen, Maen Alradaideh, Tom Wernimont, Kartik Ayyar, Dan Bui, Dwight Galbi,
Allan Lester, Marzio Pedrali-Noy, and Willie Anderson

AbstractThis paper describes the implementation of a Qualcomm Hexagon digital signal processor (DSP) in a 28 nm high-
metal gate technology. The DSP is a multi-threaded very-longinstruction-word (VLIW) machine optimized for low leakage and
energy efficiency. It uses a clock distribution network, clock gating
cells, and pulsed latches that are optimized for low switching
energy. The processor can be powered using a low-dropout (LDO)
voltage regulator or a head switch. It operates from 255 MHz at
0.60 V to 1.24 GHz at 1.05 V. When operating from the LDO,
the power consumption of the core can be as low as 58 W/MHz,
which is two to three times lower than comparable cores optimized
for ultra-low voltage operation.

TABLE I
INTERCONNECT STACK

Index TermsCapacitor-less LDO, clock power reduction,


DSP, leakage optimization, low power design, near-threshold
computing, power gating, pulsed latches.

I. INTRODUCTION

HIS paper describes a Qualcomm Hexagon digital signal


processor (DSP) fabricated using a 28 nm high- metalgate process technology optimized for mobile applications [1].
The processor is designed to deliver superior energy efficiency
compared to mobile CPU alternatives and thereby help achieve
long battery life for important mobile applications.
The core is optimized for a heterogeneous computing environment. The specialization of a subsystem to a task can enhance performance and power beyond what is possible with a
homogenous CPU-based computing platform. The DSP targets
high performance and low power across a wide variety of multimedia and modem applications, under aggressive area targets.
Its architecture pursues executing a high number of instructions
per cycle (IPC) as opposed to high frequency [2]. It includes a
32 kB L1 data cache (D$), a 16 kB L1 instruction cache (I$),
and a 256 kB L2 cache (L2$).
The process technology supports three threshold voltages
(NVT, HVT, UHVT) and three channel lengths. The drawn
channel lengths (30 nm, 35 nm, 40 nm) are shrunk by 10% and
Manuscript received May 14, 2014; revised August 26, 2014; accepted
October 27, 2014. This paper was approved by Guest Editor Stephen
Kosonocky.
The authors are with Qualcomm Technologies, Inc., Austin, TX 78759 USA
(e-mail: martins@qti.qualcomm.com).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSSC.2014.2371454

then biased to get the final silicon dimensions. The interconnect


stack has eight copper layers. As shown in Table I, six of these
layers have a narrow pitch and two are wider. The metal stack
also includes a thick aluminum layer that is used primarily for
power routing. The six-transistor (6T) SRAM cell offered in
this technology has an area of 0.127 m .
This paper discusses the low-power techniques used to
implement the Hexagon DSP. Section II gives an overview
of the architecture of the processor. Sections III and IV focus
on the clock distribution network and the sequential elements.
Section V describes the FIFOs used to bridge the asynchronous
clock and voltage boundaries of the DSP and move data in
and out of the core. Section VI discusses leakage optimization
and the importance of minimizing glitch power. Section VII
gives details about the head switch that power gates the
processor and the on-chip LDO used to independently scale
its voltage. Section VIII discusses the L1 and L2 caches.
Section IX presents the frequency and power measured for
the DSP. Section X concludes the paper with a discussion on
near-threshold computing.
II. ARCHITECTURE
The Qualcomm Hexagon DSP is a very-long-instruction-word (VLIW) machine. It has the ability to issue multiple
instructions per cycle to increase performance by exploiting
instruction-level parallelism. In superscalar architectures, complicated hardware is required to check for resource and data
dependencies before dispatching instructions. Architectures
supporting out-of-order execution also need hardware for
register renaming and reordering the results. This ultimately

0018-9200 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

Fig. 2. Clock distribution network.

Fig. 1. Processor architecture.

increases complexity even more. In contrast, this VLIW architecture relies on the compiler to form packets of instructions
that can execute in parallel. The packets themselves are executed in order. This significantly reduces the complexity of the
issue logic, which helps to minimize area and power. To keep
the code density high, each packet has a variable length and
can contain between one and four instructions.
The pipeline is built so that all instructions in a packet
complete before the next packet is dispatched. This allows the
compiler to schedule a dependent packet immediately after
the source packet. Consequently, the compiler does not need
to find unrelated instructions to form independent packets to
be scheduled between the source and the dependent packets.
Finding enough independent instructions tends to be difficult.
Not having to schedule these independent packets helps the
compiler increase the number of instructions per packet, which
increases the utilization of the functional units.
As shown in Fig. 1, the core has two 64 bit load/store units.
These units can also perform 32 bit ALU operations. In addition, the core has two 64 bit vector execution units that, together, are capable of performing eight 16 bit multiply-accumulate operations per cycle. This core also supports dynamic
multithreading. The threads are typically scheduled in a simple
round-robin fashion, where a packet from a different thread is
dispatched each clock cycle. This temporal multithreading is
particularly useful to help ensure deterministic real-time thread
performance. Certain events, like an L2 cache miss, can cause
a stalled thread to be skipped, thus improving performance for
the other threads that are not stalled. Each thread has a unified
register file. The 16 kB L1 instruction cache, the 32 kB L1 data
cache, and the 256 kB L2 cache are shared between all threads.
The L2 cache can be reconfigured as a tightly coupled memory.
This is useful for critical interrupt handlers or real-time tasks as
the unpredictable latency of a cache would be highly undesirable. Finally, the core communicates with the rest of the system
and the external memory through asynchronous bus interfaces.
III. CLOCKING
The DSP core can be clocked from multiple sources. Because
the bus interfaces of the DSP are asynchronous, the phase or frequency of the bus clocks can be independent from the phase or

frequency of the core clock. Since clock phase alignment is not


required, frequency-locked loops (FLLs) can be used as clock
sources instead of phase-locked loops (PLLs). The same clock
source can therefore simultaneously feed the DSP and other
cores in the system without having to balance the insertion delay
of the DSP clock path. Frequency dividers can also be used as
clock sources.
A glitch-free clock multiplexer is used to dynamically switch
between these various clock sources. During a transition from
one source to another, the control logic first waits for the low
phase of the current clock. At this point, it can force the output of
the multiplexer to zero without producing a clock glitch. Next,
the circuit waits for the low phase of the new clock. Then, it can
finally let the output of the multiplexer follow the new clock.
The insertion delay through the multiplexer is comparable to
the delay of a clock gating cell. Therefore, the glitch-free multiplexer introduces very little clock jitter.
The clock distribution network is designed for low dynamic
power consumption. As shown in Fig. 2, the clock enters the
core through a voltage level shifter. The level shifter is optimized for low clock insertion delay and duty cycle distortion
across the range of voltages at which the DSP can operate. The
clock is then routed to the middle of the core through a duty
cycle correction (DCC) circuit. The circuit can apply different
delays to the rising and falling edges of the clock to intentionally
adjust the duty cycle to potentially improve the timing of phase
paths. From there, the clock is distributed using four horizontal
clock bays. Physically, each clock bay consists of a long metal-8
wire driven from the middle by a large inverter. The wire is designed to have a low resistance to reduce the clock propagation
delay from the inverter to its end points.
When the DSP is idle, the entire clock distribution network
can be gated off at its root. However, when the core is active,
the global clock bays are free running. The first level of clock
gating after the global clock
is implemented by regional
clock buffers (RCBs). The RCBs are aligned under the clock
bays to minimize the interconnect capacitance associated with
. A few RCBs are placed far from their clock bay driver.
These remote RCBs tend to experience more skew, especially
when high temperatures make the clock bay wires more resistive. However, most RCBs are fairly close to their driver and
do not suffer from this problem. The RCBs drive 526 regional
clocks
. These clocks are not shielded, but are routed with
extra spacing to reduce power. The second level of clock gating
is performed by local clock buffers (LCBs) or pulse generators.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS

Fig. 3. Conventional clock gating cell topology.

Fig. 5. Pulsed latches. The scan out path uses high-

logic gates.

on
and , the pull-down stack could turn on. However, it
would then harmlessly help the keeper hold the state node low.
The simulated global clock distribution power is
3.96 W/MHz at 0.90 V. When running a typical workload,
79.8% of the regional clocks and 94.3% of the local clocks are
gated off.
Fig. 4. Low-power clock gating cell topology.

The LCBs drive 1565 local clocks


and the pulse generators drive 1907 pulsed clocks
. There are always four inversions between the global clock and the clock received by the
sequential elements. The physical implementation of this clock
structure is discussed in more details in [3].
Instead of the conventional clock gating cell (CGC) topology
shown in Fig. 3, the RCBs use the low-power topology shown in
Fig. 4. Both are logically equivalent. The low-power topology
adds a second enable to the circuit presented in [4] to enhance
testability. With the conventional topology, three nets and 10
transistors switch whenever the input clock switches. Instead
of using the input clock to control the latch holding the enable,
the low-power topology takes advantage of the low switching
activity of the gated clock. With this topology, only one net
and four transistors switch when the input clock does. The lowpower topology has a negligible impact on the clock insertion
delay, and uses 19 transistors instead of 24. The CGCs are designed to minimize variations in electrical performance due to
layout context effects.
The low-power topology is well suited to support a wide
range of operating voltages because it is fully static and free
of contention. When the input clock is low, the active-low functional enable
and the test enable control the state node of
the latch . The pull-up and pull-down stacks are on and effectively form a NAND gate driving the state node. Because it is
controlled by a gated clock, the keeper is completely interrupted
and does not produce contention. Then, the circuit electrically
behaves like a combinational logic gate: the sizing of the transistors matters for timing, but not functionality. When the input
clock rises with the state node high, the pull-up and pull-down
stacks controlled by
and are disabled. The upper half of
the keeper statically holds the state node. When the clock rises
with the state node low, the other half of the keeper is used to
maintain its value. Then, the pull-up stack controlled by
and is again disabled, but not the pull-down stack. Depending

IV. PULSED LATCHES


The pulsed latches in Fig. 5 implement most sequential elements. These pulsed latches are scannable and have an asynchronous reset. They are optimized for energy efficiency as opposed to high speed. Compared to conventional master-slave
flip-flops, these pulsed latches consume significantly less clock
and data power [3]. They also offer more data transparency,
which improves performance.
The pulse generator is combined with up to 32 latches in a
single cell. This bundling provides several advantages. First, it
ensures that the clock wires (
and
) going from the
pulse generator to the latches are short to reduce power. Second,
it eliminates several sources of design variability, which improves robustness. For example, the number of latches driven
by the pulse generator and the metal layer, width, and spacing
of the clock interconnects are completely known. Third, this
bundling reduces the device performance variability caused by
layout-dependent proximity effects [5]. One such effect is the
oxide definition (OD) region spacing effect. The proximity of
other OD regions (also known as active device regions) affects
the mechanical stress that shallow trench isolation (STI) induces
on a device, which in turn affects its performance. When the
relative placement of the latches with respect to the pulse generator is fixed, the STI stress can be systematically predicted
and ceases to be a source of random variability. Finally, this
bundling makes it possible to route non-critical signals (e.g.,
and
) using the metal gate layer and internally stitch
the scan chain, which reduces routing congestion. The number
of bits per cell is limited to 32 to ensure that the delay of the
clock wires (
and
) is small compared to the delay of
the latches.
The scan chain is internally connected from one bit to the
next (
to
) using highgates. This ensures
that the delay between successive bits exceeds the clock pulse
width, in particular at low voltages. Otherwise, the scan output
of one pulsed latch could race through the next pulsed latch
and be incorrectly captured further in the chain. To avoid this
hold-time race, an auxiliary latch could be added to the scan

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

path. The auxiliary latch would then act as a master latch and
transform each pulsed latch into a flip-flop when a test pattern is
shifted in. However, in this technology, simply adding highdelay stages was found to require less area while still providing
enough timing margin to avoid the hold-time race between successive bits. The scan chain is gated in functional mode to save
power. Otherwise, whenever the pulsed latch captures new data,
the delay stages and the input capacitance of the scan multiplexer of the next pulsed latch would toggle unnecessarily.
The pulse generator supports clock gating. Compared to
a clock gating cell driving 32 conventional master-slave
flip-flops, the pulsed latches consume 2.5X less clock power.
In designs with aggressive clock gating, the power consumed
when the clock is gated and the data input
is switching is
particularly important. This power is 2.0X lower with pulsed
latches. This is because the entire master latch toggles in a
conventional flip-flop. The power when the data output
is switching is 13% higher with pulsed latches. Obviously,
cannot change unless also changes. However, since the clock
can be gated, not all input transitions lead to output transitions.
Therefore, the switching activity of the output cannot exceed
the switching activity of the input. For this core, the switching
activity of the output is typically three to four times lower than
the input. Because of this, the lower clock and input power of
the pulsed latches overwhelmingly dominate the slightly higher
output power.

Fig. 6. Asynchronous FIFOs. Combined level shifter and isolation cells drive
the bitlines and the write wordlines.

Tracking is achieved by widening the FIFO. The extra bits


are used to write the write pointer itself into the latch array
before sending it to the receiver. The write pointer therefore
goes through the same flip-flops, level shifters, and bitlines as
the data. This helps guarantee proper timing at any frequency
and voltage combination and allows a low-overhead handshake
protocol.

V. ASYNCHRONOUS FIFOS
The core clock of the DSP is not synchronized to any other
clock. Additionally, the core voltage can be different than the
chip voltage. Asynchronous FIFOs with four or eight entries
bridge these clock and voltage boundaries and move data across
the bus interfaces of the DSP.
Fig. 6 shows the structure of the FIFOs. The memory array
is implemented using a custom bitcell with a tristate output.
Voltage level shifters are placed between the input flip-flops and
the array. A level shifter is also placed on each write wordline.
This makes the write timing relatively insensitive to the difference between the core and chip voltages. An entry can be read
by enabling the appropriate row of tristate drivers.
In an asynchronous FIFO, ensuring the proper ordering of
events between the read and write domains is a major challenge.
Data is written into the memory array using a write pointer
maintained in the input clock domain. Similarly, data is read
from the memory array using a read pointer managed in the
output clock domain. When a write occurs to a certain entry,
the write pointer is updated and the new value is sent to the receiver. When the read and write pointers become different, the
receiver assumes that the entry is valid and readable.
Sending the updated write pointer too early to the receiver
could cause it to incorrectly read an entry that has not yet been
properly written. Therefore, a timing shift of the write-pointer
path relative to the actual write path of the array could lead
to read failures. The handshaking protocol used between the
domains must include enough margin to cover any variation
between the write pointer and array timings. To reduce this
overhead, the FIFO includes a write-pointer tracking circuit
that mimics the path from the input register to the latch array.

VI. MINIMIZATION OF POWER CONSUMPTION


In addition to the use of a low-power clock distribution network and pulsed latches, several techniques are employed to reduce the dynamic power consumption of the core. The core is
micro-architected to avoid wasteful speculation. The switching
activity of the logic is simulated across a wide range of program
traces to verify that only the blocks performing a useful computation consume power. For certain datapath blocks, additional
registers are added to keep the inputs stable when the output is
not needed. Care is taken to avoid broadcasting the results of a
block to multiple other blocks if the results are only needed by a
subset of the receivers. The data switching activity of every sequential element is compared to its clock switching activity. A
relatively low data-to-clock activity ratio indicates that the sequential element tends to capture the same value multiple times.
It suggests an opportunity for clock gating it more aggressively.
A relatively high activity ratio indicates that the sequential element often receives data that it is not capturing. It suggests an
opportunity for gating of the upstream logic generating this data
more aggressively. At times, the output of a sequential element
may not be observable by the sequential elements located downstream. This can occur if its output goes to a multiplexer branch
that is not selected. Then, the clock of the sequential element
generating the unobservable output can be gated off.
Fig. 7 shows that the power of logic glitches is a significant
fraction of the total dynamic power. Each point represents a different program. The horizontal axis gives the normalized functional power of the core. This functional power is the dynamic
power estimated when only functional transitions are present,
i.e., when the logic gates cannot produce glitches because they

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS

Fig. 8. Normalized leakage per device type.


Fig. 7. Glitch versus functional power.

are simulated with a delay of zero. These functional transitions


are the ones that cause the values of signals to change between
the beginning and the end of the clock cycle. They are essential
to the correct behavior of the core. The maximum of the horizontal axis corresponds to the program producing the highest
power consumption for the core. The vertical axis shows the dynamic power increase caused by the glitches that appear when
non-zero delays are used for the logic gates. These delays are
extracted from the database used for timing signoff and are relatively accurate. The glitch power tends to become a larger fraction of the functional power when the activity of the core increases. It can reach 34% of the functional dynamic power.
On wide buses, special attention is paid to the select lines of
multiplexers to minimize this glitch power. Whenever possible,
the selects are driven directly from sequential elements, as opposed to combinational logic gates. This reduces the likelihood
of generating glitches for the entire bus when multiple selects
change and the multiplexer temporarily points to an undesired
source.
The logic leakage is minimized by optimizing the threshold
voltages and channel lengths of the standard cells used in the
core. The maximum-delay (setup) timing margin through each
cell is evaluated. If the margin through a cell is sufficiently positive, it is replaced by another one that is slower but leaks less.
To avoid perturbing the placement of the cells, the replacement
cannot be larger than the original cell. Since the DSP is targeting
operation across a wide range of voltage, process, and temperature conditions, the timing margins tend to vary a lot. Before
swapping one cell for another, the timing margin at every operating corner must therefore be considered. In particular, the performance of high- and ultra-high- devices degrades greatly
at low voltages, especially at cold temperatures. That makes the
low voltage and cold temperature corner one of the dominant
ones for leakage recovery. Certain cells, like the ones used on
the clock distribution network, are excluded from the leakage
optimization process.
Various device types, i.e., combinations of threshold voltage
and channel lengths, are available is this technology for leakage
recovery. Fig. 8 shows the relative leakage for each type. To
achieve the lowest leakage possible, the cell swap process
cannot be done in an arbitrary order. The heuristic used here

first performs the cell swaps that provide the most leakage
reduction for the least timing margin degradation. More specifically, the first pass considers the cells having the leakiest device
type (NVT30) and attempts to swap them to the next most leaky
and slightly slower device type (NVT35). Compared to a swap
to the least leaky device type (UHVT40), this recovers most
of the leakage because the leakage decreases quickly (approximately exponentially) as the delay of the device type increases.
However, it avoids most of the timing degradation. Leaving
more timing margin in the paths going through a particular cell
increases the likelihood of being able to later swap other cells
also using the leakiest device type in these paths. After the
completion of this first pass, the algorithm attempts to swap the
cells using the second leakiest device type (NVT35) to the third
leakiest device type (NVT40). This continues with swaps from
NVT40 to HVT40 and, finally, from HVT40 to UHVT40.
This leakage recovery process provides two important side
benefits. By slowing down the gates that are too fast, the process
significantly increases the minimum-delay (hold) timing margin
of a large number of paths. This extra delay naturally fixes a
large number of hold-time violations at no area cost. This considerably reduces the number of delay buffers that must be inserted to fix the remaining violations. The second side benefit
is related to crosstalk. The slower cells drive their outputs with
higher transition times and are, therefore, weaker aggressors to
other signals. Occasionally, one of these victims is part of a
critical path. Then, reducing the strength of the aggressor improves the timing of the critical path. Because several critical
paths have a large number of aggressors, it is not uncommon
for the maximum frequency of the core to increase slightly after
leakage recovery.
Fig. 9 shows the final leakage and area distributions per
device type. The NVT30 device type dominates the overall
leakage despite its relatively low use. Fig. 10 illustrates the
power profile of the core while running a typical workload and
a maximum-activity workload. For these workloads, most of
the power consumed by the core is dynamic. However, the
leakage power can be a much larger fraction of the total for
other workloads containing long periods of inactivity. Fig. 10
also breaks down the dynamic power consumed by each circuit
type. For the maximum activity workload, the power required
for distributing the global, regional, and local clocks is higher

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

Fig. 12. Logical connectivity of the BHS.

Fig. 9. Final leakage and area distributions per device type.

Fig. 10. Dynamic and leakage power profile.

Fig. 11. Package connectivity for the BHS and LDO.

than for the typical workload. However, it is a smaller fraction


of the total.
VII. CORE VOLTAGE SCALING AND POWER GATING
The DSP uses several power switches for leakage control
and an on-chip low-dropout (LDO) voltage regulator to support dynamic voltage scaling. As shown in Fig. 11, a block head
switch (BHS) is connected in parallel with the LDO to produce
the DSP voltage (VDDQ6) from a chip power rail (VDDCX).
The BHS and LDO occupy 0.015 mm and 0.027 mm , respectively. VDDQ6 is routed over the DSP using a metal layer in the
package and then connected to an on-chip power grid through
several bumps. This scheme requires the package to be co-designed with the chip to ensure that are bumps are placed appropriately in relation to the BHS and LDO. It avoids sharing the

on-chip wiring resources between VDDQ6 and VDDCX, which


helps improve power-supply noise and timing [6].
Logically, the BHS behaves like the single transistor shown
in Fig. 12. Its n-well is always on. However, when VDDQ6 is
power-gated, the n-well of the core devices leaks toward zero.
This eliminates most of the junction leakage. The BHS is implemented using 24 tiles. Each tile has an array of identical HVT40
devices. A signal called
enables a few (3%) of these devices
and another signal,
, enables the rest (97%). When powering
up the core,
is asserted for the first tile. After some delay,
is asserted for the second tile, then for the third, and so on
until it is asserted for all the tiles. At this point, the voltage is
expected to be very close to its full value. Finally, the
signals are asserted one tile at a time in a similar fashion, but with
less delay. When powering down, all the tiles are turned off
almost instantaneously. Theoretically, for a high-leakage core,
turning off the power switch too quickly could trigger a large
di/dt event. However, this scenario is not a primary concern for
this core because of its low leakage. The BHS is sized such that
its worst-case on resistance remains sufficiently low across all
process, voltage, and temperature conditions.
Following the proper sequence is important to safely power
down the core. If data retention is desired, the L1 data cache
must first be flushed to the L2 cache. Next, the signals controlling the sleep state of the L2 cache must be latched on the
memory supply and the outputs of the VDDQ6 power domain
must be isolated. This includes the core outputs as well as the
internal interface to the L2 cache. Isolation prevents the flow
of short-circuit currents and the propagation of unknown logic
values into the power domains that are still on. Finally, the BHS
can be turned off. The power up sequence resets the core before
removing the isolation.
VDDCX is driven by a high-efficiency external switching
regulator. Since VDDCX also supports voltage scaling, using
the LDO to supply power to the DSP is not always optimal
to maximize energy efficiency. The BHS is used whenever
VDDCX is at (or slightly above) the minimum voltage required by the DSP to support a given clock frequency. Using
the BHS is advantageous because it avoids the energy loss
associated with the IR drop across the LDOs pass transistor. In
high-performance mode, the low resistance of the BHS and of
the package power grid allows the DSP to run with minimal IR
drop. In low-power mode, the LDO can reduce the voltage of
the DSP while the rest of the system stays at a higher voltage.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS

Fig. 13. Block diagram of the LDO.

During power down, the BHS can cut the DSP leakage to
practically zero.
Fig. 13 shows a block diagram of the LDO, where an analog
and a digital loop operate in parallel to supply the load current.
The LDO does not require an external capacitor. However, when
the current demand of the core increases, not having a large
capacitor to quickly supply charge forces the LDO itself to be
faster.
To improve the transient response of the LDO, a mechanism
is introduced to limit the maximum current that the analog loop
has to deliver. The analog loop has a current mirror driving an
analog-to-digital converter (ADC). The ADC senses the current
provided by the analog loop. The digital loop can then offload
the excess leakage from the analog loop. This mechanism takes
advantage of the analog loop to respond to the fast current transients and of the low-bandwidth digital loop to supply the static
or close-to-static load current. Sizing the analog pass transistor
to deliver the worst-case leakage current would make it much
larger. This would increase its gate capacitance and the time required to change its gate voltage, degrading the ability of the
LDO to respond quickly to transient events.
Another challenge is that the load current of the LDO can
vary by over three orders of magnitude. When the DSP is idle,
the load of the LDO is only the leakage of the core. For slow
silicon at low temperatures and low voltages, this leakage can
be roughly 30 times smaller than for typical silicon at 25 C
(4.08 mW at 0.90 V). When the DSP is active, the load current
also includes a dynamic component and can increase well above
100 mA, especially at high temperatures.
With this load current variation, ensuring the stability of the
LDO is more challenging. Here, the dominant pole is at the
output of the error amplifier and a second pole is located at
the load [7]. The frequency of the dominant pole is a function of the effective capacitance driven by the error amplifier.
This capacitance includes the gate-to-source and gate-to-drain
capacitances
and
of the pass transistor. However, due
to the Miller effect,
increases the effective capacitance by
, where
is the voltage gain of the pass transistor. Anything that changes
will perturb the frequency of
the first pole. Similarly, anything that changes the output resistance
of the LDO will perturb the frequency of the second
pole. The two are related because
, where
is the
transconductance of the pass transistor. Load current variations
affect both
and
, although
tends to increase when

decreases. This partial cancellation usually makes the first


pole significantly less sensitive than the second to load current
variations. A large output capacitance and a small load current
move the second pole towards the first. Without proper compensation, this would reduce the LDOs phase margin and threaten
stability.
Under certain circumstances, the digital loop is also used to
handle fast transients. The digital controller receives information from the DSP about future pipeline events likely to cause
large voltage droops. When the current demand of the core is
expected to rise, the LDO can preemptively and quickly lower
its resistance through the digital loop.
Switching between the LDO and the BHS is allowed while
the DSP is running. When transitioning from the LDO to
the BHS, the BHS is progressively turned on. Lowering the
impedance of the BHS gradually pinches the headroom of
the LDO to nearly zero. Then, the LDO no longer regulates
its output voltage and can be turned off. When transitioning
from the BHS to the LDO, the LDO is forced to its minimum
impedance state by the digital controller. Then, the BHS
is turned off. Finally, the controller gradually increases the
impedance of the LDO until the output voltage drops to its
target value.
VIII. ARRAY DESIGN
As discussed earlier, the processor has a 32 kB L1 data cache
(D$), a 16 kB L1 instruction cache (I$), and a unified 256 kB
L2 cache (L2$). The L1 data cache is physically tagged and
virtually indexed. It is eight-way set associative and has 32 bytes
per line. The data cache has two 64 bit load ports and a 64 bit
store port. It also supports 256 bit evictions and fills. An access
takes two cycles. A tag array access, a tag comparison, and a
state array access are performed in the first cycle. If the cache
line is valid and the tag comparison indicates a hit, the data
array is accessed during the second cycle. Serializing the tag and
data array accesses is advantageous from an energy efficiency
perspective. It eliminates an unnecessary access to the data array
when a read miss occurs. It also eliminates reading from the data
array and later discarding the ways that are not needed.
The tags of the L1 data cache are stored in an SRAM array.
This array is accessed to read in parallel the eight tags of the set
that might contain the data. Then, the eight tags are compared to
the tag of the address being looked up in the cache to determine
if there is a match. The DSP is multi-threaded and, in general,
each thread operates in its own virtual memory region in a different virtual page number (VPN). Alternating between the interleaved threads increases the activity on the VPN bus. With the
SRAM-based implementation used here, the high-activity VPN
bus is only routed to the comparators at the output of the SRAM
array. With an implementation based on a content-addressable
memory (CAM), the area would be significantly higher. In addition, the high-activity VPN bus would be broadcast across the
entire cache and consume more power [8]. The tag array of the
L1 data cache has timing-critical interfaces from the register
file, from the TLB, and to the data array. As a result, it uses
eight-transistor (8T) bitcells.
As shown in Fig. 14, the data array is built using 16 banks
of 2 kB each. It uses six-transistor (6T) SRAM cells. Each

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8

Fig. 14. Organization of the L1 data cache.

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

Fig. 16. Die photograph.

Fig. 15. Dynamic level shifter. The select is the only VDDQ6 signal.
Fig. 17. Measured frequency.

2 kB bank consists of two 1 kB arrays, each with 64 rows, 128


columns, and a four-to-one column multiplexer. The banks are
organized in quadrants and each quadrant contains a double
word from each cache line. Each bit of the fill and evict buses
is routed to one quadrant only. The eight hit signals are routed
to all the 2 kB banks and are factored into the wordline drivers.
The 6T bitcells of the D$ and L2$ require a relatively high
minimum voltage
to operate correctly. They are connected to a dedicated memory voltage (VDDMX) to avoid limiting the minimum voltage of the core. The address decoder is
on the core supply (VDDQ6), which can be much lower than
VDDMX. As shown in Fig. 15, dynamic level shifters are embedded with the wordline drivers to reduce the area and delay
penalty associated with the power domain crossing. When the
select (the only VDDQ6 input) is low, there is no contention.
However, transistor P1 can be weakly on when the select is
high, so the pull-down stack must be optimized appropriately
to ensure its robustness. Independent PMOS power switches are
added to the bitcells and the peripheral logic to support different
leakage saving modes. In the L2$, a source biasing circuit reduces the retention leakage while the memory is idle. Finally, a
circuit tracking the level shifters automatically shifts the time at
which the sense amps fire to maintain a sufficient sense margin.
The tag and data arrays of the L1 instruction cache must be
accessed in a single cycle. Due to this speed requirement, and at

a cost in area, these arrays use 8T bitcells. These bitcells are connected to VDDQ6. In the arrays using the 8T bitcells, the write
wordlines are boosted to VDDMX to ensure functionality at low
voltages. This allows much of the dynamic power consumption
and leakage of the instruction cache to scale down with the core
voltage.
IX. MEASUREMENTS
A micrograph of the DSP is shown in Fig. 16 with the caches
and other major structures identified. A large fraction of the
logic is synthesized.
When the LDO is powering the DSP, VDDQ6 can be scaled
down from VDDCX. The measurements in Fig. 17 show that
the DSP is operational from 255 MHz at 0.60 V up to 1.20 GHz
(5640 DMIPS) at 1.05 V. There, VDDCX is fixed at 1.15 V.
With the block head switch (BHS), VDDQ6 is electrically
tied to VDDCX. Due to a limitation of the rest of the system,
VDDCX cannot go below 0.65 V, which constrains the lowest
DSP voltage. At higher voltages, the BHS allows the DSP
to run slightly faster than with the LDO and reach 1.24 GHz
at 1.05 V. This indicates that the voltage regulation loop of
the LDO can increase the current through the pass transistor
quickly enough to respond to the load transients of the core.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS

TABLE II
COMPARABLE PROCESSORS

(1) Caches on.

disabled. However, in modern processors, disabling the caches


and executing from external memory invariably leads to a large
degradation in performance. Therefore, disabling the caches to
enable operation at a lower voltage is unlikely to improve the
energy efficiency. For the Hexagon DSP, the minimum energy
per cycle of 58 W/MHz is at least two to three times lower than
for the other processors, including the ones optimized for operation in or near the sub-threshold region. Again, this includes the
power loss of the LDO. At 25 C, the average of the VDDQ6
leakage measured on five typical parts is 4.08 mW at 0.90 V.
The leakage of the Hexagon DSP is therefore about five to 10
times lower than for these processors.
X. CONCLUSION

Fig. 18. Measured power consumption.

Fig. 18 shows the power measured on VDDCX for the LDO


and the DSP, when VDDCX is fixed at 1.05 V and the DSP is
executing a typical workload. At 230 MHz, the voltage scaling
made possible by the LDO reduces the total power to 23.4 mW.
This is 2.25 times smaller than what is measured when the DSP
is operating with the BHS from the fixed supply. A separate
measurement, taken when VDDCX is reduced to 0.70 V, indicates that the best energy efficiency is achieved when the
LDO regulates VDDQ6 down to its lowest voltage, 0.60 V.
There, the measured power for VDDCX is 13.4 mW at 230 MHz
(58 W/MHz), including the internal power consumption of the
LDO.
A comparison to other low-power processors is shown in
Table II. The table includes an x86 core [9] and a C64x+ VLIW
DSP [10], both optimized for ultra-low voltage operation. The
x86 core is implemented in a 32 nm technology. It does not
have an L2 cache. It can operate down to 0.28 V at 3.0 MHz,
but is at its best energy efficiency of 170 W/MHz at 0.45 V.
With its caches on, the 28 nm C64x+ DSP can operate down to
14.4 MHz at 0.60 V. It is at its best efficiency at 0.75 V. The
C64x+ DSP can function at a lower voltage when its caches are

When operating from the LDO, the power consumption of the


Hexagon DSP can be as low as 58 W/MHz, which is two to
three times better than comparable cores optimized for ultra-low
voltage operation. Near-threshold computing (NTC) has the potential to improve energy efficiency, but it is not guaranteed to
do so [11], [12].
For a given core, operating at near-threshold or sub-threshold
voltages can significantly improve the energy efficiency compared to operating the same core at the maximum voltage. For
instance, the energy efficiency of the x86 core discussed in [9]
improves by 4.7 times when the supply voltage is reduced from
1.20 V to 0.45 V at 25 C.
However, optimizing a core to operate at ultra-low voltages
can also lead to a degradation in energy efficiency, especially
if conventional power optimization techniques have to be abandoned to reach these ultra-low voltages. In particular, restricting
the usage of the high- , complex, or minimum-sized logic
gates [9], [10] due to their higher sensitivity to process variability can significantly increase area, leakage, and dynamic
power. For the x86 core, an area penalty approaching 80% is reported in [13]. Making the sequential elements and the bitcells
more tolerant to variability tends to exacerbate these negative
effects.
As discussed in the previous section, the cores optimized for
ultra-low voltage operation leak five to 10 times more than the
Hexagon DSP at room temperature. This suggests that, in order
to operate in or near the sub-threshold region, the x86 core and
the C64x+ DSP were unable to aggressively reduce leakage as

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015

described in Section VI. It also suggests that the energy efficiency of these cores would degrade significantly at high temperatures, where the leakage tends to become a much larger fraction of the total power consumption.
Around the point of optimal energy efficiency, the energy-versus-voltage curve is fairly flat. This implies that the
voltage can be increased without significantly degrading the
energy efficiency. In [9], increasing the optimal voltage by
100 mV to 0.55 V degrades the energy efficiency by approximately 12%. This suggests that the slightly higher operating
voltage (0.60 V) targeted by the Hexagon DSP is close enough
to the near-threshold region to reap most of the benefits of
NTC, but high enough to keep the benefits of the aggressive
power optimization techniques discussed in this paper.

Paul Bassett received the B.S. degree from Texas


A&M University, College Station, TX, USA, in
1982, and the M.S. degree from the Massachusetts
Institute of Technology (MIT), Cambridge, MA,
USA, in 1985, both in electrical engineering.
His primary area of focus has been on custom
circuit design for high-performance, low-power
processors. He has worked on circuit designs for
processors developed by BBN, Kendall Square
Research, Ross Technologies, Alchemy Semiconductor, and Qualcomm.

Ken Lin received the B.S.E.E. degree from National Taiwan University, Taiwan, in 1991, and the
M.S.E.E. degree from Syracuse University, NY,
USA, in 1995.
He is a Principal Engineer at Qualcomm Inc.,
San Diego, CA, USA. He joined Qualcomm in
2004 where he has worked on several generations
of Hexagon processors. Previously, he worked for
Digital Equipment Corporation, Compaq, and Sun
Microsystems, where he contributed to Alpha and
SPARC microprocessor circuit designs. He holds

REFERENCES
[1] S. Y. Wu et al., A highly manufacturable 28 nm CMOS low power
platform technology with fully functional 64 Mb SRAM using
dual/tripe gate oxide process, in IEEE Symp. VLSI Circuits, 2009,
pp. 210211.
[2] L. Codrescu et al., Hexagon DSP: An architecture optimized for mobile multimedia and communications, IEEE Micro, vol. 34, pp. 3443,
Mar. 2014.
[3] P. Bassett and M. Saint-Laurent, Energy efficient design techniques
for a digital signal processor, in IEEE Int. Conf. IC Design and Tech.,
2012, pp. 4144.
[4] M. Saint-Laurent and A. Datta, A low-power clock gating cell optimized for low-voltage operation in a 45-nm technology, in ACM/IEEE
Int. Symp. Low-Power Electronics and Design, 2010, pp. 159163.
[5] J. V. Faricelli, Layout-dependent proximity effects in deep nanoscale
CMOS, in Proc. IEEE Custom Integrated Circuits Conf., 2010, pp.
18.
[6] M. Saint-Laurent and M. Swaminathan, Impact of power-supply
noise on timing in high-frequency microprocessors, IEEE Trans.
Adv. Packag., vol. 27, pp. 135144, Feb. 2004.
[7] J. Torres et al., Low drop-out voltage regulators: Capacitor-less architecture comparison, IEEE Circuits Syst. Mag., vol. 14, no. 2, pp.
626, 2014.
[8] B. Mohammad, P. Bassett, A. Aziz, and J. Abraham, Cache organizations for embedded processors: CAM-vs-SRAM, in IEEE Int. SOC
Conf., 2006, pp. 299302.
[9] S. Jain et al., A 280 mV-to-1.2 V wide-operating-range IA-32
processor in 32 nm CMOS, in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, 2012, pp. 6667.
[10] N. Ickes et al., A 28 nm 0.6 V low power DSP for mobile applications, IEEE Journal of Solid-State Circuits, vol. 47, pp. 3546, Jan.
2012.
[11] L. Chang and W. Haensch, Near-threshold operation for power-effi, in Proc. ACM/EDAC/IEEE Design
cient computing? It depends
Automation Conf. (DAC), 2012, pp. 11551159.
[12] M. Severson, K. Yuen, and Y. Du, Not so fast my friend: Is nearthreshold computing the answer for power reduction of wireless devices?, in ACM/EDAC/IEEE Design Automation Conf. (DAC), 2012,
pp. 11601162.
[13] G. Ruhl, S. Dighe, S. Jain, S. Khare, and S. R. Vangal, IA-32 processor
with a wide-voltage-operating range in 32-nm CMOS, IEEE Micro,
vol. 33, pp. 2836, Mar. 2013.
Martin Saint-Laurent (SM09) received the B.Eng.
degree (with honors) from McGill University, Montreal, Canada, and the M.S. and Ph.D. degrees from
the Georgia Institute of Technology (Georgia Tech),
Atlanta, GA, USA, all in electrical engineering.
From 1998 to 2005, he was with Intel Corp., where
he worked on two generations of high-frequency
IA-32 processors as a custom circuit designer. His
responsibilities included clock distribution and
sequential element design. In 2005, he joined Qualcomm, Inc., Austin, TX, USA, where he is currently
a Principal Engineer. He manages the team responsible for low-power design,
power delivery, and power-aware verification. He has 24 patents granted or
pending related to clocking, energy-efficient circuits, and voltage regulation.

eight U.S. patents.

Baker Mohammad (M04SM13) received the


Ph.D. from the University of Texas at Austin, TX,
USA, in 2008, the M.S. degree from Arizona State
University, Tempe, AZ, USA, and the B.S. degree
from the University of New Mexico, Albuquerque,
NM, USA, all in electrical and computer engineering.
He is an Assistant Professor of electronic engineering at Khalifa University, Abu Dhabi, UAE,
and a consultant for Qualcomm Inc., Austin. Prior
to joining Khalifa University, he was a Senior
Staff Engineer/Manager at Qualcomm, where he was engaged in designing
high-performance and low-power DSPs for communication and multimedia
applications. Before joining Qualcomm, he worked on a wide range of microprocessors design at Intel Corp., from high-performance IA-64 server chips to
mobile embedded XScale processors. He has over 16 years of industrial experience in microprocessor design with emphasis on memory, low-power circuit,
and physical design. His research interests include power-efficient computing,
high-yield embedded memory, and emerging technologies such as memristor
and STTRAM. In addition, he is engaged in research on microwatt computing
platforms for WSN focusing on energy harvesting and power management,
including efficient DC/DC and AC/DC convertors. He holds eight issued U.S.
patents and has several pending patent applications. He authored the book titled
Embedded Memory Design for Multi-Core and SOC.
Dr. Mohammad has served IEEE in many editorial and administrative capacities. He is a member of the Technical Program Committee for several IEEE conferences such as International Conference on Computer Design (ICCD), ICECS,
and VLSI-SOC. He is a regular reviewer for IEEE TRANSACTIONS ON VLSI and
IEEE TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS.

Yuhe Wang received the M.S. degree in electrical


and computer engineering at the University of California, San Diego, CA, USA.
He works in the RFA (RF and analog/mixedsignal) group at Qualcomm Inc., San Diego. He has
one patent issued and two patents pending.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SAINT-LAURENT et al.: A 28 nm DSP POWERED BY AN ON-CHIP LDO FOR HIGH-PERFORMANCE AND ENERGY-EFFICIENT MOBILE APPLICATIONS

11

Xufeng Chen received the B.S. degree in physics


from Fudan University, Shanghai, China, and the
M.S.E.E. degree from Carnegie Mellon University,
Pittsburgh, PA, USA.
He has been working on ASIC design with Qualcomm, San Diego, CA, USA, since 1998. He holds
nine patents on digital signal processor and their
power management.

Dwight Galbi received the B.S. degree in electrical


engineering at Duke University, Durham, NC, USA,
and the M.S. degree in electrical and computer engineering at the University of Texas at Austin, TX,
USA.
He has 25 years of semiconductor work experience
and is currently a Principal Engineer/Manager for the
Physical Design team at Qualcomm Inc. in Austin.

Maen Alradaideh received the B.S. degree in electrical and computer engineering at Jordan University
of Science and Technology, Irbid, Jordan, and the
M.S. degree in electrical and computer engineering
at the University of Alabama, Huntsville, AL, USA.
He spent over 20 years in the field of processor verification and validation. He is currently leading silicon validation and customer enablement.

Allan Lester received the B.S. degree in electrical


engineering from Texas A&M University, College
Station, TX, USA, in 1990.
He worked for Compaq Computer Corporation
(later Hewlett Packard) from 1991 to 2004, where he
designed chipsets for desktop and server products.
In 2004, he joined Qualcomm as a founding member
of the Austin, TX, based DSP design team. He is
currently Senior Director of Engineering, responsible for logic design and physical design of all
Qualcomm DSP cores.

Tom Wernimont received the B.S. degree in electrical and computer engineering at the University of
Notre Dame, Notre Dame, IN, USA, and the M.S.
degree in electrical and computer engineering at
the University of Illinois at Urbana-Champaign, IL,
USA.
He is currently a Senior Staff Engineer at Qualcomm Inc., Austin, TX, USA. He has five patents
issued.

Kartik Ayyar received the Bachelor of Engineering


degree in electronics and instrumentation and the
Master of Science (Honors) degree in physics from
Birla Institute of Technology and Science, Pilani,
India.
He is currently working as a Staff Engineer at
Qualcomm India Pvt. Ltd., Bangalore, India.

Dan Bui received the B.S. degree in computer engineering at the University of Texas at Austin, TX,
USA.
He has over 20 years working experience and
is currently a Principal Engineer/Manager at Qualcomm in Austin, TX.

Marzio Pedrali-Noy received the M.S. degree


(summa cum laude) in electrical engineering from
the University of Pavia, Italy, in 1996.
From April 1996 to December 2000, he worked
as an Analog IC Design Engineer with Lawrence
Berkeley National Laboratory, Berkeley, CA, USA,
where he designed circuits for high-energy physics
and medical imaging applications. In 2001 he joined
Qualcomms Mixed Signal design group in San
Diego, CA, USA, and worked on circuits for data
converters, phase-locked loops and delay-locked
loops. He is currently Director of Engineering, Mixed Signal IP design group.
His research interests include the design of timing circuits and integrated power
regulators.

Willie Anderson is VP, Engineering for Qualcomm


Technologies Inc., and has worked for Qualcomm
since 2003. He leads DSP core and software tool
development for QCT and started the Qualcomm
Austin design center, as well as led the development
of a new DSP core by that team. That DSP has
now shipped in dozens of application processor and
baseband products, and has a production volume
of well over 3 billion cores. He has over 25 years
of experience in semiconductor and software engineering, including senior positions at Analog
Devices, Motorola and in private fabless semiconductor companies. He holds
over 25 patents in arithmetic unit, processor, system and software design.

Das könnte Ihnen auch gefallen