A 28nm DSP Powered by An On Chip LDO For High Performance and Energy Efficiency For Mobile Application

A 28nm DSP Powered by an
On-Chip LDO for High

Performance and Energy
Efficient Mobile Applications
10/27/16
Contents
Introduction
Architecture
Clock distribution
FIFO
Leakage optimization
Core voltage scaling & power gating
Array design
Conclusion
10/27/16
Introduction
This uses the QUALCOMM HEXAGON DIGITAL

SIGNAL PROCESSOR
High performance & low power for multimedia
and modern applications.
Process supports three threshold voltages &
three channel lengths.
The interconnected stack has eight layers.
Metal layer uses Al thick layer for power
routing
10/27/16
Architecture
Very long instruction word(vlsw) machine

Many instructions per cycle(parallelism)
Packet instruction for parallel execution
Variable length packet(1-4)improve code density
Pipeline is supported
Compiler to schelud dependent packets after
source packets
Contain 2 data cache memory(32kB&256kB) and
instruction cache memory(16kB)
10/27/16

Dependent packets helps to increase the no.
instruction per packets, increase utilization of
functional units.
Eight 16 bit multiply accumulate operations
per cycles
Dynamic multi-threading real time thread
performance
Communicated with external memory using
asynchronous bus
10/27/16
Processor
Architecture
10/27/16
Clock Distribution
DSP can be clocked from multiple sources.

Clock phase alignment is not required ,
frequency locked loop can be used.
Same clock source can be fed to DSP and
other core in the system.
Frequency divider can also be used as clock
source.
Clock distribution is designed for low dynamic
power consumption
10/27/16

The clock enters the core through a voltage
shifter- optimized for low clock insertion delay
and duty cycle distortion across the range of
voltages of DSP.
The circuit can apply different delays to the
rising and falling edges-adjust duty cycleimprove the timing of phase path.
Each clock bay consist of a long metal 8-wire
driven from the middle of the inverter (low
resistance) .
10/27/16
10/27/16

Levels
Global clock
Regional clock buffer (526 rclk )
Local clock buffer (1565 lclk )
Pulse generator (1907 pclk )

This topology includes 19 transistors instead
of 24 in conventional clock gating cell.
10/27/16
10
Pulsed Latches
Implement most sequential elements.
Are scan-able and have an asynchronous reset

Consumes less clock and data power.
Offers more data transparency- improves
performance.
Advantages :
Reduces power .
Improves robustness.
Reduces device performance variability.

10/27/16
11
FIFO
Core clock of DSP not synchronized with other

clocks
Core voltage different than chip voltage
Proper ordering of the read & write domains is
a major challenge.
Data written using the write pointer
maintained by the input clock domain
Read data is done using the read pointer
maintained by the output clock domain
10/27/16
12

When read/write operations done the pointer
values updated new value sent to receiver
entry valid and readable
Handshaking protocols should include enough
margin to cover any variation between the
writer pointer and array timing
FIFO includes a writer pointer tracking circuitmimic the path of the input register to the
latch array
10/27/16
13
Asynchronous FIFO
10/27/16
14
Leakage optimization
Low data to clock means that the sequential

element received tends to capture multiple
values many times
High data to clock means that the sequential
element often received are not captured
Many technologies are used for leakage
optimization and minimum power
consumption
10/27/16
15

Leakage recovery processor provides other
benefits
provides crosstalk
provide delay timing
reduces the no. delay circuits
maximum frequency of the core is improved
after the leakage recovery
10/27/16
16
Core voltage scaling & power gating
LDO used for dynamic voltage scaling

Improves power supply noise and timing
BHS implemented using 24 tiles array of
identical HVT40 devices
Signal enf applied then some tiles powered up
with a bit higher delay
When the enr is applied the complete system
is turned on with less delay
10/27/16
17

When powered down ,all the tiles are turned off
simultaneously
Proper sequence is important to safely power
down the core
Data retention is desired ,L1 cache dataL2
cache
Signal controlling the sleep state L2 cache
must be latched on to the memory supply and
output to the VDDQ6power domain must be
isolated.
10/27/16
18

Includes the core output as well as internal
interface to L2 cache
Isolation prevents flow of short circuit currents
and the propagation of unknown logic values to
the power domain .
The BHS is finally turned off . Power-up sequence
resets the core before removing the isolation.
In low power mode , LDO can reduce the voltage
of DSP while the rest of the system stays at
higher voltage .
10/27/16
19

During power down , BHS can cut the DSP
leakage to practically zero .
LDO does not require an external capacitor.
When the current demand of the core increases,
not having a large capacitor to quickly supply
charge forces the LDO itself to be faster.
Increase its gate capacitance and the time
required to change its gate voltage degrades
the ability of LDO to respond quickly to transient
events.
10/27/16
20

Switching between LDO and BHS is allowed
while DSP is running .
When transition from LDO to BHS is
progressively-LDO no longer regulates its
output voltage and can be turned off .
When transition from BHS to LDO is forced to
have minimum impedance state by the digital
controller and BHS is turned off controller
gradually increases the impedance of LDO
output voltage drops to its target value .
10/27/16
21
BHS and LDO
10/27/16
22
Array Design
The processor has 32kB L1 data cache ,16kB l1

instruction cache and unified 256kB L2 cache
L1 cache is virtually indexed and physically
tagged
Eight way set associative and has 32 bytes per
line
Data cache has 64 bit load port and 64 bit
store port
Supports 256bits eviction and fills
10/27/16
23

An access takes 2 cycles
It eliminates unnecessary access to the data
array when a read miss occurs
L1 data cache is stored in SRAM array
DSP is multi threaded and each thread operates
in its own virtual memory region in a different
virtual page number
CAM (content addressable memory) and
VPN(virtual page number) ,area is significantly
large and large power dissipation
10/27/16
24

Tag and data array of L1 cache must be
accessed in a single cycle
This allows much of the dynamic power
consumption and leakage of the instruction
cache to scale down with the core voltage
10/27/16
25
Conclusion
Power consumption of DSP is 58W/MHz

NTC has potential to improve energy efficiency
Voltage can significantly be increased without
degrading the energy efficiency
10/27/16
26
References
L.Codrescu et al., Hexagon DSP : An

architecture optimized for mobile multimedia
and communications, . IEEE Micro,vol 34 .pp 3443, March 2014 .
J. Torres et al . Low drop-out voltage regulators :
Capacitor-less architecture comparison . IEEE
Circuits Syst . Mag ., vol 14, no. 2 ,pp,6-26 ,2014.
N. Ickes et al., a 28 nm 0.6 V low power DSP or
mobile applications . IEE Journal Of Solid
Circuits. Vol 47 , pp , 35-46 , Jan 2012.
10/27/16
27
QUESTIONS
10/27/16
28
Thank u.
10/27/16
29

A 28nm DSP Powered by An On Chip LDO For High Performance and Energy Efficiency For Mobile Application

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A 28nm DSP Powered by An On Chip LDO For High Performance and Energy Efficiency For Mobile Application

Hochgeladen von

Copyright:

Verfügbare Formate

A 28nm DSP Powered by an

On-Chip LDO for High

This uses the QUALCOMM HEXAGON DIGITAL

Very long instruction word(vlsw) machine

DSP can be clocked from multiple sources.

Regional clock buffer (526 rclk )

Local clock buffer (1565 lclk )

Pulse generator (1907 pclk )

Implement most sequential elements.

Are scan-able and have an asynchronous reset

Reduces device performance variability.

Core clock of DSP not synchronized with other

Low data to clock means that the sequential

Core voltage scaling & power gating

LDO used for dynamic voltage scaling

BHS and LDO

The processor has 32kB L1 data cache ,16kB l1

Power consumption of DSP is 58W/MHz

L.Codrescu et al., Hexagon DSP : An

Das könnte Ihnen auch gefallen