Sie sind auf Seite 1von 29

A 28nm DSP Powered by an

On-Chip LDO for High


Performance and Energy
Efficient Mobile Applications

10/27/16

Contents

Introduction
Architecture
Clock distribution
FIFO
Leakage optimization
Core voltage scaling & power gating
Array design
Conclusion

10/27/16

Introduction

This uses the QUALCOMM HEXAGON DIGITAL


SIGNAL PROCESSOR
High performance & low power for multimedia
and modern applications.
Process supports three threshold voltages &
three channel lengths.
The interconnected stack has eight layers.
Metal layer uses Al thick layer for power
routing
10/27/16

Architecture

Very long instruction word(vlsw) machine


Many instructions per cycle(parallelism)
Packet instruction for parallel execution
Variable length packet(1-4)improve code density
Pipeline is supported
Compiler to schelud dependent packets after
source packets
Contain 2 data cache memory(32kB&256kB) and
instruction cache memory(16kB)

10/27/16


Dependent packets helps to increase the no.
instruction per packets, increase utilization of
functional units.
Eight 16 bit multiply accumulate operations
per cycles
Dynamic multi-threading real time thread
performance
Communicated with external memory using
asynchronous bus
10/27/16

Processor
Architecture

10/27/16

Clock Distribution

DSP can be clocked from multiple sources.


Clock phase alignment is not required ,
frequency locked loop can be used.
Same clock source can be fed to DSP and
other core in the system.
Frequency divider can also be used as clock
source.
Clock distribution is designed for low dynamic
power consumption
10/27/16


The clock enters the core through a voltage
shifter- optimized for low clock insertion delay
and duty cycle distortion across the range of
voltages of DSP.
The circuit can apply different delays to the
rising and falling edges-adjust duty cycleimprove the timing of phase path.
Each clock bay consist of a long metal 8-wire
driven from the middle of the inverter (low
resistance) .
10/27/16

10/27/16


Levels

Global clock

Regional clock buffer (526 rclk )

Local clock buffer (1565 lclk )

Pulse generator (1907 pclk )


This topology includes 19 transistors instead
of 24 in conventional clock gating cell.

10/27/16

10

Pulsed Latches

Implement most sequential elements.

Are scan-able and have an asynchronous reset


Consumes less clock and data power.
Offers more data transparency- improves
performance.
Advantages :

Reduces power .

Improves robustness.

Reduces device performance variability.


10/27/16

11

FIFO

Core clock of DSP not synchronized with other


clocks
Core voltage different than chip voltage
Proper ordering of the read & write domains is
a major challenge.
Data written using the write pointer
maintained by the input clock domain
Read data is done using the read pointer
maintained by the output clock domain
10/27/16

12


When read/write operations done the pointer
values updated new value sent to receiver
entry valid and readable
Handshaking protocols should include enough
margin to cover any variation between the
writer pointer and array timing
FIFO includes a writer pointer tracking circuitmimic the path of the input register to the
latch array
10/27/16

13

Asynchronous FIFO

10/27/16

14

Leakage optimization

Low data to clock means that the sequential


element received tends to capture multiple
values many times
High data to clock means that the sequential
element often received are not captured
Many technologies are used for leakage
optimization and minimum power
consumption

10/27/16

15


Leakage recovery processor provides other
benefits
provides crosstalk
provide delay timing
reduces the no. delay circuits
maximum frequency of the core is improved
after the leakage recovery

10/27/16

16

Core voltage scaling & power gating

LDO used for dynamic voltage scaling


Improves power supply noise and timing
BHS implemented using 24 tiles array of
identical HVT40 devices
Signal enf applied then some tiles powered up
with a bit higher delay
When the enr is applied the complete system
is turned on with less delay

10/27/16

17


When powered down ,all the tiles are turned off
simultaneously
Proper sequence is important to safely power
down the core
Data retention is desired ,L1 cache dataL2
cache
Signal controlling the sleep state L2 cache
must be latched on to the memory supply and
output to the VDDQ6power domain must be
isolated.
10/27/16

18


Includes the core output as well as internal
interface to L2 cache
Isolation prevents flow of short circuit currents
and the propagation of unknown logic values to
the power domain .
The BHS is finally turned off . Power-up sequence
resets the core before removing the isolation.
In low power mode , LDO can reduce the voltage
of DSP while the rest of the system stays at
higher voltage .
10/27/16

19


During power down , BHS can cut the DSP
leakage to practically zero .
LDO does not require an external capacitor.
When the current demand of the core increases,
not having a large capacitor to quickly supply
charge forces the LDO itself to be faster.
Increase its gate capacitance and the time
required to change its gate voltage degrades
the ability of LDO to respond quickly to transient
events.
10/27/16

20


Switching between LDO and BHS is allowed
while DSP is running .
When transition from LDO to BHS is
progressively-LDO no longer regulates its
output voltage and can be turned off .
When transition from BHS to LDO is forced to
have minimum impedance state by the digital
controller and BHS is turned off controller
gradually increases the impedance of LDO
output voltage drops to its target value .
10/27/16

21

BHS and LDO

10/27/16

22

Array Design

The processor has 32kB L1 data cache ,16kB l1


instruction cache and unified 256kB L2 cache
L1 cache is virtually indexed and physically
tagged
Eight way set associative and has 32 bytes per
line
Data cache has 64 bit load port and 64 bit
store port
Supports 256bits eviction and fills
10/27/16

23


An access takes 2 cycles
It eliminates unnecessary access to the data
array when a read miss occurs
L1 data cache is stored in SRAM array
DSP is multi threaded and each thread operates
in its own virtual memory region in a different
virtual page number
CAM (content addressable memory) and
VPN(virtual page number) ,area is significantly
large and large power dissipation
10/27/16

24


Tag and data array of L1 cache must be
accessed in a single cycle
This allows much of the dynamic power
consumption and leakage of the instruction
cache to scale down with the core voltage

10/27/16

25

Conclusion

Power consumption of DSP is 58W/MHz


NTC has potential to improve energy efficiency
Voltage can significantly be increased without
degrading the energy efficiency

10/27/16

26

References

L.Codrescu et al., Hexagon DSP : An


architecture optimized for mobile multimedia
and communications, . IEEE Micro,vol 34 .pp 3443, March 2014 .
J. Torres et al . Low drop-out voltage regulators :
Capacitor-less architecture comparison . IEEE
Circuits Syst . Mag ., vol 14, no. 2 ,pp,6-26 ,2014.
N. Ickes et al., a 28 nm 0.6 V low power DSP or
mobile applications . IEE Journal Of Solid
Circuits. Vol 47 , pp , 35-46 , Jan 2012.
10/27/16

27

QUESTIONS

10/27/16

28

Thank u.

10/27/16

29

Das könnte Ihnen auch gefallen