Sie sind auf Seite 1von 52

Sub-pj per Operation Parallel Computing in CMOS

Foundations to Implementation
Luca Benini
IIS-ETHZ & DEI-UNIBO
http://www.pulp-platform.org

IoT Near-Sensor Processing


Sense

Analyze and Classify

MEMS IMU
Controller

MEMS Microphone

Transmit
Short range, BW

L2 Memory
e.g. CortexM

ULP Imager

Low rate (periodic) data

IOs

EMG/ECG/EIT

25 MOPS
112000
MOPS
1 10 mW

SW update, commands
Long range, low BW

100 W 2 mW

Battery + Harvesting powered


a few mW power envelope

Idle:
~1W
Active: ~ 50mW
2

Near-Sensor Processing
Sense

Analyze and Classify

MEMS IMU
Controller

MEMS Microphone

Transmit
Short range, BW

L2 Memory
e.g. CortexM

ULP Imager

Low rate (periodic) data

IOs

EMG/ECG/EIT

25 MOPS
112000
MOPS
1 10 mW

SW update, commands
Long range, low BW

100 W 2 mW

Battery + Harvesting powered


a few mW power envelope

Idle:
~1W
Active: ~ 50mW
3

NSP A quantitative view


Image

COMPRESSION
COMPUTATIONAL OUTPUT
INPUT
FACTOR
BANDWIDTH
DEMAND
BANDWIDTH

Recognition:
[*Ioffe2015]

607 Kbps

4.00 GOPS

10 bps

60700x

256 Kbps

100 MOPS

0.02 Kbps

12800x

2.4 Kbps

7.7 MOPS

0.02 Kbps

120x

16 Kbps

150 MOPS

0.08 Kbps

200x

Voice/Sound
Speech:
[*VoiceControl]

Inertial
Kalman:

[*Nilsson2014]

Biometrics
SVM:

[*Benatti2014]

Extremely compact output (single index, alarm, signature)


Computational power of ULP Controllers is not enough
Parallel worloads
4

Microcontrollers Landscape
*not exhaustive
High performance MCUs

Low-Power MCUs

Our Target

Luca Benini

Near-Threshold Multiprocessing

Minimum energy operation


Source: Vivek De, INTEL Date 2013
0.9
0.8

32nm CMOS, 25 C

) 0.7
J
n
( 0.6

e
l
c 0.5
y
C
/ 0.4
y
g
r 0.3
e
n 0.2
E

4.7X
Total Energy
Leakage Energy
Dynamic Energy

0.1
0.0
0.2
0.55

0.3
0.55

0.4
0.55

0.5
0.55

0.6
0.6

0.7
0.7

0.8
0.8

0.9
0.9

1
1

1.1
1.1

1.2
1.2

Logic Vcc / Memory Vcc (V)

Near-Threshold Computing (NTC):


1. Dont waste energy pushing devices in strong inversion
2. Recover performance with parallel execution
7

Near-Threshold Multiprocessing

4-stage OpenRISC &


RiscV IMC, BP

I$B0
DEMUX
Periph
+ExtM

I$Bk

PE0

. . . . .

PEN-1

MB0

L1 TCDM+T&S

MBM

N Cores
DMA
Tightly Coupled DMA

Shared L1 DataMem + Atomic Variables


1.. 8 PE-per-cluster, 132 clusters
NT but parallel Max. Energy efficiency when Active

+ strong PM for (partial) idleness


8

Near threshold FDSOI technology

Body bias: Highly effective knob for power & variability management!
9

Body Biasing for


Variability Management
Process variation

120C

Thermal inversion

100x
@0.5V

25 MHz 7 MHz (3)

-40C

RVT transistors
FBB/RBB

10

Selective, fine-grained BB
The cluster is partitioned in
separate clock gating and body
bias regions
Body bias multiplexers
(BBMUXes) control the well
voltages of each region
A Power Management Unit
(PMU) automatically manages
transitions between the operating
modes
Power modes of each region:
Boost mode: active + FBB
Normal mode: active + NO BB
Idle mode: clock gated + NO BB (in
LVGT) RBB (in RVT)

Luca Benini

11

ULP (NT) Bottleneck: Memory


256x32 6T SRAMS vs. SCM
Standard 6T SRAMs:

2x-4x

High VDDMIN
Bottleneck for energy efficiency

Near-Threshold SRAMs (8T)

Lower VDDMIN
Area/timing overhead (25%-50%)
High active energy
Low technology portability

Standard Cell Memories:

Wide supply voltage range


Lower read/write energy (2x - 4x)
Easy technology portability
Major area overhead (4x)

12

Heterogeneous + Reconfigurable
Memory Architecture
I$B0

I$Bk

L0+PFB

L0+PFB

PE0

. . . . .

MMU

SCM0

SRAM0

MMU (logical/physical add map):


Interleaved/private addresses
Shutdown of SRAM banks

SCMM-1
Reconfigurable Pipeline Stages
for SRAMs degradation@low VDD

L1 TCDM
private

L0 buffer combined with prefetcher

PEN-1

MMU

interleaved

Shared multi-ported SCM I$

SRAMM-1

SCM for part of TCDM


to widen the operating range and
boost energy efficiency,
SRAM for density

Luca Benini

13

HW Synchronizer
#Parallel

#Barrier
25

EU-v2
LibGOMP

600

20
CPU Cycles

CPU Cycles

800

400
200

15
EU-v2

10

LibGOMP

0
0

10

15

#Threads

15

#Threads

#Critical

#Sections
2000

800
EU-v2

EU-v2

1500

CPU Cycles

CPU Cycles

10

LibGOMP

1000

600
400

500

200

0
0

10

LibGOMP

15

10

15

#Threads

#Threads

Cost of OpenMP runtime reduced by more than one order of magnitude


Better scalability with number of cores
Davide Rossi

14

ISA Extensions
Sensor data NEVER comes with 32-bit
1.
2.
3.
4.
5.
6.

Dot product between vectors


Saturation between [-2n or 0 ; 2n-1]
Mul/Add/Sub plus round and norm
Shuffle operations for vectors
Packed-SIMD ALU operations
Bit manipulations

Davide Rossi

The Evolution of the Species


# of cores
L2 memory
TCDM
DVFS
I$
DSP Extensions
HW Synchronizer

Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.

PULPv1
4
16 kB
16kB SRAM

PULPv2
4
64 kB
32kB SRAM
8kB SCM

PULPv3
4
128 kB
32kB SRAM
16kB SCM

no
4kB SRAM private
no
no

yes
4kB SCM private
no
no

yes
4kB SCM shared
yes
yes

PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W

PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W

Davide Rossi

PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W

16

# of cores
L2 memory
TCDM

PULPv1
PULPv2
PULPv3
The Evolution
of
the
Species
4
4
4

DVFS
I$
DSP Extensions
HW Synchronizer

Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.

16 kB
16kB SRAM

64 kB
32kB SRAM
8kB SCM

128 kB
32kB SRAM
16kB SCM

no
4kB SRAM private
no
no

yes
4kB SCM private
no
no

yes
4kB SCM shared
yes
yes

PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W

PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W

Davide Rossi

PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W

17

Cluster Energy Efficiency


193 MOPS/mW @ 40.5MHz, 0.46V, 0V FBB, 840 W
10 mW @ 1GOPS, 100 MOPS/mW,
0.66V, 0.5V FBB

Luca Benini

18

Approximate Computing to the Rescue

Approximate?
Less-than-perfect results perceived as correct by the users
e.g. image processing (filtering)

RGB to GRAYSCALE

RGB to GRAYSCALE (+ 10% error)

Approximation is not always acceptable


Application and program phase dependent!
20

Approximate Storage?
Retention voltage
Retention
SCM

0.25V

6T-SRAM

0.29V

Probability of flip-bit error on a single bit during read/


write operations

Voltage (V)

0.50

0.55

0.60

0.65

0.70

0.75

0.80

P(flip-bit) SCM

0.0

0.0

0.0

0.0

0.0

0.0

0.0

P(flip-bit) 6T

0.0037

0.0012

0.0003

5.24e-5 4.35e-6 4.16e-8 0.0

21

TCDM Design

0x14000
EFFICIENCY GOAL switch down the voltage of 6T-RAM domains to save energy

6T-SRAM
#0

6T-SRAM
#15

6T-SRAM
#0

6T-SRAM
#15



6T-SRAM (D4)
APPROXIMATION least signicant bytes (LSBs) of error-tolerant variables in 6T-RAM (split
region), and it can be can be re-congured on-the-y by SW
(1 more)

0x0C000

6T-SRAM
6T-SRAM
#0 MANAGEMENT
#15

RELIABILITY
UNITS
(RMUs)
c(D2)
ombinaeonal
logic bVOLTAGE
locks used
to
DOMAINS

6T-SRAM

remap the physical address range of the TCDM into logical memory areas

0x08000
0x2000
+
OFFSET

LSB #0

LSB #15

LEVEL SHIFTERS

LEVEL SHIFTERS

OFFSET
0x0000

SCM #0

VOLTAGE LEVELS
0.8V : RELIABLE
0.5V : NOT RELIABLE

0x2000

6T-SRAM (D1)

SCM #15

MSB #0

MSB #15

RMU

RMU
interconnect

SCM

SPLIT

LOGICAL MEMORY
AREAS

0.5V : RELIABLE
0.8V : RELIABLE
0.5V : MSBs RELIABLE,
LSBs NOT RELIABLE

Programming model

We propose an extension to OpenMP

#pragma omp tolerant direc.ve to annotate statements


var_list() clause to specify error-tolerant variables

int main()
{
int sparse_M[];
int i=0, index;
Marking other variables as tolerant
while(func(i))
(e.g. index) would lead to fatal errors
{
index = compute_index(i);
#pragma omp tolerant var_list(sparse_M)
sparse_M[index] = compute_element(sparse_M, index);
update(i);
}
output_value(sparse_M);
Tolerant computaeon here would
}
probably lead to fatal errors!

Compiler support
Clang + LLVM
Direc.ves are translated into annotated tokens of the
intermediate representa.on (LLVM IR)

Three variable sets to allocate:


Tolerant variables (TV) LSB part could stay in 6T-SRAM
variable live range overlap with tolerant regions
at least a var_list clause contains the variable

Non-tolerant variables (RV) strictly in SCM


variable live range overlap with tolerant regions
no var_list clause contains the variable

Safe variables (SV) no alloca.on constraints


variable live range does not overlap with any tolerant region

Two opera.on modes for each 6T-SRAM domain (low-


voltage and high-voltage), split area control

Hardware evaluaeon
Energy consumpeon (data-memory) and area compared to other solueons
AVG ENERGY

OUR ENERGY
OUR AREA

MEMORY TECH.
APPLICATIONS

Accuracy
While do not simply discard the LSBs?

Application constraints

Pushing Beyond pJ/OP

Recovering more silicon efficiency


GOPS/W
1

> 100

SW

Mixed

HW

General-purpose Throughput
Computing
Computing
CPU

GPGPU

Accelerator Gap

HW IP

Closing The Accelerator Efficiency Gap with Agile Customization


28

Learn to Accelerate
Brain-inspired (deep convolutional networks) systems are
high performers in many tasks over many domains
CNN:
93.4% accuracy
(Imagenet 2014)
Human:
85% (untrained),
94.9% (trained)
[Karpahy15]
Image recognition
[Russakovsky et al., 2014]

Speech recognition
[Hannun et al., 2014]

Flexible acceleration: learned CNN weights are the program


29

Computational Effort

Computational effort
~90%

7.5 GOp for 320x240 image


260 GOp for FHD
1050 GOp for 4k UHD

Origami chip
30

Origami: A CNN Accelerator

FP needed?
12-bit signals sufficient
Input to classification double-vs-12-bit
accuracy loss <0.5% (80.6% to 80.1%)
31

Sub pJ/OP?
0% bit flips
437 GOPS/W @1.2V
803 GOPS/W @0.8V

1% bit flips
1.84x energy improvement
1.2pJ/OP
32

Pushing Further: YodaNN1


Approximation at the algorithmic side Binary weights
BinaryConnect [Courbariaux, NIPS15]
Reduce weights to a binary value -1/+1
Stochastic Gradient Descent with Binarization in the Forward Path
, ={1&1=() @1&1=1
,
1
=
{1&<0@1&>0

Learning large networks is still an issue with binary connect

Ultra-optimized HW is possible!
Power reduction because of arithmetic simplification
Major arithmetic density improvements
Area can be used for more energy-efficient weight storage

SCM memories for lower voltage E goes with 1/V2


1After

the Yedi Master from Star Wars - Small in size but wise and powerful cit. www.starwars.com
33

SoP-Unit Optimization
ImageBank

FilterBank

Equivalent for 7x7 SoP

Image Mapping (3x3, 5x5, 7x7)

34

SCM for Energy efficiency

Same area 832 SoP units + all SCM


12x Energy efficiency improvement

Power Efficiency [Gop/s/mW!

SCM (down to 0.6V in 65nm)

Core Energy Efficiency (8x8)

0.5

0.7

0.9

1.1

1.3

Supply [V]

35

Comparison with SoA


Publication

Throughput
[GOPS]

En.Eff.
[GOPS/W]

Supply
[V]

Area Effic.
[GOPS/MGE]

Neuflow

320

490

1.0

17

Leuven

102

2600

0.5 - 1.1

64

Eyeriss

84

160

0.8 - 1.2

46

NINEX

569

1800

1.2

51

k-Brain

411

1.2

109

Origami

196/74

437/803

1.2/0.8

90/34

This Work*

1510/55

9800/61200

1.2/0.6

1135/41

2.7x

1930

23x

10x

*UMC 65nm Technology, post place & route, 25C, tt, 1.2V/0.6V

Breakthrough for ultra-low-power CNN ASIC implementation

fJ/op in sight: manufacturing in Flobalfoundries 22FDX


36

Heterogeneous PULP Cluster


Hardware Convolutional Engine (HWCE) in the Cluster

Logarithmic Interconnect
SOP units & Line Buffer
HWPE wrapper CTRL HWPE wrapper

Peripheral Interconnect

Davide Rossi

37

Heterogeneous PULP Cluster


Hardware Convolutional Engine (HWCE) in the Cluster
~6.9 mm2 PULP chip, 1/2016

4 OR10Nv2 cores
1 HW CRYPTO accelerator
1 CNN HWCE
greatly improved DMA engine
uDMA,

~1514 kGE for the cluster


232 kGE for the CNN HWCE

SOP units & Line Buffer


CTRL

Davide Rossi

38

HWCE CNN Performance


Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE

ENERGY EFFICIENCY

8 GOPS

6500 GOPS/W

61x

47x

PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V


Davide Rossi

39

HWCE CNN Performance


Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE

ENERGY EFFICIENCY

8 GOPS

6500 GOPS/W

61x

47x

PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V


Davide Rossi

40

Doing Nothing (very) Well

Sleepwalker IO Interface
CLUSTER
INTERCONNECT

BUS

64-bit

32-bit

Top
L2
MEMORY

Soc Ctrl
FLLs

FLL Ctr

AXI
TO
APB

DC FIFO
ID R.

HWCE

Open
RISC
#0
DU

JTAG
TAP

...

TCDM
BANK
#M-1

DMA

CLUSTER
PERIPHS

SPI S
JTAG

TCDM
BANK
#1

TCDM INTERCONNECT

CRYPT

PERIPHERAL
INTERCONNECT

GPIO

APB BUS

SPI M

SOC BUS

UART

CLUSTER BUS

DC FIFO

uDMA

I2C

TCDM
BANK
#0

AXI2MEM

ROM

I2S

PAD MUX

CLUSTER

BRIDGE

SoC

Open
RISC
#1

...

DU

Open
RISC
#N-1
DU

JTAG2AXI
TMC

JTAG
TAP

ADV
DEBUG IF

SHARED I$

Micro DMA gives a major boost to cluster sleep time


Davide Rossi

42

Sleepwalking on PULP
Programming sequence:
1) Set events mask
2) Program transfer
3) Trigger transfer
4) Shut down cores
5) If all cores are idle the
cluster automatically shut
down (power gating)
48 maskable events
General purpose
DMA
Timers
Peripherals (SPI, I2C. GPIO...)

PE0

PE1

PE2

PE3

HW
SYNCH

EVENT
UNIT

SPI

PERIPHERAL INTERCONNECT
L2
MEM

GOAL:
Automatically manage shut down of cores and cluster during data
transfers from peripherals to L2

Davide Rossi

10/1
1/16

43

Back to System-Level
Smart Visual Sensor idle most of the time (nothing interesting to see)

PULPv3
GrainCam

Mixed-Signal
Event- based
Imager

Digital
Parallel
Processor

Event-Driven Computation, which


occurs only when relevant events are
detected by the sensor
Event-based sensor interface to
minimize IO energy (vs. Frame-based
inteface
Mixed-signal event triggering with an
ULP imager with internal processing
AMS capability

Doing nothing VERY well

GrainCam Imager (FBK)

Pixel-level spatial-contrast extraction

Analog internal image processing

Contrast Extraction
Motion Extraction, differencing two successive frames
Background Subtraction, differencing the reference
image, stored in the memory, with the current frame
45

Energy-Efficient Sensor IF
Event-based + Prefiltering + Buffering
Control Block:
Always-on 32KHz controller for
continuously handling sensor timing
signals (e.g. integration, readout
periods) and control signals (e.g.
readout mode, reference frame
sampling)
Activates data-path block during
imager readout phase
Triggers events

Imager IF

Data Path
Input Stage

Control
Block

enable

32 kHz
done

Interrupt Manager

Master Data
Port
Slave Control
Port
Wake-up
Event

SoC
clock

SoC Ctrl
Always-on

Data Path: stream processing on sensed


post-triggering data
u-DMA: autonomously transfers pixel-events
to L2 memory during imager readout, without
requiring cluster action
Sensor IF Peripheral as transfer initiator.
Slave Port interface to set-up sensor interface
parameters

Stream
Processor

uDMA

A PB B u s

L2
SoC Bus

Data
Event
Event
Unit

Cluster

DMA

TCDM

Open Source
Parallel ULP computing for the IoT
(sub)-pJ/op computing platform - lets make it Open!

Processor &
Hardware IPs

Compiler
Infrastructure

Virtualization
Layer

Programming
Model

47

ETHZ and UNIBO released PULPino

Hundreds of git forks in a few weeks!


Integrated Systems Laboratory

48

Conclusions
IoT a Challenge and an opportunity
Computing Everywhere!

Energy efficiency requirements: pj/OP and below


Technology scaling alone is not doing the job for us
Ultra-low power architecture and circuits are needed, but not
sufficient in the long run
Most promising technologies: 3D integration, low-leakage, nonvolatile silicon-compatible mem-computing devices, but adoption rate
will be SLOW

Approximate computing to the rescue


Holistic approach needed, math, number systems, algorithms, tools,
runtime, architecture, circuits, devices from approximate to
transprecision computing

Non-Von-Neumann + Transprecision Jackpot potential!


Open Source HW & SW approach for building an
innovation ecosystem
49

Thanks for your attention!!!


www.pulp-platform.org
www-micrel.deis.unibo.it/pulp-project
iis-projects.ee.ethz.ch/index.php/PULP

PULP related chips

Main PULP chips (ST 28nm FDSOI)

PULPv1 (see slide 7)


PULPv2 (see sliee 7)
PULPv3 (under test)
PULPv4 (in progress)

Sir10us
Or10n

Mixed-signal systems (SMIC 130nm)


VivoSoC, Vivosoc2
EdgeSoC (in planning)

PULP development (UMC 65nm)

Artemis - IEEE 754 FPU


Hecate - Shared FPU
Selene - Logarithic Number System FPU
Diana - Approximate FPU
Mia Wallace full system
Imperio - PULPino chip

Fulmine Secure cluster + CNN

Early building blocks (UMC180)

IcySoC chips approx. computing


platforms (ALP 180nm)
Diego
Manny
Sid

More to come.

180nm

180nm

RISC-V based systems (GF 28nm)


Honey Bunny

28nm

28nm

28nm

65nm

28nm

130nm

180nm

180nm

180nm

65nm

65nm

65nm

65nm

51

Example
int main()
{
#pragma omp tolerant var
int sparse_A[];
int i=0, index;
Marking index as tolerant would

probably lead to fatal errors!
while(func(i))
{

index = compute_index(i);
#pragma omp tolerant computation
sparse_A[index] = compute_element(sparse_A, index);

update(i);
}
use_value(sparse_A);
}
Tolerant computation here would

probably lead to fatal errors!


52

Das könnte Ihnen auch gefallen