Benini

Sub-pj per Operation Parallel Computing in CMOS
Foundations to Implementation
Luca Benini
IIS-ETHZ & DEI-UNIBO
http://www.pulp-platform.org
IoT Near-Sensor Processing

Sense
Analyze and Classify
MEMS IMU
Controller
MEMS Microphone
Transmit
Short range, BW
L2 Memory
e.g. CortexM
ULP Imager
Low rate (periodic) data
IOs
EMG/ECG/EIT
25 MOPS
112000
MOPS
1 10 mW
SW update, commands
Long range, low BW
100 W 2 mW
Battery + Harvesting powered

a few mW power envelope
Idle:
~1W
Active: ~ 50mW
2
Near-Sensor Processing
Sense
Analyze and Classify
MEMS IMU
Controller
MEMS Microphone
Transmit
Short range, BW
L2 Memory
e.g. CortexM
ULP Imager
Low rate (periodic) data
IOs
EMG/ECG/EIT
25 MOPS
112000
MOPS
1 10 mW
SW update, commands
Long range, low BW
100 W 2 mW
Battery + Harvesting powered

a few mW power envelope
Idle:
~1W
Active: ~ 50mW
3
NSP A quantitative view

Image
COMPRESSION
COMPUTATIONAL OUTPUT
INPUT
FACTOR
BANDWIDTH
DEMAND
BANDWIDTH
Recognition:
[*Ioffe2015]
607 Kbps
4.00 GOPS
10 bps
60700x
256 Kbps
100 MOPS
0.02 Kbps
12800x
2.4 Kbps
7.7 MOPS
0.02 Kbps
120x
16 Kbps
150 MOPS
0.08 Kbps
200x
Voice/Sound
Speech:
[*VoiceControl]
Inertial
Kalman:
[*Nilsson2014]
Biometrics
SVM:
[*Benatti2014]
Extremely compact output (single index, alarm, signature)

Computational power of ULP Controllers is not enough
Parallel worloads
4
Microcontrollers Landscape
*not exhaustive
High performance MCUs
Low-Power MCUs
Our Target
Luca Benini
Near-Threshold Multiprocessing
Minimum energy operation

Source: Vivek De, INTEL Date 2013
0.9
0.8
32nm CMOS, 25 C
) 0.7
J
n
( 0.6
e
l
c 0.5
y
C
/ 0.4
y
g
r 0.3
e
n 0.2
E
4.7X
Total Energy
Leakage Energy
Dynamic Energy
0.1
0.0
0.2
0.55
0.3
0.55
0.4
0.55
0.5
0.55
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1
1
1.1
1.1
1.2
1.2
Logic Vcc / Memory Vcc (V)
Near-Threshold Computing (NTC):

1. Dont waste energy pushing devices in strong inversion
2. Recover performance with parallel execution
7
Near-Threshold Multiprocessing
4-stage OpenRISC &

RiscV IMC, BP
I$B0
DEMUX
Periph
+ExtM
I$Bk
PE0
. . . . .
PEN-1
MB0
L1 TCDM+T&S
MBM
N Cores
DMA
Tightly Coupled DMA
Shared L1 DataMem + Atomic Variables

1.. 8 PE-per-cluster, 132 clusters
NT but parallel Max. Energy efficiency when Active
+ strong PM for (partial) idleness

8
Near threshold FDSOI technology
Body bias: Highly effective knob for power & variability management!
9
Body Biasing for

Variability Management
Process variation
120C
Thermal inversion
100x
@0.5V
25 MHz 7 MHz (3)
-40C
RVT transistors
FBB/RBB
10
Selective, fine-grained BB
The cluster is partitioned in
separate clock gating and body
bias regions
Body bias multiplexers
(BBMUXes) control the well
voltages of each region
A Power Management Unit
(PMU) automatically manages
transitions between the operating
modes
Power modes of each region:
Boost mode: active + FBB
Normal mode: active + NO BB
Idle mode: clock gated + NO BB (in
LVGT) RBB (in RVT)
Luca Benini
11
ULP (NT) Bottleneck: Memory

256x32 6T SRAMS vs. SCM
Standard 6T SRAMs:
2x-4x
High VDDMIN
Bottleneck for energy efficiency
Near-Threshold SRAMs (8T)
Lower VDDMIN
Area/timing overhead (25%-50%)
High active energy
Low technology portability
Standard Cell Memories:
Wide supply voltage range

Lower read/write energy (2x - 4x)
Easy technology portability
Major area overhead (4x)
12
Heterogeneous + Reconfigurable
Memory Architecture
I$B0
I$Bk
L0+PFB
L0+PFB
PE0
. . . . .
MMU
SCM0
SRAM0
MMU (logical/physical add map):

Interleaved/private addresses
Shutdown of SRAM banks
SCMM-1
Reconfigurable Pipeline Stages
for SRAMs degradation@low VDD
L1 TCDM
private
L0 buffer combined with prefetcher
PEN-1
MMU
interleaved
Shared multi-ported SCM I$
SRAMM-1
SCM for part of TCDM

to widen the operating range and
boost energy efficiency,
SRAM for density
Luca Benini
13
HW Synchronizer
#Parallel
#Barrier
25
EU-v2
LibGOMP
600
20
CPU Cycles
CPU Cycles
800
400
200
15
EU-v2
10
LibGOMP
0
0
10
15
#Threads
15
#Threads
#Critical
#Sections
2000
800
EU-v2
EU-v2
1500
CPU Cycles
CPU Cycles
10
LibGOMP
1000
600
400
500
200
0
0
10
LibGOMP
15
10
15
#Threads
#Threads
Cost of OpenMP runtime reduced by more than one order of magnitude

Better scalability with number of cores
Davide Rossi
14
ISA Extensions
Sensor data NEVER comes with 32-bit
1.
2.
3.
4.
5.
6.
Dot product between vectors

Saturation between [-2n or 0 ; 2n-1]
Mul/Add/Sub plus round and norm
Shuffle operations for vectors
Packed-SIMD ALU operations
Bit manipulations
Davide Rossi
The Evolution of the Species

# of cores
L2 memory
TCDM
DVFS
I$
DSP Extensions
HW Synchronizer
Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.
PULPv1
4
16 kB
16kB SRAM
PULPv2
4
64 kB
32kB SRAM
8kB SCM
PULPv3
4
128 kB
32kB SRAM
16kB SCM
no
4kB SRAM private
no
no
yes
4kB SCM private
no
no
yes
4kB SCM shared
yes
yes
PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W
PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W
Davide Rossi
PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W
16
# of cores
L2 memory
TCDM
PULPv1
PULPv2
PULPv3
The Evolution
of
the
Species
4
4
4
DVFS
I$
DSP Extensions
HW Synchronizer
Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.
16 kB
16kB SRAM
64 kB
32kB SRAM
8kB SCM
128 kB
32kB SRAM
16kB SCM
no
4kB SRAM private
no
no
yes
4kB SCM private
no
no
yes
4kB SCM shared
yes
yes
PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W
PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W
Davide Rossi
PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W
17
Cluster Energy Efficiency

193 MOPS/mW @ 40.5MHz, 0.46V, 0V FBB, 840 W
10 mW @ 1GOPS, 100 MOPS/mW,
0.66V, 0.5V FBB
Luca Benini
18
Approximate Computing to the Rescue
Approximate?
Less-than-perfect results perceived as correct by the users
e.g. image processing (filtering)
RGB to GRAYSCALE
RGB to GRAYSCALE (+ 10% error)
Approximation is not always acceptable

Application and program phase dependent!
20
Approximate Storage?
Retention voltage
Retention
SCM
0.25V
6T-SRAM
0.29V
Probability of flip-bit error on a single bit during read/

write operations
Voltage (V)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
P(flip-bit) SCM
0.0
0.0
0.0
0.0
0.0
0.0
0.0
P(flip-bit) 6T
0.0037
0.0012
0.0003
5.24e-5 4.35e-6 4.16e-8 0.0
21
TCDM Design
0x14000
EFFICIENCY GOAL switch down the voltage of 6T-RAM domains to save energy
6T-SRAM
#0
6T-SRAM
#15
6T-SRAM
#0
6T-SRAM
#15

6T-SRAM (D4)
APPROXIMATION least signicant bytes (LSBs) of error-tolerant variables in 6T-RAM (split
region), and it can be can be re-congured on-the-y by SW
(1 more)

0x0C000

6T-SRAM
6T-SRAM
#0 MANAGEMENT
#15

RELIABILITY
UNITS
(RMUs)
c(D2)
ombinaeonal
logic bVOLTAGE
locks used
to
DOMAINS

6T-SRAM

remap the physical address range of the TCDM into logical memory areas

0x08000
0x2000
+
OFFSET
LSB #0
LSB #15
LEVEL SHIFTERS
LEVEL SHIFTERS
OFFSET
0x0000
SCM #0
VOLTAGE LEVELS
0.8V : RELIABLE
0.5V : NOT RELIABLE
0x2000
6T-SRAM (D1)
SCM #15
MSB #0
MSB #15
RMU
RMU
interconnect
SCM
SPLIT
LOGICAL MEMORY
AREAS
0.5V : RELIABLE
0.8V : RELIABLE
0.5V : MSBs RELIABLE,
LSBs NOT RELIABLE

Programming model
We propose an extension to OpenMP
#pragma omp tolerant direc.ve to annotate statements

var_list() clause to specify error-tolerant variables

int main()
{
int sparse_M[];
int i=0, index;
Marking other variables as tolerant
while(func(i))
(e.g. index) would lead to fatal errors
{
index = compute_index(i);
#pragma omp tolerant var_list(sparse_M)
sparse_M[index] = compute_element(sparse_M, index);
update(i);
}
output_value(sparse_M);
Tolerant computaeon here would
}
probably lead to fatal errors!
Compiler support
Clang + LLVM
Direc.ves are translated into annotated tokens of the
intermediate representa.on (LLVM IR)

Three variable sets to allocate:

Tolerant variables (TV) LSB part could stay in 6T-SRAM
variable live range overlap with tolerant regions
at least a var_list clause contains the variable
Non-tolerant variables (RV) strictly in SCM

variable live range overlap with tolerant regions
no var_list clause contains the variable
Safe variables (SV) no alloca.on constraints

variable live range does not overlap with any tolerant region

Two opera.on modes for each 6T-SRAM domain (low-

voltage and high-voltage), split area control
Hardware evaluaeon
Energy consumpeon (data-memory) and area compared to other solueons
AVG ENERGY
OUR ENERGY
OUR AREA
MEMORY TECH.
APPLICATIONS
Accuracy
While do not simply discard the LSBs?
Application constraints
Pushing Beyond pJ/OP
Recovering more silicon efficiency

GOPS/W
1
> 100
SW
Mixed
HW
General-purpose Throughput
Computing
Computing
CPU
GPGPU
Accelerator Gap
HW IP
Closing The Accelerator Efficiency Gap with Agile Customization

28
Learn to Accelerate
Brain-inspired (deep convolutional networks) systems are
high performers in many tasks over many domains
CNN:
93.4% accuracy
(Imagenet 2014)
Human:
85% (untrained),
94.9% (trained)
[Karpahy15]
Image recognition
[Russakovsky et al., 2014]
Speech recognition
[Hannun et al., 2014]
Flexible acceleration: learned CNN weights are the program

29
Computational Effort
Computational effort
~90%
7.5 GOp for 320x240 image

260 GOp for FHD
1050 GOp for 4k UHD
Origami chip
30
Origami: A CNN Accelerator
FP needed?
12-bit signals sufficient
Input to classification double-vs-12-bit
accuracy loss <0.5% (80.6% to 80.1%)
31
Sub pJ/OP?
0% bit flips
437 GOPS/W @1.2V
803 GOPS/W @0.8V
1% bit flips
1.84x energy improvement
1.2pJ/OP
32
Pushing Further: YodaNN1

Approximation at the algorithmic side Binary weights
BinaryConnect [Courbariaux, NIPS15]
Reduce weights to a binary value -1/+1
Stochastic Gradient Descent with Binarization in the Forward Path
, ={1&1=() @1&1=1
,
1
=
{1&<0@1&>0
Learning large networks is still an issue with binary connect
Ultra-optimized HW is possible!
Power reduction because of arithmetic simplification
Major arithmetic density improvements
Area can be used for more energy-efficient weight storage
SCM memories for lower voltage E goes with 1/V2

1After
the Yedi Master from Star Wars - Small in size but wise and powerful cit. www.starwars.com
33
SoP-Unit Optimization
ImageBank
FilterBank
Equivalent for 7x7 SoP
Image Mapping (3x3, 5x5, 7x7)
34
SCM for Energy efficiency
Same area 832 SoP units + all SCM

12x Energy efficiency improvement
Power Efficiency [Gop/s/mW!
SCM (down to 0.6V in 65nm)
Core Energy Efficiency (8x8)
0.5
0.7
0.9
1.1
1.3
Supply [V]
35
Comparison with SoA

Publication
Throughput
[GOPS]
En.Eff.
[GOPS/W]
Supply
[V]
Area Effic.
[GOPS/MGE]
Neuflow
320
490
1.0
17
Leuven
102
2600
0.5 - 1.1
64
Eyeriss
84
160
0.8 - 1.2
46
NINEX
569
1800
1.2
51
k-Brain
411
1.2
109
Origami
196/74
437/803
1.2/0.8
90/34
This Work*
1510/55
9800/61200
1.2/0.6
1135/41
2.7x
1930
23x
10x
*UMC 65nm Technology, post place & route, 25C, tt, 1.2V/0.6V
Breakthrough for ultra-low-power CNN ASIC implementation
fJ/op in sight: manufacturing in Flobalfoundries 22FDX

36
Heterogeneous PULP Cluster

Hardware Convolutional Engine (HWCE) in the Cluster
Logarithmic Interconnect
SOP units & Line Buffer
HWPE wrapper CTRL HWPE wrapper
Peripheral Interconnect
Davide Rossi
37
Heterogeneous PULP Cluster

Hardware Convolutional Engine (HWCE) in the Cluster
~6.9 mm2 PULP chip, 1/2016
4 OR10Nv2 cores
1 HW CRYPTO accelerator
1 CNN HWCE
greatly improved DMA engine
uDMA,
~1514 kGE for the cluster

232 kGE for the CNN HWCE
SOP units & Line Buffer

CTRL
Davide Rossi
38
HWCE CNN Performance

Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE
ENERGY EFFICIENCY
8 GOPS
6500 GOPS/W
61x
47x
PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V

Davide Rossi
39
HWCE CNN Performance

Average performance and energy efficiency on a 32x16 CNN frame
PERFORMANCE
ENERGY EFFICIENCY
8 GOPS
6500 GOPS/W
61x
47x
PULPv3 ARCHITECTURE, CORNER: tt28, 25C, VDD= 0.5V, FBB = 0.5V

Davide Rossi
40
Doing Nothing (very) Well
Sleepwalker IO Interface
CLUSTER
INTERCONNECT
BUS
64-bit
32-bit
Top
L2
MEMORY
Soc Ctrl
FLLs
FLL Ctr
AXI
TO
APB
DC FIFO
ID R.
HWCE
Open
RISC
#0
DU
JTAG
TAP
...
TCDM
BANK
#M-1
DMA
CLUSTER
PERIPHS
SPI S
JTAG
TCDM
BANK
#1
TCDM INTERCONNECT
CRYPT
PERIPHERAL
INTERCONNECT
GPIO
APB BUS
SPI M
SOC BUS
UART
CLUSTER BUS
DC FIFO
uDMA
I2C
TCDM
BANK
#0
AXI2MEM
ROM
I2S
PAD MUX
CLUSTER
BRIDGE
SoC
Open
RISC
#1
...
DU
Open
RISC
#N-1
DU
JTAG2AXI
TMC
JTAG
TAP
ADV
DEBUG IF
SHARED I$
Micro DMA gives a major boost to cluster sleep time

Davide Rossi
42
Sleepwalking on PULP
Programming sequence:
1) Set events mask
2) Program transfer
3) Trigger transfer
4) Shut down cores
5) If all cores are idle the
cluster automatically shut
down (power gating)
48 maskable events
General purpose
DMA
Timers
Peripherals (SPI, I2C. GPIO...)
PE0
PE1
PE2
PE3
HW
SYNCH
EVENT
UNIT
SPI
PERIPHERAL INTERCONNECT
L2
MEM
GOAL:
Automatically manage shut down of cores and cluster during data
transfers from peripherals to L2
Davide Rossi
10/1
1/16
43
Back to System-Level
Smart Visual Sensor idle most of the time (nothing interesting to see)
PULPv3
GrainCam
Mixed-Signal
Event- based
Imager
Digital
Parallel
Processor
Event-Driven Computation, which

occurs only when relevant events are
detected by the sensor
Event-based sensor interface to
minimize IO energy (vs. Frame-based
inteface
Mixed-signal event triggering with an
ULP imager with internal processing
AMS capability
Doing nothing VERY well
GrainCam Imager (FBK)
Pixel-level spatial-contrast extraction
Analog internal image processing
Contrast Extraction
Motion Extraction, differencing two successive frames
Background Subtraction, differencing the reference
image, stored in the memory, with the current frame
45
Energy-Efficient Sensor IF
Event-based + Prefiltering + Buffering
Control Block:
Always-on 32KHz controller for
continuously handling sensor timing
signals (e.g. integration, readout
periods) and control signals (e.g.
readout mode, reference frame
sampling)
Activates data-path block during
imager readout phase
Triggers events
Imager IF
Data Path
Input Stage
Control
Block
enable
32 kHz
done
Interrupt Manager
Master Data
Port
Slave Control
Port
Wake-up
Event
SoC
clock
SoC Ctrl
Always-on
Data Path: stream processing on sensed

post-triggering data
u-DMA: autonomously transfers pixel-events
to L2 memory during imager readout, without
requiring cluster action
Sensor IF Peripheral as transfer initiator.
Slave Port interface to set-up sensor interface
parameters
Stream
Processor
uDMA
A PB B u s
L2
SoC Bus
Data
Event
Event
Unit
Cluster
DMA
TCDM
Open Source
Parallel ULP computing for the IoT
(sub)-pJ/op computing platform - lets make it Open!
Processor &
Hardware IPs
Compiler
Infrastructure
Virtualization
Layer
Programming
Model
47
ETHZ and UNIBO released PULPino
Hundreds of git forks in a few weeks!

Integrated Systems Laboratory
48
Conclusions
IoT a Challenge and an opportunity
Computing Everywhere!
Energy efficiency requirements: pj/OP and below

Technology scaling alone is not doing the job for us
Ultra-low power architecture and circuits are needed, but not
sufficient in the long run
Most promising technologies: 3D integration, low-leakage, nonvolatile silicon-compatible mem-computing devices, but adoption rate
will be SLOW
Approximate computing to the rescue

Holistic approach needed, math, number systems, algorithms, tools,
runtime, architecture, circuits, devices from approximate to
transprecision computing
Non-Von-Neumann + Transprecision Jackpot potential!

Open Source HW & SW approach for building an
innovation ecosystem
49
Thanks for your attention!!!

www.pulp-platform.org
www-micrel.deis.unibo.it/pulp-project
iis-projects.ee.ethz.ch/index.php/PULP
PULP related chips
Main PULP chips (ST 28nm FDSOI)
PULPv1 (see slide 7)

PULPv2 (see sliee 7)
PULPv3 (under test)
PULPv4 (in progress)
Sir10us
Or10n
Mixed-signal systems (SMIC 130nm)

VivoSoC, Vivosoc2
EdgeSoC (in planning)
PULP development (UMC 65nm)
Artemis - IEEE 754 FPU

Hecate - Shared FPU
Selene - Logarithic Number System FPU
Diana - Approximate FPU
Mia Wallace full system
Imperio - PULPino chip
Fulmine Secure cluster + CNN
Early building blocks (UMC180)
IcySoC chips approx. computing

platforms (ALP 180nm)
Diego
Manny
Sid
More to come.
180nm
180nm
RISC-V based systems (GF 28nm)

Honey Bunny
28nm
28nm
28nm
65nm
28nm
130nm
180nm
180nm
180nm
65nm
65nm
65nm
65nm
51
Example
int main()
{
#pragma omp tolerant var
int sparse_A[];
int i=0, index;
Marking index as tolerant would

while(func(i))
{

index = compute_index(i);
#pragma omp tolerant computation
sparse_A[index] = compute_element(sparse_A, index);

update(i);
}
use_value(sparse_A);
}
Tolerant computation here would

52

Benini

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Benini

Hochgeladen von

Copyright:

Verfügbare Formate

Sub-pj per Operation Parallel Computing in CMOS

IoT Near-Sensor Processing

Analyze and Classify

Low rate (periodic) data

Battery + Harvesting powered

Analyze and Classify

Low rate (periodic) data

Battery + Harvesting powered

NSP A quantitative view

Extremely compact output (single index, alarm, signature)

Minimum energy operation

Logic Vcc / Memory Vcc (V)

Near-Threshold Computing (NTC):

4-stage OpenRISC &

Shared L1 DataMem + Atomic Variables

+ strong PM for (partial) idleness

Near threshold FDSOI technology

Body Biasing for

25 MHz 7 MHz (3)

ULP (NT) Bottleneck: Memory

Near-Threshold SRAMs (8T)

Standard Cell Memories:

Wide supply voltage range

MMU (logical/physical add map):

L0 buffer combined with prefetcher

Shared multi-ported SCM I$

SCM for part of TCDM

Cost of OpenMP runtime reduced by more than one order of magnitude

Dot product between vectors

The Evolution of the Species

Cluster Energy Efficiency

Approximate Computing to the Rescue

RGB to GRAYSCALE (+ 10% error)

Approximation is not always acceptable

Probability of flip-bit error on a single bit during read/

5.24e-5 4.35e-6 4.16e-8 0.0

We propose an extension to OpenMP

#pragma omp tolerant direc.ve to annotate statements

Three variable sets to allocate:

Non-tolerant variables (RV) strictly in SCM

Safe variables (SV) no alloca.on constraints

Two opera.on modes for each 6T-SRAM domain (low-

Pushing Beyond pJ/OP

Recovering more silicon efficiency

Closing The Accelerator Efficiency Gap with Agile Customization

Flexible acceleration: learned CNN weights are the program

7.5 GOp for 320x240 image

Origami: A CNN Accelerator

Pushing Further: YodaNN1

Learning large networks is still an issue with binary connect

SCM memories for lower voltage E goes with 1/V2

Equivalent for 7x7 SoP

Image Mapping (3x3, 5x5, 7x7)

SCM for Energy efficiency

Same area 832 SoP units + all SCM

Power Efficiency [Gop/s/mW!

SCM (down to 0.6V in 65nm)

Core Energy Efficiency (8x8)

Comparison with SoA

Breakthrough for ultra-low-power CNN ASIC implementation

fJ/op in sight: manufacturing in Flobalfoundries 22FDX

Heterogeneous PULP Cluster

Heterogeneous PULP Cluster

~1514 kGE for the cluster