Beruflich Dokumente
Kultur Dokumente
Foundations to Implementation
Luca Benini
IIS-ETHZ & DEI-UNIBO
http://www.pulp-platform.org
MEMS IMU
Controller
MEMS Microphone
Transmit
Short range, BW
L2 Memory
e.g. CortexM
ULP Imager
IOs
EMG/ECG/EIT
25 MOPS
112000
MOPS
1 10 mW
SW update, commands
Long range, low BW
100 W 2 mW
Idle:
~1W
Active: ~ 50mW
2
Near-Sensor Processing
Sense
MEMS IMU
Controller
MEMS Microphone
Transmit
Short range, BW
L2 Memory
e.g. CortexM
ULP Imager
IOs
EMG/ECG/EIT
25 MOPS
112000
MOPS
1 10 mW
SW update, commands
Long range, low BW
100 W 2 mW
Idle:
~1W
Active: ~ 50mW
3
COMPRESSION
COMPUTATIONAL OUTPUT
INPUT
FACTOR
BANDWIDTH
DEMAND
BANDWIDTH
Recognition:
[*Ioffe2015]
607 Kbps
4.00 GOPS
10 bps
60700x
256 Kbps
100 MOPS
0.02 Kbps
12800x
2.4 Kbps
7.7 MOPS
0.02 Kbps
120x
16 Kbps
150 MOPS
0.08 Kbps
200x
Voice/Sound
Speech:
[*VoiceControl]
Inertial
Kalman:
[*Nilsson2014]
Biometrics
SVM:
[*Benatti2014]
Microcontrollers Landscape
*not exhaustive
High performance MCUs
Low-Power MCUs
Our Target
Luca Benini
Near-Threshold Multiprocessing
32nm CMOS, 25 C
) 0.7
J
n
( 0.6
e
l
c 0.5
y
C
/ 0.4
y
g
r 0.3
e
n 0.2
E
4.7X
Total Energy
Leakage Energy
Dynamic Energy
0.1
0.0
0.2
0.55
0.3
0.55
0.4
0.55
0.5
0.55
0.6
0.6
0.7
0.7
0.8
0.8
0.9
0.9
1
1
1.1
1.1
1.2
1.2
Near-Threshold Multiprocessing
I$B0
DEMUX
Periph
+ExtM
I$Bk
PE0
. . . . .
PEN-1
MB0
L1 TCDM+T&S
MBM
N Cores
DMA
Tightly Coupled DMA
Body bias: Highly effective knob for power & variability management!
9
120C
Thermal inversion
100x
@0.5V
-40C
RVT transistors
FBB/RBB
10
Selective, fine-grained BB
The cluster is partitioned in
separate clock gating and body
bias regions
Body bias multiplexers
(BBMUXes) control the well
voltages of each region
A Power Management Unit
(PMU) automatically manages
transitions between the operating
modes
Power modes of each region:
Boost mode: active + FBB
Normal mode: active + NO BB
Idle mode: clock gated + NO BB (in
LVGT) RBB (in RVT)
Luca Benini
11
2x-4x
High VDDMIN
Bottleneck for energy efficiency
Lower VDDMIN
Area/timing overhead (25%-50%)
High active energy
Low technology portability
12
Heterogeneous + Reconfigurable
Memory Architecture
I$B0
I$Bk
L0+PFB
L0+PFB
PE0
. . . . .
MMU
SCM0
SRAM0
SCMM-1
Reconfigurable Pipeline Stages
for SRAMs degradation@low VDD
L1
TCDM
private
PEN-1
MMU
interleaved
SRAMM-1
Luca Benini
13
HW Synchronizer
#Parallel
#Barrier
25
EU-v2
LibGOMP
600
20
CPU Cycles
CPU Cycles
800
400
200
15
EU-v2
10
LibGOMP
0
0
10
15
#Threads
15
#Threads
#Critical
#Sections
2000
800
EU-v2
EU-v2
1500
CPU Cycles
CPU Cycles
10
LibGOMP
1000
600
400
500
200
0
0
10
LibGOMP
15
10
15
#Threads
#Threads
14
ISA Extensions
Sensor data NEVER comes with 32-bit
1.
2.
3.
4.
5.
6.
Davide Rossi
Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.
PULPv1
4
16 kB
16kB SRAM
PULPv2
4
64 kB
32kB SRAM
8kB SCM
PULPv3
4
128 kB
32kB SRAM
16kB SCM
no
4kB SRAM private
no
no
yes
4kB SCM private
no
no
yes
4kB SCM shared
yes
yes
PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W
PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W
Davide Rossi
PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W
16
# of cores
L2 memory
TCDM
PULPv1
PULPv2
PULPv3
The Evolution
of
the
Species
4
4
4
DVFS
I$
DSP Extensions
HW Synchronizer
Status
Technology
Voltage range
BB range
Max freq.
Max perf.
Peak en. eff.
16 kB
16kB SRAM
64 kB
32kB SRAM
8kB SCM
128 kB
32kB SRAM
16kB SCM
no
4kB SRAM private
no
no
yes
4kB SCM private
no
no
yes
4kB SCM shared
yes
yes
PULPv1
silicon proven
FD-SOI 28nm
conventional-well
0.45V - 1.2V
-1.8V - 0.9V
475 MHz
1.9 GOPS
60 GOPS/W
PULPv2
Silicon proven
FD-SOI 28nm
flip-well
0.3V - 1.2V
0.0V - 1.8V
1 GHz
4 GOPS
135 GOPS/W
Davide Rossi
PULPv3
post tape out
FD-SOI 28nm
conventional-well
0.5V - 0.7V
-1.8V - 0.9V
200 MHz
1.8 GOPS
385 GOPS/W
17
Luca Benini
18
Approximate?
Less-than-perfect results perceived as correct by the users
e.g. image processing (filtering)
RGB to GRAYSCALE
Approximate Storage?
Retention voltage
Retention
SCM
0.25V
6T-SRAM
0.29V
Voltage (V)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
P(flip-bit) SCM
0.0
0.0
0.0
0.0
0.0
0.0
0.0
P(flip-bit) 6T
0.0037
0.0012
0.0003
21
TCDM Design
0x14000
EFFICIENCY
GOAL
switch
down
the
voltage
of
6T-RAM
domains
to
save
energy
6T-SRAM
#0
6T-SRAM
#15
6T-SRAM
#0
6T-SRAM
#15
6T-SRAM
(D4)
APPROXIMATION
least
signicant
bytes
(LSBs)
of
error-tolerant
variables
in
6T-RAM
(split
region),
and
it
can
be
can
be
re-congured
on-the-y
by
SW
(1
more)
0x0C000
6T-SRAM
6T-SRAM
#0
MANAGEMENT
#15
RELIABILITY
UNITS
(RMUs)
c(D2)
ombinaeonal
logic
bVOLTAGE
locks
used
to
DOMAINS
6T-SRAM
remap
the
physical
address
range
of
the
TCDM
into
logical
memory
areas
0x08000
0x2000
+
OFFSET
LSB #0
LSB #15
LEVEL SHIFTERS
LEVEL SHIFTERS
OFFSET
0x0000
SCM #0
VOLTAGE
LEVELS
0.8V
:
RELIABLE
0.5V
:
NOT
RELIABLE
0x2000
6T-SRAM (D1)
SCM #15
MSB #0
MSB #15
RMU
RMU
interconnect
SCM
SPLIT
LOGICAL
MEMORY
AREAS
0.5V
:
RELIABLE
0.8V
:
RELIABLE
0.5V
:
MSBs
RELIABLE,
LSBs
NOT
RELIABLE
Programming model
Compiler
support
Clang
+
LLVM
Direc.ves
are
translated
into
annotated
tokens
of
the
intermediate
representa.on
(LLVM
IR)
Hardware
evaluaeon
Energy
consumpeon
(data-memory)
and
area
compared
to
other
solueons
AVG
ENERGY
OUR
ENERGY
OUR
AREA
MEMORY
TECH.
APPLICATIONS
Accuracy
While do not simply discard the LSBs?
Application constraints
> 100
SW
Mixed
HW
General-purpose Throughput
Computing
Computing
CPU
GPGPU
Accelerator Gap
HW IP
Learn to Accelerate
Brain-inspired (deep convolutional networks) systems are
high performers in many tasks over many domains
CNN:
93.4% accuracy
(Imagenet 2014)
Human:
85% (untrained),
94.9% (trained)
[Karpahy15]
Image recognition
[Russakovsky et al., 2014]
Speech recognition
[Hannun et al., 2014]
Computational Effort
Computational effort
~90%
Origami chip
30
FP needed?
12-bit signals sufficient
Input to classification double-vs-12-bit
accuracy loss <0.5% (80.6% to 80.1%)
31
Sub pJ/OP?
0% bit flips
437 GOPS/W @1.2V
803 GOPS/W @0.8V
1% bit flips
1.84x energy improvement
1.2pJ/OP
32
Ultra-optimized HW is possible!
Power reduction because of arithmetic simplification
Major arithmetic density improvements
Area can be used for more energy-efficient weight storage
the Yedi Master from Star Wars - Small in size but wise and powerful cit. www.starwars.com
33
SoP-Unit Optimization
ImageBank
FilterBank
34
0.5
0.7
0.9
1.1
1.3
Supply [V]
35
Throughput
[GOPS]
En.Eff.
[GOPS/W]
Supply
[V]
Area Effic.
[GOPS/MGE]
Neuflow
320
490
1.0
17
Leuven
102
2600
0.5 - 1.1
64
Eyeriss
84
160
0.8 - 1.2
46
NINEX
569
1800
1.2
51
k-Brain
411
1.2
109
Origami
196/74
437/803
1.2/0.8
90/34
This Work*
1510/55
9800/61200
1.2/0.6
1135/41
2.7x
1930
23x
10x
*UMC 65nm Technology, post place & route, 25C, tt, 1.2V/0.6V
Logarithmic Interconnect
SOP units & Line Buffer
HWPE wrapper CTRL HWPE wrapper
Peripheral Interconnect
Davide Rossi
37
4 OR10Nv2 cores
1 HW CRYPTO accelerator
1 CNN HWCE
greatly improved DMA engine
uDMA,
Davide Rossi
38
ENERGY EFFICIENCY
8 GOPS
6500 GOPS/W
61x
47x
39
ENERGY EFFICIENCY
8 GOPS
6500 GOPS/W
61x
47x
40
Sleepwalker IO Interface
CLUSTER
INTERCONNECT
BUS
64-bit
32-bit
Top
L2
MEMORY
Soc
Ctrl
FLLs
FLL Ctr
AXI
TO
APB
DC
FIFO
ID
R.
HWCE
Open
RISC
#0
DU
JTAG
TAP
...
TCDM
BANK
#M-1
DMA
CLUSTER
PERIPHS
SPI
S
JTAG
TCDM
BANK
#1
TCDM INTERCONNECT
CRYPT
PERIPHERAL
INTERCONNECT
GPIO
APB BUS
SPI M
SOC BUS
UART
CLUSTER BUS
DC FIFO
uDMA
I2C
TCDM
BANK
#0
AXI2MEM
ROM
I2S
PAD MUX
CLUSTER
BRIDGE
SoC
Open
RISC
#1
...
DU
Open
RISC
#N-1
DU
JTAG2AXI
TMC
JTAG
TAP
ADV
DEBUG
IF
SHARED I$
42
Sleepwalking on PULP
Programming sequence:
1) Set events mask
2) Program transfer
3) Trigger transfer
4) Shut down cores
5) If all cores are idle the
cluster automatically shut
down (power gating)
48 maskable events
General purpose
DMA
Timers
Peripherals (SPI, I2C. GPIO...)
PE0
PE1
PE2
PE3
HW
SYNCH
EVENT
UNIT
SPI
PERIPHERAL
INTERCONNECT
L2
MEM
GOAL:
Automatically manage shut down of cores and cluster during data
transfers from peripherals to L2
Davide Rossi
10/1
1/16
43
Back to System-Level
Smart Visual Sensor idle most of the time (nothing interesting to see)
PULPv3
GrainCam
Mixed-Signal
Event- based
Imager
Digital
Parallel
Processor
Contrast Extraction
Motion Extraction, differencing two successive frames
Background Subtraction, differencing the reference
image, stored in the memory, with the current frame
45
Energy-Efficient Sensor IF
Event-based + Prefiltering + Buffering
Control Block:
Always-on 32KHz controller for
continuously handling sensor timing
signals (e.g. integration, readout
periods) and control signals (e.g.
readout mode, reference frame
sampling)
Activates data-path block during
imager readout phase
Triggers events
Imager IF
Data Path
Input Stage
Control
Block
enable
32 kHz
done
Interrupt Manager
Master Data
Port
Slave Control
Port
Wake-up
Event
SoC
clock
SoC Ctrl
Always-on
Stream
Processor
uDMA
A PB B u s
L2
SoC Bus
Data
Event
Event
Unit
Cluster
DMA
TCDM
Open Source
Parallel ULP computing for the IoT
(sub)-pJ/op computing platform - lets make it Open!
Processor &
Hardware IPs
Compiler
Infrastructure
Virtualization
Layer
Programming
Model
47
48
Conclusions
IoT a Challenge and an opportunity
Computing Everywhere!
Sir10us
Or10n
More to come.
180nm
180nm
28nm
28nm
28nm
65nm
28nm
130nm
180nm
180nm
180nm
65nm
65nm
65nm
65nm
51
Example
int
main()
{
#pragma
omp
tolerant
var
int
sparse_A[];
int
i=0,
index;
Marking index as tolerant would
probably lead to fatal errors!
while(func(i))
{
index
=
compute_index(i);
#pragma
omp
tolerant
computation
sparse_A[index]
=
compute_element(sparse_A,
index);
update(i);
}
use_value(sparse_A);
}
Tolerant computation here would