Beruflich Dokumente
Kultur Dokumente
Deep Learning
and Reconfigurable
Platforms in the
Internet of Things
Challenges and Opportunities in Algorithms and Hardware
A
s the Internet of Things (IoT) rived from them, i.e., how to extract plants or airplanes). Such applica-
continues its run as one of knowledge out of such (big) data. IoT tions introduce endless possibilities
the most popular tech- devices are used in an ever-growing for better understanding, learning,
nolog y bu z z word s of number of application domains (see and informedly acting (i.e., situational
today, the discussion Figure 1), ranging from sports gad- awareness and actionable information
really turns from how gets (e.g., Fitbits and Apple Watches) in government lingo). Although rapid
the massive data sets or more serious medical devices (e.g., expansion of devices and sensors
are collected to how value can be de- pacemakers and biochips) to smart brings terrific opportunities for tak-
homes, cities, and self-driving cars, ing advantage of terabytes of machine
Digital Object Identifier 10.1109/MIE.2018.2824843 to predictive maintenance in mission- data, the mind-boggling task of un-
Date of publication: 25 June 2018 critical systems (e.g., in nuclear power derstanding growth of data remains
Multiple Layers of
Volume Velocity Training Benefit
Abstraction
from Larger
Data Sets
Variability
Automatic
Big Data Deep Learning Feature
Value Extraction
FIGURE 2 – The big data six Vs and their connection with deep learning.
Performance
timized software and hard-
ware structures, as opposed GPGPUs FPSoC Architecture
ASICs
to the trend of the last 30 FPSoCs feature a hard process-
FPGAs Multicore
years, where waiting for the ing system (HPS) and FPGA
next generation of devices DSPs fabric on the same chip. Both
was more profitable than in- parts are connected by means
vesting in optimization. All of Microcontrollers of high-throughput bridges,
these facts combined make which provide faster commu-
it more difficult than ever Flexibility nications and power savings
for designers to decide the compared to multichip solu-
FIGURE 3 – The performance versus flexibility of digital processing
best possible architecture for platforms (adapted from [52]). tions [53]. The HPS in first-gen-
their applications. eration FPSoCs featured single-
The digital processing or dual-core ARM application
platforms currently available in the processing unit (TPU), slowly stealing processors and some widely used pe-
market are summarized in Figure 3, high-performance computing applica- ripherals, such as timers and control-
where they can be compared in terms tions from GPUs. lers for different types of communica-
of performance and flexibility. Flex- While this is the pace for high- tion protocols, i.e., Ethernet, universal
ibility refers here to ease of develop- performance computing, the lack of serial bus (USB), interintegrated circuit
ment, portability, and possibility for flexibility in ASICs and the high power (I2C), universal asynchronous receiver-
adapting to changes in specifications. consumed by GPUs do not fit in wide transmitter (UART), and controller area
For high-end deep neural network ap- areas of the IoT world that demand network (CAN).
plications, where performance is the power-efficient, flexible embedded Pushed by increasing application
most important parameter, general- systems. This explains why many IoT requirements, some devices in the
purpose GPUs (GPGPUs) are the domi- devices are currently based on micro- newest FPSoC families include quad-
nant solution. Their parallel structure, controllers, digital signal processors core ARM processors, GPUs, and
the latest efforts by manufacturers to (DSPs), and multicore CPUs. How- real-time processors in the HPS, with
compete for machine-learning appli- ever, as the IoT market grows, both FPSoCs becoming complex heteroge-
cations (e.g., adding specific instruc- manufacturers and designers face a neous computing platforms. Resourc-
tions for fast neuron inference), and problem due to the diversification of es in the FPGA fabric also evolved
their reduced cost due to the mass applications and increasing demand from the basic structure consisting of
production for personal computers for computing power (particularly for standard logic resources and relative-
made them ideal for training and infer- machine-learning algorithms), leading ly simple specialized hardware blocks
ence of deep neural networks. a transformation from sense making (e.g., fixed-point DSP multipliers, mem-
The latest NVIDIA Volta GV100 GPU to decision making [50]. ory blocks, and transceivers). Current
platform, including 21.1 billion transis- Offering a wider portfolio of devic- devices include much more complex
tors within a die size of 815 mm2, is es to cover the different applications blocks, e.g., DSP blocks with floating-
capable of doing inference 100 times means less market share per device, point capabilities, video codecs for
faster than the fastest current central increasing manufacturing costs. How- video compression, soft-decision for-
processing unit (CPU) on the market ever, offering complex heterogeneous ward error recovery (SD-FEC) units to
[49]. This unparalleled brute power devices that can be used in several speed up encoding/decoding in wire-
force comes at a price: high power applications implies higher integra- less applications, or analog-to-digital
consumption, the need for custom tion of functionality and a waste of sili- converters (ADCs). Figure 4 shows the
data types (not necessarily float), ir- con, also increasing the overall cost generic block diagram of a modern
regular parallelism (alternating se- [51]. In this scenario, FPGAs, located FPSoC device, where the location and
quential and parallel processing), and in the middle of Figure 3, appear as connection of the aforementioned ele-
divergence (not all cores executing the a balanced solution to add flexibility ments is depicted.
same code simultaneously). That is and efficient computing power for ma- All computing elements (proces-
why some companies are investing in chine-learning algorithms to the next sors and GPU) have their own cache
neural network application-specific in- generation of IoT devices. Combin- memory and share common synchro-
tegrated circuits (ASICs) for improved ing processors and FPGAs in a single nous dynamic random access memo-
performance at the expense of losing package results in the FPSoC concept. ry (SDRAM) external memory, usually
flexibility. Examples are the first and In the following sections, FPSoC ar- controlled by a single multiport con-
second generation (optimized for in- chitecture is presented along with troller. A main switch interconnects
ference and both inference and train- an analysis of the usefulness of its masters and slaves in the HPS. The
ing, respectively) of the Google tensor hardware resources for implementing FPGA fabric can be accessed as any
CAN, I2C...)
Processor
and large FPGA fabrics, focused on
higher-end applications, such as fifth-
generation communications, artificial SDRAM Controller
intelligence, data centers, or video pro DMA
cessing. Microsemi and Quicklogic GPU
Controller
offer simpler devices with real-time HPS
processors, focusing on data acquisi-
External SDRAM
tion, wearables, and smartphones.
Despite the additional components
that manufacturers provide in some FIGURE 4 – The block diagram of a modern FPSoC. DAC: digital-to-analog converter.
TABLE 2 – THE RESOURCE USAGE AND LATENCY FOR USUAL FLOATING-POINT OPERATIONS IN ARRIA FPSoCs [75].
ARRIA V (FIXED-POINT DSP BLOCKS) ARRIA 10 (FLOATING-POINT DSP BLOCKS)
LATENCY LATENCY
FLOATING-POINT (CLOCK DSP (CLOCK DSP
OPERATION PRECISION CYCLES) LEs BLOCKS fMAX (MHz) CYCLES) LEs BLOCKS fMAX (MHz)
Addition/subtraction Single Nine 1,193 Zero 250 Five 1,208 Zero 319
Multiplication Single Five 390 One 281 Three 123 One 289
Controller
little programming effort. PE PE PE PE
Artificial neural network imple-
mentation in FPGA-based devices is
becoming so popular that a neural net- Input Buffer Output Buffer
work compiler, which generates HDL
code from high-level specifications,
DMA
has recently been created [86]. Design-
ers only have to select the structure,
activation function, and other param- FPGA-to-SDRAM
FPGA-to-HPS HPS-to-FPGA
eters of the artificial neural network, Bridges Bridges Bridges
and the compiler automatically gener-
ates the HDL code, applying the most General-Purpose
Peripherals
Processor
suitable optimization options in each
case. This reduces the design time HPS Main Switch
UART to
SDRAM Controller
possible solution. USB
After some iterations, the algorithm HPS
should converge to the global solution.
Several families of such algorithms exist. External SDRAM
They are characterized by the search
policy of the individuals: ant colony op-
FIGURE 8 – The implementation of a Deep-Q network on Zynq-7000. SD: secure digital.
timization (which emulates ant colony
food search), particle swarm optimiza-
tion (which emulates the movement Although the fit function can be on the application and the algorithm
of a flock of birds where the distance evaluated in parallel for each indi- used. The application defines the fit
between individuals is important), or vidual, evolutionary computing algo- function and, depending on the op-
genetic algorithms (where particles ex- rithms are not always as suitable for erations involved, it will be more or
perience gene evolution through, e.g., FPGA implementation as artificial neu- less appropriate for FPGA implemen-
mutation and crossover), to name just ral networks because their arithmetic tation. Generally speaking, the more
the most popular ones. operations are completely dependent pipelineable and parallelizable the fit
70
Float Double
60
50
Generations/s
40
30
20
10
0
PC Cyclone V Cyclone V Cyclone V
SoC Without with FPGA with FPGA and
FPGA Hardware/Software
Coprocessing
(a) (b)
FIGURE 9 – (a) A performance comparison of particle swarm optimization algorithm for different Cyclone V SoC implementations and a desktop
computer. (b) The system based on a Cyclone V SoC board.