Deep Learning and Reconfigurable Platforms in The Internet of Things

Image licensed by Ingram Publishing
ROBERTO FERNANDEZ MOLANES,

KASUN AMARASINGHE,
JUAN J. RODRIGUEZ-ANDINA,
and MILOS MANIC
Deep Learning
and Reconfigurable
Platforms in the
Internet of Things
Challenges and Opportunities in Algorithms and Hardware
A
s the Internet of Things (IoT) rived from them, i.e., how to extract plants or airplanes). Such applica-
continues its run as one of knowledge out of such (big) data. IoT tions introduce endless possibilities
the most popular tech- devices are used in an ever-growing for better understanding, learning,
nolog y bu z z word s of number of application domains (see and informedly acting (i.e., situational
today, the discussion Figure 1), ranging from sports gad- awareness and actionable information
really turns from how gets (e.g., Fitbits and Apple Watches) in government lingo). Although rapid
the massive data sets or more serious medical devices (e.g., expansion of devices and sensors
are collected to how value can be de- pacemakers and biochips) to smart brings terrific opportunities for tak-
homes, cities, and self-driving cars, ing advantage of terabytes of machine
Digital Object Identifier 10.1109/MIE.2018.2824843 to predictive maintenance in mission- data, the mind-boggling task of un-
Date of publication: 25 June 2018 critical systems (e.g., in nuclear power derstanding growth of data remains
36 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018 1932-4529/18©2018IEEE

and heavily relies on artificial intel- ic, the most suitable implementa- of an unprecedented amount of data
ligence and machine learning [1], [2]. tion (hardware or software) being used [8], [9]. With this rapid development of
Where traditional approaches do not for each specific functional building the IoT, cloud computing, and the ex-
scale well, artificial intelligence tech- block to be optimized, and functional- plosion of big data, the most fundamen-
niques have evidenced great success in ity to be easily reconfigured on site. tal challenge is to store and explore
applications of machine and cognitive In addition, reconfigurable platforms these volumes of data and extract use-
intelligence (such as image classifica- dramatically ease system scalabil- ful information for future actions [9].
tion, face recognition, or language trans- ity and upgrading. Hence, they pro- The main element of most IoT ap-
lation). We recognize the widespread vide high levels of flexibility, as de- plications is an intelligent learning
usage of various well-known machine- manded by the IoT market. methodology that senses and under-
learning algorithms in the IoT (such as In this regard, this article identi- stands its environment [6]. Tradition-
fuzzy systems, support vector machines, fies hardware implementation chal- ally, many machine-learning algorithms
Bayesian networks, reinforcement learn- lenges and thoroughly analyzes the were proposed to provide intelligence
ing, and others), but we focus here on aforementioned suitability of FPSoCs to IoT devices [10]. However, in recent
the most recent and highly advanta- for a broad range of IoT applications years, with the popularity of deep neu-
geous type of machine learning in the involving machine-learning and arti- ral networks/deep learning, using deep
IoT: deep learning. ficial intelligence algorithms, which neural networks in the domain of the
The success of deep learning and, is demonstrated in two case studies, IoT has received increased attention
in particular, deep neural networks one related to deep learning and the [6], [11]. Deep learning and the IoT
greatly coincides with the advent of other to the more classical evolution- were among the top three technology
highly specialized, powerful parallel- ary computing techniques. trends for 2017 announced at Gartner
computing devices, i.e., graphics pro- Symposium/ITxpo [12]. This increased
cessing units (GPUs) [4]. Although the Deep Learning for the IoT interest in deep learning in the IoT do-
overwhelming processing and memory In the era of the IoT, the number of sens- main is because traditional machine-
requirements can be met with high- ing devices that are deployed in every learning algorithms have failed to
performance computing hardware, the facet of our day-to-day life is enormous. address the analytic needs of IoT sys-
resulting sheer size, cost, and power In recent years, many IoT applications tems [6], which produce data at such
consumption would make the goal of have arisen in various domains, such a rapid rate and volume that they de-
deep neural network-enabled IoT and as health, transportation, smart homes, mand artificial intelligence algorithms
embedded devices unattainable. and smart cities [6]. It is predicted by with modern data analysis approaches.
In this scenario, field-programmable the U.S. National Intelligence Council Depending on the predominant factor,
system-on-chip (FPSoC) platforms, that, by 2025, Internet nodes will reside volume or rate, data analytics for IoT
which combine in a single chip one or in everyday things, such as food pack- applications can be viewed in two main
more powerful processors and recon- ages, furniture, and documents [7]. This categories: 1) big data analysis and 2)
figurable logic [in the form of field-pro- expansion of IoT devices, together with data stream analysis.
grammable gate array (FPGA) fabric], cloud computing, has led to creation When focusing on data volume, the IoT
are emerging as a very suit- is one of the major sources of
able implementation alterna- big data. Analytics of the gener-
tive for the next generation of ated massive data sets directly
IoT devices. The fine-grained benefit the performance and
structure of FPGAs has proven enhance capabilities of IoT sys-
to provide powerful implemen- tems. Extracting knowledge from
tations of machine-learning such big data is not a straight-
algorithms with less power forward task. It requires capa-
consumption than comparable bilities that go beyond the tra-
platforms (in terms of cost or ditional inference and learning
size) [5], making them ideal for techniques [13], generally ex-
machine and cognitive intelli- pressed with the six Vs [14], [15]:
gence in strict resource-limited ■ ■ volume, which refers to
applications, like many in the the ability to ingest, pro-
IoT (while GPUs remain as the cess, and store large data
dominant platforms for other sets (petabytes or even exa
IoT scenarios). bytes)
Moreover, FPSoCs allow ■■ velocity, which refers to
the processing load to be the speed of data genera-
balanced between proces- tion and frequency of de-
sors and reconfigurable log- FIGURE 1 – IoT devices (adapted from [3]). livery (sampling)
JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 37

■■ variety, which refers to the data butions as the environment changes One of the most important deriva-
from different sources and types around the devices [17]. This context/ tives of the IoT is the concept of smart
(structured or unstructured); even concept drift occurs due to the chang- cities. Improving cities is becoming a
the types of data have been grow- es in factors, such as location, time, global need with the rising and urban-
ing fast and activity. In addition to the require- ization of the population [26]. The con-
■■ variability, which refers to the need ment of speed adaptability, the lack of cept of smart cities has been around
for getting meaningful data consid- labeled data in IoT data streams adds since the early 2000s. Smart cities
ering scenarios of extreme unpre- to the difficulty because it makes su- claim to contain thousands of sens-
dictability pervised learning methods inadequate ing devices, which generate massive
■■ veracity, which refers to bias, noise, for analysis [17], [18]. Therefore, highly amounts of data that can be harnessed
and abnormality in data (only the adaptable unsupervised and semisu- to optimize and improve the operations
relevant, usable data within ana- pervised deep-learning techniques are of these cities [27]. Smart cities try to
lytic models is to be stored) required for mining the fast-changing accomplish goals, e.g., reducing pollu-
■■ value, which refers to the purpose data streams in IoT devices. tion and energy consumption or opti-
the solution has to address. mizing transportation [28]. IoT devices
Figure 2 shows the six Vs of big data Applications of Deep can help collect data about how people
and how the advantages of deep-learn- Learning in the IoT use cities, and machine-learning algo-
ing techniques can be used to meet Deep neural networks have revolu- rithms can be used to understand that
these challenges in big data. More spe- tionized a multitude of fields because data [26]. Adding further intelligence to
cific applications of deep-learning tech- of their ability for learning through the embedded sensing nodes allows lo-
niques in big data in the IoT are pre- multiple layers of abstraction [19], cal storage needs and network conges-
sented in the next section. The latest [20]. This enables learning of complex tion to be reduced.
considerations add three additional Vs patterns that are hidden in complex One of the most important aspects
to the mix: vulnerability (of data), vola- data sets, a capability ideal for min- of smart cities powered by the IoT is
tility (relevance of data before becoming massive heterogeneous data sets. smarter energy management. With the
ing obsolete), and visualization (ways of Different deep neural network algo- advent of smart meters, there are mas-
meaningful visualization). rithms have been used to good effect sive amounts of data being collected
As mentioned, in addition to per- in a range of areas that were very dif- on energy consumption. This enables
forming data mining on massive collec- ficult to tackle in the past. Long short- research on energy consumption pre-
tions of data produced by IoT systems, term memory algorithms, e.g., have diction, which can lead to optimizing
another important aspect is dealing been shown to be extremely useful in energy usage and the way energy is
with real-time data streams that re- speech recognition and natural lan- generated in smart cities and smart
quire fast-learning algorithms. IoT ap- guage processing [21]–[23], and con- grids. Machine-learning algorithms
plications, such as traffic management volutional neural networks have been are indispensable in this area, and
systems and supply chain logistics of used to produce state-of-the-art per- deep-learning algorithms, such as
supermarkets, involve large data sets formance in many vision applications, long short-term memory algorithms,
that have to be analyzed in near real such as image classification [24], [25]. restricted Boltzmann machines, and
time [16]. Mining fast-generated data Therefore, deep learning is applied ex- convolutional neural networks, have
streams requires the algorithms to be tensively in a range of IoT devices for been proposed to perform data-driv-
adaptable to the change of data distri- human interaction. en predictions of energy usage at both
Multiple Layers of
Volume Velocity Training Benefit
Abstraction
from Larger
Data Sets
Variability
Automatic
Big Data Deep Learning Feature
Value Extraction
Veracity Complex Pattern

Handling Extraction
Variety Heterogeneous
Data
FIGURE 2 – The big data six Vs and their connection with deep learning.
38 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

the individual consumer and aggres- work for human activity recognition in have been exploited in this domain to
gate levels [29]–[32]. smart homes used especially for help- good effect.
Another important aspect of smart ing people with diseases [39]. Context Diro and Chilamkurti presented a
cities is using machine learning and the awareness is another important aspect deep neural networks-based distrib-
IoT for traffic management. Optimized of the IoT, closely tied with mining data uted methodology for cyberattack de-
traffic management targets reducing streams. Machine learning has a very tection in the IoT [46]. They compared
congestion, long queues, delays, and crucial role to play in understanding their distributed deep model with a
even the carbon footprint of cities [33]. the environment and the context of the shallow neural network and a central-
To that end, driverless or self-driving device from the data. ized deep model, and they concluded
cars have become a much-discussed In recent years, we have seen com- that the distributed deep model out-
topic recently, with major car compa- mercial IoT devices or edge devices performs the others significantly.
nies, such as Tesla, BMW, and Ford, and emerging in the market, such as Nest Another area of cybersecurity is mal-
tech giants, such as Google and Apple, Thermostat [40] and Amazon devices ware detection. Pajouh et al. presented
stepping up to the plate to develop truly powered by Alexa [41], that have the a deep recurrent neural network-based
intelligent autonomous cars. ability of sensing their environment malware detection methodology for
Self-driving cars have a plethora and using machine learning to under- the IoT [47]. The authors implemented
of devices continuously sensing their stand data. Context-aware devices three different long short-term memory
environment and a suit of machine- or things have the ability of under- configurations and showed that their
learning algorithms for understanding standing the environment and adapt- algorithm can achieve 98.18% accura-
and fusing the various data sources, ing their reasoning capabilities [10]. cy in malware detection for the tested
such as LIDAR depth maps and im- Further, machine-learning algorithms data set. In all aspects of cybersecurity,
ages. Deep neural networks have been are extremely crucial for some areas, when taking a data-driven approach,
extensively explored in this domain, such as intelligent health trackers for anomaly detection algorithms are very
as they have the capability of auto- medicine, e.g., intelligent pacemakers useful tools. Canedo and Skjellum pre-
matically learning features to pick out or photoplethysmography systems sented an artificial neural network-
obvious ones, such as lane marking [42], [43] that can monitor the heart- based anomaly detection methodology
and road edges, as well as other subtle beat of a patient. tailored for IoT cybersecurity [48]. They
ones that exist on the roads [34]. Adding intelligence to these de- recognized that the main challenges for
Computer vision is a highly sought- vices is very important, as it permits anomaly detection in IoT data are quan-
after application in many use cases in improved and faster preventive detec- tity and heterogeneity. They showed
the IoT domain. Smart cameras, espe- tion of pathologies. Compared with that the artificial neural network-based
cially in smart security systems, play the option to send data via the Inter- methodology was able to overcome
an important role in smart homes [35], net to remote sensors for analysis or those challenges in detecting anomalies
and vision applications, such as face saving data for postprocessing, this in the data sent from edge devices.
recognition, are very crucial [36]. Ma- option enables a dramatic reduction
chine-learning algorithms have been of data transmission and storage (with Hardware Implementation
used extensively in image-processing the respective reduction of energy Challenges
applications and, in that, convolution- consumption) and the possibility to The implementation of machine-learn-
al neural networks have been deemed work offline (very useful for remote or ing algorithms has been a hot topic in
the gold standard since the advent of rural areas). research for several years but recently
LeNet [37]. Ko et al. presented a frame- boomed, mainly thanks to the oppor-
work for energy-efficient neural net- Safety and Security in the IoT tunities created by the advancements
works to be used in IoT edge devices In addition to enabling and facilitat- in chip fabrication technologies, which
[38]. The authors claim that in deploy- ing IoT applications, deep learning enabled solving design problems at a
ing deep neural networks-based im- plays a crucial role in keeping the cost and with a time-to-market that
age processing, energy efficiency can highly connected devices safe. Due were unthinkable just a few years ago.
be the performance bottleneck, and to its ubiquity in the modern techno- The resolution of Google Challenge
hence, they present the recent tech- logical ecosystem, the IoT is a very by AlexNet using an eight-layer deep
nological advantages for making deep attractive target for cyberattackers. neural network [24] is usually cited
neural networks, such as convolu- Therefore, cybersecurity is one of the as an inflexion point that boosted the
tional and recurrent neural networks, most important research areas in the research on new chips and applica-
more energy efficient. field of the IoT [44], [45]. It is known tions of machine-learning algorithms,
Another area in which machine- that a large number of zero-day at- especially in the field of neural net-
learning-driven vision applications is tacks are emerging continuously due works. This explosion coincides with
used in the IoT is human activity rec- to the various protocols added to the the deceleration of Moore’s law (even
ognition in smart homes. Fang and Hu IoT [46]. The multiple-level feature- Gordon Moore himself predicted the
proposed a deep-learning-based frame- learning capabilities of deep learning end of his Moore’s law [92]), which

now makes it economically machine-learning algorithms
reasonable to work on op- in IoT devices.
Performance
timized software and hard-
ware structures, as opposed GPGPUs FPSoC Architecture
ASICs
to the trend of the last 30 FPSoCs feature a hard process-
FPGAs Multicore
years, where waiting for the ing system (HPS) and FPGA
next generation of devices DSPs fabric on the same chip. Both
was more profitable than in- parts are connected by means
vesting in optimization. All of Microcontrollers of high-throughput bridges,
these facts combined make which provide faster commu-
it more difficult than ever Flexibility nications and power savings
for designers to decide the compared to multichip solu-
FIGURE 3 – The performance versus flexibility of digital processing
best possible architecture for platforms (adapted from [52]). tions [53]. The HPS in first-gen-
their applications. eration FPSoCs featured single-
The digital processing or dual-core ARM application
platforms currently available in the processing unit (TPU), slowly stealing processors and some widely used pe-
market are summarized in Figure 3, high-performance computing applica- ripherals, such as timers and control-
where they can be compared in terms tions from GPUs. lers for different types of communica-
of performance and flexibility. Flex- While this is the pace for high- tion protocols, i.e., Ethernet, universal
ibility refers here to ease of develop- performance computing, the lack of serial bus (USB), interintegrated circuit
ment, portability, and possibility for flexibility in ASICs and the high power (I2C), universal asynchronous receiver-
adapting to changes in specifications. consumed by GPUs do not fit in wide transmitter (UART), and controller area
For high-end deep neural network ap- areas of the IoT world that demand network (CAN).
plications, where performance is the power-efficient, flexible embedded Pushed by increasing application
most important parameter, general- systems. This explains why many IoT requirements, some devices in the
purpose GPUs (GPGPUs) are the domi- devices are currently based on micro- newest FPSoC families include quad-
nant solution. Their parallel structure, controllers, digital signal processors core ARM processors, GPUs, and
the latest efforts by manufacturers to (DSPs), and multicore CPUs. How- real-time processors in the HPS, with
compete for machine-learning appli- ever, as the IoT market grows, both FPSoCs becoming complex heteroge-
cations (e.g., adding specific instruc- manufacturers and designers face a neous computing platforms. Resourc-
tions for fast neuron inference), and problem due to the diversification of es in the FPGA fabric also evolved
their reduced cost due to the mass applications and increasing demand from the basic structure consisting of
production for personal computers for computing power (particularly for standard logic resources and relative-
made them ideal for training and infer- machine-learning algorithms), leading ly simple specialized hardware blocks
ence of deep neural networks. a transformation from sense making (e.g., fixed-point DSP multipliers, mem-
The latest NVIDIA Volta GV100 GPU to decision making [50]. ory blocks, and transceivers). Current
platform, including 21.1 billion transis- Offering a wider portfolio of devic- devices include much more complex
tors within a die size of 815 mm2, is es to cover the different applications blocks, e.g., DSP blocks with floating-
capable of doing inference 100 times means less market share per device, point capabilities, video codecs for
faster than the fastest current central increasing manufacturing costs. How- video compression, soft-decision for-
processing unit (CPU) on the market ever, offering complex heterogeneous ward error recovery (SD-FEC) units to
[49]. This unparalleled brute power devices that can be used in several speed up encoding/decoding in wire-
force comes at a price: high power applications implies higher integra- less applications, or analog-to-digital
consumption, the need for custom tion of functionality and a waste of sili- converters (ADCs). Figure 4 shows the
data types (not necessarily float), ir- con, also increasing the overall cost generic block diagram of a modern
regular parallelism (alternating se- [51]. In this scenario, FPGAs, located FPSoC device, where the location and
quential and parallel processing), and in the middle of Figure 3, appear as connection of the aforementioned ele-
divergence (not all cores executing the a balanced solution to add flexibility ments is depicted.
same code simultaneously). That is and efficient computing power for ma- All computing elements (proces-
why some companies are investing in chine-learning algorithms to the next sors and GPU) have their own cache
neural network application-specific in- generation of IoT devices. Combin- memory and share common synchro-
tegrated circuits (ASICs) for improved ing processors and FPGAs in a single nous dynamic random access memo-
performance at the expense of losing package results in the FPSoC concept. ry (SDRAM) external memory, usually
flexibility. Examples are the first and In the following sections, FPSoC ar- controlled by a single multiport con-
second generation (optimized for in- chitecture is presented along with troller. A main switch interconnects
ference and both inference and train- an analysis of the usefulness of its masters and slaves in the HPS. The
ing, respectively) of the Google tensor hardware resources for implementing FPGA fabric can be accessed as any

other memory-mapped peripheral from
the HPS through the HPS-to-FPGA bridg-
es. There are also several options to ac-
Memory DSP Transceivers
cess the HPS from the FPGA fabric: FPGA- ADC SD-FEC
Blocks Blocks Memory
to-HPS bridges to access HPS peripherals,
Controllers
the accelerator coherency port (ACP) to DAC Reconfigurable Logic:
coherently access processor cache, and LEs and Interconnect
FPGA-to-SDRAM bridges to access main FPGA
memory in a noncoherent way. ACP
FPGA-to-HPS HPS-to-FPGA FPGA-to-SDRAM
Not all FPSoCs include all blocks (Cache
Bridges Bridges Access) Bridges
in Figure 4. Table 1 shows a summary
of characteristics of the most relevant Application
currently available FPSoC families. Processor
Ethernet, Timers, UART,

HPS Peripherals (USB,
Intel FPGA and Xilinx offer powerful
Main Switch Real-Time
devices with application processors
CAN, I2C...)
Processor
and large FPGA fabrics, focused on
higher-end applications, such as fifth-
generation communications, artificial SDRAM Controller
intelligence, data centers, or video pro DMA
cessing. Microsemi and Quicklogic GPU
Controller
offer simpler devices with real-time HPS
processors, focusing on data acquisi-
External SDRAM
tion, wearables, and smartphones.
Despite the additional components
that manufacturers provide in some FIGURE 4 – The block diagram of a modern FPSoC. DAC: digital-to-analog converter.
TABLE 1 – THE CHARACTERISTICS OF MODERN FPSoC FAMILIES.

APPLICATION PROCESSOR REAL-TIME PROCESSOR FPGA
TRANSISTOR MAXIMUM MAXIMUM MAXIMUM MAXIMUM

COMPANY FAMILY SIZE TYPE f (GHz) TYPE f (MHz) SIZE f (MHz) OTHER
Intel FPGA Cyclone V SoC 28 nm Single/dual 0.925 — — 301 K LEs 200

32-bit ARM
Cortex-A9
Arria V SoC 28 nm Single/dual 1.05 — — 462 K LEs 300
32-bit ARM
Cortex-A9
Arria 10 SoC 20 nm Dual 32-bit ARM 1.5 — — 1.15 M LEs 500 Floating-point
Cortex-A9 DSP blocks in
FPGA
Stratix 10 SoC 14 nm tri-gate Quad 64-bit 1.5 — — 5.5 M LEs 1000 Floating-point
ARM Cortex-A53 DSP blocks in
FPGA
Xilinx Zynq-7000 Artix 28 nm Single/dual 0.866 — — 85 K LCs — ADC
32-bit ARM
Cortex-A9
Zynq-7000 Kintex 28 nm Dual 32-bit ARM 1 — — 444 K LCs — ADC
Cortex-A9
Ultrascale+ Kintex 20 nm Dual/quad 1.5 Dual- 600 1143 K LCs — Option to GPU,
64-bit ARM cortex-R5 video codec,
Cortex-A53 ADC, DAC,
SD-FEC
Microsemi SmartFusion 130 nm — — Single- 100 6 K LEs 350 ADC, nonvolatile
cortex-M3 FPGA
SmartFusion 2 130 nm — — Single- 166 150 K LEs 350 ADC, nonvolatile
cortex-M3 FPGA
QuickLogic S3 — — — Single- 80 — — DSP, power
cortex- management unit
M4-F

devices targeting specific applica- broad range of devices [59]. The HPS, ful for unique network identification,
tions, the most important in an FPSoC e.g., can simultaneously connect with traceability, and access control [62].
are still the HPS processors and the sensors using I2C and with other de- FPSoCs enable the design of em-
FPGA fabric. To successfully deploy vices via Ethernet or Wi-Fi. The FPGA bedded systems with very small size,
an application taking the greatest pos- fabric adds great flexibility, enabling low power consumption, and perfor-
sible advantage of these devices, pro- the implementation of communication mance sometimes even equal or high-
cessors and FPGA should smoothly protocols not included in HPS as well er than that of desktop platforms [64].
cooperate with each other, executing as specific functionalities that achieve Regarding energy, FPSoCs largely out-
the parts of the functionality that best higher performance in hardware than perform computer systems in terms of
fit their respective architectures, shar- in software, such as pulsewidth modu- operations per second and watt [65].
ing data between them when needed. lation, capture and compare, or fre- FPSoCs are also more power efficient
A designer typically starts with a quency measurement units. than GPU-based SoC designs [66],
software implementation in HPS and Connectivity of IoT devices raises particularly for neural network imple-
moves to the FPGA those parts of the serious security and privacy concerns. mentations [67], [68]. However, poor
code that need acceleration. Commu- At the hardware level, one possible usage of the available FPGA resources
nication between HPS and FPGA is not way to address them is with ARM’s may result in some cases in CPUs and
a trivial task and depends on several TrustZone Technology [60], which de- GPUs outperforming them [69]. With
factors, such as data size, operating fines some peripheral slaves as secure, this concern in mind, FPSoCs are the
system (OS), or FPGA operating fre- so only trusted masters can access best option for implementing machine
quency. It is very important to choose them. A secure interrupt controller, learning in battery-powered systems
the best possible mechanism for HPS– e.g., may be used to create a noninter- with strict size limitations, like drones
FPGA data exchange, otherwise it can ruptible task that monitors the system, [70] or wireless sensor networks [71].
impair the acceleration achieved by and a secure keyboard may ensure se- Regarding economic and market-
moving portions of the algorithms to cure password entries. This concept ing issues, FPSoCs are inexpensive
hardware. In [54]–[56], different anal- has also been extended to software, as because they are mass-produced com-
yses of the influence of these factors shown in Figure 5. A trusted firmware ponents. Time to market is short and,
in the transfer rate are carried out. layer controls context switching of the thanks to the new high-level synthesis
In [56], the results of the analysis are processor from trusted OS and apps tools (like OpenCL and C/C++ compil-
elaborated into design guidelines to to regular OS and apps, which may ers), similar to that of pure software so-
maximize the performance of FPSoC run malicious software completely lutions. Because of its reconfigurable
implementations. isolated from trusted software and se- nature, functionality can be upgraded
FPGA design is typically based cure hardware. without the need for changing the
on hardware description languages To protect intellectual property, hardware platform, improving postsale
(HDLs), which require from designers current FPSoCs also allow the FPGA support compared to nonconfigurable
good knowledge of digital hardware. configuration bitstream as well as the devices like ASICs.
Fortunately, nowadays it is also pos- boot image for the HPS to be encrypt-
sible to automatically compile code ed [61]. In addition to the solutions FPSoCs and Machine Learning
for both the FPGA and the HPS from provided by manufacturers, extra FPGAs exhibit some unique features
high-level languages, namely C/C++ functionalities can be implemented to for efficiently implementing por-
(using high-level synthesis tools, ei- prevent hacker attacks. These include tions of machine-learning algorithms
ther commercial or open-source, like physically unclonable functions, use- in hardware.
LegUp [57]), OpenCL, MATLAB, and ■■ Parallelism: Most machine-learning
LabVIEW. This gives designers with algorithms include parallelizable
limited or no experience in digital de- portions of the code that can take
sign access to the excellent character- Apps Trusted Apps advantage of this property of the
istics of FPSoCs. Code generated by hardware. Each neuron in a neural
these tools is not as optimized as that network layer can be computed in
OS
resulting from HDL workflows, but parallel, e.g. In evolutionary com-
Trusted OS
they allow design time to be dramati- puting, fit functions can also be con-
Hypervisor
cally reduced [58]. currently executed for the whole
population of genes/particles.
ARM Trusted Firmware
FPSoCs and the IoT ■■ Pipelining: Although this technique
FPSoC characteristics make them very Nonsecure Secure is also used in processors and
suitable for many IoT applications. The Peripherals Peripherals GPUs to fetch and execute instruc-
availability of HPS peripherals for the tions, greater advantage of it can be
most popular communication proto- FIGURE 5 – The ARM TrustZone security taken in FPGAs, where the output
cols enables interoperability among a (adapted from [63]). of an operation can directly feed

the input of the next one, avoiding FPGA logic elements (LEs). FPGAs the exponential operation, because
the extra clock cycles required to are very powerful for fixed-point op- it does not suit the fixed structure of
compute the same operations in erations [72] but achieve lesser per- floating-point DSP blocks well.
the arithmetic/floating-point units formance in number of floating-point In low-level design with HDLs, it is
of processors and GPUs. operations per second than GPUs for easy to estimate the performance of a
■■ Scalability and upgrading: It is com- most machine-learning implementa- given algorithm implementation in a giv-
mon for machine-learning algo- tions [73]. However, in some cases the en device from the information regard-
rithms to change structure or size configurable FPGA architecture com- ing available hardware resources and
(e.g., adding layers or inputs to a pensates this drawback and achieves latency of the different operations.
neural network) to improve per- faster execution times [74]. This is not the case when using high-lev-
formance from knowledge gained In an effort to make FPSoCs more el synthesis tools, where the compiler
during test or normal operation. competitive, newer devices from Intel can make inefficient use of hardware
In a hardware/software coprocess- FPGA (Arria 10 and Stratix 10 fami- resources. To achieve acceptable per-
ing implementation, this may mean lies) include DSP blocks with single formance when using these tools, it
to port more (or new) parts of the floating-point capabilities in the FPGA is a must to consider all of the available
algorithm to hardware. The same fabric. Table 2 summarizes the size options to help the tool efficiently fit the
may happen in the context of the (LE and DSP block usage) and per- design in the FPGA fabric [76].
IoT when new functionality, wheth- formance (latency and maximum op- The aforementioned hardware fea-
er related to the target machine- erating frequency, fMAX ) of floating- tures are complemented in FPSoCs
learning algorithm or not (such as point operators in Arria V and Arria 10 with those provided by the applica-
a web server or an encryption al- FPGAs for some usual floating-point tion processors in HPS. Those range
gorithm), needs to be added to the operations in machine-learning algo- from real-time processors with fixed-
system. The abundance of standard rithms. Double-precision operations point arithmetic capabilities available
logic resources and specialized require more than twice the resources in simpler devices to DSP-like proces-
hardware blocks in FPGAs, together and have almost twice the latency of sors for speeding up signal processing
with their reconfiguration capabili- single-precision ones. Addition, sub- tasks, or to dedicated floating-point
ties, facilitates system scalability traction, and multiplication make low units or single-instruction multiple
and upgrading. usage of resources, whereas other op- data coprocessors for vector arithme-
Current FPGAs include tens to hun- erators are less efficiently implement- tic in more advanced devices.
dreds of DSP blocks usually equipped ed. Using floating-point DSP blocks
with fixed-point multipliers and ad- results in improvements in terms of Case Study 1: Implementation of
ders. Other operations, e.g., floating either significant reduction of logic re- Deep Neural Networks in FPSoC
point, are implemented by a combi- source usage or increase of maximum Neural network algorithms and, in
nation of these blocks and standard operating frequency. The exception is particular, deep neural networks are
TABLE 2 – THE RESOURCE USAGE AND LATENCY FOR USUAL FLOATING-POINT OPERATIONS IN ARRIA FPSoCs [75].
ARRIA V (FIXED-POINT DSP BLOCKS) ARRIA 10 (FLOATING-POINT DSP BLOCKS)
LATENCY LATENCY
FLOATING-POINT (CLOCK DSP (CLOCK DSP
OPERATION PRECISION CYCLES) LEs BLOCKS fMAX (MHz) CYCLES) LEs BLOCKS fMAX (MHz)
Addition/subtraction Single Nine 1,193 Zero 250 Five 1,208 Zero 319
Double 12 2,903 Zero 252 Seven 2,765 Zero 290
Multiplication Single Five 390 One 281 Three 123 One 289
Double Seven 848 Four 186 Five 780 Four 289
Division Single 18 1,140 Four 249 16 985 Four 347
Double 35 3,523 15 185 30 3,020 15 258
Exponential base e Single 14 1,795 Two 217 26 745 Six 365
Double 28 5,335 Ten 185 28 5,390 Ten 260
Sine Single 12 1,463 Three 240 11 1,463 Three 280
Double 29 4,370 14 185 29 4,795 14 260

executed in two phases: training needed for the activation function to FPSoC platforms have already
(where network weights are adapted almost none. The solution in [78] in- been used to improve pure FPGA im-
to achieve the desired functionality) curs just 0.03% error with regard to plementation. In [84], a Zynq-7000 is
and inference (deployment operation an implementation using true expo- used to implement an image classifier
of the network). Training is highly nential and division cores. However, based on a deep convolutional neural
computationally demanding, so it is the activation function ReLu(x) = max network. The network layers (convolu-
typically implemented by processing (0, x) has recently been shown to pro- tional, pooling, and fully connected)
batches of data (several patterns at vide better classification results and are executed in the FPGA, whereas
the same time) offline, for which GPUs shorter training times than the former the HPS is responsible for synchro-
are very suitable. The inference phase ones for deep neural networks [79], nization [controlling direct memory
is suitable for FPGA implementation, simplifying their implementation in access (DMA) in the FPGA] and the
because it typically has to be imple- all platforms. final steps of the classification pro-
mented over single patterns in real Although most implementations cess. A set of configurable processing
time and, as shown in Figure 6, the use floating-point operations, recent elements (PEs) performs all network
neurons in one layer can be executed works have shown that fixed-point operations (see Figure 7). This imple-
in parallel. Moreover, the operations approximations provide equal perfor- mentation is compared against others
to be performed by each neuron can mance in some cases [80]. Moreover, using an Intel Xeon CPU at 2.9 GHz,
be very efficiently implemented using for some applications it is possible an NVIDIA TK1 mobile GPU with 192
DSP blocks. These operations are to aggressively scale down (what is CUDA cores, and an NVIDIA K40 GPU
n-1
called quantization) the number of bits with 2,880 CUDA cores. Results show
a x (y) = v e / a i (y - 1) * w ix o, (1) in fixed-point representations. In [81], that the FPSoC is 1.4 times faster than
i=0
e.g., it is reported that with only five- the CPU, with 14 times less power
where a x (y) is the output of neuron bit integer resolution for the weighting consumption; two times faster than
x in layer y, wix is the weight between coefficients, performance degradation the mobile GPU, with the same power
neuron i in layer y−1 and neuron x is negligible compared with the origi- consumption; and 13 times slower
in layer y, and v is the so-called acti- nal 32-bit floating-point resolution. than the GPU, but consuming 26 times
vation function of the neuron. The Other operations that can be used less power. This shows that FPSoCs
classical neuron activation functions to reduce FPGA logic resource usage achieve excellent performance–power
a r e Sigmoid (x) = 1/ (1 + e - x ) a n d are network pruning (removing non- consumption tradeoffs.
Tanh (x) = ^e x - e -x h / ^e x + e -x h . important connections) [81], network In [85], a Zynq-7000 is used to im-
These operations involve divisions clustering (fusing neurons) [82], and plement a Deep-Q network (Figure 8)
and exponentials so, according to retraining (adding a penalty term in that learns how to play a board game
Table 2, their FPGA implementation the training cost function to maximize called Trax. Starting from a pure C/C++
is not particularly efficient. Because not only the network fitting to inputs software implementation and us-
of that, some works addressed their and outputs but also the bit depth ing high-level synthesis, the most
efficient hardware implementation us- needed for the network weights) [83]. time - consuming parts of the algo -
ing linear approximations. The use of These techniques, together with the rithm, in this case matrix multiplica-
Taylor approximations and reuse of use of simpler activation functions like tion of the convolutional layers, were
the multipliers and adders for the lin- ReLu, will surely boost the number of moved to hardware. Each layer has
ear part of the neuron is proposed in implementations in FPGA-based de- its own matrix multiplication core
[77], reducing the additional hardware vices in the near future. that uses a double-precision floating-
point multiply accumulate module to
perform operations and two FPGA-
Artificial Neuron Artificial Neural Network SDRAM ports to share data with the
processor in the HPS.
One port is used to read operands
from the processor and the other to
write results back. The processor ex-
ecutes the rest of the algorithm. Results
show a 26 times acceleration with re-
spect to the pure software implemen-
tation. Design time was very short, be-
cause hardware was directly compiled
Input Hidden Output from C/C++ code using high-level synthe-
Layer Layers Layer sis, and only the most time-consuming
parts of the algorithm were migrated
FIGURE 6 – A graphical representation of a single neuron and an artificial neural network. to hardware. This example shows that

high-level synthesis tools may allow
impressive performance improvements
to be achieved by migrating software FPGA
Computing Complex
implementations to hardware ones with
...
Controller
little programming effort. PE PE PE PE
Artificial neural network imple-
mentation in FPGA-based devices is
becoming so popular that a neural net- Input Buffer Output Buffer
work compiler, which generates HDL
code from high-level specifications,
DMA
has recently been created [86]. Design-
ers only have to select the structure,
activation function, and other param- FPGA-to-SDRAM
FPGA-to-HPS HPS-to-FPGA
eters of the artificial neural network, Bridges Bridges Bridges
and the compiler automatically gener-
ates the HDL code, applying the most General-Purpose
Peripherals
Processor
suitable optimization options in each
case. This reduces the design time HPS Main Switch
compared to using high-level syn- SDRAM Controller

thesis, where a deep analysis of the
HPS
network and the FPGA is needed to op-
timize the implementation.
External SDRAM
Case Study 2: Implementation of
Evolutionary Computing in FPSoC FIGURE 7 – The implementation of a deep convolutional neural network on Zynq-7000.
FPSoCs are suitable implementation
platforms not only for deep-learning
algorithms, such as deep neural net-
works, but also for other machine-
Matrix Matrix
learning algorithms (such as evolution- FPGA Multiplication Multiplication
ary computing ones) used in a wide Convolutional Convolutional
range of IoT applications. Evolution- Layer 1 Layer 2
ary computing algorithms are used
for complex optimization problems. In
them, a population of individuals (e.g., FPGA-to-HPS HPS-to-FPGA FPGA-to-SDRAM
Bridges Bridges Bridges
particles or genes) is spread through
the solution space, and a fit function
Controller
SD-Card
is evaluated for them, the goal being SD

to minimize or maximize it. Depending Card
on the values of the fit function for the Application
Processor
different individuals in the current and
Main Switch
Console
past iterations, these move toward a

UART
UART to
SDRAM Controller
possible solution. USB
After some iterations, the algorithm HPS
should converge to the global solution.
Several families of such algorithms exist. External SDRAM
They are characterized by the search
policy of the individuals: ant colony op-
FIGURE 8 – The implementation of a Deep-Q network on Zynq-7000. SD: secure digital.
timization (which emulates ant colony
food search), particle swarm optimiza-
tion (which emulates the movement Although the fit function can be on the application and the algorithm
of a flock of birds where the distance evaluated in parallel for each indi- used. The application defines the fit
between individuals is important), or vidual, evolutionary computing algo- function and, depending on the op-
genetic algorithms (where particles ex- rithms are not always as suitable for erations involved, it will be more or
perience gene evolution through, e.g., FPGA implementation as artificial neu- less appropriate for FPGA implemen-
mutation and crossover), to name just ral networks because their arithmetic tation. Generally speaking, the more
the most popular ones. operations are completely dependent pipelineable and parallelizable the fit

ing the state of health of solar panels
located in remote areas, where human
FPSoCs are suitable implementation platforms not intervention is difficult. In a pure soft-
only for deep-learning algorithms but also for other ware implementation, the evaluation
of the fit function takes 83% of the ex-
machine-learning algorithms. ecution time. Using a Cyclone V SoC
device, the evaluation of the fit func-
tion is moved to hardware. In a first
function, the better. Also, according to algorithms, both the algorithm itself approach, the processor waits in idle
Table 2, fit functions involving multi- (particle movement) and the evalua- state for the FPGA to finish this evalu-
plications and additions are more suit- tion of the fit function were typically ation. Even though, in this particular
able for FPGA implementation than executed in hardware [88], [90]. In case, the fit function is neither inter-
those using exponentials and divi- some cases where simple fit functions nally parallelizable nor pipelineable,
sions. The operations involved in par- can be used, a soft processor (i.e., a it can be concurrently computed for
ticle movement in the aforementioned processor implemented using stan- 12 particles, resulting in 3.4 times ac-
evolutionary computing algorithms are dard FPGA logic resources) may be in celeration with regard to the pure soft-
■■ ant colony: addition, multiplication, charge of evaluating the fit function ware implementation.
division, exponential, square root, in software, as reported, e.g., in [91]. An improved solution takes advan-
and random number generation However, in real-life problems it is tage of idle processor time for it to
[87], hence, these algorithms are very usual that fit function evaluation generate the random numbers to be
not particularly suitable for FPGA takes most of the execution time, and used in subsequent iterations of the
implementation soft processors are not fast enough algorithm, resulting in 4.8 times ac-
■■ particle swarm optimization: multipli- to justify a software implementation, celeration. The achieved performance
cation, addition, and random num- therefore most designers opted for is comparable to that obtained with a
ber generation [88], which can be ef- pure hardware implementations. desktop computer but with much low-
ficiently implemented in FPGA Today, the situation is different er size, cost, and power consumption,
■■ genetic algorithms: random number with the availability of powerful FP- as shown in Figure 9(a). The whole
generation and movement or modi- SoC devices, whose embedded hard monitoring system fits in a small elec-
fications of chromosomes [89]; pro- processors work much faster than soft tric box [Figure 9(b)] and can be lo-
cessing of chromosomes perfectly ones and have in many cases floating- cated under each panel.
fits in FPGA hardware, to the ex- point capabilities. In this scenario, the
tent that it can be concurrently ex- most efficient solution is to implement Closing Discussion
ecuted for all individuals in a single the evaluation of the fit function in The ubiquitous deployment of machine
clock cycle. hardware and execute the algorithm learning and artificial intelligence across
Until recently, when considering in software. IoT devices has introduced various intel-
the use of configurable platforms for In [64], a particle swarm optimiza- ligence and cognitive capabilities. One
implementing evolutionary computing tion algorithm is proposed for evaluat- may conclude that these capabilities
70
Float Double
60
50
Generations/s
40
30
20
10
0
PC Cyclone V Cyclone V Cyclone V
SoC Without with FPGA with FPGA and
FPGA Hardware/Software
Coprocessing
(a) (b)
FIGURE 9 – (a) A performance comparison of particle swarm optimization algorithm for different Cyclone V SoC implementations and a desktop
computer. (b) The system based on a Cyclone V SoC board.

have led to the success of a wide and
ever-growing number of applications,
such as object/face/speech recognition,
The ubiquitous deployment of machine learning
wearable devices and biochips, diagno- and artificial intelligence across IoT devices
sis software, or intelligent security and
preventive maintenance. has introduced various intelligence and
Developments in other areas, such cognitive capabilities.
as humanoid robots, self-driving cars,
or smart buildings and cities, will likely
revolutionize the way we live in the very
near future. This new reality comes with mance. In our opinion, the trends for such as OpenCL, C/C++, or MATLAB,
significant advantages but also with neural network implementation in IoT enabling software designers to take ad-
many challenges related to the acqui- devices in the following years can be vantage of the excellent characteristics
sition, processing, storage, exchange, summarized as follows. of FPSoC devices.
sharing, and interpretation of the contin- ■ ■ Training will rely on heavy-duty
uously growing, overwhelming amount cloud-based GPUs. ASICs like the Acknowledgments
of data generated by the IoT. new Google TPU (optimized for both Roberto Fernandez Molanes’s and Juan
Up to now, complex applications inference and training, with impres- J. Rodriguez-Andina’s work in this arti-
involving deep neural networks have sive performance) will have a piece cle has been supported by the Spanish
mainly used the brute force of GPUs for of the pie here, but with the limita- Ministerio de Economia y Competitivi-
both training and inference. In the last tion posed by their lack of flexibility. dad under grant TEC2014-56613-C2-1-P.
two years, some companies have pro- ■■ The simplest IoT devices will use
duced ASICs with better performance CPUs and ASICs for inference to re- Biographies
and lower power consumption than duce cost and power consumption, Roberto Fernandez Molanes (roberto
GPUs. These solutions are suitable for respectively. Larger devices will fem@uvigo.es) received his M.Sc. degree
high-performance computing applica- use FPGAs/FPSoCs for inference in electrical engineering and M.Sc. de-
tions, but neither the low flexibility of because of their balanced flexibil- gree in advanced technologies and pro-
ASICs nor the high-power consumption ity and computer power. For heavy- cesses in industry from the University of
of GPUs is suitable for many IoT appli- duty inference, the same consider- Vigo, Spain, in 2012 and 2013, respective-
cations, which demand energy-efficient, ations as for training apply. ly, where he is currently working toward
flexible embedded systems capable of FPSoCs are an excellent alternative his Ph.D. degree in the Department of
coping with the increasing diversifica- for evolutionary computing, because Electronic Technology. His research in-
tion of the IoT. they allow the algorithm itself to be terests include the design of hardware/
In contrast, FPSoC architectures, executed in software while the objec- software coprocessing systems and
which include processors and FPGA tive function can be computed in par- high-performance instrumentation us-
fabric in the same chip, are a balanced allel in hardware for all individuals. ing field-programmable system-on-chip
solution to implement machine-learn- However, their efficiency in this con- platforms. He is a Student Member of
ing applications for IoT devices. The text greatly depends on whether or the IEEE and a member of the IEEE In-
latest advancements in FPGA hardware not the specific operations involved in dustrial Electronics Society.
allow a wide range of machine-learning the computation of the objective func- Kasun Amarasinghe (amarasing
algorithms to be efficiently implement- tion fit available hardware resources. hek@vcu.edu) received his B.Sc. degree
ed. FPGAs are very well suited to per- It can be concluded that, thanks to the in computer science from the Univer-
form deep neural network inference availability of hard processors with sity of Peradeniya, Sri Lanka, in 2011.
because of the parallel arrangement of floating-point units, FPSoCs are very He is currently reading for his doctoral
neurons in layers and the type of math- suitable for implementing evolution- degree in computer science at Virginia
ematical functions they have to com- ary computing algorithms. In the case Commonwealth University, Richmond.
pute. This will be even more so in the of particle swarm, it has been dis- His research interests include interpre-
future because of the trend to use sim- cussed how the same performance as table machine learning, fuzzy systems,
pler neuron activation functions (like a desktop computer can be achieved deep learning, data mining, and natu-
ReLu) that, in addition to improving with FPSoCs with a fraction of the size, ral language processing. His interests
training, fit better in FPGA resources. cost, and power consumption. also extend to applications of such
Moreover, the use of quantization tech- In our opinion, the implementation algorithms to a multitude of domains,
niques and custom data types (which in FPSoCs of IoT devices with machine- including cyberphysical systems and
is difficult to achieve, if possible at all, learning capabilities will be boosted energy systems. He has gained expe-
in devices with fixed architectures like by the availability of increasingly effi- rience on Internet of Things systems
ASICs and GPUs) can significantly re- cient high-level synthesis tools based from multiple research projects funded
duce complexity and improve perfor- on widely known and used languages, by the U.S. Department of Energy and

industry leaders. He is a Student Mem- References [22] S. Venugopalan, H. Xu, J. Donahue, M. Rohr-
[1] M. Jaffe. (2014). IoT won’t work without arti- bach, R. Mooney, and K. Saenko. (2014). Trans-
ber of the IEEE and a member of the ficial intelligence. WIRED. [Online]. Available: lating videos to natural language using deep
IEEE Industrial Electronics Society. https://www.wired.com/insights/2014/11/iot- recurrent neural networks. arXiv. [Online].
wont-work-without-artificial-intelligence/ Available: https://arxiv.org/abs/1412.4729
Juan J. Rodriguez-Andina (jjrd [2] E. Sappin. (2017, June 28). How AI and IoT must [23] S. Wang and J. Jiang. (2015). Learning natural
guez@uvigo.es) received his M.Sc. de- work together. VentureBeat. [Online]. Avail- language inference with LSTM. arXiv. [Online].
able: https://venturebeat.com/2017/06/28/ Available: https://arxiv.org/abs/1512.08849
gree from the Technical University how-ai-and-iot-must-work-together/ [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
of Madrid, Spain, in 1990 and his Ph.D. [3] E. Ahmed, I. Yaqoob, I. Abaker Targio Hashem, “ImageNet classification with deep convolu-
I. Khan, A. Ibrahim Abdalla Ahmed, M. Imran, tional neural networks,” in Proc. 25th Int. Conf.
degree from the University of Vigo, and A. V. Vasilakos, “The role of big data ana- Neural Information Processing Systems 25, 2012,
Spain, in 1996, both in electrical engi- lytics in Internet of Things,” Comput. Netw., pp. 1097–1105.
vol. 129, pp. 459–471, Dec. 2017. [25] K. Simonyan and A. Zisserman. (2014). Very
neering. He is an associate professor in [4] B. Del Monte and R. Prodan, “A scalable GPU- deep convolutional networks for large-scale
the Department of Electronic Technol- enabled framework for training deep neural image recognition. arXiv. [Online]. Available:
networks,” in Proc. 2016 2nd Int. Conf. Green https://arxiv.org/abs/1409.1556
ogy, University of Vigo. In 2010–2011, High Performance Computing (ICGHPC 2016). [26] J. Walker. (2017, Sept. 14). Smart city artificial
he was on sabbatical leave as a visiting [5] M. D. Valdés Peña, J. J. Rodriguez-Andina, and intelligence applications and trends. TechEm-
M. Manic, “The Internet of Things: The role of ergence. [Online]. Available: https://www
professor at the Advanced Diagnosis, reconfigurable platforms,” IEEE Ind. Electron. .techemergence.com/smart- city-artificial-
Automation, and Control Laboratory, Mag., vol. 11, no. 3, pp. 6–19, Sept. 2017. intelligence-applications-trends/
[6] M. Mohammadi, A. Al-Fuqaha, S. Sorour, [27] J. Chin, V. Callaghan, and I. Lam, “Under-
Electrical and Computer Engineering and M. Guizani. (2017). Deep learning for IoT standing and personalising smart city services
Department, North Carolina State Uni- big data and streaming analytics: A survey. using machine learning, the Internet-of-
arXiv. [Online]. Available: https://arxiv.org/ Things and big data,” in Proc. 2017 IEEE 26th
versity, Raleigh. His research interests abs/1712.04301 Int. Symp. on Industrial Electronics (ISIE), pp.
include the implementation of complex [7] L. Atzori, A. Iera, and G. Morabito, “The Inter- 2050–2055.
net of Things: A survey,” Comput. Netw., vol. 54, [28] E. Woyke. (2018, Feb. 21). A smarter smart city.
control and processing algorithms and no. 15, pp. 2787–2805, Oct. 2010. MIT Technology Review. [Online]. Available:
intelligent sensors in embedded plat- [8] J. Gubbi, R. Buy ya, S. Marusic, and M. https://www.technologyreview.com/s/610249/
Palaniswami, “Internet of Things (IoT): A vi- a-smarter-smart-city/
forms. He has authored more than 160 sion, architectural elements, and future direc- [29] D. Marino, K. Amarasinghe, and M. Manic,
journal and conference articles and tions,” Future Gener. Comput. Syst., vol. 29, no. “Building energy load forecasting using deep
7, pp. 1645–1660, Sept. 2013. neural networks,” in Proc. IECON 2016 42nd
holds several Spanish, European, and [9] F. Chen, P. Deng, J. Wan, D. Zhang, A. V. Vasila- Annu. Conf. IEEE Industrial Electronics Society,
U.S. patents. He is a Senior Member of kos, and X. Rong, “Data mining for the Internet pp. 7046–7051.
of Things: Literature review and challenges,” [30] K. Amarasinghe, D. L. Marino, and M. Manic,
the IEEE and a member of the IEEE In- Int. J. Distrib. Sens. Netw., vol. 11, no. 8, pp. “Deep neural networks for energy load fore-
dustrial Electronics Society. 1–14, Aug. 2015. casting,” in Proc. IEEE Int. Symp. Industrial Elec-
[10] C. Perera, S. Member, A. Zaslavsky, and P. tronics, 2017, pp. 1483–1488.
Milos Manic (misko@ieee.org) re- Christen, “Context aware computing for the In- [31] E. Mocanu, P. H. Nguyen, M. Gibescu, and W.
ceived his M.S. degree in computer ternet of Things: A survey,” Commun. Surveys L. Kling, “Deep learning for estimating build-
Tuts., vol. 16, no. 1, pp. 1–41, 2014. ing energy consumption,” Sustain. Energy Grids
science from the University of Nis, [11] J. Tang, D. Sun, S. Liu, and J. L. Gaudiot, “En- Netw., vol. 6, pp. 91–99, June 2016.
Serbia, in 1996 and his Ph.D. degree in abling deep learning on IoT devices,” Comput., [32] M. Manic, K. Amarasinghe, J. J. Rodriguez-An-
vol. 50, no. 10, pp. 92–96, 2017. dina, and C. Rieger, “Intelligent buildings of the
computer science from the University [12] K. Panetta. (2016, Oct. 18). Gartner’s top 10 future: Cyberaware, deep learning powered,
of Idaho in 2003. He is a professor in strategic technology trends for 2017. Smarter and human interacting,” IEEE Ind. Electron.
with Gartner. [Online]. Available: https://www Mag., vol. 10, no. 4, Dec. 2016.
the Computer Science Department .gartner.com/smarterwithgartner/gartners- [33] R. J. F. Rossetti, “Traffic control & management
and director of the Modern Heuris- top-10-technology-trends-2017/ systems in smart cities,” Readings Smart Cities,
[13] X.-W. Chen and X. Lin, “Big data deep learning: vol. 2, no. 3, 2016.
tics Research Group at Virginia Com- Challenges and perspectives,” IEEE Access, [34] M. Bojarski, P. Yeres, A. Choromanska, K. Choro-
monwealth University, Richmond. He vol. 2, pp. 514–525, May 2014. manski, B. Firner, L. Jackel, and U. Muller, (2017).
[14] M. Hilbert, “Big data for development: A re- Explaining how a deep neural network trained
has more than 20 years of academic view of promises and challenges,” Dev. Policy with end-to-end learning steers a car. arXiv. [On-
and industrial experience leading Rev., vol. 34, no. 1, pp. 135–174, Jan. 2016. line]. Available: https://arxiv.org/abs/1704.07911
[15] H. Hu, Y. Wen, T. S. Chua, and X. Li, “Toward [35] C.-R. Yu, C.-L. Wu, C.-H. Lu, and L.-C. Fu, “Hu-
more than 30 research grants focus- scalable systems for big data analytics: A man localization via multi-cameras and floor
ing on machine and deep learning technology tutorial,” IEEE Access, vol. 2, pp. sensors in smart home,” in Proc. 2006 IEEE
652–687, May 2014. Int. Conf. Systems, Man and Cybernetics, pp.
in energy, resilience, cybersecurity, [16] A. Akbar, A. Khan, F. Carrez, and K. Moessner, 3822–3827.
and human–system interaction in “Predictive analytics for complex IoT data [36] A. H. M. Amin, N. M. Ahmad, and A. M. M. Ali,
streams,” IEEE Internet Things J., vol. 4, no. 5, “Decentralized face recognition scheme for
mission-critical infrastructures. He pp. 1571–1582, Oct. 2017. distributed video surveillance in IoT-cloud
is a founder of the IEEE Industrial [17] D. Nallaperuma, D. De Silva, D. Alahakoon, and infrastructure,” in Proc. 2016 IEEE Region 10
X. Yu, “A cognitive data stream mining tech- Symp. (TENSYMP 2016), pp. 119–124.
Electronics Society Technical Com- nique for context-aware IoT systems,” in Proc. [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
mittee on Resilience and Security in IECON 2017—43rd Annu. Conf. IEEE Industrial “Gradient-based learning applied to document
Electronics Society, pp. 4777–4782. recognition,” Proc. IEEE, vol. 86, no. 11, pp.
Industry. He has published more than [18] G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, 2278–2323, Nov. 1998.
180 refereed articles in international M. Smith, and P. Steggles, Towards a Better Un- [38] J. H. Ko, Y. Long, M. Faisal Amir, D. Kim, J.
derstanding of Context and Context-Awareness. Kung, T. Na, A. Ranjan Trivedi, and S. Mukho-
journals, books, and conferences and New York: Springer-Verlag, 1999, pp. 304–307. padhyay, “Energy-efficient neural image pro-
holds several U.S. patents. He built his [19] I. Goodfellow, B. Yoshua, and A. Courville, Deep cessing for Internet-of-Things edge devices,”
Learning. Cambridge, MA: MIT Press, 2016. in Proc. Midwest Symp. Circuits Systems, 2017,
expertise through research on a num- [20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep pp. 1069–1072.
ber of U.S. Department of Energy and learning,” Nature, vol. 521, no. 7553, pp. 436– [39] H. Fang and C. Hu, “Recognizing human ac-
444, May 2015. tivity in smart home using deep learning al-
industry-funded projects. He is a Se- [21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. gorithm,” in Proc. 33rd Chinese Control Conf.,
nior Member of the IEEE and an IEEE Salakhudinov, R. Zemel, and Y. Bengio, “Show, 2014, pp. 4716–4720.
attend and tell: Neural image caption genera- [40] Nest Labs. (2018). What makes a Nest thermo-
Industrial Electronics Society Senior tion with visual attention,” in Proc. Int. Conf. on stat a Nest thermostat? Nest Labs. [Online].
AdCom member. Machine Learning, 2015, pp. 2048–2057. Available: https://nest.com/thermostats/

[41] G. Anders. (2017, Aug. 9). “Alexa, understand [60] ARM Security Technology, “Building a secure [77] R. Finker, J. Echanobe, I. del Campo, and K.
me.” MIT Technology Review. [Online]. Available: system using TrustZone technology,” ARM, Basterretxea, “Controlled accuracy approxi-
https://www.technologyreview.com/s/608571/ San Jose, CA, PRD29-GENC-009492, 2009. mation of sigmoid function for efficient FPGA-
alexa-understand-me/ [61] Y. Liu, J. Briones, R. Zhou, and N. Magotra, based implementation of artificial neurons,”
[42] B. Marsh. (2018). The intelligent pacemaker “Study of secure boot with a FPGA-based IoT Electron. Lett., vol. 49, no. 25, pp. 1598–1600,
that can talk to your doctor. Daily Mail. [On- device,” in Proc. Midwest Symp. Circuits Sys- Dec. 2013.
line]. Available: http://www.dailymail.co.uk/ tems, 2017, pp. 1053–1056. [78] S. Gomar, M. Mirhassani, and M. Ahmadi, “Pre-
health/article-122444/The-intelligent-pace- [62] C. Marchand, L. Bossuet, U. Mureddu, N. Bo- cise digital implementations of hyperbolic
maker-talk-doctor.html chard, A. Cherkaoui, and V. Fischer, “Imple- tanh and sigmoid function,” in Proc. 2016 50th
[43] W. R. Dassen, K. Dulk, and H. J. Wellens, “Mod- mentation and characterization of a physical Asilomar Conf. Signals, Systems and Computers,
ern pacemakers: Implantable artificial intelli- unclonable function for IoT: A case study with pp. 1586–1589.
gence?” Pacing Clin. Electrophysiol., vol. 11, no. the TERO-PUF,” IEEE Trans. Comput.-Aided [79] F. Ertam and G. Aydin, “Data classification with
11, pp. 2114–2120, Nov. 1988. Design Integr. Circuits Syst., vol. 37, no. 1, pp. deep learning using Tensorflow,” in Proc. 2017
[44] J. Pacheco and S. Hariri, “IoT security frame- 97–109, Jan. 2018. Int. Conf. Computer Science and Engineering
work for smart cyber infrastructures,” in Proc. [63] ARM. (2018). Security on ARM TrustZone. (UBMK), pp. 755–758.
2016 IEEE 1st Int. Workshops Foundations and ARM. [Online]. Available: https://www.arm [80] S. Shin, K. Hwang, and W. Sung, “Fixed-point
Applications of Self* Systems (FAS*W), pp. .com/products/security-on-arm/trustzone performance analysis of recurrent neural
242–247. [64] R. Fernandez-Molanes, M. Garaj, W. Tang, J. J. networks,” in Proc. IEEE Int. Conf. Acoustics,
[45] A.-R. Sadeghi, C. Wachsmann, and M. Waidner, Rodriguez-Andina, J. Farina, K. F. Tsang, and Speech and Signal Processing 2016, pp. 976–
“Security and privacy challenges in industrial K. F. Man, “Implementation of particle swarm 980.
Internet of Things,” in Proc. 52nd Annu. Design optimization in FPSoC devices,” in Proc. 2017 [81] M. Shah, J. Wang, D. Blaauw, D. Sylvester, H.-S.
Automation Conf. (DAC ’15), 2015, pp. 1–6. IEEE 26th Int. Symp. Industrial Electronics Kim, and C. Chakrabarti, “A fixed-point neural
[46] A. A. Diro and N. Chilamkurti, “Distributed (ISIE), pp. 1274–1279. network for keyword detection on resource
attack detection scheme using deep learning [65] S. Sridharan, P. Durante, C. Faerber, and N. constrained hardware,” in Proc. 2015 IEEE
approach for Internet of Things,” Future Gener. Neufeld, “Accelerating particle identification Workshop Signal Processing Systems (SiPS),
Comput. Syst., vol. 82, pp. 761–768, May 2018. for high-speed data-filtering using OpenCL on pp. 1–6.
[47] H. Haddad Pajouh, A. Dehghantanha, R. Khay- FPGAs and other architectures,” in Proc. 2016 [82] X. Zhang, A. Ramachandran, C. Zhuge, D. He,
ami, and K.-K. R. Choo, “A deep recurrent 26th Int. Conf. Field Programmable Logic and W. Zuo, Z. Cheng, K. Rupnow, and D. Chen,
neural network based approach for Internet of Applications (FPL), pp. 1–7. “Machine learning on FPGAs to face the IoT
Things malware threat hunting,” Future Gener. [66] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. revolution,” in Proc. 2017 IEEE/ACM Int. Conf.
Comput. Syst., vol. 85, pp. 88–96, Aug. 2018. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, Computer-Aided Design (ICCAD), pp. 894–901.
[48] J. Canedo and A. Skjellum, “Using machine “TABLA: A unified template-based framework [83] R. Doshi, K.-W. Hung, L. Liang, and K.-H. Chiu,
learning to secure IoT systems,” in Proc. 2016 for accelerating statistical machine learning,” “Deep learning neural networks optimization
14th Annu. Conf. Privacy Security Trust (PST in Proc. 2016 IEEE Int. Symp. High Performance using hardware cost penalty,” in Proc. 2016
2016), pp. 219–222. Computer Architecture (HPCA), pp. 14–26. IEEE Int. Symp. Circuits and Systems (ISCAS),
[49] NVIDIA. (2018). NVIDIA Volta. NVIDIA. [On- [67] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, pp. 1954–1957.
line]. Available: https://www.nvidia.com/en- S. Han, Y. Wang, and H. Yang, “Angel-Eye: A [84] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou,
us/data-center/volta-gpu-architecture/ complete design flow for mapping CNN onto J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H.
[50] Y. Pu, C. Shi, G. Samson, D. Park, K. Easton, R. embedded FPGA,” IEEE Trans. Comput.-Aided Yang, “Going deeper with embedded FPGA
Beraha, A. Newham, M. Lin, V. Rangan, K. Cha- Design Integr. Circuits Syst., vol. 37, no. 1, pp. platform for convolutional neural network,”
tha, D. Butterfield, and R. Attar, “A 9-mm2 ul- 35–47, Jan. 2018. in Proc. 2016 ACM/SIGDA Int. Symp. Field-Pro-
tra-low-power highly integrated 28-nm CMOS [68] A. X. M. Chang and E. Culurciello, “Hardware grammable Gate Arrays (FPGA ’16), pp. 26–35.
SoC for Internet of Things,” IEEE J. Solid-State accelerators for recurrent neural networks on [85] N. Sugimoto, T. Mitsuishi, T. Kaneda, C. Tsuru-
Circuits, vol. 53, no. 3, pp. 1–13, 2018. FPGA,” in Proc. 2017 IEEE Int. Symp. Circuits and ta, R. Sakai, H. Shimura, and H. Amano, “Trax
[51] C. Trigas, “Design challenges for system-in-pack- Systems (ISCAS), pp. 1–4. solver on Zynq with deep Q-network,” in Proc.
age vs system-on-chip,” in Proc. IEEE 2003 Cus- [69] J. Lachmair, T. Mieth, R. Griessl, J. Hagemeyer, 2015 Int. Conf. Field Programmable Technology
tom Integrated Circuits Conf., 2003, pp. 663–666. and M. Porrmann, “From CPU to FPGA—Accel- (FPT), pp. 272–275.
[52] J. J. Rodríguez Andina, E. de la Torre-Arnanz, eration of self-organizing maps for data min- [86] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deep-
and M. D. Valdes Peña, FPGAs : Fundamentals, ing,” in Proc. Int. Joint Conf. Neural Networks, Burning: Automatic generation of FPGA-based
Advanced Features, and Applications in Indus- 2017, pp. 4299–4308. learning accelerators for the neural network
trial Electronics. Boca Raton, FL: CRC, 2017. [70] W. Fang, Y. Zhang, B. Yu, and S. Liu. (2017). FP- family,” in Proc. Design Automation Conf.
[53] S. S. Iyer, “Heterogeneous integration for perfor- GA-based ORB feature extraction for real-time (DAC), 2016.
mance and scaling,” IEEE Trans. Compon. Pack- visual SLAM. arXiv. [Online]. Available: https:// [87] C.-F. Juang, C.-M. Lu, C. Lo, and C.-Y. Wang,
ag. Manuf. Technol., vol. 6, no. 7, pp. 973–982, arxiv.org/abs/1710.07312 “Ant colony optimization algorithm for fuzzy
July 2016. [71] T. Mekonnen, M. Komu, R. Morabito, T. Kaup- controller design and its FPGA implementa-
[54] M. Sadri, C. Weis, N. Wehn, and L. Benini, “En- pinen, E. Harjula, T. Koskela, and M. Ylianttila, tion,” IEEE Trans. Ind. Electron., vol. 55, no. 3,
ergy and performance exploration of accel- “Energy consumption analysis of edge orches- pp. 1453–1462, Mar. 2008.
erator coherency port using Xilinx ZYNQ,” in trated virtualized wireless multimedia sensor [88] W. Wang, A. C.-F. Liu, H. S.-H. Chung, R. W.-H.
Proc. 10th FPGA World Conf. 2013, pp. 1–8. networks,” IEEE Access, vol. 6, pp. 5090–5100, Lau, J. Zhang, and A. W.-L. Lo, “Fault diagnosis
[55] L. Costas, R. Fernandez-Molanes, J. J. Rodri- Dec. 2017. of photovoltaic panels using dynamic current–
guez-Andina, and J. Farina, “Characterization [72] X. Ma, W. A. Najjar, and A. K. Roy-Chowdhury, voltage characteristics,” IEEE Trans. Power
of FPGA-master ARM communication delays “Evaluation and acceleration of high-through- Electron., vol. 31, no. 2, pp. 1588–1599, Feb.
in zynq devices,” in Proc. 2017 IEEE Int. Conf. put fixed-point object detection on FPGAs,” 2016.
Industrial Technology (ICIT), pp. 942–947. IEEE Trans. Circuits Syst. Video Technol., vol. 25, [89] T. M. Chan, K. F. Man, K. S. Tang, and S. Kwong,
[56] R. Fernandez-Molanes, J. J. Rodriguez-Andina, no. 6, pp. 1051–1062, June 2015. “A jumping gene paradigm for evolutionary
and J. Farina, “Performance characterization [73] M. Sit, R. Kazami, and H. Amano, “FPGA-based multiobjective optimization,” IEEE Trans. Evol.
and design guidelines for efficient processor– accelerator for losslessly quantized convolu- Comput., vol. 12, no. 2, pp. 143–159, Apr. 2008.
FPGA communication in Cyclone V FPSoCs,” tional neural networks,” in Proc. 2017 Int. Conf. [90] N. N. Morsi, M. B. Abdelhalim, and K. A. She-
IEEE Trans. Ind. Electron., vol. 65, no. 5, pp. Field Programmable Technology (ICFPT), pp. hata, “Efficient hardware implementation of
4368–4377, May 2018. 295–298. PSO-based object tracking system,” in Proc.
[57] B. Fort, A. Canis, J. Choi, N. Calagar, R. Lian, [74] H. Nakahara, A. Jinguji, T. Fujii, and S. Sato, “An 2013 Int. Conf. Electronics, Computer and Com-
S. Hadjis, Y. T. Chen, M. Hall, B. Syrowik, T. acceleration of a random forest classification putation (ICECCO), pp. 155–158.
Czajkowski, S. Brown, and J. Anderson, “Au- using Altera SDK for OpenCL,” in Proc. 2016 Int. [91] S.-A. Li, C.-C. Wong, C.-J. Yu, and C.-C. Hsu,
tomating the design of processor/accelerator Conf. Field-Programmable Technology (FPT), “Hardware/software co-design for particle
embedded systems with LegUp high-level syn- pp. 289–292. swarm optimization algorithm,” in Proc. 2010
thesis,” in Proc. 2014 12th IEEE Int. Conf. Embed- [75] Altera Corporation, “Floating-point IP cores IEEE Int. Conf. Systems, Man and Cybernetics,
ded and Ubiquitous Computing, pp. 120–129. user guide,” Altera, San Jose, CA, UG-01058, pp. 3762–3767.
[58] Altera Corporation, “Implementing FPGA de- 2016. [92] R. Courtland. (2015, Mar. 30). Gordon Moore:
sign with the OpenCL standard,” Altera, San [76] R. Domingo, R. Salvador, H. Fabelo, D. Madro- The man whose name means progress. IEEE
Jose, CA, WP-01173-3.0, 2013. nal, S. Ortega, R. Lazcano, E. Juarez, G. Callico, Spectrum. [Online]. Available: https://spectrum
[59] N. Cardoso, P. Garcia, T. Gomes, F. Salgado, P. and C. Sanz, “High-level design using Intel .ieee.org/computing/hardware/gordon-moore-
Rodrigues, J. Cabral, J. Mendes, and A. Tavares, FPGA OpenCL: A hyperspectral imaging spa- the-man-whose-name-means-progress
“Multi-camera home appliance network: Han- tial-spectral classifier,” in Proc. 2017 12th Int.
dling device interoperability,” in Proc. IEEE 10th Symp. Reconfigurable Communication-Centric
Int. Conf. Industrial Informatics, 2012, pp. 69–74. Systems-on-Chip (ReCoSoC), pp. 1–8.

Deep Learning and Reconfigurable Platforms in The Internet of Things

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Deep Learning and Reconfigurable Platforms in The Internet of Things

Hochgeladen von

Copyright:

Verfügbare Formate

Image licensed by Ingram Publishing

ROBERTO FERNANDEZ MOLANES,

36 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018 1932-4529/18©2018IEEE

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 37

Veracity Complex Pattern

38 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 39

40 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

Ethernet, Timers, UART,

TABLE 1 – THE CHARACTERISTICS OF MODERN FPSoC FAMILIES.

TRANSISTOR MAXIMUM MAXIMUM MAXIMUM MAXIMUM

Intel FPGA Cyclone V SoC 28 nm Single/dual 0.925 — — 301 K LEs 200

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 41

42 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

Double 12 2,903 Zero 252 Seven 2,765 Zero 290

Double Seven 848 Four 186 Five 780 Four 289

Division Single 18 1,140 Four 249 16 985 Four 347

Double 35 3,523 15 185 30 3,020 15 258

Exponential base e Single 14 1,795 Two 217 26 745 Six 365

Double 28 5,335 Ten 185 28 5,390 Ten 260

Sine Single 12 1,463 Three 240 11 1,463 Three 280

Double 29 4,370 14 185 29 4,795 14 260

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 43

44 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

compared to using high-level syn- SDRAM Controller

is evaluated for them, the goal being SD

past iterations, these move toward a

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 45

46 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 47

48 IEEE INDUSTRIAL ELECTRONICS MAGAZINE ■ JUNE 2018

JUNE 2018 ■ IEEE INDUSTRIAL ELECTRONICS MAGAZINE 49

Das könnte Ihnen auch gefallen