You are on page 1of 5


Title Here Editor:
Name Islam
Here nn Editor
n here
n editor email here

Approximate Computing:
Making Mobile Systems
More Efficient
Thierry Moreau, Adrian Sampson, and Luis Ceze, University of Washington

M obile devices run at the limits of

what is possible in computer sys-
tem designs, with performance and bat-
Approximate Computing
There are two key challenges to real-
to existing processor designs, making
their near-term adoption difficult, and
they often present little energy reduc-
tery life both paramount. Unfortunately, izing approximate computing’s full tion over their precise counterparts.
battery technology is advancing slowly. potential: we need ways to safely pro- To address these challenges, our
Making mobile devices truly better from gram approximate computers, and we end-to-end system includes two build-
generation to generation will necessitate need hardware technologies that can ing blocks. First, a new programmer-
new, creative ways to extract more from smoothly trade off energy for accuracy. guided compiler framework transforms
each joule of a battery’s capacity. Regarding programmability, an programs to use approximation in a
Approximate computing is an emerg- approximate system must make it trac- controlled way. An Approximate C
ing research area that promises to offer table for programmers to write correct Compiler for Energy and Performance
drastic energy savings. It exploits the fact software even when the hardware can Tradeoffs (Accept) uses programmer
that many applications don’t require be incorrect. Programmers need to iso- annotations, static analysis, and dynamic
perfect correctness. Many important late parts of the program that must be profiling to find parts of a program that
mobile applications use “soft” error-tol- precise from those that can be approx- are amenable to approximation.
erant computations, including computer imated so that a program functions Second, the compiler targets a system
vision, sensor data analysis, machine correctly even as quality degrades. For on a chip (SoC) augmented with a co-
learning, augmented reality, signal pro- example, an image renderer can toler- processor that can efficiently evaluate
cessing, and search. A few small errors ate errors in the pixel data it outputs; coarse regions of approximate code. A
while detecting faces or displaying game a small number of erroneous pixels Systolic Neural Network Accelerator in
graphics, for example, might be accept- might be acceptable or even undetect- Programmable logic (Snnap) is a hard-
able or even unnoticeable, yet today’s able. However, an error in a jump table ware accelerator prototype that can
systems faithfully compute precise out- could lead to a crash, and even small efficiently evaluate approximate regions
puts even when the inputs are imprecise. errors in the image file format might of code in a general-purpose program.1
Approximate computing research make the output unreadable. The prototype is implemented on the
builds software and hardware that are Regarding the technology, approxi- FPGA fabric of an off-the-shelf ARM
allowed to make mistakes when appli- mate hardware must offer appealing SoC, which makes its near-term adop-
cations are willing to tolerate them. quality-performance tradeoffs that can tion possible. Hardware acceleration
Approximate systems can reclaim energy be exposed to the compiler. For exam- with Snnap is enabled by neural accel-
that’s currently lost to the “correct- ple, an approximate adder implemented eration, an algorithmic transforma-
ness tax” imposed by traditional safety using power gating can rely on ISA tion that substitutes regions of code in
margins designed to prevent worst-case (instruction set architecture, or hard- a program with approximate versions
scenarios. In particular, research at the ware/software interface) extensions to that can be efficiently evaluated on a
University of Washington is exploring specify the amount of error allowed in specialized accelerator. Using Accept
programming language extensions, a an addition operation. The challenge and Snnap, a software programmer can
compiler, and a hardware co-processor with most approximate hardware tech- leverage the benefits of approximate
to support approximate acceleration. niques is that they require modifications acceleration with minimal effort by

Published by the IEEE CS n 1536-1268/15/$31.00 © 2015 IEEE PER VA SI V E computing 9



annotating legacy software with intui- approximate data. To this end, Accept the pixel array in an image-filter algo-
tive approximate datatype annotations. provides a compiler analysis library that rithm. The programmer also provides a
finds regions of code that are amenable quality metric that measures the accu-
An Approximate Compiler to transformations. An ensemble of racy of the program’s overall output.
The Accept compiler framework com- optimization strategies transform these
bines programmer annotations, code regions. One critical optimization tar- Compilation
analysis, optimizations, and profiling gets Snnap, our neural accelerator The compiler implements neural accel-
feedback to make approximation safe (described in more detail later). eration in four phases: region selection,
and keep control in the hands of pro- execution observation, training, and
grammers. Its front end, built atop the Autotuning code generation. Accept first identifies
LLVM compiler infrastructure, extends Although a set of annotations might large regions of code that are safe to
the syntax of C and C++ to incorporate permit many different safe program approximate and nominates them as
an APPROX keyword that program- relaxations, not all of them are benefi- candidates for neural acceleration.
mers use to annotate datatypes. Accept’s cial in the quality-performance tradeoff Next, it executes the program with
analysis identifies code that can affect they offer. A practical approximation test cases and records the inputs and
only variables marked as APPROX. mechanism must help programmers outputs to each target code region. It
Optimizations use these analysis results choose from among many candidate then uses this input-output data to train
to avoid transforming the precise parts relaxations for a given program to a neural network that mimics the origi-
of the program. An autotuning com- strike an optimal balance between per- nal code. Training can use standard
ponent measures program executions formance and quality. Accept’s auto- techniques for neural networks—we
and uses heuristics to identify program tuner heuristically explores the space use the standard backpropagation
variants that maximize performance of possible relaxed programs to identify algorithm.
and output quality. The final output is Pareto-optimal variants. Finally, the compiler generates an
a set of Pareto-optimal versions of the executable that replaces the original
input program that reflect its efficiency- Neural Acceleration code with invocations of a special accel-
quality tradeoff space. Neural acceleration is a powerful erator (the NPU), which implements the
approach to approximate comput- trained neural network.
Safety Constraints and Feedback ing that works by substituting entire
Because program relaxations can have regions of code in a program with Execution
significant effects on program behavior, machine-learning models. 2 Neural During deployment, the transformed
programmers need visibility into—and acceleration trains neural networks to program begins execution on the main
control over—the transformations mimic and replace regions of approxi- core and configures the NPU. Through-
the compiler applies. To give the pro- mate imperative code. Once the neural out execution, the program invokes
grammer fine-grained control over network is trained, the system no longer the NPU to perform a neural network
relaxations, Accept extends an exist- executes the original code and instead evaluation in lieu of executing the code
ing lightweight annotation system for invokes the neural network model on region it replaced. Invoking the NPU
approximate computing based on type a neural processing unit (NPU) accel- is faster and more energy-efficient than
qualifiers.2 Accept gives programmers erator. Neural networks have efficient executing the original code region on
visibility into the relaxation process via hardware implementations, so this the CPU, so the program as a whole
feedback that identifies which transfor- workflow can offer significant energy runs faster.
mations can be applied and which anno- savings over traditional execution.
tations are constraining it. Through Neural acceleration consists of three Hardware Support for
annotation and feedback, the program- phases: programming, compilation, Approximate Acceleration
mer iterates toward an annotation set and execution. Our NPU implementation, Snnap, runs
that unlocks new performance benefits on off-the-shelf FPGAs. Using existing,
while relying on an assurance that criti- Programming affordable hardware means that Snnap
cal computations are unaffected. To use neural acceleration in Accept, can provide benefits today, without wait-
the programmer uses profiling informa- ing for new silicon. Snnap uses an emerg-
Automatic Program tion and type annotations to mark code ing class of heterogeneous computing
Transformations that’s amenable to approximation. For devices called programmable system-
Based on programmer annotations, many applications, it’s easy to identify on-chips (PSoCs). These devices combine
Accept’s compiler passes can apply pro- the “core” approximate data that domi- a set of hard processor cores with pro-
gram transformations that involve only nates the program’s execution—such as grammable logic on the same die.

10 PER VA SI V E computing


Compared to conventional FPGAs,

this integration provides a higher- Zynq programmable system-on-a-chip
bandwidth and lower-latency interface Application processing unit ACP Neural processing unit
between the main CPU and the pro- AXI Master
grammable logic. However, the latency OCM L2 $ Interface
is still higher than in previous propos- bus
als for neural acceleration with special- snoop control unit
purpose hardware.3 Our design covers
this additional latency by exploiting L1 I$ L1D$ control control
Implementation on the Zynq
We’ve implemented Snnap on a com- ...


mercially available PSoC: the Xilinx


Dual core
Zynq-7020 on the ZC702 evaluation ARM cortex-A9
platform ( PE PE
silicon-devices/soc.html).4 The Zynq
includes a Dual Core ARM Cortex-A9
and an FPGA fabric. The CPU-NPU
interface composes three communica-
tion mechanisms on the Zynq PSoC 4
Figure 1. The system diagram for the Systolic Neural Network Accelerator in
for high bandwidth and low latency.
Programmable logic (Snnap). Each processing unit (PU) contains a chain of
First, when the program starts, it
processing elements (PE) feeding into a sigmoid unit (SIG).
configures Snnap using the medium-
throughput general-purpose I /Os
(GPIOs) interface. Then, to use Snnap
during execution, the program sends from a local scratchpad memory three undergraduate researchers, all of
inputs using the high-throughput ARM where temporary results can also be whom were beginners with the C and
Accelerator Coherency Port (ACP). The stored. The sigmoid unit implements C++ languages and new to approximate
processor then uses the ARMv7 SEV/ a nonlinear neuron-activation func- computing, as well as graduate students
WFE signaling instructions to invoke tion using a lookup table. The PU more familiar with the field.
Snnap and enter sleep mode. Finally, the control block contains a configurable Programmers tended to approach
accelerator writes outputs back to the sequencer that orchestrates com- annotation by finding the central
processor’s cache via the ACP interface munication between the PEs and the approximable data in the program—
and, when finished, signals the proces- sigmoid unit. The PUs can be pro- for example, the vector coordinates in a
sor to wake up. grammed to operate independently, clustering algorithm, or pixels in imag-
so different PUs can be used to either ing code. Accept’s type errors guided
Micro-Architecture parallelize the invocations of a single programmers toward other parts of
Our design, shown in Figure 1, consists neural network or evaluate different the code that needed annotation. Pro-
of a cluster of processing units (PUs) neural networks concurrently. grammers needed to balance effort with
connected through a bus. Each PU is potential reward during annotation, so
composed of a control block, a chain Experience and Results auxiliary tools, such as profilers and
of processing elements (PEs), and a sig- We applied Accept and Snnap to a set call graph generators, were useful to
moid unit, denoted by the SIG block. of approximable benchmarks. Our goal find hot spots.
The PEs form a one-dimensional sys- was to show that programmers can
tolic array that feeds into the sigmoid unlock significant efficiency gains at a Snnap Acceleration Efficiency
unit. Systolic arrays excel at exploiting small accuracy cost with minimal effort. Our evaluation targeted seven bench-
the regular data-parallelism found in marks from many application domains:
neural networks, and they’re amenable Writing Approximate Programs option pricing, signal processing, robot-
to efficient implementation on modern To evaluate the effort required to ics (the inverse kinematics for 2-joint
FPGAs. apply approximation, we annotated arm—inversek2j), lossy image com-
When evaluating a layer of a neural a set of benchmarks for Accept’s lan- pression, machine learning (k-means),
network, PEs read the neuron weights guage. The programmers included and image processing. We compared

april–june 2015 PER VA SI V E computing 11



better resource-normalized through-

4 put on four out of seven benchmarks.
10.84 38.12 3.78
In particular, when HLS couldn’t fit a
fully pipelined datapath on the avail-
3 able resources, the resulting throughput
Whole application speedup

2.67 was affected.

In addition to competitive perfor-
2.35 mance, Snnap and Accept also offer
programmability and generality
1.46 advantages over specialized datapaths.
1.3 Neural acceleration doesn’t require
1 hardware design knowledge which
was often required to debug or opti-
mize the performance of HLS kernels.
Also, all seven benchmarks we com-
piled through HLS generated a dif-
oles fft 2j int jpeg ans sobe
l N
bsch rsek jme kme MEA ferent specialized datapath, whereas
inve GEO
Snnap provides a fabric for acceler-
ating all seven benchmarks, making
Figure 2. Whole-application speedup rates across the benchmark suite. The
virtualization and context-switching
geometric mean was 3.78.

each program’s performance, power,

and energy consumption when using
Snnap versus running the original
relatively complex and didn’t present
a significant advantage over executing
the original target region on the CPU.
A ccept and Snnap represent the first
steps toward near-term approxi-
mate computing on PSoCs, but compi-
­software on the ARM processor. We The energy efficiency results were lation and neural acceleration aren’t the
limited each application’s output error similar: energy use was reduced 2.77 only challenges in approximate comput-
to 10 percent. Snnap incorporates eight times on the SoC+DRAM subsystem, ing. We’re also developing high-level tools
processing units and runs at one quarter ranging from 0.87 times for k-means to to help programmers better ­navigate
of the CPU’s core frequency. 28.01 times for inversek2j. The primary and understand ­performance-quality
energy benefit of Snnap comes from ­tradeoffs, including special-purpose
Performance and energy efficiency. Fig- racing to completion: faster execution debuggers for approximate programs.
ure 2 shows the geometric mean for the times must compensate for Snnap’s We also wish to explore the rich design
whole-application speedup, which was fixed power overhead. space for approximate acceleration;
3.78 times faster across our benchmark neural acceleration is just one coarse-
suite. As shown, the speedup ranged Comparing Snnap with application-­ grained technique among others. Future
from 1.30 times faster for k-means specific datapaths. Specialized hard- work will establish better error guaran-
to 38.12 times for inversek2j. Inverse ware accelerators are another way to tees for neural acceleration using robust
kinematics saw the largest speedup, improve applications’ energy efficiency training.
because the bulk of its execution was using FPGAs. We compared Snnap’s Approximate computing research is
offloaded to Snnap. The target code performance to specialized FPGA in its infancy and needs more tools for
for that benchmark includes trigono- designs generated by a commercial prototyping and evaluating ideas. The
metric function calls that are expensive high-level synthesis (HLS) tool, Vivado Accept framework and Snnap proto-
to evaluate on an ARM CPU. Neural HLS 2014.2. For each benchmark, type—both of which we plan to make
acceleration approximates these expen- we generated a specialized datapath open source—demonstrate a practical
sive functions with a compact neural by compiling through HLS the same and efficient implementation of approx-
network that can be quickly evaluated region of code that we offload to Snnap imate transformation.
on Snnap. Conversely, the benchmark via neural acceleration. Approximate computing is especially
that had the smallest speedup couldn’t For a fair comparison, we normal- relevant in mobile environments for two
offload most of its execution to Snnap, ized performance of each approach by important reasons. First, mobile devices
and the neural network that was trained its resource usage on the FPGA. To our are energy-constrained. Second, most of
to approximate the target region was ­surprise, neural acceleration offered the applications we run in mobile devices

12 PER VA SI V E computing


are inherently approximable—including References

video chatting, games, video creation and Thierry Moreau is a PhD
1. T. Moreau et al., “SNNAP: Approxi-
consumption, sensor data collection and mate Computing on Programmable student at the University of
summarization. If we can make approxi- SoCs via Neural Acceleration,” Proc. Washington. Contact him at
mate computing a reality on smartphones, 21st IEEE Symp. High Performance
Computer Architecture (HPCA),
it could significantly increase performance 2015; http://homes.cs.washington.
while simultaneously decreasing energy edu/~moreau/media/papers/snnap-
consumption, thereby potentially enabling hpca2015.pdf.
new, more demanding applications. 2. A. Sampson et al., “EnerJ: Approximate Adrian Sampson is a PhD
Data Types for Safe and General Low- student at the University of
Power Computation,” ACM SIGPLAN Washington. Contact him at
Acknowledgments Conf. Programming Language Design
and Implementation (PLDI), 2011, asampson@cs.washington.
We thank everyone involved in doing research on pp. 164–174. edu.
approximate computing in collaboration with UW:
3. H. Esmaeilzadeh et al., “Neural Accel-
Andre Baixo, James Bornholt, Doug Burger, Hadi eration for General-Purpose Approxi-
Esmaeilzadeh, Dan Grossman, Kathryn McKinley, mate Programs,” International Symp. Luis Ceze is the Torode
Todd Mytkowicz, Jacob Nelson, Mark Oskin, Karin Microarchitecture (MICRO), 2012, Family Professor of Com-
pp. 449–460.
Strauss, and Mark Wyse. The research in this article puter Science and Engi-
was supported in part by the National Science Foun- 4. Xilinx, Zynq-7000 All Programmable neering at the University of
dation; DARPA; the Center for Future Architectures SoC: Technical Reference Manual, Feb. Washington. Contact him
Research (C-FAR), one of six centers of STARnet; and documentation/user_guides/ug585- at luisceze@cs.washington.
gifts by Microsoft, Google and Qualcomm. Zynq-7000-TRM.pdf. edu.

Take the
CS Library
you go!
IEEE Computer Society magazines and Transactions are now
available to subscribers in the portable ePub format.

Just download the articles from the IEEE Computer Society Digital
Library, and you can read them on any device that supports ePub.
For more information, including a list of compatible devices, visit

april–june 2015 PER VA SI V E computing 13