Sie sind auf Seite 1von 7

Data-centric Computation Mode for Convolution in

Deep Neural Networks


Peiqi Wang∗† , Zhenyu Liu† , HaiXia Wang† and Dongsheng Wang†
∗ Department of Computer Science and Technology
Tsinghua University, Beijing, China 100084
Email: wpq14@mails.tsinghua.edu.cn
† Tsinghua National Laboratory for Information Science and Technology
Beijing, China 100084
Email: liuzhenyu73, hx-wang, wds@tsinghua.edu.cn

Abstract—Deep Convolutional Neural Network (CNN) based [12] have focused on massive off-chip memory access while
methods have shown outstanding performance in a wide range of designing accelerators, trying to decrease energy consumption
applications. Nowadays neural networks become deeper, leading by declining off-chip data transmission. One popular method
to demand of substantial computation and memory resources.
Customized hardware is one option which maintains high per- is designing efficient on-chip memory hierarchy [13]–[15]. For
formance in lower energy consume than general CPUs or GPUs. instance, DaDianNao [13] lays tens of eDRAM banks on chip
While hardware designing, we need to address the problem of in order to store the whole network’s parameters. In situations
massive data transmission, and ensure high throughput at the of dealing with deep networks owning substantial parameters,
same time. Actually, substantial data transfer consumes more it will need huge on-chip storage capacity. Another method
energy than computation. The larger scales of neural networks
become, the harder to solve this problem. In this paper, we is to compress data [8], [10], [11], decreasing the stress of
focus on convolution operation, which occupies nearly 90% bandwidth. One drawback is that extra hardware overhead or
computation and runtime in deep CNN. We propose a data- time consumption will be required during compression and
centric computation mode for convolution, which declines the decompression. Furthermore, energy consumed by on-chip
total requirements of data transfer during convolution pro- data transfer also plays an important role. Off-chip access
cessing period efficiently, and utilizes data locality to achieve
high throughput. Different from previous methods, which adopt once spends hundreds of Nanojoule [16],and that is tens of
efficient on-chip memory hierarchy or focus on partial results’ Nanojoule for on-chip access. However, the amount of on-chip
movements, our proposed method concentrates on operands access is as ten to hundred times as that of off-chip access.
themselves in convolution, minimizing data transfer right from
the start. Obviously, it can be combined with others to achieve
higher energy efficient. Furthermore, we also simulate and To address these problems, we propose a data-centric com-
analyse the hardware overhead of our data-centric convolution, putation mode, declining the total amount of on-chip data
corroborating its potentiality of performing high throughput in transfer during the entire processing. We focus on convolution
low energy consumption.
operation, which plays an important role in contemporary deep
I. I NTRODUCTION neural networks. Utilizing data reuse patterns in convolution
operation, operands in the data-centric computation mode are
Deep Convolution Neural Networks (CNN) are showing
less moved. Meanwhile, our method also guarantees high
great power in broad application scopes, such as face detection,
throughput. In terms of hardware implementation, it can be
object localization, scene classification, tracking, automatic
accomplished in simple computational primitives. We simulate
drive, speech recognition and so on [1]–[5]. In recent years,
and analyse the hardware details, and then compare to the tra-
‘going deeper’ has become a trend of CNNs to improve
ditional convolution mode, which is used by many researches.
accuracy. Large number of layers, millions of filter weights,
The experimental results demonstrate our method can work in
different sizes of filters, and more complex network structures
low energy consume in small footprint. Basically, our method
are adopted in todays deep CNNs. For example, the googlenet
can be combined with previous approaches to achieve higher
[6] model, one of champions in Image Net Large Scale
performance. Ground on these, our data-centric computation
Vision Recognition Challenge (ILSVRC) 2014, proposes the
method can use data sufficiently, decrease data access effi-
conception of inception, which uses different sizes of filters
ciently and successfully decline energy consumption.
in parallel.
While achieving these advanced techniques, CNN-based
methods demand substantial computation and memory re- The paper is organized as follows. Related works are intro-
sources. Generally, large and expensive servers are required. duced in Section 2. In Section 3, we provide the background of
It becomes crucial problem in some situations requiring high CNN. In Section 4, we investigate the data-centric computation
throughput in low power, like embedded systems. Customized mode for convolution operation. The evaluation is proposed in
hardware addresses these problems. Many researchers [7]– Section 5.We finally conclude this paper in Section 6.

978-1-5090-6182-2/17/$31.00 ©2017 IEEE 133


II. R ELATED WORK 3UREDELOLW\LQµZRPDQ¶
3UREDELOLW\LQµKDW¶
3UREDELOLW\LQµIHDWKHU¶

Due to the growing demand on computational capacity,


hardware-accelerated methods have been studied by groups ««
of researchers. Comparing with software-accelerated methods, 3UREDELOLW\LQµGRJ¶

hardware implementation can easily satisfy large and complex


networks. Embedded and real-time applications always adopt )&
&RQYROXWLRQ 3RROLQJ &RQYROXWLRQ )&
hardware-accelerated methods to accelerate deep CNN mod-
els.
Fig. 1. Sketch of the structure of a typical CNN model.
In the 1980s and early 1990s, LeCun and his colleagues al-
ready embarked down an early path toward custom hardwares
for CNNs. ANNA chip [17], an analog multiplier, does signify       
   
h h h

important role of hardware accelerator. Neuflow [18] is a real-         


h h h
   
time, low-power accelerator, providing fast computation of the          
   
h h h

feed-forward path of CNNs. In the past few years, a series         


   
of studies have appeared to optimize the memory system.       ILOWHU
&RQYROYHGIHDWXUH
Zhang et al. demonstrated the CNN performance model in      
[12], and took buffer management and bandwidth optimization ,PDJH
into consideration. In [7], Chen et al. proposed efficient on-
chip storage to address memory bottleneck. They separated Fig. 2. A convolution operation example.
different kind of data and stored them in different buffers to
boost performance. Tiling strategy is also helpful because of
locality. In their further study [13] and [19], they enlarged on- new accelerating methods. In [14], [15], they utilized the
chip memory to store all the weights of CNN model on chip. features of memristor to improve performance. Other novel
In this manner, there is nearly no data traffic between on-chip technologies like Processing-in-memory(PIM) and 3D stack-
and off-chip memories, showing great performance. However, ing memory [21] are adopted as well.
regarding a large number of new deep CNN models, it’s barely III. BACKGROUND
possible to store whole model weights on chip memory. If the
Although different CNN models have different network
input is large and the buffer is not bigger enough, thrashing
structures, they all consist of some basic layer components:
will happen. Eyeriss [10] focused on the data transfer energy.
Convolution Layer (CONV), Pooling Layer, and Full Con-
It is implemented in spatial stationary dataflow, using point-
nection Layer (FC). Fig. 1 shows a typical CNN model. The
to-point network and single-cycle multi-cast to map a pair of
key operations in these above layers are convolution, vector
filter row and image row to each processing engine. Basically,
multiplication, sub-pooling and non-linear operations.
Eyeriss mainly concentrated on minimizing data movement
Convolution Layer extracts features of input. It connects
of partial sum accumulation, while ours decreases operands
neuron of output to a local region of input. CONV layer
movement in convolution operation from buffer to PEs.
computes dot products between convolution filters and the
At the same time, other researchers pay more attention to the
region connected in input feature maps to get output feature
computing engines. Complex computing engine seems more
maps. The j-th output feature map can be calculated as
powerful and leads to great performance. In [20], Chakaradhar
following:
et al. presented a complex computation structure to accelerate
CNNs. To enable design space exploration for dynamic con-
figuration across different CNN layers, they added dedicated  
ninput
f outj (x, y) = σ( f ini (x + k, y + l)×
switches between the computing modules. As well as an as- (1)
i=0 k,l
sociate compiler was needed to exploit the parallelism among
the CNN working sets. The study in [9] used similar method. f ilterij (k, l) + biasj )
They employed a dual-range multiply-accumulate block for Computational example is shown in Fig. 2 and symbol ⊗
low-power convolution operations, and integrated numbers of represents a convolution operation. Function σ(x) is used for
such blocks into a CNN-optimized neuron processing engine. activation, we will talk about it later.
Angel-Eye [8] adopted line buffer design in processing engines Pooling Layer is an optional sub-sampling step after con-
to achieve operator-level parallelism. Moreover, tiling strategy volution layer. Max pooling and Average pooling are two main
exploited data locality to improve Angel-Eyes performance. methods in CNN models. Just as their names imply, pooling
Complex structures often bring some problems, and the most layers return the maximum or average value of contiguous
important one is flexibility. Previous designs [8], [18], [20] subareas, performing down-sampling operations along the
just support several popular fixed parameter models or layers spatial dimensions.
setting. Full Connection Layer is almost the same with multilayer
Emergency of some advance technologies also encourages perceptron. One output neuron connects to all the neurons

134
RXW RXW RXW Fig. 3 displays a sample of using 3 × 3 filter to convolve an
[ [ [ [ [ [ «« ILOWHU
input feature map. Each 3 × 3 box stands for one convolution,
producing one value in output. It’s obvious that for datum X22 ,
RXW [ [ [ [ [ [ «« . . .
it is used in 9 convolution operations. In previous methods,
RXW
RXW [ [ [ [ [ [ «« ۪ . . .
convolution is nearly treated as an atomic operator, calculated
[ [ [ [ [ [ «« ൈ . . . as (3).
RXW [ [ [ [ [ [ ««  J+K−1
I+K−1 
outIJ = Xij × Kij (3)
««

RXW RXW i=I j=J

The atomic-like approach calculates 9 multiplications and


Fig. 3. Convolution operations on a channel of input feature map. 8 additions, then gets one output value. Finishing one con-
volution, it starts computing another. Therefore, X22 will be
loaded 9 times to computing engines. The movements remain
(i.e. every value in every feature map) from previous layer. growing as the size of filters increase. Data-centric method
Full connection linearly transforms input, and an activation can easily address this problem. Once X22 is loaded into
function is needed before getting output values. The j-th value computing engines, we calculate all computations involving
in output vector can be calculated as following: X22 . Subsequently, accumulate all products belonging to one
same final value. It’s simple to implement without much extra

sizeinput
f out(j) = σ( f in(i) × f ilterj (i) + biasj ) (2) overhead. The hardware details and comparison with original
i=0 methods are illustrated in Section IV.
The last full connection layer’s output can be treated as the B. Data-centric computation mode for convolution
class scores, i.e. this layer works as a classifier. Simply and without loss of generality, we take convolution
Activation function simulates the active state of biological with 3 × 3 filters for example. The basic processing engines
neurons, as the σ(x) in (1), which is always used after (PEs) in our case are simple, as same as multipliers. One
a convolution layer and a full connection layer. Usually, convolution operation with K × K filter can be divided into
threshold functions are applied to model the firing rate of the K fragments, each fragment is the sum of K products in the
neurons. Some classic functions as sigmoid and tanh, as well same row of connected region. Equation (4) gives a formal
as some novel functions like the Rectified Linear Unit (ReLU), representation, which is the expansion of (3).
PReLU, Leaky ReLU, and so on are all popular in recent CNN
models.

I+K−1 
J+K−1

IV. DATA - CENTRIC CONVOLUTION outIJ = rowi , where rowi = Xij × Kij (4)
i=I j=J
A. Analysis of data reuse
As filter window slides, it is apparent that Xij is used in all
According to (1) and (2), we can easily finger out two K fragments. In previous convolution computation method, it
different grains in data reuse: is natural to compute all fragments of one output pixel, sum
• Inter-output level: For CONV layers and FC layers, them together, and then turn to next output pixel. However, as
multiplying the same input data with different weights we analyse in Fig. 3, this method leads to K 2 times loading of
creates different outputs. Therefore, the input data is one input pixel. The key point of data-centric convolution is
reused among all outputs. calculating all fragments involving Xij as long as Xij loaded.
• Operator level: For CONV layers, there is a lot of data This means once a time we calculate fragments belong to
reused in one convolution operation caused by the sliding different output pixels. Then the fragment is sent to delay
window pattern. As (1) shows, one input pixel can be flip-flop, waiting for other fragments calculated in subsequent
reused for K 2 times (where K × K represents the size cycles. When the calculation of next fragment belonging to the
of one convolution filter), excluding the inter-output data same output pixel is completed, we add the result to the above
reuse. In shared weights convolution, filter is reused for fragment saved in delay flip-flop directly. After K cycles initial
the whole computation of one input channel. For FC time, all fragments for one output pixel are completed, and
layers, multi-layer perceptron operations have no such a they are already added together. The entire process is executed
reuse pattern. in pipeline, and every one cycle an output pixel is streamed
Considering data reuse in inter-output level, some previous out.
works [7] [8] exchange the order of different processing di- For instance, we use a 3 × 3 filter convolve a 7 × 7
mensions to utilize this character. What’s more, tiling strategy input block. Fig. 4 demonstrates the dataflow of data-centric
also benefits to performance in both CONV and FC layers convolution. Due to the symmetry of row and column, we
because of the limitation of resources. These approaches have take column-major computation as an example. To simplify
shown great power in reality. Our work mainly focuses on how situation, we roughly set the total time of finishing one
to utilize operator level data reuse pattern efficiently. multiplication and one 4-inputs addition as per unit time,

135
WLPH [ [ [ [ [ [ [
ηϬ ηϭ ηϮ ηϯ ηϰ ηϱ ηϲ ηϳ ηϴ ηϵ
[ [ [ [

'DWDORDGGLUHFWLRQ
,QSXW'DWD [ [ [
3( 3( 3( [ [ [ [ [ [ [
URZ ; ; ; ; ; ; ; ; ;
. . . ; [ [ [ [ [ [ [
URZ
3( 3( 3(
. . .
; ; ; ; ; ; ; ; ; ; ͙͙ [ [ [ [ [ [ [
[ [ [ [ [ [ [
3( 3( 3( ; ; ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
URZ . . .

2 2 2 2 2


  2 2 2 2 2   2 ͙͙ 2 2 2 2 2
2 2 2 2 2
2XWSXW'DWD
2 2 2 2 2
2 2 2 2 2

&\FOH &\FOH &\FOH

3( 3( 3( 3( 3( 3( 3( 3( 3(


™ ™ ™
URZ .î ; .î ; .î ; 2 .î ; .î ; .î ; 2 .î ; .î ; .î ; 2
3( 3( 3( 3( 3( 3( 3( 3( 3(
™ ™ ™
URZ .î ; .î ; .î ; 18// .î ; .î ; .î ; 2 .î ; .î ; .î ; 2
3( 3( 3( 3( 3( 3( ™ 3( 3( 3(
™ ™
URZ .î ; .î ; .î ; 18// .î ; .î ; .î ; 18// .î ; .î ; .î ; 2

&\FOH &\FOH &\FOH


2
3( 3( 3( 3( 3( 3( 3( 3( 3(
™ ™ ™
URZ .î ; .î ; .î ; 2 .î ; .î ; .î ; 2 .î ; .î ; .î ; 18//

3( 3( 3( 3( 3( 3( 3( 3( 3(


™ ™ ™
URZ .î ; .î ; .î ; 2 .î ; .î ; .î ; 2 .î ; .î ; .î ; 2
3( 3( 3( ™ 3( 3( 3( 3( 3( 3(
™ ™
URZ .î ; .î ; .î ; 2 .î ; .î ; .î ; 2 .î ; .î ; .î ; 2

2 2 2

Fig. 4. A case of data-centric convolution. The input block size is 7 × 7, filter size is 3 × 3

namely, one cycle, regardless the concrete implementation. data in Ref1 is sent to each PE row. The dataflow is totally the
Initially, filter weights are stored respectively in each PE. In same with what we discuss before. When convolution window
cycle #0 shown in Fig. 4, we load the first three pixels of input, slides near the bottom, PE rows need to choose from these two
and send them to every PE row. PEs in the first column all reference groups. Fig. 5 elaborates the details. Continuing our
receive X00 ; P Ex1 receive X01 ; P Ex2 receive X02 . Evidently, previous case, while calculating O40 , the last output pixel of
the sum of row 0 is the first fragment of output O00 , expressed the first column, we skip the useless computation by sending
as O00−1 . However, the sums of row 1 and row 2 are useless operands of next output pixel O01 as Ref1, i.e. data X01 to
because of the initialization of pipeline. In cycle #2, input data X03 . The original operands, X50 to X52 , are treated as Ref2.
X20 to X22 are loaded. Row 0 calculates the fragment O20−1 . As shown in Fig. 5, row 0 chooses Ref1 while row 1 and row 2
Row 1 gets the result of fragment O10−2 . The sum of row 2 choose Ref2. Instead of convolving X50 to X52 and producting
O00−3 is the last fragment of output O00 , so the computation useless result, row 0 calculates the first fragment of O01 in
of O00 is completed. The rest can be done in the same way. cycle #5. In cycle #6, row 1 skips useless computation as well,
processing the data in next column. Adopting this mechanism,
C. Reference Mechanism pipeline keeps practicing. Reference Mechanism eliminates the
Actually, when we meet the bottom of input block, the bubbles successfully, improving the performance of pipeline in
pipeline of data-centric convolution is interrupted. For in- low overhead.
stance, in our previous case there is no output in cycle #7
and #8. The order of sending input data to PEs is row by D. Performance evaluation
row in a width of filters constantly until the last row of this Fig. 6 demonstrates the dataflow of traditional convolution
column, and then turning back to the top of next column. In computation. It is apparent that the throughput of these two
cycle #5 in Fig. 4, the result of row 0 is useless because of methods are similar, i.e. producing one output pixel per cycle.
the boundary. The next cycle is in the same circumstance. This The initial time of data-centric convolution can be ignored
leads to two bubbles in pipeline, emerging in cycle #7 and #8. when input is large. Meanwhile, the advantages of data-centric
Therefore, we introduce reference mechanism to address this mode are obvious.
problem. Firstly, less data transfer per cycle. During computation, we
We send two different groups of input pixels, Ref1 and need to transfer data from on-chip buffer to PEs constantly
Ref2. PEs choose one group between them as operands. In the to ensure the high throughput. It’s easy to figure out that
ordinary situation, Ref1 is valid. In other words, the group of for K × K filter, data-centric convolution requires K data

136
&\FOH WLPH
ηϬ ηϭ ηϮ ηϯ ηϰ ηϱ ηϲ ηϳ
,QSXW'DWD
URZ 3( 3( 3( ,QSXW'DWD [ [ [ [ [ [ [
. . . ; ; ; ; ; ; ; ;
3( 3( 3( [ [ [ [

'DWDORDGGLUHFWLRQ
; ; ; ; ; ; ;
[ [ [
;
. . .
; ; ; 5HI ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
URZ 3( 3( 3( ; [ [ [
. . . ;;; 5HI 3( 3( 3( ;
;
;
;
;
;
;
; ;
;
;
;
;
;
; ͙͙ [ [ [ [
2 . . .
; ; ; ; ; ; ; ; [ [ [ [ [ [ [
3( 3( 3( [ [ 3( 3( 3( ; ; ; ; ; ; ; ; [ [ [ [ [ [ [
[ [ [ [ [ ; ; ; ; ; ; ; ;
. . . . . . ; [ [ [ [ [ [ [
URZ [ [ [ [ [ [ [ ; ; ; ; ; ; ;

[ [ [ [ [ [ [


͙͙
2 2 2 2 2
[ [ [ [ [ [ [ 2 2 2 2 2 2 2 2 ͙͙ 2 2 2 2 2
&\FOH [ [ [ [ [ [ [ 2 2 2 2 2
2XWSXW'DWD 2 2 2 2 2
URZ [ [ [ [ [ [ [
3( 3( 3( 2 2 2 2 2
[ [ [ [ [ [ [
. . . ,QSXW'DWD

3( 3( 3( 2


URZ ; ; ; 5HI
. . .
;;; 5HI
Fig. 6. An example dataflow of traditional convolution computation.
3( 3( 3(
. . .
URZ
7RWDOGDWDWUDQVIHURXWSXWVL]H

&\FOH &\FOH 

3( 3( 3( 3( 3( 3(


™ ™ 
.î ; .î ; .î ; 2 .î ; .î ; .î ; 2

3( 3( 3( 3( 3( 3( 


™ ™
.î ; .î ; .î ; 2 .î ; .î ; .î ; 2  

 
3( 3( 3( 3( 3( 3( ™
™ 
.î ; .î ; .î ; 2 .î ; .î ; .î ; 2


2 2


Fig. 5. Computation details of data-centric convolution with Reference  )LOWHU


  
   
Mechanism while convlving the bottom of input block.

Fig. 7. Data transfer using different convolution method.


in each cycle, then broadcasts them to all rows of PEs. As
for traditional convolution shown in Fig. 6, totally K × K
up, data-centric computation mode expresses better. Actually,
data is needed. This means data-centric convolution reduces
the filter sizes adopted in some applications [22], [23] are
the bandwidth requirements between processing engine and
really larger than some popular CNNs. For instance, the
on-chip buffer, which is conducive to low wire density in
convolution filter in YouTube video object recognition [23]
hardware implementation.
is 20 × 20, while the filter size is 3 × 3 in VGG [24].
Secondly, less total amount of data transfer during the whole
convolution processing period. Considering our previous case V. E VALUATION
in Fig. 4, data X20 to X22 are loaded into PEs just once in A. Basic processing engines
cycle #2 during the computation of the first column. However,
traditional convolution has to load them three times to finish Fig. 8 illustrates the structures of traditional and data-
the computation, marked in red font in Fig. 6. The total amount centric convolution. We implement processing engines using
of data transfer is reduced by K times, showing potential multipliers. As the case of 3 × 3 filter shown in Fig. 8, we set
in energy consumption decreasing. Equation (5) displays the the width of PE arrays equal to filter, thus it can stream out one
amount of data transfer during the processing of one input pixel result per cycle. Obviously, the traditional convolution
channel adopting data-centric convolution. structure can also compute FC layers. To utilize data-centric
structure to compute full connection layer slickly, we add
selectors to support multilayer perceptron operations. Selectors
Tdata−centric = W × K 2 + W × (H − 1) × K (5) choose data from delay flip-flop in CONV layers, like the
black wires shown in Fig. 8 (b), or the result of 4-inputs adder
In (5), W represents the width of output feature map, H is in FC layers, like the yellow wires shown. In FC layers, all
the height, and the filter size is K ×K. It’s easy to understand: input values are multiplied by corresponding weights, and then
just the computation in first line of output image need K 2 all products are added together, producing one output point.
data movement, while others only require K data transfer. Thus we divide the input data by treating M points as a set
The situation for traditional convolution is shown in (6): (M is the width of our PE basic unit), and then we broadcast
Ttraditional = W × H × K 2 (6) sets one by one to all basic unit. Every PE holds different
corresponding weight, so each row in a basic unit computes
The change of total data transfer with filter size is shown one output point independently. Processing in parallel ensures
in Fig. 7. To simplify the situation, we choose the ratio of the the performance in FC layers.
amount of data transfer to output size as y-coordinate. Data-
centric convolution is near linear correlation with filter size, B. Evaluation
while traditional convolution shows a polynomial relationship. To evaluate the performance of our data-centric convolution,
It coincides our analysis before. With the filter size growing we implement architectures adopting both traditional and data-

137
,QSXWGDWD
:HLJKWV
͘ TABLE I
3( 3( 3(  OVERHEAD OF ADDERS @700MH Z
,QSXWGDWD
:HLJKWV

3( 3( 3( LQSXWV


$GGHU
Traditional Adder tree Special Adder
͘ Items
3( 3( 3(
' 4-inputs 8-inputs 4-inputs 8-inputs
Power(mW ) 0.576 2.887 0.164 0.694

$GGHU7UHH
3( 3( 3(
%ORFN
LQSXWV
Area(μm2 )
RXWSXW $GGHU
͘ 2002.8 4146.8 1548.8 2960.4
3( 3( 3( '
3( 3( 3(

LQSXWV
$GGHU
͘ TABLE II
OVERHEAD OF PE ARRAYS @700MH Z
%ORFNRXWSXW

D 7UDGLWLRQDOFRQYROXWLRQ E 'DWDFHQWULFFRQYROXWLRQ Items Traditional convolution Data-centric convolution


Power(mW ) 164.96 142.49
Fig. 8. Structures of convolution with 3 × 3 filter.
Area(mm2 ) 0.729 0.703

centric methods. In fact, some previous researches like [7]


performance for future work.
and [13] adopt traditional method. The on-chip memory and
controller design are similar with previous work, just the
VI. C ONCLUSION AND D ISCUSSION
center computation components, PE arrays, are using different
methods. In order to test the proper performance of data- This paper presents a data-centric convolution method,
centric convolution, we just show the results of PE array which achieves low energy consume by declining data move-
components to exclude the influence of other factors. ments while ensuring high throughput. It adopts a novel
We create cycle accurate simulators with Verilog-HDL, the dataflow, leading to reduction in total operands transfer during
area and energy are measured with synthesized implementa- whole convolution processing. In deep neural networks, the
tions. We synthesize by the Synopsys Design Compiler using energy spent on data transfer is more expensive than com-
the TSMC 65nm GP standard VT library. The buffers and putation. Our data-centric computation mode declines power
memories are modeled using the Artisan single-ported register from fountainhead. What’s more, data-centric convolution is
file memory compiler. beneficial to hardware implementation in low wire density.
To balance computation performance between convolution And we also evaluate our method in power and area overhead,
layers and full connection layers, as well as to fit the max showing that it can work in low power in small footprint.
filter size in our benchmark, we set the width of PE array Our on-going researches are introducing a reconfiguration
M = 16. We adopt 16-bit fixed-point operations as the previ- mechanism as well as an optimized memory hierarchy to
ous researches and let our architecture works under 704.2MHz. an architecture adopting data-centric computation mode. The
Considering the buffers, we set the depth equal to the height of high hardware utilization and low energy consume will result
our input image, i.e. H = 256. In experiments, we just focus in great performance. The fundamental of our data-centric
on inference phase. The reason is that for many industrial computation mode can be extended to a range of fields. Even
applications, off-line learning is sufficient. Neural networks when programming on CPUs or GPUs, data-centric thought
are first trained on a set of data and only used in inference can also be referenced to arrange computing dataflow in fine
mode by the end user. grain, which is essential and helpful to programs’ performance.
Except the different dataflows, the adders in data-centric
ACKNOWLEDGMENT
convolution are also better than traditional adder trees.As the
size of filters grows, the adder tree in traditional convolution This work is partially supported by the National Key Re-
becomes more complex and slower than the specific adder search and Development Plan of China(Grant No. 2016YF-
in data-centric convolution. Specific adders convert addends B0200505), and the National Natural Science Foundation of
and add them together in one cycle. It has negligible effect China(Grant No. 61373025).
on accuracy because the input data has been normalized. The
statistics about adders overhead are presented in Table I. As the R EFERENCES
number of addends increases, the advantage of special adders
[1] C. Garcia and M. Delakis, “Convolutional face finder: A neural archi-
in data-centric convolution is more significant. tecture for fast and robust face detection,” IEEE Transactions on pattern
Table II reports the power and area of PE arrays employing analysis and machine intelligence, vol. 26, no. 11, pp. 1408–1423, 2004.
different methods. PE array adopting data-centric convolution [2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
works in lower power, and footprint is smaller than traditional in Proceedings of the IEEE conference on computer vision and pattern
convolution. In future work, we will optimize the on-chip recognition, 2014, pp. 580–587.
memory of whole architecture to accommodate data-centric [3] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “C-
nn features off-the-shelf: an astounding baseline for recognition,” in
convolution, thus it can achieve higher performance. We also Proceedings of the IEEE Conference on Computer Vision and Pattern
leave the optimization of full connection layer computation Recognition Workshops, 2014, pp. 806–813.

138
[4] A.-M. Zou, K. D. Kumar, Z.-G. Hou, and X. Liu, “Finite-time atti- [24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
tude tracking control for spacecraft using terminal sliding mode and large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
chebyshev neural network,” IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), vol. 41, no. 4, pp. 950–963, 2011.
[5] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn,
and D. Yu, “Convolutional neural networks for speech recognition,”
IEEE/ACM Transactions on audio, speech, and language processing,
vol. 22, no. 10, pp. 1533–1545, 2014.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 1–9.
[7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM,
2014, pp. 269–284.
[8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. ACM,
2016, pp. 26–35.
[9] J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim, “14.6 a
1.42 tops/w deep convolutional neural network recognition processor for
intelligent ioe systems,” in 2016 IEEE International Solid-State Circuits
Conference (ISSCC). IEEE, 2016, pp. 264–265.
[10] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional neural
networks,” in 2016 IEEE International Solid-State Circuits Conference
(ISSCC). IEEE, 2016, pp. 262–263.
[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “Eie: efficient inference engine on compressed deep neural
network,” arXiv preprint arXiv:1602.01528, 2016.
[12] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[13] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,”
in Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
[14] P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and
Y. Xie, “Prime: A novel processing-in-memory architecture for neural
network computation in reram-based main memory,” in Proceedings of
ISCA, vol. 43, 2016.
[15] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
in Proc. ISCA, 2016.
[16] D. Pandiyan, “Data movement energy characterization of emerging
smartphone workloads for mobile platforms,” Ph.D. dissertation, ARI-
ZONA STATE UNIVERSITY, 2014.
[17] E. Säckinger, B. E. Boser, J. M. Bromley, Y. LeCun, and L. D. Jackel,
“Application of the anna neural network chip to high-speed character
recognition,” IEEE Transactions on Neural Networks, vol. 3, no. 3, pp.
498–505, 1992.
[18] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for
vision,” in Cvpr 2011 Workshops. IEEE, 2011, pp. 109–116.
[19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: shifting vision processing closer to the
sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.
ACM, 2015, pp. 92–104.
[20] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynam-
ically configurable coprocessor for convolutional neural networks,” in
ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM,
2010, pp. 247–257.
[21] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neu-
rocube: A programmable digital neuromorphic architecture with high-
density 3d memory,” in Computer Architecture (ISCA), 2016 ACM/IEEE
43rd Annual International Symposium on. IEEE, 2016, pp. 380–392.
[22] Q. V. Le, “Building high-level features using large scale unsupervised
learning,” in 2013 IEEE international conference on acoustics, speech
and signal processing. IEEE, 2013, pp. 8595–8598.
[23] B. Catanzaro, “Deep learning with cots hpc systems,” 2013.

139

Das könnte Ihnen auch gefallen