Sie sind auf Seite 1von 2

CNNA 2016, August 23-25, 2016, Dresden, Germany

Cellular Neural Networks for FPGAs with OpenCL

Franz Richter-Gottfried and Dietmar Fey

Chair of Computer Science 3 (Computer Architecture)

Friedrich-Alexander-Universitt Erlangen-Nurnberg¨

(FAU)

91058 Erlangen, Germany Email: {franz.richter-gottfried, dietmar.fey}@fau.de

Abstract—Cellular Neural Networks (CNNs) are an inherently parallel computational model for multiple applications, and they are especially appropriate for image processing tasks. Besides of implementing them with analogue electronic circuits, they can be simulated on digital processor architectures like CPUs and GPUs, with the drawback of limited parallelism. FPGAs offer a fine-grained parallel execution with low power consumption and are thus attractive for embedded systems like smart cameras, for which it is not possible to use a full-featured CPU or GPU with tens or hundrets of watts. The drawback of implementing CNNs with FPGAs, to profit from the high performance-to-power ratio, is the time-consuming design process with conventional hardware descriptions languages. High-level-synthesis, e.g., from OpenCL, eases the process of generating CNNs in FPGAs. By using the OpenCL programming model, the programmer can explicitly express the parallel nature of CNNs in a platform-independent way. To investigate its applicability to CNNs, we compare the execution of an unmodified OpenCL kernel on a recent CPU with an FPGA design generated with Altera’s SDK for OpenCL. The results show, that though the CPU is faster, the FPGA solution performs better in terms of energy efficiency and fits for smart camera systems.

I. INTRODUCTION

Many applications based on CNNs have been presented, in particular image processing operations, since Chua and Yang published CNNs [1]. Originally, CNNs were implemented by analogue circuits taking advantage of the parallel processing of multiple realized CNN cells. Using CPUs for simulation has the problem that a limited number of arithmetic resources is available, and the cells’ computation has to be time- multiplexed. FPGAs offer a higher degree of parallelism than CPUs or even GPUs, but for the cost of a complex and time consuming design process. High-level-synthesis (HLS) may shorten this drastically if the source language is capable of expressing the algorithm’s properties. OpenCL, supported by HLS tools, allows and simplifies the expression of an array of parallel operating CNN cells connected among each other. Our experiments show that there is a loss of performance on the FPGA compared to a realization of CNNs on a CPU. However, when we also consider energy efficiency, the OpenCL design of CNNs offers a reduction of both design time and energy consumption, making it attractive for smart cameras. The paper is structured as follows. After a short introduction of CNNs in Section II we outline related work in Section III. In Section IV we briefly introduce OpenCL followed by a description of an OpenCL kernel implementation of CNNs in Section V. Section VI compares the results achieved on an

ISBN 978-3-8007-4252-3

79

FPGA with a multi-core CPU. Finally, we conclude the paper with Section VII.

II. CELLULAR NEURAL NETWORKS

CNNs are modeled by cells connected in a regular 2D grid. Only neighboring cells are linked and communicate. Each cell maintains a state x which iteratively evolves over time based on its current state and the feedback from its neighbors y. The change of the cell state shows (1).

x 1 = x 1 + a k y k + b k u k + z

kN

kN

(1)

The output of a cell is a function of its current state (2).

y k (x k )=0.5(|x k + 1|−|x k 1|) (2)

This basic concept is valid for all CNNs. The actual functionality of a CNN depends on two filtering masks (of size N ), a and b. Mask b is used to convolve the input signal u once, and a is applied to the neighbors’ output to produce their feedback to compute the cell’s new state. This step is iteratively applied until a defined criteria for convergence is met. The state is also influenced by a constant bias z. The size of the masks may differ among applications, but it highly influences the amount of calculations and memory transfers and thus execution time on most devices. In general, it is desirable to have smaller masks.

III. RELATED WORK

Several papers deal with software implementation of CNNs on CPUs and GPUs, but also on FPGAs, to take advantage of the flexibility and applicability for embedded applications. To the best of our knowledge, few publications investigate CNNs in OpenCL to compare CPUs and GPUs, but not FPGAs. Potluri et al. [2] present an OpenCL implementation on CPU and GPU, but they miss an architectural descrip- tion as well as a detailed performance analysis. Dolan and DeSouza [3] focus on the implementation but without any optimization, resulting in a very bad CPU performance. In [4], optimized CPU and GPU implementations are compared, and the GPU outperforms the CPU by a factor of ten. However, our FPGA implementation easily excels their CPU performance, of course also influenced by technical progress of the devices. The authors of [5] and [6] present custom FPGA architectures, but they have to trade flexibility regarding image and filter size for performance and design simplicity, which is the advantage of using OpenCL for FPGA design.

© VDE VERLAG GMBH, Berlin, Offenbach

CNNA 2016, August 23-25, 2016, Dresden, Germany

IV.

OPENCL

OpenCL defines a library interface for the host, typically

a normal CPU, to control devices like GPUs or FPGAs. The

algorithm itself, referred to as kernel, is written in the C-style language OpenCL-C. Devices offer compute units (CU), and each of them consists of processing elements (PE) to execute

the kernel in parallel. Work is distributed among PEs according to the OpenCL execution model. A work item, which is an entity in the problem space, represents a single instance of the kernel implementing the functionality of a single CNN cell. Multiple work items combined in work groups can be rapidly defined in OpenCL to realize a whole CNN array exchanging data by using fast local memory.

V. I MPLEMENTATION

In the OpenCL implementation, each cell is represented by a single work item. Due to the regular memory access pattern, we omitted buffering data in local memory but rely on efficient streaming access to main memory and caching. The CNN simulation is split into three kernels: buz, update and output, to express temporal dependencies between the neighboring cells’ current states and their output. The host first transfers the input image to the OpenCL device and calls the first kernel to compute kN b k u k + z once, as it is constant during execution. For each iteration, the other two kernels compute the output for each cell, based on the current state, and the next state using the processed input, the cell’s current state and the neighbors’ outputs. The performance of our implementation is measured using

a filter setup and the mask described as EGDEGRAY in [7], which are shown in (3).

a =

0 0 0 0 2 0 0 0 0
0
0
0
0
2
0
0
0
0
-1 -1 -1 b = -1 8 -1 -1 -1 -1 VI. RESULTS
-1
-1
-1
b =
-1
8 -1
-1
-1
-1
VI. RESULTS

z = 0.5

(3)

We evaluated our implementation on a CPU (Intel Core i7- 4790 CPU, 3.60GHz) and an FPGA accelerator card (Bittware S5-PCIe-HQ D5). For both platforms, automatic selection of the work group size gave the best performance. Table I shows the execution times and effective memory bandwidth for a single iteration. In total, seven memory transfers are needed for an update, but two are only needed once for initialization, resulting in additional five transfers for each further iteration. As it can be seen, the CPU outperforms the FPGA by a factor of 10 for smaller images, and 5 for larger ones. See Table II for the FPGA resources of the design. Even smaller FPGAs may be used as there are enough free resources, if the memory bandwidth stays the same. With

a higher memory bandwidth, multiple kernel instances may

increase performance. Besides of raw performance, power consumption is of in- terest, especially for embedded applications like smart camera

systems. The Intel CPU used has a thermal design power (TDP) of 84 W, leading to a system’s power of more than

TABLE I

PERFORMANCE OF ONE ITERATION

 

Image size

Time [ms]

Bandwidth [GB/s]

 

512x512

0.28

24.09

CPU

4096x4096

39.94

10.95

 

512x512

3.21

2.13

FPGA

4096x4096

192.80

2.27

 

TABLE II

FPGA RESOURCES

Component

Total (Percent)

Logic Elements

71484 (21%)

FlipFlops

104582 (15%)

RAMs

531 (26%)

DSPs

15 (1%)

100 W. The FPGA board consumes at most 25 W and by using OpenCL pipes, the FPGA can directly read from the input, so the host becomes redundant. Smaller FPGAs may be even more efficient.

VII. CONCLUSION

We implemented a typical image processing application for a CNN in OpenCL and generated an FPGA design using Altera’s SDK for OpenCL to compare the performance to a recent CPU’s. Though the CPU outperforms the FPGA, the energy consumption and the flexibility of the FPGA solution compensates this. We show that HLS from OpenCL is a rea- sonable tradeoff between performance and design complexity.

REFERENCES

[1] L. O. Chua and L. Yang, “Cellular neural networks:

applications,” IEEE Transactions on Circuits and Systems,

vol. 35, no. 10, pp. 1273–1290, Oct 1988. [2] S. Potluri, A. Fasih, L. K. Vutukuru, F. A. Machot, and

K. Kyamakya, “CNN based high performance computing

for real time image processing on GPU,” in Proceedings of the Joint INDS’11 ISTET’11, July 2011, pp. 1–7. [3] R. Dolan and G. DeSouza, “Gpu-based simulation of cellular neural networks for image processing,” in 2009 International Joint Conference on Neural Networks, June 2009, pp. 730–735.

[4] T.-Y. Ho, P.-M. Lam, and C.-S. Leung, “Parallelization of cellular neural networks on GPU,” Pattern Recognition, vol. 41, no. 8, pp. 2684 – 2692, 2008.

[5]

B. E. Shi, “A scalable fpga implementation of cellular

neural networks for gabor-type filtering,” in The 2006 IEEE International Joint Conference on Neural Network Proceedings, 2006, pp. 15–20. [6] R. Grech, E. Gatt, I. Grech, and J. Micallef, “Digital implementation of cellular neural networks,” in Electron- ics, Circuits and Systems, 2008. ICECS 2008. 15th IEEE International Conference on, Aug 2008, pp. 710–713. [7] L. O. Chua and T. Roska, Cellular Neural Networks and Visual Computing: Foundations and Applications. New York, NY, USA: Cambridge University Press, 2002.

O. Y. H. Cheung, P. H. W. Leong, E. K. C. Tsang, and

ISBN 978-3-8007-4252-3

80

© VDE VERLAG GMBH, Berlin, Offenbach