Research Platform (Harp) Exploring Opencl On A Cpu-Fpga Heterogeneous Architecture

Exploring OpenCL on a CPU-FPGA Heterogeneous Architecture
Research Platform (HARP)
Thomas Faict
Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D'Hollander
Counsellors: Prof. dr. ir. Bart Goossens, Alexandra Kourfali
Master's dissertation submitted in order to obtain the academic degree of

Master of Science in Computer Science Engineering
Department of Electronics and Information Systems

Chair: Prof. dr. ir. Koen De Bosschere
Faculty of Engineering and Architecture
Academic year 2017-2018
Exploring OpenCL on a CPU-FPGA Heterogeneous Architecture
Thomas Faict
Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D'Hollander
Master's dissertation submitted in order to obtain the academic degree of

Department of Electronics and Information Systems

Permission of usage
The author gives permission to make this master dissertation available for consultation
and to copy parts of this masters dissertation for personal use. In the case of any other
use, the copyright terms have to be respected, in particular with regard to the obligation
to state expressly the source when quoting results from this master dissertation.
Thomas Faict, August 2018
i
Acknowledgments
I would like to thank my supervisors Prof. dr. ir. Dirk Stroobandt and Prof. dr. ir.
Erik D’Hollander for the opportunity to work on this thesis. My thanks also go out to my
counsellors, Alexandra Kourfali, Vijaykumar Guddad and Prof. dr. ir. Bart Goossens.
Furthermore, I would like to thank ir. Dries Vercruyce for his help with the Paladin server.
Finally, I would like to thank my family and friends for their support over the past years.
Many thanks to Sophie for supporting me during this busy year.
Thomas Faict, August 2018
ii
Exploring OpenCL on a CPU-FPGA
Heterogeneous Architecture Research
Platform
by
Thomas Faict
Master’s dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D’Hollander

Ghent University
Department of Electronics and Inormation Systems

Abstract
Intel recently introduced the Heterogeneous Architecture Research Platform (HARP).
In this platform, the Central Processing Unit (CPU) and the Field-Programmable Gate
Array (FPGA) are connected through a high bandwidth, low latency interconnect and
both share DRAM memory. For this platform, OpenCL, a High Level Synthesis (HLS)
language, is made available. By making use of HLS, a faster design cycle can be
achieved compared to programming in a traditional Hardware Description Language
(HDL). This, however, comes at the cost of having less control over the hardware
implementation. We will investigate how OpenCL can be applied on the HARP
platform. In a first phase, a number of benchmarks are executed on the HARP platform
to analyze performance-critical parameters of the OpenCL programming model for
FPGAs. In a second phase, the guided image filter algorithm is implemented using the
insights gained in the first phase. Both a floating point and a fixed point
implementation were developed for this algorithm, based on a sliding window
implementation. This resulted in a maximum floating point performance of 135
GFLOPS and a maximum fixed point performance of 430 GOPS.
Keywords
HARP, OpenCL, High-Level Synthesis, Guided Image Filter
Exploring OpenCL on a CPU-FPGA Heterogeneous
Architecture Research Platform (HARP)
Thomas Faict
Supervisor(s): Dirk Stroobandt, Erik D’Hollander, Bart Goossens, Alexandra Kourfali
Abstract — Intel recently introduced the Heterogeneous

Architecture Research Platform (HARP). In this platform, the II. HETEROGENEOUS ARCHITECTURE RESEARCH PLATFORM
Central Processing Unit (CPU) and the Field-Programmable
Gate Array (FPGA) are connected through a high bandwidth, In order to spur research into CPU-FPGA platforms, Intel
low latency interconnect and both share DRAM memory. For introduced the Heterogeneous Architecture Research
this platform, OpenCL, a High Level Synthesis (HLS) Platform (HARP) [2]. This platform consists of an Intel Xeon
language, is made available. By making use of HLS, a faster CPU and an Intel Arria 10 GX1150 FPGA, and differs from
design cycle can be achieved compared to programming in a traditional CPU-FPGA platforms in several points. In
traditional Hardware Description Language (HDL). This, traditional CPU-FPGA platforms, the CPU and the FPGA are
however, comes at the cost of having less control over the connected through a Peripheral Component Interconnect
hardware implementation. We will investigate how OpenCL Express (PCIe) channel, and they each have their private
can be applied on the HARP platform. In a first phase, a DRAM memory. In the HARP platform, however, the CPU
number of benchmarks are executed on the HARP platform to and the FPGA are connected through two PCIe channels and
analyze performance-critical parameters of the OpenCL
one Intel QuickPath Interconnect (QPI) connection.
programming model for FPGAs. In a second phase, the guided
image filter algorithm is implemented using the insights gained Moreover, the CPU and the FPGA share DRAM memory,
in the first phase. Both a floating point and a fixed point and the FPGA implements a soft IP cache that has a size of
implementation were developed for this algorithm, based on a 64 KiB. Therefore, cache coherence with the CPU is
sliding window implementation. This resulted in a maximum ensured.
floating point performance of 135 GFLOPS and a maximum The HARP platform supports HLS development through
fixed point performance of 430 GOPS. OpenCL, which is a C-based HLS framework. Using HLS
Keywords — HARP, OpenCL, High-Level Synthesis, Guided has several benefits. As the hardware implementation of the
Image Filter algorithm does not need to be specified, a faster development
is enabled and no profound knowledge of the hardware is
I. INTRODUCTION required. HLS, however, has as a drawback that the
In traditional computing architectures, multi-core developer has less control on the implemented hardware. An
OpenCL application consists of two parts: the host and one
architectures are commonplace. As these processors consist
or more kernels. An OpenCL kernel is a function that is
of a collection of identical cores, this type of architecture is
executed on the FPGA while the host, on the other hand,
called a homogeneous multi-core architecture. They begin,
consists of code executed on the CPU. The main
however, to face the limits of their performance due to
responsibility of the host is to configure the kernels and
emerging compute intensive applications, such as machine
provide data, while the kernel typically consists of
learning, data mining or robot vision. A viable solution is the
computations that are computed faster on the accelerator.
use of heterogeneous multi-core architectures, in which a set
of different types of computing cores is used, called
accelerators [1]. Some examples of accelerators are Graphics III. OPENCL ON HARP
Processing Units (GPUs) and Application-Specific Integrated In order to investigate whether OpenCL is capable of
Circuits (ASICs). GPUs can process applications that involve exploiting the performance enhancing features of the HARP
a large amount of parallelism. They are, however, power- platform, a number of benchmarks, written in OpenCL was
hungry. ASICs, on the other hand, have high performance executed.
and low power requirements. However, their functionality
cannot be altered and their design is extremely expensive. A. Bandwidth
FPGAs fill the gap between software and hardware In the HARP platform, the CPU is connected through two
accelerators. They are instructionless, just like an ASIC, and PCIe Gen3x8 interconnects and one QPI interconnect. These
offer therefore a high performance at a low power three connections result in a theoretical maximum bandwidth
consumption. They can, however, be reconfigured. Due to of 28 GB/s. In a first benchmark, it was tested what
the reconfigurability of FPGAs, the paradigm in which maximum bandwidth an OpenCL kernel on the HARP
FPGAs are used as accelerator is called reconfigurable platform can achieve. It was also tested how an OpenCL
computing. Because of the low power requirements and the kernel can be adopted if no high bandwidth is obtained. This
reconfigurability, FPGAs gain attention in areas such as data was done by creating an OpenCL kernel that read values of
center applications. different sizes. In a first kernel, words of 4 bytes were read,
while in a second and a third kernel, words of 64 bytes and
128 bytes were read. This lead to the maximum bandwidths does not demonstrate a cache behaviour. This concludes that
for each of the kernels, as displayed in Table 1. it may be difficult for an OpenCL developer to control the
soft IP FPGA cache, but that the OpenCL software cache can
Table 1 Maximum achievable read bandwidth for increasing word
lengths. still be used to cache values.
Word Size Maximum Bandwidth IV. GUIDED IMAGE FILTERING

4 byte 1.03 GB/s Guided image filtering is an image smoothing technique in
64 byte 15.23 GB/s which the properties of two different images are combined
128 byte 15.23 GB/s [4]. This algorithm was implemented on the HARP platform
in OpenCL, using the insights gained in the previous phase.
As can be seen in Table 1, the maximum bandwidth We chose for a sliding window implementation, in which the
obtained is equal to 15.23 GB/s. Moreover, when reading 4 convolution window in the guided image filter slides over the
byte words, only a bandwidth of 1.03 GB/s is obtained. This image. A first feature that was used, was the OpenCL
can be explained because every FPGA clock cycle, only one implemented cache. As two images are read, two separate
4 byte value is read. Therefore, the FPGA only requires a caches are created. This results in a cache hit percentage of
bandwidth equal to 4 bytes * 258.3 MHz = 1.03 GB/s. 81.5% for the first and 95.3% for the second image.
Therefore, increasing the amount of 4 byte values read per Moreover, a maximum of two pixels, each containing three
clock cycle is necessary to increase this bandwidth. It has different channels, could be processed simultaneously. By
been found that duplicating the read requests in the OpenCL embedding all elements in structs, a combined maximum
code by using the pragma unroll did not solve this issue. read bandwidth of approximately 3.3 GB/s was obtained.
What did work, however, was embedding the 4 byte values Because in the HARP platform, DRAM memory is shared
in a larger data type, defined as a struct. By using this between the CPU and the FPGA, no explicit data copies
method, a bandwidth of approximately 15 GB/s was obtained between separate DRAMs are required, thereby saving time
as well. and improving performance.
In the sliding window implementation, a larger size of the
35 convolution window results in a higher resource usage.
30
Bandwidth (GB/s)
Therefore, the possible window size will be ultimately

25 limited by the FPGA resources. By using fixed point values
20 instead of floating point values, less resources are used, and a
15 larger window size can be supported. For the floating point
10 implementation, a maximum window size of 7x7 pixels can
5 be achieved, while the fixed point kernel is able to support a
0 window size of up to 13x13 pixels. The floating point kernel
0 64 128 192 256 obtains a maximum performance of 135 GFLOPS, while the
fixed point kernel obtains a performance of 430 GOPS. A
Buffer Size (KiB)
maximum throughput of 74 frames/s was obtained when
processing two pixels, each consisting of three color
channels, simultaneously.
Figure 1 Bandwidth when reading 128 byte words from a buffer
with variable size.
V. CONCLUSION
We conclude that the HARP platform offers several
B. Cache Performance benefits to OpenCL developers. Because of the fast
In the HARP platform, a soft IP cache is implemented. interconnect, a high bandwidth can be achieved between the
This cache has a size of 64 KiB, but is only implemented for CPU and the FPGA. In addition, the shared memory removes
the QPI channel, and therefore not for the two PCIe the need for sending data between separate CPU- and FPGA-
connections. An OpenCL developer, however, cannot control RAM. Besides the HARP architecture itself, a developer
the physical channel that is used to transmit data. Typically, enjoys the productivity of OpenCL and its features. In
all three channels are used simultaneously, and read requests particular, OpenCL adds a software implemented cache,
are distributed among these channels. Therefore, it is which dramatically reduces the amount of required read
uncertain whether the QPI cache is addressed or not. Besides requests.
this soft IP cache, the OpenCL kernel itself caches values as
well [3]. In order to test both different types of caches, an VI. FUTURE WORK
array consisting of 128 byte values was read repeatedly. In a
The HARP platform could use a larger cache. Furthermore,
first OpenCL kernel, the OpenCL cache was enabled. In a
additional research into emerging applications, such as
second kernel, the cache was disabled, hereby relying on the
machine learning, is beneficial to boost the use of the HARP
soft IP FPGA cache. Both kernels were executed on the
platform for these types of computing tasks.
HARP platform, which lead to the bandwidths displayed in
Figure 1. The red graph indicates the bandwidth achieved by
REFERENCES
reading from the OpenCL cache, while the blue graph
indicates the situation in which only the FPGA cache is used. [1] Tulika Mitra. A Heterogeneous Multi-core Architectures, ISPJ
Transactions on System LSI Design Methodology, vol 8, pp 51-62,
As can be seen in this graph, the OpenCL cache is able to 2015.
supply the high requested bandwidth. achieving a bandwidth [2] Prabhat Gupta, Accelerating Datacenter Workloads, 26th International
of more than 30 GB/s. The blue graph, on the other hand, Conference on Field Programmable Logic and Applications (FPL),
Lausanne, Switzerland, 2016.
[3] Altera SDK for OpenCL Best Practices Guide, Intel, May 2016.
[4] Kaiming He, Jian Sun and Xiaoou Tan. Guided Image Filtering, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp.
1397-1409, Jun. 2013.
Contents
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Development Models . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Heterogeneous Architecture Research Platform (HARP) 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Intel Arria 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 CPU-FPGA Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 FPGA Interface Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.1 Core-Cache Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 FPGA Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.3 Development Models . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Quantitative Analysis of HARPv1 . . . . . . . . . . . . . . . . . . . . . . . 19
vii
4 Software 22
4.1 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 OpenCL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.2 OpenCL Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 OpenCL Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Intel FPGA SDK for OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Board Support Package . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 FPGA Bitstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 FPGA Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Initiation Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Performance Tuning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Compilation Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.2 Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6.1 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.2 Compilation Options . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 OpenCL Performance on HARP 40

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 OpenCL Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Shared Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Guided Image Filtering 62

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Sliding Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.1 OpenCL Implementation . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Conclusion and future work 82

7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ix
List of Figures
1.1 Traditional CPU-FPGA platform versus Intel’s HARP platform. . . . . . . 3

1.1a Traditional CPU-FPGA platform. The CPU and the FPGA are
connected through a low bandwidth, high latency connection, and
they each contain their own DRAM memory. . . . . . . . . . . . . . 3
1.1b HARP’s CPU-FPGA platform. The CPU and the FPGA are con-
nected through a high bandwidth, low latency interconnect, and they
share DRAM memory. . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 High-level FPGA architecture. The main building blocks are logic cells,
embedded memory, DSPs and I/O blocks. These elements are connected
through reconfigurable routing lanes. . . . . . . . . . . . . . . . . . . . . . 6
2.2 Roofline model graph. For a low operational intensity, the performance will
ultimately be limited by the memory bandwidth. For a higher operational
intensity, on the other hand, the performance wil ultimately be limited by
the maximum floating point performance of the computing system. . . . . 10
3.1 HARPv2 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Cache coherence domain. This figure uses UltraPath Interconnect (UPI),
which is the successor of QPI. . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Traditional PCIe-based CPU-FPGA platform. In contrast to the HARP
platform, the CPU and the FPGA are not connected through QPI and they
do not share DRAM memory. . . . . . . . . . . . . . . . . . . . . . . . . . 20
x
4.1 The Board Support Package (BSP) separates the kernel logic part from the
I/O part. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 The Board Support Package (BSP) is included when compiling the kernel
code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Basic pipelined implementation of adding two variables. . . . . . . . . . . . 29
4.4 OpenCL-based design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Screenshot of a compilation report with the 4 different panes indicated. . . 33
4.6 Example of a loops analysis report. . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Example of a system viewer report. . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Example of a profile report. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 HARP bandwidth when reading values on FPGA from the CPU. The dif-
ferent lines denote the different word sizes of the memory transfers. . . . . 42
5.2 Roofline model for the HARP platform. The blue line denotes the perfor-
mance limit based on the maximum theoretical bandwidth, while the green
line denotes the performance limit based on the maximum achieved bandwidth. 44
5.3 HARP bandwidth for reading 4 B values on FPGA from the CPU. The blue
line denotes the bandwidth when reading 4 B each clock cycle, while the
green line is the result of unrolling the loop with unroll factor 16. . . . . . 46
5.4 HARP bandwidth for reading 4 B values on FPGA from the CPU. The loop
was unrolled by various unroll factors. . . . . . . . . . . . . . . . . . . . . 47
5.5 HARP bandwidth for reading 64 B struct variables on FPGA from the CPU. 48
5.6 HARP bandwidth for reading ulong8 variables on FPGA from the CPU. The
blue graph corresponds with the bandwidth calculated by the formula buffer
size/kernel execution time, while the green graph displays the bandwidth as
measured by the OpenCL profiler. . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Different types of memory hierarchies . . . . . . . . . . . . . . . . . . . . . 53
xi
5.7a Memory hierarchy when using the volatile keyword. The memory
hierarchy consists of three levels: the QPI cache, the CPU’s Last-
Level Cache (LLC) and DRAM memory. . . . . . . . . . . . . . . . 53
5.7b Memory hierarchy without using volatile keyword. The memory hi-
erarchy consists of four levels: the OpenCL cache, the QPI cache,
the CPU’s Last-Level Cache (LLC) and DRAM memory. . . . . . . 53
5.8 Bandwidth of reading src when executing kernel cache read for 64 B values. 54
5.8a Bandwidth of reading src when executing kernel cache read for ulong8
pointer types. The bandwidth for both the kernel with and without
the volatile keyword is displayed. . . . . . . . . . . . . . . . . . . . 54
5.8b Bandwidth of reading src when executing kernel cache read for volatile
ulong8 pointer types. In this graph, only the FIU cache is used. This
graph zooms in on the blue graph in above figure. . . . . . . . . . . 54
5.9 Bandwidth of reading src when executing kernel cache read for 128 B values. 55
5.9a Bandwidth of reading src when executing kernel cache read for ulong16
pointer types. The bandwidth for both the kernel with and without
the volatile keyword is displayed. . . . . . . . . . . . . . . . . . . . 55
5.9b Bandwidth of reading src when executing kernel cache read for volatile
ulong16 pointer types. In this graph, only the FIU cache is used.
This graph zooms in on the blue graph in above figure. . . . . . . . 55
5.10 Memory buffer location when using different types of platforms. . . . . . . 58
5.10a In a PCIe-based CPU-FPGA architecture, the original data resides
in CPU DRAM while the memory buffer is located in the FPGA
DRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.10b In the HARP platform, both the original data and the memory buffer
are located in the DRAM memory. . . . . . . . . . . . . . . . . . . 58
5.11 In the HARP platform, both the CPU and the FPGA can access SVM. . . 58
xii
5.12 Bandwidths when copying data in the HARP platform. The blue line shows
the bandwidth when writing data to a write buffer while the red line denotes
the bandwidth when reading from a memory buffer. . . . . . . . . . . . . . 59
6.1 The guided image filter algorithm smooths the filtering input image by mak-
ing use of a guidance image. The filtering output consists of the filtering
input, enhanced by the data in the guidance image. . . . . . . . . . . . . . 63
6.2 Conceptual example of sliding window filtering. . . . . . . . . . . . . . . . 65
6.3 Sliding window implementation. Kernel 1 computes the first three steps in
the guided image filter algorithm, while kernel 2 computes the last two steps. 66
6.4 Data structure image data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 Principle of the sliding window. The pixels in green and blue are row-wise
stored in a shift register. The pixels in blue are used to calculate the window
function fmean,r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5a Shift register state at clock cycle n. In the next clock cycle, the red
pixel will be loaded in the shift register in local memory. . . . . . . 72
6.5b Shift register state at clock cycle n+1. All values are shifted one
place in the shift register compared to the previous clock cycle. . . . 72
6.6 Roofline model HARP platform. The red dots indicate the performance of
the floating point kernel for different values of radius, while the blue dots
indicate the performance of the fixed point implementation. . . . . . . . . 80
xiii
List of Tables
3.1 Overview of Intel Arria 10 GX1150 FPGA specifications. . . . . . . . . . . 13

3.2 QPI and PCIe characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 AFU memory read paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Read bandwidth and read latency comparison between the Alpha Data
board and the HARPv1 platform. . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 CPU-FPGA access latency in HARPv1. . . . . . . . . . . . . . . . . . . . 21
5.1 Requested bandwidth versus maximum achieved bandwidth when reading

variables of a different size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Requested bandwidth for kernel cache read. . . . . . . . . . . . . . . . . . 52
6.1 Resource usage for OpenCL kernels implementing the sliding window im-
plementation using fixed point and floating point computations. . . . . . . 74
6.2 Execution times when executing the sliding window implementation of the
guided filter on the HARP platform. . . . . . . . . . . . . . . . . . . . . . 76
6.3 Profiling results when executing the sliding window implementation of the
guided image filter algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Cache hit rates, derived from the OpenCL profiler, when executing the slid-
ing window implementation of the guided image filter algorithm. . . . . . . 78
6.5 Execution times when of the guided filter on the HARP platform with SVM
and with memory buffers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6 Kernel execution times when processing a different number of pixels in parallel. 81
xiv
Acronyms
B Byte
CPU Central Processing Unit
FLOPS Floating Point OPerations per Second
FPGA Field-Programmable Gate Array
GB/s 109 Byte/second
GPU Graphics Processing Unit
HARP Heterogeneous Architecture Research Platform
HLS High-Level Synthesis
II Initiation Interval
IO Input/Output
KiB 210 Byte
MiB 220 Byte
OpenCL Open Computing Language
PCIe Peripheral Component Interconnect Express
QPI QuickPath Interconnect
SVM Shared Virtual Memory
xv
Chapter 1
Introduction
In traditional computing architectures, multi-core processors are commonplace. As these

processors consist of a collection of identical cores, this type of architecture is called a
homogeneous multi-core architecture. They are relatively simple to design and allow a
straightforward software development. However, homogeneous architectures begin to face
the limits of their performance due to emerging compute intensive applications, such as
machine learning, data mining or robot vision. A promising technique, however, are het-
erogeneous multi-core architectures, consisting of different types of cores [1].
In a heterogeneous multi-core architecture, several different computing cores or proces-

sors are combined. Some examples are the use of Graphics Processing Units (GPUs) or
Application-Specific Integrated Circuits (ASICs) as accelerators. One of the most promis-
ing approaches, however, is the use of Field-Programmable Gate Arrays (FPGAs) [2]. An
FPGA is a digital IC that can be reconfigured to directly implement a digital function.
Because of this direct implementation, high performance and low power requirements are
obtained. Due to the reconfigurability of FPGAs, the paradigm in which FPGAs are used
as accelerators is called reconfigurable computing.
Traditionally, FPGA applications are developed in a Hardware Description Language

(HDL). In this type of language, not only the algorithm itself but also its implementation
in the FPGA is directly defined. This results in a high development effort and requires a
profound knowledge of the FPGA hardware. In order to speed up the development pro-
cess, High-Level Synthesis (HLS) tools were developed. In HLS, only the algorithm itself
needs to be specified. The implementation in digital hardware is done by a HLS compiler,
and therefore does not need to be defined by the developer. This dramatically reduces the
1
Chapter 1. Introduction 2
development time. Even though hardware mapping is done automatically, this process can
often be guided by making use of compiler directives. Some examples of HLS languages
and tools are SystemC, Vivado HLS or Open Computing Language (OpenCL).
In traditional CPU-FPGA platforms, both the CPU and the FPGA have their own DRAM
memory, and they are connected through a low bandwidth connection with high latency.
This is depicted in Figure 1.1a. Intel, however, introduced the Heterogeneous Architec-
ture Research Platform (HARP), in which an Intel Xeon CPU is combined with an Intel
Arria 10 FPGA. This platform differs from traditional CPU-FPGA platforms in that the
CPU and the FPGA are connected through a high bandwidth, low latency communica-
tion link. Moreover, the CPU and the FPGA share DRAM memory, and a soft IP cache
is implemented in the FPGA. This configuration is illustrated in Figure 1.1b. Moreover,
an OpenCL Software Development Kit (SDK) is made available for the HARP platform.
Therefore, FPGA applications can be developed in a HLS language.
1.1 Problem
Using OpenCL on the HARP platform, poses several questions. As OpenCL is a HLS
language, the digital implementation is determined by the OpenCL compiler. As a result,
there is less control over the hardware implementation itself.
1.2 Goal
The goal of this thesis is to evaluate OpenCL as HLS language for applications on the
HARP platform. In a first phase, different performance characteristics of the HARP plat-
form are benchmarked by making use of OpenCL. It is, for instance, investigated whether
an OpenCL application can obtain high bandwidths or whether it can benefit from the soft
IP cache in the FPGA. Based on the achieved results, it is formulated which performance
characteristics of the HARP platform an OpenCL developer can exploit. Moreover, it is
investigated to what degree the hardware implementation can be guided using OpenCL
directives.
In a second phase, a case study is performed. As a design example, the guided image filter
algorithm is considered. Guided image filtering is an image processing technique in which
low bandwidth
CPU FPGA
high latency
CPU DRAM FPGA DRAM

(a) Traditional CPU-FPGA platform. The CPU and the FPGA are connected through a low
bandwidth, high latency connection, and they each contain their own DRAM memory.
high bandwidth
CPU FPGA
low latency
shared DRAM
(b) HARP’s CPU-FPGA platform. The CPU and the FPGA are connected through a high
bandwidth, low latency interconnect, and they share DRAM memory.
Figure 1.1: Traditional CPU-FPGA platform versus Intel’s HARP platform.

images are smoothed. This application is implemented on the HARP platform, using the
insights gained in the first phase.
1.3 Organization
In chapter 2, a background on reconfigurable computing, FPGAs, and performance analysis

is presented. First, an introduction to FPGAs is given, after which the need for CPU-
FPGA platforms is further explained. Thereafter, the roofline model, a performance model
that can be used to analyze FPGA applications, is described. Chapter 3 discusses the
HARP platform. Based on technical literature, the opportunities this platform offers to
an application developer are identified. In chapter 4, an overview of the OpenCL language
is given. The development process will be examined, and the main attention points are
indicated. Chapter 5 discusses the use of OpenCL on the HARP platform. In this chapter,
the HARP platform is benchmarked by executing a number of tests, written in OpenCL.
The goal of this chapter is to formulate best practices that can be used when developing
OpenCL applications on the HARP platform. In Chapter 6, a case study of the guided
image filter is done. This application is developed for the HARP platform in OpenCL,
based on the insights gained in chapter 5. Finally, chapter 7 formulates the conclusions of
this thesis.
Chapter 2
Background
2.1 FPGA
Field-Programmable Gate Arrays (FPGAs) are digital chips that can be programmed for
implementing arbitrary digital circuits. In contrast to Application-Specific Integrated Cir-
cuits (ASICs), FPGAs can be reconfigured. Therefore, their functionality can be changed.
This reconfigurability, however, comes at a cost of higher area usage, lower frequency and
higher power requirements than ASIC implementations.
In this section, the architecture of FPGAs and the possible development models are de-
scribed.
2.1.1 Architecture
In Figure 2.1, a simplified architecture of an FPGA is displayed. As can be seen in this

figure, the main building blocks of an FPGA are logic cells, embedded memory, DSPs and
I/O-blocks. In order to connect these blocks, reconfigurable routing lanes are used.
Logic blocks consist of Lookup Tables (LUTs). LUTs can perform any boolean function
of a limited number of inputs, and therefore provide a large flexibility of operations. Even
though LUTs are useful for bit manipulations, they can be too fine-grained to efficiently
implement many types of circuits, such as a floating point multiplication. Therefore, an
FPGA also contains Digital Signal Processing (DSP) blocks. In DSP blocks, frequently
occurring procedures, such as multiplications, are directly implemented in a highly opti-
mized IP core. Embedded memory blocks, on the other hand, allow storage of frequently
5
Chapter 2. Background 6
I/O I/O I/O I/O
I/O Logic Memory Logic DSP I/O
I/O I/O I/O I/O
Figure 2.1: High-level FPGA architecture. The main building blocks are logic cells, embed-
ded memory, DSPs and I/O blocks. These elements are connected through reconfigurable
routing lanes [3].
used data. Due to the proximity of these memory elements, data can be accessed quickly
with a small latency. Finally, I/O-blocks connect the computing fabric of an FPGA to the
outside world [3].
Except for DSPs, embedded memory and programmable logic blocks, an FPGA also con-
tains routing lanes. These routing lanes connect all blocks and can be reconfigured to
interconnect the desired elements together [3].
2.1.2 Development Models
An FPGA is programmed by an FPGA bitstream. This bitstream contains the configura-

tion of the reconfigurable elements in the FPGA, such as the logic blocks and the routing
lanes. FPGA bitstream development is typically subdivided in two large groups, based on
the abstraction level. On the one hand there are Hardware Description Languages (HDL),
and on the other hand there is High-Level Synthesis (HLS). Both methods will be briefly
discussed in this section.
Hardware Description Languages
In HDL-development, the digital system is directly defined at the Register Transfer Level
(RTL). RTL is an abstraction level in which all state information is stored in registers and
where logic between the registers is used to generate or compute new states. Consequently,
the RTL level describes all memory elements (e.g., flip-flops, registers, or memories) and
the used logic. For each clock cycle, it is specified which values are transferred to which
storage locations, and therefore the flow of data through a circuit is directly determined
[4].
An RTL-description is specified by a HDL, with VHDL and Verilog being the most popular
ones. These languages give full control over the generated hardware. Therefore, not only
the algorithm itself is defined, but also its implementation in the digital system. However,
these languages are typically used by trained hardware design engineers and it needs a
strong hardware background to harness the full potential of this approach. Moreover,
since the cycle-by-cycle behavior of the system is completely specified, it is hard to develop
at such a low level [4].
High-Level Synthesis
A second development model is High-Level Synthesis (HLS). In HLS, an algorithmic de-

scription is directly translated to the digital system. This implies that only the order in
which operations are performed are defined. Other design decisions, such as hardware
allocation or scheduling, are done automatically by the HLS compiler [4].
In HLS, the FPGA bitstream is defined at the algorithmic level, typically in a C-based
language (e.g. SystemC), from which a digital circuit is automatically generated. HLS is
beneficial for FPGA developers because of several reasons. First, no extensive hardware
expertise has to be built up. Furthermore, HLS allows developers to design systems faster
at a high level of abstraction, and rapidly explore the design space. These benefits, however,
come at the cost of having less control on the hardware implementation [4], [5].
2.2 Reconfigurable Computing
In contrast to a homogeneous multi-core architecture, in which the different computing

cores are identical, a heterogeneous multi-core architecture consists of computing cores
with different characteristics. Heterogeneous multi-core architectures can be broadly clas-
sified into two categories: performance heterogeneity and functional heterogeneity. In
performance heterogeneity, multiple cores are included with different power-performance
characteristics, but with the same instruction set architecture. Functional heterogeneous
multi-core architectures, on the other hand, comprise of cores with a different functionality
[1]. As this thesis focuses on functional heterogeneous multi-core architectures, this will
be further elaborated in this section.
In functional heterogeneity, several different processor or accelerator types are combined,

such as GPUs or ASICs. A GPU is a processor that consists of thousands of cores. Even
though GPUs are originally designed for graphics and gaming, their suitability for general-
purpose data parallel programming made them popular in other fields that involve a lot
of parallel computations, such as scientific computing. ASICs, on the other hand, are
electronic circuits that are specialized to execute specific tasks. ASICs are, in contrast
to CPUs and GPUs, instructionless. This results in high performance and low power
requirements, but comes at the cost of zero flexibility and an immense development effort
[1]. Another type of accelerator is the FPGA.
FPGAs fill the gap between hardware and software accelerators, and therefore have several
benefits. Firstly, an FPGA directly implements a digital function, and therefore it is,
just like an ASIC, instructionless. Because no instructions need to be fetched, a high
performance and a low power consumption is achieved. However, in contrast to ASICs,
an FPGA can be reprogrammed. Therefore, their functionality is not at all limited, and
FPGAs are able to meet changing requirements [6].
Reconfigurable computing is mainly used in embedded systems, as FPGAs offer low power
requirements and high reliability. FPGAs are, for instance, commonly used in network
equipment, avionics and the automotive industry. However, due to the emerging big data
applications such as machine learning, high computing requirements are desired in other
domains such as data center applications. Because at the same time energy consumption
becomes an issue in data centers, reconfigurable computing attracts attention, and is ex-
pected to gain a broader use in the future [6]. Microsoft, for instance, equipped servers
with FPGAs for the operation of the Bing search engine. They demonstrated a doubling
of the ranking throughput of Bing, while increasing the power consumption by only 10 %
[7].
2.3 Roofline Model
A comprehensible model that offers performance guidelines can be valuable to guide a pro-
grammer in application development. In [8], a model was proposed that relates processor
performance to off-chip memory traffic. This model is called the roofline model.
In the roofline model, operational intensity is related to floating point performance. The
operational intensity is defined as the number of floating point operations (FLOP) per byte
(B) of DRAM traffic. Note that the DRAM traffic does not include all memory traffic.
Memory requests that are, for instance, responded by the cache, will not be passed to
DRAM memory and will therefore not be taken into account. The floating point perfor-
mance, on the other hand, is defined as the number of floating point operations per second
(FLOPS), and defines the performance of the application.
In Figure 2.2, the roofline model is illustrated. Both axes in the graph are logarithmically
scaled. The x -axis defines the operational intensity, and the y-axis defines the achievable
performance. Given an application, the achieved performance is equal to the operational
intensity times the obtained bandwidth. The obtained bandwidth, however, can never be
peak floating point performance

Attainable GFLOPS
peak memory bandwidth
Operational Intensity (FLOP/B)

Figure 2.2: Roofline model graph. For a low operational intensity, the performance will
ultimately be limited by the memory bandwidth. For a higher operational intensity, on
the other hand, the performance wil ultimately be limited by the maximum floating point
performance of the computing system [8].
higher than the peak memory bandwidth, and the performance of the application can never
exceed the peak performance of the hardware platform. This is represented by the roofline
model. If the performance hits the roof, it either hits the flat part of the roof, which means
performance is compute bound, or it hits the slanted part, which means performance is
memory bound. Based on these two performance limits, following formula can be used to
calculate the attainable GFLOPS.
Attainable GFLOPS = Min(Peak Floating Point Performance,

(2.1)
Peak Memory Bandwidth × Operational Intensity)
Chapter 3
Heterogeneous Architecture
3.1 Introduction
The Heterogeneous Architecture Research Platform (HARP) is a CPU-FPGA platform

that is developed by Intel to spur research in CPU-FPGA heterogeneous computing plat-
forms. In a first stage, in 2015, an Intel Xeon CPU was combined with an Altera Stratix
V FPGA in a discrete configuration. These two chips were connected using a single Intel
QuickPath Interconnect (QPI) channel, and shared DRAM memory. For clarity, this first
generation HARP platform will be referred to as HARPv1. In a later stage, in 2017, the sec-
ond generation of HARP, was introduced, here mentioned as HARPv2. HARPv2 consists
of a 14 core Intel Broadwell Xeon CPU combined with an Intel Arria 10 GX1150 FPGA. In
HARPv2, the CPU and FPGA also share DRAM memory, but they are connected through
one QPI channel and two Peripheral Component Interconnect Express (PCIe) channels.
Instead of a discrete configuration of 2 separate chips, as in HARPv1, HARPv2 combines
both chips in a Multi-Chip Package (MCP) [9]. In this thesis, the HARPv2 platform will
be studied.
In the next sections, the hardware architecture of HARPv2 will be analyzed. First, some
details on the Arria 10 FPGA will be listed. Thereafter, the interconnect structure of
the platform will be investigated. In the last section, a comparative study between the
HARPv1 platform and a traditional CPU-FPGA platform will be studied.
12
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 13
3.2 Intel Arria 10
In the HARPv2 platform, the implemented FPGA is the Intel Arria 10 GX1150. A short
overview of the most important specifics can be found in Table 3.1. In this table, it
can be seen that the peak floating point performance equals 1366 GFLOPS. This should
be interpreted in following way. As can be seen in Table 3.1, the Arria 10 FPGA con-
sists of 1518 DSPs. Each of these DSPs can perform 2 FLOP/clock cycle, which results
in 3036 FLOP/clock cycle for all DSPs together. At a rate of 450 MHz, this results in
3036 FLOP/clock cycle · 450 MHz = 1366 GFLOPS [10].
Resource Value
Logic Elements 1,150,000
Adaptive Logic Modules 427,200
Registers 1,708,800
Memory 65.5 Kib
DSPs 1,518
Peak GFLOPS 1,366
Table 3.1: Overview of Intel Arria 10 GX1150 FPGA specifications [11].
3.3 CPU-FPGA Interconnect
The CPU and the FPGA in the HARPv2 platform are connected through three physi-
cal channels: one QuickPath Interconnect (QPI) channel and two Peripheral Component
Interconnect Express (PCIe) channels. QPI is an interconnect technology, developed by
Intel, that connects processors in a distributed shared-memory style [12]. PCIe, on the
other hand, is a general purpose I/O interconnect defined for a wide variety of computer
and communication platforms [13]. The HARPv2 uses a PCIe Gen3x8 interconnect. This
indicates that the PCIe connection uses the third generation PCIe technology, and has a
bit width of 8 bits.
3.3.1 Bandwidth
Raw Bandwidth
The raw bandwidth is the amount of data that can be transferred per second, without ac-
counting for overhead or other effects. The raw bandwidth can be calculated as the amount
of bytes that can be sent in one transfer multiplied with the amount of transfers that can
be sent per second. The amount of transfers per second is expressed as gigatransfers/s
(GT/s).
The QPI channel has 16 data connections, and therefore a bit width of 2 B. Its maximum
transfer rate is 6.4 GT/s. This results in a maximum, one-way bandwidth of 12.8 GB/s
[12]. Similar calculations can be done for the PCIe channels. The used PCIe connection is
of third generation, and has a bit width of 8 bits. However, compared to the QPI-channel,
the PCIe uses a less efficient data encoding scheme, which results in a physical channel
with an efficiency of 98.5 %. Therefore, for every 130 bits sent, only 128 bits are useful.
The maximum amount of transfers per second equals 8 GT/s. This results in a maximum
raw bandwidth of 98.5 % · 1 byte · 8 GT/s = 7.88 GB/s [13].
Both the QPI- and PCIe-bandwidths are unidirectional bandwidths. Because both con-
nections consist of a separate read and write channel, the total available bandwidth can be
doubled. This leads to a total raw bandwidth of 25.6 GB/s for the QPI, and 15.75 GB/s
for the PCIe-channel. However, the individual read- or write-bandwidth cannot exceed the
unidirectional bandwidth. If the three physical channels are combined, a theoretical read
and write bandwidth of 1 · 12.8 GB/s + 2 · 7.88 GB/s = 28.56 GB/s is obtained.
An overview of these results is shown in Table 3.2.
QPI PCIe Gen.3x8

width 16 bits 8 bits
max #transfers per second 6.4 GT/s 8 GT/s
maximum raw unidirectional bandwidth 12.8 GB/s 7.88 GB/s
Table 3.2: QPI and PCIe characteristics [12], [13].

Achievable Bandwidth
Based on previous calculations, one could conclude that the HARP-platform can obtain
read- and write-bandwidths up to 28.56 GB/s. Above results are, however, theoretical
bandwidths and they do not account for packet overhead and other effects. Therefore, the
maximum bandwidth that can be achieved will be lower, and can also vary for different
types of data transfers. In [12], the packet overhead for the QPI-channel, and for a second
generation PCIe-channel was calculated. For a payload size of 64 B, QPI and PCIe obtained
efficiencies of respectively 89 % and 68 %. However, for a larger payload size of 256 byte,
the efficiency of the PCIe-connection increased to 79 %.
The efficiency of the connection, and therefore the bandwidth, depends on multiple factors.
However, even though the calculated raw bandwidths are only theoretical values, they give
a good indication on the potential of a connection and on the maximum bandwidth a
connection can physically provide.
3.4 FPGA Interface Unit
An FPGA is configured by an FPGA bitstream. In the HARPv2 platform, the FPGA

bitstream consists of two parts: the FPGA Interface Unit (FIU) and the Accelerator
Functional Unit (AFU). The FIU contains Intel-provided Intellectual Property (IP). It
implements I/O control, power and temperature monitoring and partial reconfiguration,
and it provides one or more interfaces to the AFU. The AFU, on the other hand, consists of
programmer-defined FPGA code and determines the functionality of the FPGA as intended
by the developer [14], [15]. The architecture of the FIU and the AFU, and their internal
structure is shown in Figure 3.1.
3.4.1 Core-Cache Interface
An important responsibility of the FIU is to abstract the QPI and PCIe channels by
providing a transparent interface to the AFU. Therefore, a developer does not need to
program low-level details of the physical communication channels. This interface is termed
the Core-Cache Interface (CCI-P), and will be discussed in this section.
Intel Xeon CPU Arria 10 FPGA

FIU
QPI
QPI CTRL QPI CTRL Cache
PCIe 0 PCIe 0 PCIe 0

CTRL CTRL
PCIe 1 PCIe 1 PCIE 1
LLC
CTRL CTRL
...
... CCI-P
MPF
DRAM AFU
Figure 3.1: HARPv2 architecture.

Virtual Channels
The CCI-P abstracts the three different physical channels in the HARP platform. This
is done by mapping them to four so-called virtual channels. PCIe 0 is mapped to virtual
channel VH0 (Virtual channel with High latency), PCIe 1 to VH1 and QPI to VL0 (Virtual
channel with Low latency). A fourth virtual channel, VA (Virtual Auto), combines the
three physical channels in one virtual channel and automatically selects the appropriate
physical link during execution by making use of Virtual Channel (VC) steering logic [14],
[15].
Memory Properties Factory
The CCI-P does not provide other memory services than abstracting the physical channels.
For instance, no ordering of memory requests is enforced. Therefore, each memory request
can bypass another request, even to the same memory address. Even though the CCI-P
memory model does not provide ordering, this can be acceptable for applications that do
not read from and write to the same memory address. Other applications, however, require
memory transactions to be ordered. In order to support more complex requirements, Intel
provides so-called basic building blocks. Basic building blocks are reference designs that
users can instantiate in their AFU. The Memory Properties Factory (MPF) is a basic
building block that provides a collection of extensions to the CCI-P. Even though the
MPF is predefined, it is the responsibility of the user to instantiate the MPF in their
design. For this reason, the MPF is placed on the AFU-side in Figure 3.1 [14].
The MPF is a basic building block that transforms a CCI-request to a CCI-request, adding
certain properties. An example of a property the MPF can add is memory request reorder-
ing, which guarantees the correct ordering of memory requests [14]. Another example is
virtual to physical address translation. By using this translation, a HDL developer is able
to use virtual addresses in the AFU, which heavily simplifies the development process. The
benefit, however, of using the MPF comes at a cost. In [2], it was found that transaction
reordering adds three FPGA clock cycles to a read request and one clock cycle to a write
request. Address translation, on the other hand, requires an extra four clock cycles for
both read and write requests.
3.4.2 FPGA Cache
The FIU not only contains the CCI-P, it also provides a soft IP cache for the QPI channel.
This cache has a total capacity of 64 KiB, and is organized as a direct-mapped cache with
64 B cachelines, hereby resulting in 1024 cachelines. As the FPGA cache is included in
the cache coherence domain of the CPU, the data in the FPGA cache is coherent with
the CPU-cache and the DDR-memory. This results in the coherence domain depicted in
Figure 3.2 by the dotted line.
Figure 3.2: Cache coherence domain [15]. This figure uses UltraPath Interconnect (UPI),
which is the successor of QPI.
Because of the expanded cache coherence domain, AFU memory requests can be serviced
by three different memory levels. This results in the AFU memory read paths displayed
in Table 3.3. The lowest latency and highest bandwidth is obtained when reading from
the FPGA cache itself. In this case, no read request to the CPU is required. If the FPGA
cache does not contain the requested cache line, the next memory level to be addressed is
the CPU Last Level Cache (LLC). This results in a higher latency and lower bandwidth
than when reading from the FPGA cache. The highest latency and lowest bandwidth is
obtained when both the FPGA cache and the CPU LLC read requests result in a miss. In
this case, the read request is serviced by the DRAM.
Requests FPGA Cache Processor LLC DRAM

FPGA Cache Hit Hit (only applies to VL0)
Processor Cache Hit Miss Hit
All Cache Miss Miss Miss Read
Table 3.3: AFU memory read paths [14].
Since the FPGA cache is only implemented for the QPI channel, the memory read path
depends on the used virtual channel. When the channels VH0 or VH1 are used, the read
request will not pass the FPGA cache as they are mapped to PCIe channels 0 and 1. A
read request to virtual channel VL0 will always pass the FPGA cache. When using the
VA virtual channel, the Virtual Channel (VC) steering logic will map the request to one of
the three different physical channels. The VC steering logic takes into account parameters
such as the link utilization and traffic distribution. It does not, however, take cache locality
into account. Therefore, a read request can be mapped to a PCIe channel, even though
the requested data is present in the FPGA cache [14].
3.4.3 Development Models
The FIU framework supports two programming languages: HDL design, and OpenCL
design. In case of HDL design, the HDL code is synthesized through the Intel Quartus
tool chain to generate the FPGA bitstream. OpenCL, on the other hand, is a high-level
synthesis framework for writing FPGA code at a higher level of C-like abstraction. The
OpenCL code is compiled to HDL code by the Intel OpenCL compiler, after which this HDL
code is synthesized using Intel Quartus. As this thesis focuses on OpenCL development,
this development model will be further elaborated in Chapter 4 [14].
3.5 Quantitative Analysis of HARPv1
In [2], PCIe-based CPU-FPGA platforms were compared with QPI-based CPU-FPGA

platforms. For this purpose, two platforms, the PCIe-based Alpha Data board and the
QPI-based HARPv1 platform, were used. A high-level overview of the architecture of
Alpha Data is shown in Figure 3.3. The main focus of this research is making a comparison
of the effective bandwidth and latency of the CPU-FPGA communication.
PCIe
CPU FPGA
CPU DRAM FPGA DRAM
Figure 3.3: Traditional PCIe-based CPU-FPGA platform. In contrast to the HARP plat-
form, the CPU and the FPGA are not connected through QPI and they do not share
DRAM memory [2].
Benchmarks
For both platforms, a series of benchmarks was executed. In a first test, the effective
bandwidth between the CPU and the FPGA was measured. Measuring the bandwidth for
both platforms is highly different. In the HARP-platform, data originates from the shared
DRAM memory, and therefore, only one FPGA read is required. In the Alpha Data board,
on the other hand, CPU-FPGA communication includes two steps. In a first phase, data
is copied from the CPU DRAM to the FPGA DRAM through the PCIe channel. After
completing the copy operation, the FPGA can access the data in the FPGA DRAM. This
lead to the results shown in Table 3.4. It must be noted that the lower read latency the
HARPv1 platform achieves, is only obtained for fine-grained memory accesses (<4 kB).
Board Access type Latency Bandwidth

CPU DRAM to FPGA DRAM 160.000 ns 1.6 GB/s
Alpha Data
FPGA DRAM 542 ns 9.5 GB/s
HARPv1 shared DRAM 355 ns 7 GB/s
Table 3.4: Read bandwidth and read latency comparison between the Alpha Data board
and the HARPv1 platform [2].
In a second test, the cache behavior of the HARPv1 platform was examined. In Table 3.5,
the read and write latencies of both a cache hit and a cache miss are shown.
Access type Cache hit time Cache miss time (ns)

Read 14 clock cycles avg: 71 clock cycles
Write 12 clock cycles avg: 72 clock cycles
Table 3.5: CPU-FPGA access latency in HARPv1 [2].
Conclusions
Based on the results, the authors of the study formulated suggestions concerning the use
of the HARPv1 platform and its improvements over a traditional PCIe-based platform. A
first finding is that the DRAM-latency of the HARP platform is lower than the FPGA
DRAM-latency of the Alpha Data platform. However, as this smaller latency only holds
for small payload sizes (<4 kB), a QPI-based platform is preferred for latency-sensitive ap-
plications, especially those that require frequent (random) fine-grained CPU-FPGA com-
munication. Secondly, the authors observed that the long latency (14 clock cycles) and
small size (64 KiB) of the FPGA-side cache post a serious challenge for users to take ad-
vantage of this cache. Especially, since the access time to embedded FPGA memory is
only 1 clock cycle, the gap between this embedded memory and cache is too large.
Chapter 4
Software
4.1 Open Computing Language
Open Computing Language (OpenCL) is a framework, developed by Khronos Group,

for programming heterogeneous computing architectures. Moreover, it provides an API
through which either task-based or data-based parallelism can be defined. Even though
OpenCL was originally designed for GPU-programming, it is available for all types of ac-
celerators. OpenCL facilitates the development of software for heterogeneous computing
architectures on two different levels. On the one hand, it provides an API through which
the accelerator device can be controlled. On the other hand, a C-based programming
language is defined that is used to program the accelerator itself [16].
4.1.1 OpenCL Architecture
An OpenCL application consists of two parts: the host and the kernel. Host code is
executed on the CPU, while kernel code is executed on a selected accelerator. Both com-
ponents have a different function in the OpenCL framework. The host can be considered
as the executive part of an OpenCL application as it manages the kernel. It is responsible
for providing data to the kernel, invoking the kernel etc. The kernel, on the other hand,
is a function executed on the accelerator. Therefore, it typically consists of critical code
that is executed more efficiently on the accelerator architecture. A host can call multiple
kernels on different accelerator devices [16].
22
Chapter 4. Software 23
4.1.2 OpenCL Kernel
An OpenCL kernel is defined as a function that is executed on the accelerator. An

OpenCL kernel definition has following structure: kernel void <kernel name>(<input
parameters>). A kernel function is indicated by the kernel keyword. Because an
OpenCL kernel cannot directly return a value, it is always accompanied by the void key-
word. The <kernel name> keyword indicates the name of the kernel, and the <input
parameters> indicates that a kernel function can be passed a variable number of parame-
ters. Kernel parameters can be passed either by value, or by reference. If a value is passed
by value, the OpenCL kernel is only able to read the value. If, on the other hand, it is
passed by reference, the OpenCL kernel can read from and write to the variable. Input
parameters therefore allow I/O between the host and the kernel.
An example of an OpenCL kernel is displayed in Listing 4.1. This kernel is called vectoradd,
and has three input parameters a, b and c. These parameters are defined as floating point
variables, and are characterized by the keyword global. This keyword indicates that each
parameter can be accessed by all invocations of the OpenCL kernel. This is useful useful
when using NDRange kernels, as explained later in this section. The kernel vectoradd
calculates the sum of two vectors a and b, and writes the outcome to vector c. As all
parameters are passed by reference, as pointers to a memory location, the OpenCL kernel
can read from and write to the location the pointer points to.
Kernel Types
OpenCL kernels can be subdivided in 2 different types: single work-item kernels and
NDRange kernels. In a single work-item kernel, only one instance of the kernel is executed,
while in an NDRange kernel, multiple instances of the kernel are executed in parallel. The
difference between these two types of kernels can be illustrated by the examples displayed
in Listing 4.1 and Listing 4.2. In both code fragments, the sum of vectors a and b is
calculated and written to vector c. This functionality is implemented in Listing 4.1 as a
single work-item kernel, while Listing 4.2 implements this functionality as an NDRange
kernel. Even though both kernels calculate the same vector-summation, they execute in
an entirely different way.
The code in the single work-item kernel is implemented as a single compute unit. This
means that the functionality of the kernel is implemented only once, and that this single
1 kernel void vectoradd ( global float *a ,

2 global float *b ,
3 global float * c )
4 {
5 for ( int i = 0; i < N ; i ++) {
6 c [ i ] = a [ i ] + b [ i ];
7 }
8 }
Listing 4.1: Example of a single work-item kernel in which the sum of 2 vectors a and b is
calculated and written to vector c.
1 kernel void vectoradd ( global float *a ,

2 global float *b ,
3 global float * c )
4 {
5 int i = get_global_id (0) ;
6 c [ i ] = a [ i ] + b [ i ];
7 }
Listing 4.2: Example of an NDRange kernel in which the sum of 2 vectors a and b is
calculated and written to vector c.
implementation processes all data. Since the implemented for-loop iterates serially over the
indices i, only a single element of vector i will be calculated at each time step. An NDRange
kernel, on the other hand, consists of multiple compute units. Therefore, the kernel function
is executed in a SIMD way. As can be seen in Listing 4.2, no for-loop is present. Instead
of sequentially calculating the sum of vectors a and b, the OpenCL framework assigns a
specific index to each compute unit by using the function get global id(0). In this paradigm,
the keyword global is relevant. As all compute units access the data from a and b, and
write to c, these pointers must be made available to every compute unit. This is done by
using the global keyword. There exist other keywords that make data available for only a
specific work-item, but this is out of scope.
Both kernel types have different characteristics and the decision to choose one of both
heavily depends on the hardware for which the kernel is developed. Executing, for instance,
a single work-item kernel on a GPU results in one GPU-core being used while all others
are idle. An NDRange kernel execution, on the other hand, will lead to the simultaneous
execution of different compute units on different cores. Therefore, an NDRange kernel is
highly suited for a GPU architecture. An FPGA, on the other hand, has entirely different
hardware characteristics. Even though an FPGA can implement data-level parallelism,
such as a GPU, pipeline parallelism can be implemented highly efficiently. In pipeline
parallelism, a computation is split in different pipeline stages. Intermediate results are
stored in registers, hereby splitting lengthy calculations in several smaller computation
steps. Several iterations can then be executed simultaneously in different pipeline stages. If
a single work-item kernel is implemented as a pipeline, this can result in a high performance
on an FPGA.
4.1.3 OpenCL Host

The main tasks of an OpenCL host are providing data and invoking the OpenCL ker-
nel. This, however, requires certain API-calls to be made. Some examples are requesting
access to the accelerator platform, creating the OpenCL kernel or creating a command
queue through which kernel commands can be sent. Therefore, calling a kernel is not as
straightforward as just calling a function in C.
4.2 Intel FPGA SDK for OpenCL
As indicated earlier, the OpenCL-language and -framework is specified by Khronos Group.

Altera, now Intel FPGA, adopted the framework by developing a Software Development
Kit (SDK) for its FPGA platforms. This SDK is called Intel FPGA SDK for OpenCL, and
is built on top of the Intel Quartus software. Intel Quartus is the software package that
enables FPGA development for Intel FPGAs in a Hardware Description Language (HDL)
[17].
4.2.1 Board Support Package
Developing OpenCL applications for a CPU-FPGA heterogeneous computing platform is

done by programming the CPU in C/C++ and the FPGA in OpenCL. Both languages
are higher level languages, in which the specifics of the underlying hardware are concealed.
In order to incorporate the hardware information, the Board Support Package (BSP) is
defined. The BSP contains logic information, memory information and I/O controllers, as
indicated in Figure 4.1.
In Chapter 3, it was explained that an FPGA bitstream for the HARP platform consists
of the FPGA Interface Unit (FIU), containing low-level I/O-logic, and the Accelerator
Function Unit (AFU), containing the custom FPGA logic written by the developer. The
BSP of the HARP therefore contains the logic in the FIU, such as the Core-Cache Interface
(CCI-P), described in Section 3.4.1.
4.2.2 FPGA Bitstream
An FPGA bitstream is generated in 2 major steps, as depicted in Figure 4.2. In a first step,
the OpenCL kernel is compiled to HDL-code by making use of the Altera OpenCL (AOCL)
compiler. The hardware information of the FPGA is incorporated in the generated HDL
code by making use of the BSP. In the second step, an FPGA bitstream is generated based
on the compiled HDL code. In this step, all necessary implementation details, such as
selecting the appropriate FPGA frequency, are determined by the Quartus toolchain. This
results in an Altera OpenCL executable (.aocx ) file that contains the FPGA bitstream.
The host code, on the other hand, is compiled using an appropriate C compiler.
Figure 4.1: The Board Support Package (BSP) separates the kernel logic part from the
I/O part [18].
4.3 FPGA Performance Model
As indicated in previous sections, there are two types of OpenCL kernels: single work-item
kernels and NDRange kernels. In the Intel FPGA SDK for OpenCL, both types of kernels
are implemented completely different. When compiling an NDRange kernel, the FPGA
implements a GPU-like architecture in which a number of threads execute simultaneously.
This implementation, however, does not make use of pipeline parallelism. Single work-
item kernels, on the other hand, are implemented in a pipelined fashion. Since pipelining
is important for FPGA applications, some general principles of pipeline parallelism will be
discussed in this section.
4.3.1 Pipeline Parallelism
Following definition of pipelining is given in [19]: ”Pipeline parallelism is an implementation

technique whereby multiple instructions are overlapped in execution; it takes advantage
of of parallelism that exists among the actions needed to execute an instruction”. In
User programs Board vendor provided
__kernel void prog(

int main()
__global int * A, ...)
{
{ BSP
...
...
}
}
Intel OpenCL compiler
C compiler HDL files
Intel Quartus
.aocx file
CPU FPGA
Figure 4.2: The Board Support Package (BSP) is included when compiling the kernel code
[18].
pipeline parallelism, a set of consecutive computations is split in a sequence of computation

steps in which the intermediate results are stored. By implementing a pipeline, several
computations can be executed interleaved in a different pipeline step.
This principle can be explained by making use of the OpenCL kernel vectoradd in Listing
4.1. In this kernel, the elements of two vectors are added and the result is stored in the ele-
ment of a third vector. A basic pipelined implementation of the statement c[i]=a[i]+b[i]
in this kernel is displayed in Figure 4.3.
register register
Load Load
register register
Add
register
Store
Figure 4.3: Basic pipelined implementation of adding two variables.
The statement is subdivided in three different stages. In a first stage, the values a[i] and
b[i] are loaded and saved in registers. In the second pipeline stage, these two values are
added and the result is stored in a register. Finally, the result is stored in memory to c[i].
4.3.2 Initiation Interval
Because the pipeline in Figure 4.3 consists of three pipeline stages, three iterations can
execute simultaneously. In this case, the maximum throughput is obtained, as the pipeline
is always in use. It is, however, possible that only two or even a single loop iteration is being
executed. The amount of loop iterations that can execute simultaneously is determined by
the amount of clock cycles between two subsequent iterations. This is termed the Initiation
Interval (II): the number of cycles that must elapse between issuing two iterations [19].
The II of a pipelined loop largely determines the performance. The desired value for the II is
one. In this case, a new loop iteration can be issued every clock cycle, and the implemented
hardware will be used 100 % of the time. This leads to the optimal performance. If the II,
however, is higher, a lower performance will be obtained. For an II of two, only one loop
iteration will be started every two clock cycles. Therefore, each pipeline stage will only be
used 50 % of the time, and the loop will take two times as much time to finish. An II of
three leads to a usage of 33.33 % and an execution time that is three times higher, and so
forth.
A large value of II can have several possible causes. Some examples are I/O delay, a
limited number of resources or dependencies in the algorithm. When developing a single
work-item OpenCL kernel for FPGA, the purpose should always be to obtain an II of one
for implemented loops. Higher values lead to a lower utilization rate and therefore a lower
performance.
4.4 Development Process
Three main development phases when developing an OpenCL project with the Intel FPGA
SDK for OpenCL are distinguished in [18]: the emulation phase, the performance tuning
phase and the execution phase. In the emulation phase, the behavior of the kernel is
verified by emulating the code on a CPU. In the performance tuning phase, performance
bottlenecks are identified by inspecting compilation and profiling reports. In the execution
phase, the performance of the OpenCL kernel is analyzed by executing the kernel on the
FPGA platform. A flowchart of a typical OpenCL FPGA design flow, in which these three
phases are indicated, can be seen in Figure 4.4. Since the performance tuning phase is the
most critical stage for improving the performance, this will be discussed more in detail in
following sections.
4.5 Performance Tuning Phase
In the performance tuning phase, the performance of the OpenCL kernel is analyzed and
improved. The Intel FPGA SDK for OpenCL provides two types of reports that can be
used in this phase: compilation reports and profiling reports.
30 3 FPGA Accelerator Design Using OpenCL
Write kernel code
Compile for emulation

Emulation
Emulation on CPU Modify kernel code
phase
Is program behavior no
acceptable ?
yes
First stage compilation
Evaluate reports for resource overuses and

optimization opportunities
no
Performance Satisfied with reports ?
tuning yes
phase Compile for profiling
Evaluate reports for

Change compile
memory access bottlenecks
options
and optimization opportunities
Satisfied with no
profiling results ?
yes
Second stage compilation
Execution on FPGA-based system

Execution and performance evaluation
phase
Satisfied with no
performance ?
yes
Accelerator design is completed
Fig. 3.1 OpenCL-based design flow

Figure 4.4: OpenCL-based design flow [18].
Note that you need a BSP for emulation although you do not need an FPGA board.
In emulation, the code is executed sequentially similar to a typical C-code.
The parallel operations such as pipelines, loop-unrolling, and SIMD operations are
ignored in emulation. Therefore, the emulation time could be very large and you
may not able to emulate the whole computation. You may have to use a small data
sample for emulation. Note that emulation of I/O channels is not possible until SDK
for OpenCL version 16.1. However, version 17.0 provides a method to replicate the
4.5.1 Compilation Reports
As explained earlier in this chapter, the Intel FPGA SDK for OpenCL generates FPGA
bitstreams in two steps: a compilation step and a bitstream generation step. In the
compilation step, the OpenCL code is compiled into HDL code, and based on this code,
the synthesis step generates an FPGA bitstream. During the compilation process, reports
of the produced HDL code are generated. These reports contain information on the resource
usage of the kernel, the compilation transformations, the initiation interval of loops and
so forth. By investigating these reports, more insight in the implemented hardware can be
gained. Because of the compilation reports, an early performance analysis can be made,
which prevents going through the full, lengthy FPGA development cycle to assess the
OpenCL kernel performance.
Structure
The compilation reports are generated as a set of .html -files, and can be viewed using a web
browser. On the home page of the compilation report, shown in Figure 4.5, an overview is
given of the compiled kernel. On this home page, four different panes can be distinguished.
• View reports pane: in this pane, different types of reports can be selected.
• Source code pane: shows the source code. This pane is extremely useful for linking
implementation details of the kernel to the OpenCL source code.
• Analysis pane: in this pane, the analysis details as selected in the ’view reports pane’
can be read.
• Details pane: shows details from selected elements in the ’analysis pane’.
Using the ’view reports pane’, six different types of reports can be selected. A first one is
’Summary’, which results in the home page indicated in Figure 4.5 with a summary of the
compiled kernel. The ’Loops analysis’ view gives an overview of all loops in the OpenCL
kernel. ’Area analysis of system’ and ’Area analysis of source’ both give an overview
of the resource usage of the kernel. ’System viewer’ shows a control-flow graph of the
developed system. In the last report, the ’Kernel memory viewer’, a visual representation of
FPGA embedded memory usage is shown. Since this thesis mainly focuses on performance
analysis, and less on area analysis, only the ’Loops analysis’ and the ’System viewer’ reports
will be discussed.
Source code pane

View reports pane
Details pane
Analysis pane
Figure 4.5: Screenshot of a compilation report with the 4 different panes indicated.
Loops Analysis
In the loops analysis, information on the loops in the kernel is given. An example is shown
in Figure 4.6. In the analysis pane, all loops in a kernel are listed. For each of the listed
loops, there are four different columns listing details.
• Pipelined: indicates whether a loop is pipelined or not, and therefore whether or not
pipeline parallelism can be applied.
• II: displays the initation interval in case the loop is pipelined.
• Bottleneck: displays whether or not there is a performance bottleneck.
• Details: gives details on the performance bottleneck, if there is one.
The loop analysis report can be read in following way. As can be seen in Figure 4.6, the
implemented kernel is pipelined and has an initiation interval of 525. As explained earlier,
such a high value as initiation interval is dramatic. Therefore, the value of II is clearly a
bottleneck for the kernel performance. This is indicated in the third column, that displays
that the bottleneck is the implemented initiation interval. In the fourth column, details on
this poor initiation interval are given. Apparently, it is caused by a memory dependency.
In the details pane on the bottom of the screen, a more elaborate description of the cause
of the bottleneck is given. This description refers to lines in the source code. In this way,
a developer can easily find the source of a poor performing OpenCL kernel.
System Viewer
A visual representation of the kernel can be found in the system viewer. In this view,
the OpenCL kernel is displayed as a control flow graph of basic blocks. A basic block is
a is a sequence of statements that is always entered at the beginning and exited at the
end [20]. Besides the operation flow, all memory operations to ’Global Memory’, i.e. the
DRAM-memory, are displayed in this report.
The system viewer report of the vectoradd example is displayed in Figure 4.7. In this view,
the vectoradd kernel is subdivided in three basic blocks. The basic block vectoradd.B1
contains the for-loop in which the addition operation is executed. As can be seen in this
figure, this basic block is marked red. This indicates a performance bottleneck, which is
Figure 4.6: Example of a loops analysis report.

the poor initiation interval of 525. When hovering with the mouse pointer over this basic
block, some more information can be observed, such as the latency (expressed as number
of clock cycles).
Figure 4.7: Example of a system viewer report.

4.5.2 Profiling Reports
The Intel FPGA SDK for OpenCL enables reviewing the kernel’s performance by incorpo-
rating the OpenCL Profiler. When generating an FPGA bitstream using the profile-option,
the FPGA program will measure and save performance metrics during execution. This en-
ables reviewing the kernel’s behaviour, and in this way detecting e.g. causes of a kernel’s
poor performance.
The profiling information is saved in a .mon-file, and can be viewed through a graphical
user interface. An example of a profile report is illustrated in Figure 4.8. For each I/O-
statement, following performance metrics are shown.
• Stall (%): indicates the percentage of the overall profiled time frame that the memory
access causes a pipeline stall. Since a pipeline stall is not desired, the preferable stall-
value is 0 %.
• Occupancy (%): the percentage of the overall profiled time frame that a memory
instruction is issued. In a pipelined implementation, the best performance is achieved
when every clock cycle a loop iteration stage can be issued. Therefore, the most
desired value for the occupancy is 100 %.
• Bandwidth (MB/s): the average memory bandwidth of a memory access.
• Bandwidth efficiency (%): the percentage of loaded memory that is actually used by
the kernel. Since data is loaded by reading memory words, it is possible that parts
of the loaded word are not used by the kernel. In the optimal case, all loaded data
is used. Therefore, an efficiency of 100 % is desired.
4.6 Compiler Directives
OpenCL is a High-Level Synthesis (HLS) language in which the hardware implementation

details are absent. It is, however, possible to guide the compilation process by using
compiler directives. In this section, an overview will be given of the most important
compiler directives, and their effect on the FPGA implementation.
Figure 4.8: Example of a profile report.

1 # pragma unroll 8
2 for ( int i = 0; i < N ; i ++) {
3 c [ i ] = a [ i ] + b [ i ];
4 }
Listing 4.3: Example of the use of pragma unroll..
4.6.1 Loop Unrolling
When unrolling a loop in the Intel FPGA SDK for OpenCL, a spatial implementation of
the different loop iterations is obtained. The statements in the loop will be duplicated
in hardware, and multiple loop iterations will be executed in parallel. Unrolling a loop
therefore results in a higher resource usage but a possible shorter execution time.
Loop unrolling is indicated by #pragma unroll. In this case, the compiler will unroll the
loop. If an extra parameter is given, the unroll factor, the developer specifies how many
times the loop should be unrolled. An example of loop unrolling is given in Listing 4.3. In
this listing, the for-loop will be spatially unrolled eight times.
4.6.2 Compilation Options
Besides loop unrolling, other compiler directives are possible as well. When compiling an
OpenCL kernel, several compilation options can be given. Some examples are the options
-fp-relaxed or -fpc. -fp-relaxed relaxes the order of floating point operations. This
potentially leads to a lower resource usage, but can also result in a lower precision. -fpc
adds aditional precision to floating point operations by disabling intermediate rounding
operations.
Chapter 5
OpenCL Performance on HARP
5.1 Introduction
In this chapter, we investigate whether OpenCL is able to exploit the performance enhanc-
ing features of the HARP-platform. This will be done by a number of tests that focus
on a specific hardware characteristic. In a first test, the HARP read bandwidth will be
examined by making use of several OpenCL kernels. Thereafter, the cache performance of
the FPGA cache is investigated. Finally, the effect of Shared Virtual Memory (SVM) will
be analyzed.
5.2 Bandwidth
In a first test, the read bandwidth of the HARP platform is investigated. This is done by
reading a variable amount of data on the FPGA from the CPU.
5.2.1 OpenCL Benchmark
The OpenCL-kernel used for this test is displayed in Listing 5.1. The kernel memory read
is implemented as a single task kernel, which is indicated by attribute ((task)). This
kernel has three parameters. The first two parameters, src and dst, are pointers to addresses
from which data is read and to which data is written. The third parameter, lines, is used
to determine the buffer size of the data that is read. The variables src and dst are defined
as ulong8 pointers. ulong8 is an OpenCL vector data type that consists of eight ulong -
40
Chapter 5. OpenCL Performance on HARP 41
1 kernel void
2 attribute (( task ) )
3 memory_read (
4 global ulong8 * restrict src ,
5 global ulong8 * restrict dst ,
6 unsigned int lines )
7 {
8 ulong8 output = ( ulong8 ) (0) ;
9 for ( unsigned int i = 0; i < lines ; i ++) {
10 output += src [ i ];
11 }
12 * dst = output ;
13 }
Listing 5.1: Kernel code used to test the maximum achievable read bandwidth of the
HARP platform.
unsigned long - elements. Therefore, as a single ulong variable has a size of 8 B, reading a
single ulong8 variable results in reading 8 · 8 B = 64 B. This size equals the size of a cache
line in the Core-Cache Interface (CCI-P). Therefore, the amount of ulong8 variables that
are read is equal to the amount of cache lines necessary for loading the data.
The first statement in the kernel on line 8 declares and initializes the variable output. This
variable is used to prevent data reads from being compiled away. After initializing output,
a for-loop iterates over all elements of input variable src, and adds the elements to output.
Finally, output is written to the location dst points to. If this additional write to dst would
be absent, the complete for-loop would be compiled away.
In order to investigate the influence of the used data type, the kernel in Listing 5.1 was
implemented for other data types of variables src, dst and output. In a first kernel, these
variables were defined as ulong16 data type, resulting in a size of 128 B for one memory
transfer. In a second kernel, the variables were defined as int data types, resulting in 4 B
memory transfer sizes.
The three different kernels were synthesized together to a single .aocx file, resulting in a
clock frequency of 258.3 MHz. All kernels resulted in a fully pipelined implementation,
in which the for-loop obtained an initiation interval equal to one. Therefore, every clock
cycle a new iteration is started, and a value is read every clock cycle. Each of the three
different kernels was executed using variable buffer sizes. The execution time of the kernel
was recorded by making use of the OpenCL profiler, and the resulting bandwidth was
buf f er size
calculated as bandwidth = .
kernel execution time
5.2.2 Results
The execution of the OpenCL kernel in Listing 5.1 lead to the bandwidths displayed in
Figure 5.1 for the three different kernels.
16
4 B words
14 64 B words
128 B words
12
Bandwidth (GB/s)
10
128 B 1 KiB 32 KiB 1 MiB 32 MiB

Buffer size
Figure 5.1: HARP bandwidth when reading values on FPGA from the CPU. The different
lines denote the different word sizes of the memory transfers.
General Observations
When analyzing the graph in Figure 5.1, some elementary observations can be made. First,
it can be noticed that the bandwidth increases for larger buffer sizes. This can be explained
by the influence of overhead. For smaller buffer sizes, factors such as kernel set-up time,
communication set-up time etc. will not be negligible. For this reason, the bandwidth
will increase for higher buffer sizes, where these factors will become small compared to the
data transfer time. Secondly, it can be seen that the maximum bandwidth for high buffer
sizes increases to more or less 15 GB/s. However, this is only the case for the kernels that
load ulong8 - and ulong16 -variables. The kernel that loads int-values reaches a maximum
bandwidth of around 1 GB/s.
Requested Bandwidth
Based on the frequency of the compiled kernels and the data size that is transferred every
clock cycle, it is possible to calculate the bandwidth requested by the kernel. Since all
kernels are fully pipelined, one variable is read every clock cycle. This results in a requested
data type size
bandwidth that can be calculated as . As the frequency of the kernel
clock cycle time
equals 258.3 MHz, the cycle time equals 3.87 ns. Table 5.1 summarizes the requested and
maximum achieved bandwidths of the different kernels.
Kernel Size Requested Bandwidth Achieved Bandwidth

int 4B 1.03 GB/s 1.03 GB/s
ulong8 64 B 16.53 GB/s 15.23 GB/s
ulong16 128 B 33.06 GB/s 15.23 GB/s
Table 5.1: Requested bandwidth versus maximum achieved bandwidth when reading vari-
ables of a different size.
The achieved bandwidth for the kernel that loads int values perfectly equals the requested
bandwidth, i.e. 1.03 GB/s. This implies that the CPU-FPGA interconnect can satisfy
the requested bandwidth. The measured bandwidth for the kernels that load ulong8 and
ulong16 values, however, do not correspond with the theoretical bandwidths. The re-
quested bandwidth when loading ulong16 variables is higher than the raw bandwidth of
the interconnect. As calculated in Chapter 3, this bandwidth equals 28.56 GB/s. There-
fore, this requested bandwidth can physically not be achieved. However, given the high
requested bandwidths, the achieved bandwidths will never exceed 15.23 GB/s. Therefore,
the kernels that load ulong8 - and ulong16 -variables are communication limited, with a
maximum bandwidth of 15.23 GB/s.
Given this limit, it is possible to construct the roofline model for the HARP platform,
resulting in the model in Figure 5.2. In this figure, the compute limit is not displayed
since the operational intensity is too low. The blue graph is the performance limit as
defined by the raw bandwidth. However, since this bandwidth can never be achieved, a
corrected version is constructed based on the bandwidth tests. The maximum bandwidth
was rounded up to 16 GB/s, which results in the green graph. The black dots indicate the
three kernels that were tested. Both the kernel loading ulong8 and ulong16 kernel, which
achieve an operational intensity of 1/8, hit the performance roof, and their performance
cannot be increased. The kernel loading int values, on the other hand, which obtains an
operational intensity of 1/4, only achieves 1/16 of the possible performance. Since this
kernel only achieves a requested bandwidth of 1.03 GB/s, as can be seen in Table 5.1, not
enough data is requested to achieve high performance. Increasing the requested bandwidth,
therefore can potentially boost the performance. Note that in this figure, we use OP/B
and OPS since the used data types are no floating point data types.
256 theoretical performance limit

achieved performance limit
128
64
Performance (GOPS)
32
16
8
4
2
1
1/2
1/4
1/16 1/8 1/4 1/2 1 2 4 8 16

Operational Intensity (OP/B)
Figure 5.2: Roofline model for the HARP platform. The blue line denotes the perfor-
mance limit based on the maximum theoretical bandwidth, while the green line denotes
the performance limit based on the maximum achieved bandwidth.
A possible solution to increase the requested bandwidth is to unroll the for-loop in the
8 ...
9 # pragma unroll 16
10 for ( unsigned int i = 0; i < int_values ; i ++) {
11 output += src [ i ];
12 }
13 ...
Listing 5.2: Unroll pragma in OpenCL kernel.
OpenCL-kernel, such that more data is requested per clock cycle. In a first attempt, the
for-loop is unrolled 16 times. This should result in a requested bandwidth of 16.53 GB/s.
The modified kernel is displayed in Listing 5.2. Even though this appears to be a viable
solution, problems emerge when compiling this kernel. The compilation reports notify
that the compiled for-loop has an initiation interval of 7. Therefore, 16 int-values would
be loaded every 7 clock cycles. This massively decreases the requested bandwidth, which
16.53 GB/s
is then equal to = 2.36 GB/s.
7
The achieved bandwidth is displayed in Figure 5.3. As suggested by the compilation
log-files, the bandwidth only increases marginally, from 1.03 GB/s to 1.86 GB/s. This
maximum achieved bandwidth, however, is lower than the expected bandwidth, which
equals 2.36 GB/s. However, when profiling the kernel, no memory stalls were found, and
the maximum bandwidth efficiency of 100% was achieved. Therefore, the interconnect did
not limit the achieved bandwidth.
In the compilation reports, no specific information is given on the cause of the poor ini-
tiation interval. The cause, however, is the unknown amount of iterations. Since the
parameter int values is passed to the kernel by the host, this parameter is unknown at
compile time. When choosing a fixed value for this variable, the compiler is able to gen-
erate a for-loop with an initiation interval of 1. The downside of this approach is that
the buffer size needs to be known before generating hardware. In order to find the maxi-
mum bandwidth using this kernel, a buffer size of 32 MiB was chosen. This resulted in a
maximum bandwidth of 9.36 GB/s. Once again, the achieved bandwidth did not achieve
the earlier found limit of approximately 16 GB/s. In a last attempt to further increase the
bandwidth using the pragma unroll, the for-loop was unrolled using a higher unroll factor
for a fixed buffer size. This leads to the maximum bandwidths displayed in Figure 5.4.
It can be seen in this figure that an upper limit of approximately 14 GB/s is reached for
2
original kernel
unrolled kernel
1.5
Bandwidth (GB/s)
0.5

Buffer size
Figure 5.3: HARP bandwidth for reading 4 B values on FPGA from the CPU. The blue
line denotes the bandwidth when reading 4 B each clock cycle, while the green line is the
result of unrolling the loop with unroll factor 16.
unroll factors 32, 64 and 128.
14
Maximum Bandwidth (GB/s)
12
10
0
1 2 4 8 16 32 64 128
Unroll factor
Figure 5.4: HARP bandwidth for reading 4 B values on FPGA from the CPU. The loop
was unrolled by various unroll factors.
Another approach to increase the requested bandwidth of the kernel loading int variables,
is to define a custom data type such that more data is loaded every clock cycle. For
this purpose, a struct was declared consisting of 16 int values. The used struct is illus-
trated in Listing 5.3. Since this struct contains 16 int values, it has a size of 64 B. The
packed attribute indicates the compiler that subsequent struct elements are stored without
padding.
When executing the kernel that reads this struct, the bandwidth displayed in Figure 5.5 is
obtained. As can be seen in this figure, a bandwidth of nearly 16 GB/s is obtained, which
is the desired outcome.
Profiling Bandwidth
Another aspect to be addressed is the profiling bandwidth. As explained in chapter 4, the

Intel FPGA SDK for OpenCL implements a kernel profiler. When compiling the kernel
attribute (( packed ) )
struct integer
{
int a ;
int b ;
int c ;
...
int p ;
};
Listing 5.3: Struct containing 16 int values.
16
14
12
Bandwidth (GB/s)
10

Buffer size
Figure 5.5: HARP bandwidth for reading 64 B struct variables on FPGA from the CPU.
with the profile-flag, performance information is measured during execution and saved in
a profile file. As the profiling information captures kernel information more precisely, this
will generally be a more accurate way of analyzing the kernel bandwidth.
The bandwidth that was achieved for the kernel loading ulong8 -variables was compared
with the bandwidth as measured by the OpenCL-profiler. Both bandwidths are displayed
in Figure 5.6. In this figure, it can be seen that the achieved bandwidth for low buffer
sizes is systematically lower than the profiled bandwidth. For higher buffer sizes, however,
both bandwidths correspond with each other. Again, this can be explained by transient
phenomena. Since some of these phenomena, such as kernel set-up time, are not included
in the CPU-FPGA communication as measured by the profiler, the profiled bandwidth will
be higher than the achieved bandwidth.
16 achieved bandwidth
profiled bandwidth
14
12
Bandwidth (GB/s)
10

Buffer size
Figure 5.6: HARP bandwidth for reading ulong8 variables on FPGA from the CPU. The
blue graph corresponds with the bandwidth calculated by the formula buffer size/kernel
execution time, while the green graph displays the bandwidth as measured by the OpenCL
profiler.
5.3 Cache Performance
As explained in Chapter 3, the FPGA Interface Unit (FIU) implements a soft IP cache.
The responsibility of the FIU cache is preventing memory requests to be passed to DRAM
memory. If the requested data is already available on the FPGA itself, a potentially higher
bandwidth and lower latency can be achieved. The FPGA cache has a capacity of 64 KiB
with 64 B direct-mapped cache lines.
In the HARP platform, the Intel Arria 10 FPGA is connected to the processor through one
QPI and two PCIe channels. The FIU cache, however, is only implemented for the QPI
channel. Therefore, if a memory request is passed to a PCIe-channel, no cache look-up
will executed, and the memory request will be sent to DRAM memory. Since the OpenCL
board-support package for the HARP platform uses the Virtual Auto (VA)-channel, a
memory request can be assigned to each of the three physical channels. As a result, an
FPGA memory request in the HARP platform will not necessarily pass the FIU cache, and
cache performance is unpredictable.
Cache behaviour is typically subdivided in two different types: spatial locality and temporal
locality. When using spatial locality, data elements are loaded that are located close
together in memory. In temporal locality, on the other hand, data is loaded several times
within a small time duration.
If the OpenCL kernel in Listing 5.1 is considered, the kernel that loads int-variables could
make use of spatial locality. As one cache line has a size of 64 B, 16 int-values can fit
in this cache line. Therefore, when reading one int-value, the 15 surrounding values can
potentially be read from the cache line that that was loaded. In fact, it could be possible
that the FIU cache has already been used in this kernel. However, in this test case, the
requested bandwidth was only 1.03 GB/s. Since cache behaviour is especially interesting in
those cases where the requested bandwidth exceeds the upper bound of the interconnect,
this kernel will not be considered.
Temporal locality, on the other hand, can be used in each of the three kernels. When loading
the same data multiple times in a short period of time, the data potentially resides in
cache, hereby exploiting temporal locality. When loading 64 B- or 128 B-values in previous
1 kernel void
2 attribute (( task ) )
3 cache_read (
4 global volatile ulong8 * restrict src ,
5 global volatile ulong8 * restrict dst ,
6 unsigned int lines )
7 {
8 ulong8 output = ( ulong8 ) (0) ;
9 for ( unsigned int i = 0; i < ITERATIONS ; i ++) {
10 unsigned int index = i % lines ;
11 output += src [ index ];
12 }
13 * dst = output ;
14 }
Listing 5.4: Kernel code used to test the cache effectiveness of OpenCL on the HARP
platform.
section, the interconnect could not cope with the high requested bandwidths. Therefore, it
will be investigated in this section whether the FIU cache can satisfy these high requested
bandwidths.
The OpenCL kernel used to test the FIU cache is displayed in Listing 5.4. This kernel has
changed in three ways compared to the kernel memory read in Listing 5.1.
• The for-loop iterates for a fixed number of iterations, indicated by the constant
ITERATIONS. The number of iterations is determined at compile time, and is chosen
sufficiently high such that overhead effects become insignificant.
• The array src is indexed by the variable index. This variable is calculated as the
remainder when dividing loop-variable i by the parameter lines. If lines is, for
instance, equal to 4, the first four variables will be repeatedly accessed. This leads to
following access pattern: 0, 1, 2, 3, 0, 1, 2, 3, . . . . By using this modulo-operation,
temporal locality is enforced, and cache-performance can be evaluated. The input
parameter lines determines the degree of temporal locality.
• The pointers src and dst are marked volatile. This keyword indicates that the
data to which a pointer points may change over time. Therefore, every time the
variable is read, a memory request should be sent. If the volatile keyword is not
used, on the other hand, the loaded data is buffered in the OpenCL kernel. In
this case, the OpenCL compiler adds a software implemented cache to the FPGA
bitstream through which all memory requests pass [21]. The memory hierarchy when
using the volatile keyword is displayed in Figure 5.7a, while Figure 5.7b displays the
memory hierachy without using the volatile keyword. When testing the hardware-
implemented QPI cache, it is therefore important to mark src and dst as volatile,
such that no software cache prevents the FIU cache to be read. The software cache,
however, can be used as a comparison to the FIU cache. Therefore, the kernel
cache read was also implemented without using the volatile keyword.
5.3.2 Results
The kernel was developed for both the ulong8 and ulong16 data type. It was synthesized
to an .aocx file, which resulted in a clock frequency of 240.6 MHz. The variable lines was
varied from 1 to 2048. This resulted in repeated reads of a memory region with size 64 B
to 128 KiB for the ulong8 variable, and 128 B to 256 KiB for the ulong16 variable.
In Figure 5.8a and Figure 5.9a, the obtained average bandwidths for both kernels are
displayed with and without using the volatile keyword. As there are large variations in
obtained bandwidth for small values of the variable lines, a scatter plot is displayed in
Figure 5.8b and Figure 5.9b for the first 20 values of lines.
As the kernels are fully pipelined, every clock cycle a variable is read. Because of the
frequency of 240.6 MHz, and therefore a clock cycle time of 4.16 ns, the requested bandwidth
read size
can be calculated as . This leads to the requested bandwidths displayed
clock cycle time
in Table 5.2.
Kernel Read Size Requested Bandwidth

ulong8 64 B 15.4 GB/s
ulong16 128 B 30.8 GB/s
Table 5.2: Requested bandwidth for kernel cache read.

Processor FPGA
FIU AFU
VC
QPI QPI OpenCL
DRAM LLC steering
cache kernel
logic
PCIe 0
PCIe 1
(a) Memory hierarchy when using the volatile keyword. The memory hierarchy consists of three
levels: the QPI cache, the CPU’s Last-Level Cache (LLC) and DRAM memory.
Processor FPGA
FIU AFU
VC
QPI QPI OpenCL OpenCL
DRAM LLC steering
cache cache kernel
logic
PCIe 0
PCIe 1
(b) Memory hierarchy without using volatile keyword. The memory hierarchy consists of four
levels: the OpenCL cache, the QPI cache, the CPU’s Last-Level Cache (LLC) and DRAM mem-
ory.
Figure 5.7: Different types of memory hierarchies

16 FIU cache
14 OpenCL cache
Bandwidth (GB/s)
12
10
0B 32 KiB 64 KiB 96 KiB 128 KiB

Memory region read
(a) Bandwidth of reading src when executing kernel cache read for ulong8 pointer
types. The bandwidth for both the kernel with and without the volatile keyword is
displayed.
15
Bandwidth (GB/s)
10
0
0 256 512 768 1,024 1,280
Memory region read (byte)
(b) Bandwidth of reading src when executing kernel cache read for volatile ulong8
pointer types. In this graph, only the FIU cache is used. This graph zooms in on the
blue graph in above figure.
Figure 5.8: Bandwidth of reading src when executing kernel cache read for 64 B values.
FIU cache
30
OpenCL cache
25
Bandwidth (GB/s)
20
15
10
0B 64 KiB 128 KiB 192 KiB 256 KiB

Memory region read
(a) Bandwidth of reading src when executing kernel cache read for ulong16 pointer
types. The bandwidth for both the kernel with and without the volatile keyword is
displayed.
30
25
Bandwidth (GB/s)
20
15
10
0
0 512 1,024 1,536 2,048 2,560
Memory region read (byte)
(b) Bandwidth of reading src when executing kernel cache read for volatile ulong16
pointer types. In this graph, only the FIU cache is used. This graph zooms in on the
blue graph in above figure.
Figure 5.9: Bandwidth of reading src when executing kernel cache read for 128 B values.
OpenCL Cache
As can be seen in Figure 5.8a and Figure 5.9a, the OpenCL-implemented cache is able to
fulfill the requested bandwidths of 15.4 GB/s and 30.8 GB/s. In both graphs, the size of
this software cache can also be deduced from the point where the bandwidth drops. When
reading ulong8 -variables, the OpenCL cache has a size of 64 KiB; when reading ulong16 -
variables, the cache is 128 KiB large. When the memory region is larger than the cache
size, the bandwidth decreases and converges to an asymptotic bandwidth of 9.3 GB/s.
FIU Cache
In Figure 5.8a and Figure 5.9a, it is difficult to see a trace of the FIU cache in the bandwidth
results. While there is a clear point at which the bandwidth of the OpenCL cache decreases,
the bandwidth of the volatile kernel is mainly constant.
However, when looking at a more detailed view of these kernels, in Figure 5.8b and Figure
5.9b, some variations can be distinguished that indicate the use of the hardware cache. In
Figure 5.8b, there are points in which the system was able to achieve a higher bandwidth
than the asymptotic bandwidth. When reading single cache lines, i.e. 64 B-values, there are
measurements that achieve a bandwidth of 15.4 GB/s, which is the requested bandwidth.
In Figure 5.9b, there are less variations in this area. However, there are measurements in
which a bandwidth of more than 30 GB/s is achieved. Since the maximum unidirectional
theoretical bandwidth of the CPU-FPGA interconnect is roughly 28 GB/s, this bandwidth
exceeds the physical limit. Therefore, this read probably originates from the FIU cache.
Bandwidth
In previous section, a kernel was developed to achieve a bandwidth that was as high as
possible. This resulted a maximum bandwidth of roughly 16 GB/s. Even though the CPU-
FPGA interconnect clearly supports these high bandwidths, the asymptotic bandwidth
when loading 64 B values is approximately only 9 GB/s. The kernel is fully pipelined, has
an initiation interval equal to one and no memory stalls were encountered during execution.
Therefore, no memory read problems were encountered, and the low bandwidth is probably
caused by the OpenCL kernel itself.
5.4 Shared Virtual Memory
A third aspect that is considered, is the use of Shared Virtual Memory (SVM). In the
OpenCL-framework, data is communicated between the host and the kernel by reading
from and writing to allocated memory areas. There are two different types of memory
areas: memory buffers and Shared Virtual Memory (SVM).
Memory Buffers
When using memory buffers, the data is copied to a specially allocated part of the RAM
memory that is called a memory buffer. The host can only access memory buffers through
the OpenCL API. Moreover, the memory location of the memory buffers depends on the
accelerator architecture. In a PCIe-based CPU-FPGA architecture, both the processor
and the FPGA have their own DRAM memory, as illustrated in Figure 5.10a. In this type
of architecture, the memory buffer is located in the FPGA DRAM memory [18]. The data,
that is located in the CPU DRAM, is therefore copied from CPU DRAM to FPGA DRAM,
after which the FPGA reads from the FPGA DRAM. This is the conventional operation
in a traditional PCIe-based CPU-FPGA platform. In a shared memory architecture such
as the HARP platform, on the other hand, both the original data and the OpenCL buffer
are stored in the shared DRAM. This is indicated in Figure 5.10b.
SVM
SVM, on the other hand, is a paradigm in which the original data can be accessed by
both the OpenCL kernel and the host without having to pass through the OpenCL API.
The host can access the allocated memory area by using conventional memory operations.
Therefore, data is not duplicated, as illustrated in Figure 5.11.
In order to characterize the impact of traditional memory buffers, their overhead was
investigated. This overhead is characterized by the amount of time is required to write to
and read from the the memory buffer, and will be expressed as the achieved bandwidth.
PCIe
CPU FPGA
CPU DRAM FPGA DRAM

Data Buffer
(a) In a PCIe-based CPU-FPGA architecture, the original data resides in CPU DRAM while the
memory buffer is located in the FPGA DRAM.
QPI
PCIe
CPU FPGA
PCIe
DRAM
Data
Buffer
(b) In the HARP platform, both the original data and the memory buffer are located in the
DRAM memory.
Figure 5.10: Memory buffer location when using different types of platforms.
QPI
PCIe
CPU FPGA
PCIe
DRAM
SVM
Figure 5.11: In the HARP platform, both the CPU and the FPGA can access SVM.
5.4.2 Results
14 write to buffer
read from buffer
12
Bandwidth (GB/s)
10
1 MiB 128 MiB 256 MiB 384 MiB 512 MiB

Memory buffer size
Figure 5.12: Bandwidths when copying data in the HARP platform. The blue line shows
the bandwidth when writing data to a write buffer while the red line denotes the bandwidth
when reading from a memory buffer.
The results are displayed in Figure 5.12. Writing data to a memory buffer results in an
asymptotic bandwidth of nearly 13 GB/s, while reading from the buffers results in nearly
12 GB/s.
The influence of these bandwidths on the performance largely depends on the buffer size
and on the execution time of the kernel itself. For a small amount of data that needs
to be copied, and a kernel with a large execution time, the influence of using memory
buffers on the performance is relatively small. When reading and writing, however, a large
amount of data for a kernel that only requires a short execution time, memory buffers
will degrade the performance. Consider, for instance, an OpenCL kernel that requires
128 MiB of data to be written and to be read and that has an execution time of 20 ms. The
achieved bandwidths for the specific payload sizes can be read from the graph in Figure
128 MiB
5.12. Writing the data will require approximately = 12 ms, while reading from
11 GB/s
128 MiB
the buffer takes approximately = 19 ms. In this case, the total kernel performance
7 GB/s
will be seriously degraded.
5.5 Conclusion
Following conclusions can be formulated based on the executed benchmarks.
Bandwidth
The maximum bandwidth that was achieved in the different tests is 15.73 GB/s. In Chap-
ter 3, it was found that the maximum raw one-way bandwidth of the interconnect is
28.56 GB/s. Due to communication overhead, this raw bandwidth will never be achieved.
However, a more detailed analysis on what communication overhead aspects affect the
bandwidth would be useful.
It was noticed that boosting the bandwidth by making use of the pragma unroll did not
achieve good results. When using a custom struct, on the other hand, a high bandwidth
was obtained. Therefore, using a struct is recommended when data has to be read for
which no native data type is available.
Cache
In the executed tests, traces of the FIU cache were found. When loading 128 B values,
for instance, bandwidths of over 30 GB/s were obtained. Because this bandwidth exceeds
the physical limits of the interconnect, the FIU cache was presumably used. However,
it is only in certain cases, and only for a couple of measurements, that the FIU cache
was effective. In all other cases, the achieved bandwidths do not satisfy the requested
bandwidths. A possible cause for this problem is the virtual auto channel, which selects
the physical channel without taking cache locality into account. However, no hard claims
can be made.
Even though it may be hard for an OpenCL developer to exploit the FIU cache, the software
cache added by OpenCL compiler is useful. By implementing a software implemented
cache, high bandwidths were obtained. As the volatile keyword has as a drawback that
this hardware cache is not implemented, it should therefore only be used when necessary.
This is the case if the data could change during the execution of the OpenCL kernel.
SVM
The HARP platform offers a shared memory architecture that can be exploited by using
SVM in OpenCL. The use of memory buffers results in memory and performance overhead.
Therefore, the SVM should be used on the HARP platform.
Chapter 6
Guided Image Filtering
6.1 Introduction
Guided Image Filtering is an image processing technique in which a filtering input is

smoothed while preserving edges [22]. The guided image filter takes 2 images, a filtering
image and a guidance image, as input and smooths the filtering image by incorporating
the guidance image. As a result, the filtering output is locally a linear transform of the
guidance image. A high-level illustration of the effect of the guided image filter is displayed
in Figure 6.1. In this figure, the filtering input contains a lot of noise, while the guidance
image holds more precise data. By using the data in the guidance image, the noise in the
filtering input can be removed, and the filtering output is obtained.
An example application of this algorithm is the enhancement of LIght Detection And Rang-
ing (LIDAR) data. By scanning an environment using LIDAR, a point cloud, consisting
of (x,y,z)-coordinates, is obtained. This point cloud can act as the filtering input in the
guided image filter algorithm. By making use of a color image of the scene as the guidance
image, the lidar data can be denoised.
6.2 Algorithm
The guided image filter algorithm consists of 5 steps, which are displayed in pseudocode
in Algorithm 6.1. The algorithm has 4 input parameters, filtering input I, guidance input
G, filtering radius r and a regularization parameter , and has one output, the filtering
output O [22].
62
Chapter 6. Guided Image Filtering 63
Figure 6.1: The guided image filter algorithm smooths the filtering input image by making
use of a guidance image. The filtering output consists of the filtering input, enhanced by
the data in the guidance image [22].
In a first step, the function fmean,r is applied. fmean,r calculates the mean of all pixels
within a square window. The size of this window is determined by the radius r. A
radius of 2, for instance, leads to a square window in which all values within a distance
of 2 pixels of the central pixel are considered. Therefore, this window will have a size of
(2 · r + 1) × (2 · r + 1) = 5 × 5 pixels. This function is applied to the filtering input, the
guidance input, the filtering input multiplied with the guidance input, and the guidance
input multiplied with itself. The multiplication of two images is executed pixel-wise, as
denoted by the point-multiplication .∗ . Based on the values calculated in step 1, the
variance of the guidance input, varG , and the covariance of the filtering input multiplied
with the guide input, covIG , is calculated for each pixel in step 2. In step 3, images a and b
are calculated using the previously calculated varG and covIG , and using the regularization
parameter . Then, in step 4, images a and b are filtered using the same fmean,r as used in
step 1, which results in images meana and meanb . Finally, in step 5, the filtering output is
calculated as meana . ∗ G + meanb .
Input: filtering input I, guidance image G, radius r, regularization parameter

Output: filtering output O
1. meanI = fmean,r (I)
meanG = fmean,r (G)
corrIG = fmean,r (I. ∗ G)
corrG = fmean,r (G. ∗ G)
2. varG = corrG − meanG . ∗ meanG

covIG = corrIG − meanI . ∗ meanG
covIG
3. a = varG +
b = meanI − a. ∗ meanG
4. meana = fmean,r (a)

meanb = fmean,r (b)
5. O = meana . ∗ G + meanb
Algorithm 6.1: Guided Image Filter algorithm.
6.3 Sliding Window
In the sliding window implementation of an image processing algorithm, the window func-
tion is slid over the input image. Each output pixel is then produced by computing the
operation according to the input pixels under the window and the chosen window operator.
The result is a pixel value that is assigned to the centre of the window in the output image
[23]. This principle is illustrated in Figure 6.2. In the guided image filter algorithm, the
applied window function is fmean,r . In this window function, the mean of all pixels under
the window, with size (2 · r + 1) × (2 · r + 1) is calculated.
6.3.1 OpenCL Implementation
In the OpenCL implementation, the guided image filter algorithm is split into two kernels,
as shown in Figure 6.3. The first kernel calculates steps 1, 2 and 3 in Algorithm 6.1, while
the second kernel calculates steps 4 and 5. Kernel 1 starts by reading the filtering image I
and the guidance image G from global memory. It then slides over I and G and calculates
meanI , meanG , corrIG and corrG . Based on these four arrays, it then calculates the values
Window
operator
input image output image
y y
input Output
window pixel
path of
window
x x
Figure 6.2: Conceptual example of sliding window filtering [23].
in step 2 and step 3. The result of these first three steps, arrays a and b, are sent directly
to the second OpenCL kernel, together with the guidance image, which is required in step
5 of the algorithm. Kernel 2 slides fmean,r over a and b, which results in arrays meana and
meanb . Based on meana and meanb , the filtering output O is calculated and written back
to global memory.
The main motivation to split the calculations in two kernels is the ease of programming.
Kernel 1 contains the sliding window implementation of the four fmean,r functions in step
1 of the algorithm. In kernel 2, on the other hand, the sliding window implementation of
the two fmean,r functions in step 4 is implemented. Separating these two algorithmic steps
decreases the implementation complexity. Because it is possible in OpenCL to directly
send data between kernels, no extra bandwidth to global memory is required for this
configuration.
In the following subsections, the specific OpenCL implementation details will be discussed.
Image Data Structure
The implemented guided image filter algorithm operates on data that is read from global
memory. This data contains either different color channels or 3D-coordinates. The different
dimensions of a specific pixel, however, are stored together. This is displayed in Figure 6.4.
global memory
input guide
kernel 1
a, b guide
kernel 2
output
global memory
Figure 6.3: Sliding window implementation. Kernel 1 computes the first three steps in the
guided image filter algorithm, while kernel 2 computes the last two steps.
This figure shows a 1D-array, as data is stored sequentially row-wise in memory, containing
image data. The three different color channels, for instance Red (R), Green (G) and Blue
(B), for pixel 0 are stored, before continuing to the three different color channels of pixel
1.
R0 G0 B0 R1 G1 B1 R2 G2 B2 ...
Figure 6.4: Data structure image data.
This data structure, however, poses problems for the guided image filter implementation.
Because the guided image filter operates on each color channel individually, the best possi-
ble situation would be that pixels are grouped by their color channel, i.e. first all R-values
together, before going to the G- and B-values. In this way, memory can be read entirely
sequentially, without having to discard data that was read.
In order to cope with the interleaved storage of color channels, all computations are dupli-
cated three times, once for each color channel. In this way, data can be read sequentially
without discarding color channels. This leads to a SIMD-implementation in which three
color channels are processed simultaneously.
This approach, however, also requires adaptations to the used data type. The Intel FPGA
SDK for OpenCL does not support native vectorized data types containing three elements.
Therefore, a custom struct was defined that contains the three different channels. The
filtering input and output images consist of real number values, while the guidance image
consists of color values between 0 and 255. Therefore, two different structs were defined, as
displayed in Listing 6.1. The struct filter image consists of three different float-variables,
while the guide image struct consists of three unsigned char values. Except for these two
input and output image types, a third struct, called struct a b was defined that contains
the values of arrays a and b. For each color channel, the vectorized OpenCL data type
int2 contains a value of array a and a value of array b. The packed attritube indicates
that subsequent struct-elements are stored in memory without padding.
OpenCL Channels
OpenCL channels are used to directly send data between different OpenCL kernels without
passing through global memory. By using OpenCL channels, the number of accesses to
global memory can be reduced, hereby saving CPU-FPGA bandwidth.
struct filter_image
{
float x ;
float y ;
float z ;
};
struct guide_image
{
unsigned char r ;
unsigned char g ;
unsigned char b ;
};
struct struct_a_b
{
int2 x ;
int2 y ;
int2 z ;
};
Listing 6.1: Structs that define the input and output image, the guidance image and the
array containing values a and b.
An OpenCL channel behaves as a FIFO, with a producer and a consumer. The producer
places data in the FIFO, while the consumer pulls the data from the FIFO. Moreover, an
OpenCL channel has a limited capacity, and can therefore be saturated. If this occurs,
the producer will stall until data can again be placed in the channel. In the OpenCL
implementation of the guided image filter algorithm, arrays a and b are directly sent from
kernel 1 to kernel 2 through an OpenCL channel, by making use of the earlier defined struct
in Listing 6.1. Moreover, as kernel 2 also needs the guidance image for its computations,
kernel 1 sends this array to kernel 2 through an OpenCL channel as well. This is indicated
in figure 6.3.
OpenCL channels are implemented as shown in Listing 6.2. In line 2, the use of channels
is activated by enabling the pragma OPENCL EXTENSION cl altera channels. In line 3
and line 4, the channels are declared. Channel ch 0, which uses the earlier defined struct
struct a b will be used to transfer a and b, while channel ch 1 transfers the guidance image
by means of the struct guide image. Both channels are accompanied by the depth attribute.
As explained earlier, an OpenCL channel has a limited capacity. In order to prevent the
producer, i.e. kernel 1, from blocking, a high enough capacity of 128 struct elements was
chosen. In line 6, kernel 1 is declared. A pointer to the filtering input and the guide input
is provided such that they can be read from global memory. After applying the required
computations, the data, contained by variables output buf and guide buf out is then sent
through the channels ch 0 and ch 1 to kernel 2. Kernel 2, declared on line 16, reads data
from the channels, performs its computations, and then writes its output to the memory
location defined by its input parameter output.
Shift Register
In the sliding window implementation, one filtering output value is calculated every clock
cycle. Therefore, exactly one filtering input value and one guide input value are loaded
every clock cycle. Since the loaded values are required for multiple convolution operations,
they remain in the local memory. Previously loaded input and guide values that are no
longer required by the computation will leave local memory at a rate of one value per clock
cycle. Because one value enters local memory while another one leaves local memory every
clock cycle, the local memory can be organized as a shift register.
A shift register is an array of registers in which the elements are shifted every clock cycle.
The use of a shift register in the guided image filter implementation can be clarified by
1 ...
2 # pragma OPENCL EXTENSION cl_altera_channels : enable
3 channel struct struct_a_b ch_0 attribute (( depth (128) ) ) ;
4 channel struct guide ch_1 attribute (( depth (128) ) ) ;
5 ...
6 kernel
7 void kernel_1 ( global struct image * restrict input ,
8 global struct guide * restrict guide )
9 {
10 ...
11 write_channel_altera ( ch_0 , output_buf ) ;
12 write_channel_altera ( ch_1 , guide_buf_out ) ;
13 }
14
15 kernel
16 void kernel_2 ( global struct image * restrict output )
17 {
18 struct image_filtered input_buf = read_channel_altera ( ch_0 ) ;
19 struct guide guide_buf = read_channel_altera ( ch_1 )
20 ...
21 }
Listing 6.2: Kernel code used to test the maximum achievable read bandwidth of the
HARP platform.
making use of Figure 6.5. In this figure, the parameter r is equal to one, and therefore a
window of size (2 · 1 + 1) × (2 · 1 + 1) = 3 × 3 pixels is obtained. The blue pixels indicate the
values that are required for the computation of fmean,r in the current clock cycle. These
pixels are therefore present in local memory. The green pixels, on the other hand, are also
present in local memory. Their values are not necessary for the current computation, but
will be at a later stage. The white pixels are not in local memory. The red pixel in Figure
6.5a, on the other hand, is also not present in local memory, but will be loaded in local
memory the next clock cycle.
As all green and all blue pixels are located in the shift register in local memory, their
position in the shift register is indicated by the index in the pixels. A high index implies
that the pixel resides already for a longer time in local memory, while a lower index
indicates that the pixel was recently loaded. After calculating the mean function fmean,r ,
which results in the mean of all blue pixels, the output value is obtained and the window
slides one pixel to the right. By shifting the window, all values in the shift register are
shifted one place. Since the blue pixel with index 18 in Figure 6.5a is no longer required
for a computation in the future, this pixel will leave local memory. The red pixel in Figure
6.5a, on the other hand, is necessary for the fmean,r computation in the next clock cycle,
and is therefore loaded in local memory. All pixels in local memory shift one position in
the shift register, and the red pixel shifts in. The new situation, one clock cycle after the
situation in Figure 6.5a, is displayed in Figure 6.5b.
Using a shift register results in a highly efficient hardware implementation. As can be seen
in Figure 6.5, the pixels that are required for the computation of the function fmean,r , i.e.
the blue pixels, are always located in registers 0, 1, 2, 8, 9, 10, 16, 17 and 18. As these
indices are read only once every clock cycle, no memory stalls will occur, and therefore no
duplication of local memory is necessary. Moreover, other positions in the shift register
will never be read for computations. Therefore, these registers do not require an extra read
port. This leads to a low resource usage.
A shift register is created when the OpenCL compiler detects a memory transfer pattern
in local memory that fits a shift register architecture. The shift register transfer pattern
in the OpenCL implementation of the guided image filter algorithm is displayed in Listing
6.3. In this code fragment, the data in variable shift register is shifted by one position by
looping over all indices. By spatially unrolling this loop using the pragma unroll, all shifts
are executed in the same clock cycle, and a shift register is obtained.
18 17 16 15 14 13
12 11 10 9 8 7 6 5
4 3 2 1 0
(a) Shift register state at clock cycle n. In the next clock cycle, the red pixel will be loaded in
the shift register in local memory.
18 17 16 15 14
13 12 11 10 9 8 7 6
5 4 3 2 1 0
(b) Shift register state at clock cycle n+1. All values are shifted one place in the shift register
compared to the previous clock cycle.
Figure 6.5: Principle of the sliding window. The pixels in green and blue are row-wise
stored in a shift register. The pixels in blue are used to calculate the window function
fmean,r .
...
# pragma unroll
for ( int i = shift_register_size - 1; i > 0; --i ) {
shift_register [ i ] = shift_register [ i - 1];
}
...
Listing 6.3: Shift register implementation in OpenCL.
...
int mean = 0;
# pragma unroll
for ( int i = 0; i < 2 * R + 1; i ++) {
# pragma unroll
for ( int j = 0; j < 2 * R + 1; j ++) {
int value = shift_register [ i * IMAGE_WIDTH + j ];
mean += value ;
}
}
mean = mean / ((2* R +1) *(2* R +1) ) ;
...
Listing 6.4: Implemented window operation in the sliding window implementation.
Window Operation
In the sliding window approach, an output value can be calculated every clock cycle. This
can be achieved by implementing a fully spatial implementation of the window operation.
As the window operation is computed by iterating over two dimensions, which results in
two for-loops, both loops must be fully spatially unrolled. This is shown in Listing 6.4.
Both for-loops, which each iterate over an index area of size 2 · r + 1, are fully unrolled
using the pragma unroll. fmean,r is then calculated by adding all values in the area and
dividing this value by the size of the window.
Spatially unrolling the window operation, however, has as downside that the radius is
limited by the availabe FPGA resources. A higher radius will lead to larger for-loops and
therefore a higher resource usage.
Fixed-Point Calculations
As can be seen in Listing 6.1, the filtering input consists of floating point variables. How-
ever, using floating point variables results in a higher resource usage. Not only the number
of DSPs will increase, the overall logic usage will increase as well. Therefore, the imple-
mented kernel makes use of fixed point calculations.
Resource Usage To illustrate the effect of implementing fixed point instead of floating
point calculations, the resource usage of two kernels is listed in Table 6.1. Both kernels
implement the sliding window implementation of the guided image filter, and use a radius
equal to 3, which results in a window size of 7 × 7 pixels. The first kernel uses fixed
point calculations while the second kernel uses floating point calculations. As can be seen,
both the logic usage and the DSP usage is higher when using floating point operations.
The maximum radius for the floating point kernel is equal to 3. For higher radiuses, the
FPGA bitstream does not fit any longer on the Arria 10 FPGA. The fixed point kernel,
on the other hand, can go up to a radius of size 6, which corresponds with a window of
13 × 13 pixels. Furthermore, since both the logic and DSP usage of the floating point
kernel approach 100 %, this will limit the ability of further optimizing the OpenCL kernel
and, for instance, applying a higher SIMD factor. Therefore, the implemented kernel uses
fixed point calculations.
Usage
Kernel Logic RAM DSP
Fixed Point Kernel 63 % 41 % 35 %
Floating Point Kernel 91 % 42 % 87 %
Table 6.1: Resource usage for OpenCL kernels implementing the sliding window imple-
mentation using fixed point and floating point computations.
Binary Representation A fixed point value is represented by an integer data type,

such as int or long. In this data type, a fixed amount of bits is reserved for the fractional
part and the integer part. This, however, leads to a scaling of the range of available values.
Consider, for instance, the case in which a 32 bit unsigned int is used to store a fixed point
number. If 8 bits are used to represent the fractional part, then 24 bits remain to represent
1
the integer part of the number. This leads to a fractional precision of 8 ≈ 0.004. The
2
range of values that can be represented by this fixed point representation lies between 0
and approximately 224 . This indicates a major flaw of using a fixed point representation.
The range of numbers that can be represented is smaller than when using a floating points.
Therefore, fixed point numbers are much more prone to under- and overflow.
When using a fixed point numbers, the required fractional precision is balanced against
the value range. The higher the number of fractional bits, the higher the precision, but the
int fixed_point = ( int ) ( floating_point_input * (1 <<

fractional_bits ) ) ;
float floating_point_output = (( float ) fixed_point ) / (1 <<
fractional_bits ) ;
Listing 6.5: Floating point to fixed point conversion and vice versa.
smaller the range of available values, and vice versa. In order to obtain a fractional precision
1
of, for instance, one decimal, at least 4 fractional bits are required, as 4 ≈ 0.06 < 0.1. By
2
assessing the required number of fractional precision, the appropriate amount of fractional
bits can be determined.
Conversion The conversion of floating point to fixed point values and vice versa is
displayed in Listing 6.5. Floating point to fixed point conversion is done by shifting the
floating point number to the left for the specified number of fractional bits, determined by
the variable fractional bits. The resulting value is then casted to an int. Therefore, only
the integer part, which contains a portion of the fractional component, will be retained.
Converting a fixed point variable into a floating point variable, on the other hand, is done
by first casting the fixed point variable to a float data type, and then shifting this variable
to the right for fractional bits bits.
Shared Virtual Memory
As indicated in chapter 5, the shared memory architecture of the HARP platform can make
use of Shared Virtual Memory (SVM). In SVM, both the CPU and the FPGA have full
access to a specially allocated memory region. Therefore, only one instance of the data is
present in memory, and no explicit data copies to a separate memory region, only available
for FPGA, are necessary.
In the OpenCL implementation of the guided image filter, a series of images was filtered. In
order to process the different images, only two steps are required when using SVM. In a first
step, the input parameters, i.e. pointers to the SVM, are passed to the OpenCL kernels.
In the second step, the OpenCL kernels are invoked. Since it is no longer necessary to
copy the input and output data to and from memory buffers, the use of SVM dramatically
reduces the number of operations that must be executed to pass data to the OpenCL
kernel.
6.3.2 Results
The implemented kernel was synthesized to filter Full HD images, which have a resolution
of 1080 × 1920 pixels, and a radius equal to 3. This lead to an FPGA bitstream with
a frequency of 237.5 MHz. The bitstream was then executed on the HARP platform, by
filtering 100 images. Two time recordings were registered. A first recording measured the
execution time of the OpenCL kernel itself, without accounting for passing input parame-
ters. A second recording measured the total execution time at the host side. This includes
the FPGA execution, but also passing the input parameters and invoking the OpenCL ker-
nel. 100 images were filtered on the HARP platform, which lead to an average execution
time shown in Table 6.2.
Recording Execution Time GOPS Frames/s

Kernel Execution 18.20 ms 139 54
Total Execution 22.06 ms 114 45
Table 6.2: Execution times when executing the sliding window implementation of the
guided filter on the HARP platform.
Profiling Results
Since the kernel is synthesized using a frequency of 237.5 MHz, a clock cycle time of 4.21 ns
is obtained. As denoted in the compilation reports, the kernel is fully pipelined and
has an initiation interval equal to one. Therefore, every clock cycle, an output pixel
can be calculated. This leads to a theoretical execution time equal to approximately
the total number of pixels multiplied with the cycle time, which equals 8.78 ms. If this
value is compared to the obtained execution time of 18.20 ms, it can be observed that
the obtained execution time is higher. In order to investigate this difference, the profiling
results, displayed in Table 6.3, are investigated.
Low Occupancy The occupancy is defined as the portion of the execution time of the
kernel that a memory operation is issued. While the optimal value is an occupancy of
Memory Operation Occupancy Bandwidth Efficiency

Filtering Input 48.6 % 1371.6 MB/s 100 %
Guidance Input 48.6 % 343 MB/s 100 %
Filtering Output 48.5 % 1459.9 MB/s 100 %
Table 6.3: Profiling results when executing the sliding window implementation of the
guided image filter algorithm.
100 %, only an occupancy of 48.6 % is achieved in this kernel. Therefore, the load and
read statement are only issued 48.6 % of the clock cycles, which means that a read takes
around 2 clock cycles. Therefore, the bandwidth is lower as well. The theoretical requested
bandwidth is 12 B/4.21 ns = 2850 MB/s. However, because of the occupancy of 48.6 %,
only a bandwidth of 2850 MB/s · 48.6 % = 1385 MB/s ≈ 1371.6 MB/s is obtained.
This low bandwidth is obtained because of the defined structs. Both the filtering input,
the guidance input and the filtering output consist of a data type that has a size that is a
multiple of three. The filtering input, for instance, has a size of 12 B and is aligned at 4 B.
Therefore, three separate read requests need to be issued, which dramatically reduces the
bandwidth, and explains the low occupancy.
High bandwidth efficiency The efficiency of all I/O-operations equals 100%, which is
the best possible value. This is caused because in the implementation, no read values are
discarded. As explained earlier, all three color channels are used. Therefore, the use of
structs, that decreases the occupancy of memory operations, results in an increase of the
bandwidth efficiency as no read values are unused.
OpenCL Cache
As can be seen in Listing 6.2, the input parameters are not marked by the keyword volatile.
Therefore, the OpenCL compiler adds a software implemented cache. As has been found
in Chapter 5, this cache can achieve high bandwidths. The performance of this cache can
be analyzed by making use of the profiler, in which the cache hit rate is displayed. These
hit rates are displayed in Table 6.4.
In this table, it can be seen that the cache hit rate for the filtering input equals 81.5 %.
This can be explained by considering the size of the struct that is used, which equals 12 B,
and the size of the OpenCL cache lines, which is 64 B. Data is read from memory as 64 B
Memory Operation OpenCL Cache hit rate

Filtering input 81.5 %
Guidance Input 95.4 %
Table 6.4: Cache hit rates, derived from the OpenCL profiler, when executing the sliding
window implementation of the guided image filter algorithm.
cache lines. If one struct element is read, the other 64 B − 12 B = 52 B resides in cache
52 B
memory. This leads to a cache hit ratio of = 81.5 %. Moreover, as a separate cache
64 B
is created for the filtering input and the guidance input, both cache reads do not interfere
with each other. The same reasoning can be applied to the cache hit ratio of the guidance
image input.
SVM
In order to evaluate the benefit of using SVM, the kernel was also implemented using
memory buffers. This leads to an additional write of the filtering and the guidance input
to memory buffers, and an additional read of the filtering output from the memory buffer.
The average execution time for filtering one image for both the implementation with SVM
and with memory buffers is listed in Table 6.5. As can be seen in this table, the use of
memory buffers adds approximately 10 ms to the total execution time. The amount of
frames/s decreases from 45 to 30 when using memory buffers.
Recording Execution Time GOPS Frames/s

SVM 22.06 ms 139 45
Memory Buffer 32.72 ms 77 30
Table 6.5: Execution times when of the guided filter on the HARP platform with SVM
and with memory buffers.
Performance
In order to evaluate the achieved performance, the roofline model, explained in chapter 2,
can be used. This model relates the operational intensity to the achieved performance. In
the guided image filter, the operational intensity, however, depends on the value of radius.
A larger radius results in more operations to be performed for an equal amount of data
that is read, and therefore a higher operational intensity.
As has been discussed earlier, the maximum radius for a floating point implementation of
the guided image filter algorithm is equal to 3. For higher values, the synthesized hardware
does not fit in the FPGA any longer. Therefore, a fixed point implementation was used,
which could go up to a radius of 6. Both the floating point and fixed point implementation
of the guided image filter were executed for the possible values for radius. This results
in the achieved performances in Figure 6.6. In this figure, the performance of the kernels
for the different operational intensities is shown. The fixed point kernel performance is
indicated by the blue dots, while the floating point kernel performance measurements are
shown in red. Moreover, the roofline model of the HARP platform is shown as well. As
peak-memory bandwidth, the bandwidth of 16 GB/s obtained in Chapter 5 is used, while
the performance limit is based on the technical data of the HARP platform. Note that
for the fixed point kernel, the performance should be expressed in OPerations per Second
(OPS), instead of FLOPS.
As can be seen in Figure 6.6, the implementation definitely benefits from using the fixed
point implementation. While the maximum achievable performance of the floating point
kernel is 135 GFLOPS, the fixed point kernel achieves a top performance of 430 GOPS.
For radii that both the floating point and the fixed point kernel can implement, on the
other hand, no significant difference in performance can be observed. In this case, using
the floating point implementation is advantageous due to the specific issues of the fixed
point implementation, such as under- and overflow.
Furthermore, it can be seen in Figure 6.6 that for larger operational intensities, a higher
performance is obtained. Therefore, the larger computational demands do not hamper the
performance and the kernel is only limited by the achieved bandwidth. It can, however, be
seen that the maximum performance of the floating point kernel, 135 GFLOPS, is smaller
than the compute limit of 1366 GFLOPS. It is, however, unknown what is the cause of
this large difference. Therefore, the implementation should be further analyzed to explain
this big difference.
Further Optimizations
When the fixed point kernel was developed for a radius of 3, there were still resources left.
In order to increase performance, the kernel was implemented using a higher SIMD factor.
1024
Performance (GFLOPS)
512
256
128
64
32
16 32 64 128 256 512

Operational Intensity (FLOP/byte)
Figure 6.6: Roofline model HARP platform. The red dots indicate the performance of
the floating point kernel for different values of radius, while the blue dots indicate the
performance of the fixed point implementation.
Up until this point, one pixel, consisting of three different color channels, was processed
in parallel. It is, however, possible to process two or four pixels simultaneously, resulting
in either six or twelve parallel calculations. The execution times for these different kernels
are listed in Table 6.6.
#pixels in parallel Kernel execution time

1 18.20 ms
2 9.45 ms
4 20 ms
Table 6.6: Kernel execution times when processing a different number of pixels in parallel.
As can be seen in Table 6.6, the maximum performance is obtained when processing 2
pixels simultaneously, in which a minimum execution time of 9.45 ms is obtained. However,
considering the overhead by setting the input parameters and invoking the kernel, a total
execution time of 13.45 ms is achieved. This results in a maximum frame rate of 74 frames/s.
6.4 Conclusion
As a case study, we implemented the guided image filter in OpenCL. By making use of the
insights gained in Chapter 5, the performance was boosted as much as possible. Some key
points in achieving a high performance are the use of SVM, and the OpenCL implemented
cache. A maximum floating point performance of 135 GFLOPS and a maximum fixed point
performance of 430 GOPS were obtained. A maximum frame rate of 74 frames/s can be
achieved.
Chapter 7
Conclusion and future work
The goal of this master’s dissertation was to evaluate OpenCL as HLS language for the
HARP platform. Based on the findings in this thesis, it can be stated that the HARP
platform offers several benefits to developers to increase the performance of OpenCL ap-
plications. The main advantages are the shared memory and the high bandwidth. HARP
uses shared DRAM memory for the CPU and the FPGA. As this shared memory ap-
proach is supported by shared virtual memory in OpenCL, this can be fully exploited.
Furthermore, a high CPU-FPGA bandwidth is available to OpenCL developers.
Besides the HARP platform, the OpenCL compiler offers performance enhancing advan-
tages. A highly effective feature is the implementation of the OpenCL cache. This cache
is useful, as it is able to provide data at a high data rate, and reduces the amount of mem-
ory requests. The OpenCL profiler is useful to evaluate the OpenCL kernel. Performance
characteristics are visualized in an uncomplicated way, which allows an easy performance
analysis.
7.1 Future work
Several aspects of the HARP platform are useful for an OpenCL developer. Other charac-
teristics, however, are difficult to exploit. It is difficult for a developer to directly address
the soft IP FPGA cache. This is certainly caused by the automatic channel selection
in which cache locality is not considered. However, as has been stated in other studies,
implementing a larger cache would be beneficial, as the cache has a size of only 64 KiB.
82
Chapter 7. Conclusion and future work 83
However, as HLS goes together well with the HARP platform, this is a promising approach
for many emerging compute intensive applications, such as machine learning. Performing
research in these areas, could further boost the use of platforms such as HARP in these
domains.
Bibliography
[1] T. Mitra, “Heterogeneous Multi-core Architectures”, IPSJ Transactions on System

LSI Design Methodology, vol. 8, pp. 51–62, 2015.
[2] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A Quantitative
Analysis on Microarchitectures of Modern CPU-FPGA Platforms”, in Proceedings of
the 53rd Annual Design Automation Conference, Austin, Texas: ACM, 2016, pp. 109–
115.
[3] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges”,
Foundations and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135–253,
2008.
[4] D. Koch, F. Hannig, and D. Ziener, Eds., FPGAs for Software Programmers, 1st.
Springer, Jun. 2016, 327 pp., isbn: 978-3-319-26406-6.
[5] R. Nane, V.-M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao,
S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of
FPGA High-Level Synthesis Tools”, IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591–1604, Oct. 2016.
[6] C. Wang, W. Lou, L. Gong, L. Jin, T. Luchao, X. Li, and X. Zhou, “Reconfigurable
Hardware Accelerators: Opportunities, Trends, and Challenges”, CoRR, vol. 1712
04771, 2017.
[7] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme,
H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil,
A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong,
P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale dat-
acenter services”, SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13–24,
Jun. 2014.
84
Bibliography 85
[8] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Perfor-

mance Model for Multicore Architectures”, Commun. ACM, vol. 52, no. 4, pp. 65–76,
Apr. 2009.
[9] P. K. Gupta, “Accelerating Datacenter Workloads”, in 26th International Conference
on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 2016.
[10] M. Parker, Understanding Peak Floating-Point Performance Claims, Technical White
Paper, Intel, Jun. 2014.
[11] Intel Arria 10 Device Overview, Intel, Jan. 2017.
[12] An Introduction to the Intel QuickPath Interconnect, Intel, Jan. 2009, pp. 1–22.
[13] PCI Express Base Specification Revision 3.0, PCI-SIG, Nov. 2010, pp. 1–508.
[14] Intel FPGA IP Core Cache Interface (CCI-P), version Revision 0.5, Intel, Sep. 2017.
[15] Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P)
Reference Manual, Intel, Apr. 2018, p. 51.
[16] A. Munshi, Ed., The OpenCL Specification, version 1.0, Khronos OpenCL Working
Group, Jun. 2009.
[17] Altera SDK for OpenCL Programming Guide, Intel, May 2016.
[18] H. M. Waidyasooriya, M. Hariyama, and K. Uchiyama, Design of FPGA-Based Com-
puting Systems with OpenCL, 1st. Springer, 2017, isbn: 978-3-319-68160-3.
[19] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,
Fourth Edition. Morgan Kaufmann, 2006, p. 704, isbn: 978-0-12-370490-0.
[20] A. Appel and M. Ginsburg, Modern Compiler Implementation in C. Cambride Uni-
versity Press, 2004, isbn: 978-0-52-158390-9.
[21] Altera SDK for OpenCL Best Practices Guide, Intel, May 2016.
[22] K. He, J. Sun, and X. Tang, “Guided image filtering”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 35, pp. 1397–1409, Jun. 2013.
[23] C. T. Johnston, K. T. Gribbon, and D. G. Bailey, “Implementing Image Processing
Algorithms on FPGAs”, Palmerston North, New Zealand, Nov. 2004, pp. 118–123.

Research Platform (Harp) Exploring Opencl On A Cpu-Fpga Heterogeneous Architecture

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Research Platform (Harp) Exploring Opencl On A Cpu-Fpga Heterogeneous Architecture

Hochgeladen von

Copyright:

Verfügbare Formate

Exploring OpenCL on a CPU-FPGA Heterogeneous Architecture

Research Platform (HARP)

Master's dissertation submitted in order to obtain the academic degree of

Department of Electronics and Information Systems

Master's dissertation submitted in order to obtain the academic degree of

Department of Electronics and Information Systems

Thomas Faict, August 2018

Thomas Faict, August 2018

Master’s dissertation submitted in order to obtain the academic degree of

Academic year 2017-2018

Faculty of Engineering and Architecture

Department of Electronics and Inormation Systems

Supervisor(s): Dirk Stroobandt, Erik D’Hollander, Bart Goossens, Alexandra Kourfali

Abstract — Intel recently introduced the Heterogeneous

Word Size Maximum Bandwidth IV. GUIDED IMAGE FILTERING

Therefore, the possible window size will be ultimately

3 Heterogeneous Architecture Research Platform (HARP) 12

5 OpenCL Performance on HARP 40

6 Guided Image Filtering 62

7 Conclusion and future work 82

1.1 Traditional CPU-FPGA platform versus Intel’s HARP platform. . . . . . . 3

3.1 HARPv2 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Overview of Intel Arria 10 GX1150 FPGA specifications. . . . . . . . . . . 13

5.1 Requested bandwidth versus maximum achieved bandwidth when reading

In traditional computing architectures, multi-core processors are commonplace. As these

In a heterogeneous multi-core architecture, several different computing cores or proces-

Traditionally, FPGA applications are developed in a Hardware Description Language

CPU DRAM FPGA DRAM

Figure 1.1: Traditional CPU-FPGA platform versus Intel’s HARP platform.

In chapter 2, a background on reconfigurable computing, FPGAs, and performance analysis

In Figure 2.1, a simplified architecture of an FPGA is displayed. As can be seen in this

I/O I/O I/O I/O

I/O Logic Memory Logic DSP I/O

I/O Logic Memory Logic DSP I/O

I/O Logic Memory Logic DSP I/O

I/O Logic Memory Logic DSP I/O

I/O I/O I/O I/O

2.1.2 Development Models

An FPGA is programmed by an FPGA bitstream. This bitstream contains the configura-

Hardware Description Languages

A second development model is High-Level Synthesis (HLS). In HLS, an algorithmic de-

2.2 Reconfigurable Computing

In contrast to a homogeneous multi-core architecture, in which the different computing

In functional heterogeneity, several different processor or accelerator types are combined,

2.3 Roofline Model

peak floating point performance

peak memory bandwidth

Operational Intensity (FLOP/B)

Attainable GFLOPS = Min(Peak Floating Point Performance,

The Heterogeneous Architecture Research Platform (HARP) is a CPU-FPGA platform

3.2 Intel Arria 10

Table 3.1: Overview of Intel Arria 10 GX1150 FPGA specifications [11].

3.3 CPU-FPGA Interconnect

An overview of these results is shown in Table 3.2.

QPI PCIe Gen.3x8

Table 3.2: QPI and PCIe characteristics [12], [13].

3.4 FPGA Interface Unit

An FPGA is configured by an FPGA bitstream. In the HARPv2 platform, the FPGA

3.4.1 Core-Cache Interface

Intel Xeon CPU Arria 10 FPGA

PCIe 0 PCIe 0 PCIe 0

Figure 3.1: HARPv2 architecture.

Memory Properties Factory