Beruflich Dokumente
Kultur Dokumente
Thomas Faict
Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D'Hollander
Counsellors: Prof. dr. ir. Bart Goossens, Alexandra Kourfali
Thomas Faict
Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D'Hollander
Counsellors: Prof. dr. ir. Bart Goossens, Alexandra Kourfali
The author gives permission to make this master dissertation available for consultation
and to copy parts of this masters dissertation for personal use. In the case of any other
use, the copyright terms have to be respected, in particular with regard to the obligation
to state expressly the source when quoting results from this master dissertation.
i
Acknowledgments
I would like to thank my supervisors Prof. dr. ir. Dirk Stroobandt and Prof. dr. ir.
Erik D’Hollander for the opportunity to work on this thesis. My thanks also go out to my
counsellors, Alexandra Kourfali, Vijaykumar Guddad and Prof. dr. ir. Bart Goossens.
Furthermore, I would like to thank ir. Dries Vercruyce for his help with the Paladin server.
Finally, I would like to thank my family and friends for their support over the past years.
Many thanks to Sophie for supporting me during this busy year.
ii
Exploring OpenCL on a CPU-FPGA
Heterogeneous Architecture Research
Platform
by
Thomas Faict
Supervisors: Prof. dr. ir. Dirk Stroobandt, Prof. dr. ir. Erik D’Hollander
Counsellors: Prof. dr. ir. Bart Goossens, Alexandra Kourfali
Abstract
Intel recently introduced the Heterogeneous Architecture Research Platform (HARP).
In this platform, the Central Processing Unit (CPU) and the Field-Programmable Gate
Array (FPGA) are connected through a high bandwidth, low latency interconnect and
both share DRAM memory. For this platform, OpenCL, a High Level Synthesis (HLS)
language, is made available. By making use of HLS, a faster design cycle can be
achieved compared to programming in a traditional Hardware Description Language
(HDL). This, however, comes at the cost of having less control over the hardware
implementation. We will investigate how OpenCL can be applied on the HARP
platform. In a first phase, a number of benchmarks are executed on the HARP platform
to analyze performance-critical parameters of the OpenCL programming model for
FPGAs. In a second phase, the guided image filter algorithm is implemented using the
insights gained in the first phase. Both a floating point and a fixed point
implementation were developed for this algorithm, based on a sliding window
implementation. This resulted in a maximum floating point performance of 135
GFLOPS and a maximum fixed point performance of 430 GOPS.
Keywords
HARP, OpenCL, High-Level Synthesis, Guided Image Filter
Exploring OpenCL on a CPU-FPGA Heterogeneous
Architecture Research Platform (HARP)
Thomas Faict
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Development Models . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Roofline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
4 Software 22
4.1 Open Computing Language . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 OpenCL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.2 OpenCL Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 OpenCL Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Intel FPGA SDK for OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Board Support Package . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 FPGA Bitstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 FPGA Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Initiation Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 Performance Tuning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Compilation Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.2 Profiling Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.6.1 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.2 Compilation Options . . . . . . . . . . . . . . . . . . . . . . . . . . 39
viii
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
List of Figures
2.1 High-level FPGA architecture. The main building blocks are logic cells,
embedded memory, DSPs and I/O blocks. These elements are connected
through reconfigurable routing lanes. . . . . . . . . . . . . . . . . . . . . . 6
2.2 Roofline model graph. For a low operational intensity, the performance will
ultimately be limited by the memory bandwidth. For a higher operational
intensity, on the other hand, the performance wil ultimately be limited by
the maximum floating point performance of the computing system. . . . . 10
x
4.1 The Board Support Package (BSP) separates the kernel logic part from the
I/O part. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 The Board Support Package (BSP) is included when compiling the kernel
code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Basic pipelined implementation of adding two variables. . . . . . . . . . . . 29
4.4 OpenCL-based design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Screenshot of a compilation report with the 4 different panes indicated. . . 33
4.6 Example of a loops analysis report. . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Example of a system viewer report. . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Example of a profile report. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 HARP bandwidth when reading values on FPGA from the CPU. The dif-
ferent lines denote the different word sizes of the memory transfers. . . . . 42
5.2 Roofline model for the HARP platform. The blue line denotes the perfor-
mance limit based on the maximum theoretical bandwidth, while the green
line denotes the performance limit based on the maximum achieved bandwidth. 44
5.3 HARP bandwidth for reading 4 B values on FPGA from the CPU. The blue
line denotes the bandwidth when reading 4 B each clock cycle, while the
green line is the result of unrolling the loop with unroll factor 16. . . . . . 46
5.4 HARP bandwidth for reading 4 B values on FPGA from the CPU. The loop
was unrolled by various unroll factors. . . . . . . . . . . . . . . . . . . . . 47
5.5 HARP bandwidth for reading 64 B struct variables on FPGA from the CPU. 48
5.6 HARP bandwidth for reading ulong8 variables on FPGA from the CPU. The
blue graph corresponds with the bandwidth calculated by the formula buffer
size/kernel execution time, while the green graph displays the bandwidth as
measured by the OpenCL profiler. . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Different types of memory hierarchies . . . . . . . . . . . . . . . . . . . . . 53
xi
5.7a Memory hierarchy when using the volatile keyword. The memory
hierarchy consists of three levels: the QPI cache, the CPU’s Last-
Level Cache (LLC) and DRAM memory. . . . . . . . . . . . . . . . 53
5.7b Memory hierarchy without using volatile keyword. The memory hi-
erarchy consists of four levels: the OpenCL cache, the QPI cache,
the CPU’s Last-Level Cache (LLC) and DRAM memory. . . . . . . 53
5.8 Bandwidth of reading src when executing kernel cache read for 64 B values. 54
5.8a Bandwidth of reading src when executing kernel cache read for ulong8
pointer types. The bandwidth for both the kernel with and without
the volatile keyword is displayed. . . . . . . . . . . . . . . . . . . . 54
5.8b Bandwidth of reading src when executing kernel cache read for volatile
ulong8 pointer types. In this graph, only the FIU cache is used. This
graph zooms in on the blue graph in above figure. . . . . . . . . . . 54
5.9 Bandwidth of reading src when executing kernel cache read for 128 B values. 55
5.9a Bandwidth of reading src when executing kernel cache read for ulong16
pointer types. The bandwidth for both the kernel with and without
the volatile keyword is displayed. . . . . . . . . . . . . . . . . . . . 55
5.9b Bandwidth of reading src when executing kernel cache read for volatile
ulong16 pointer types. In this graph, only the FIU cache is used.
This graph zooms in on the blue graph in above figure. . . . . . . . 55
5.10 Memory buffer location when using different types of platforms. . . . . . . 58
5.10a In a PCIe-based CPU-FPGA architecture, the original data resides
in CPU DRAM while the memory buffer is located in the FPGA
DRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.10b In the HARP platform, both the original data and the memory buffer
are located in the DRAM memory. . . . . . . . . . . . . . . . . . . 58
5.11 In the HARP platform, both the CPU and the FPGA can access SVM. . . 58
xii
5.12 Bandwidths when copying data in the HARP platform. The blue line shows
the bandwidth when writing data to a write buffer while the red line denotes
the bandwidth when reading from a memory buffer. . . . . . . . . . . . . . 59
6.1 The guided image filter algorithm smooths the filtering input image by mak-
ing use of a guidance image. The filtering output consists of the filtering
input, enhanced by the data in the guidance image. . . . . . . . . . . . . . 63
6.2 Conceptual example of sliding window filtering. . . . . . . . . . . . . . . . 65
6.3 Sliding window implementation. Kernel 1 computes the first three steps in
the guided image filter algorithm, while kernel 2 computes the last two steps. 66
6.4 Data structure image data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 Principle of the sliding window. The pixels in green and blue are row-wise
stored in a shift register. The pixels in blue are used to calculate the window
function fmean,r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5a Shift register state at clock cycle n. In the next clock cycle, the red
pixel will be loaded in the shift register in local memory. . . . . . . 72
6.5b Shift register state at clock cycle n+1. All values are shifted one
place in the shift register compared to the previous clock cycle. . . . 72
6.6 Roofline model HARP platform. The red dots indicate the performance of
the floating point kernel for different values of radius, while the blue dots
indicate the performance of the fixed point implementation. . . . . . . . . 80
xiii
List of Tables
6.1 Resource usage for OpenCL kernels implementing the sliding window im-
plementation using fixed point and floating point computations. . . . . . . 74
6.2 Execution times when executing the sliding window implementation of the
guided filter on the HARP platform. . . . . . . . . . . . . . . . . . . . . . 76
6.3 Profiling results when executing the sliding window implementation of the
guided image filter algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Cache hit rates, derived from the OpenCL profiler, when executing the slid-
ing window implementation of the guided image filter algorithm. . . . . . . 78
6.5 Execution times when of the guided filter on the HARP platform with SVM
and with memory buffers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6 Kernel execution times when processing a different number of pixels in parallel. 81
xiv
Acronyms
B Byte
CPU Central Processing Unit
FLOPS Floating Point OPerations per Second
FPGA Field-Programmable Gate Array
GB/s 109 Byte/second
GPU Graphics Processing Unit
HARP Heterogeneous Architecture Research Platform
HLS High-Level Synthesis
II Initiation Interval
IO Input/Output
KiB 210 Byte
MiB 220 Byte
OpenCL Open Computing Language
PCIe Peripheral Component Interconnect Express
QPI QuickPath Interconnect
SVM Shared Virtual Memory
xv
Chapter 1
Introduction
1
Chapter 1. Introduction 2
development time. Even though hardware mapping is done automatically, this process can
often be guided by making use of compiler directives. Some examples of HLS languages
and tools are SystemC, Vivado HLS or Open Computing Language (OpenCL).
In traditional CPU-FPGA platforms, both the CPU and the FPGA have their own DRAM
memory, and they are connected through a low bandwidth connection with high latency.
This is depicted in Figure 1.1a. Intel, however, introduced the Heterogeneous Architec-
ture Research Platform (HARP), in which an Intel Xeon CPU is combined with an Intel
Arria 10 FPGA. This platform differs from traditional CPU-FPGA platforms in that the
CPU and the FPGA are connected through a high bandwidth, low latency communica-
tion link. Moreover, the CPU and the FPGA share DRAM memory, and a soft IP cache
is implemented in the FPGA. This configuration is illustrated in Figure 1.1b. Moreover,
an OpenCL Software Development Kit (SDK) is made available for the HARP platform.
Therefore, FPGA applications can be developed in a HLS language.
1.1 Problem
Using OpenCL on the HARP platform, poses several questions. As OpenCL is a HLS
language, the digital implementation is determined by the OpenCL compiler. As a result,
there is less control over the hardware implementation itself.
1.2 Goal
The goal of this thesis is to evaluate OpenCL as HLS language for applications on the
HARP platform. In a first phase, different performance characteristics of the HARP plat-
form are benchmarked by making use of OpenCL. It is, for instance, investigated whether
an OpenCL application can obtain high bandwidths or whether it can benefit from the soft
IP cache in the FPGA. Based on the achieved results, it is formulated which performance
characteristics of the HARP platform an OpenCL developer can exploit. Moreover, it is
investigated to what degree the hardware implementation can be guided using OpenCL
directives.
In a second phase, a case study is performed. As a design example, the guided image filter
algorithm is considered. Guided image filtering is an image processing technique in which
Chapter 1. Introduction 3
low bandwidth
CPU FPGA
high latency
high bandwidth
CPU FPGA
low latency
shared DRAM
(b) HARP’s CPU-FPGA platform. The CPU and the FPGA are connected through a high
bandwidth, low latency interconnect, and they share DRAM memory.
images are smoothed. This application is implemented on the HARP platform, using the
insights gained in the first phase.
1.3 Organization
Background
2.1 FPGA
Field-Programmable Gate Arrays (FPGAs) are digital chips that can be programmed for
implementing arbitrary digital circuits. In contrast to Application-Specific Integrated Cir-
cuits (ASICs), FPGAs can be reconfigured. Therefore, their functionality can be changed.
This reconfigurability, however, comes at a cost of higher area usage, lower frequency and
higher power requirements than ASIC implementations.
In this section, the architecture of FPGAs and the possible development models are de-
scribed.
2.1.1 Architecture
Logic blocks consist of Lookup Tables (LUTs). LUTs can perform any boolean function
of a limited number of inputs, and therefore provide a large flexibility of operations. Even
though LUTs are useful for bit manipulations, they can be too fine-grained to efficiently
implement many types of circuits, such as a floating point multiplication. Therefore, an
FPGA also contains Digital Signal Processing (DSP) blocks. In DSP blocks, frequently
occurring procedures, such as multiplications, are directly implemented in a highly opti-
mized IP core. Embedded memory blocks, on the other hand, allow storage of frequently
5
Chapter 2. Background 6
Figure 2.1: High-level FPGA architecture. The main building blocks are logic cells, embed-
ded memory, DSPs and I/O blocks. These elements are connected through reconfigurable
routing lanes [3].
Chapter 2. Background 7
used data. Due to the proximity of these memory elements, data can be accessed quickly
with a small latency. Finally, I/O-blocks connect the computing fabric of an FPGA to the
outside world [3].
Except for DSPs, embedded memory and programmable logic blocks, an FPGA also con-
tains routing lanes. These routing lanes connect all blocks and can be reconfigured to
interconnect the desired elements together [3].
In HDL-development, the digital system is directly defined at the Register Transfer Level
(RTL). RTL is an abstraction level in which all state information is stored in registers and
where logic between the registers is used to generate or compute new states. Consequently,
the RTL level describes all memory elements (e.g., flip-flops, registers, or memories) and
the used logic. For each clock cycle, it is specified which values are transferred to which
storage locations, and therefore the flow of data through a circuit is directly determined
[4].
An RTL-description is specified by a HDL, with VHDL and Verilog being the most popular
ones. These languages give full control over the generated hardware. Therefore, not only
the algorithm itself is defined, but also its implementation in the digital system. However,
these languages are typically used by trained hardware design engineers and it needs a
strong hardware background to harness the full potential of this approach. Moreover,
since the cycle-by-cycle behavior of the system is completely specified, it is hard to develop
at such a low level [4].
Chapter 2. Background 8
High-Level Synthesis
In HLS, the FPGA bitstream is defined at the algorithmic level, typically in a C-based
language (e.g. SystemC), from which a digital circuit is automatically generated. HLS is
beneficial for FPGA developers because of several reasons. First, no extensive hardware
expertise has to be built up. Furthermore, HLS allows developers to design systems faster
at a high level of abstraction, and rapidly explore the design space. These benefits, however,
come at the cost of having less control on the hardware implementation [4], [5].
FPGAs fill the gap between hardware and software accelerators, and therefore have several
benefits. Firstly, an FPGA directly implements a digital function, and therefore it is,
just like an ASIC, instructionless. Because no instructions need to be fetched, a high
performance and a low power consumption is achieved. However, in contrast to ASICs,
an FPGA can be reprogrammed. Therefore, their functionality is not at all limited, and
FPGAs are able to meet changing requirements [6].
Reconfigurable computing is mainly used in embedded systems, as FPGAs offer low power
requirements and high reliability. FPGAs are, for instance, commonly used in network
equipment, avionics and the automotive industry. However, due to the emerging big data
applications such as machine learning, high computing requirements are desired in other
domains such as data center applications. Because at the same time energy consumption
becomes an issue in data centers, reconfigurable computing attracts attention, and is ex-
pected to gain a broader use in the future [6]. Microsoft, for instance, equipped servers
with FPGAs for the operation of the Bing search engine. They demonstrated a doubling
of the ranking throughput of Bing, while increasing the power consumption by only 10 %
[7].
A comprehensible model that offers performance guidelines can be valuable to guide a pro-
grammer in application development. In [8], a model was proposed that relates processor
performance to off-chip memory traffic. This model is called the roofline model.
In the roofline model, operational intensity is related to floating point performance. The
operational intensity is defined as the number of floating point operations (FLOP) per byte
(B) of DRAM traffic. Note that the DRAM traffic does not include all memory traffic.
Memory requests that are, for instance, responded by the cache, will not be passed to
DRAM memory and will therefore not be taken into account. The floating point perfor-
mance, on the other hand, is defined as the number of floating point operations per second
(FLOPS), and defines the performance of the application.
In Figure 2.2, the roofline model is illustrated. Both axes in the graph are logarithmically
scaled. The x -axis defines the operational intensity, and the y-axis defines the achievable
performance. Given an application, the achieved performance is equal to the operational
intensity times the obtained bandwidth. The obtained bandwidth, however, can never be
Chapter 2. Background 10
higher than the peak memory bandwidth, and the performance of the application can never
exceed the peak performance of the hardware platform. This is represented by the roofline
model. If the performance hits the roof, it either hits the flat part of the roof, which means
performance is compute bound, or it hits the slanted part, which means performance is
memory bound. Based on these two performance limits, following formula can be used to
calculate the attainable GFLOPS.
Heterogeneous Architecture
Research Platform (HARP)
3.1 Introduction
In the next sections, the hardware architecture of HARPv2 will be analyzed. First, some
details on the Arria 10 FPGA will be listed. Thereafter, the interconnect structure of
the platform will be investigated. In the last section, a comparative study between the
HARPv1 platform and a traditional CPU-FPGA platform will be studied.
12
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 13
In the HARPv2 platform, the implemented FPGA is the Intel Arria 10 GX1150. A short
overview of the most important specifics can be found in Table 3.1. In this table, it
can be seen that the peak floating point performance equals 1366 GFLOPS. This should
be interpreted in following way. As can be seen in Table 3.1, the Arria 10 FPGA con-
sists of 1518 DSPs. Each of these DSPs can perform 2 FLOP/clock cycle, which results
in 3036 FLOP/clock cycle for all DSPs together. At a rate of 450 MHz, this results in
3036 FLOP/clock cycle · 450 MHz = 1366 GFLOPS [10].
Resource Value
Logic Elements 1,150,000
Adaptive Logic Modules 427,200
Registers 1,708,800
Memory 65.5 Kib
DSPs 1,518
Peak GFLOPS 1,366
The CPU and the FPGA in the HARPv2 platform are connected through three physi-
cal channels: one QuickPath Interconnect (QPI) channel and two Peripheral Component
Interconnect Express (PCIe) channels. QPI is an interconnect technology, developed by
Intel, that connects processors in a distributed shared-memory style [12]. PCIe, on the
other hand, is a general purpose I/O interconnect defined for a wide variety of computer
and communication platforms [13]. The HARPv2 uses a PCIe Gen3x8 interconnect. This
indicates that the PCIe connection uses the third generation PCIe technology, and has a
bit width of 8 bits.
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 14
3.3.1 Bandwidth
Raw Bandwidth
The raw bandwidth is the amount of data that can be transferred per second, without ac-
counting for overhead or other effects. The raw bandwidth can be calculated as the amount
of bytes that can be sent in one transfer multiplied with the amount of transfers that can
be sent per second. The amount of transfers per second is expressed as gigatransfers/s
(GT/s).
The QPI channel has 16 data connections, and therefore a bit width of 2 B. Its maximum
transfer rate is 6.4 GT/s. This results in a maximum, one-way bandwidth of 12.8 GB/s
[12]. Similar calculations can be done for the PCIe channels. The used PCIe connection is
of third generation, and has a bit width of 8 bits. However, compared to the QPI-channel,
the PCIe uses a less efficient data encoding scheme, which results in a physical channel
with an efficiency of 98.5 %. Therefore, for every 130 bits sent, only 128 bits are useful.
The maximum amount of transfers per second equals 8 GT/s. This results in a maximum
raw bandwidth of 98.5 % · 1 byte · 8 GT/s = 7.88 GB/s [13].
Both the QPI- and PCIe-bandwidths are unidirectional bandwidths. Because both con-
nections consist of a separate read and write channel, the total available bandwidth can be
doubled. This leads to a total raw bandwidth of 25.6 GB/s for the QPI, and 15.75 GB/s
for the PCIe-channel. However, the individual read- or write-bandwidth cannot exceed the
unidirectional bandwidth. If the three physical channels are combined, a theoretical read
and write bandwidth of 1 · 12.8 GB/s + 2 · 7.88 GB/s = 28.56 GB/s is obtained.
Achievable Bandwidth
Based on previous calculations, one could conclude that the HARP-platform can obtain
read- and write-bandwidths up to 28.56 GB/s. Above results are, however, theoretical
bandwidths and they do not account for packet overhead and other effects. Therefore, the
maximum bandwidth that can be achieved will be lower, and can also vary for different
types of data transfers. In [12], the packet overhead for the QPI-channel, and for a second
generation PCIe-channel was calculated. For a payload size of 64 B, QPI and PCIe obtained
efficiencies of respectively 89 % and 68 %. However, for a larger payload size of 256 byte,
the efficiency of the PCIe-connection increased to 79 %.
The efficiency of the connection, and therefore the bandwidth, depends on multiple factors.
However, even though the calculated raw bandwidths are only theoretical values, they give
a good indication on the potential of a connection and on the maximum bandwidth a
connection can physically provide.
An important responsibility of the FIU is to abstract the QPI and PCIe channels by
providing a transparent interface to the AFU. Therefore, a developer does not need to
program low-level details of the physical communication channels. This interface is termed
the Core-Cache Interface (CCI-P), and will be discussed in this section.
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 16
...
... CCI-P
MPF
DRAM AFU
Virtual Channels
The CCI-P abstracts the three different physical channels in the HARP platform. This
is done by mapping them to four so-called virtual channels. PCIe 0 is mapped to virtual
channel VH0 (Virtual channel with High latency), PCIe 1 to VH1 and QPI to VL0 (Virtual
channel with Low latency). A fourth virtual channel, VA (Virtual Auto), combines the
three physical channels in one virtual channel and automatically selects the appropriate
physical link during execution by making use of Virtual Channel (VC) steering logic [14],
[15].
The CCI-P does not provide other memory services than abstracting the physical channels.
For instance, no ordering of memory requests is enforced. Therefore, each memory request
can bypass another request, even to the same memory address. Even though the CCI-P
memory model does not provide ordering, this can be acceptable for applications that do
not read from and write to the same memory address. Other applications, however, require
memory transactions to be ordered. In order to support more complex requirements, Intel
provides so-called basic building blocks. Basic building blocks are reference designs that
users can instantiate in their AFU. The Memory Properties Factory (MPF) is a basic
building block that provides a collection of extensions to the CCI-P. Even though the
MPF is predefined, it is the responsibility of the user to instantiate the MPF in their
design. For this reason, the MPF is placed on the AFU-side in Figure 3.1 [14].
The MPF is a basic building block that transforms a CCI-request to a CCI-request, adding
certain properties. An example of a property the MPF can add is memory request reorder-
ing, which guarantees the correct ordering of memory requests [14]. Another example is
virtual to physical address translation. By using this translation, a HDL developer is able
to use virtual addresses in the AFU, which heavily simplifies the development process. The
benefit, however, of using the MPF comes at a cost. In [2], it was found that transaction
reordering adds three FPGA clock cycles to a read request and one clock cycle to a write
request. Address translation, on the other hand, requires an extra four clock cycles for
both read and write requests.
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 18
The FIU not only contains the CCI-P, it also provides a soft IP cache for the QPI channel.
This cache has a total capacity of 64 KiB, and is organized as a direct-mapped cache with
64 B cachelines, hereby resulting in 1024 cachelines. As the FPGA cache is included in
the cache coherence domain of the CPU, the data in the FPGA cache is coherent with
the CPU-cache and the DDR-memory. This results in the coherence domain depicted in
Figure 3.2 by the dotted line.
Figure 3.2: Cache coherence domain [15]. This figure uses UltraPath Interconnect (UPI),
which is the successor of QPI.
Because of the expanded cache coherence domain, AFU memory requests can be serviced
by three different memory levels. This results in the AFU memory read paths displayed
in Table 3.3. The lowest latency and highest bandwidth is obtained when reading from
the FPGA cache itself. In this case, no read request to the CPU is required. If the FPGA
cache does not contain the requested cache line, the next memory level to be addressed is
the CPU Last Level Cache (LLC). This results in a higher latency and lower bandwidth
than when reading from the FPGA cache. The highest latency and lowest bandwidth is
obtained when both the FPGA cache and the CPU LLC read requests result in a miss. In
this case, the read request is serviced by the DRAM.
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 19
Since the FPGA cache is only implemented for the QPI channel, the memory read path
depends on the used virtual channel. When the channels VH0 or VH1 are used, the read
request will not pass the FPGA cache as they are mapped to PCIe channels 0 and 1. A
read request to virtual channel VL0 will always pass the FPGA cache. When using the
VA virtual channel, the Virtual Channel (VC) steering logic will map the request to one of
the three different physical channels. The VC steering logic takes into account parameters
such as the link utilization and traffic distribution. It does not, however, take cache locality
into account. Therefore, a read request can be mapped to a PCIe channel, even though
the requested data is present in the FPGA cache [14].
The FIU framework supports two programming languages: HDL design, and OpenCL
design. In case of HDL design, the HDL code is synthesized through the Intel Quartus
tool chain to generate the FPGA bitstream. OpenCL, on the other hand, is a high-level
synthesis framework for writing FPGA code at a higher level of C-like abstraction. The
OpenCL code is compiled to HDL code by the Intel OpenCL compiler, after which this HDL
code is synthesized using Intel Quartus. As this thesis focuses on OpenCL development,
this development model will be further elaborated in Chapter 4 [14].
PCIe
CPU FPGA
Figure 3.3: Traditional PCIe-based CPU-FPGA platform. In contrast to the HARP plat-
form, the CPU and the FPGA are not connected through QPI and they do not share
DRAM memory [2].
Benchmarks
For both platforms, a series of benchmarks was executed. In a first test, the effective
bandwidth between the CPU and the FPGA was measured. Measuring the bandwidth for
both platforms is highly different. In the HARP-platform, data originates from the shared
DRAM memory, and therefore, only one FPGA read is required. In the Alpha Data board,
on the other hand, CPU-FPGA communication includes two steps. In a first phase, data
is copied from the CPU DRAM to the FPGA DRAM through the PCIe channel. After
completing the copy operation, the FPGA can access the data in the FPGA DRAM. This
lead to the results shown in Table 3.4. It must be noted that the lower read latency the
HARPv1 platform achieves, is only obtained for fine-grained memory accesses (<4 kB).
Table 3.4: Read bandwidth and read latency comparison between the Alpha Data board
and the HARPv1 platform [2].
In a second test, the cache behavior of the HARPv1 platform was examined. In Table 3.5,
the read and write latencies of both a cache hit and a cache miss are shown.
Chapter 3. Heterogeneous Architecture Research Platform (HARP) 21
Conclusions
Based on the results, the authors of the study formulated suggestions concerning the use
of the HARPv1 platform and its improvements over a traditional PCIe-based platform. A
first finding is that the DRAM-latency of the HARP platform is lower than the FPGA
DRAM-latency of the Alpha Data platform. However, as this smaller latency only holds
for small payload sizes (<4 kB), a QPI-based platform is preferred for latency-sensitive ap-
plications, especially those that require frequent (random) fine-grained CPU-FPGA com-
munication. Secondly, the authors observed that the long latency (14 clock cycles) and
small size (64 KiB) of the FPGA-side cache post a serious challenge for users to take ad-
vantage of this cache. Especially, since the access time to embedded FPGA memory is
only 1 clock cycle, the gap between this embedded memory and cache is too large.
Chapter 4
Software
An OpenCL application consists of two parts: the host and the kernel. Host code is
executed on the CPU, while kernel code is executed on a selected accelerator. Both com-
ponents have a different function in the OpenCL framework. The host can be considered
as the executive part of an OpenCL application as it manages the kernel. It is responsible
for providing data to the kernel, invoking the kernel etc. The kernel, on the other hand,
is a function executed on the accelerator. Therefore, it typically consists of critical code
that is executed more efficiently on the accelerator architecture. A host can call multiple
kernels on different accelerator devices [16].
22
Chapter 4. Software 23
An example of an OpenCL kernel is displayed in Listing 4.1. This kernel is called vectoradd,
and has three input parameters a, b and c. These parameters are defined as floating point
variables, and are characterized by the keyword global. This keyword indicates that each
parameter can be accessed by all invocations of the OpenCL kernel. This is useful useful
when using NDRange kernels, as explained later in this section. The kernel vectoradd
calculates the sum of two vectors a and b, and writes the outcome to vector c. As all
parameters are passed by reference, as pointers to a memory location, the OpenCL kernel
can read from and write to the location the pointer points to.
Kernel Types
OpenCL kernels can be subdivided in 2 different types: single work-item kernels and
NDRange kernels. In a single work-item kernel, only one instance of the kernel is executed,
while in an NDRange kernel, multiple instances of the kernel are executed in parallel. The
difference between these two types of kernels can be illustrated by the examples displayed
in Listing 4.1 and Listing 4.2. In both code fragments, the sum of vectors a and b is
calculated and written to vector c. This functionality is implemented in Listing 4.1 as a
single work-item kernel, while Listing 4.2 implements this functionality as an NDRange
kernel. Even though both kernels calculate the same vector-summation, they execute in
an entirely different way.
The code in the single work-item kernel is implemented as a single compute unit. This
means that the functionality of the kernel is implemented only once, and that this single
Chapter 4. Software 24
implementation processes all data. Since the implemented for-loop iterates serially over the
indices i, only a single element of vector i will be calculated at each time step. An NDRange
kernel, on the other hand, consists of multiple compute units. Therefore, the kernel function
is executed in a SIMD way. As can be seen in Listing 4.2, no for-loop is present. Instead
of sequentially calculating the sum of vectors a and b, the OpenCL framework assigns a
specific index to each compute unit by using the function get global id(0). In this paradigm,
the keyword global is relevant. As all compute units access the data from a and b, and
write to c, these pointers must be made available to every compute unit. This is done by
using the global keyword. There exist other keywords that make data available for only a
specific work-item, but this is out of scope.
Both kernel types have different characteristics and the decision to choose one of both
heavily depends on the hardware for which the kernel is developed. Executing, for instance,
a single work-item kernel on a GPU results in one GPU-core being used while all others
are idle. An NDRange kernel execution, on the other hand, will lead to the simultaneous
execution of different compute units on different cores. Therefore, an NDRange kernel is
highly suited for a GPU architecture. An FPGA, on the other hand, has entirely different
hardware characteristics. Even though an FPGA can implement data-level parallelism,
such as a GPU, pipeline parallelism can be implemented highly efficiently. In pipeline
parallelism, a computation is split in different pipeline stages. Intermediate results are
stored in registers, hereby splitting lengthy calculations in several smaller computation
steps. Several iterations can then be executed simultaneously in different pipeline stages. If
a single work-item kernel is implemented as a pipeline, this can result in a high performance
on an FPGA.
In Chapter 3, it was explained that an FPGA bitstream for the HARP platform consists
of the FPGA Interface Unit (FIU), containing low-level I/O-logic, and the Accelerator
Function Unit (AFU), containing the custom FPGA logic written by the developer. The
BSP of the HARP therefore contains the logic in the FIU, such as the Core-Cache Interface
(CCI-P), described in Section 3.4.1.
An FPGA bitstream is generated in 2 major steps, as depicted in Figure 4.2. In a first step,
the OpenCL kernel is compiled to HDL-code by making use of the Altera OpenCL (AOCL)
compiler. The hardware information of the FPGA is incorporated in the generated HDL
code by making use of the BSP. In the second step, an FPGA bitstream is generated based
on the compiled HDL code. In this step, all necessary implementation details, such as
selecting the appropriate FPGA frequency, are determined by the Quartus toolchain. This
results in an Altera OpenCL executable (.aocx ) file that contains the FPGA bitstream.
The host code, on the other hand, is compiled using an appropriate C compiler.
Chapter 4. Software 27
Figure 4.1: The Board Support Package (BSP) separates the kernel logic part from the
I/O part [18].
As indicated in previous sections, there are two types of OpenCL kernels: single work-item
kernels and NDRange kernels. In the Intel FPGA SDK for OpenCL, both types of kernels
are implemented completely different. When compiling an NDRange kernel, the FPGA
implements a GPU-like architecture in which a number of threads execute simultaneously.
This implementation, however, does not make use of pipeline parallelism. Single work-
item kernels, on the other hand, are implemented in a pipelined fashion. Since pipelining
is important for FPGA applications, some general principles of pipeline parallelism will be
discussed in this section.
Intel Quartus
.aocx file
CPU FPGA
Figure 4.2: The Board Support Package (BSP) is included when compiling the kernel code
[18].
Chapter 4. Software 29
This principle can be explained by making use of the OpenCL kernel vectoradd in Listing
4.1. In this kernel, the elements of two vectors are added and the result is stored in the ele-
ment of a third vector. A basic pipelined implementation of the statement c[i]=a[i]+b[i]
in this kernel is displayed in Figure 4.3.
register register
Load Load
register register
Add
register
Store
The statement is subdivided in three different stages. In a first stage, the values a[i] and
b[i] are loaded and saved in registers. In the second pipeline stage, these two values are
added and the result is stored in a register. Finally, the result is stored in memory to c[i].
Because the pipeline in Figure 4.3 consists of three pipeline stages, three iterations can
execute simultaneously. In this case, the maximum throughput is obtained, as the pipeline
is always in use. It is, however, possible that only two or even a single loop iteration is being
executed. The amount of loop iterations that can execute simultaneously is determined by
the amount of clock cycles between two subsequent iterations. This is termed the Initiation
Interval (II): the number of cycles that must elapse between issuing two iterations [19].
The II of a pipelined loop largely determines the performance. The desired value for the II is
Chapter 4. Software 30
one. In this case, a new loop iteration can be issued every clock cycle, and the implemented
hardware will be used 100 % of the time. This leads to the optimal performance. If the II,
however, is higher, a lower performance will be obtained. For an II of two, only one loop
iteration will be started every two clock cycles. Therefore, each pipeline stage will only be
used 50 % of the time, and the loop will take two times as much time to finish. An II of
three leads to a usage of 33.33 % and an execution time that is three times higher, and so
forth.
A large value of II can have several possible causes. Some examples are I/O delay, a
limited number of resources or dependencies in the algorithm. When developing a single
work-item OpenCL kernel for FPGA, the purpose should always be to obtain an II of one
for implemented loops. Higher values lead to a lower utilization rate and therefore a lower
performance.
Three main development phases when developing an OpenCL project with the Intel FPGA
SDK for OpenCL are distinguished in [18]: the emulation phase, the performance tuning
phase and the execution phase. In the emulation phase, the behavior of the kernel is
verified by emulating the code on a CPU. In the performance tuning phase, performance
bottlenecks are identified by inspecting compilation and profiling reports. In the execution
phase, the performance of the OpenCL kernel is analyzed by executing the kernel on the
FPGA platform. A flowchart of a typical OpenCL FPGA design flow, in which these three
phases are indicated, can be seen in Figure 4.4. Since the performance tuning phase is the
most critical stage for improving the performance, this will be discussed more in detail in
following sections.
In the performance tuning phase, the performance of the OpenCL kernel is analyzed and
improved. The Intel FPGA SDK for OpenCL provides two types of reports that can be
used in this phase: compilation reports and profiling reports.
Chapter 4. Software 31
30 3 FPGA Accelerator Design Using OpenCL
Is program behavior no
acceptable ?
yes
no
Performance Satisfied with reports ?
tuning yes
phase Compile for profiling
Satisfied with no
profiling results ?
yes
Note that you need a BSP for emulation although you do not need an FPGA board.
In emulation, the code is executed sequentially similar to a typical C-code.
The parallel operations such as pipelines, loop-unrolling, and SIMD operations are
ignored in emulation. Therefore, the emulation time could be very large and you
may not able to emulate the whole computation. You may have to use a small data
sample for emulation. Note that emulation of I/O channels is not possible until SDK
for OpenCL version 16.1. However, version 17.0 provides a method to replicate the
Chapter 4. Software 32
As explained earlier in this chapter, the Intel FPGA SDK for OpenCL generates FPGA
bitstreams in two steps: a compilation step and a bitstream generation step. In the
compilation step, the OpenCL code is compiled into HDL code, and based on this code,
the synthesis step generates an FPGA bitstream. During the compilation process, reports
of the produced HDL code are generated. These reports contain information on the resource
usage of the kernel, the compilation transformations, the initiation interval of loops and
so forth. By investigating these reports, more insight in the implemented hardware can be
gained. Because of the compilation reports, an early performance analysis can be made,
which prevents going through the full, lengthy FPGA development cycle to assess the
OpenCL kernel performance.
Structure
The compilation reports are generated as a set of .html -files, and can be viewed using a web
browser. On the home page of the compilation report, shown in Figure 4.5, an overview is
given of the compiled kernel. On this home page, four different panes can be distinguished.
• View reports pane: in this pane, different types of reports can be selected.
• Source code pane: shows the source code. This pane is extremely useful for linking
implementation details of the kernel to the OpenCL source code.
• Analysis pane: in this pane, the analysis details as selected in the ’view reports pane’
can be read.
• Details pane: shows details from selected elements in the ’analysis pane’.
Using the ’view reports pane’, six different types of reports can be selected. A first one is
’Summary’, which results in the home page indicated in Figure 4.5 with a summary of the
compiled kernel. The ’Loops analysis’ view gives an overview of all loops in the OpenCL
kernel. ’Area analysis of system’ and ’Area analysis of source’ both give an overview
of the resource usage of the kernel. ’System viewer’ shows a control-flow graph of the
developed system. In the last report, the ’Kernel memory viewer’, a visual representation of
FPGA embedded memory usage is shown. Since this thesis mainly focuses on performance
analysis, and less on area analysis, only the ’Loops analysis’ and the ’System viewer’ reports
will be discussed.
Chapter 4. Software 33
Details pane
Analysis pane
Figure 4.5: Screenshot of a compilation report with the 4 different panes indicated.
Chapter 4. Software 34
Loops Analysis
In the loops analysis, information on the loops in the kernel is given. An example is shown
in Figure 4.6. In the analysis pane, all loops in a kernel are listed. For each of the listed
loops, there are four different columns listing details.
• Pipelined: indicates whether a loop is pipelined or not, and therefore whether or not
pipeline parallelism can be applied.
The loop analysis report can be read in following way. As can be seen in Figure 4.6, the
implemented kernel is pipelined and has an initiation interval of 525. As explained earlier,
such a high value as initiation interval is dramatic. Therefore, the value of II is clearly a
bottleneck for the kernel performance. This is indicated in the third column, that displays
that the bottleneck is the implemented initiation interval. In the fourth column, details on
this poor initiation interval are given. Apparently, it is caused by a memory dependency.
In the details pane on the bottom of the screen, a more elaborate description of the cause
of the bottleneck is given. This description refers to lines in the source code. In this way,
a developer can easily find the source of a poor performing OpenCL kernel.
System Viewer
A visual representation of the kernel can be found in the system viewer. In this view,
the OpenCL kernel is displayed as a control flow graph of basic blocks. A basic block is
a is a sequence of statements that is always entered at the beginning and exited at the
end [20]. Besides the operation flow, all memory operations to ’Global Memory’, i.e. the
DRAM-memory, are displayed in this report.
The system viewer report of the vectoradd example is displayed in Figure 4.7. In this view,
the vectoradd kernel is subdivided in three basic blocks. The basic block vectoradd.B1
contains the for-loop in which the addition operation is executed. As can be seen in this
figure, this basic block is marked red. This indicates a performance bottleneck, which is
Chapter 4. Software 35
the poor initiation interval of 525. When hovering with the mouse pointer over this basic
block, some more information can be observed, such as the latency (expressed as number
of clock cycles).
The Intel FPGA SDK for OpenCL enables reviewing the kernel’s performance by incorpo-
rating the OpenCL Profiler. When generating an FPGA bitstream using the profile-option,
the FPGA program will measure and save performance metrics during execution. This en-
ables reviewing the kernel’s behaviour, and in this way detecting e.g. causes of a kernel’s
poor performance.
The profiling information is saved in a .mon-file, and can be viewed through a graphical
user interface. An example of a profile report is illustrated in Figure 4.8. For each I/O-
statement, following performance metrics are shown.
• Stall (%): indicates the percentage of the overall profiled time frame that the memory
access causes a pipeline stall. Since a pipeline stall is not desired, the preferable stall-
value is 0 %.
• Occupancy (%): the percentage of the overall profiled time frame that a memory
instruction is issued. In a pipelined implementation, the best performance is achieved
when every clock cycle a loop iteration stage can be issued. Therefore, the most
desired value for the occupancy is 100 %.
• Bandwidth efficiency (%): the percentage of loaded memory that is actually used by
the kernel. Since data is loaded by reading memory words, it is possible that parts
of the loaded word are not used by the kernel. In the optimal case, all loaded data
is used. Therefore, an efficiency of 100 % is desired.
1 # pragma unroll 8
2 for ( int i = 0; i < N ; i ++) {
3 c [ i ] = a [ i ] + b [ i ];
4 }
Listing 4.3: Example of the use of pragma unroll..
When unrolling a loop in the Intel FPGA SDK for OpenCL, a spatial implementation of
the different loop iterations is obtained. The statements in the loop will be duplicated
in hardware, and multiple loop iterations will be executed in parallel. Unrolling a loop
therefore results in a higher resource usage but a possible shorter execution time.
Loop unrolling is indicated by #pragma unroll. In this case, the compiler will unroll the
loop. If an extra parameter is given, the unroll factor, the developer specifies how many
times the loop should be unrolled. An example of loop unrolling is given in Listing 4.3. In
this listing, the for-loop will be spatially unrolled eight times.
Besides loop unrolling, other compiler directives are possible as well. When compiling an
OpenCL kernel, several compilation options can be given. Some examples are the options
-fp-relaxed or -fpc. -fp-relaxed relaxes the order of floating point operations. This
potentially leads to a lower resource usage, but can also result in a lower precision. -fpc
adds aditional precision to floating point operations by disabling intermediate rounding
operations.
Chapter 5
5.1 Introduction
In this chapter, we investigate whether OpenCL is able to exploit the performance enhanc-
ing features of the HARP-platform. This will be done by a number of tests that focus
on a specific hardware characteristic. In a first test, the HARP read bandwidth will be
examined by making use of several OpenCL kernels. Thereafter, the cache performance of
the FPGA cache is investigated. Finally, the effect of Shared Virtual Memory (SVM) will
be analyzed.
5.2 Bandwidth
In a first test, the read bandwidth of the HARP platform is investigated. This is done by
reading a variable amount of data on the FPGA from the CPU.
The OpenCL-kernel used for this test is displayed in Listing 5.1. The kernel memory read
is implemented as a single task kernel, which is indicated by attribute ((task)). This
kernel has three parameters. The first two parameters, src and dst, are pointers to addresses
from which data is read and to which data is written. The third parameter, lines, is used
to determine the buffer size of the data that is read. The variables src and dst are defined
as ulong8 pointers. ulong8 is an OpenCL vector data type that consists of eight ulong -
40
Chapter 5. OpenCL Performance on HARP 41
1 kernel void
2 attribute (( task ) )
3 memory_read (
4 global ulong8 * restrict src ,
5 global ulong8 * restrict dst ,
6 unsigned int lines )
7 {
8 ulong8 output = ( ulong8 ) (0) ;
9 for ( unsigned int i = 0; i < lines ; i ++) {
10 output += src [ i ];
11 }
12 * dst = output ;
13 }
Listing 5.1: Kernel code used to test the maximum achievable read bandwidth of the
HARP platform.
unsigned long - elements. Therefore, as a single ulong variable has a size of 8 B, reading a
single ulong8 variable results in reading 8 · 8 B = 64 B. This size equals the size of a cache
line in the Core-Cache Interface (CCI-P). Therefore, the amount of ulong8 variables that
are read is equal to the amount of cache lines necessary for loading the data.
The first statement in the kernel on line 8 declares and initializes the variable output. This
variable is used to prevent data reads from being compiled away. After initializing output,
a for-loop iterates over all elements of input variable src, and adds the elements to output.
Finally, output is written to the location dst points to. If this additional write to dst would
be absent, the complete for-loop would be compiled away.
In order to investigate the influence of the used data type, the kernel in Listing 5.1 was
implemented for other data types of variables src, dst and output. In a first kernel, these
variables were defined as ulong16 data type, resulting in a size of 128 B for one memory
transfer. In a second kernel, the variables were defined as int data types, resulting in 4 B
memory transfer sizes.
The three different kernels were synthesized together to a single .aocx file, resulting in a
clock frequency of 258.3 MHz. All kernels resulted in a fully pipelined implementation,
Chapter 5. OpenCL Performance on HARP 42
in which the for-loop obtained an initiation interval equal to one. Therefore, every clock
cycle a new iteration is started, and a value is read every clock cycle. Each of the three
different kernels was executed using variable buffer sizes. The execution time of the kernel
was recorded by making use of the OpenCL profiler, and the resulting bandwidth was
buf f er size
calculated as bandwidth = .
kernel execution time
5.2.2 Results
The execution of the OpenCL kernel in Listing 5.1 lead to the bandwidths displayed in
Figure 5.1 for the three different kernels.
16
4 B words
14 64 B words
128 B words
12
Bandwidth (GB/s)
10
Figure 5.1: HARP bandwidth when reading values on FPGA from the CPU. The different
lines denote the different word sizes of the memory transfers.
General Observations
When analyzing the graph in Figure 5.1, some elementary observations can be made. First,
it can be noticed that the bandwidth increases for larger buffer sizes. This can be explained
by the influence of overhead. For smaller buffer sizes, factors such as kernel set-up time,
Chapter 5. OpenCL Performance on HARP 43
communication set-up time etc. will not be negligible. For this reason, the bandwidth
will increase for higher buffer sizes, where these factors will become small compared to the
data transfer time. Secondly, it can be seen that the maximum bandwidth for high buffer
sizes increases to more or less 15 GB/s. However, this is only the case for the kernels that
load ulong8 - and ulong16 -variables. The kernel that loads int-values reaches a maximum
bandwidth of around 1 GB/s.
Requested Bandwidth
Based on the frequency of the compiled kernels and the data size that is transferred every
clock cycle, it is possible to calculate the bandwidth requested by the kernel. Since all
kernels are fully pipelined, one variable is read every clock cycle. This results in a requested
data type size
bandwidth that can be calculated as . As the frequency of the kernel
clock cycle time
equals 258.3 MHz, the cycle time equals 3.87 ns. Table 5.1 summarizes the requested and
maximum achieved bandwidths of the different kernels.
Table 5.1: Requested bandwidth versus maximum achieved bandwidth when reading vari-
ables of a different size.
The achieved bandwidth for the kernel that loads int values perfectly equals the requested
bandwidth, i.e. 1.03 GB/s. This implies that the CPU-FPGA interconnect can satisfy
the requested bandwidth. The measured bandwidth for the kernels that load ulong8 and
ulong16 values, however, do not correspond with the theoretical bandwidths. The re-
quested bandwidth when loading ulong16 variables is higher than the raw bandwidth of
the interconnect. As calculated in Chapter 3, this bandwidth equals 28.56 GB/s. There-
fore, this requested bandwidth can physically not be achieved. However, given the high
requested bandwidths, the achieved bandwidths will never exceed 15.23 GB/s. Therefore,
the kernels that load ulong8 - and ulong16 -variables are communication limited, with a
maximum bandwidth of 15.23 GB/s.
Given this limit, it is possible to construct the roofline model for the HARP platform,
Chapter 5. OpenCL Performance on HARP 44
resulting in the model in Figure 5.2. In this figure, the compute limit is not displayed
since the operational intensity is too low. The blue graph is the performance limit as
defined by the raw bandwidth. However, since this bandwidth can never be achieved, a
corrected version is constructed based on the bandwidth tests. The maximum bandwidth
was rounded up to 16 GB/s, which results in the green graph. The black dots indicate the
three kernels that were tested. Both the kernel loading ulong8 and ulong16 kernel, which
achieve an operational intensity of 1/8, hit the performance roof, and their performance
cannot be increased. The kernel loading int values, on the other hand, which obtains an
operational intensity of 1/4, only achieves 1/16 of the possible performance. Since this
kernel only achieves a requested bandwidth of 1.03 GB/s, as can be seen in Table 5.1, not
enough data is requested to achieve high performance. Increasing the requested bandwidth,
therefore can potentially boost the performance. Note that in this figure, we use OP/B
and OPS since the used data types are no floating point data types.
32
16
8
4
2
1
1/2
1/4
Figure 5.2: Roofline model for the HARP platform. The blue line denotes the perfor-
mance limit based on the maximum theoretical bandwidth, while the green line denotes
the performance limit based on the maximum achieved bandwidth.
A possible solution to increase the requested bandwidth is to unroll the for-loop in the
Chapter 5. OpenCL Performance on HARP 45
8 ...
9 # pragma unroll 16
10 for ( unsigned int i = 0; i < int_values ; i ++) {
11 output += src [ i ];
12 }
13 ...
Listing 5.2: Unroll pragma in OpenCL kernel.
OpenCL-kernel, such that more data is requested per clock cycle. In a first attempt, the
for-loop is unrolled 16 times. This should result in a requested bandwidth of 16.53 GB/s.
The modified kernel is displayed in Listing 5.2. Even though this appears to be a viable
solution, problems emerge when compiling this kernel. The compilation reports notify
that the compiled for-loop has an initiation interval of 7. Therefore, 16 int-values would
be loaded every 7 clock cycles. This massively decreases the requested bandwidth, which
16.53 GB/s
is then equal to = 2.36 GB/s.
7
The achieved bandwidth is displayed in Figure 5.3. As suggested by the compilation
log-files, the bandwidth only increases marginally, from 1.03 GB/s to 1.86 GB/s. This
maximum achieved bandwidth, however, is lower than the expected bandwidth, which
equals 2.36 GB/s. However, when profiling the kernel, no memory stalls were found, and
the maximum bandwidth efficiency of 100% was achieved. Therefore, the interconnect did
not limit the achieved bandwidth.
In the compilation reports, no specific information is given on the cause of the poor ini-
tiation interval. The cause, however, is the unknown amount of iterations. Since the
parameter int values is passed to the kernel by the host, this parameter is unknown at
compile time. When choosing a fixed value for this variable, the compiler is able to gen-
erate a for-loop with an initiation interval of 1. The downside of this approach is that
the buffer size needs to be known before generating hardware. In order to find the maxi-
mum bandwidth using this kernel, a buffer size of 32 MiB was chosen. This resulted in a
maximum bandwidth of 9.36 GB/s. Once again, the achieved bandwidth did not achieve
the earlier found limit of approximately 16 GB/s. In a last attempt to further increase the
bandwidth using the pragma unroll, the for-loop was unrolled using a higher unroll factor
for a fixed buffer size. This leads to the maximum bandwidths displayed in Figure 5.4.
It can be seen in this figure that an upper limit of approximately 14 GB/s is reached for
Chapter 5. OpenCL Performance on HARP 46
2
original kernel
unrolled kernel
1.5
Bandwidth (GB/s)
0.5
Figure 5.3: HARP bandwidth for reading 4 B values on FPGA from the CPU. The blue
line denotes the bandwidth when reading 4 B each clock cycle, while the green line is the
result of unrolling the loop with unroll factor 16.
Chapter 5. OpenCL Performance on HARP 47
14
Maximum Bandwidth (GB/s)
12
10
0
1 2 4 8 16 32 64 128
Unroll factor
Figure 5.4: HARP bandwidth for reading 4 B values on FPGA from the CPU. The loop
was unrolled by various unroll factors.
Another approach to increase the requested bandwidth of the kernel loading int variables,
is to define a custom data type such that more data is loaded every clock cycle. For
this purpose, a struct was declared consisting of 16 int values. The used struct is illus-
trated in Listing 5.3. Since this struct contains 16 int values, it has a size of 64 B. The
packed attribute indicates the compiler that subsequent struct elements are stored without
padding.
When executing the kernel that reads this struct, the bandwidth displayed in Figure 5.5 is
obtained. As can be seen in this figure, a bandwidth of nearly 16 GB/s is obtained, which
is the desired outcome.
Profiling Bandwidth
attribute (( packed ) )
struct integer
{
int a ;
int b ;
int c ;
...
int p ;
};
Listing 5.3: Struct containing 16 int values.
16
14
12
Bandwidth (GB/s)
10
Figure 5.5: HARP bandwidth for reading 64 B struct variables on FPGA from the CPU.
Chapter 5. OpenCL Performance on HARP 49
with the profile-flag, performance information is measured during execution and saved in
a profile file. As the profiling information captures kernel information more precisely, this
will generally be a more accurate way of analyzing the kernel bandwidth.
The bandwidth that was achieved for the kernel loading ulong8 -variables was compared
with the bandwidth as measured by the OpenCL-profiler. Both bandwidths are displayed
in Figure 5.6. In this figure, it can be seen that the achieved bandwidth for low buffer
sizes is systematically lower than the profiled bandwidth. For higher buffer sizes, however,
both bandwidths correspond with each other. Again, this can be explained by transient
phenomena. Since some of these phenomena, such as kernel set-up time, are not included
in the CPU-FPGA communication as measured by the profiler, the profiled bandwidth will
be higher than the achieved bandwidth.
16 achieved bandwidth
profiled bandwidth
14
12
Bandwidth (GB/s)
10
Figure 5.6: HARP bandwidth for reading ulong8 variables on FPGA from the CPU. The
blue graph corresponds with the bandwidth calculated by the formula buffer size/kernel
execution time, while the green graph displays the bandwidth as measured by the OpenCL
profiler.
Chapter 5. OpenCL Performance on HARP 50
As explained in Chapter 3, the FPGA Interface Unit (FIU) implements a soft IP cache.
The responsibility of the FIU cache is preventing memory requests to be passed to DRAM
memory. If the requested data is already available on the FPGA itself, a potentially higher
bandwidth and lower latency can be achieved. The FPGA cache has a capacity of 64 KiB
with 64 B direct-mapped cache lines.
In the HARP platform, the Intel Arria 10 FPGA is connected to the processor through one
QPI and two PCIe channels. The FIU cache, however, is only implemented for the QPI
channel. Therefore, if a memory request is passed to a PCIe-channel, no cache look-up
will executed, and the memory request will be sent to DRAM memory. Since the OpenCL
board-support package for the HARP platform uses the Virtual Auto (VA)-channel, a
memory request can be assigned to each of the three physical channels. As a result, an
FPGA memory request in the HARP platform will not necessarily pass the FIU cache, and
cache performance is unpredictable.
Cache behaviour is typically subdivided in two different types: spatial locality and temporal
locality. When using spatial locality, data elements are loaded that are located close
together in memory. In temporal locality, on the other hand, data is loaded several times
within a small time duration.
If the OpenCL kernel in Listing 5.1 is considered, the kernel that loads int-variables could
make use of spatial locality. As one cache line has a size of 64 B, 16 int-values can fit
in this cache line. Therefore, when reading one int-value, the 15 surrounding values can
potentially be read from the cache line that that was loaded. In fact, it could be possible
that the FIU cache has already been used in this kernel. However, in this test case, the
requested bandwidth was only 1.03 GB/s. Since cache behaviour is especially interesting in
those cases where the requested bandwidth exceeds the upper bound of the interconnect,
this kernel will not be considered.
Temporal locality, on the other hand, can be used in each of the three kernels. When loading
the same data multiple times in a short period of time, the data potentially resides in
cache, hereby exploiting temporal locality. When loading 64 B- or 128 B-values in previous
Chapter 5. OpenCL Performance on HARP 51
1 kernel void
2 attribute (( task ) )
3 cache_read (
4 global volatile ulong8 * restrict src ,
5 global volatile ulong8 * restrict dst ,
6 unsigned int lines )
7 {
8 ulong8 output = ( ulong8 ) (0) ;
9 for ( unsigned int i = 0; i < ITERATIONS ; i ++) {
10 unsigned int index = i % lines ;
11 output += src [ index ];
12 }
13 * dst = output ;
14 }
Listing 5.4: Kernel code used to test the cache effectiveness of OpenCL on the HARP
platform.
section, the interconnect could not cope with the high requested bandwidths. Therefore, it
will be investigated in this section whether the FIU cache can satisfy these high requested
bandwidths.
The OpenCL kernel used to test the FIU cache is displayed in Listing 5.4. This kernel has
changed in three ways compared to the kernel memory read in Listing 5.1.
• The for-loop iterates for a fixed number of iterations, indicated by the constant
ITERATIONS. The number of iterations is determined at compile time, and is chosen
sufficiently high such that overhead effects become insignificant.
• The array src is indexed by the variable index. This variable is calculated as the
remainder when dividing loop-variable i by the parameter lines. If lines is, for
instance, equal to 4, the first four variables will be repeatedly accessed. This leads to
following access pattern: 0, 1, 2, 3, 0, 1, 2, 3, . . . . By using this modulo-operation,
temporal locality is enforced, and cache-performance can be evaluated. The input
parameter lines determines the degree of temporal locality.
• The pointers src and dst are marked volatile. This keyword indicates that the
Chapter 5. OpenCL Performance on HARP 52
data to which a pointer points may change over time. Therefore, every time the
variable is read, a memory request should be sent. If the volatile keyword is not
used, on the other hand, the loaded data is buffered in the OpenCL kernel. In
this case, the OpenCL compiler adds a software implemented cache to the FPGA
bitstream through which all memory requests pass [21]. The memory hierarchy when
using the volatile keyword is displayed in Figure 5.7a, while Figure 5.7b displays the
memory hierachy without using the volatile keyword. When testing the hardware-
implemented QPI cache, it is therefore important to mark src and dst as volatile,
such that no software cache prevents the FIU cache to be read. The software cache,
however, can be used as a comparison to the FIU cache. Therefore, the kernel
cache read was also implemented without using the volatile keyword.
5.3.2 Results
The kernel was developed for both the ulong8 and ulong16 data type. It was synthesized
to an .aocx file, which resulted in a clock frequency of 240.6 MHz. The variable lines was
varied from 1 to 2048. This resulted in repeated reads of a memory region with size 64 B
to 128 KiB for the ulong8 variable, and 128 B to 256 KiB for the ulong16 variable.
In Figure 5.8a and Figure 5.9a, the obtained average bandwidths for both kernels are
displayed with and without using the volatile keyword. As there are large variations in
obtained bandwidth for small values of the variable lines, a scatter plot is displayed in
Figure 5.8b and Figure 5.9b for the first 20 values of lines.
As the kernels are fully pipelined, every clock cycle a variable is read. Because of the
frequency of 240.6 MHz, and therefore a clock cycle time of 4.16 ns, the requested bandwidth
read size
can be calculated as . This leads to the requested bandwidths displayed
clock cycle time
in Table 5.2.
Processor FPGA
FIU AFU
VC
QPI QPI OpenCL
DRAM LLC steering
cache kernel
logic
PCIe 0
PCIe 1
(a) Memory hierarchy when using the volatile keyword. The memory hierarchy consists of three
levels: the QPI cache, the CPU’s Last-Level Cache (LLC) and DRAM memory.
Processor FPGA
FIU AFU
VC
QPI QPI OpenCL OpenCL
DRAM LLC steering
cache cache kernel
logic
PCIe 0
PCIe 1
(b) Memory hierarchy without using volatile keyword. The memory hierarchy consists of four
levels: the OpenCL cache, the QPI cache, the CPU’s Last-Level Cache (LLC) and DRAM mem-
ory.
16 FIU cache
14 OpenCL cache
Bandwidth (GB/s)
12
10
15
Bandwidth (GB/s)
10
0
0 256 512 768 1,024 1,280
Memory region read (byte)
(b) Bandwidth of reading src when executing kernel cache read for volatile ulong8
pointer types. In this graph, only the FIU cache is used. This graph zooms in on the
blue graph in above figure.
Figure 5.8: Bandwidth of reading src when executing kernel cache read for 64 B values.
Chapter 5. OpenCL Performance on HARP 55
FIU cache
30
OpenCL cache
25
Bandwidth (GB/s)
20
15
10
30
25
Bandwidth (GB/s)
20
15
10
0
0 512 1,024 1,536 2,048 2,560
Memory region read (byte)
(b) Bandwidth of reading src when executing kernel cache read for volatile ulong16
pointer types. In this graph, only the FIU cache is used. This graph zooms in on the
blue graph in above figure.
Figure 5.9: Bandwidth of reading src when executing kernel cache read for 128 B values.
Chapter 5. OpenCL Performance on HARP 56
OpenCL Cache
As can be seen in Figure 5.8a and Figure 5.9a, the OpenCL-implemented cache is able to
fulfill the requested bandwidths of 15.4 GB/s and 30.8 GB/s. In both graphs, the size of
this software cache can also be deduced from the point where the bandwidth drops. When
reading ulong8 -variables, the OpenCL cache has a size of 64 KiB; when reading ulong16 -
variables, the cache is 128 KiB large. When the memory region is larger than the cache
size, the bandwidth decreases and converges to an asymptotic bandwidth of 9.3 GB/s.
FIU Cache
In Figure 5.8a and Figure 5.9a, it is difficult to see a trace of the FIU cache in the bandwidth
results. While there is a clear point at which the bandwidth of the OpenCL cache decreases,
the bandwidth of the volatile kernel is mainly constant.
However, when looking at a more detailed view of these kernels, in Figure 5.8b and Figure
5.9b, some variations can be distinguished that indicate the use of the hardware cache. In
Figure 5.8b, there are points in which the system was able to achieve a higher bandwidth
than the asymptotic bandwidth. When reading single cache lines, i.e. 64 B-values, there are
measurements that achieve a bandwidth of 15.4 GB/s, which is the requested bandwidth.
In Figure 5.9b, there are less variations in this area. However, there are measurements in
which a bandwidth of more than 30 GB/s is achieved. Since the maximum unidirectional
theoretical bandwidth of the CPU-FPGA interconnect is roughly 28 GB/s, this bandwidth
exceeds the physical limit. Therefore, this read probably originates from the FIU cache.
Bandwidth
In previous section, a kernel was developed to achieve a bandwidth that was as high as
possible. This resulted a maximum bandwidth of roughly 16 GB/s. Even though the CPU-
FPGA interconnect clearly supports these high bandwidths, the asymptotic bandwidth
when loading 64 B values is approximately only 9 GB/s. The kernel is fully pipelined, has
an initiation interval equal to one and no memory stalls were encountered during execution.
Therefore, no memory read problems were encountered, and the low bandwidth is probably
caused by the OpenCL kernel itself.
Chapter 5. OpenCL Performance on HARP 57
A third aspect that is considered, is the use of Shared Virtual Memory (SVM). In the
OpenCL-framework, data is communicated between the host and the kernel by reading
from and writing to allocated memory areas. There are two different types of memory
areas: memory buffers and Shared Virtual Memory (SVM).
Memory Buffers
When using memory buffers, the data is copied to a specially allocated part of the RAM
memory that is called a memory buffer. The host can only access memory buffers through
the OpenCL API. Moreover, the memory location of the memory buffers depends on the
accelerator architecture. In a PCIe-based CPU-FPGA architecture, both the processor
and the FPGA have their own DRAM memory, as illustrated in Figure 5.10a. In this type
of architecture, the memory buffer is located in the FPGA DRAM memory [18]. The data,
that is located in the CPU DRAM, is therefore copied from CPU DRAM to FPGA DRAM,
after which the FPGA reads from the FPGA DRAM. This is the conventional operation
in a traditional PCIe-based CPU-FPGA platform. In a shared memory architecture such
as the HARP platform, on the other hand, both the original data and the OpenCL buffer
are stored in the shared DRAM. This is indicated in Figure 5.10b.
SVM
SVM, on the other hand, is a paradigm in which the original data can be accessed by
both the OpenCL kernel and the host without having to pass through the OpenCL API.
The host can access the allocated memory area by using conventional memory operations.
Therefore, data is not duplicated, as illustrated in Figure 5.11.
In order to characterize the impact of traditional memory buffers, their overhead was
investigated. This overhead is characterized by the amount of time is required to write to
and read from the the memory buffer, and will be expressed as the achieved bandwidth.
Chapter 5. OpenCL Performance on HARP 58
PCIe
CPU FPGA
(a) In a PCIe-based CPU-FPGA architecture, the original data resides in CPU DRAM while the
memory buffer is located in the FPGA DRAM.
QPI
PCIe
CPU FPGA
PCIe
DRAM
Data
Buffer
(b) In the HARP platform, both the original data and the memory buffer are located in the
DRAM memory.
Figure 5.10: Memory buffer location when using different types of platforms.
QPI
PCIe
CPU FPGA
PCIe
DRAM
SVM
Figure 5.11: In the HARP platform, both the CPU and the FPGA can access SVM.
Chapter 5. OpenCL Performance on HARP 59
5.4.2 Results
14 write to buffer
read from buffer
12
Bandwidth (GB/s)
10
Figure 5.12: Bandwidths when copying data in the HARP platform. The blue line shows
the bandwidth when writing data to a write buffer while the red line denotes the bandwidth
when reading from a memory buffer.
The results are displayed in Figure 5.12. Writing data to a memory buffer results in an
asymptotic bandwidth of nearly 13 GB/s, while reading from the buffers results in nearly
12 GB/s.
The influence of these bandwidths on the performance largely depends on the buffer size
and on the execution time of the kernel itself. For a small amount of data that needs
to be copied, and a kernel with a large execution time, the influence of using memory
buffers on the performance is relatively small. When reading and writing, however, a large
amount of data for a kernel that only requires a short execution time, memory buffers
will degrade the performance. Consider, for instance, an OpenCL kernel that requires
128 MiB of data to be written and to be read and that has an execution time of 20 ms. The
achieved bandwidths for the specific payload sizes can be read from the graph in Figure
128 MiB
5.12. Writing the data will require approximately = 12 ms, while reading from
11 GB/s
Chapter 5. OpenCL Performance on HARP 60
128 MiB
the buffer takes approximately = 19 ms. In this case, the total kernel performance
7 GB/s
will be seriously degraded.
5.5 Conclusion
Bandwidth
The maximum bandwidth that was achieved in the different tests is 15.73 GB/s. In Chap-
ter 3, it was found that the maximum raw one-way bandwidth of the interconnect is
28.56 GB/s. Due to communication overhead, this raw bandwidth will never be achieved.
However, a more detailed analysis on what communication overhead aspects affect the
bandwidth would be useful.
It was noticed that boosting the bandwidth by making use of the pragma unroll did not
achieve good results. When using a custom struct, on the other hand, a high bandwidth
was obtained. Therefore, using a struct is recommended when data has to be read for
which no native data type is available.
Cache
In the executed tests, traces of the FIU cache were found. When loading 128 B values,
for instance, bandwidths of over 30 GB/s were obtained. Because this bandwidth exceeds
the physical limits of the interconnect, the FIU cache was presumably used. However,
it is only in certain cases, and only for a couple of measurements, that the FIU cache
was effective. In all other cases, the achieved bandwidths do not satisfy the requested
bandwidths. A possible cause for this problem is the virtual auto channel, which selects
the physical channel without taking cache locality into account. However, no hard claims
can be made.
Even though it may be hard for an OpenCL developer to exploit the FIU cache, the software
cache added by OpenCL compiler is useful. By implementing a software implemented
cache, high bandwidths were obtained. As the volatile keyword has as a drawback that
Chapter 5. OpenCL Performance on HARP 61
this hardware cache is not implemented, it should therefore only be used when necessary.
This is the case if the data could change during the execution of the OpenCL kernel.
SVM
The HARP platform offers a shared memory architecture that can be exploited by using
SVM in OpenCL. The use of memory buffers results in memory and performance overhead.
Therefore, the SVM should be used on the HARP platform.
Chapter 6
6.1 Introduction
An example application of this algorithm is the enhancement of LIght Detection And Rang-
ing (LIDAR) data. By scanning an environment using LIDAR, a point cloud, consisting
of (x,y,z)-coordinates, is obtained. This point cloud can act as the filtering input in the
guided image filter algorithm. By making use of a color image of the scene as the guidance
image, the lidar data can be denoised.
6.2 Algorithm
The guided image filter algorithm consists of 5 steps, which are displayed in pseudocode
in Algorithm 6.1. The algorithm has 4 input parameters, filtering input I, guidance input
G, filtering radius r and a regularization parameter , and has one output, the filtering
output O [22].
62
Chapter 6. Guided Image Filtering 63
Figure 6.1: The guided image filter algorithm smooths the filtering input image by making
use of a guidance image. The filtering output consists of the filtering input, enhanced by
the data in the guidance image [22].
In a first step, the function fmean,r is applied. fmean,r calculates the mean of all pixels
within a square window. The size of this window is determined by the radius r. A
radius of 2, for instance, leads to a square window in which all values within a distance
of 2 pixels of the central pixel are considered. Therefore, this window will have a size of
(2 · r + 1) × (2 · r + 1) = 5 × 5 pixels. This function is applied to the filtering input, the
guidance input, the filtering input multiplied with the guidance input, and the guidance
input multiplied with itself. The multiplication of two images is executed pixel-wise, as
denoted by the point-multiplication .∗ . Based on the values calculated in step 1, the
variance of the guidance input, varG , and the covariance of the filtering input multiplied
with the guide input, covIG , is calculated for each pixel in step 2. In step 3, images a and b
are calculated using the previously calculated varG and covIG , and using the regularization
parameter . Then, in step 4, images a and b are filtered using the same fmean,r as used in
step 1, which results in images meana and meanb . Finally, in step 5, the filtering output is
calculated as meana . ∗ G + meanb .
Chapter 6. Guided Image Filtering 64
5. O = meana . ∗ G + meanb
Algorithm 6.1: Guided Image Filter algorithm.
In the sliding window implementation of an image processing algorithm, the window func-
tion is slid over the input image. Each output pixel is then produced by computing the
operation according to the input pixels under the window and the chosen window operator.
The result is a pixel value that is assigned to the centre of the window in the output image
[23]. This principle is illustrated in Figure 6.2. In the guided image filter algorithm, the
applied window function is fmean,r . In this window function, the mean of all pixels under
the window, with size (2 · r + 1) × (2 · r + 1) is calculated.
In the OpenCL implementation, the guided image filter algorithm is split into two kernels,
as shown in Figure 6.3. The first kernel calculates steps 1, 2 and 3 in Algorithm 6.1, while
the second kernel calculates steps 4 and 5. Kernel 1 starts by reading the filtering image I
and the guidance image G from global memory. It then slides over I and G and calculates
meanI , meanG , corrIG and corrG . Based on these four arrays, it then calculates the values
Chapter 6. Guided Image Filtering 65
Window
operator
input image output image
y y
input Output
window pixel
path of
window
x x
Figure 6.2: Conceptual example of sliding window filtering [23].
in step 2 and step 3. The result of these first three steps, arrays a and b, are sent directly
to the second OpenCL kernel, together with the guidance image, which is required in step
5 of the algorithm. Kernel 2 slides fmean,r over a and b, which results in arrays meana and
meanb . Based on meana and meanb , the filtering output O is calculated and written back
to global memory.
The main motivation to split the calculations in two kernels is the ease of programming.
Kernel 1 contains the sliding window implementation of the four fmean,r functions in step
1 of the algorithm. In kernel 2, on the other hand, the sliding window implementation of
the two fmean,r functions in step 4 is implemented. Separating these two algorithmic steps
decreases the implementation complexity. Because it is possible in OpenCL to directly
send data between kernels, no extra bandwidth to global memory is required for this
configuration.
In the following subsections, the specific OpenCL implementation details will be discussed.
The implemented guided image filter algorithm operates on data that is read from global
memory. This data contains either different color channels or 3D-coordinates. The different
dimensions of a specific pixel, however, are stored together. This is displayed in Figure 6.4.
Chapter 6. Guided Image Filtering 66
global memory
input guide
kernel 1
a, b guide
kernel 2
output
global memory
Figure 6.3: Sliding window implementation. Kernel 1 computes the first three steps in the
guided image filter algorithm, while kernel 2 computes the last two steps.
Chapter 6. Guided Image Filtering 67
This figure shows a 1D-array, as data is stored sequentially row-wise in memory, containing
image data. The three different color channels, for instance Red (R), Green (G) and Blue
(B), for pixel 0 are stored, before continuing to the three different color channels of pixel
1.
R0 G0 B0 R1 G1 B1 R2 G2 B2 ...
This data structure, however, poses problems for the guided image filter implementation.
Because the guided image filter operates on each color channel individually, the best possi-
ble situation would be that pixels are grouped by their color channel, i.e. first all R-values
together, before going to the G- and B-values. In this way, memory can be read entirely
sequentially, without having to discard data that was read.
In order to cope with the interleaved storage of color channels, all computations are dupli-
cated three times, once for each color channel. In this way, data can be read sequentially
without discarding color channels. This leads to a SIMD-implementation in which three
color channels are processed simultaneously.
This approach, however, also requires adaptations to the used data type. The Intel FPGA
SDK for OpenCL does not support native vectorized data types containing three elements.
Therefore, a custom struct was defined that contains the three different channels. The
filtering input and output images consist of real number values, while the guidance image
consists of color values between 0 and 255. Therefore, two different structs were defined, as
displayed in Listing 6.1. The struct filter image consists of three different float-variables,
while the guide image struct consists of three unsigned char values. Except for these two
input and output image types, a third struct, called struct a b was defined that contains
the values of arrays a and b. For each color channel, the vectorized OpenCL data type
int2 contains a value of array a and a value of array b. The packed attritube indicates
that subsequent struct-elements are stored in memory without padding.
OpenCL Channels
OpenCL channels are used to directly send data between different OpenCL kernels without
passing through global memory. By using OpenCL channels, the number of accesses to
global memory can be reduced, hereby saving CPU-FPGA bandwidth.
Chapter 6. Guided Image Filtering 68
attribute (( packed ) )
struct filter_image
{
float x ;
float y ;
float z ;
};
attribute (( packed ) )
struct guide_image
{
unsigned char r ;
unsigned char g ;
unsigned char b ;
};
attribute (( packed ) )
struct struct_a_b
{
int2 x ;
int2 y ;
int2 z ;
};
Listing 6.1: Structs that define the input and output image, the guidance image and the
array containing values a and b.
Chapter 6. Guided Image Filtering 69
An OpenCL channel behaves as a FIFO, with a producer and a consumer. The producer
places data in the FIFO, while the consumer pulls the data from the FIFO. Moreover, an
OpenCL channel has a limited capacity, and can therefore be saturated. If this occurs,
the producer will stall until data can again be placed in the channel. In the OpenCL
implementation of the guided image filter algorithm, arrays a and b are directly sent from
kernel 1 to kernel 2 through an OpenCL channel, by making use of the earlier defined struct
in Listing 6.1. Moreover, as kernel 2 also needs the guidance image for its computations,
kernel 1 sends this array to kernel 2 through an OpenCL channel as well. This is indicated
in figure 6.3.
OpenCL channels are implemented as shown in Listing 6.2. In line 2, the use of channels
is activated by enabling the pragma OPENCL EXTENSION cl altera channels. In line 3
and line 4, the channels are declared. Channel ch 0, which uses the earlier defined struct
struct a b will be used to transfer a and b, while channel ch 1 transfers the guidance image
by means of the struct guide image. Both channels are accompanied by the depth attribute.
As explained earlier, an OpenCL channel has a limited capacity. In order to prevent the
producer, i.e. kernel 1, from blocking, a high enough capacity of 128 struct elements was
chosen. In line 6, kernel 1 is declared. A pointer to the filtering input and the guide input
is provided such that they can be read from global memory. After applying the required
computations, the data, contained by variables output buf and guide buf out is then sent
through the channels ch 0 and ch 1 to kernel 2. Kernel 2, declared on line 16, reads data
from the channels, performs its computations, and then writes its output to the memory
location defined by its input parameter output.
Shift Register
In the sliding window implementation, one filtering output value is calculated every clock
cycle. Therefore, exactly one filtering input value and one guide input value are loaded
every clock cycle. Since the loaded values are required for multiple convolution operations,
they remain in the local memory. Previously loaded input and guide values that are no
longer required by the computation will leave local memory at a rate of one value per clock
cycle. Because one value enters local memory while another one leaves local memory every
clock cycle, the local memory can be organized as a shift register.
A shift register is an array of registers in which the elements are shifted every clock cycle.
The use of a shift register in the guided image filter implementation can be clarified by
Chapter 6. Guided Image Filtering 70
1 ...
2 # pragma OPENCL EXTENSION cl_altera_channels : enable
3 channel struct struct_a_b ch_0 attribute (( depth (128) ) ) ;
4 channel struct guide ch_1 attribute (( depth (128) ) ) ;
5 ...
6 kernel
7 void kernel_1 ( global struct image * restrict input ,
8 global struct guide * restrict guide )
9 {
10 ...
11 write_channel_altera ( ch_0 , output_buf ) ;
12 write_channel_altera ( ch_1 , guide_buf_out ) ;
13 }
14
15 kernel
16 void kernel_2 ( global struct image * restrict output )
17 {
18 struct image_filtered input_buf = read_channel_altera ( ch_0 ) ;
19 struct guide guide_buf = read_channel_altera ( ch_1 )
20 ...
21 }
Listing 6.2: Kernel code used to test the maximum achievable read bandwidth of the
HARP platform.
Chapter 6. Guided Image Filtering 71
making use of Figure 6.5. In this figure, the parameter r is equal to one, and therefore a
window of size (2 · 1 + 1) × (2 · 1 + 1) = 3 × 3 pixels is obtained. The blue pixels indicate the
values that are required for the computation of fmean,r in the current clock cycle. These
pixels are therefore present in local memory. The green pixels, on the other hand, are also
present in local memory. Their values are not necessary for the current computation, but
will be at a later stage. The white pixels are not in local memory. The red pixel in Figure
6.5a, on the other hand, is also not present in local memory, but will be loaded in local
memory the next clock cycle.
As all green and all blue pixels are located in the shift register in local memory, their
position in the shift register is indicated by the index in the pixels. A high index implies
that the pixel resides already for a longer time in local memory, while a lower index
indicates that the pixel was recently loaded. After calculating the mean function fmean,r ,
which results in the mean of all blue pixels, the output value is obtained and the window
slides one pixel to the right. By shifting the window, all values in the shift register are
shifted one place. Since the blue pixel with index 18 in Figure 6.5a is no longer required
for a computation in the future, this pixel will leave local memory. The red pixel in Figure
6.5a, on the other hand, is necessary for the fmean,r computation in the next clock cycle,
and is therefore loaded in local memory. All pixels in local memory shift one position in
the shift register, and the red pixel shifts in. The new situation, one clock cycle after the
situation in Figure 6.5a, is displayed in Figure 6.5b.
Using a shift register results in a highly efficient hardware implementation. As can be seen
in Figure 6.5, the pixels that are required for the computation of the function fmean,r , i.e.
the blue pixels, are always located in registers 0, 1, 2, 8, 9, 10, 16, 17 and 18. As these
indices are read only once every clock cycle, no memory stalls will occur, and therefore no
duplication of local memory is necessary. Moreover, other positions in the shift register
will never be read for computations. Therefore, these registers do not require an extra read
port. This leads to a low resource usage.
A shift register is created when the OpenCL compiler detects a memory transfer pattern
in local memory that fits a shift register architecture. The shift register transfer pattern
in the OpenCL implementation of the guided image filter algorithm is displayed in Listing
6.3. In this code fragment, the data in variable shift register is shifted by one position by
looping over all indices. By spatially unrolling this loop using the pragma unroll, all shifts
are executed in the same clock cycle, and a shift register is obtained.
Chapter 6. Guided Image Filtering 72
18 17 16 15 14 13
12 11 10 9 8 7 6 5
4 3 2 1 0
(a) Shift register state at clock cycle n. In the next clock cycle, the red pixel will be loaded in
the shift register in local memory.
18 17 16 15 14
13 12 11 10 9 8 7 6
5 4 3 2 1 0
(b) Shift register state at clock cycle n+1. All values are shifted one place in the shift register
compared to the previous clock cycle.
Figure 6.5: Principle of the sliding window. The pixels in green and blue are row-wise
stored in a shift register. The pixels in blue are used to calculate the window function
fmean,r .
...
# pragma unroll
for ( int i = shift_register_size - 1; i > 0; --i ) {
shift_register [ i ] = shift_register [ i - 1];
}
...
Listing 6.3: Shift register implementation in OpenCL.
Chapter 6. Guided Image Filtering 73
...
int mean = 0;
# pragma unroll
for ( int i = 0; i < 2 * R + 1; i ++) {
# pragma unroll
for ( int j = 0; j < 2 * R + 1; j ++) {
int value = shift_register [ i * IMAGE_WIDTH + j ];
mean += value ;
}
}
mean = mean / ((2* R +1) *(2* R +1) ) ;
...
Listing 6.4: Implemented window operation in the sliding window implementation.
Window Operation
In the sliding window approach, an output value can be calculated every clock cycle. This
can be achieved by implementing a fully spatial implementation of the window operation.
As the window operation is computed by iterating over two dimensions, which results in
two for-loops, both loops must be fully spatially unrolled. This is shown in Listing 6.4.
Both for-loops, which each iterate over an index area of size 2 · r + 1, are fully unrolled
using the pragma unroll. fmean,r is then calculated by adding all values in the area and
dividing this value by the size of the window.
Spatially unrolling the window operation, however, has as downside that the radius is
limited by the availabe FPGA resources. A higher radius will lead to larger for-loops and
therefore a higher resource usage.
Fixed-Point Calculations
As can be seen in Listing 6.1, the filtering input consists of floating point variables. How-
ever, using floating point variables results in a higher resource usage. Not only the number
of DSPs will increase, the overall logic usage will increase as well. Therefore, the imple-
mented kernel makes use of fixed point calculations.
Chapter 6. Guided Image Filtering 74
Resource Usage To illustrate the effect of implementing fixed point instead of floating
point calculations, the resource usage of two kernels is listed in Table 6.1. Both kernels
implement the sliding window implementation of the guided image filter, and use a radius
equal to 3, which results in a window size of 7 × 7 pixels. The first kernel uses fixed
point calculations while the second kernel uses floating point calculations. As can be seen,
both the logic usage and the DSP usage is higher when using floating point operations.
The maximum radius for the floating point kernel is equal to 3. For higher radiuses, the
FPGA bitstream does not fit any longer on the Arria 10 FPGA. The fixed point kernel,
on the other hand, can go up to a radius of size 6, which corresponds with a window of
13 × 13 pixels. Furthermore, since both the logic and DSP usage of the floating point
kernel approach 100 %, this will limit the ability of further optimizing the OpenCL kernel
and, for instance, applying a higher SIMD factor. Therefore, the implemented kernel uses
fixed point calculations.
Usage
Kernel Logic RAM DSP
Fixed Point Kernel 63 % 41 % 35 %
Floating Point Kernel 91 % 42 % 87 %
Table 6.1: Resource usage for OpenCL kernels implementing the sliding window imple-
mentation using fixed point and floating point computations.
When using a fixed point numbers, the required fractional precision is balanced against
the value range. The higher the number of fractional bits, the higher the precision, but the
Chapter 6. Guided Image Filtering 75
smaller the range of available values, and vice versa. In order to obtain a fractional precision
1
of, for instance, one decimal, at least 4 fractional bits are required, as 4 ≈ 0.06 < 0.1. By
2
assessing the required number of fractional precision, the appropriate amount of fractional
bits can be determined.
Conversion The conversion of floating point to fixed point values and vice versa is
displayed in Listing 6.5. Floating point to fixed point conversion is done by shifting the
floating point number to the left for the specified number of fractional bits, determined by
the variable fractional bits. The resulting value is then casted to an int. Therefore, only
the integer part, which contains a portion of the fractional component, will be retained.
Converting a fixed point variable into a floating point variable, on the other hand, is done
by first casting the fixed point variable to a float data type, and then shifting this variable
to the right for fractional bits bits.
As indicated in chapter 5, the shared memory architecture of the HARP platform can make
use of Shared Virtual Memory (SVM). In SVM, both the CPU and the FPGA have full
access to a specially allocated memory region. Therefore, only one instance of the data is
present in memory, and no explicit data copies to a separate memory region, only available
for FPGA, are necessary.
In the OpenCL implementation of the guided image filter, a series of images was filtered. In
order to process the different images, only two steps are required when using SVM. In a first
step, the input parameters, i.e. pointers to the SVM, are passed to the OpenCL kernels.
In the second step, the OpenCL kernels are invoked. Since it is no longer necessary to
copy the input and output data to and from memory buffers, the use of SVM dramatically
Chapter 6. Guided Image Filtering 76
reduces the number of operations that must be executed to pass data to the OpenCL
kernel.
6.3.2 Results
The implemented kernel was synthesized to filter Full HD images, which have a resolution
of 1080 × 1920 pixels, and a radius equal to 3. This lead to an FPGA bitstream with
a frequency of 237.5 MHz. The bitstream was then executed on the HARP platform, by
filtering 100 images. Two time recordings were registered. A first recording measured the
execution time of the OpenCL kernel itself, without accounting for passing input parame-
ters. A second recording measured the total execution time at the host side. This includes
the FPGA execution, but also passing the input parameters and invoking the OpenCL ker-
nel. 100 images were filtered on the HARP platform, which lead to an average execution
time shown in Table 6.2.
Table 6.2: Execution times when executing the sliding window implementation of the
guided filter on the HARP platform.
Profiling Results
Since the kernel is synthesized using a frequency of 237.5 MHz, a clock cycle time of 4.21 ns
is obtained. As denoted in the compilation reports, the kernel is fully pipelined and
has an initiation interval equal to one. Therefore, every clock cycle, an output pixel
can be calculated. This leads to a theoretical execution time equal to approximately
the total number of pixels multiplied with the cycle time, which equals 8.78 ms. If this
value is compared to the obtained execution time of 18.20 ms, it can be observed that
the obtained execution time is higher. In order to investigate this difference, the profiling
results, displayed in Table 6.3, are investigated.
Low Occupancy The occupancy is defined as the portion of the execution time of the
kernel that a memory operation is issued. While the optimal value is an occupancy of
Chapter 6. Guided Image Filtering 77
Table 6.3: Profiling results when executing the sliding window implementation of the
guided image filter algorithm.
100 %, only an occupancy of 48.6 % is achieved in this kernel. Therefore, the load and
read statement are only issued 48.6 % of the clock cycles, which means that a read takes
around 2 clock cycles. Therefore, the bandwidth is lower as well. The theoretical requested
bandwidth is 12 B/4.21 ns = 2850 MB/s. However, because of the occupancy of 48.6 %,
only a bandwidth of 2850 MB/s · 48.6 % = 1385 MB/s ≈ 1371.6 MB/s is obtained.
This low bandwidth is obtained because of the defined structs. Both the filtering input,
the guidance input and the filtering output consist of a data type that has a size that is a
multiple of three. The filtering input, for instance, has a size of 12 B and is aligned at 4 B.
Therefore, three separate read requests need to be issued, which dramatically reduces the
bandwidth, and explains the low occupancy.
High bandwidth efficiency The efficiency of all I/O-operations equals 100%, which is
the best possible value. This is caused because in the implementation, no read values are
discarded. As explained earlier, all three color channels are used. Therefore, the use of
structs, that decreases the occupancy of memory operations, results in an increase of the
bandwidth efficiency as no read values are unused.
OpenCL Cache
As can be seen in Listing 6.2, the input parameters are not marked by the keyword volatile.
Therefore, the OpenCL compiler adds a software implemented cache. As has been found
in Chapter 5, this cache can achieve high bandwidths. The performance of this cache can
be analyzed by making use of the profiler, in which the cache hit rate is displayed. These
hit rates are displayed in Table 6.4.
In this table, it can be seen that the cache hit rate for the filtering input equals 81.5 %.
This can be explained by considering the size of the struct that is used, which equals 12 B,
and the size of the OpenCL cache lines, which is 64 B. Data is read from memory as 64 B
Chapter 6. Guided Image Filtering 78
Table 6.4: Cache hit rates, derived from the OpenCL profiler, when executing the sliding
window implementation of the guided image filter algorithm.
cache lines. If one struct element is read, the other 64 B − 12 B = 52 B resides in cache
52 B
memory. This leads to a cache hit ratio of = 81.5 %. Moreover, as a separate cache
64 B
is created for the filtering input and the guidance input, both cache reads do not interfere
with each other. The same reasoning can be applied to the cache hit ratio of the guidance
image input.
SVM
In order to evaluate the benefit of using SVM, the kernel was also implemented using
memory buffers. This leads to an additional write of the filtering and the guidance input
to memory buffers, and an additional read of the filtering output from the memory buffer.
The average execution time for filtering one image for both the implementation with SVM
and with memory buffers is listed in Table 6.5. As can be seen in this table, the use of
memory buffers adds approximately 10 ms to the total execution time. The amount of
frames/s decreases from 45 to 30 when using memory buffers.
Table 6.5: Execution times when of the guided filter on the HARP platform with SVM
and with memory buffers.
Performance
In order to evaluate the achieved performance, the roofline model, explained in chapter 2,
can be used. This model relates the operational intensity to the achieved performance. In
the guided image filter, the operational intensity, however, depends on the value of radius.
Chapter 6. Guided Image Filtering 79
A larger radius results in more operations to be performed for an equal amount of data
that is read, and therefore a higher operational intensity.
As has been discussed earlier, the maximum radius for a floating point implementation of
the guided image filter algorithm is equal to 3. For higher values, the synthesized hardware
does not fit in the FPGA any longer. Therefore, a fixed point implementation was used,
which could go up to a radius of 6. Both the floating point and fixed point implementation
of the guided image filter were executed for the possible values for radius. This results
in the achieved performances in Figure 6.6. In this figure, the performance of the kernels
for the different operational intensities is shown. The fixed point kernel performance is
indicated by the blue dots, while the floating point kernel performance measurements are
shown in red. Moreover, the roofline model of the HARP platform is shown as well. As
peak-memory bandwidth, the bandwidth of 16 GB/s obtained in Chapter 5 is used, while
the performance limit is based on the technical data of the HARP platform. Note that
for the fixed point kernel, the performance should be expressed in OPerations per Second
(OPS), instead of FLOPS.
As can be seen in Figure 6.6, the implementation definitely benefits from using the fixed
point implementation. While the maximum achievable performance of the floating point
kernel is 135 GFLOPS, the fixed point kernel achieves a top performance of 430 GOPS.
For radii that both the floating point and the fixed point kernel can implement, on the
other hand, no significant difference in performance can be observed. In this case, using
the floating point implementation is advantageous due to the specific issues of the fixed
point implementation, such as under- and overflow.
Furthermore, it can be seen in Figure 6.6 that for larger operational intensities, a higher
performance is obtained. Therefore, the larger computational demands do not hamper the
performance and the kernel is only limited by the achieved bandwidth. It can, however, be
seen that the maximum performance of the floating point kernel, 135 GFLOPS, is smaller
than the compute limit of 1366 GFLOPS. It is, however, unknown what is the cause of
this large difference. Therefore, the implementation should be further analyzed to explain
this big difference.
Further Optimizations
When the fixed point kernel was developed for a radius of 3, there were still resources left.
In order to increase performance, the kernel was implemented using a higher SIMD factor.
Chapter 6. Guided Image Filtering 80
1024
Performance (GFLOPS)
512
256
128
64
32
Figure 6.6: Roofline model HARP platform. The red dots indicate the performance of
the floating point kernel for different values of radius, while the blue dots indicate the
performance of the fixed point implementation.
Chapter 6. Guided Image Filtering 81
Up until this point, one pixel, consisting of three different color channels, was processed
in parallel. It is, however, possible to process two or four pixels simultaneously, resulting
in either six or twelve parallel calculations. The execution times for these different kernels
are listed in Table 6.6.
Table 6.6: Kernel execution times when processing a different number of pixels in parallel.
As can be seen in Table 6.6, the maximum performance is obtained when processing 2
pixels simultaneously, in which a minimum execution time of 9.45 ms is obtained. However,
considering the overhead by setting the input parameters and invoking the kernel, a total
execution time of 13.45 ms is achieved. This results in a maximum frame rate of 74 frames/s.
6.4 Conclusion
As a case study, we implemented the guided image filter in OpenCL. By making use of the
insights gained in Chapter 5, the performance was boosted as much as possible. Some key
points in achieving a high performance are the use of SVM, and the OpenCL implemented
cache. A maximum floating point performance of 135 GFLOPS and a maximum fixed point
performance of 430 GOPS were obtained. A maximum frame rate of 74 frames/s can be
achieved.
Chapter 7
The goal of this master’s dissertation was to evaluate OpenCL as HLS language for the
HARP platform. Based on the findings in this thesis, it can be stated that the HARP
platform offers several benefits to developers to increase the performance of OpenCL ap-
plications. The main advantages are the shared memory and the high bandwidth. HARP
uses shared DRAM memory for the CPU and the FPGA. As this shared memory ap-
proach is supported by shared virtual memory in OpenCL, this can be fully exploited.
Furthermore, a high CPU-FPGA bandwidth is available to OpenCL developers.
Besides the HARP platform, the OpenCL compiler offers performance enhancing advan-
tages. A highly effective feature is the implementation of the OpenCL cache. This cache
is useful, as it is able to provide data at a high data rate, and reduces the amount of mem-
ory requests. The OpenCL profiler is useful to evaluate the OpenCL kernel. Performance
characteristics are visualized in an uncomplicated way, which allows an easy performance
analysis.
Several aspects of the HARP platform are useful for an OpenCL developer. Other charac-
teristics, however, are difficult to exploit. It is difficult for a developer to directly address
the soft IP FPGA cache. This is certainly caused by the automatic channel selection
in which cache locality is not considered. However, as has been stated in other studies,
implementing a larger cache would be beneficial, as the cache has a size of only 64 KiB.
82
Chapter 7. Conclusion and future work 83
However, as HLS goes together well with the HARP platform, this is a promising approach
for many emerging compute intensive applications, such as machine learning. Performing
research in these areas, could further boost the use of platforms such as HARP in these
domains.
Bibliography
84
Bibliography 85