Sie sind auf Seite 1von 8

Impact of Run-Time Reconfiguration on Design and Speed -

A Case Study Based on a Grid of


Run-Time Reconfigurable Modules inside a FPGA

Jochen Strunk∗ , Toni Volkmer∗ , Klaus Stephan∗ , Wolfgang Rehm∗ and Heiko Schick‡

Chemnitz University of Technology
Computer Architecture Group
Email: {sjoc,tovo,stekl,rehm}@cs.tu-chemnitz.de

IBM Deutschland Research & Development GmbH
Email: schickhj@de.ibm.com

Abstract which gives more flexibility and leads to faster reconfiguration


times.
This paper examines the feasibility of utilizing a grid of run- With DPR capable FPGAs an array of run-time reconfig-
time reconfigurable (RTR) modules on a dynamically and urable (RTR) modules within a single FPGA is conceivable.
partially reconfigurable (DPR) FPGA. The aim is to create These processing units could be configured on demand e.g.
a homogeneous array of RTR regions on a FPGA, which with special purpose processing engines or compute kernels to
can be reconfigured on demand during run-time. We study its offload tasks. Due to the highly parallel fabric of the FPGAs
setup, implementation and performance in comparison with massive parallel processing power is imaginable.
its static counterpart. Such a grid of partially reconfigurable With the implementation of such a run-time reconfigurable
regions (PRR) on a FPGA could be used as an accelerator for grid various questions arise:
computers to offload compute kernels or as an enhancement One important question is whether the design, implementa-
of functionality in the embedded market which uses FPGAs. tion and management is possible with currently available FPGA
An in-depth look at the methodology of creating run-time software tools. Another issue is to what extend the design and
reconfigurable modules and its tools is shown. Due to the lack the performance is influenced, perhaps restricted, by the terms
of the tools in handling hundreds of dynamically reconfigurable of the physical DPR capability of the FPGAs and the software.
regions a framework is presented which supports the user in What are the current design limitations and how does the
the creation process of the design. A case study which uses amount, the placement and density of RTR modules effect the
state of the art Xilinx Virtex-5 FPGAs compares the run-time clock speed? How does the RTR design perform in comparison
reconfigurable implementation and achievable clock speeds of a with its static counter part?
grid with up to 47 reconfigurable module regions with its static These interesting questions are addressed in this paper.
counterpart. For this examination a high performance module
is used, which finds patterns in a bit stream (pattern matcher).
This module is replicated for each partially reconfigurable 2. Introduction
region. Particularly, design considerations for the controller,
which manages the modules, are introduced. Beyond this, the FPGAs have gained access to various fields in research and
paper also addresses further challenges of the implementation industry. Applications using FPGAs are taking advantage of
of such a RTR grid and limitations of the reconfigurability of the possibility of creating their own processing units utilizing
Xilinx FPGAs. the highly parallel nature of FPGAs. Woods et al. [1] gained
a speedup of more than 50 compared with a CPU when
1. Motivation accelerating a Quasi-Monte Carlo Simulation. Zang et al. [2]
reached a 25 times speedup on another Monte Carlo Simulation.
Dynamically and partially reconfigurable (DPR) FPGAs offer With the emergence of partially reconfigurable FPGAs a new
the possibility to reconfigure distinct regions of the FPGA degree of flexibility is added. Parts (regions) of the silicon
during run-time while maintaining the configuration and func- device can be reconfigured during run-time, e.g. with special
tionality of the remaining parts. This feature enables customers purpose processing units, without interfering the remaining
to ”update the design” without reconfiguring the full device, units.
Only a few manufactures provide the feature of run-time re- Partial Reconfiguration” [9] for Xilinx FPGAs is introduced
configurability. Xilinx is one of these which have implemented as well as challenges and limitations are shown. For a RTR
this feature starting with their first Virtex [3] series. Until today design the HDL sources must be divided into the part which is
each generation of the Virtex series offers dynamic partial constantly available (static) during run-time inside the FPGA
reconfiguration, i.e. Virtex, Virtex-E, Virtex-II (Pro), Virtex-4 and the modules which are exchangeable. All communication
and Virtex-5 FPGAs. In this paper we focus on the state of the between the static and the dynamic parts has to travel through
art Xilinx Virtex-5 FPGAs. hard macros, also called bus macros and typically 2 to 8 bit
The rest of the paper is organized as follows: wide each, which must be assigned to a fixed location inside
Chapter 3 focuses on related work. In chapter 4 we describe the FPGA. The HDL source of the static part also instantiates
the principles of a run-time reconfigurable design flow. The the clocks and the hard macros, which are also needed by
mapping of a grid of RTR modules on the FPGA is described the RTR modules. For each partially reconfigurable region a
in chapter 5. A framework needed to handle a huge amount of module is instantiated as a black box, i.e. without supplying
reconfigurable regions and their modules is explained in chapter the implementation. The interface (entity) of such a black box
6. A case study is conducted on a 8x6 grid in chapter 7 and the module is fixed and cannot be changed during run-time. A
impact of different implementations and placements of modules common interface has to be found if distinct modules should be
are discussed in chapter 8. Chapter 9 concludes the results of loaded into the same PR region. For each PR region the position
this paper. and dimension must be specified using an ”AREA GROUP”
constraint. The location of every hard macro must also be
3. Related Work defined by a ”LOC” constraint. If many PR regions are used
and a wide bus between the static and the dynamic part should
Different approaches have been proposed to make use of
be implemented, this can be a time-consuming work. Hence we
the DPR capability of FPGAs. Hagemeyer et al. [4] presented
have developed a framework, which is capable of handling this,
a tool-flow, which generates a homogeneous communication
see chapter 6. In contrast, defining area groups and locations
infrastructure for DPR capable FPGAs built upon the Xilinx
of hard macros is not required for a static implementation.
design flow. Another tool, named Recobus-builder, which
A special patch, provided by Xilinx, has to be applied to
also uses static buses between the modules, without applying
the standard synthesis tools, to run a RTR design flow and to
Xilinx’s PR flow, was proposed by Koch et al. [5]. Only Virtex-
generate the partial bit stream files.
II and Spartan-3 FPGAs are supported so far. Besides static
buses between the modules, switch architectures with routers
have been examined [6] [7]. Most of these tools, e.g. recobus- 5. RTR Grid
builder, claim that partially reconfigurable regions can be made
very tiny, e.g. only a few slices in length. This might be possible For a run-time reconfigurable design it is necessary to define
but not applicable for most RTR modules, which need hard partially reconfigurable regions (PRRs). This should be done
block primitives like block RAMs or DSP slices. These blocks in such a way that the resources within the regions satisfy
are not scattered homogeneously over the FPGA. Instead, they the demands of the partially reconfigurable modules (PRMs).
are clustered in the same columns of the FPGA. Besides this, To conduct our examination a homogeneous array of PRRs
most frameworks were done for research only and are not is preferable. The PRR should contain the same amounts of
publicly available. For our examination we decided to build resources, i.e. LUTs, block RAMs, DSP slices whenever possi-
our framework on top of the Xilinx design flow to get an ble. Figure 1 is a schematic view of the Virtex-5 XC5VLX155
adaptable platform which uses single slice macros for state FPGA used in our case study. A 8x6 two dimensional array (cf.
of the art Xilinx Virtex-5 FPGAs. In contrast to the bus and Figure 2) is mapped on the physical resources of the FPGA
router topologies mentioned above, cf. [4] [5] [6] [7], inter- in such a way that each PRR contains the same amount of
module communication is not utilized in this paper. Instead a LUTs and block RAMs. A finer grained mesh up to hundreds
controller distributes the data and collects the results from the of PRRs would be feasible if the amount of block RAMs could
modules using point-to-point connections. This is due to the be neglected. DSP slices are not utilized, because they are only
reason that inter-module communication requires more complex available in two columns of the Virtex-5 XC5VLX155 FPGA,
logic, which on the other side could lead to longer critical paths, i.e. they are not homogeneously distributed over the FPGA.
which makes it difficult to compare a static and with a dynamic The partitioning follows the rules, which originate from the
controller/host driven design. For dealing with host coupled granularity of a configurable frame (cf. chapter 7). One region
accelerators, including FPGAs, we have designed a common (x3/y5), which is located near the clock buffers in the middle
interface, built on top of a virtual file system (ACCFS) [8]. of the FPGA will be used later for a static controller and not
as a PRR. To take advantage of synchronous design (without
4. RTR Design Flow asynchronous FIFOs) and to investigate the influence of the
placement of the RTR modules, all modules are driven by
In this chapter the principle design flow of ”Module based the same clock source. The clock is propagated over the total
11
00
00
11
00
11
11
00
00
11
00
11
0110101011 00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
11
00
00
11
00
11
common
IO
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
00
11
00
11
00
11 IO
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 IO
vhdl_templates
00
11 00
11 00
11 00
11 00
11 00
11
101011
00
11 00
11 00 00
11 00
11 00
11
top_vhdl
00
11 00
11 00
11 00
11 00
11 00
11
00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11 11
00 00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
controller_vhdl
00
11
00
11
00
11
00
11
00
11
00
11 1010101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
script_templates
IO 00
11
00
11
00
11
00
11
00
11
00
11 1010101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO matcher_edn
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
ucf_template
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
PRR_definition
00
11
00
11
00
11
00
11
00
11
00
11 1010101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
PRM_definition_1 ... 47
IO 00
11
00
11
00
11
00
11
00
11
00
11 1010101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
macros
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 modules_1
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11 IO
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 modules_2
00
11
00
11 00
11
00
11
1010101011 00
00
11 00
11
00
11 00
11
00
11 00
11
00
11

...
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
00
11 00
11
1010101011 00
11 00
11 00
11 00
11
IO IO

00
11 00
11 00 00
11 00
11 00
11 1
0
modules_47
00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1
0
1
0
1
0
1
generated_sources
00
11 00
11
101011 00 00
11 00
11 00
11 DCM
00
11 00
11 00
11 00
11 00
11 00
11 0
1 top_vhdl
00
11 00
11 00
11 00
11 00
11 00
11 0
1
00
11 00
11
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO

1
0
controller_vhdl
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
IO
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1
0
1
DSP ngc
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 0
1
0
1
top
controller
00
11 00
11 11
00
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
1
0
0
1
0
1
0
1
BRAM ncd
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1 top
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 BUFG
controller
00
11 00
11 modules
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
mod1
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11

...
00
11 00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
CLOCK
REGION mod47
00
11 00
11
101011
00 00
11 00
11 00
11
IO 00
11
00
11 00
11
00
11 00
11
00
11 IO 00
11
00
11 00
11
00
11 00
11
00
11 IO

00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11 1010101011
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
Figure 3. Design hierarchy
00
11 00
11 00
11 00
11 00
11 00
11

Figure 1. Schematic view of Xilinx XC5VLX155 FPGA

8 in PlanAhead. Consequently, a framework was constructed as


follows:
7
6 First, template files for the top module and controller were
added to the ”common/vhdl templates” directory (cf. Figure 3).
5
These templates contain a VHDL constant for the number of
4
PRMs and heavily use VHDL generate statements to instantiate
3 PRMs and bus macros. Since the Xilinx tools do not support
2 multiple PRMs having the same component name in the partial
1 design flow, an EDIF file has been created for the matcher
whereby the component name of each matcher can be replaced
1 2 3 4 5 6 by a simple text search and replace action.
Figure 2. Example RTR grid Then, a ”ucf” template was created that contains the clock
speed constraint of 350 MHz and additionally the area con-
straints of the controller and bus macros. Afterwards, a PRR
FPGA. All modules should operate at the same clock speed. definition file, that assigns a symbolic name to each grid
element, and PRM definition files, that map the modules to
6. Framework for RTR Grid the grid elements using the symbolic names, were created.

Originally, Xilinx PlanAhead 10.1 was going to be used for The framework script works as follows: First, the top and
the RTR design flow. During the first test runs, two major issues controller VHDL files are generated from the templates for each
evolved. The first one is that bus macros wider than 4 bits grid configuration and copied into each ”generated sources”
are not supported for Virtex-5 FPGAs. Secondly, the place and subdirectory (cf. Figure 3) as well as the synthesis scripts,
route process (implementation step) must be done for each RTR implementation scripts, ucf files and matcher EDIFs. Then, the
module within the grid, because the location of the block RAM synthesis is performed and the results are moved to the ngc
and the bus macro can be different (left, middle, right) within subdirectory.
the PR regions. For modules placed in similar PR regions, there Afterwards, a ucf file based on the ucf template and the
is no ”copy to PR region” tool available so far, that could move PRR and PRM definition is generated for each module con-
the routed design relatively. Since each module in this case figuration and copied to the ”ncd” subdirectories. Next, the
study utilizes 48 signal lines and 47 PRMs are placed at the implementation scripts for the static and dynamic part are
maximum, 564 bus macros would have to be placed manually started, performing the map, place and route processes.
7. Case Study Pattern Matcher Grid matcher can be driven at high clock rates due to using many
pipeline stages in the design. This is important to ensure that
the module itself does not decrease the overall clock speed. To
1 2 4 8
8 8 8 8
get applicable results we use one block RAM for each module
7
11
00 7
111
000 7
11
00 7
11
00
00
1100
11 and the controller. This is a more realistic approach than just
00
11 000
111 00
11 00
11
00
1100
11
00
11 000
111 00
11 00
11
00
11
0011
6 6 6 6 00 using LUTs for the implementation of the modules. As we
00
11 000
111 0011
11 11
5 5 5 C 00
00
11 5 00
11
C 11
00
11
00
00
11
000
111 00
1100
11 00
11
00
1100
11
C C
4 4 000
111
000
111
4 00
11
0011
11
00
00
11
4 00
11
00
11
0011
00
11
11
00
00
11
will see in subchapter 7.5, this restricts the placement of the
3 3 3 3
2 2 2 2 modules. In this test case all modules are driven by a single
1 1 1 1
clock source (synchronous design) and managed by a single
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
16 32 46 47 controller. We have implemented two distinct versions of the
11
00
00011
111 00 11
00
00
110011
1100
00
11
000
111 11
00
00
110011
1100
00
11
000
111
00
11 00011
111 00
11 00
11
00
110011
1100
11
00
11
000
111 00
11
00
110011
1100
11
00
11
000
111 controller to assess the influence on the design. The RTR grid
0011
8
00 8
00
11
00
11
00
11
000
111 000
111
00
8
00
11
00
1100
1100
00
11
000
111
8
00
11
00
1100
1100
00
11
000
111
1100
11 00
11 000
111
00
11
0011
00 00
11
00
11
000
111
00011
00
000
111 00
11
00
1100
11
0011
00
00
11
000
111 00
11
00
1100
11
0011
7
11
00 00
11 7 7 7 00
00
11
000
111
00
111100
11
00
11 00
11
00
11
111 00
11
000
111 00
11
00
111100
11
00
11
000
111 00
11
00
111100
11
00
11
000
111 is implemented on a XC5VLX155, one of the largest Virtex-5
00
1100
11
0011
6 00
00
11 6 00
11
00
11
000
111 00
11
000
111 6 00
11
00
1100
1100
11
00
11
000
111 6 00
11
00
1100
1100
11
00
11
000
111
00
11
00
1111 00011
111 00
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
5 00
11
00
11C 11
00
00
11
00
11
00
11 5 00
11
00
11
00
11
00
11
000
111
C 11
00
000
111
000
111 5 00
11
00
11C 11
00
00
11
000
111 5 00
11
00
11C 11
00
00
11
000
111 FPGA available in Xilinx PR 9.2 tools.
00
11
00
1100
1100
11
00
11 00
11
00
11
000
111 00
11
000
111
000
111 00
11
00
1100
1100
11
00
11
000
111 00
11
00
1100
1100
11
00
11
000
111
4 00
1100
1100
11
00
11 4 00
11
00
11
000
111 00
11
000
111
000
111 4 00
11
00
1100
1100
11
00
11
000
111 4 00
11
00
1100
1100
11
00
11
000
111
00
1100
1100
11
00
11 00011
111
00
11
00
11 00
000
111
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
3 00
1100
11 3 00
11
000
111
00
11 00
11
000
111 3
00
11
00
1100
1100
11
00
11
000
111 3
00
11
00
1100
1100
11
00
11
000
111
0011
1100 00
11
00011
111
00
11 00
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
2 2 00
11
000
111 00
11 2
00
11
00
1100
11 000
111
00
11
00
11 2
00
11
00
1100
1100
11
00
11
000
111 7.2. RTR Module Pattern Matcher
00
11
00011
111 00 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
1 1 1 00
11
00
1100
1100
11
00
11 1 00
11
00
1100
1100
11
00
11
000
111
00
11
00
110011
1100
00
11 00
11
00
110011
1100
00
11
000
111
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
The pattern matcher represents the reconfigurable part of the
Figure 4. Module placement: circular distribution design. It compares a four-byte string pattern against an input
stream. While the string pattern can be configured by data input
signals, the input stream is generated from a search database
1 2 4 8
8 8 8 8
which resides in a Virtex-5 block RAM for each module.
7 7 7 7 The input stream is implemented as a 2 x 4-byte sliding
6 6 6 6
5 C 5 C 5 C 5 C window over the search database. The string matching is
4 4 4 4
performed by comparing simultaneously the string pattern with
3 3 3 3
11
00
00
11 000
111
000
111
2
11
00 2
11
00 111
000 2
11
00
00
11 11
00
000
111 2 00
11
00
1100
11 000
111
00
11
00
11 the first four bytes and the four bytes which are shifted by
1 00
11
00
11 1 00
11
00
11 000
111
000
111 1 00
11
00
11
00
11
00
11 00
11
000
111
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
00
11 00
11 000
111 00
11
00
11 00
11
000
111 00
11
00
110011
1100
00
11
000
111 1, 2 and 3. Therefore, the sliding window can be shifted
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
16 24 32 47 by four bytes each clock cycle. This means, that four 32-bit
11
00
00
110011
1100
00
11
000
111
8 8 8 8 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111 comparisons respectively 16 one-byte comparisons are realized
7 7 7
11
00
00
11 000
111 7
00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
6 6 6 00
11
00
11
00
11
00
11 000
111
000
111 6 00
11
00
1100
1100
11
00
11
000
111 each clock cycle.
00
11
00
11 000
111
00
11
00
11 00
11
00
110011
1100
00
11
000
111
5 5 5 00
11
00
11C 11
00
00
11
000
111 5 00
11
00
11C 11
00
00
11
000
111 By the input signal ”start”, the matcher begins the string
11
00
00
1100
11 00
11
00
11 00
11
00
11
000
111 00
11
00
11 00
11
00
11
000
111
0011
00
00
11
000
111 00
11 00
11
C C
4 4 00
11
00
11
00
11
00
1111
00
1100
11
00
11
000
111
00
11
00
11
000
111
4 00
11
00
11
00
11
00
110011
11
00
1100
11
00
11
000
111
00
00
11
000
111
4 00
11
00
11
00
11
00
110011
11
00
1100
11
00
11
000
111
00
00
11
000
111
00
11
00
11 00
11
000
111
3 11
00
00
11
00
11
00
1100
11
00
11
000
111
0011
1100
11
00
11
000
111
00
3 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111
3 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
3 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 comparisons. After the execution has finished, a status signal
2 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111
2 00
11
00
110011
1100
00
11
000
111 2 00
11
00
110011
1100
00
11
000
111 2 00
11
00
110011
1100
00
11
000
111
1 00
11
00
110011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 indicates that the result is available on the data output.
00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Synthesized as a reconfigurable module in the RTR design
flow, a single pattern matcher utilizes about 240 LUTs and
Figure 5. Module placement: bottom-up distribution
1 block RAM. Due to the highly pipelined design of the
matcher, a maximum clock speed of approximately 350 MHz
is achievable on a Xilinx Virtex XC5VLX155 FPGA.
7.1. Overview and Goals
7.3. Pattern Matcher Controller
To study the behavior of modules placed within a RTR
grid on a single FPGA a homogeneous array of reconfigurable The task of the controller is to provide the matchers with
regions should be applied. To make an assessment about the string patterns, to start the matching sequence and to gather the
difference in the achievable clock speed, the RTR based design results using control and data lines. Each matcher is checked
is compared with its two static build implementations. The first periodically by the controller. After a matcher has finished its
static version, named ”static free”, does not use any placement work, its results are collected. Thereafter, a new string pattern
constraints. The synthesis tools can optimize, place and route is loaded into this matcher and the signal ”start” is asserted.
the modules wherever it wants to achieve maximal overall Two distinct implementations of the controller exist, which
design speed. The second one is the statically built variant of differ in the maximal achievable clock speed.
the region based RTR design. This implies that the modules are The first variant named ”controller” is based on a finite state
placed at the same sites as in the RTR grid. To investigate the machine (FSM) with 12 states. Due to the synchronous behavior
deviation in clock speed beyond the different modules placed at of the utilized hard macros (bus macros) two wait states are
different locations within the grid, the same module is replicated added, one for each direction.
many times. As RTR module we use a ”matcher”, which finds The second variant named ”controller-2” contains one addi-
patterns within a bit stream. Such pattern matchers are used for tional buffer for each control and data line as well as two more
example in the Human Genome Project to find sequences. The wait states.
7.4. Design Validation and that there are no placement constraints (free placement).
In the ”static region” variant the placement is bounded to the
The correctness of the design was validated by post route regions of the grid. For the ”dynamic” version the placement is
simulation, implementation and testing the static and RTR constrained to the PRRs of the RTR grid, it is a RTR module
module based design on a Xilinx PCIe board ML555 with a and this also implies that the communication goes through ”bus
Virtex-5 XC5VLX50T at the different clock speeds. macros”. All in all we have synthesized and built over 700 RTR
modules and more than 80 full static versions. We utilized a
7.5. Module Placement in RTR Grid Virtex-5 XC5VLX155 with speed grade two for our case study.
All ”opt” versions where done with timing driven ”map” option
For defining a RTR grid, i.e. quantum, location and size and high effort for place and route.
of the partially reconfigurable regions, two conditions must be
met. First the primitives, e.g. LUTs, block RAMs, used by the 400
matcher opt static free speed
module ”matcher” must reside within a PR region and secondly matcher opt static region speed
matcher opt dynamic speed
these regions must obey the DPR rules of the FPGA fabric. The 350
smallest portion of Virtex-5 FPGAs, which can be configured,
are represented by the configuration frames [10]. One frame 300
covers for example 20 configurable logic blocks (CLB), 40 IO

speed (MHz)
block (IOB) or 4 block RAMs. These frames represent at the 250
same time the finest granularity of conceivable modules. This
means that modules can be organized on a Virtex-5 FPGA in 200
a two dimensional array in contrast to Virtex-2 (Pro) FPGAs
where only full columns can be reconfigured. The height of a
150
frame corresponds with the vertical dimension of a clock region
of the Virtex-5 FPGAs, c.f. Figure 1. Although a matcher needs
100
only one block RAM the other three within in the same column 12 4 8 16 32 47
of a clock regions are not available for other modules. This #modules

is due to the fact that a single configuration frame contains Figure 6. Static free, static with regions and dynamic
four block RAMs of the same column. The total amount of the circular distribution, all built with controller-2
possible PR regions for the matcher case study is 48, which
results from the number of clock regions (16) multiplied by the
amount of block RAM columns (3) per clock region, c.f. Figure 400
matcher speed
1. The Virtex-5 XC5VLX155 FPGA has an equal number of matcher opt speed
matcher* speed
block RAMs on the right and left side of the FPGA, which 350 matcher* opt speed
makes it relatively easy to create a 8 by 6 two dimensional
array (cf. Figure 2). The partially reconfigurable regions are 300
equal in the amounts of block RAMs (4) and LUTs (1600). In
speed (MHz)

this case study one of these available regions, located in the 250
middle of the device is used for the controller.
To examine the behavior two distinct distributions of the RTR
200
modules are analyzed. The first one is a circular placement
around the controller (cf. Figure 4) and the other is a bottom-
150
up filling (cf. Figure 5) with an increasing number of matcher
modules.
100
12 4 8 16 32 47
8. Results of Case Study #modules

Figure 7. Speed of static design with free placement, built


To assess the impact of using RTR modules in a RTR grid with controller and controller-2
on the maximal achievable clock speed, we have implemented
two versions of controller, ”controller” and ”controller-2”, Figure 6 shows the difference in clock speed between the
used different placements, ”circular” and ”bottom-up”, applied ”static free”, ”static region” and run-time reconfigurable con-
default and optimization (”opt”) implementation strategies for figuration ”dynamic” for different amounts of modules using
place and route (PAR) and built two full static implementations the circular placement and ”controller-2” (with one additional
(”static free” and ”static region”) as counterpart of all variants. pipeline stage). Three main issues can be seen. In the ”static
The version ”static free” means that the implementation is static free” version, in which the placer can decide where to put
400
matcher speed upwards, where the controller limits to overall design speed.
matcher opt speed Again, it can be noticed, that controller-2 with the additional
matcher* speed
350 matcher* opt speed buffering performs up to 100 MHz better. At this point it should
be emphasized that the design of the controller is crucial for
300 the total speed.
When we change the placement of the modules in such a way
speed (MHz)

250 that we start with the farthermost location from the controller
(bottom-up), we envisage an almost constant controller clock
200 speed over all module counts, cf. Figure 10. This means, that
the bottom-up setup with one RTR module gives an assessment
150
at which clock speed the full occupied grid can be driven.

400
100 matcher opt static free speed
12 4 8 16 32 47 matcher opt static region speed
#modules controller2 opt
350 matcher opt modules min speed
Figure 8. Speed of static design with regional placement,
built with controller and controller-2 300

speed (MHz)
400
250

350
200

300
150
speed (MHz)

250
100
12 4 8 16 32 47
#modules
200 controller
matcher modules min speed
controller opt
Figure 10. Speed of static free, static regional, controller-2
150
matcher opt modules min speed
controller2
and RTR modules for bottom-up distribution
matcher* modules min speed
controller2 opt
matcher* opt modules min speed
100
12 4 8 16 32 47 400
controller opt
#modules matcher opt modules min speed
Figure 9. Speed of controller and RTR modules with 350
optimization for circular distribution
300
speed (MHz)

the modules, the clock speed keeps constant for 16 modules 250
upwards. The frequency of the ”static region” and ”dynamic”
version drops with increasing module count. The behavior of 200
both curves is equal, both fall from approx. 330 MHz (1
module) to approx. 250 MHz (47 modules). This leads to the
150
conclusion that the maximal achievable clock speed depends on
routing delays, i.e. data path and clock delay, only.
100
Figure 7 shows that there is no significant difference between 12 4 8 16 32 47
the ”static free” (matcher), the ”static free” with higher PAR #modules

effort (matcher opt) and the controller with one additional Figure 11. Speed of ”controller” and RTR modules with
pipeline stage (matcher*) for static design with free placement. optimization for circular distribution
This changes in the ”region” placement (cf. Figure 8). All
implementations encounter a decrease in clock speed, but the Another interesting question was how the clock speed of
matcher version (matcher*) with one additional pipeline stage the modules behaves on the irregularities of the PR regions,
is up to 100 MHz faster than the one without the buffering. i.e. location of block RAM, LUTs and bus macros. In Figure
Figure 9 reveals that up to 8 matchers the matcher itself (both 11, 12 and 13 the spread between the RTR modules can be
versions) is the limiting factor. This changes from 16 modules seen. The maximal deviation in clock speed is about 8 percent.
400 400 400
controller2 controller2 controller2
matcher opt module matcher opt module matcher opt module

350 350 350

300 300 300


speed (MHz)

speed (MHz)

speed (MHz)
250 250 250

200 200 200

150 150 150

100 100 100


1 2 3 4 1 2 3 4 5 6 7 8 2 4 6 8 10 12 14 16
n-th module n-th module n-th module

(a) 4 RTR modules (b) 8 RTR modules (c) 16 RTR modules

400 400 400


controller2 controller2 controller2
matcher opt module matcher opt module matcher opt module

350 350 350

300 300 300


speed (MHz)

speed (MHz)

speed (MHz)
250 250 250

200 200 200

150 150 150

100 100 100


2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
n-th module n-th module n-th module

(d) 32 RTR modules (e) 46 RTR modules (f) 47 RTR modules

Figure 12. Deviation of clock speed between RTR modules in circular distribution

400
controller2 opt a single slice synchronous bus macro, which is located within
matcher opt modules min speed the region of the RTR module, to the controller is about 3.4
350 ns and the clock path skew about -0.7 ns. This means that the
clock reaches the bus macro, placed inside the PR region, 0.7
300 ns later than the controller. The clock path skew is subtracted
from the data path delay, i.e. both values are added. The result
speed (MHz)

250 is the maximal clock speed at which data can be transfered.


From the view of device utilization we have seen that only a
200 quarter of the pattern matcher modules could be placed due
to the granularity of configuring the FPGAs by configuration
150
frames. Despite this restriction we achieve over 45 billion 32-
bit comparisons per second with 47 RTR modules running at
100
240 MHz.
12 4 8 16 32 47
#modules
9. Conclusion
Figure 13. Speed of ”controller-2” and RTR modules with
optimization for circular distribution The implementation of a homogeneous grid of run-time
reconfigurable modules within a FPGA is feasible. We showed
a framework which is potentially capable of handling hundreds
This difference can not be blamed on the irregularities, because of RTR modules. We have proven in our case study that a grid
it falls within the noise of a NP-complete place and routing with 47 RTR pattern matcher modules can be run with more
process. than 240 MHz, if the design is carefully chosen.
To get more information about the reasons why the speed of We have examined what impact run-time reconfiguration has
a static design, where modules are constraints to regions and the on the design and the clock speed.
”dynamic” design are about 100 MHz slower, the worst case is The design of a RTR module, particularly, the resource
examined. Figure 14 shows the placed and routed design of the utilization including hard block primitives like block RAM and
”bottom-up” version with one RTR module, which is located DSP slices should be equal to the resources of the reconfig-
at the lower corner of the FPGA. The data path delay from urable region. Otherwise, the total device utilization could be
The results presented should encourage the community to
think about run-time reconfigurable processing engines orga-
nized in a grid within a FPGA.

10. Future Work


In this paper we concentrated on a synchronous design of
a grid with RTR modules. The overall clock speed is always
restricted to the slowest component. Examining asynchronously
clocked grids seems promising.

11. Acknowledgment
The project is performed in collaboration with the Center of
Advanced Study Böblingen, IBM Deutschland Research &
Development GmbH in Germany.

References
[1] N. A. Woods and T. VanCourt, “FPGA Acceleration of Quasi-
Monte Carlo in Finance,” in FPL. IEEE, 2008, pp. 335–340.

[2] G. L. Zhang, P. H. W. Leong, C. H. Ho, K. H. Tsoi, C. C. C.


Cheung, D.-U. Lee, R. C. C. Cheung, and W. Luk, “Reconfig-
urable acceleration for monte carlo based financial simulation,”
in FPT, G. J. Brebner, S. Chakraborty, and W.-F. Wong, Eds.
IEEE, 2005, pp. 215–222.

[3] “Xilinx Virtex family,” Website, 2008. [Online]. Available:


http://www.xilinx.com/products/

[4] J. Hagemeyer, B. Kettelhoit, M. Koester, and M. Porrmann,


Figure 14. Placed and routed design of bottom-up with in Design of Homogeneous Communication Infrastructures for
one PRM Partially Reconfigurable FPGAs (ERSA). CSREA Press, 2007.

[5] D. Koch, C. Beckhoff, and J. Teich, “ReCoBus-Builder a Novel


Tool and Technique to Build Statically and Dynamically Recon-
poor. This is important especially for the design of fine grained figurable Systems for FPGAs,” in Proceedings of International
processing engines. These could suffer from the granularity of Conference on Field-Programmable Logic and Applications (FPL
the partial reconfiguration capabilities. 08), Heidelberg, Germany, 2008.
For a synchronous driven grid, the design of a central
[6] J. Surisi, C. Patterson, and P. Athanas, “An efficient run-time
controller, which could be a router or an interface to a host router for connecting modules in FPGAs,” in Proceedings of
system, is crucial. A pipelined controller, where incoming and International Conference on Field-Programmable Logic and Ap-
outgoing data is buffered near the bus macros, is recommended. plications (FPL 08), Heidelberg, Germany, 2008.
The same applies to the RTR modules. The achievable clock
[7] T. Pionteck, C. Albrecht, K. Maehle, E., Hübner, M., and Becker,
speed of the grid depends on the slowest component, which is
J., “Commuication Architectures for Dynamically Reconfigurable
most likely the controller for huge grids. FPGA Designs,” in Proceedings of IEEE International Parallel
There is no difference in clock speed between a RTR grid and Distributed Processing Symposium, IPDPS USA, 2007.
and a static built grid with the same placement of modules in
the grid areas. In comparison with a static built version, where [8] A. Heinig, R. Oertel, J. Strunk, W. Rehm, and H. Schick,
“Generalizing the spufs concept - a case study towards a common
the placement of the modules is free, the RTR grid variant can accelerator interface,” in proceedings of the Many-core and
suffer from routing delays, i.e. data path delays and clock path Reconfigurable Supercomputing Conference (MRSC), 2008.
skew. For a RTR grid, optimization from the synthesis tools can
only the done on the basis of each module, not on the global [9] Xilinx, “Two flows for partial reconfiguration: Module based or
difference based,” in Application Note: Virtex, Virtex-E, Virtex-II,
design. Virtex-II Pro Families (XAPP290), 2004.
The deviation in clock speed measured when using only one
type of module for the total grid configuration is about 8 percent [10] Xilinx, “Configuration Memory Frames,” in Virtex-5 FPGA Con-
maximum. figuration User Guide (UG191), 2008.

Das könnte Ihnen auch gefallen