Beruflich Dokumente
Kultur Dokumente
Jochen Strunk∗ , Toni Volkmer∗ , Klaus Stephan∗ , Wolfgang Rehm∗ and Heiko Schick‡
∗
Chemnitz University of Technology
Computer Architecture Group
Email: {sjoc,tovo,stekl,rehm}@cs.tu-chemnitz.de
‡
IBM Deutschland Research & Development GmbH
Email: schickhj@de.ibm.com
...
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11
00
11 00
11
1010101011 00
11 00
11 00
11 00
11
IO IO
00
11 00
11 00 00
11 00
11 00
11 1
0
modules_47
00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1
0
1
0
1
0
1
generated_sources
00
11 00
11
101011 00 00
11 00
11 00
11 DCM
00
11 00
11 00
11 00
11 00
11 00
11 0
1 top_vhdl
00
11 00
11 00
11 00
11 00
11 00
11 0
1
00
11 00
11
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
1
0
controller_vhdl
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
IO
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1
0
1
DSP ngc
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 0
1
0
1
top
controller
00
11 00
11 11
00
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
1
0
0
1
0
1
0
1
BRAM ncd
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
0
1 top
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11 BUFG
controller
00
11 00
11 modules
IO 00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
IO
mod1
00
11
00
11
00
11
00
11
00
11
00
11 101011 00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
...
00
11 00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
1010101011 00
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
CLOCK
REGION mod47
00
11 00
11
101011
00 00
11 00
11 00
11
IO 00
11
00
11 00
11
00
11 00
11
00
11 IO 00
11
00
11 00
11
00
11 00
11
00
11 IO
00
11 00
11 00
11 00
11 00
11 00
11
00
11
00
11
00
11
00
11
00
11
00
11 1010101011
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
00
11
Figure 3. Design hierarchy
00
11 00
11 00
11 00
11 00
11 00
11
Originally, Xilinx PlanAhead 10.1 was going to be used for The framework script works as follows: First, the top and
the RTR design flow. During the first test runs, two major issues controller VHDL files are generated from the templates for each
evolved. The first one is that bus macros wider than 4 bits grid configuration and copied into each ”generated sources”
are not supported for Virtex-5 FPGAs. Secondly, the place and subdirectory (cf. Figure 3) as well as the synthesis scripts,
route process (implementation step) must be done for each RTR implementation scripts, ucf files and matcher EDIFs. Then, the
module within the grid, because the location of the block RAM synthesis is performed and the results are moved to the ngc
and the bus macro can be different (left, middle, right) within subdirectory.
the PR regions. For modules placed in similar PR regions, there Afterwards, a ucf file based on the ucf template and the
is no ”copy to PR region” tool available so far, that could move PRR and PRM definition is generated for each module con-
the routed design relatively. Since each module in this case figuration and copied to the ”ncd” subdirectories. Next, the
study utilizes 48 signal lines and 47 PRMs are placed at the implementation scripts for the static and dynamic part are
maximum, 564 bus macros would have to be placed manually started, performing the map, place and route processes.
7. Case Study Pattern Matcher Grid matcher can be driven at high clock rates due to using many
pipeline stages in the design. This is important to ensure that
the module itself does not decrease the overall clock speed. To
1 2 4 8
8 8 8 8
get applicable results we use one block RAM for each module
7
11
00 7
111
000 7
11
00 7
11
00
00
1100
11 and the controller. This is a more realistic approach than just
00
11 000
111 00
11 00
11
00
1100
11
00
11 000
111 00
11 00
11
00
11
0011
6 6 6 6 00 using LUTs for the implementation of the modules. As we
00
11 000
111 0011
11 11
5 5 5 C 00
00
11 5 00
11
C 11
00
11
00
00
11
000
111 00
1100
11 00
11
00
1100
11
C C
4 4 000
111
000
111
4 00
11
0011
11
00
00
11
4 00
11
00
11
0011
00
11
11
00
00
11
will see in subchapter 7.5, this restricts the placement of the
3 3 3 3
2 2 2 2 modules. In this test case all modules are driven by a single
1 1 1 1
clock source (synchronous design) and managed by a single
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
16 32 46 47 controller. We have implemented two distinct versions of the
11
00
00011
111 00 11
00
00
110011
1100
00
11
000
111 11
00
00
110011
1100
00
11
000
111
00
11 00011
111 00
11 00
11
00
110011
1100
11
00
11
000
111 00
11
00
110011
1100
11
00
11
000
111 controller to assess the influence on the design. The RTR grid
0011
8
00 8
00
11
00
11
00
11
000
111 000
111
00
8
00
11
00
1100
1100
00
11
000
111
8
00
11
00
1100
1100
00
11
000
111
1100
11 00
11 000
111
00
11
0011
00 00
11
00
11
000
111
00011
00
000
111 00
11
00
1100
11
0011
00
00
11
000
111 00
11
00
1100
11
0011
7
11
00 00
11 7 7 7 00
00
11
000
111
00
111100
11
00
11 00
11
00
11
111 00
11
000
111 00
11
00
111100
11
00
11
000
111 00
11
00
111100
11
00
11
000
111 is implemented on a XC5VLX155, one of the largest Virtex-5
00
1100
11
0011
6 00
00
11 6 00
11
00
11
000
111 00
11
000
111 6 00
11
00
1100
1100
11
00
11
000
111 6 00
11
00
1100
1100
11
00
11
000
111
00
11
00
1111 00011
111 00
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
5 00
11
00
11C 11
00
00
11
00
11
00
11 5 00
11
00
11
00
11
00
11
000
111
C 11
00
000
111
000
111 5 00
11
00
11C 11
00
00
11
000
111 5 00
11
00
11C 11
00
00
11
000
111 FPGA available in Xilinx PR 9.2 tools.
00
11
00
1100
1100
11
00
11 00
11
00
11
000
111 00
11
000
111
000
111 00
11
00
1100
1100
11
00
11
000
111 00
11
00
1100
1100
11
00
11
000
111
4 00
1100
1100
11
00
11 4 00
11
00
11
000
111 00
11
000
111
000
111 4 00
11
00
1100
1100
11
00
11
000
111 4 00
11
00
1100
1100
11
00
11
000
111
00
1100
1100
11
00
11 00011
111
00
11
00
11 00
000
111
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
3 00
1100
11 3 00
11
000
111
00
11 00
11
000
111 3
00
11
00
1100
1100
11
00
11
000
111 3
00
11
00
1100
1100
11
00
11
000
111
0011
1100 00
11
00011
111
00
11 00
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
2 2 00
11
000
111 00
11 2
00
11
00
1100
11 000
111
00
11
00
11 2
00
11
00
1100
1100
11
00
11
000
111 7.2. RTR Module Pattern Matcher
00
11
00011
111 00 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
1 1 1 00
11
00
1100
1100
11
00
11 1 00
11
00
1100
1100
11
00
11
000
111
00
11
00
110011
1100
00
11 00
11
00
110011
1100
00
11
000
111
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
The pattern matcher represents the reconfigurable part of the
Figure 4. Module placement: circular distribution design. It compares a four-byte string pattern against an input
stream. While the string pattern can be configured by data input
signals, the input stream is generated from a search database
1 2 4 8
8 8 8 8
which resides in a Virtex-5 block RAM for each module.
7 7 7 7 The input stream is implemented as a 2 x 4-byte sliding
6 6 6 6
5 C 5 C 5 C 5 C window over the search database. The string matching is
4 4 4 4
performed by comparing simultaneously the string pattern with
3 3 3 3
11
00
00
11 000
111
000
111
2
11
00 2
11
00 111
000 2
11
00
00
11 11
00
000
111 2 00
11
00
1100
11 000
111
00
11
00
11 the first four bytes and the four bytes which are shifted by
1 00
11
00
11 1 00
11
00
11 000
111
000
111 1 00
11
00
11
00
11
00
11 00
11
000
111
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
00
11 00
11 000
111 00
11
00
11 00
11
000
111 00
11
00
110011
1100
00
11
000
111 1, 2 and 3. Therefore, the sliding window can be shifted
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
16 24 32 47 by four bytes each clock cycle. This means, that four 32-bit
11
00
00
110011
1100
00
11
000
111
8 8 8 8 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111 comparisons respectively 16 one-byte comparisons are realized
7 7 7
11
00
00
11 000
111 7
00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
6 6 6 00
11
00
11
00
11
00
11 000
111
000
111 6 00
11
00
1100
1100
11
00
11
000
111 each clock cycle.
00
11
00
11 000
111
00
11
00
11 00
11
00
110011
1100
00
11
000
111
5 5 5 00
11
00
11C 11
00
00
11
000
111 5 00
11
00
11C 11
00
00
11
000
111 By the input signal ”start”, the matcher begins the string
11
00
00
1100
11 00
11
00
11 00
11
00
11
000
111 00
11
00
11 00
11
00
11
000
111
0011
00
00
11
000
111 00
11 00
11
C C
4 4 00
11
00
11
00
11
00
1111
00
1100
11
00
11
000
111
00
11
00
11
000
111
4 00
11
00
11
00
11
00
110011
11
00
1100
11
00
11
000
111
00
00
11
000
111
4 00
11
00
11
00
11
00
110011
11
00
1100
11
00
11
000
111
00
00
11
000
111
00
11
00
11 00
11
000
111
3 11
00
00
11
00
11
00
1100
11
00
11
000
111
0011
1100
11
00
11
000
111
00
3 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111
3 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111
3 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 comparisons. After the execution has finished, a status signal
2 00
11
00
11
00
11
00
1100
11
0011
1100
00
11
000
111
00
11
00
11
000
111
2 00
11
00
110011
1100
00
11
000
111 2 00
11
00
110011
1100
00
11
000
111 2 00
11
00
110011
1100
00
11
000
111
1 00
11
00
110011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 1 00
11
00
11
00
11
00
1100
1100
11
00
11
000
111
0011
1100
00
11
000
111 indicates that the result is available on the data output.
00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111 00
11
00
110011
1100
00
11
000
111
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Synthesized as a reconfigurable module in the RTR design
flow, a single pattern matcher utilizes about 240 LUTs and
Figure 5. Module placement: bottom-up distribution
1 block RAM. Due to the highly pipelined design of the
matcher, a maximum clock speed of approximately 350 MHz
is achievable on a Xilinx Virtex XC5VLX155 FPGA.
7.1. Overview and Goals
7.3. Pattern Matcher Controller
To study the behavior of modules placed within a RTR
grid on a single FPGA a homogeneous array of reconfigurable The task of the controller is to provide the matchers with
regions should be applied. To make an assessment about the string patterns, to start the matching sequence and to gather the
difference in the achievable clock speed, the RTR based design results using control and data lines. Each matcher is checked
is compared with its two static build implementations. The first periodically by the controller. After a matcher has finished its
static version, named ”static free”, does not use any placement work, its results are collected. Thereafter, a new string pattern
constraints. The synthesis tools can optimize, place and route is loaded into this matcher and the signal ”start” is asserted.
the modules wherever it wants to achieve maximal overall Two distinct implementations of the controller exist, which
design speed. The second one is the statically built variant of differ in the maximal achievable clock speed.
the region based RTR design. This implies that the modules are The first variant named ”controller” is based on a finite state
placed at the same sites as in the RTR grid. To investigate the machine (FSM) with 12 states. Due to the synchronous behavior
deviation in clock speed beyond the different modules placed at of the utilized hard macros (bus macros) two wait states are
different locations within the grid, the same module is replicated added, one for each direction.
many times. As RTR module we use a ”matcher”, which finds The second variant named ”controller-2” contains one addi-
patterns within a bit stream. Such pattern matchers are used for tional buffer for each control and data line as well as two more
example in the Human Genome Project to find sequences. The wait states.
7.4. Design Validation and that there are no placement constraints (free placement).
In the ”static region” variant the placement is bounded to the
The correctness of the design was validated by post route regions of the grid. For the ”dynamic” version the placement is
simulation, implementation and testing the static and RTR constrained to the PRRs of the RTR grid, it is a RTR module
module based design on a Xilinx PCIe board ML555 with a and this also implies that the communication goes through ”bus
Virtex-5 XC5VLX50T at the different clock speeds. macros”. All in all we have synthesized and built over 700 RTR
modules and more than 80 full static versions. We utilized a
7.5. Module Placement in RTR Grid Virtex-5 XC5VLX155 with speed grade two for our case study.
All ”opt” versions where done with timing driven ”map” option
For defining a RTR grid, i.e. quantum, location and size and high effort for place and route.
of the partially reconfigurable regions, two conditions must be
met. First the primitives, e.g. LUTs, block RAMs, used by the 400
matcher opt static free speed
module ”matcher” must reside within a PR region and secondly matcher opt static region speed
matcher opt dynamic speed
these regions must obey the DPR rules of the FPGA fabric. The 350
smallest portion of Virtex-5 FPGAs, which can be configured,
are represented by the configuration frames [10]. One frame 300
covers for example 20 configurable logic blocks (CLB), 40 IO
speed (MHz)
block (IOB) or 4 block RAMs. These frames represent at the 250
same time the finest granularity of conceivable modules. This
means that modules can be organized on a Virtex-5 FPGA in 200
a two dimensional array in contrast to Virtex-2 (Pro) FPGAs
where only full columns can be reconfigured. The height of a
150
frame corresponds with the vertical dimension of a clock region
of the Virtex-5 FPGAs, c.f. Figure 1. Although a matcher needs
100
only one block RAM the other three within in the same column 12 4 8 16 32 47
of a clock regions are not available for other modules. This #modules
is due to the fact that a single configuration frame contains Figure 6. Static free, static with regions and dynamic
four block RAMs of the same column. The total amount of the circular distribution, all built with controller-2
possible PR regions for the matcher case study is 48, which
results from the number of clock regions (16) multiplied by the
amount of block RAM columns (3) per clock region, c.f. Figure 400
matcher speed
1. The Virtex-5 XC5VLX155 FPGA has an equal number of matcher opt speed
matcher* speed
block RAMs on the right and left side of the FPGA, which 350 matcher* opt speed
makes it relatively easy to create a 8 by 6 two dimensional
array (cf. Figure 2). The partially reconfigurable regions are 300
equal in the amounts of block RAMs (4) and LUTs (1600). In
speed (MHz)
this case study one of these available regions, located in the 250
middle of the device is used for the controller.
To examine the behavior two distinct distributions of the RTR
200
modules are analyzed. The first one is a circular placement
around the controller (cf. Figure 4) and the other is a bottom-
150
up filling (cf. Figure 5) with an increasing number of matcher
modules.
100
12 4 8 16 32 47
8. Results of Case Study #modules
250 that we start with the farthermost location from the controller
(bottom-up), we envisage an almost constant controller clock
200 speed over all module counts, cf. Figure 10. This means, that
the bottom-up setup with one RTR module gives an assessment
150
at which clock speed the full occupied grid can be driven.
400
100 matcher opt static free speed
12 4 8 16 32 47 matcher opt static region speed
#modules controller2 opt
350 matcher opt modules min speed
Figure 8. Speed of static design with regional placement,
built with controller and controller-2 300
speed (MHz)
400
250
350
200
300
150
speed (MHz)
250
100
12 4 8 16 32 47
#modules
200 controller
matcher modules min speed
controller opt
Figure 10. Speed of static free, static regional, controller-2
150
matcher opt modules min speed
controller2
and RTR modules for bottom-up distribution
matcher* modules min speed
controller2 opt
matcher* opt modules min speed
100
12 4 8 16 32 47 400
controller opt
#modules matcher opt modules min speed
Figure 9. Speed of controller and RTR modules with 350
optimization for circular distribution
300
speed (MHz)
the modules, the clock speed keeps constant for 16 modules 250
upwards. The frequency of the ”static region” and ”dynamic”
version drops with increasing module count. The behavior of 200
both curves is equal, both fall from approx. 330 MHz (1
module) to approx. 250 MHz (47 modules). This leads to the
150
conclusion that the maximal achievable clock speed depends on
routing delays, i.e. data path and clock delay, only.
100
Figure 7 shows that there is no significant difference between 12 4 8 16 32 47
the ”static free” (matcher), the ”static free” with higher PAR #modules
effort (matcher opt) and the controller with one additional Figure 11. Speed of ”controller” and RTR modules with
pipeline stage (matcher*) for static design with free placement. optimization for circular distribution
This changes in the ”region” placement (cf. Figure 8). All
implementations encounter a decrease in clock speed, but the Another interesting question was how the clock speed of
matcher version (matcher*) with one additional pipeline stage the modules behaves on the irregularities of the PR regions,
is up to 100 MHz faster than the one without the buffering. i.e. location of block RAM, LUTs and bus macros. In Figure
Figure 9 reveals that up to 8 matchers the matcher itself (both 11, 12 and 13 the spread between the RTR modules can be
versions) is the limiting factor. This changes from 16 modules seen. The maximal deviation in clock speed is about 8 percent.
400 400 400
controller2 controller2 controller2
matcher opt module matcher opt module matcher opt module
speed (MHz)
speed (MHz)
250 250 250
speed (MHz)
speed (MHz)
250 250 250
Figure 12. Deviation of clock speed between RTR modules in circular distribution
400
controller2 opt a single slice synchronous bus macro, which is located within
matcher opt modules min speed the region of the RTR module, to the controller is about 3.4
350 ns and the clock path skew about -0.7 ns. This means that the
clock reaches the bus macro, placed inside the PR region, 0.7
300 ns later than the controller. The clock path skew is subtracted
from the data path delay, i.e. both values are added. The result
speed (MHz)
11. Acknowledgment
The project is performed in collaboration with the Center of
Advanced Study Böblingen, IBM Deutschland Research &
Development GmbH in Germany.
References
[1] N. A. Woods and T. VanCourt, “FPGA Acceleration of Quasi-
Monte Carlo in Finance,” in FPL. IEEE, 2008, pp. 335–340.