An On Chip Network Inside A FPGA For Run-Time Reconfigurable Low Latency Grid Communication

2009 12th Euromicro Conference on Digital System Design / Architectures, Methods and Tools
An on Chip Network inside a FPGA for Run-Time Reconfigurable

Low Latency Grid Communication
Jochen Strunk∗ , Toni Volkmer∗ , Wolfgang Rehm∗ and Heiko Schick‡

∗
Chemnitz University of Technology
Computer Architecture Group
Email: {sjoc,tovo,rehm}@cs.tu-chemnitz.de
‡
IBM Deutschland Research & Development GmbH
Email: schickhj@de.ibm.com
Abstract can be realized. Accelerating a Quasi-Monte Carlo

Simulation by the factor of 50 compared to a CPU
In this paper a low latency, on chip communication solution was shown by Woods et al. [1]. A speedup
network (NoC) for a run-time reconfigurable (RTR) of 25 was presented by Zang et al. [2] using another
grid inside dynamically and partially reconfigurable Monte Carlo Simulation. A high performance imple-
(DPR) FPGAs is proposed, which supports the ar- mentation of a pattern matcher and a pseudo random
bitrary placement of run-time reconfigurable modules number generator based on a Mersenne Twister had
(RTRM) inside the grid. The dedicated, fully meshed, been implemented as offload functions on a FPGA in
silicon network should support the arrangement of [3]. In the field of embedded systems FPGAs are used
communication channels between the RTRMs within for real-time and stand-alone solutions.
the different partially reconfigurable regions (PRRs) Partially reconfigurable FPGAs offer a new degree
on the FPGA. The design of the network guarantees a of flexibility. Partially reconfigurable regions (PRR) on
low latency communication of RTRMs without mutual DPR capable FPGAs (e.g. Xilinx Virtex FPGAs) can
interference of each other. In comparison with an im- be reconfigured during run-time without inference of
plementation using FPGA resources the dedicated sil- the remaining parts, which remain fully functional.
icon network could save an huge amount of resources The rest of the paper is organized as follows:
in terms of transistors. The new degree of parallel Section 2 focuses on related work covering run-time
communication provided for a RTR grid with arbi- reconfiguration. In section 3 the costs of dealing with
trarily placeable RTRMs offers new application fields arbitrarily placeable RTRMs are discussed in relation
for DPR capable FPGAs. Multiple user applications to communication latencies, bandwidth and the number
with inter-communicating offload compute kernels can of configuration bit stream files. Section 4 describes the
be loaded on a host coupled FPGA accelerator, a configurable fully meshed network supporting RTRMs.
real-time (RT) system with concurrent communication The network architecture, the interface, the protocol,
tasks are possible and enhancing the functionality on latencies, the logical and physical implementation are
demand for embedded systems is conceivable. A case shown. A case study conducted in section 5 describes
study was conducted for proof of concept and for the the verification and the run-time environment system
verification of the run-time environment system, which managing run-time reconfiguration and the config-
manages the configurable network. urable network. Section 6 concludes the results of this
paper.
1. Introduction
2. Related Work
FPGAs are utilized in various fields of application. In
high performance computing they have gained access Utilizing the feature of run-time reconfiguration im-
as special purpose accelerators for offloading compute poses requirements on the design, the communication
kernels. Due to the highly parallel fine grained archi- with RTRMs, the design flow, the synthesis tools and
tecture of FPGAs massive parallel processing engines run-time environment system. Different approaches
978-0-7695-3782-5/09 $25.00 © 2009 IEEE 539

DOI 10.1109/DSD.2009.133
have been proposed which make use of the DPR 1 3
capability of FPGAs. The design, i.e. hierarchy and in- 3 4
terfaces, depend on the synthesis tools and frameworks 1 3 4 1 2
used. Most of the tool-flows are built upon the Xilinx 4 3
design flow (cf. [5]) and partial synthesis software 1/2/3/4 4
for Virtex FPGAs. The design methodology is based 2 1
on a point-to-point communication with RTRMs using 1 4
hard macros. A tool-flow for more complex structures, 2 4 1
e.g. homogeneous communication infrastructure, was
(a) Logical view of RTRMs (b) Actual placement of
presented by Hagemeyer et al. [6], which is based in a RTR grid grouped in RTRMs grouped in com-
upon the Xilinx design flow. In contrast a tool named communication domains munication domains lo-
ReCoBus-builder, which is not dependent on the Xilinx cated inside a FPGA
design flow, builds static buses between RTRMs and Figure 1. Logical and physical placement of
was proposed by Koch et al. [7], [8]. Only Virtex-II and RTRMs in a RTR grid grouped in communication
Spartan-3 FPGAs are supported so far. For increasing domains
throughput and allowing more than one RTRM to
communicate at the same time topologies with routers
have been examined in [9], [10]. Beyond methodolo-
gies of creating a run-time reconfigurable infrastructure
on the FPGA, run-time management systems for the point to point connection to a RTRM or a static module
integration of FPGAs into a host system have been and not a bus with more than two participants. The
proposed. module based partial design flow with hard macros
ReconOS [11], a real-time OS with static FPGA is described in [5]. Tri-state buffers, which can be
threads, is based on memory mapping and targets used in a bus, implemented with AND-OR logic in
embedded systems. BORPH [12] runs on the embed- Virtex-2 (Pro) FPGAs (cf. [13]) are no longer avail-
ded PowerPC and communicates through the UNIX able in Virtex-4/-5/-6 FPGAs, presumably because of
IPC mechanism. For dealing with host coupled ac- high latencies due to long routing delays between the
celerators, including FPGAs, a common interface was most distant logic blocks (slices). Another technique
designed, built on top of a virtual file system (ACCFS) of providing inter module communication, called Re-
supporting run-time reconfiguration for HyperTrans- CoBus (cf. [7]), connects all peers in a long directed
port coupled FPGAs (cf. [3]). combinatorial ORed chain with local AND gates to
enable master read operations. Only Virtex-2 (Pro)
3. Communication and Arbitrary Place- and Spartan-3 FPGAs are supported and homogeneous
ment of RTRMs PRRs are assumed.
In this section the costs of dealing with arbitrarily All buses have in common that there can be only one
placeable RTRMs on a fixed mesh of PRRs, which transmitter at the same time. If more than one message
is defined during design time, will be assessed. The is allowed to travel through the network routers are
goal is to find a manageable solution for the arbitrary used, cf. [9], [10]. Due to the reason that low and
placement of RTRMs, which on the one hand keeps the constant latencies are difficult to achieve on routed
number of partial configuration bit stream files small topologies and buses suffer from single message trans-
and on the other hand provides constant low commu- mission, a fully meshed network, where each RTRM
nication latencies and multiple communication access has an extra communication path to each of the other
of the RTRMs at the same time. Figure 1.a shows RTRMs, is considered. If non arbitrary placement of
an example, where RTRMs are logically grouped in RTRMs is assumed, the implementation costs of the
communication domains and Figure 1.b represents a communication infrastructure based on FPGA logic
valid placement of these RTRMs in PRRs on a FPGA. and routing resources for a fully meshed network
consist only of routing resources if communication
3.1. Costs of Communication with RTRMs paths are unregistered (unbuffered). All interconnect
topologies for communication with RTRMs have in
Different approaches have been proposed. The FPGA common that the amount of communication is limited
manufacturer Xilinx uses hard macros, also known as in bandwidth and latencies could be high compared to
bus macros, even though it provides a single directed a fully static design.
540
RTRM 1 RTRM 2
3.2. Costs of Arbitrary Placement of RTRMs sb_out[5] sb_out[5]
sb_in[5] sb_in[5]
data_out[32] data_out[32]
For the calculation of the amount of configuration bit data_in[32] data_in[32]
stream files needed, we make the following assump- valid_out valid_out

valid_in valid_in
tions: Due to the heterogeneity of the different PRRs wr/rd_out wr/rd_out
it is not possible to copy a configuration bit stream file wr/rd_in wr/rd_in
for a fixed PRR to another PRR. The heterogeneity rq/rp_out rq/rp_out

rq/rp_in rq/rp_in
for example originates in the different amount of
BlockRAMs (BRAMs), multiplexer and orientation of
Figure 2. Application defined interface of com-
module interconnects to the internal network in a PRR.
ports
Each RTRM has a single network port (snp) to access
the communication partners. If n is the amount of
RTRMs we want to place in r PRRs, then n times r
4. Configurable Fully Meshed Network on
(cf. equation 1) configuration bit stream files (csnp ) are
needed for the arbitrary placement of the RTRM and Chip
when the RTRMs are able to access the communication
partners on the same way. 4.1. Network Architecture
To enable arbitrary placement of RTRMs with a fully

csnp (n, r) = n · r (1) meshed network topology a configurable fully meshed
network (cFMN) is needed. This means that the route
For a fully meshed network topology which has of the outgoing and incoming communication of a
multiple network ports (mnp) per RTRM, the ports RTRM to others must be directed (switched) in such
are directly connected point to point and arbitrary a way that no change is needed to the RTRMs i.e. the
placement is provided, the amount of configuration bit configuration bit stream file itself. The network switch
stream files (cmnp ) is shown in equation 2. When n is is automatically configured in a way that the RTRMs
equal to r each permutation of the possible placements always communicate with their defined communication
of RTRMs is needed to ensure that the communication partners independently of their actual placement on
is done to the predefined ports. These are costs in the FPGA (cf. Figure 1.b). A fully meshed network
terms of the amount of configuration bit stream files for allows constant low latencies for most communication
the arbitrary RTRM placement without a configurable patterns, e.g. single transaction (unicast), multicast and
fully meshed network, i.e. directly connected point to all-to-all communication. The bandwidth is directly
point routed. It should be noted that such a solution proportional to the number of signals within a com-
is not feasible, e.g. for n = r = 16 the amount of munication channel.
configuration files are 16! (≈ 21 trillions).
4.2. Interface and Protocol


 n! n=r The interface and the protocol should be as universal

as possible to allow common standard protocols to be



r

used. Therefore only the width of a communication

n! · n<r

cmnp (n, r) = n (2) port which is connected to a communication channel

is set at a fixed size defined for the architecture. We



n

propose a width of 40 bit for Virtex-5 FPGAs, which

 r! · n>r


r allows to transfer 32 bit at one clock cycle and leaves
enough signals left for control, status signals and side
If we supply the fully meshed network with band signals (sb). Figure 2 depicts a typical interface
outgoing and incoming switches on each RTRM port defined by an application module (e.g. RTRM) and the
we can adjust the route during run-time when a new connection to another module. The speed of a clocked
RTRM is added or replaced. This leads to the massive transmission can be chosen potentially different for
reduction of configuration bit stream files from the each communication channel. In this paper we assume
one stated in equation 2 to the one in equation 1. an overall synchronous clock for driving the commu-
nication channels.
541
Network 1 operating modes
com−port S reg S reg R com−port R
S1 S2 1 non registered
clk
A.1 D 2 registered at sender side
Figure 3. Schematic view of the communication of 2 3 registered at receiver side
two RTRMs through the communication channel 4 registered at both sides
A.2 B D
A.1 sender transmit
3
4.3. Adjustable Latencies A.2 sender transmit to sender reg
A.3 C D A.3 sender transmit to receiver reg
To allow asynchronous (unregistered) as well as syn- 4 B in internal sender reg
chronous (registered) data transfer through the com- C in internal receiver reg
munication channel the operation mode can be chosen A.2 B C D

D available for receiver
on demand during runtime. On both sides, i.e. sender
and receiver, the use of a register where the data is Figure 4. Communication through com-ports de-
buffered can be selected, cf. Figure 3 for a schematic pendent on the operating mode
view of the communication channel. The four different
modes of selecting respectively deselecting the register t hold t setup
on the sender and receiver side are depicted in Figure 4.
When the registered path is selected on the sender side clk
the data is clocked into the register at the first clock tm tn
cycle and clocked to the communication partner in the 1
2 tm 1
2 tm
second. The asynchronous mode (unregistered) is most tm tm
1 2
suitable for a path which originates inside a RTRM
M1 at a register, runs through combinatorial logic, Figure 5. Asynchronous path from one RTRM to
reaches the communication port, travels through the another through the network
communication channel, presumably to another stage
of combinatorial logic at the receiver side at RTRM
M2 targeting a register and this all within one clock used per RTRM respectively per PRR. The first one
cycle. It should be noted that this mode of operation used for all outgoing communication is able to choose
requires special attention by the synthesis software to the communication partner (i.e. distant RTRM) at a
ensure that the signal reaches the destination in time communication port of the sending RTRM defined by
for RTR support. The earliest setup time tsetup is the RTRM itself. The second one routes the incoming
predetermined by the logic at the receiving RTRM and data from a distant RTRM to a local communication
the latest hold time thold is defined by the sending port, which is defined by the receiving RTRM. Figure 6
RTRM. The time ∆tn is subtracted from tsetup as shows the schematic view of a configurable, fully
worst case network latency through the communication meshed network for four RTRMs. The interconnection
channel and the remaining time ∆tm should be divided between the switches is chosen in such a way that the
and assigned equally to both RTRMs (cf. Figure 5). configuration (config) for the switches can be easily
Registered communication leverages the constraints for determined for a specific route. For each outgoing
the place and route FPGA software and is most suitable port on the network side (i.e. behind a switch for
to achieve higher clock cycle frequencies. sending), one of the RTRM’s sending ports can be
selected. The same applies to the receiving switch but
4.4. Physical Implementation here the configuration is done in the way that the
communication port of a RTRM can decide which
To allow arbitrary placement of RTRMs in a fully network connection from a distant RTRM should be
meshed network under the conditions stated in equa- used. If for both, the sending and the receiving port,
tion 1, the network needs the property to assign out- equal port numbers are taken, i.e. that data is sent on
going communication ports of a RTRM at run-time, port p of the sending ports of RTRM M1 to RTRM
whose amount is the number of P RRs − 1 (= r − 1) M2 and data is received also on the incoming port p of
at the maximum for each RTRM, to a desired incoming the receiving ports of M1 , an easy transformation from
communication port of another RTRM without inter- the configuration setting of the sending to the receiving
fering the other selected communication paths of the switch is possible. Different clock speeds of communi-
network. To achieve such a behavior two switches are cation ports are possible but not discussed further here.
542
com−port 1 reg reg com−port 1
clk config config clk

RTRM 1 RTRM 2

RTRM 3 RTRM 4
Figure 6. Schematic view of a configurable, fully meshed network for 4 RTRMs
For the physical silicon implementation of all switches, different sizes of RTR grids with 40 bit communication
sending and receiving, 2 · r · (r − 1) · w multiplexers port width. The consumption of transistors used for
of (r-1)-to-1 can be utilized, where r is the maximal the switches and D-Flip-Flops (D-FF) for buffers and
amount of RTRMs which can be placed on the FPGA config registers are listed in detail. The fully meshed
at the same time and w the width of a single communi- network is a silicon hard block and exposed to a
cation port. The amount of transistors can be reduced RTRM through communication channel hard blocks
further if tristate buffers are utilized. This is possible (primitives), one for each communication partner. The
because the transmission is only unidirectional, point location of the hard blocks of the communication ports
to point and can be terminated at the receiver side. The must be within a PRR respectively clock region of a
buffers are neither load dependent nor depend on the FPGA. The sending and receiving ports can be equally
length of the transmission line, which should be made distributed over the PRR (cf. Figure 8.a) or placed near
equal for all communication channels, independently the center of the FPGA (cf. Figure 8.b), which is the
of the physical location to ease the timing analysis preferred variant to allow additional RTRMs working
during the place and route process of a customer at a slower speed behind a master RTRM. Building
design using RTRM designs. The number of tristate larger PRRs allows channel combining to double the
buffers for a configurable fully meshed network with bandwidth (cf. Figure 9.a and Figure 9.b for 8 RTRMs
r PRRs/RTRMs is 2·r·(r−1)2 ·w. For a schematic view with 16 communication channels). The silicon solution
of an outgoing and incoming switch implemented with provides a high speed, i.e. low jitter, low latency path,
tristate buffers confer Figure 7. The implementation of compared to a FPGA routed fully meshed network
a fully meshed silicon network for 16 to 24 RTRMs is design which needs up to 10 ns (cf. [14]) to reach
feasible with 40 bit port width, if we compare this the most distinct RTRMs.
to the number of transistors used for Virtex-5 (1.1
billion) and Altera Stratix IV (2.5 billion) FPGAs. An 4.5. HDL Design
implementation with 16 (24) RTRMs by 40 bit width
(with registers for buffers and config for switches) RTRMs which want to utilize the configurable, fully
using tristate buffers takes about 1.4‰ (4.4‰) of the meshed network have to instantiate the communication
Virtex-5 FPGA transistors or about 2.4‰ (8.1‰) for channel primitive COM CHANNEL for each con-
the multiplexer solution. Table 1 shows the amount nection to a distant RTRM (cf. Figure 10) in HDL
of transistors needed for the silicon implementation (e.g. Verilog or VHDL). This is comparable with the
of a configurable, fully meshed network on chip for instantiation of a BlockRAM (BRAM) or multiplier
543
com−port 1 reg R S S R R S S R
com−port 2 reg
com−port 3 reg
clk config
RTRM 1
com−port 1 reg
com−port 2 reg
com−port 3 reg
clk
(a) Location of hard blocks (b) Location of hard blocks
config
equally distributed inside a near the center of the FPGA
PRR
Figure 7. Tristate buffer solution for the sending
Figure 8. Location of com-ports as hard blocks for
and receiving switch in a RTR grid with a cFMN
a RTR grid with 16 PRRs/RTRMs
#PRRs
cFMN (40 bit/port) 2 4 8 16 24
R S S R R S S R
#transistors for
tristate sol. (mil.) 0.00 0.01 0.13 1.15 4.06
#transistors for
multiplexer sol. (mil.) 0.00 0.02 0.25 2.30 8.13
#transistors for
regs (D-FF) (mil.) 0.00 0.02 0.07 0.31 0.71
#transistors for
config (D-FF) 0 768 5376 30T 88T
P
transistors for
tristate sol. (mil.) 0.00 0.03 0.20 1.49 4.86
P
transistors for
muxer sol. (mil.) 0.00 0.04 0.33 2.64 8.92
Virtex-5 transistors
tristate util. (%) 0.00 0.00 0.01 0.14 0.44
(a) Location of hard blocks (b) Location of hard blocks
equally distributed inside a near the center of the FPGA
muxer util. (%) 0.00 0.00 0.03 0.24 0.81
PRR
Table 1. Resource utilization for RTR grids with
Figure 9. Location of com-ports as hard block for
different amount of PRRs using a NoC - both with
a RTR grid with 8 PRRs/RTRMs
40 bit / com-port
done on the largest Virtex-5 FPGA XC5VLX330.

(DSP Slice). The COM CHANNELs are hard blocks The utilization respectively the count of transistors
located inside a PRR respectively a clock region on should then be compared with a dedicated silicon
a FPGA. Designs where the hard block is located NoC solution. The utilization is about 40% for the
near (not inside) a PRRs are also possible but not configurable, fully meshed network only and about
recommended due to likely higher routing delays. The 7% for the RTRMs themselves for FSMs. Figure 11.a
synthesis tool (placer) decides which physical hard shows a placed design with PRRs located at adjacent
block is to select for the actual PRR. regions, which could not be routed. Confer Figure 11.b
for a routed design, where the PRRs are distributed
5. Case Study over the FPGA. The utilization of about 40% for the
configurable, fully meshed network corresponds to the
5.1. Proof of Concept theoretical minimal resource utilization of 38% shown
in Table 2. The theoretical resource utilization for
To get an assessment what device utilization is needed 24 PRRs using FPGA resources is 152% minimum
for a configurable, fully meshed network with 16 (cf. Table 2). Figure 12 presents the comparison of
PRRs (RTRMs) and 40 bit port width built on con- transistors needed for a configurable, fully meshed
figurable logic blocks (CLB), an implementation was network for different RTR grid sizes implemented with
544
c o m c h a n n e l i n s t : COM CHANNEL
g e n e r i c map (
CHANNEL NUMBER => n u m b e r p e r m o d u l e ;
p o r t map (
d a t a s e n d => r t r m c o m d a t a o u t ;
d a t a r e c e i v e => r t r m c o m d a t a i n ;
c o n f i g => c o n f i g ;
c l k => c l k ;
};
Figure 10. Instantiation of a communication chan-

nel primitive
#PRRs
cFMN (40 bit/port) 2 4 8 16 24
#slices for
cFMN muxers 0 240 2240 19200 77280
#slices for (a) Non-routable design (b) Routed design
cFMN registers 40 240 1120 4800 11040
#slices for Figure 11. RTR grid with 16 RTRMs and fully
cFMN config regs 0 12 84 480 1380
P meshed network (40 bit / com-port) implemented
slices for
cFMN 40 252 2324 19680 78660 with FPGA resources on a XC5VLX330
P
transistors for
cFMN (mil.) 0.8 5.3 49.3 417.6 1669.1
utilization (%) 0.1 0.5 4.5 38.0 151.7 FPGA resources
10000 muxer
Table 2. Minimal resource utilization for RTR grids tristate
1000
with different amount of PRRs implemenated with
#transistors (mil)
FPGA resources at 40 bit / com-port 100
10
0.1
configurable logic on a FPGA and a dedicated silicon
network on chip solution. Due to the long synthe- 0.01
sis runs a smaller configuration with 8 PRRs and a 0.001

2 4 8 16 24
smaller FPGA (XC5VLX50T) was chosen to verify the #PRRs
port interface definition, communication protocols, the
multi-port HDL design and simultaneous transmissions
(particularly all-to-all communication). Figure 12. Resource utilization in transistor count
for cFMN with FPGA resources and cFMN with
5.2. Run-Time Environment System NoC (multiplexer and tristate) - both with 40 bit /
com-port
A FPGA application is called a process at run-time,
which consists at least of one RTRM. A process can
create new threads (RTRMs) within its own process. Different requirements need to be satisfied for each
All threads within a process belong to the same com- type. In an accelerator environment for offloading com-
munication domain. RTRMs within the same commu- pute kernels a multi-user, multi-process functionality
nication domain can build a fully meshed sub network is desirable. The number of threads within a process
for all threads. is most likely fixed. The run-time environment system
The task of the run-time environment system is to for accelerators is throughput respectively performance
load the configuration data of a RTRM, to partially driven. In contrast a run-time environment system for
reconfigure the PRR and to adjust the routing of the real-time applications is event driven, has to abide to
communication channels as intended by the applica- deadlines and should allow dynamic thread (RTRMs)
tion. The run-time environment system can pursue dif- creation. For embedded systems the run-time environ-
ferent strategies dependent on the field of application. ment system is generally demand driven. Additional
Three different fields have been determined, i.e. a functionality is loaded on demand to exchange or
run-time reconfigurable FPGA accelerator in a host enhance functionality. In a verification process for the
system, a real-time system and an embedded system. overall concepts a run-time environment system was
545
tested on a host system in conjunction with a Xilinx [4] P. Chen and A. Ye, “The effect of sparse switch
XC5VLX50T FPGA PCI Express card. patterns on the area efficiency of multi-bit routing
resources in field-programmable gate arrays,” in Field
Programmable Logic and Applications, 2008. FPL
6. Conclusion 2008. International Conference on, Sept. 2008, pp.
427–430.
In this paper a solution has been presented for the
arbitrary placement of RTRMs in a RTR grid on DPR [5] Xilinx, “Two flows for partial reconfiguration: Module
capable FPGAs based on a dedicated silicon commu- based or difference based,” in Application Note: Virtex,
nication network. The implementation, configuration Virtex-E, Virtex-II, Virtex-II Pro Families (XAPP290),
2004.
and utilization of the dedicated, silicon, configurable,
fully meshed network was shown. The silicon network [6] J. Hagemeyer, B. Kettelhoit, M. Koester, and M. Por-
provides a high speed, i.e. low jitter, low latency, so- rmann, in Design of Homogeneous Communication
lution and consumes only a tiny amount of transistors Infrastructures for Partially Reconfigurable FPGAs
compared to an implementation with FPGA resources. (ERSA). CSREA Press, 2007.
The communication infrastructure allows different [7] D. Koch, C. Beckhoff, and J. Teich, “ReCoBus-Builder
applications built on RTRMs to coexist and commu- a Novel Tool and Technique to Build Statically and
nicate without interference compared to other known Dynamically Reconfigurable Systems for FPGAs,” in
bus or routed network solutions for FPGAs. Proceedings of International Conference on Field-
Programmable Logic and Applications (FPL 08), Hei-
The communication channels are customizable to
delberg, Germany, 2008.
the requirements of the application, which is able to
run self-defined protocols over the channels. For real- [8] D. Koch, C. Beckhoff, and J. Teich, “A Communi-
time applications the communication channels provide cation Architecture for Complex Runtime Reconfig-
a constant and low latency. Additionally the run-time urable Systems and its Implementation on Spartan-
3 FPGAs,” in Proceedings of the 17th ACM/SIGDA
environment system for managing such a network has International Symposium on Field-Programmable Gate
been presented for different fields of applications. Arrays (FPGA 2009). Monterey, California, USA:
ACM, Feb. 2009, pp. 233–236.
7. Future Work
[9] J. Surisi, C. Patterson, and P. Athanas, “An efficient
In this paper we have concentrated on a network design run-time router for connecting modules in FPGAs,”
in Proceedings of International Conference on Field-
with only one network clock frequency for all modules. Programmable Logic and Applications (FPL 08), Hei-
Communication channels working at different clock delberg, Germany, 2008.
speeds seem to be promising.
[10] T. Pionteck, C. Albrecht, K. Maehle, E., Hübner, M.,
8. Acknowledgment and Becker, J., “Commuication Architectures for Dy-
namically Reconfigurable FPGA Designs,” in Proceed-
ings of IEEE International Parallel and Distributed
The project is performed in collaboration with the Cen-
Processing Symposium, IPDPS USA, 2007.
ter of Advanced Study Böblingen, IBM Deutschland
Research & Development GmbH in Germany. [11] E. Lubbers and M. Planner, “ReconOS: An RTOS
Supporting Hard-and Software Threads,” Field Pro-
References grammable Logic and Applications, 2007. FPL 2007.
International Conference on, pp. 441–446, Aug. 2007.
[1] N. A. Woods and T. VanCourt, “FPGA Acceleration of [12] H. K.-H. So and R. Bordersen, “File System Access
Quasi-Monte Carlo in Finance,” in FPL. IEEE, 2008, From Reconfigurable FPGA Hardware Processes In
pp. 335–340. BORPH,” in Proceedings of the 2008 IEEE Interna-
[2] G. L. Zhang, P. H. W. Leong, C. H. Ho, K. H. Tsoi, tional Conference on Field-Programmable Logic, FPL
C. C. C. Cheung, D.-U. Lee, R. C. C. Cheung, and 2008, 8-10 September, Heidelberg. IEEE, 2008.
W. Luk, “Reconfigurable Acceleration for Monte Carlo
[13] “Virtex-II Platform FPGAs: Complete Data Sheet,”
Based Financial Simulation,” in FPT, G. J. Brebner,
Xilinx DS031, vol. 3.5, p. 20, 2007.
S. Chakraborty, and W.-F. Wong, Eds. IEEE, 2005,
pp. 215–222.
[14] J. Strunk, T. Volkmer, K. Stephan, W. Rehm, and
[3] J. Strunk, A. Heinig, T. Volkmer, W. Rehm, and H. Schick, “Impact on Run-Time Reconfiguration on
H. Schick, “Run-Time Reconfiguration for Hypertrans- Design and Speed - A Case Study Based on a Grid of
port coupled FPGAs using ACCFS,” in proceedings Run-Time Reconfigurable Modules inside a FPGA,” in
of the Workshop on HyperTransport Research and proceedings of the Reconfigurable Architectures Work-
Applications (WHTRA), 2009. shop (RAW) / IPDPS, 2009.
546

An On Chip Network Inside A FPGA For Run-Time Reconfigurable Low Latency Grid Communication

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

An On Chip Network Inside A FPGA For Run-Time Reconfigurable Low Latency Grid Communication

Hochgeladen von

Copyright:

Verfügbare Formate

2009 12th Euromicro Conference on Digital System Design / Architectures, Methods and Tools

An on Chip Network inside a FPGA for Run-Time Reconfigurable

Jochen Strunk∗ , Toni Volkmer∗ , Wolfgang Rehm∗ and Heiko Schick‡

Abstract can be realized. Accelerating a Quasi-Monte Carlo

978-0-7695-3782-5/09 $25.00 © 2009 IEEE 539

stream files needed, we make the following assump- valid_out valid_out

for a fixed PRR to another PRR. The heterogeneity rq/rp_out rq/rp_out

To enable arbitrary placement of RTRMs with a fully

4.3. Adjustable Latencies A.2 sender transmit to sender reg

A.3 C D A.3 sender transmit to receiver reg

To allow asynchronous (unregistered) as well as syn- 4 B in internal sender reg

munication channel the operation mode can be chosen A.2 B C D

com−port 2 reg reg com−port 2

com−port 3 reg reg com−port 3

clk config config clk

com−port 2 reg reg com−port 2

com−port 3 reg reg com−port 3

clk config config clk

com−port 1 reg reg com−port 1

com−port 2 reg reg com−port 2

com−port 3 reg reg com−port 3

clk config config clk

com−port 2 reg reg com−port 2

com−port 3 reg reg com−port 3

clk config config clk

Figure 6. Schematic view of a configurable, fully meshed network for 4 RTRMs

done on the largest Virtex-5 FPGA XC5VLX330.

Figure 10. Instantiation of a communication chan-

FPGA resources at 40 bit / com-port 100

sis runs a smaller configuration with 8 PRRs and a 0.001

Das könnte Ihnen auch gefallen