Reconnect Ieee

RECONNECT: A NoC for polymorphic ASICs using a Low Overhead Single
Cycle Router
Nimmy Joseph, Ramesh Reddy C, Keshavan Varadarajan, Mythri Alle, Alexander Fell, S K Nandy
CAD Lab, SERC, Indian Institute of Science, Bangalore
{jnimmy, crreddy, keshavan, mythri, alefel, nandy}@cadl.iisc.ernet.in
Ranjani Narayan
Morphing Machines, Bangalore, India
ranjani.narayan@morphingmachines.com
Abstract
Metadata Orchestration
Support Logic
Execution
Fabric
...
A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a future proof custom hardware solution for
multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is
achieved through hardware reconfiguration. In this paper
we present the design of a single cycle Network on Chip
(NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design
is optimized to avoid FIFO buffers at the input port and
loop back at output crossbar. It provides virtual channels
to emulate a non-blocking network and supports a simple
X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 88 honeycomb NoC (RECONNECT) implemented in 130nm UMC CMOS standard
cell library operates at 500MHz and has a bisection bandwidth of 28.5GBps. The network is characterized for random, self-similar and application specific traffic patterns
that model the execution of multimedia and DSP kernels
with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5%
of the network capacity for random traffic. For application
specific traffic the latency is 6 clock cycles and throughput
is 87% of the network capacity.
...
solutions. While ASICs offer significant power and performance advantage over other programmable solutions, the
NRE costs of ASICs can only be amortized by high production rates. A flexible and reprogrammable ASIC would
be future proof and cost effective since a single ASIC
can now support multiple applications and their derivatives.
We refer to such a flexible ASIC as a polymorphic ASIC.
In FPGAs alteration of applications at runtime cannot be
achieved easily because of the latency caused by each configuration reload whenever there is an application switch.
On the other hand a polymorphic ASIC (an abstract model
is shown in Figure 1) is a programmable multiprocessor on
a chip, where both the data path and the control path are redefined during execution to meet the performance requirements of applications. Polymorphic ASIC is a hardware
substrate comprising a regular array of tiles. A tile consists
of a Compute Element (CE) and a network router. These
tiles are connected through a high performance Network on
Chip (NoC) which can be reconfigured at runtime.
...
Compute
Element
...
...
...
Compute
Element
Programmable Interconnect
Introduction
Figure 1. Abstract model of polymorphic ASIC
Emerging embedded applications in domains such as

multimedia, networking, security, bioinformatics, etc. are
implemented on ASICs and bound to that specific hardware
platform. A frequent exchange or later upgrade of these
applications and their derivatives call for flexible hardware
1-4244-1898-5/08/$20.00 2008 IEEE
Given an application specification in a high level language, it is fragmented into substructures called HyperOps
[7]. A HyperOp is a sub graph of the application data flow
graph. At runtime the CEs are programmed according to the
251
operations represented by the nodes of the sub graph. Since

several CEs of the polymorphic ASIC are aggregated at runtime to map the HyperOp and its operations, it is equivalent
to a customized hardware (ASIC) for the HyperOp.
The polymorphic ASIC can execute several scheduled
HyperOps simultaneously, if sufficient CEs are available.
The Metadata Orchestration Logic is responsible for selecting HyperOps that are ready to execute, and for loading
them onto a set of CEs. The set of CEs which are assigned
to a HyperOp follow static dataflow semantics, while the
Metadata Orchestration Logic supports dynamic dataflow
semantics among multiple HyperOps. Therefore a polymorphic ASIC is akin to a dynamic dataflow machine.
The data paths of an ASIC are predefined and static,
whereas the paths of a polymorphic ASIC are realized
through the NoC giving the flexibility to change with every newly loaded HyperOp which is bought at an increased
latency. Since the edges between operations of a HyperOp
represent communication between CEs, the physical mapping of HyperOp onto the CEs in the execution fabric must
meet specific communication criteria such as minimum distances. This will ensure a low hop count and lower probability for congestion for each packet routed through the
NoC.
Rest of the paper is organized as follows. Section 2 describes the NoC features and design considerations with respect to topology, interconnects, router and its flow control
mechanism. Simulation and synthesis results are presented
in Section 3. Section 4 concludes the paper.
From HyperOp Store
48 * 22 = 1056
(Payload=48 bits)
HyperOp Launcher
399
399
(57 * 7;
57 = Header+48)
T
RECONNECT: The NoC
The NoC is a regular interconnection of tiles. Each tile is

an aggregation of elementary hardware resources comprising compute and communication elements (routers). NoC
efficiency depends on the topology and the router performance while the design trade offs are silicon area, power
and delay. In this paper we explain the proposed NoC architecture and the router for communication among tiles.
Over-design of the router is avoided since the polymorphic
ASICs are designed for specific application domains with
known bounds on latency and throughput.
To exploit the instruction level parallelism exhibited by
the HyperOps, multiple operations are performed by different CEs simultaneously. Depending on the instruction level
parallelism, a HyperOp is split into several sub-sets called
p-HyperOps [7], where each p-HyperOp is a set of instructions assigned to a CE. This allows parallel execution of
multiple instructions. This necessitates parallel transfer of
instructions to the polymorphic ASIC.
Our implementation consists of 64 tiles arranged in 88
regular structure connected through a honeycomb topology
as shown in Figure 2. The CE of a tile interfaces to the
router through a Network Interface (NI) unit. The router
in a tile has 4 ports of which one is connected to the NI of
the tile, the remaining three are connected to the adjacent
routers (except the routers at the periphery). Routers at the
periphery of the honeycomb have at least one port free - we
term these ports as launching ports. There are 22 launching
ports in a 88 honeycomb topology through which instructions can be loaded parallelly to the CEs. The HyperOps are
stored in HyperOp store which is an on chip memory. The
HyperOp launcher acts as an interface between the launching ports and the HyperOp store. It has 22 NI units for parallel transfer of 22 packets. Topology, NoC interconnects
and router are explained in the following sections.
57
2.1
Topology and Interconnects
Express Lanes
Honeycomb topology is chosen as interconnection network on the fabric since it has a lower degree per node than
a 2-D Mesh [9]. This reduces the complexity and area of
the network router. A detailed comparison of the honeycomb and mesh topologies is provided in [10].
The interconnects are divided into two logical sets. The
first set of interconnections is called Express Lanes (Figure
2). These facilitates parallel instruction transfer from the
HyperOp launcher to boundary tiles and reduce the load on
the internal network. If the destination is not the boundary
tile, the instructions are routed from the boundary tiles to
the destination CEs through the internal network. The second set of wires connects the tiles, which are used for inter
CE data transfer in addition to the instruction transfer to the
114
(57 * 2)
Tile = CE + Router
Figure 2. Proposed Network on Chip
252
internal CEs from boundary tiles. Thus RECONNECT is

suitable for polymorphic ASIC as compared to some of the
other NoC architectures proposed [5, 6].
For an instruction width of 57 bits (Figure 3), a total
of 1254 (57 x 22) wires are required for routing 22 parallel Express Lanes. Using metal 6 wires of pitch 480 nm,
for 65nm CMOS technology node total width for Express
Lanes is 0.6019mm [2]. This is permissible for a 64 tile
(88) fabric where each tile size is 3mm2 resulting in a die
size of 196mm2 (14mm14mm). Long wires used with repeaters can support up to 2 GHz for 14mm14mm fabric
in 65nm CMOS technology node and more than 2 GHz for
130nm CMOS technology node as global wire delay does
not scale [1]. Therefore routing long wires between HyperOp store and launching ports is not the freqency limiting
factor for NoCs operating below 2GHz.
2.2
New Data
X Relative
Y Relative
Indicator
Address
Address
56
Payload(48 bits)
Figure 3. Packet format
The packet size (Figure 3) is taken as flit1 size. Network bandwidth between two adjacent routers is equal to
the packet size. Splitting the packet into multiple flits will
make the network inefficient in our case as the CE will be
idle till the whole packet is received. Further it requires additional logical circuitry for splitting the packets into flits
and their reassembly at the destination.
Each input port has a port status out bit (Figure 4) indicating whether one of the virtual channels is free. The
synchronization between two adjacent routers is based on
this bit. In the router, the received packets are stored in one
of the free virtual channels in a round robin fashion. Router
outputs packet only if one of virtual channels of the adjacent router is free. Otherwise, the packet is blocked at the
input of the current router. Thus packets are not dropped in
the network. The non availability of the virtual channels of
adjacent router increases the average latency of the overall
network. Once a packet is sent to an adjacent router, the
corresponding virtual channel status of the current router is
updated by the Port Update Logic to indicate that it is free to
receive another packet. A toggle in the MSB (Bit 0) of the
received packet indicates new packet. Packet is stored only
if the MSB is different from that of the previous packet.
Routing Algorithm: The routing algorithm is described
with respect to Figure 4. Packets are routed along the shortest path to the destination. Honeycomb topology has horizontal links on every alternate node. Therefore the routing algorithm prioritizes horizontal links over vertical ones
(Figure 2). The output request signal (output req from vc)
generated from the packets in the virtual channels indicates the direction in which the packet has to travel. The
K:1 virtual channel arbiter at each input port selects one
of the virtual channels for which the requested adjacent
router input port is free and sends the corresponding request
(req from east etc.) to the Request Decoder, which generates decoded req for Output Arbiter.
At each router, the output port to which the packet is to
be sent is determined based on the relative address. At the
source a packet is formed by concatenating the X, Y relative
addresses of destination and payload. 4-bit 2s complement
arithmetic is used for address updating in the router. If a
packet has to travel in the X direction when horizontal link
is not available and Y relative address is zero, then it takes
south (north) direction and Y relative address is modified
accordingly so that after traveling in X direction the packet
Router
In a typical ASIC, the functionality of CEs and routers

is integrated into a monolithic hardware element. In a polymorphic ASIC, the functionality of CEs and routers is decoupled to facilitate reconfiguration. At gigascale levels of
integration, an on-chip crossbar interconnect is not a viable
solution. Since the router is just a data forwarder and does
not contribute to logic, the router overhead should be as low
as possible (within the limits of meeting the traffic requirements of a given application).
The different components of the router are shown in Figure 4. It has four input ports and four output ports. Of
these, three are connected to the adjacent routers, fourth is
connected to the Network Interface (NI) unit of the CE. The
NI is connected to the east port if it is not connected to the
adjacent router; otherwise the NI is connected to the west
port. The router consists of input buffers, input and output
arbiters, one to four Mux and Demux units.
Two levels of arbitration are used in the router. First level
is the virtual channel arbitration at each input port which
selects one of the virtual channels. If there are K virtual
channels at the input port we need a K:1 arbiter. The second
level arbitration is for output port (Output arbiter). Since the
router has four input ports and loop back is not provided, a
3:1 arbiter is used at the output reducing the router area.
Flow Control Mechanism: Network throughput can
be increased by dividing the buffer storage associated with
each network channel into several virtual channels [3]. Each
physical channel is associated with several small queues,
called virtual channels rather than a single long queue. The
virtual channels associated with one physical channel are
allocated independently but compete with each other for
physical bandwidth. We use virtual channel flow control
mechanism at each input port to increase throughput and to
prevent head of line blocking.
1A
253
flit is the smallest flow control unit of the network
Data
Control Signal
0
1
000
111
0
0
1
1111
0
1
0 VC1000
1
0
1
0000
1111
0
0
1
000
1111
0
1
1110
000
0 VC2
1
1
111
000
0
1
0
1
0 VC3
1
1
0
0000
1111
0
1
1
0
0
1
0
1
11 VC4000
00
1111
000
111
0
11
00
demux_sel[1:0]
VC Arbiter
VC_status[3:0]
CLK
P R1 R2 R3 R4
VC_status[3:0]
P R1 R2 R3 R4
00
11
demux_sel[1:0]
VC_sel[1:0]
VC Arbiter
VC_status[3:0]
Output Arbiter
0
1
0
1
000
111
0
1
1
0
000
111
1
0
0
1
000
111
0
1
0
1
0
111
00
0
1
Req Decoder
Port Update Logic
0
1
000
111
0 VC1
1
00
11
000
111
00
11
00 VC2
11
00
11
00
11
000
111
00 VC3
11
0
1
000
111
0
1
000
111
0 VC4
1
11
00
11north_out[56:0]
00
mux_sel[1:0]
11
00
00
11
0
1
0
1
0
1
0
1
0
1
11
00
11south_out[56:0]
00
mux_sel[1:0]
Output Arbiter
VC_sel[1:0]
VC Arbiter
decoded_req
0
1
0
1
11
00
00
11
00
11
00
11
00
11
Req Decoder
Port Update Logic
0
1
000
111
1
0
0 VC1
1
000
1110
1
0
1
0
1
0
1
0
1
0000
1111
0
1
0 VC2
1
0
1
0
1
0
000
1111
000
111
0 VC3
1
0
1
0
1
1110
000
000
111
1
0 VC4000
1
111
1
0
0
1
demux_sel[1:0]
1111
0000
00
11
1111
east_in[56:0] 0000
00
11
port_status_out_east
VC_sel[1:0]
1
0
0
1
0
1
0
1
0
1
0
1
11
00
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_north[1:0]
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_south[1:0]
0
1
0
1
11
00
0
1
11
00
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_west[1:0]
0
1
00
11
0
1
00 1
11
11111111111111111111111111
00000000000000000000000000
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_east[1:0]
0
1
11
00
0
1
Output Arbiter
port_status_out_south
P R1 R2 R3 R4
0
1
0
1
1
0
0
1
0
1
0
1
Req Decoder
Port Update Logic
port_status_out_west
VC_status[3:0]
1
0
1
0
west_in[56:0]
VC_sel[1:0]
VC Arbiter
Address Updated Packet
Address Update
Logic
Output Arbiter
south_in[56:0]
demux_sel[1:0]
1
0
000000000000001
11111111111111
0
0
1
0
1
00
11
00
11
0
1
0
1
0
1
Req Decoder
Port Update Logic
port_status_out_north
00
11
11
00
0000
1111
00
00
11
00 VC1 11
11
00
11
00
00
11
00
11
VC2 11
000
111
00
11
00
11
00
11
11
00
north_in[56:0] 000
111
00
11
1 VC3
0
1
0
000
111
0000
1111
1
0
1
0
0
1
0000
1111
11
00
00
11
0
1
00
0 VC4 11
1
11
00
11west_out[56:0]
00
mux_sel[1:0]
00
11
00
11
00
11
11
00
00
11
11
00
11
00
00
11
11
00
00east_out[56:0]
11
mux_sel[1:0]
P R1 R2 R3 R4
P = port_status_in[3:0] R1, R2, R3, R4 = output_ req_from_VC
Figure 4. Data and control path scheme of the router with four virtual channels(VC)
moves in north (south) direction. The relative address updating logic for four directions is described below.
North
South
West
East
:
:
:
:
packets to four different ports in single cycle. The router

crossbar design is optimized by avoiding the loop back
which is prevented by the routing algorithm; i.e. an input
received on a port is not transmitted back on the same port,
resulting in reduced area of the crossbar.
Y relative address new = Y relative address + 1

Y relative address new = Y relative address 1
X relative address new = X relative address + 1
X relative address new = X relative address 1
Performance Analysis
We evaluated the proposed NoC for different traffic patterns that are representative of real applications. The NoC is
constructed using RTL code in Verilog HDL and simulated
using Mentor Graphics Modelsim by driving it through a
test-bench written in Verilog HDL. Synopsys Design Compiler is used for synthesis of the NoC.
The NoC is simulated using four types of traffic, viz.
random, self-similar and two application specific traffic patterns that model the execution of multimedia and DSP kernels on polymorphic ASIC. In random traffic, each tile generates packets with random destinations. Packets are injected into the network from all ports every clock cycle
whenever the ports are available.
Self-similar traffic has been observed in the bursty traffic
between on-chip modules in typical MPEG-2 video applications [4]. It has been shown that modeling of self-similar
traffic can be obtained by aggregating a large number of
The relative address is updated before a packet is routed

to an adjacent router depending on the chosen direction.
The Output Arbiter arbitrates among the outputs from the
four virtual channel arbiters at level 1. The arbitration at
the virtual channels of the input port and output port is done
using a Round Robin algorithm. To indicate the new data
to the next router, the New Data Indicator bit (MSB of the
packet) is toggled at the output port. When the packet is
received, the output port direction is calculated in the same
clock cycle including the arbitration and the routing. Thus
if all the four adjacent routers are free, four packets can be
routed to the adjacent routers in a single clock cycle. When
the destination is reached, the packet is sent to the corresponding CE in the next cycle.
Router Crossbar: Crossbar is at the second level arbitration (Output Arbiter) of the router which can route four
254
ON-OFF message sources. The length of time each message spends in either the ON or the OFF state should be selected according to a distribution which exhibits long-range
dependence. The Pareto distribution (F (x) = 1x , with
l < < 2) has been found to fit well to this kind of traffic. A
1
packet train remains in the ON state for tON = (1 r) ON
1
0.9
Throughput (fraction of capacity)
0.8
1
OF F
and in the OFF state for tOF F = (1 r)

where r is
a random number uniformly distributed between 0 and 1,
ON = 1.9, and OF F = 1.25 [8].
Two application specific traffic patterns refer to the traffic generated from the traffic trace given by a topology (honeycomb) aware simulator of polymorphic ASIC for which
this NoC is targeted for. We simulated this to mimic the
realistic scenario of the polymorphic ASIC in which the
HyperOp Launcher, realized in Verilog HDL, transfers 22
packets simultaneously onto the fabric from the nearest
boundary port. Intermediate results generated by the CEs
are transferred over the network at specified intervals as demanded by the application. Communication due to traffic
generated by multimedia kernels spans 44 tessellations on
the execution fabric; four such tessellations execute in parallel occupying the entire fabric. Communication due to
traffic generated by DSP kernels include multiple 22 and
44 tessellations on the execution fabric.
3.1
0.7
0.6
0.5
0.4
0.3
0.2
DSP kernel
multimedia kernel
self-similar
random
0.1
0
1
Number of Virtual Channels
Figure 5. Throughput variation with virtual channels
creases with the number of virtual channels and injection

load. In general low latency is achieved because of the single cycle router implementation (packet size is same as flit
size). For more realistic application specific traffic lower
latencies are observed due to near neighbourhood traffic.
Throughput and Latency
45
DSP kernel
multimedia kernel
self-similar
random
40
We compare the throughput and latency for various traffic patterns on RECONNECT. Figure 5 shows the variation
of throughput with the number of virtual channels for different types of traffic. Throughput is the maximum traffic
accepted by the network and it relates to the peak data rates
sustainable by the system. The accepted traffic depends on
the rate at which data is injected into the network. Ideally accepted traffic should increase linearly with injection
load. However, due to the limitation of routing resources,
accepted traffic will saturate at a certain injection load.
Throughput expressed as a fraction of network capacity,
is measured by applying a saturation source to each port that
injects a new packet into the network whenever the port is
free, which results in insertion of more packets when router
contains more virtual channels. As it can be observed in
Figure 5, it saturates when the number of virtual channels
exceeds 4. Higher throughput is achieved for application
specific traffic as compared to random or self-similar traffic
patterns, due to higher near neighborhood communication
across tiles. The variation of throughput with virtual channels shows a similar trend for all four types of traffic.
Figure 6 shows the variation of latency with the number
of virtual channels. Network latency is measured by finding the average difference between arrival time at the destination and launching time at the source, including the time
spent buffered at the source. The average packet latency in-
Average Latency (cycles)
35
30
25
20
15
10
5
0
1
Number of Virtual Channels
Figure 6. Latency variation with virtual channels
Throughput of the network saturates beyond 4 virtual

channels with further increase in latency. To keep the latency low while maintaining acceptable throughput, the
number of virtual channels is selected to be 4 in the final
design of the router which strikes an appropriate balance
between high throughput, low latency and conservation of
silicon area. This result is consistent with the previous work
on number of virtual channels [3] and validates our simulation approach.
Figure 7 shows average network latency vs injected load
(random traffic) for varying virtual channels. The latency
255
Conclusion
100
2 virtual channels
4 virtual channels
8 virtual channels
90
In this paper we have presented the architecture of

a single cycle NoC router operating at 500MHz as the
communication backbone for a polymorphic ASIC. The
NoC router in such a polymorphic ASIC facilitates runtime
reconfiguration of the hardware substrate to create ASIC
like custom hardware regions for various application
substructures. The NoC router design is optimized for area
and occupies less than 5% of the typical tile size of 3mm2
for NoC based SoCs. Network latency and throughput increases with a higher number of virtual channels. Using the
router with 4 virtual channels, the NoC gives lowest average
latency of 6 clock cycles and 87% throughput for application specific traffic. Considerable improvement in latency
and throughput for application specific traffic shows that
our RECONNECT NoC is suitable for polymorphic ASICs.
80
Average Latency (cycles)
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Injected Load (fraction of capacity)
Figure 7. Latency variation with injected traffic
Acknowledgments: This work was supported in part by

the Ministry of Communication and Information Technology, Govt. of India. Keshavan Varadarajan is supported by
an IBM Ph.D. Fellowship.
approaches infinity when injected traffic is beyond the saturation throughput. For example, in case of 4 virtual channels saturation throughput is 62% of the total network capacity (Figure 5). Beyond this point latency approaches
infinity (Figure 7) because of the limitation of routing resources.
3.2
References
[1] ITRS 2001. In International Technology Roadmap for Semiconductors.
[2] P. Bai et al. A 65nm Logic Technology featuring 35nm gate
lengths, Enhanced channel strain, 8 Cu Interconnect layers,
Low-k ILD and 0.57m2 SRAM cell. In IEDM 04, pages
657660, Dec 2004.
[3] W. J. Dally. Virtual-Channel Flow Control. IEEE Trans.
Parallel Distributed Syst., 3(2):194205, 1992.
[4] G.Varatkar and R.Marculescu. Traffic Analysis for On-Chip
Networks Design of Multimedia Applications. In Design
Automation Conference, pages 510217. IEEE, June 2002.
[5] Kees Goossens et al. Ethreal Network on Chip: Concepts,
Architectures and Implementation. In IEEE Design and Test
of Computers. IEEE Computer Society, 2005.
[6] Mikael Millberg et al. Guarenteed Bandwidth using Looped
Containers in Temporally Disjoint Networks within the Nostrun Network on Chip. In Proceedings of the Design, Automation and Test. IEEE Computer Society, 2004.
[7] Mythri Alle et al. Synthesis of Application Accelerators on
Runtime Reconfigurable Hardware. In ASAP 08, 2008.
[8] K. Park and W.Willinger. Self-Similar Network Traffic and
Performance Evaluation. In John Wiley & Sons, 2002.
[9] A. N. Satrawala et al. Redefine: Architecture of a SOC Fabric for Runtime Composition of Computation Structures. In
FPL 07, 2007.
[10] I. Stojmenovic. Honeycomb Networks: Topological Properties and Communication Algorithms. In IEEE 97: Parallel
Distributed Systems, volume 8, pages 10361042, 1997.
Synthesis Results and Comparison
To get an estimate of the size and speed of the routers, we

synthesized our NoC design for 130nm UMC CMOS standard cell library using Synopsys Design Compiler. Latency,
area, power and delay numbers are tabulated for honeycomb
and mesh routers with four virtual channels.
Table 1. Comparison of routers in different topologies
Topology
Honeycomb
Mesh
Latency
(cycles)
6
4
Area
(mm2 )
0.134481
0.153767
Power
(mW)
36.33
43.93
Frequency
(MHz)
500
500
We compare the latency for traffic traces of DSP kernel,

area and power figures of the router in honeycomb topology
with those of the router in mesh topology, both operating at
500 MHz. As observed from Table 1 router in honeycomb
topology has 12.5% area advantage and 17.3% advantage in
power over that of a router in mesh topology at the cost of
30% increase in latency.
256

Reconnect Ieee

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Reconnect Ieee

Hochgeladen von

Copyright:

Verfügbare Formate

RECONNECT: A NoC for polymorphic ASICs using a Low Overhead Single

Emerging embedded applications in domains such as

1-4244-1898-5/08/$20.00 2008 IEEE

operations represented by the nodes of the sub graph. Since

RECONNECT: The NoC

The NoC is a regular interconnection of tiles. Each tile is

Topology and Interconnects

Figure 2. Proposed Network on Chip

internal CEs from boundary tiles. Thus RECONNECT is

Figure 3. Packet format

In a typical ASIC, the functionality of CEs and routers

flit is the smallest flow control unit of the network

Port Update Logic

Port Update Logic

Port Update Logic

Address Updated Packet

Port Update Logic

packets to four different ports in single cycle. The router

Y relative address new = Y relative address + 1

The relative address is updated before a packet is routed

Throughput (fraction of capacity)

and in the OFF state for tOF F = (1 r)

Number of Virtual Channels

Figure 5. Throughput variation with virtual channels

creases with the number of virtual channels and injection

Throughput and Latency

Average Latency (cycles)

Number of Virtual Channels

Figure 6. Latency variation with virtual channels

Throughput of the network saturates beyond 4 virtual

In this paper we have presented the architecture of

Injected Load (fraction of capacity)

Figure 7. Latency variation with injected traffic

Acknowledgments: This work was supported in part by

Synthesis Results and Comparison

To get an estimate of the size and speed of the routers, we

Table 1. Comparison of routers in different topologies

We compare the latency for traffic traces of DSP kernel,

Das könnte Ihnen auch gefallen