Region-Based Routing: An Efficient Routing Mechanism To Tackle Unreliable Hardware in Network On Chips

Region-Based Routing: An Efcient Routing Mechanism to Tackle Unreliable
Hardware in Network on Chips

J. Flich, A. Mejia, P. L opez, J. Duato
Dept. de Inform atica de Sistemas y Computadores
Universidad Polit ecnica de Valencia
P.O.B. 22012, 46022 - Valencia, Spain
E-mail: {jich,andres,plopez,jduato}@gap.upv.es
Abstract
The design of scalable and reliable interconnection net-
works for System on Chips (SoCs) introduce new design
constraints not present in current multicomputer systems.
Although regular topologies are preferred for building
NoCs, heterogeneous blocks, fabrication faults and reliabil-
ity issues derived from the high integration scale may lead
to irregular topologies. In this situation, efcient routing
becomes a challenge. Although table-based routing allows
the use of most routing algorithms on any topology, it does
not scale in terms of latency and area.
In this paper we propose the region-based routing me-
chanism that avoids the scalability problems of table-based
solutions. From an initial topology and routing algorithm,
the mechanism groups, at every switch, destinations into di-
fferent regions based on the output ports. By doing this,
redundant routing information typically found in routing ta-
bles is eliminated. Evaluation results show that the mecha-
nism requires only four regions to support several routing
algorithms in a 2D mesh with no performance degradation.
Moreover, when dealing with link failures, our results indi-
cate that the mechanism combined with the Segment-Based
Routing algorithm is able to pack all the routing informa-
tion into eight regions providing high throughput. The pa-
per provides also a simple and efcient hardware implemen-
tation of the mechanism requiring only 240 logic gates per
switch to support eight regions in a 2D mesh topology.
1 Introduction
System on Chips (SoCs) integrate into a single chip all
the parts that were found on a small printed circuit board of
This work was supported by the European Commission in the context

of the SCALA integrated project no. 27648 (FP6), by CONSOLIDER-
INGENIO 2010 under Grant CSD2006-00046 and by CICYT under Grant
TIN2006-15516-C04-01.
the past. They are typically composed of several processor
cores together with application-specic circuitry. A SoC
generally reduces size and lowers power compared to less
integrated solutions.
The solutions for SoC communication structures have
been characterized by custom designed ad hoc mixes of
buses and point-to-point links [2]. However, as the size of
the system increases, buses become not only a bottleneck
but also an inefcient solution in terms of power usage and
arbitration. This has motivated a shift in the communication
design paradigm from non-scalable bus-based architectures
to a shared global communication structure based on the use
of an on-chip data-routing network or a Network on Chip
(NoC) based architecture [1].
In a SoC, each core node (end node) is attached to the
NoC through a network adapter. The NoC is composed of
several switches and links interconnected following some
connection pattern or topology. End nodes communicate
with each other by sending packets to/from the NoC. Pack-
ets are transmitted across the links and through the switches
according to the switching technique. Wormhole switching
is the preferred choice for NoCs, as it requires small buffers
at each switch. In absence of a fully connection between all
switches, packets must be forwarded through several inter-
mediate switches until they reach their destination.
Concerning topology, most NoCs implement regular
forms of network topology that can be easily laid out on
a chip surface (a 2-dimensional plane). Therefore 2D-torus
or 2D-meshes are the preferred choices. In some designs
[10], torus is discarded in favor of a mesh because in the
latter wires are shorter, thus leading to a lower and con-
stant link delay. However, some SoCs are composed of a
heterogeneous collection of ALUs, memories, FPGAs, etc.
Due to the heterogeneity of the SoC components, the result-
ing interconnection topology is not longer regular. Indeed,
the high integration scale used in SoCs pushes a number
of communication reliability issues. Crosstalk, power sup-
ply noise, electromagnetic and intersymbol interference are
Proceedings of the First International Symposium on Networks-on-Chip (NOCS'07)
0-7695-2773-6/07 $20.00 2007
Authorized licensed use limited to: UNIVERSIDAD POLITECNICA DE VALENCIA. Downloaded on November 4, 2009 at 11:30 from IEEE Xplore. Restrictions apply.
some of these issues. Moreover, fabrication faults may ap-
pear, in the form of defective core nodes, wires or switches.
In these cases, while some regions of the chip are defec-
tive, the remaining chip area may be fully functional. From
the NoC point of view, in presence of such fabrication de-
fects, the initial regular topology has become an irregular
one. Anyway, the fact that the original topology was regu-
lar may be exploited.
Routing determines the path that each packet follows be-
tween a sourcedestination node pair. Routing is determin-
istic if only one path is provided per node pair, or adaptive,
if several paths are available. Adaptive routing better bal-
ances network trafc, thus allowing the network to obtain a
higher throughput. Routing strategies can be also classied
as source or distributed. In source routing, the source end
node computes the path and stores it in the packet header.
Since the header itself must be transmitted through the net-
work, it consumes network bandwidth. Source routing has
been used in some networks because switches are very sim-
ple. They only have to select the output port for a packet ac-
cording to the information stored in its header. In distributed
routing, each switch computes the next link that will be used
while the packet travels across the network. By repeating
this process at each switch, the packet reaches its destina-
tion end node. The packet header only contains the destina-
tion ID. Distributed routing has been used in most hardware
switches for efciency reasons, since it allows more ex-
ibility, as different ports can be available to reach a given
destination at each switch (distributed adaptive routing).
Distributed routing can be implemented in different
ways. The approach followed in regular topologies of
large multicomputers was the so called algorithmic routing,
which relies on a combinational logic circuit that computes
the output port to be used as a function of the current and
destination nodes and the status of the output ports. The
implementation is very efcient in terms of both area and
speed, but the algorithm is specic to the topology and to
the routing strategy used on that topology. Indeed, when us-
ing a fault-tolerant routing algorithm, additional hardware
must be used to properly support it. With the introduction
of clusters of workstations, switches based on forwarding
tables were proposed. In this case, there is a table at each
switch that stores, for each destination end node, the output
port that must be used. This scheme can be easily extended
to support adaptive routing by storing several outputs in
each table entry [7]. The main advantage of table-based
routing is that any topology and any routing algorithm can
be used, including fault-tolerant routing algorithms.
However, routing based on forwarding tables suffers
from a lack of scalability. The size of the forwarding ta-
ble grows linearly with network size and, most important,
the time required to access the table also depends on its
size, and, thus, on network size. On the other hand, power
consumption is not negligible. For these reasons, forward-
ing tables are not suitable for NoCs. Several NoC design
frameworks for application specic NoC that address the
problem of routing in irregular meshes using forwarding ta-
bles have been proposed in the literature [14, 12, 11]. In
[14] Srinivasan et al. proposed a linear programming algo-
rithm that minimizes the number of routers utilized in the
topology in order to reduce the trafc owing in the net-
work. It takes into account the communication traces of
the application and generates only a unique route for every
communication. Unfortunately the routing method provides
not adaptivity at all and does not state how deadlock free-
dom is guaranteed. In [12] Schafer et al. proposed a oor-
planning method based on the Odd-Even routing algorithm.
This method is enhanced with special rules that dene rou-
ting patterns under certain types of irregularities. In order
to avoid deadlocks, the IP cores are required to have special
dimensions and at the same time, the connection point of a
core most be situated in its south-west edge. This presents a
drawback as it limitates the use of general purpose IP cores.
In [11], a mechanism that compresses forwarding tables
has been proposed by Palesi et al. Although it helps in
improving forwarding table scalability, it is aimed to sup-
port a particular kind of routing algorithms where the ac-
tual communication pattern among end nodes is considered
for deadlock-freedom (application specic routing). Even
more, the mechanism is proposed for minimal path rou-
ting, thus, not allowing routing through networks with some
failed links/switches.
Our challenge in this paper is to develop a routing frame-
work for NoCs that is able to implement not only routing
algorithms for regular topologies but also for the irregular
ones that arise when some defects or faults appear in the net-
work (i.e., agnostic routing algorithms that can route pack-
ets on any topology). For efciency reasons, distributed
adaptive routing is a concern. On the other hand, the routing
algorithm should be computed quickly and the silicon area
and power should be as low as possible. Specically de-
signed algorithmic routing are not feasible, as the topology
could be irregular and distributed routing based on forward-
ing tables is inefcient due to its scalability problems.
2 Motivation
When a packet arrives to a switch, the routing algorithm
obtains the corresponding output port by analyzing (i) the
input port used for the packet to reach the switch and (ii)
the destination node of the packet. The routing algorithm
is dened in the domain C N C (input port, destina-
tion node, output port). Up*/down* [13] and derivatives are
examples of these routing algorithms. Sometimes, the in-
formation about the input port is not required, obtaining the
output port only as a function of the destination node. In this
case, the domain is N N C (current node, destination
0-7695-2773-6/07 $20.00 2007
node, output port). Dimension-order routing algorithms for
meshes and tori are examples of this class.
From the domain point of view, a routing algorithm asso-
ciates each input output port combination a set that con-
tains the destination nodes that can be reached from that
port. Given a packet entering through a given input port, it
must be checked whether its destination belongs to the cor-
responding set. If so, the output port can be used to forward
the packet.
Forwarding tables are a way to easily implement this
check. The input port and destination identier of the packet
are used to obtain the output port. In this way, the set of des-
tinations associated to an input output port combination is
spread throughout the table. The table describes this set by
enumerating all its elements.
Usually, the elements of the set have some common
properties that allow representing it in a more compact way.
This happens in regular topologies. For instance, in XYrou-
ting, the input port is not used for routing, and the port in the
X dimension can be used if the X coordinate of the current
and destination nodes differ, regardless of the Y coordinate.
The destinations that match this condition dene a planar
gure (a rectangle), and can be represented by using the
identiers of two opposite corners. To check whether the
destination of a given packet belongs to the rectangle, all
we have to do is comparing the X coordinate of the destina-
tion with the X coordinates of the nodes that dene the rect-
angle. This is the approach followed by hardwired routing
algorithms, Interval Routing [6] and derivatives [4]. The
implementation is very efcient in terms of area, speed and
power consumption. The problem of these approaches is
that they do not support faults.
When some faults appear in a regular network or a true
irregular network is used, again the input port is required
for routing, and the set of nodes associated with a given in-
put output port combination are not longer represented by
a single rectangle. However, we could deal with this case
by representing this set by several rectangles that may in-
tersect. The corresponding input output port combination
may be used if the destination is found inside any of them.
We will refer to each rectangle as a region. As several re-
gions will be associated to each input output port combi-
nation, the implementation of this region-based routing will
be half-way between hardwired algorithmic routing and for-
warding tables. The higher the number of regions, the closer
to the complexity of forwarding routing. In fact, a forward-
ing table is a particular case of region-based routing, where
each region is composed of only one destination identier.
The total amount of logic for routing using regions de-
pends on the number of required regions. At the same time,
the number of regions depends on the regularity of the to-
pology (the number of failures and their position) and the
routing algorithm used. However, in some cases (regular
topologies with faults) different input ports may share the
same set of destinations mapped to the same output port.
We can exploit this fact by associating each region only to
an output port, and also by coding in the region which input
ports share the set of destinations. This trick may help in re-
ducing the number of regions. On the other hand, the num-
ber of regions is closely related to the number of routing
options. In fact, a way of reducing the number of regions is
to limit the adaptiveness of the routing algorithm.
In this paper, we propose the region-based routing as an
alternative implementation of routing algorithms in NoCs.
While providing the required exibility to support fault-
tolerance, is efcient in terms of area, speed and power con-
sumption, thus being appropriate for large networks.
Additionally, we combine in this paper the region-based
routing mechanism with an appropriate routing algorithm
for regular networks with failures. The Segment-Based
Routing (SR) algorithm [8] will be key to provide high net-
work throughput while requiring an acceptable number of
regions. SR routing will take benet from the regular struc-
ture of the topology even in the presence of failures.
The rest of the paper is organized as follows. Section
3 describes the region-based routing mechanism, rst by
providing the foundations of the mechanism, then by de-
scribing and analyzing a very simple hardware implementa-
tion. Then, we will describe the way regions are computed.
Later, in Section 4, we present an analysis of the number of
regions and the performance achieved by region-based rou-
ting for different trafc patterns. Cost, area, and latency of
the hardware proposal are also explored. Finally, in Section
5 we set the basis for future work and draw some conclu-
sions.
3 Region-Based Routing
3.1 Overview
The basic idea behind the mechanism is to group at
every switch different destinations within different regions.
Roughly, each region will be dened by all the destinations
that can be reached using the same set of output ports at
the switch where the region is dened. As an example, Fig-
ure 1 shows a 8 8 mesh network with seven link failures.
Ports at every switch are labeled as N (north), E (east), W
(west), S (south) and I (internal). The Figure also shows
the set of bidirectional routing restrictions dened by the
applied routing algorithm
1
. A routing restriction is dened
by two consecutive links in a given switch. Both links can
not be used consecutively by any packet in order to guaran-
tee deadlock freedom.
In Figure 1, some switches have been grouped into re-
gions (r1, r2, r3). Regions are dened for the switch high-
lighted with a circle (this switch will be referred to as the
1
SR has been applied for this case; it will be described in section 3.4.
0-7695-2773-6/07 $20.00 2007
routing
restrictions
r1
r3
r4
r5
r2
r1,r2,r3,r4
W E
N
S
r2
r4,r5
Figure 1. Example of region denitions.
reference switch). All the packets arriving to the reference
switch (or being generated at its local port) addressed to any
destination included in region r1 necessarily need to use the
W output port at the reference switch. Notice that if they
use the N port, they will need to cross a routing restric-
tion which is forbidden or use a not minimal route which
is inefcient. The same happens when using the E and S
ports. Therefore, at the reference switch, region r1 includes
destinations that can be reached only through the W port.
Also, there are other destinations that can be reached using
only the W port (i.e. switches within region r3), however
the mechanism relies on dening rectangular regions for the
shake of implementation, thus, two regions (r1 and r3) are
dened for the same output port.
An interesting property of the mechanism is that regions
can be dened for more than one output port, thus allowing
the adaptiveness inherent to the applied routing algorithm.
For instance, region r2 is dened by those switches that can
be reached either using the N or W ports at the reference
switch.
The input port used by the packet to arrive to the ref-
erence switch must be also taken into account when den-
ing regions. For instance, at rst sight we can use ports W
and S to reach region r4. However, W output port should
be only used when the packet does not reach the reference
switch through the N port, as it would cross a routing re-
striction at the reference switch and potentially would lead
to deadlock. Therefore, regions must be dened based on
the set of destinations, the output ports and the input ports.
In the gure, region r4 is dened for packets using the in-
put port E and I and region r5 is dened for packets using
the input port N. Notice that the set of destinations of two
regions can be overlapped, but they will differ in the set of
input ports.
To summarize, a set of regions is computed at every
switch. Each region is dened by the possible input ports
used by the packets, a subset of the output ports that may
be used and the potential destinations that can be reached.
Additionally, as regions are dened by rectangular boxes of
switches in a 2D mesh network, we can notate a region for
switch x as R
x
(iport list, {sw
ref1
, sw
ref2
}, oport list)
(sw
ref1
is the top left most switch and sw
ref2
is the bot-
tom right most switch).
Therefore, whenever a packet arrives into a switch, all
the regions must be inspected in order to detect which are
the regions suitable for routing the packet. The control rou-
ting unit will, thus, select the set of output ports provided by
the matched regions and deliver all the possible output ports
to the arbiter in charge of selecting the most convenient one.
3.2 Hardware Description
In this section we describe a simple implementation of
the region-based routing mechanism for a NN mesh net-
work. Figure 2 shows the proposed implementation. The
mechanism requires nearly the same functional blocks than
any other NoC switch, that is: the input ports, the routing
control unit, the arbiter and scheduler, the crossbar and the
output ports.
The input port for wormhole switches generally requires
two different blocks, an input port controller (IPC) that
manages the input buffer and transmits the status informa-
tion to neighboring nodes, and a header decoder (HD) that
decodes the header information of every packet. The packet
header should be as compact as possible in order to keep the
minimum overhead. Among other information it contains
the destination identier of the packet. In our proposal, an
absolute addressing is performed, and the packet ID identi-
es the X and Y coordinates of the destination (the MSBs
indicate the row (RowDst) and the LSBs indicate the col-
umn (ColDst)). Once decoded, the coordinates of the des-
tination are compared against the coordinates of the current
switch. If equal, the packet is delivered to the internal port,
otherwise, the coordinates of the destination are sent to the
routing control unit.
The routing control (RC) unit is made of different logic
modules, one for each possible region dened in the switch
for the region-based routing. Each module has six reg-
isters, the input port (IP) register with one bit per input
port (NEWSI; North, East, West, South, and Internal), the
ROW
1
, COL
1
, ROW
2
, COL
2
registers with Log(N) bits
each (these registers contain the coordinates of the top left
most switch and the bottom right most switch of the region),
and the output port (OP) register with one bit per output
port (NEWS) (this register does not take into account the
internal port as the packets going to the current switch have
previously been delivered to the internal port). These regis-
ters must be programmed before routing any packet at net-
work boot time. The way register values are computed will
be described in Section 3.3.
0-7695-2773-6/07 $20.00 2007
W
I
N
E
W
S
+
+
+
+
.
.
.
.
.
.
.
.
.
.
.
.
LSB
MSB
RowDst
ColDst
IPC
East Port
Header Decoder (HD)
.
.
.
W
.
.
.
S
N
.
.
.
.
.
.
IPC & HD
South Port
IPC & HD
West Port
IPC & HD
Local Port
North Port
IPC & HD
Region Trigger Region matcher
Xbar
Switch Output Ports
Arbiter
South Port Status
N E W S
E
N
I
S
Output Port Selector

IP register
Region Trigger (RT)
Region Matcher (RM)
OP register
Row1
Row2
Col1
Col2
RowDst >= Row1
ColDst >= Col1
Col2 >= ColDst
Row1 >= RowDest
RowDst
RowDst
ColDst
ColDst
Routing Region 1
E
Switch Input Ports Routing Control Unit (RC)
Output Port Selector Routing Region n
.
.
.
.
.
.
.
.
Output Port Selector Region matcher Region Trigger Routing Region 2
West Port Status
East Port Status
North Port Status
Figure 2. Proposed hardware implementation of the region-based routing mechanism.
Once the packet header has been decoded it is passed to
the RC unit together with the input port identier where the
packet arrived to the switch. At the RC unit this informa-
tion is compared against the pre-dened regions. In order to
save power consumption a pre-selection of regions is made
according to the IP registers. That is, we discard checking
the regions whose input port registers do not match with the
input port of the packet being routed i.e, if the packet has ar-
rived to the switch through the south port we discard regions
dened at the south of the switch. To achieve this in hard-
ware, the RT (region trigger) unit matches the input port of
the packet with the input port register of every region. The
RT unit consists of ve AND gates (one per input port) and
a single OR gate. The output of the RT unit is a signal that
triggers the rest of logic for the region.
After checking the input port, and only on success, the
rest of the logic for a region is activated. In particular, the
logic determines if the destination is within the boundaries
dened by the region. To achieve this, the row and column
of the packets destination are compared with the contents
of ROW
1
, COL
1
, ROW
2
, and COL
2
registers. For this,
four magnitude comparators are used at the region matcher
(RM) unit. If all the comparisons are true, the destination
of the packet is within the region and therefore, the output
ports dened for the region must be considered for routing
purposes. Thus, the logic selects the output ports previously
introduced in the OP register. Notice that the implementa-
tion allows different output ports to be selected from the
same region (adaptivity).
Once all regions have been evaluated, all the selected
output ports from all the regions that matched the packet are
ORed and passed to the arbiter. The arbiter may choose one
output port based on different criteria. Then the packet will
be routed to the next switch until arriving to destination.
As shown, the control unit of the region-based routing
mechanism can be implemented using a very simple and
fast combinational logic circuit
2
as it is shown within the
dotted square in Figure 2. Note that RT and RM units have
been serialized (one is performed after the other). Although
such a decision will increase latency, it will reduce energy
consumption. Thus, there is a trade off between the latency
penalty (routing time) and energy consumed that must be
evaluated when designing the NoC.
3.3 Software Description
The algorithm for computing regions is divided into se-
quential phases. Figure 3 shows the different phases. Ini-
tially, as input parameters, the algorithm receives the topo-
logy of the network and a set of routing restrictions. The
network can be a 2D mesh of any size and with or without
failures. As an example, Figure 3 shows a 2D mesh with a
link failure.
The set of routing restrictions depends on the routing al-
gorithm applied to the network. Notice that considering the
set of routing restrictions allows to compute in a later phase
the entire set of possible routing paths for every source-
destination pair. Thus, routing is not restricted at this stage.
In a rst phase, the algorithm computes the possible set
of routing paths for every source-destination. This is a cha-
llenging problem since the number of possible paths can be
extremely high. For this, the algorithm rst tries to compute
all the minimal paths for every source-destination
3
pair. If,
for a particular pair of nodes, there is no valid minimal path
(in case the topology is irregular), then it searches for paths
2
Latency and area analysis are performed in Section 4.
3
Minimal paths are searched by assuming that the underlying topology
is a mesh network with probably some failed links.
0-7695-2773-6/07 $20.00 2007
REGION
SET
TOPOLOGY &
ROUTING
RESTRICTIONS
SET
PATH
COMPUTATION
REGION
COMPUTATION
REGION
MERGE
max_regions
Figure 3. Stages in the computation process of regions.
allowing an increasing number of misrouting hops. Once
a non-minimal valid path is found, the algorithm switches
to the next pair of nodes. Thus, the algorithm nds all the
possible minimal paths and, when necessary, nds a unique
non-minimal path. As an example, Figure 4 shows diffe-
rent paths computed for the 2D mesh. All the paths have in
common the destination (switch d).
d
(N,d,S) (W,d,E)
(I,d,S) (W,d,S)
(I,d,E)
({N},d,{S})
({W,I},d,{S,E})
SWITCH
SWITCH
(N,d,S) (N,d,E)
(W,d,S)(W,d,E)
(I,d,S) (I,d,E)
({N,W,I},d,{S,E})
Figure 4. Routing paths and routing options.
The routing paths are computed and stored in a dis-
tributed way. In particular, the algorithm packs all the rou-
ting info into a set of routing options. Also, routing options
are grouped per switch where they are applied. Each rou-
ting option (ip, dst, op) is dened by an input port (ip), a
destination (dst), and an output port (op), and indicates that
a packet at the switch where the routing option is dened,
that is coming through the input port ip, and going to desti-
nation dst may take output port op. For instance, in Figure
3, routing options for switches plotted as a box and as a cir-
cle are shown. For instance, (N, d, S) at the box switch
indicates that a packet coming through N north port going
to destination d can be routed through S output port.
Additionally, routing options may be packed per output
port and per input port. First, packing per output port is per-
formed. Two routing options at the same switch are packet
if both have the same input port and the same destination
ID. Therefore, both routing options (N,d,S) and (N,d,E)
can be packed into the routing option (N,d,{N,E}). No-
tice that by doing this, we are representing all the possible
adaptive routing options into a single routing option.
Once routing options are packed per output ports, further
packing can be done, now depending on the input port. If
two routing options at a given switch have the same des-
tination IDs and the same set of output ports, then they
can be further packed into one routing option. The input
ports of the resulting routing option is the union of all the
input ports of the routing options being packed. For in-
stance, in the Figure, routing option ({N,W,I},d,{S,E})
has been obtained by packing (N,d,{S,E}), (W,d,{S,E}),
and (I,d,{S,E}) routing options.
In the second phase, the algorithm computes the routing
regions from the routing options. At every switch the algo-
rithm groups destinations reachable through the same out-
put ports at the switch and coming from the same set of in-
put ports. For instance, Figure 5 shows some regions com-
puted for the selected switch. In particular, only regions de-
ned for output ports S and E are plotted. The rst region
(R1) includes switches inside the box dened by switches
x1 and x2. This region is dened for destinations that are
reachable only through the S output port and for packets
coming from input port W, N, or I. In the same sense,
region R2 is dened for packets using output port E and
coming through input ports N, W, or I. Finally, region R3
is dened for packets coming either from input ports N, W
or I, and can use either output port E or S. Notice that
routing is not restricted, since the different regions are com-
puted based exclusively on the output ports used.
R1 ({N,W,I}, {x1,x2}, S)
R2 ({N,W,I}, {x3,x4}, E)
R3 ({N,W,I}, {x5,x6}, {S,E})
x2
x5
x1
x4
x3
x6
R1 ({N,W,I}, {x1,x2}, S)
R2 ({N,W,I}, {x3,x6}, E)
Figure 5. Some computed regions.
Finally, at the last phase, the algorithm tries to pack all
the regions in order to bound the maximum number of al-
lowed regions. For this, the algorithm takes a third param-
eter (max regions parameter) that indicates the maximum
0-7695-2773-6/07 $20.00 2007
number of regions allowed at any switch. The algorithm
will pack regions by restricting routing. This will be done
by reducing the number of output ports that can be used at
a given switch to reach a set of destinations. However, the
algorithm will guarantee that always there will be an output
port that can be used (i.e. connectivity is ensured by the
algorithm).
In order to reduce the number of routing options, the
algorithm focuses at switches using a number of regions
higher than the maximum allowed (max regions param-
eter). Then, at a given switch, it compares each region with
the remaining ones, checking if they can be merged. Two
regions can be merged if the combination of both regions
denes a box of switches and the output ports of one of the
regions is a subset of the output ports of the other region.
For instance, from the previous example, the algorithm de-
tects that regions R2 and R3 can be merged into one region.
Effectively, the output port of region R2 (E) is a subset of
output ports of R3 ({S, E)}). However, the resulting region
will restrict routing. In order to keep deadlock freedom, the
output ports of the resulting region will result fromthe inter-
section of the output ports of the two regions being merged.
Thus, in the example, the output port of the resulting region
is E. This way, we prevent the use of the S output port for
packets going to R2
(which is not allowed by the original

routing algorithm).
Although routing is restricted, we expect this limitation
to have a low impact on nal performance. There are two
reasons to make this afrmation, First, routing is restricted
at some particular switches, thus, for a given particular path,
alternative routing options will be eliminated only at some
switches. The second reason is that the packing will be per-
formed only at those switches with a high number of re-
gions. As we will see, most of the switches will require a
very low number of regions, thus no requiring packing.
As an example, Figure 6 shows a mesh with 64 switches
and seven link failures. The applied routing algorithm is
SR (see Section 3.4). Figure shows the computed regions
for the switch plotted in a black circle before (Figure 6.a)
and after restricting routing (Figure 6.b). We can see that
initially up to 14 regions (r0, r1, ..., r13 in the gure) are
required for the switch to properly route packets (the exam-
ple shows the case for the switch with the maximum num-
ber of regions). For instance, r0 denes the switches that
are reachable from the switch by using the W output port.
As can be noticed, regions r0 and r1 are reached by using
a different set of output ports (output ports N and W for
r1). Thus, maximum adaptivity offered by the underlying
routing algorithm is preserved by the computed regions.
In order to reduce the number of regions, routing restric-
tions are applied. In particular, Figure 6.b shows the mini-
mum number of regions achievable. As it can be observed,
former regions r0, r3, r4, r6, r8, r10, and r12 have been
packed into the new region r0. For this, routing has been re-
stricted and all of the destinations can be reached now only
through the W output port. Also, former regions r1, r2, and
r5 have been packed into region r1, and regions r7 and r11
into region r4. Notice also that former regions r13 and r9
have been kept (new regions r2 and r3, respectively). At
the end, only ve regions are required for the plotted switch
at the expense of some routing restrictions. If, however, the
number of allowed regions is higher than ve, then, more
routing options would be preserved.
3.4 Segment-based Routing (SR)
Segment-based routing algorithm (SR) is a simple, ef-
fective and topology-agnostic routing algorithm that does
not rely on the use of virtual channels and obtains a good
performance for any topology. Its key characteristic is that,
in the presence of any possible combination of faults, SR
benets from the regularity inherent in the underlying topo-
logy. It always provides a valid and deadlock-free path for
every source-destination pair of nodes in a fast way. SR has
been proposed for networks with regular [8] and irregular
[9] topologies.
L
I J
S
5
S
4
S
6
O P
M N
H
S
3
S
7
G
S
2
C
S
1
A
B
F E
D
K
11
12
13
14
15
16
17
T T
18
19
22
21
20
8
9
10
T
7
6
5
1
2
4
3
Starting Switch
first subnet
Figure 7. Example of computing SR.
Basically, the algorithm splits the network into different
network segments (groups of switches and links). Thus, a
network is formed by one or more segments and each net-
work component belongs to a unique segment. As an exam-
ple, Figure 7 shows a regular topology with two induced
link failures. From this topology, SR computes the seg-
ments described in Figure 7. In particular, 7 segments have
been computed (from S1 to S7). A segment is dened as a
list of interconnected switches and links. Segment S1 con-
sists of switches {A, B, F, E} and links {1, 2, 3, 4}. Seg-
ment S2 consists of switches {C, G} and links {5, 6, 7},
while segment S6 is formed only by switch {I} and links
{16, 17} and so on for the rest of segments. The main rule
followed to compute segments is that every new computed
0-7695-2773-6/07 $20.00 2007
N N
E
N
W
W
W
W
W
N W
SW (ei)
S (n) W (ei) S
E S
S (n)
N
r0
r3
r1 r2
r5
r7
r11
r10
r6
r4
r9 r8
r12
r13
input port link failure
(a)
S (n)
E
S (newi)
N
W (esi)
r2
r0
r1
r4
r3
(b)
Figure 6. Regions for switch (3, 7). (a) Before packing the regions and (b) after maximum packing.
segment (except the rst one) must start and end on an al-
ready computed one. Notice that except for the initial seg-
ment, the remaining segments start and end on a switch that
already is part of a previously computed segment (e.g. S2
starts/ends from/on S1).
As soon as the complete topology is partitioned into seg-
ments, SR adds a bidirectional routing restriction to every
segment. By partitioning the network into independent seg-
ments, SR is able to place a routing restriction in a seg-
ment independently of the remaining segments. Any pos-
sible combination will end up in a routing algorithm that is
deadlock-free and keeps connectivity among all end nodes
(see [8] for a demonstration). Figure 7 shows all the possi-
ble routing restrictions that can be placed on any segment.
For example, in S1 we can place a bidirectional routing re-
striction at switches B, E or F. From all the possible com-
binations, SR selects only one routing restriction in each
segment. As an example, we have selected just one rou-
ting restriction (the ones in boldface) in each of the seven
segments in Figure 7. Notice that the way the network
is divided into segments and the way routing restrictions
are placed in every segment ensures that no cycles will be
formed through the entire network.
Finally, source-destination paths are calculated follow-
ing the path balancing algorithm described in [3]. This
method minimizes the deviation of link weight. Then, rou-
ting tables are calculated and distributed to every switch
within the network. However, in this paper, the adaptive
version of SR is proposed. To achieve this objective, SR
follows the routing methodology for meshes reported in [8]
(shown in Figure 7). The main difference with the previ-
ous method is that we are not required to calculate routing
tables. Instead, the region-based routing mechanism will
automatically calculate the paths based on the locations of
the routing restrictions provided by SR.
The way segments are computed and restrictions are
placed inuence the nal performance of the algorithm.
Figure 8 shows two methods that will be explored in this
paper. The rst one (Figure 8.(a)) will be referred to as SR
and will search segments from top to bottom and at each
row in a different direction. In the second one (Figure 8.(b))
segments will be searched from left to right and at each col-
umn in a different direction. This algorithm will be referred
to as SR2. Figure shows with arrows the direction of the
search of segments.
S 1
S 2 S 3 S 4
S 5 S 6 S 7 S 8
S 9 S 10
S 16
S 15 S 14 S 13
S 12 S 11
S 1
S 4
S 2
S 3
S 5
S 6
S 7
S 8 S 9
S 10
S 11
S 12
S 14
S 15
S 16
S 13
T
Y
A B C D E
F G H I
K M N O
P Q R S T
X
J
L
U V W Y
A B C D E
F G H
I
K L M N O
P Q
R
S
V X U W
J
(a) SR (b) SR2
Figure 8. Different SR setups.
4 Evaluation
In this section we evaluate the region-based routing me-
chanism and the algorithm used for computing regions.
First, we analyze the number of regions required for di-
fferent combinations of topologies and routing algorithms.
For this, we use different routing algorithms in a regular
mesh network: Dimension Order Routing (DOR), Odd-
0-7695-2773-6/07 $20.00 2007
Even (OE), SR, and up*/down* (UD; a BFS spanning tree
is formed with the root at switch (0,0)). For SR, two di-
fferent layouts (SR and SR2) are explored, both shown in
Figure 8. Also, we analyze the SR, SR2 and UD routing al-
gorithms in the presence of different topologies (mesh net-
work with different link failures injected). In this way we
analyze which is the minimum number of regions required
for each topology-routing combination.
Then, we explore the performance achieved by the diffe-
rent routing algorithms when using the regions. This anal-
ysis is performed in two directions. First, the performance
degradation of the network when using a lower number of
regions is analyzed. Second, the performance degradation
as we increase the number of failures in the network is ex-
plored. The goal of this analysis is to determine which is the
number of regions required to achieve acceptable levels of
performance and also which is the best routing algorithm.
Finally, we analyze the proposed hardware in terms of la-
tency, area, and cost.
4.1 Simulation Tool
We developed a it-accurate simulator that models ar-
bitrary switch-based NoCs with point-to-point links. Each
switch has a non-blocking crossbar connecting input and
output ports, which allows multiple packets to be simulta-
neously transmitted. The crossbar is able to transmit one it
per connection per cycle.
In our simulations, every switch has an input-buffer size
of 5 its. The maximum bandwidth of each link is set to
1 it per cycle. We use a constant packet size of 5 its.
We assume the same constant packet rate for each end-node
and simulate 5 million packets after a warm-up session of 1
million arrived packets.
The routing decision is made at every switch, by check-
ing the regions as specied by the region-based routing me-
chanism. If multiple output ports are available for a header
it, a random selection of the output is performed. When
the routing decision is complete, its are forwarded over a
link only if there is available buffer space at the next input
port. The routing time at each switch is set to 2 cycles. This
includes the time to check the regions, perform the crossbar
arbitration, and set up the crossbar connections.
We consider several specic communication patterns be-
tween pairs of nodes: uniform, bit reversal, perfect shufe,
buttery, matrix transpose, and hot spot. For hot spot traf-
c, in addition to regular uniform trafc every packet has
6% of probability to be destined to a hot spot node located
at coordinates (3,3).
We have evaluated 8 8 meshes, and we have mod-
eled these topologies with up to seven randomly-injected
link failures. For all simulations, we measure both network
throughput and packet latency.
4.2 Required Number of Regions
Figure 9 shows the number of regions required for im-
plementing different routing algorithms in a 88 mesh net-
work without failures. The gure also shows the minimum
number of regions required to guarantee network connec-
tivity (reducing adaptiveness). The gure shows that DOR
can be implemented with only four regions at maximum on
every switch. For other routing algorithms, UD requires six
regions, SR seven and OE eight. However, these routing al-
gorithms are partly adaptive (provide more than one routing
option), contrary to DOR that is a deterministic routing al-
gorithm. Thus, the number of regions for all these routing
algorithms can be reduced to only four regions at maximum
on every switch. Obviously, this reduction may come with
an impact on performance. Later in this section we analyze
this.
Figure 9. Number of regions in a 8 8 mesh.
Figure 10 shows the number of regions (max and min)
required for a 88 mesh network with a different number of
link failures (injected randomly; the position of the link fai-
lures are shown in Figure 6) for SR, SR2, and UD. As it can
be observed, for SR the number of required regions grows
logarithmically with the number of link failures. Fourteen
regions are enough to tolerate seven link failures (5.4 % of
failed links). Also, the maximum number of regions can
be bounded by the algorithm to only eight for all the cases.
However, for SR2 and UD, a higher minimum number of
regions is required. Ten regions are needed to cope with the
seven link failures case. Also, SR2 would require up to 18
regions without restricting routing.
Figure 11 shows the percentage of switches requiring di-
fferent number of regions for the 88 mesh without link fai-
lures and without restricting adaptivity. As expected, all the
switches require four regions at most with DOR. More in-
teresting, when using other routing algorithms, almost half
of the switches require only four (for SR2 and UD) or ve
regions (for SR and OE). This result may indicate that the
overall power consumption of the mechanism will be low,
since not all the implemented regions will be activated in all
the switches.
0-7695-2773-6/07 $20.00 2007
Figure 10. Number of regions in a 8 8 mesh
when increasing the number of failures.
Figure 11. Percentage of switches requiring
different number of regions without restrict-
ing adaptivity in a 8 8 mesh.
Figure 12 shows for SR the distribution of regions for the
8 8 mesh network with an increasing number of link fai-
lures with maximum adaptivity. An interesting observation
is that regardless of the number of link failures, more than
50% of switches require only seven regions. Also, as the
number of link failures increase, there are some switches
that require more regions.
Figure 12. Percentage of switches requiring
different number of regions for a 8 8 mesh
with increasing number of link failures with
maximum adaptivity. SR routing algorithm.
4.3 Performance Evaluation
Figure 13 shows the network throughput achieved by the
different routing algorithms in different trafc scenarios for
the 8 8 mesh network with no link failures. Also, re-
sults for the maximum and minimum number of regions
are obtained. The rst observation is that DOR achieves
higher throughput than the rest of algorithms for uniform
trafc pattern. This result is well known. In fact, DOR
behaves better than adaptive routing algorithms in meshes
with no link failures. However, for other trafc patterns
DOR obtains similar or lower network throughput than the
rest of routing algorithms. SR and OE routings benet from
their adaptiveness for bit-reversal, shufe, and transpose
patterns. Also, SR achieves higher network throughput for
buttery pattern. These results are achieved using 7 regions
in SR and 8 regions in OE (see Figure 9).
Figure 13. Network throughput in a 88 mesh.
When reducing the number of regions down to the min-
imum (four), an interesting result can be observed. OE
and SR2 suffer a signicant drop in network through-
put. Throughput decreases signicantly for uniform, bit-
reversal, shufe, and transpose trafc patterns. In some
cases (uniform, shufe, and transpose) OE and SR2 per-
form signicantly lower than the rest of routing algorithms.
On the other hand, the rest of routings achieve roughly the
same throughput numbers when using four regions.
For UD routing, it can be noticed that its performance is
marginally related to the number of regions. Practically, this
routing achieves the same performance with any number of
regions. To summarize, on average (last set of columns in
Figure 13), all the routing algorithms achieve similar per-
formance, except OE and SR2 that are highly sensitive to
the number of regions. Indeed, when reducing the num-
ber of regions down to four, the percentage of routing op-
tions eliminated for OE is 22%. For the rest of routing al-
gorithms, the percentage is lower than 15%. When eight
regions are allowed, on average, SR and SR2 exhibit the
highest throughput taking advantage of their adaptiveness.
Now, let us focus in the performance achieved when
dealing with meshes with some link failures. Figure 14
shows the performance achieved in a network with one link
failure for different routing algorithms, different trafc pat-
terns and different number of regions. As it can be noticed,
most of the network throughput achievable by SR is ob-
tained with only six regions (practically the maximum in
hotspot and shufe and more than 80% in the rest of traf-
0-7695-2773-6/07 $20.00 2007
c patterns). Also, when using eight regions, throughput is
roughly the maximum for all the trafc patterns.
Figure 14. Network throughput in a 88 mesh
with one link failure.
Figure 15 shows the performance achieved in a mesh net-
work with seven link failures (the topology is shown in Fig-
ure 6). In this case, eight regions provides again most of
the network throughput achievable by SR. Also, with an in-
creasing number of link failures (up to seven) the percentage
of routing options eliminated is only 5% in the worst case
(when reducing the number of regions to eight in a mesh
network with seven link failures). On average, SR achieves
20% higher throughput than UD and 30% higher than SR2.
with seven link failures.
Finally, Figure 16 shows the throughput degradation
with an increasing number of link failures for different traf-
c patterns (throughput achieved in a mesh network with
no link failures (0F) is included). SR with eight regions
is used for all the cases. As it can be observed, perfor-
mance decreases as the number of failures increases up to
the point that network throughput becomes constant. As
the number of failures increases, the number of routing op-
tions eliminated to keep the number of regions constant
increases, thus, loosing adaptiveness. Also, a larger part
of the network becomes irregular. However, SR combined
with region-based mechanism with 8 regions keeps network
throughput even with link failures for hot-spot and trans-
pose trafc. For the rest, throughput decreases 30% for bit-
reversal and uniform, 25% for buttery, and 40% for shuf-
e.
with an increasing number of link failures. SR
with eight regions.
4.4 Area and Cost Evaluation
As an approximation of the area and time constrains of
the region-based routing mechanism, we have synthesized
our implementation onto Xilinx Virtex 2 FPGA at 3.3V and
133 MHz. Results show that the circuit would require about
30 logic gates per region. Therefore, if we were using a 88
mesh where 8 regions are required to support adaptiveness
and fault-tolerance, the total overhead required for imple-
menting the region-based control portion would be around
240 gates, which seems not to be a signicant overhead
when considered in the context of ASIC.
Table 1 shows the equivalent memory size required for
an implementation with routing tables. In particular, rou-
ting with forwarding tables requires a table with as many
entries as the number of nodes and input ports, and every en-
try needs to store the ports returned by the routing function.
Hence the cost of this alternative is N log(d) d, where
N is the number of nodes and d is the number of ports.
As can be seen, for memory-based solutions the overall re-
quirements in memory grow exponentially with the network
size. Contrary to this, the region-based routing requires
per region four registers of size log(N)/2 (ROW
1
, COL
1
,
ROW
2
, and COL
2
registers), one register with d + 1 bits
(IP register) and a register with d bits (OP register). Ta-
ble 1 shows the number of bits required (per switch and per
chip) for an implementation with 16 regions. Also, Table
1 shows the number of logic gates required. As can be ob-
served, the requirements for the mechanism are very modest
and less than half a kilobyte is required per switch, for any
network size. Even the required logic per switch slightly
increases with network size, the total logic and memory re-
quirements are signicantly low. It has to be noted that with
the memory requirements and the number of logic gates
shown in Table 1, all the routings under all the evaluated
networks could be applied without any routing restriction
applied (with no performance degradation).
Also, as a reference point, in [5] a NoC FPGA imple-
mentation of the full switch implementation required ap-
proximately from 10,000 to 26,000 gates depending on the
0-7695-2773-6/07 $20.00 2007
Memory-based routing Region-based routing
Memory memory Bits (registers) Logic gates Bits (registers) Logic gates
NoC per switch in chip per switch per switch in chip in chip
8 8 512 b 32 Kb 336 b 480 21 Kb 31 K
16 16 2 Kb 512Mb 400 b 608 100 Kb 156 K
32 32 8 Kb 8 Mb 464 b 736 464 Kb 754 K
64 64 128Kb 128Mb 528 b 1024 2 Mb 4.2 M
Table 1. Memory requirements for table-based routing and region-based routing. 16 regions.
routing algorithm selected. A hardware implementation of
the region-based routing would require less than 5 percent
of the hardware overhead.
The measured latency of the critical path for the region-
based routing mechanism is 6 ns (for the FPGA implemen-
tation). This delay includes the generation of the Region-
Trigger signal plus the time required to check whether the
address destination of the packet is inside the region (all
the regions performthe same comparison concurrently) thus
generating the region output ports assignments. If we were
using a table lookup system, the system would require an
access time that grows with network size.
5 Conclusion and Future Work
We have proposed an scalable and versatile implementa-
tion of routing algorithms in NoCs. When combined with
an appropriate routing algorithm region-based routing me-
chanism achieves to compress redundant routing informa-
tion while providing the required exibility to support fault-
tolerance. The strategy is able to implement the most com-
monly used routing algorithms in any topology with no per-
formance degradation and at the same time, it is efcient in
terms of area, speed and power consumption. The hardware
required to apply the routing mechanism is not complex and
does not depend on the network size thus being appropriate
for large network implementations.
In order to increase performance the proposed strategy
can be combined with virtual channels. The use of region-
based routing also opens the door to the use of dynamic
routing algorithms that takes into account real trafc char-
acteristics at the time of calculating routing restrictions. We
plan to take this approach in future work implementations.
References
[1] L. Benini and G. D. Micheli. Networks on chips: A new soc
paradigm. Computer, 35(1):7078, 2002.
[2] T. Bjerregaard and S. Mahadevan. A survey of research and
practices of network-on-chip. ACM Comput. Surv., 38(1):1,
2006.
[3] J. Flich, J. Duato, and et al. Combining in-transit buffers
with optimized routing schemes to boost the performance of
networks with source routing. In 2000 International Sympo-
sium on High Performance Computing, Oct. 2000.
[4] M. E. G omez, P. L opez, and J. Duato. A memory-effective
routing strategy for regular interconnection networks. In
Proceedings of the 19th IEEE International Parallel and
Distributed Processing Symposium (IPDPS05), page 41.2,
Washington, DC, USA, 2005. IEEE Computer Society.
[5] J. Hu and R. Marculescu. Dyad: smart routing for networks-
on-chip. In DAC 04: Proceedings of the 41st annual con-
ference on Design automation, pages 260263, New York,
NY, USA, 2004. ACM Press.
[6] J. V. Leeuwen and R. B. Tan. Interval routing. Comput.
Journal, 30(4):298307, 1987.
[7] J. C. Martinez, J. Flich, A. Robles, P. L opez, and J. Duato.
Supporting adaptive routing in inniband switches. J. Syst.
Archit., 49(10-11):441456, 2003.
[8] A. Mejia, J. Flich, J. Duato, S. Reinemo, and T. Skeie.
Segment-based routing: An efcient fault-tolerant routing
algorithm for meshes and tori. In International Parallel and
Distributed Processing Symposium: 20th IPDPS 2006, Rho-
dos - Grece, april 2006.
[9] A. Mejia, J. Flich, J. Duato, S. Reinemo, and T. Skeie.
Boosting ethernet performance by segment-based routing.
In Euromicro Conference on Parallel, Distributed and
Network-based Processing 15th PDP 2007, Naples - Italy,
February 2007.
[10] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch.
The nostrum backbone - a communication protocol stack for
networks on chip. In VLSID 04: Proceedings of the 17th
International Conference on VLSI Design, page 693, Wash-
ington, DC, USA, 2004. IEEE Computer Society.
[11] M. Palesi, S. Kumar, and R. Holsmark. A method for router
table compression for application specic routing in mesh
topology noc architectures. In SAMOS, pages 373384,
2006.
[12] M. K. F. Schafer, T. Hollstein, H. Zimmer, and M. Glesner.
Deadlock-free routing and component placement for irregu-
lar mesh-based networks-on-chip. In ICCAD 05: Proceed-
ings of the 2005 IEEE/ACM International conference on
Computer-aided design, pages 238245, Washington, DC,
USA, 2005. IEEE Computer Society.
[13] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray,
R. M. Needham, and T. L. Rodeheffer. Autonet: a high-
speed, self-conguring local area network using point-to-
point links. IEEE Journal on Selected Areas in Commu-
nications, 9(8), october 1991.
[14] K. Srinivasan, K. S. Chatha, and G. Konjevod. An auto-
mated technique for topology and route generation of appli-
cation specic on-chip interconnection networks. In ICCAD
05: Proceedings of the 2005 IEEE/ACM International con-
ference on Computer-aided design, pages 231237, Wash-
ington, DC, USA, 2005. IEEE Computer Society.
0-7695-2773-6/07 $20.00 2007

Region-Based Routing: An Efficient Routing Mechanism To Tackle Unreliable Hardware in Network On Chips

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Region-Based Routing: An Efficient Routing Mechanism To Tackle Unreliable Hardware in Network On Chips

Hochgeladen von

Copyright:

Verfügbare Formate

Region-Based Routing: An Efcient Routing Mechanism to Tackle Unreliable

Hardware in Network on Chips

This work was supported by the European Commission in the context

Output Port Selector

(which is not allowed by the original

Das könnte Ihnen auch gefallen