42

Low-Power Cumulative Flit Count Based Routing for
Network-on-Chip
<Authors>
<Affiliations>
<Email addresses>
Abstract Selection function in a network-on-chip (NoC)
router is critical in determining the effectiveness of an on-chip
adaptive router, particularly while mitigating the effects of
network congestion. While most adaptive routers attempt to find
a greedy or non-greedy local latency optimum while selecting the
forwarding path, most such selection function designs do not
have explicit provisions for minimizing energy consumption.
Given that the average number of hops traversed by a message is
typically higher in adaptively routed NoCs, energy-aware
selection in the router would be important in achieving low
power targets. In this paper, we propose a selection function
Low-Power Cumulative Flit Count (LPCFC) that aims to
optimize both latency and power so as to reduce the overall
power-delay product. Using a cycle accurate NoC simulator, we
show for various traffic patterns that our selection function
outperforms existing selection functions in adaptive NoC routers
in terms of power-delay product, while achieving competitive
latency values.
KeywordsNetwork-on-Chip,
routing, low-power router
selection
function,
adaptive
I. INTRODUCTION
Energy consumption in interconnects and routers
chiefly contributes to overall power dissipation in a networkon-chip (NoC). Several approaches of NoC power optimization
have been proposed through router microarchitecture design
[1], [2], or low-power interconnect design [3], [4], or use of
novel on-chip interconnect technologies [5], [6], [7]. These, in
conjunction with system-level approaches, such as dynamic
voltage and frequency scaling (DVFS) have been quite
effective in achieving low-power targets. These optimization
approaches have been largely independent of the actual routing
functions used on the NoC, and most consider deterministic
routing algorithms with standard traffic patterns. The actual
routing function used on the NoC has been largely independent
of these optimization approaches. Although adaptive routing is
preferred in high congestion scenarios to improve overall
latency, there are hardware overheads incurred in terms of
router complexity and deadlock management. These overheads
may again lead to higher power consumption. The selection
function in an adaptive router determines the effectiveness of
the adaptive routing strategy in terms of congestion
management, and also has a bearing on the overall router
complexity.
In this paper, we propose a selection function that not only
aims to optimize latency but also power consumption while
deciding on the path to which to forward a message. The
selection function estimates the future latency and power
consumption to be incurred along each of the allowed paths,
provides an output that optimizes both criteria, and yet

possesses low hardware complexity. We compare our selection
function with several state-of-the-art strategies and
demonstrate better power-delay product achieved by our
design, without incurring any latency penalty.
The paper is organized as follows. We provide a brief
background of our work in Section II. We discuss the proposed
architecture of the selection function in Section III. Section IV
describes our experimental and evaluation results. We
conclude and outline future work in Section V.
II. RELATED WORK
The optimization of routing algorithms for the NoC is a key
concern in enhancing the performance and to minimize the
energy consumption. The selection of the most optimal path to
route the packets in the NoC minimizes the traffic flow on the
network, resulting an energy-efficient design. Some popular
techniques of reducing energy consumption employ Dynamic
Voltage and Frequency Scaling (DVFS) and creation of
Voltage Islands (VIs) [9]. DVFS achieves the energy efficiency
by minimizing the applied voltage and clock-speed, when the
workload decreases. A real-time dynamic power management
algorithm proposed in [9] partitions the NoC into multiple VIs
and provides superior energy/execution time trade-off
compared to lowest and highest power modes. This improves
performance by 2x than when running at the highest voltage
and frequency. A novel framework (VISION) for automating
the synthesis of regular networks on chip (NoC) with VIs is
proposed in [10]. The mapping approach uses a distributed
decision making over the whole mesh, and the routing path
allocation algorithm integrates link-insertion and routing,
leading to 11% saving in power dissipation.
Figure 1: Output channel selection flow chart.

A routing algorithm primarily consists of two steps
routing function and next node selection function (see Fig. 1).
For mesh-based NoCs, the former is a simple process, and
Figure 2. Architecture of selection function in LPCFC

choice of the latter impacts performance of the routing
algorithm. Conventional metrics that determine the usefulness
of a routing algorithm include adaptability, resilience to
deadlocks, etc. Many deterministic routing algorithms like XY
routing, although deadlock-free for mesh topology, lack
adaptability and often lead to skewed traffic distribution. This
results in hotspots, congestion and other reliability issues.
There is a plethora of adaptive routing techniques, such as
those based on Hamiltonian path computation and network
partitioning [12], dynamic programming [13], multi-objective
ant-colony algorithm [14], Q-learning [15], etc. Although most
of these algorithms have shown appreciable performance, their
implementation inside NoC router hardware leads to significant
hardware overhead, execution latency and/or learning delay
(where applicable). In the interest of simplicity and feasibility
of implementation, we have chosen the odd-even turn model
based routing algorithm [16] with minimal number of hops,
which provides a fair degree of adaptability, and deadlockavoidance.
The routing function produces a choice of output ports for
the selection function to choose from in a significant
percentage of cases on a mesh NoC [17]. This fraction
increases with increase in the size of the NoC and injection
rate. Hence, selection function plays a significant role in traffic
congestion management. Most selection functions attempt to
optimize latency to determine the final output port. Path Based
Randomized Oblivious Minimal Routing [18] is based on
balancing the load by random channel selection. Congestion
Aware Deterministic Routing [19] estimates the congestion
level of the network based on past traffic patterns and
computes the optimal routing paths for incoming flits. This is
well suited for NoC systems which with bursty and self-similar
traffic patterns. Neighbor on Path (NOP) [20] method tracks
the busy/free status of virtual channels of neighboring routers
and weighs them to select the routing path. There are selection
functions based on fluidity of buffers [21] that follow the

philosophy that forwarding data to routers with more fluid
buffers will lead to smoother flow of traffic.
As noted earlier, most of these selection functions
optimize latency, whereas energy consumption is an increasing
concern in adaptive routing techniques. Many techniques
address this issue at the system level, but our proposed
approach addresses this issue at the level of the selection
function in each router. The central idea is that the proposed
selection function jointly optimizes projected latency and
energy consumption with minimal time or implementation
overhead. The following section described our proposed
approach in detail.
III. PROPOSED APPROACH: LOW-POWER CFC
Our approach is directed to enhance the selection function
by incorporating both power and latency metrics into its
design. A metric based on cumulative flit count (CFC) [17] has
been shown to be a simple and reliable one to optimize latency.
CFC measures the number of flits that went past the neighbors
of a given tile over a specified duration. CFC has been shown
to be better equipped to mitigate congestion than free virtual
channel count [17] for a wide range of traffic patterns,
especially near saturation. Since congestion is a cumulative
process, free VC count or other metrics based on instantaneous
values poorly represent network congestion scenarios. We
augment this metric to include potential energy consumption
via all possible output ports and jointly optimize the CFC
metric with the energy consumption metric to determine the
output of the selection function. We call this selection function
Low-Power Cumulative Flit Count (LPCFC). In order to
account for power consumption, we estimate the power
consumption of the links associated with each potential output
port of the router as follows:
!"#$ = !"!#$ !
(1)
Where, V is the supply voltage, f is the clock frequency and

Ctotal can be expressed as:
!"!#$ = ! !"#$ + !"#!
(2)
Here, Tn is average effective transitions per cycle, Cself and

Cload are self-capacitances and load capacitances respectively
seen at the router output port (see Figure 2). For a particular
technology node, Cself and Cload are fixed for a given router
implementation. In a design with a single voltage domain with
no frequency scaling, link power consumption is proportional
to toggle count.
A. CFC Architecture
We measure CFC by installing a flip-flop and a register in
each of the four sides of a tile corresponding to all directional
output ports (N, E, W, S) for a mesh NoC. In every clock
cycle, the flip-flop is set if a flit passes through that directional
output port; else, it is reset. These bits are additively
accumulated in the register, which in turn updates the status
register of the neighboring tiles. In a similar way, three bits of
flip-flop outputs from neighboring tile from all three nonadjacent directions update the status registers in the current tile.
There are twelve status registers, three for each direction.
These status registers can be considered as log-books of past
flit paths chosen. In order to avoid overflow, the register

contents are periodically right shifted by a fixed value. It is
clear that the current value stored in each status register
represents CFC (or weighted CFC). For example, the value
present in the first status register at north port denotes the
number of flits the northern neighbor of current tile has sent to
its east output port, and so on [17]. Further, we classify the
CFC values for a particular port into three categories low,
medium and high based on their frequency of occurrence (see
Figure 3 (a)). We elaborate on the methodology for this
approach further in Section IV B.
B. Link Power Estimation
From Equations (1) and (2), we determine the link power
consumption directly by counting number of transitions that
would occur if a particular path is chosen. This is tracked by
including a separate register that holds the last flit sent though
that output port by the current tile. In order to determine the
candidacy of a certain allowed output port, we calculate the
potential toggle count if the current flit is forwarded through
that port (see Figure 2). Further, we classify the toggle count
values in three classes low, medium and high based on their
frequency of occurrence (see Figure 3 (b)). We explain this
approach in further detail in Section IV B.
C. Joint Cost Determination
Once CFC and toggle count values have been individually
Table 1 (a). Classification rules based on joint CFC and

link power metrics
Cumulative Flit
Link Power
Cost
Count
Low
Low
Very Low
Low
Medium
Low
Low
High
Medium
Medium
Low
Low
Medium
Medium
Medium
Medium
High
High
High
Low
Medium
High
Medium
High
High
High
Very High
Table 1 (b). Joint cost based path selection from two
alternative paths given by odd-even routing function
Path1
Path2
Path chosen
Very Low
Very low
Either
Very Low
Low to Very High
Path1
Low
Very Low
Path2
Low
Low
Either
Low
Medium to Very High
Path1
Medium
Very Low/Low
Path2
Medium
Medium
Either
Medium
High/Very High
Path1
High
Very Low to Medium
Path2
High
High
Either
High
Very High
Path1
Very high
Very Low to High
Path2
Very high
Very High
Either
(a)
(b)
Figure 3. Plots showing cumulative frequency
distribution and quantization approach for (a) CFC
values and (b) toggle count
classified, our next step is the creation of a joint classification

table for LPCFC. For the joint cost, we propse a new five-class
domain set. This set contains five classes very low, low,
medium, high and very high. The classification table
determines a joint cost by looking at the classes that CFC and
toggle count values belong to. The mapping function is shown
in Table 1 (a). As mentioned earlier, odd-even routing function
in a mesh NoC provides two alternative paths as outputs at
every node. The role of the selection function is in choosing
one of these paths. Since we carry out joint cost determination
with both potential latency (CFC) and potential power
consumption (toggle count), we choose the path with lower
cost, as shown in Table 1 (b). Since we use a coarse
quantization of all the costs (five levels), we choose either path
when the joint cost for each path is at the same level. As we
show later, this approach is very effective is optimizing both
latency and energy without the need of complex and timeconsuming predictive computation at each router node.
IV. EXPERIMENTAL RESULTS
A. Experimental Setup
We used Noxim [22], a cycle-accurate mesh-based NoC
simulator. We implemented our selection function LPCFC by
writing SystemC modules on top of the existing router
infrastructure in Noxim. We carried out performance
evaluation for a wide range of standard traffic patterns. We
simulated for 4x4 mesh topology and odd-even routing
algorithm. In order to account for steady state traffic scenario,
results were noted after a cooling period of a few hundred
clock cycles. We implemented our design with CMOS 90 nm
standard cell library (UMC) [23] and verified it to work with a
clock frequency of 1 GHz and a global operating voltage of 1.1
V.
We compared NoC performance with our selection
function against four other selection functions; CFC [17], NOP
[20], Random [18] and Buffer Level Monitor [19]. These four
selection functions are chosen so as to cover a broad array of
most relevant and popular choices available. We choose the
following standard traffic patterns for our analysis bit
reversal, shuffle, random, butterfly and transpose1 and
transpose2.
B. Determination of Quantization Levels
Here we explain the experimental methodology we used to
classify different CFC and toggle count values. As mentioned
earlier, we classify the CFC and toggle count values into three
classes each to serve as broad indicators of flit throughput and
power consumption respectively. In order to determine the
quantization levels, we generated CFC values by simulating
the system for over 1000 clock cycles and recorded the
distribution of values (see Figure 3 (a)). We then determined
the quantization levels by finding out the tertiles (or 3quantiles) in the cumulative frequency distribution as
indicated in Figure 3 (a) and Table 2. For quantization of
toggle count into three classes, these values were recorded
over the same time period and their cumulative frequency
distribution was recorded (see Figure 3 (b)). Note that
although the flit width in this case is 32 bits, very high toggle
counts are not frequent. We find out the tertiles for toggle
count values to determine the quantization levels for toggle

count values as shown in Figure 3 (b) and Table 2.
C. Latency
We compare the average message latency using our
proposed selection function and other existing selection
functions (as mentioned above). As Figure 4 shows, both CFC
and LPCFC offer appreciable latency improvements.
Comparisons of latency improvement offered by CFC and
LPCFC over the competing selection functions are detailed in
Table 2. Quantization levels for CFC and toggle count
values
Level
CFC values
Toggle count values
Low
0-2
0-13
Medium
3-7
14-17
High
Above 8
Above 18
Table 3. Latency reduction using CFC over competing

selection functions
pir
bit
reversal
Butter
-fly
random
shuffle
trans.1
trans.2
0.8
14.9
13.17
1.52
2.31
3.82
13.2
0.11
7.08
3.5
3.86
1.28
8.5
5.26
0.14
2.48
1.16
1.82
7.76
1.95
1.41
0.17
5.51
1.03
0.81
4.52
1.89
1.92
0.20
1.25
1.03
4.79
1.52
2.5
Table 4. Latency reduction of LPCFC over competing

(non-CFC based) selection functions
pir
bit
reversal
Butter
-fly
random
shuffl
e
trans.1
trans.2
0.8
13.6
2.94
2.3
2.11
0.61
8.16
0.11
4.42
2.56
0.71
1.44
2.63
2.75
0.14
2.61
0.42
1.23
1.36
3.5
0.17
2.99
0.53
1.61
0.96
1.26
2.06
0.20
2.39
0.44
1.91
2.59
1.06
1.54
Table 5. Reduction in power*delay of LPCFC over CFC

for different traffic patterns at 0.2 packet injection rate
Traffic pattern
Reduction (%)
Bit reversal
6.01
Shuffle
2.79
Butterfly
5.45
Random
5.84
Transpose1
3.07
Transpose2
5.38
Table 3 and Table 4 respectively. One can see that CFC and
LPCFC both perform better than other prevalent selection
functions for almost all kinds of traffic. This shows that
cumulative tracking of traffic congestion gives a better result
over instantaneous functions of congestion , such as buffer
level monitoring and vacant virtual channel counting. We

however note that there is a slightly less reduction in latency in
some cases when we use LPCFC as compared to CFC. This is
due to inclusion of the power consumption metric in the
Figure 4. Plots of average latency vs packet injection rate for competing selction fuctions and different traffic patterns.
selection function.
D. Energy Efficiency
A better metric to accurately evaluate the performance of
the selection function is the power-delay product because that

accurately reflects the energy efficiency of the design. We plot
the power-delay product of the routers using selction functioms
CFC and LPCFC in Figure 5. It can be clearly seen that using
Figure 5. Power-delay product for different traffic patterns for selction functions LPCFC and CFC.
LPCFC as a selection function offers a better power-delay

product compared to CFC for all traffic patterns, thereby
clearly establishing the superior energy efficiency of our
selection function.
Although CFC alone can guarantee low latency, greedy
adaptive routing based on latency considerations alone can lead
to certain paths (links and routers) being overused, leading to
creation of hotspots. Our purpose of inclusion of power metric
into selection function is primarily aimed at reducing the
incidence and intensity of such hotspots, and reduce overall
power consumption in data transfer among on-chip processing
units. Our results show that although there is a slight increase
in average message latency whencompared with CFC (which is
expected), we achieve better overall energy efficiency by using
our proposed LPCFC selction function.
Another interesting observation from Figure 5 is that
LPCFC proves to be more energy efficient as the intensity of
injected traffic increases. In fact, with traffic patterns such as
shuffle or butterfly, using LPCFC leads to a higher powerdelay product than using CFC. For such low intensity traffic,
the probability of hotspot creation is very low and the latencypower consumption tradeoff does not provide any significant
benefit. However, as the intensity of injected traffic (packet
injection rate) goes up, we see that using LPCFC indeed gives
us an advantage over CFC, which has already been shown [17]
to be more energy efficient compared with other popular
selection function based routers.
V. CONCLUSION AND FUTURE WORK
Power consumption and energy efficiency is a key concern
in many-core NoC-based platforms, and one of the principal
contributors to the overall power consumption is the NoC
router. While system-level dynamic power optimization
techniques have targeted this problem, our approach has been
to factor in power consumption while making adaptive routing
decisions at the router level. We propose a simple, elegant
method to establish a latency-power tradeoff that decides the
output of the selection function. Experimental evaluation of
our design shows that our method achieves superior energy
efficiency compared with existing selection functions with
zero or negligible latency penalty. This approach at the router
level could be combined with system-level dynamic power
optimization algorithms to achieve even better results.
REFERENCES
[1]
[2]
[3]
Hangsheng Wang; Li-Shiuan Peh; Malik, S., "Power-driven design of

router microarchitectures in on-chip networks," Microarchitecture,
2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International
Symposium on , vol., no., pp.105,116, 3-5 Dec. 2003.
Thonnart, Y.; Vivet, P.; Clermidy, F., "A fully-asynchronous low-power
framework for GALS NoC integration," Design, Automation & Test in
Europe Conference & Exhibition (DATE), 2010 , vol., no., pp.33,38, 812 March 2010
Hui Zhang; Varghese George,; Rabaey, J.M., "Low-swing on-chip
signaling techniques: effectiveness and robustness," Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on , vol.8, no.3,
pp.264,272, June 2000
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Owens, J.D.; Dally, W.J.; Ho, R.; Jayasimha, D.N.; Keckler, S.W.; LiShiuan Peh, "Research Challenges for On-Chip Interconnection
Networks," Micro, IEEE , vol.27, no.5, pp.96,108, Sept.-Oct. 2007.
Ganguly, A.; Chang, K.; Deb, S.; Pande, P.P.; Belzer, B.; Teuscher, C.,
"Scalable Hybrid Wireless Network-on-Chip Architectures for Multicore
Systems," Computers, IEEE Transactions on , vol.60, no.10,
pp.1485,1502, Oct. 2011.
Majumder, T.; Pande, P. P.; Kalyanaraman, A., High-throughput,
energy-efficient network-on-chip-based hardware accelerators,
Sustainable Computing: Informatics and Systems, Volume 3, Issue 1,
March 2013, Pages 36-46.
Majumder, T.; Pande, P.P.; Kalyanaraman, A., "Wireless NoC Platforms
With Dynamic Task Allocation for Maximum Likelihood Phylogeny
Reconstruction," Design & Test, IEEE , vol.31, no.3, pp.54,64, June
2014.
M. Ebrahimi, et. al., "LEAR - A Low-weight and Highly Adaptive
Routing Method for Distributing Congestions in On-Chip Networks," in
Proc. 20th IEEE Euromicro Conf. on Parallel, Distributed and NetworkBased Computing (PDP), pp. 520-525, Feb. 2012.
R. David, P. Bogdan, R. Marculescu, and U. Ogras, "Dynamic power
management of voltage-frequency island partitioned networks-on-chip
using Intel Single-chip Cloud Computer," in International Symposium
on Networks-on-Chip, 2011, pp. 257-258.
N. Kapadia, S. Pasricha, "VISION: a framework for voltage island
aware synthesis of interconnection networks-on-chip," Proc. Great
Lakes Symposium on VLSI, 2011, pp. 31-36.
X. Jin and S. Goto, Hilbert transform-based workload prediction and
dynamic frequency scaling for power efficient video encoding, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 31, No.5, May 2012, pp. 649-661.
M. Daneshtalab, M. Ebrahimi, T. C. Xu , P. Liljeberg and H.
Tenhunen "A generic adaptive path-based routing method for
MPSoCs", J. Syst. Architect., vol. 57, no. 1, pp.109 -120, 2011.
T. Mak, K. Lam, P. Cheung, and W. Luk, Adaptive routing for network-on-chips using a dynamic programming network, IEEE Trans.
Ind.Electron., pp. 116, 2010.
Y. Liu, Y. Ruan, Z. Lai, and W. Jing, "Energy and Thermal Aware
Mapping for Mesh-based NoC Architectures using Multi-objective Ant
Colony Algorithm," in ICCRD '11: Proceedings of the 2011
International Conference on Computer Research and Development,
2011.
M. Ebrahimi et al., HARAQ: Congestion-Aware Learning Model for
Highly Adaptive Routing Algorithm in On-Chip Networks, Proc.
ACM/IEEE Sixth Int',l Symp. Networks-on-Chip (NOCS), pp. 19-26,
May 2012.
Ge-Ming Chiu, "The odd-even turn model for adaptive routing,"
Parallel and Distributed Systems, IEEE Transactions on , vol.11, no.7,
pp.729,738, Jul 2000.
Jose, J.; Mahathi, K. V.; Shankar, J.S.; Mutyam, M., TRACKER: A
low overhead adaptive NoC router with load balancing selection
strategy," Computer-Aided Design (ICCAD), 2012 IEEE/ACM
International Conference on, vol., no., pp.564,568, 5-8 Nov. 2012.
Myong Hyon Cho; Lis, M.; Keun Sup Shim; Kinsy, M.; Devadas, S.,
"Path-based, Randomized, Oblivious, Minimal routing," Network on
Chip Architectures, 2009. NoCArc 2009. 2nd International Workshop on
, vol., no., pp.23,28, 12-12 Dec. 2009.
Abbas Eslami Kiasari, Axel Jantsch, and Zhonghai Lu. 2010. A
framework for designing congestion-aware deterministic routing. In
Proceedings of the Third International Workshop on Network on Chip
Architectures (NoCArc '10). ACM, New York, NY, USA, 45-50..
Ascia, G.; Catania, V.; Palesi, M.; Patti, D., "Implementation and
Analysis of a New Selection Strategy for Adaptive Routing in
Networks-on-Chip," Computers, IEEE Transactions on , vol.57, no.6,
pp.809,820, June 2008.
Ying-Cherng Lan; Chen, M.C.; Su, A.P.; Yu-Hen Hu; Sao-Jie Chen,
"Fluidity concept for NoC: A congestion avoidance and relief routing
scheme," SOC Conference, 2008 IEEE International , vol., no.,
pp.65,70, 17-20 Sept. 2008.
[22] Fazzino, Fabrizio, Maurizio Palesi, and David Patti. "Noxim: Networkon-Chip simulator." URL: http:// sourceforge. Net /projects /noxim
(2008).
[23] UMC 90 nm (http://www.umc.com/English/pdf/90nm_DM.pdf ) Last

accessed 5 March 2015.

42

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

42

Hochgeladen von

Copyright:

Verfügbare Formate

Low-Power Cumulative Flit Count Based Routing for

provides an output that optimizes both criteria, and yet

Figure 1: Output channel selection flow chart.

Figure 2. Architecture of selection function in LPCFC

functions based on fluidity of buffers [21] that follow the

Where, V is the supply voltage, f is the clock frequency and

!"!#$ = ! !"#$ + !"#!

Here, Tn is average effective transitions per cycle, Cself and

flit paths chosen. In order to avoid overflow, the register

Table 1 (a). Classification rules based on joint CFC and

classified, our next step is the creation of a joint classification

count values to determine the quantization levels for toggle

Table 3. Latency reduction using CFC over competing

Table 4. Latency reduction of LPCFC over competing

Table 5. Reduction in power*delay of LPCFC over CFC

level monitoring and vacant virtual channel counting. We

the selection function is the power-delay product because that

LPCFC as a selection function offers a better power-delay

Hangsheng Wang; Li-Shiuan Peh; Malik, S., "Power-driven design of

[23] UMC 90 nm (http://www.umc.com/English/pdf/90nm_DM.pdf ) Last

Das könnte Ihnen auch gefallen