Performance Improved Network On Chip Router For Low Power Applications

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 4 ISSUE 1 APRIL 2015 - ISSN: 2349 - 9303
Performance Improved Network on Chip Router for

Low Power Applications
Ms. A.Sambavi1
1
Mr. I.Koil Pandi2

Ratna Vel Subramaniam College of Engineering &
Technology, ECE Department,
gracemejar@email.com
Ratna Vel Subramaniam College of Engineering &

Technology, ECE Department,
sambavirama7@email.com
Abstract On chip routers typically have buffers dedicated to their input or output ports for temporarily storing packets
in case contention occurs. Buffers consume significant portions of router area. While running a traffic trace, however not
all input ports of routers have incoming packets needed to be transferred simultaneously. So large numbers of buffer
queues in the network are empty and other queues are mostly busy. This observation motivates us to design Router
architecture with Shared Queues (RoShaQ), router architecture that maximizes buffer utilization by allowing the sharing
multiple buffer queues among input ports. In the network design of the NoC the most essential things are a network
topology and a routing algorithm. Routers route the packets based on the algorithm that they use. Every system has its
own requirements for the routing algorithm. A new adaptive weighted XY routing algorithm for eight port router
Architecture is proposed in order to decrease the latency of the network on chip router.
Index Terms adaptive weighted routing, Networks on Chip, router architecture, shared buffer.

1 INTRODUCTION
etwork on chip (NoC) is a communication subsystem on an

integrated circuit. In the design of NoCs, high throughput and
low latency are both important design parameters and the router
micro architecture plays a vital role in achieving these performance
goals. In a typical router, each input port has an input buffer for
temporarily storing the packets in case that output channel is busy.
This buffer can be a single queue as in a wormhole (WH) router or
multiple queues in parallel as in Virtual Channel (VC) routers. High
throughput routers allow an NoC to satisfy the communication needs
of multicore and many core applications, or the higher achievable
throughput can be traded off for power savings with fewer resources
being used to attain a target bandwidth. Further, achieving high
throughput is also critical from a delay perspective for applications
with heavy communication workloads because queueing delays grow
rapidly as the network approaches saturation.
Another approach is by sharing buffer queues that allows utilizing
idle buffers or emulating an output buffer router to obtain higher
throughput. Our work differs from those router designs by allowing
input packets at input ports to bypass shared queues hence, it
achieves lower zero load latency. In addition, the proposed router
architecture has simple control circuitry making it dissipate less
packet energy than VC routers and achieving higher throughput by
letting queues share workloads when the network load becomes

heavy. The proposed routing algorithm, which is transparent with
respect to the router implementation, are presented and discussed,
and assessed by means of simulation on synthetic and real traffic
scenarios. The analysis takes into account several aspects and
metrics of the design.
2 RELATED WORKS AND CONTRIBUTIONS

Peh et al. [19] and Mullins et al. [16] proposed speculative
techniques for VC routers allowing a packet to simultaneously
arbitrate for both VCA and SA giving a higher priority for non
speculative packets to win SA; therefore reducing zero load latency
in which the probability of failed speculation is small. This low
latency, however, comes with the high complexity of SA circuitry
and also wastes more power each time the speculation fails.
Sophisticated extensions to IBR micro architectures have been
proposed for improving throughput, latency, and power. For
throughput, techniques like it-reservation ow control, variable
allocation of VCs, and express VCs [13] have been proposed. As
these designs are input-buffered, they are only able to multiplex
arriving packets from their input ports across the crossbar switch,
unlike our proposed router architecture which can shufe incoming
packet ows from all input ports and then onto the crossbar switch.
Recently, Passas et al. [18] designed a 128 128 crossbar allowing
connecting 128 tiles while occupying only 6% of their total area.
This fact encourages us to build RoShaQ that has two crossbars
while sharing cost expensive buffer queues.
Nicopoulos et al. [17] proposed ViChar, a router architecture which
allows packets to share flit slots inside buffer queue so that can
achieve higher throughput. Our paper manages buffers at coarser
grain that is at queue-level rather than at flit-level, hence allows
reusing existing generic queue design which makes buffer and router
design much simpler and straightforward. Ramanujam et al. [21]
Ms.A.Sambavi is currently pursuing masters degree program in applied

electronics in Ratna Vel Subramaniam College of Engineering &
Technology,Dindigul, PH-624005. E-mail: sambavirama7@mail.com
Mr.I.Koil Pandi Assistant Professor working in Ratna Vel Subramaniam
College of Engineering & Technology,Dindigul, PH-624005. E-mail:
gracemajar@mail.com
85

recently proposed a router architecture with shared-queues named
DSB which emulates an output-buffered router. The majority of state
of the art on-chip router designs utilize input queuing buffers; we,
however, can find in the literature a few output queuing router
architectures [9]. If looking into the whole network picture, buffers
at an output router port should act the same as input buffers of its
downstream router. Depending on network load, RoShaQ can
dynamically adapt to use the bypass paths or the shared queues. Only
the initial, introductory paragraph has a drop cap.
3 DESCRIPTION OF THE METHOD

Routing on NoC is quite similar to routing on any network. A routing
algorithm determines how the data is routed from sender to receiver.
Oblivious algorithms route packets without any information about
traffic amounts and conditions of the network, deterministic
algorithms route packets always along a same route and stochastic
routing is based on randomness.
3.1 WH Router architecture

Router is the most important component for the design of NoC
system. Fig. 1 shows the traveling process of a it through a WH
router. At rst, when a it arrives at an input port, it is written to the
corresponding buffer queue. This step is called buffer write or Queue
Write (QW).Assuming without other packets in the front of the
queue, the packet starts deciding the output port for its next router
(based on the destination information contained in its head it)
instead of for the current router (known as Lookahead Routing
Computation (LRC). Simultaneously, it arbitrates for its output port
at the current router because there may be multiple packets from
different input queues having the same output port. This step is
called Switch Allocation (SA).If it wins the output SA, it will
traverse across the crossbar. This step is called crossbar traversal or
Switch Traversal (ST). After that, it then traverses on the output link
toward next router. This step is called Link Traversal (LT).
3.2 VC Router architecture

Virtual Channel Router (VCR) is a router which uses source routing
algorithm and wormhole network ow control with virtual channels.
It is suitable for on chip networks with two dimensional topologies.
The datapath of the router consists of buffers and a switch. In this
VC router design, an input buffer has multiple queues in parallel,
each queue is called a VC, that allows packets from different queues
to bypass each other to advance to the crossbar stage instead of being
blocked by a packet at the head of the queue (however, all queues at
one input port can be still blocked if all of them do not win SA or if
all corresponding output VC queues are full).Because now an input
port has multiple VC queues, each packet has to choose a VC of its
next routers input port before arbitrating for output switch. The
routing operation comprises four steps such as Routing Computation
(RC), virtual channel allocation (VCA), switch allocation (SA), and
switch traversal (ST) these are often implemented as four pipeline
stages in modern virtual-channel routers.
When a head it (the rst it of a packet) arrives at an input
channel, the router stores the it in the buffer for the allocated virtual
channel and determines the next hop node for the packet. This stage
is called as RC stage. Given the next hop, the router then allocates a
virtual channel in the next hop. This stage is called as VCA stage.
The next hop node and virtual channel decision is then used for the
remaining its of the given packet.
The relevant virtual channel is exclusively allocated to that packet
until the packet transmission completes. Finally, if the next hop can
accept
the
it,
the it
competes
for
a
switch.
This stage is called as SA stage, and moves to the output port. This
stage is called as ST stage. Fig. 2 shows the VC router architecture.
Figure axis labels are often a source of confusion.

Use words rather than symbols. As an example
Fig.2. VC Router architecture.
Granting an output VC for a packet is given by a Virtual Channel

Allocator (VCA) and this VC allocation is performed in parallel with
the LRC. VC router achieves higher saturation throughput than a
WH router while having the same number of buffer entries per input
port, it also has higher zero-load latency due to deeper pipeline.
Figure axis labels are often a source of confusion.

Fig.1. WH Router architecture.
Use words rather than symbols. As an example
Both LRC and SA are done by the head it of each packet; body and
tail its will follow the same route that is already reserved by the
head it, except the tail it should release the reserved resources
once it leaves the queue .In a WH router, if a packet at the head of a
queue is blocked (because it is not granted by the SA or the
corresponding input queue of the downstream router is full), all
packets behind it also stall. This section describes the power model
that contains different components of power dissipation of a link.
3.3 Shared Buffer Router Architecture

When an input port receives a packet, it calculates its output port for
the next router (look ahead routing), at the same time it arbitrates for
both its decided output port and shared queues. If it receives a grant
from the output port allocators (OPAs), it will advance to its output
port in the next cycle. Otherwise, if it receives a grant to a shared
86

queue, it will be written to that shared queue at the next cycle. In
case that it receives both grants, it will prioritize to advance to the
output port. Shared queues allocator (SQA) receives requests from
all input queues and grants the permission to their packets for
accessing non full shared queues. Fig.3 shows the Shared buffer
router architecture. Packets from input queues are allowed to write
to a share queue only if the shared queue is empty or the shared
queue is containing packets having the same output port as the
requesting packet. The OPA receives requests from both input queues
and shared queues. Both SQA and OPA grant these requests in
round-robin manner to guarantee fairness and also to avoid
starvation and livelock.
2.
3.
data propagation once a message is in the network. The unit

of measure for bandwidth is bit per second (bps) and it
usually considers the whole packet, including the bits of the
header, payload and tail.
Throughput: Throughput is dened as the maximum trafc
accepted by the network, that is, the maximum amount of
information delivered per time unit. The throughput measure
is messages per second or messages per clock cycle. One can
have a normalized throughput (independently from the size
of the messages and of the network) by dividing it by the size
of the messages and by the size of the network.
Latency: Latency is the time elapsed between the beginning
of the transmission of a message (or packet) and its complete
reception at the target node. Latency is measured in time
units and mostly used as comparison basis among different
design choices. In this case, latency can also be expressed in
terms of simulator clock cycles. Normally, the latency of a
single packet is not meaningful and one uses the average
latency to evaluate the network performance. On the other
hand, when some messages present a much higher latency
than the average, this may be important.
4 WEIGHTED ROUTING ALGORITHM

In a physically static NoC, the routing decision can be distributed or
a source based deterministic routing scheme may be employed. The
weighted XY routing algorithm assigns each output port a weight
based on available bandwidth and dx and x coordinate (columns)
distance or dy and y coordinate (rows) distance between the current
and the destination node. This ideally gives the packet a maximum
number of sensible routing choices along its route as it allows the
packet to be routed toward its destination in both the x and y
directions. The weight is also proportional to the available
bandwidth. If the output port is chosen with the highest associated
available bandwidth, the used bandwidth is distributed as evenly as
possible among the output ports. Thus, the other output ports are
more likely to be able to accommodate future transactions. The
weights of each port is given as
Fig.3. Shared buffer Router architecture.
Input queue, output port, and shared queue states maintain the status
(idle, wait, or busy) of all queues and output ports, and incorporate
with SQA and OPA to control the overall operation of the router
Only input queues of RoShaQ have routing computation logic
because packets in the shared queues were written from input queues
hence they already have their output port information. RoShaQ has
the same I/O interface as a typical router that means they have the
same number of I/O channels with it level ow control and credit
based backpressure management.
3.4 RoShaQ Datapath Pipeline

After a packet is written into an input queue in the rst cycle, in the
second cycle it simultaneously performs three operations such as
LRC, OPA, and SQA. LRC stands for Lookahed Routing
Computation, OPA stands for Output Port Allocator and SQA stands
for Shared Queue Allocator.
At low network load, there is a high chance the packet to win the
OPA due to low congestion at its desired output port, hence it is
granted to traverse through the output crossbar and output link
toward next router.
3.5 NoC Performance Parameters
Routing algorithms are evaluated primarily by measures of average
message latency and average system throughput. The hardware
requirements in terms of the buffer size required per node and the
number of virtual channels per physical channels are also used in
comparing routing algorithms. The performance of a network on
chip can be evaluated by three parameters known as bandwidth,
throughput, and latency.
1. Bandwidth: The bandwidth refers to the maximum rate of
BN | yd y | + Bmax,
WN = 0,
BN,
yd y < 0
BN < Bp
else
(1)
BS ( yd y ) + Bmax,
WS= 0,
BS,
yd y < 0
BS < Bp
else
(2)
BW | xd x | + Bmax,
WW= 0,
BW,
xd x < 0
BW < Bp
else
(3)
BE ( xd x ) + Bmax,
WE= 0,
BE,
xd x < 0
BE < Bp
else
(4)
( BN + BE) (yd xd) + Bmax, yd xd < 0

WNE= 0,
( BN+ BS) < Bp
87
(5)

BN + BE
else
Fig.5.shows the RTL schematic diagram in which it indicates clearly
( BN + BW) (yd + xd) + Bmax, yd + xd > 0

WNW=0,
( BE+BW ) < Bp
BN +BW,
else
(6)
( BS + BE) (yd xd) + Bmax, yd xd <0

WSE= 0,
(BS + BE) < Bp
BS + BE,
else
(7)
( BN + BW) (yd xd) + Bmax, yd + xd > 0

WSW= 0,
(BS + BW) < Bp
BS + BW,
else
(8)
In the above equations, bandwidth of north, south, west and east

ports are denoted as BN,BS,BW,BE. Bmax is the maximum bandwidth
and Bp is available port bandwidth. Here the weights of north, south,
west, east, northeast, northwest, southeast, southwest ports are
denoted as WS, WN, WW, WE, WNE, WNW, WSE, WSW. They are
calculated to be proportional to the distance from source to
destination and to the available bandwidth if the output direction is
facing the destination and proportional to the available bandwidth if
it is not.
If there is not enough bandwidth available, the weights are zero.
The route chosen is then to the direction with the highest weight. If
the packet is enter the west port of the router, it will delivered
through one of the router port corresponding to the router port
constrains. Latency of a packet is measured from the time its head
flit is generated by the source to the time its tail it is consumed by
the destination. Average network latency is the mean of latency of all
packets in the network.
5 RESULTS AND DISCUSSIONS
Fig.5. RTL schematic .
Fig.6. shows the Technology synthesis diagram in which it indicates

clearly about cross sectional front view of the IC. node longitude.
Fig.6. Technology Synthesis.
Our proposed router architecture, achieves both low latency as input

buffering routers and high throughput as output buffering ones
without the needs of internal router speedup. Input packets also can
bypass the shared queues to achieve low latency in the case that the
network load was low. Here RoShaQ forming a router with ned
grain shared buffers which could improve more network
performance. The simulation is performed for eight port networks on
chip router design which is based on the weight on each port of the
router to demonstrate the effectiveness of our proposed design. Fig.4
shows the output of eight port router in which the packet is delivered
to northwest port. In this design the destination latitude is greater
than the node latitude at the same time the destination longitude is
less than the node longitude. about the inputs and the outputs.
longitude.
Fig.7.shows the synthesis report in which it indicates clearly about

average latency of the eight port router i.e., 12.171 ns.
Fig.7. Synthesis Report.
6 CONCLUSION
In this paper, an eight port networks on chip router using shared buffer
was proposed. A new routing algorithm that supports real-time traffic
direction arbitration while avoiding deadlock and starvation has been
Fig.4. Northwest port Output.
88

presented. Experimental results verified that the proposed eight port
router architecture can significantly reduce the latency. Compared to
conventional router, the bandwidth utilization of our eight port router
also exhibited higher efficiency. In this project, eight port networks on
chip router is designed based on the weight on each port of the router.
The proposed router design achieves 12.171ns average latency.
REFERENCES
[1] A. Banerjee, P. T. Wolkotte, R. D. Mullins, S. W. Moore, and G. J. M. Smit,
An energy and performance exploration of network-on- chip
architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17,
no. 3, pp. 319329, Mar. 2009.
[2] E. Beigne, An asynchronous power aware and adaptive NoC based
circuit, IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 11671177,
Apr. 2009.
[3] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L.
Benini, and G. De Micheli, NoC synthesis ow for customized
domain specic multiprocessor systems-on-chip, IEEE Trans.
Parallel Distrib. Syst., vol. 16, no. 2, pp. 113129, Feb. 2005.
[4] A.Bianco, P. Giaccone, G. Masera, and M. Ricca, Power control
for crossbar-base queued Switches, IEEE Trans. Comput., vol .
62,no.1, pp. 74-84, Jan. 2013.
[5] S.T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, Matching
output queueing with a combined input/output-queued switch,
IEEE J. Selected Areas Commun., vol. 17, no. 6, pp. 10301039,
Jun. 1999.
[6] M. Galles, Spider: A high-speed network interconnect, IEEE Micro,
vol. 17, no. 1, pp. 3439, Jan. 1997.
[7] D. Gebhardt, J. You, and K. S. Stevens, Design of an energy-efcient
asynchronous NoC and its optimization tools for heterogeneous
SoCs, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol.
30, no. 9, pp. 13871399, Sep. 2011.
[8] O. He, S. Dong, W. Jang, J. Bian, and D. Z. Pan, UNISM: Unied
scheduling and mapping for general networks on chip, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp. 14961509,
Aug. 2012.
[9] M. G. Hluchyj and M. J. Karol, Queueing in high-performance
packet switching, IEEE J. Sel. Areas Commun., vol. 6, no. 9, pp.
15871597, Dec. 1988.
[10] J.Hu and R.Marculescu , Energy-and Performance-aware mapping
For NOC architecture, IEEE Trans. Comput-Aided Design Integr
Circuits Syst., vol. 24,
no. 4, pp. 551562, Apr. 2005.
[11] H. Ito, M. Kimura, K. Miyashita, T. Ishii, K. Okada, and K. Masu,
A bidirectional and multi-drop-transmission-line interconnect for
multipoint- to-multipoint on-chip communications, IEEE J. SolidState Circuits, vol. 43, no. 4 pp. 10201029, Apr. 2008.
[12] J. Kim, C. Nicopoulos, P. Dongkook, V. Narayanan, M. S. Yousif,
and C. R. Das, A gracefully degrading and energy-efcient modular
router architecture for on-chip networks, in Proc. 33rd IEEE/ACM
ISCA, Jun. 2006, pp. 415.
[13] A. Kumar, L.S. Peh, P. Kundu, and N. K. Jha, Towards ideal onchip communication using express virtual channels, IEEE Micro,
vol. 28, no. 1, pp. 8090, Jan. 2008.
[14] Y.C. Lan, H.A. Lin, S.H. Lo, Y. H. Hu, and S.J. Chen, A
bidirectional Noc(BiNoC) architecture with dynamic selfreconfigurable channel,IEEE Trans.Comput. Aided Design Integr.
Circuits Syst., vol. 30, no. 3, pp. 427440, Mar. 2011.
[15] B. Lin and I. Keslassy, The concurrent matching switch
89
architecture, in Proc. IEEE Comput. Commun. Soc. (INFOCOM),

Apr. 2006.
[16] R. Mullins, A. West, and S. Moore, Low-latency virtual-channel
routers for on-chip networks, in Proc. 31st ISCA, Mar. 2004, p. 188.
[17] C. A. Nicopoulos, P. Dongkook, K. Jongman, N. Vijaykrishnan, M.
S. Yousif, and C. R. Das, ViChaR: A dynamic virtual channel
regulator for network-on-chip routers, in Proc. 39th IEEE/ACM Int.
Symp. Microarchitect. (MICRO), Dec. 2006, pp. 333346.
[18] G. Passas, M. Katevenis, and D. Pnevmatikatos, A 128 128 24
Gb/s crossbar interconnecting 128 tiles in a single hop and occupying
6% of their area, in Proc. NOCS, 2010, pp. 8795.
[19] L. Peh and W. J. Dally, A delay model and speculative architecture
for pipelined routers, in Proc. Int. Symp. HPCA, Jan. 2001, pp. 255
266.
[20] A. Prakash, Randomized parallel schedulers for switch-memoryswitch routers: Analysis and numerical studies, in Proc. IEEE
INFOCOM, vol. 3. Mar. 2004, pp. 20262037.
[21] R. S. Ramanujam, V. Soteriou, B. Lin, and L.-S. Peh, Extending the
effective throughput of NoCs with distributed shared-buffer routers,
IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 30, no.
4, pp. 548561, Apr. 2011.
[22] D. Seo, A. Ali, W.-T. Lim, and N. Raque, Near-optimal
worst-case throughput routing for 2-D mesh networks, in Proc. 32nd
IEEE/ACM ISCA, Jun. 2005, pp. 432443.

Performance Improved Network On Chip Router For Low Power Applications

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Performance Improved Network On Chip Router For Low Power Applications

Hochgeladen von

Copyright:

Verfügbare Formate

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 4 ISSUE 1 APRIL 2015 - ISSN: 2349 - 9303