Networks On Chip, Router Architectures and Performance Challenges

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 6 (2016) pp 4392-4397
© Research India Publications. http://www.ripublication.com
Networks on Chip, router architectures and performance challenges
Manel Langar
Department of Electrical Engineering, National Engineering School of Tunis, Tunisia.
Riadh Bourguiba
Department of Electrical Engineering, National Engineering School of Tunis, Tunisia.
Jaouhar Mouine
Department of Electrical Engineering, Prince Sattam Bin Abdulaziz University, Saudi Arabia.
Abstract communication architecture can considerably decrease the

With the continuous technology scaling, System On Chips final system performance.
(SoCs) have evolved considerably and can integrate an To cope with the increased need of communication
important number of Intellectual Property (IP) cores in the bandwidth, integrated systems have evolved from simple
same chip. However, global interconnects are become the point-to-point (P2P) links, to crossbars, then buses and finally
main performance limitation of SoCs. The Network on Chip Networks-on-Chips (NoCs). Submitted to very strong timing
(NoC) paradigm has emerged as an efficient interconnection constraints, NoCs have been optimized in a way, very specific
structure addressing the global wire delay problem. This to SoCs, where time is counted in tens nanoseconds. In this
solution outperforms traditional on chip interconnects, such as paper, we expose these optimization techniques and we
Point to Point (P2P) links, busses and bridges. Since the propose a new one, based on buffers sharing.
introduction of the NoC paradigm in the last decade, several In the next section, we start by retracing the communication
methodologies have been presented by researchers to enhance techniques evolution in the SoC domain during the past
its performances in terms of latency, area occupancy and decades. In the third section, we present the NoC and describe
power consumption. In this paper we described the its characteristics. Then, in the fourth section we gradually
interconnect evolution from point to point links to network on present the optimizations elaborated to reduce latency and
chip paradigm. Then, we focused on the NoC concept and increase throughput of NoCs, before exposing our buffers
presented some of their principal characteristics. Finally, we sharing technique and its experimental results in the fifth
proposed a new switch architecture enabling an adaptive inter- section. Finally, the last section concludes this paper.
port buffers sharing. The proposed design optimizes the
virtual channels exploitation which results in an improvement
of the NoC zero load latency and throughput without inducing Communication in SoCs:
neither area nor power consumption overhead. The communication function is very important in any
embedded system: it can help express the computation power
Keywords: Network-on-Chip (NoC), System-on-Chip (SoC), of the whole architecture if well designed, as it can ruin all the
switch router, Multi-processors SoC (MPSoC), Buffer efforts if badly conceived. In the context of SoCs, it becomes
sharing, Mesh2D. of highest importance, since you have a very limited visibility
on the internal functioning of the system, you can not modify
it once it was fabricated, and any correction will require
Introduction several months and will cost hundreds thousand dollars...
Since late nineties, embedded systems have entered a new era,
where a whole system can be integrate into a single IP block
chip.These so called, Systems on Chip, or SoCs, combine During the 90’s, with the wide adoption of Hardware
software flexibility and hardware computation power to Description Languages (HDLs), such as VHDL or Verilog, as
provide real-time performance. Usually, they embarks several the standard way to conceive new digital circuits, a new
processors and coprocessors, with all the memories, concept has emerged: Intellectual Property blocs (IP blocs).
peripherals and interfaces required to implement any Based on the fact that hardware can be as reusable as
application chosen in a given domain (video, cell phone, etc.). software, and encouraged by design reuse methodologies, this
Two key aspects must be carefully taken in account when led to the idea that a system should not be built from scratch,
designing such devices: software/hardware partitioning and but rather built from already verified and proven components.
computational power. However, they both would be useless if Since then, all parts of a system are not seen as any scattered
not supported by a strong communication scheme. In fact, this components anymore, but as highly valuable and reusable IP
latter is even of highest importance, since an inadequate blocs, or shortly expressed: IPs. So designing a new SoCs
mainly signifies choosing among existing IPs, those that fit the
needs of the targeted application, and connect them properly.
4392
Only non-existing functionalities are specifically designed, 2. Arbitration: the arbiter analyzes the requests of all the
then deeply verified and documented, before being cataloged masters connected to the bus and designates a winner,
for future designs. depending on a specific arbitration policy (fixed priority,
To foster IPs reuse, secure investments and improve the round robin, time division multiple access,...).
overall quality, several standard have been specified that 3. Address decoding: the winning master places an address
precisely define IP interfaces (connectivity and protocol). on the bus for the decoder. This latter decodes the high
These later considerably facilitate IPs exchange between order bits of the address to select a slave, which in turn
design projects and companies. decodes the low order bits of the address to select an
internal storage location and prepares itself for
P2P link responding.
In early devices, communication was quite simple: a producer 4. Data exchange: Depending on the transfer direction (read
synchronizes with a consumer using a simple handshake, in or write), data is either placed on the bus by the slave and
order to transmit data. Today, IPs still implement the same read by the master, or placed by the master and written in
principles, but adapt them to the SoC flavour. Interfaces can the slave.
be master or slave, depending on who order data exchange.
Also, signals have specific meanings and can be optional, In practice, these four phases do not take place sequentially in
depending on the kind of IP (for example, a processor can the same clock cycle, but are rather distributed on four
eventually have cache management signals in its interface). consecutive cycles, in a pipelined fashion. With this
Given the complexity of real components, often IPs have technique, overall latency stays the same, but the throughput
multiple interfaces dedicated to different usages. Thus, a is almost multiplied by four, since four data transfers can
Direct Memory Access (DMA) coprocessor can have both a overlay at a single cycle. Pipelined transfers are supported by
slave interface to receive instruction from the main processor, a panel of burst transfers that avoids pipeline stalls during re-
and a master interface to read/write data from/to memory. arbitration.
Crossbar Network on Chip

P2P links are not adapted to resource sharing. For example, if In 2002, Luca Benini and Giovanni De Micheli introduced a
two processors share two memories and a serial new interconnect structure: the Network on Chip (NoC) [1].
communication controller, P2P links would be of poor help, NoCs outperform bus based interconnects by allowing a high
while a crossbar would perfectly fit the need. This latter number of IPs to communicate simultaneously. They provide
provides simultaneous access to different shared slave IPs, higher bandwidth, higher scalability and better modularity.
together with arbitration between master IPs when conflict Packet switched NoCs are composed of three main
accesses attempts occur. components:
Unfortunately, their ease of use and their great bandwidth is - Network Interfaces (NI): adapt NoC protocol to that of IP
paid a high price, as it implies complex routing schemes, that cores and vice versa.
are not scalable with an increasing number of masters and - Switches (routers): route messages from source nodes to
slaves in the system. destination nodes according to a predefined routing
algorithm.
Shared Bus - Links: physical links that connect switches and NIs.
In order to reduce the routing complexity of SoCs, while still
maintaining resource sharing, system architects have Figure 1 below illustrates an example of a 2D mesh NoC.
integrated communication buses. With this kind of
architecture, IPs are connected to the bus through their master
and/or slave interface(s), if natively compatible with the bus
protocol.
Otherwise, an adapter will bridge the gap between between
the IP’s interface protocol and the bus protocol.
Again, to improve design reuse and interoperability, standard
protocols have been specified. Most of them suppose two bus
architectures: one dedicated to high performance IPs
(processors, co-processors, SRAM, DRAM controller,...) and
another one to slow peripherals (input/output controllers,
general purpose I/O, sensors interfaces,...). Both have an
address decoder to select a slave, but usually only the first one
has an arbitration unit to take a decision when several masters
try to access the bus simultaneously. A SoC can embed as
many buses as required from each type,that can be connected
with bridges.
A bus transfer passes by four phases:
1. Bus request: the master behind the transfer notifies to the Figure 1: Network on Chip with mesh topology
arbiter that it needs to access the bus.
4393
NoC characteristics: - Wormhole: It is similar to the virtual cut-through but the

NoCs are characterized by several design parameters: packet is slit into sub-packets called flits (Flow Control
topology, routing algorithm, flow control and switch Unit). This technique has the best performances in terms
architecture. of packet latency and required buffer resources [4].
Topology Flow Control

The topology describes the way in which switches, network The flow control manages network resources, such as channel
interfaces and links are organized. There are several bandwidth, buffer slots, which are allocated to a packet
topologies for NOC architecture like mesh, torus, ring, circulating in the NoC. The flow control can be either buffered
butterfly, octagon, fat tree, heterogeneous [4]. The NoC or bufferless. The Bufferless Flow Control is not widely used
topology impacts its implementation cost. If the number of because it has more latency and fewer throughput when
connections per switch is high, the total bandwidth of the compared to the Buffered Flow Control. There are many
system improves, but results in an overhead in terms of Buffered Flow Control such as Credit Based [3],
implementation cost. Then, not all network topologies are ACK/NACK[2] and Handshaking Signal based Flow Control
suited for a silicon implementation of a NoC. Mesh topology [22].
is one of the most used topologies due to its regular structure
and its simple architecture (Figure 1).
Implementation Techniques
Routing algorithm Implementing a NoC implies constraints that are very specific
In packet switching NoCs, the routing algorithm describes to the microelectronic context. In this domain, time and
how a packet is routed to its destination node. latency are measured in nanosecond, which prevents using
There are two main types of routing algorithms: sophisticated algorithms for routing or resource allocation.
- Deterministic routing: all the packets of a source- Contrariwise, performance can be found by promoting
destination couple will traverse the same path. simplicity and regularity, which in turn allows reducing of
- Adaptive routing: packets of a source-destination couple silicon area and increasing the throughput, thanks to
can traverse different paths. The routing path can be pipelining.
modified depending on a metric such as the network
congestion. Basic architecture
Each flit arriving at an input port of the switch is buffered,
Adaptive algorithms can reduce congestion situations and then routed, before being authorized to pass the crossbar to its
enhances the reliability when compared to deterministic ones. destination. So a basic implementation of a NoC switch must
However, when using adaptive algorithms, the reception order inevitably provide a FIFO buffer for each port, plus a router, a
of packets can be different to that by which there were sent. So crossbar and an arbiter for managing accesses conflict to the
reordering buffers are needed to reorder the received packets same output port (Figure 2). As we use the wormhole flow
and their area is important. control, the FIFO buffer can be quite smaller than the packet
During the routing task, some failure situations can occur: length.
- Deadlock: a packet is waiting for an event that can not Usually, the FIFO depth is four to eight flits and it stores data
happen. synchronously with the clock rising-edge.
- Livelock: a packet never arrives at its destination and The router must implement a strategy that can give a response
remain indefinitely in the NoC. within a clock cycle. Nonetheless, routing is made only for the
header flit, as consecutive flits of the same packet follow the
The XY routing algorithm is one of the most known same way.
deterministic deadlock-free algorithms used by 2D mesh
topology. It routes packets in the horizontal direction, then in
vertical.
Switching techniques
The Switching technique describes how the packets are
transmitted between switches from source to destination
nodes.
There are three main switching techniques:
- Store-and-forward: packets are transferred to the next
switch only if the whole packet has been received [19]. So
switches must have enough buffer resources to receive the
longest packet circulating in the NoC. Also, packets
latency depends on their length.
- Virtual cut-through: It is similar to the store-and-forward
but it allows the packet to be transferred to the next switch
once its header is received. This technique reduces the
packet latency and the required buffering resources [10]. Figure 2: Structure of Wormhole switch
4394
Pipeline Look-ahead routing

With the previous architecture, in most favorable conditions, a With Virtual channels, the execution time of the second stage
flit has a single cycle to be routed, to get access to the output of the pipeline has been extended with the delay necessary to
port, to pass the crossbar and to cross the link to the next allocate a virtual channel, which results in unbalanced
switch where it will be written. Since the delay for all these pipeline stages and a decreased throughput.
operations is quite long, it is better to detach them into To solve this problem, the solution is to anticipate the routing
separate operations that would be performed in different clock of the packet at the previous switch. By doing so, a packet is
cycles, at an increased clock frequency, in a pipelined manner. already routed when it arrives at the input port. Thus virtual
Typically, the pipeline is made of three stages: one for routing channel allocation, followed by switch allocation, can take
and switch allocation, another for bridging the crossbar, and place immediately after extracting the flit from the FIFO
the last one for passing the link to the input buffer of the next buffer (Figure 5). Of course, routing for the next switch still
switch (Figure 3). As for the pipelined buses, this technique has to be done, but since it depends only on the destination
leaves the latency untouched, but multiplies the maximum node and the position of the current node in the NoC, it can be
theoretical throughput by three. done in parallel. However, this restricts the choice for the
routing algorithm.
Figure 3: Wormhole switch pipeline stages

Figure 5: Virtual Channel switch pipeline stages
Virtual channels
With wormhole switching, a blocked packet can block all the
switches over which it is spanned, resulting in severe Speculative switch allocation
performance decrease. To reduce this risk, one solution Another technique used to reduce the execution time of the
consists in providing a bypass to get around the blocked second pipeline stage of a VC switch speculates about the
packet. For this purpose, few FIFO buffers are added to the results of the virtual channel allocation. This gives the
previous single one, in each input port. Thus each FIFO opportunity to the header flit of a new packet to try getting
buffer, or so called virtual channel, will play the role of a access to an output port, before having effectively acquired an
highway lane, that would let the fast packets to exceed slow output VC [17]. Hence, virtual channel allocation and switch
ones (Figure 4). allocation can be done in parallel during a single clock cycle
However, virtual channel raises a new problem: virtual (Figure 6).
channels management, or more precisely virtual channel
allocation. This is done by a new hardware unit, that takes
place in the second stage of the pipeline, after routing and
before switch allocation.
Figure 6: Virtual Channel switch pipeline stages: speculation

impact
Intra-port buffer sharing

The memory used for FIFO buffers can represent about 40%
of the total switch area. So two ways have been explored to
reduce it, based on memory sharing. One [16] reconfigures
FIFO buffer size, while the other [7] proposes dynamically
Figure 4: Structure of Virtual Channel switch
shared buffers. The saved silicon area can be regarded as an
4395
economy, or can be re-invested to increase the number of

virtual channels, and enhance the switch throughput at limited
costs.
Inter-ports buffer sharing

In this paper, we propose another technique to reduce the
memory area, based on sharing unused queues between
neighboring input ports (Figure 7).
So, when an input port doesn’t use a FIFO queue, this latter
can be lent to another input port. It is particularly beneficial,
because in a real NoC, the local data traffic is never isotropic,
but rather oriented in a specific direction, that evolves with
time. Inter-ports buffer sharing lets the switch re-organize its
input FIFO buffers to absorb the data traffic, as the sail of a Figure 8: Latency-throughput curves over uniform random
ship that would rotate to capture the wind. traffic
As shown by the curves, our proposed switch improves the

saturation throughput (0.29 flits/cycle) when compared to the
VC2 switch (0.28 flits/cycle). Also, it reduces the zero-load
latency by 3 cycles (33 cycles versus 36 cycles).
As for the power and area consumption, the two switches have
very similar architectures with the same buffers size, so their
performances are very close.
The two routers were written in VHDL code and simulated
using Mentor Graphics Modelsim. We implemented them
using TSMC 65nm technology. The synthesis of the VHDL
code is done using Cadence RTL Compiler. We constrained
the designs and verified that they work properly at 500Mhz
target frequency.
Figure 7: Example of inter-ports buffer sharing

Conclusion
Progress in silicon technology has enabled the integration of a
Experimental Results whole electronic system into a single chip. Systems on chip
To evaluate performance of the proposed inter-port buffers combine software and hardware to provide both adaptability
sharing NoC switch and the typical speculative VC switch and power computation for demanding embedded
[17], we developed two cycle-accurate simulators. applications. In this paper, we started by relating the on chip
We considered a NoC with a regular 2D mesh topology using interconnect evolution. Then we introduced the network on
the XY deterministic routing algorithm. Both switches chip paradigm and described its main characteristics and
implement the virtual channel wormhole flow control with design parameters. Finally, we proposed a new switch
credit based protocol. They have five bi-directional ports: four architecture allowing adaptive virtual channel buffers sharing
ports (east, west, north and south) connected to neighbor among different input ports, in order to improve buffer usage
switches and a local port connected to an IP via the network of switches in 2D Mesh NoCs. A simulation platform was
interface (Figure 1). Each input port has 2 VCs with 4 flit developed and experiments over uniform traffic pattern have
buffers per VC. shown significant improvements of the zero load latency and
The packet is split in four 32 bits flits. Simulation of each saturation throughput metrics.
switch runs firstly for 10.000 warm-up cycles. Then, 100.000
packets are injected to the NoC and the simulation progresses
till the reception of all these packets. References
Latency of a packet is measured from the time its head flit is
created by the source IP core, to the time its tail flit is [1] Luca Benini and Giovanni De Micheli. Networks on
consumed by the destination. Average network latency is the chips: A new soc paradigm. Computer, 35(1):70-78,
mean of all packet latencies in the network. January 2002.
Experiments are performed over uniform random traffic [2] Davide Bertozzi, Antoine Jalabert, Srinivasan
pattern across a 8-by-8 mesh network. Murali, Rutuparna Tamhankar, Stergios Stergiou,
Figure 8 bellow represents the latency-throughput curves of Luca Benini, and Giovanni De Micheli.Noc synthesis
the proposed switch and VC2 switch over uniform random flow for customized domain specific multiprocessor
traffic. systems-on-chip. IEEE Trans. Parallel Distrib. Syst.,
16(2):113-129, February 2005.
4396
[3] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and [16] Masoud Oveis-Gharan and Gul N. Khan. Statically
Avinoam Kolodny. Qnoc: Qos architecture and adaptive multi fifo buffer architecture for network on
design process for network on chip. Journal of chip. Microprocess. Microsyst., 39(1):11-26,
systems architecture, 50:105-128, 2004. February 2015.
[4] William Dally and Brian Towles. Principles and [17] Li-Shiuan Peh and William J. Dally. A delay model
Practices of Interconnection Networks. Morgan and speculative architecture for pipelined routers. In
Kaufmann Publishers Inc., San Francisco,CA, USA, Proceedings of the 7th International Symposium on
2003. High-Performance Computer Architecture, HPCA
[5] William J. Dally. Virtual-channel flow control. ’01, pages 255-, Washington, DC, USA, 2001. IEEE
SIGARCH Comput.Archit. News, 18(2SI):60-68, Computer Society.
May 1990. [18] Rohit Sunkam Ramanujam, Vassos Soteriou, Bill
[6] William J. Dally and Brian Towles. Route packets, Lin, and Li-Shiuan Peh. Design of a high-throughput
not wires: On-chip inteconnection networks. In distributed shared-buffer noc router. In Proceedings
Proceedings of the 38th Annual Design Automation of the 2010 Fourth ACM/IEEE International
Conference, DAC ’01, pages 684-689, New York, Symposium on Networks-on-Chip, NOCS ’10, pages
NY, USA, 2001. ACM. 69-78, Washington, DC, USA, 2010. IEEE Computer
[7] Ming Fang, Songqiao Chen, and Zhang Heying. Society.
Design a dynamically shared buffer with prefetching [19] Andrew Tanenbaum. Computer Networks. Prentice
for multiple virtual channel network on chip. Journal Hall Professional Technical Reference, 4th edition,
of Convergence Information Technology, 8(5):235- 2002.
242, March 2015. [20] Anh T. Tran and Bevan M. Baas. Roshaq: High-
[8] Christopher J. Glass and Lionel M. Ni. The turn performance on-chip router with shared queues. In
model for adaptive routing. J. ACM, 41(5):874-902, Proceedings of the 2011 IEEE 29th International
September 1994. Conference on Computer Design, ICCD ’11, pp.
[9] Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin 232-238, Washington, DC, USA, 2011. IEEE
Borkar, and Shekhar Borkar. A 5-ghz mesh Computer Society.
interconnect for a teraflops processor. IEEE Micro, [21] Hangsheng Wang, Li-Shiuan Peh, and Sharad Malik.
27(5):51-61, September 2007. Power driven design of router microarchitectures in
[10] Parviz Kermani and Leonard Kleinrock. Virtual cut- on-chip networks. In Proceedings of the 36th Annual
through: a new computer communication switching IEEE/ACM International Symposium on
technique. Computer Networks, 3:267-286, 1979. Microarchitecture, MICRO 36, pages 105-,
[11] Khalid Latif, Amir-Mohammad Rahmani, Ethiopia Washington, DC, USA, 2003. IEEE Computer
Nigussie, Tiberiu Seceleanu, Martin Radetzki, and Society.
Hannu Tenhunen. Partial virtual channel sharing: A [22] Cesar Albenes Zeferino and Altamiro Amadeu Susin.
generic methodology to enhance resource Socin: A parametric and scalable network-on-chip. In
management and fault tolerance in networks-on-chip. Proceedings of the 16th Symposium on Integrated
J. Electron. Test., 29(3):431-452, June 2013. Circuits and Systems Design, SBCCI ’03, pp. 169-,
[12] De Micheli and L. Benini. Networks on Chips. Washington, DC, USA, 2003. IEEE Computer
Morgan Kaufmann Publishers Inc., San Francisco, Society.
CA, USA, 2006.
[13] Robert Mullins, Andrew West, and Simon Moore.
Low-latency virtual-channel routers for on-chip
networks. In Proceedings of the 31st Annual
International Symposium on Computer Architecture,
ISCA ’04, pages188-, Washington, DC, USA, 2004.
IEEE Computer Society.
[14] M. H. Neishaburi and Zeljko Zilic. Reliability aware
noc router architecture using input channel buffer
sharing. In Proceedings of the 19th ACM Great
Lakes Symposium on VLSI, GLSVLSI ’09, pages
511-516, New York, NY, USA, 2009. ACM.
[15] Chrysostomos A. Nicopoulos, Dongkook Park,
Jongman Kim, N. Vijaykrishnan, Mazin S. Yousif,
and Chita R. Das. Vichar: A dynamic virtual channel
regulator for network-on-chip routers. In Proceedings
of the 39th Annual IEEE/ACM International
Symposium on Microarchitecture, MICRO 39, pages
333-346, Washington, DC, USA, 2006. IEEE
Computer Society.
4397

Networks On Chip, Router Architectures and Performance Challenges

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Networks On Chip, Router Architectures and Performance Challenges

Hochgeladen von

Copyright:

Verfügbare Formate

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 11, Number 6 (2016) pp 4392-4397

© Research India Publications. http://www.ripublication.com

Networks on Chip, router architectures and performance challenges

Abstract communication architecture can considerably decrease the

Crossbar Network on Chip

NoC characteristics: - Wormhole: It is similar to the virtual cut-through but the

Topology Flow Control

Pipeline Look-ahead routing

Figure 3: Wormhole switch pipeline stages

Figure 6: Virtual Channel switch pipeline stages: speculation

Intra-port buffer sharing

economy, or can be re-invested to increase the number of

Inter-ports buffer sharing

As shown by the curves, our proposed switch improves the

Figure 7: Example of inter-ports buffer sharing

Das könnte Ihnen auch gefallen