Teamour Esmaeili Dep.of Computer Engineering DareShahr Branch, Islamic Azad University, Iran Ghazal Lak Dep.of Computer Engineering DareShahr Branch, Islamic Azad University, Iran Akram Noori Rad Dep.of Computer Engineering DareShahr Branch, Islamic Azad University, Iran AbstractNThe Folded Torus interconnection topology is widely used in massively parallel machines. Defect in manufacturing of integrated circuits is almost inevitable, and fast scaling in technology has caused the components of a Network-on-Chip (NoC) to be more susceptible to faults. Therefore, it is crucial to sustain chip production yield and reliable operation in the presence of defects. Index TermsNNetwork on Chip (NOC), NOC-torus-Folded, NS-2, simulation. ~~~~~~~~~~ ~~~~~~~~~~ 1 INTRODUCTION n chip multiprocessors (CMPs), data access latency de- pends on the memory hierarchy organization, the on- chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving per- formance improvements and scalability of threaded ap- plications. Multithreaded applications generally exhibit sharing of data among the program threads, which gen- erates coherence and data traffic on the NoC. Many NoC designs exploit communication locality to reduce communication latency by configuring special fast paths on which communication is faster than the rest of the NoC. Communication patterns are directly affected by the cache organization. However, many cache organiza- tions are designed in isolation of the underlying NoC or assume a simple NoC design, thus possibly missing op- timization opportunities. In this work, we present a NoC- aware design to reduce data access latency, improve utili- zation of network, and improve overall system perform- ance. The number of processor, memory and accelerator cores on systems-on-chip is rapidly increasing to support evolving standards and new applications. Computation and communication complexity is skyrocketing, and scal- ability centric design paradigms are critically needed [1]. Networks-on-Chip (NoCs) have emerged as the best alternative to provide high performance in communica- tion for futures Systems-on-Chip (SoCs) with dozens of cores integrated on a single silicon die. Mapping an appli- cation to on-chip network is the first and the most impor- tant step in the design flow as it will dominate the overall performance and cost [2]. Several approaches have been proposed in literature in the context of topological map- ping in NoCs [3]. Mapping algorithms are mostly focused on 2D mesh topology which is the most popular topology in NoC de- sign due to its layout efficiency, good electrical properties and simplicity in addressing on-chip resources. Another concern in NoC implementation is selecting an efficient routing strategy while providing freedom from deadlock. The routing algorithm determines the path that each packet follows between a source-destination pair. In the future chip generations, faults will appear with increasing probability due to the susceptibility of shrinking feature sizes to process variability, age-related degradation, crosstalk, and single-event upsets. To sustain chip pro- duction yield and reliable operation, very large numbers of faults will have to be tolerated [4, 5]. This argument strengthens the notion that chips need to be designed with some level of built-in fault tolerance. Furthermore, relaxing the requirement of 100% correctness in the op- eration of various components and on-chip channels pro- foundly reduces the manufacturing cost as well as cost incurred by test and verification [6]. Multi-core processor performance is dependent on the data access latency, which is highly dependent on the design of the on-chip interconnect (NoC) and the organi- zation of the memory caches. The cache organization af- fects the distance between where a data block is stored on chip and the core(s) accessing the data. The cache organi- zation also affects the utilization of the cache capacity, which in turn affects the number of misses that require the costly off-chip accesses. As the number of cores in the system increases, the data access latency becomes an even greater bottleneck. In the domain of network-on-chip (NoC), previous re- search has attempted to reduce communication latency by a variety of approaches. One approach is based on reduc- ing the global hop count [7, 8, 9, 10, 11,12]. Another ap- proach provides fast delivery of high priority cache blocks [13]. A third approach configures fast paths or cir- I JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 154 cuits which may be optical - through the NoC such that traffic traveling on these fast paths enjoys lower latency than regular traffic [14, 15, 16]. Communication locality is exploited in [14, 15, 17] such that the communication over a subset of source-destination node pairs is given higher priority than the rest of the traffic and explicit circuits connecting the source-destination pairs are configured to carry this higher priority traffic. Static non-uniform cache architecture (S-NUCA) [18] and Private [19] caches represent the two ends of the cache organization spectrum. However, neither of them is a perfect solution for CMPs. S-NUCA caches have better utilization of cache capacity - given that only one copy of a data block is retained in the cache - but suffers from high data access latency since it interleaves data blocks across physically distributed cache banks, rarely associat- ing the data with the core or cores that use it. Private caches allow fast access to on-chip data blocks but suffer from low cache capacity utilization due to data replica- tion, thus resulting in many costly off-chip data accesses. Many researchers suggested hybrid cache organizations that attempt to keep the benefits of both S-NUCA and private cacheswhile avoiding their shortcomings [20-26]. Most of these cache proposals assumed a simple 2-D packet-switched mesh interconnect. Such interconnects can be augmented with the ability to configure fast paths [15-17]. 2 SYSTEM ARCHITECTURE Network topology determines the connectivity among nodes and is therefore a first-order determinant of net- work performance and energy-efficiency. Since the abil- ity of the network to efficiently disseminate information depends largely on the topology, we especially focus on different types of Topologies. Figure 1 shows the torus noc and FOLDED-TORUS topologies. 3 SIMULATION METHODOLOGY In the last few years different network simulation tool have been developed. One of the most popular is the ns2 Network Simulator (Breslau et al. 2000). Ns2 is an open source event driven network simulator. It provides sup- port for simulation of IP-based network. In particular, ns2 provides researchers with: - MuIlicasl and unicasl iouling protocols; - Diffeienl lianspoil piolocoIs (TCI, UDI, RTI, elc.), - Mosl connon appIicalions (ITI, TeInel, HTTI). Localization of a network element (agent in ns2 terminol- ogy) in a simulation scenario is a two steps process, which requires: - The Iocalization of the node in which the agent must be instantiated - The inslanlialion of lhe agenl. To support agent instantiation in large topologies we have included in the extended NAM Editor the concept of Node Set. A Node Set is a set of nodes selected according to one of the following criteria: - Leaf node, - MuluaI dislance, - RandonIy. When a network topology is created with one of the sup- ported topology generators, some Node Sets are auto- matically created, reflecting the topology model. For in- stance, in the case of a transit-stub topology, a Node Set is associated to each transit domain and a different one to each stub domain. To make this tool even more flexible, we made the GUI of our Extended Nam Editor customizable by the end user. 3.1 Simulation Details In this paper, we have modeled our architecture concepts with the widely used network simulator ns-2 [4]. NS2 has been widely applied in research related to the design and evaluation of computer networks and to evaluate various design options for architectures [27], including the design of routers, communication protocols, etc. Ns-2 [28] is a discrete event network simulator designed for simulation of ordinary networks of computers. As Fig.1. (a) Torus noc topology (k=4) (b) FOLDED-TORUS (k=4) JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 155 many models of network components are provided, the user can simulate at a high abstraction level. Yet, it is pos- sible to implement new components in the network model. ns-2 has support for local area networks, mobile networks and even satellite networks. Two computer lan- guages are used in ns-2, namely C++ and OTcl. We would use the tool, Network Simulator ns-2 [29], [30], Which has been extensively used in the research for de- sign and evaluation of public domain computer network, to evaluate various design options for NOC architecture, including the design of router, communication protocol, Routing algorithms. NS-2 is an open source, object-oriented and discrete event driven network simulator written in C++ and OTCL. It is a very common and widely used tool to simulate small and large area networks [31]. 4 SIMULATION METHODOLOGY In this section, simulation results are presented. We have simulated different levels of NOC-torus-Folded topolo- gies which they have recursive structure by using NS-2 simulator. Each of the topologies is simulated in different size. Figures of simulation are shown below.. 4.1 NOC-torus-Folded 4*4 Some of the simulations in which the number of nodes is high may have a different view. For example Figures 2 to 3, show different views of NOC-torus-Folded topology which each of them consists of 16 nodes. 4.2 NOC-torus-Folded 6*6 Figure 4, shows the NOC-torus-Folded topology which consists of 36 nodes. Fig.2. The 1 st view of 4*4 NOC-FOLDED-TORUS Fig.3. The 2nd view of 4*4 NOC-FOLDED-TORUS Fig.4. The 6*6 NOC-FOLDED-TORUS JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 156 REFERENCES [1] L. Benini, "Application Specific NoC Design," date, vol. 1, pp.105, Proceedings of the Design Automation & Test in Europe Conference Vol. 1, 2006. [2] W. Shen, C. Chao, Y. Lien, A. Wu, "A New Binomial Mapping and Optimization Algorithm for Reduced- Complexity Mesh-Based On-Chip Network," nocs, pp.317-322, First International Symposium on Net- works-on-Chip (NOCS'07), 2007. [3] A. RoshanFekr, M. Janidarmian, V. Samadi Bok- haraei, A. Khademzadeh, "Yield Enhancement with a Novel Method in Design of Application-Specific Networks on Chips," Electrical Engineering and Ap- plied Computing, Volume 90, 247-257, 2011. [4] S. Iuilei, The fuluie of conpulei lechnoIogy and ils inpIicalions foi lhe conpulei indusliy, Conpul. }., vol. 51, no. 6, pp. 735-740,2008. [5] S. Borkar, "Designing Reliable Systems from Unreli- able Components: The Challenges of Transistor Vari- ability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov./Dec. 2005. [6] T. Dunilia, R. MaicuIescu, "On-Chip Stochastic Communication," date, vol. 1, pp.10790, Design, Au- tomation and Test in Europe Conference and Exhibi- tion (DATE'03), 2003. [7] J. D. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks. In ICS, pages 187-198, 2006. [8] S. Bourduas and Z. Zilic. A hybrid ring/mesh inter- connect for network-on-chip using hierarchical rings for global routing. In NOCS, pages 195-204, 2007. [9] R. Das, S. Eachempati, A. K. Mishra, N. Vijaykrish- nan, and C. R. Das. Design and evaluation of a hier- archical on-chip interconnect for next-generation cmps. In HPCA, pages 175-186, 2009. [10] B. Grot, J. Hestness, S.W. Keckler, and O. Mutlu. Ex- press cube topologies for on-chip interconnects. In HPCA, pages 163-174, 2009. [11] J. Kim, J. D. Balfour, and W. J. Dally. Flattened buttery lopoIogy foi on-chip networks. In MICRO, pages 172-182,2007. [12] Y. Xu, Y. Du, B. Zhao, X. Zhou, Y. Zhang, and J. Yang. A low-radix and low-diameter 3d interconnec- tion network design. In HPCA, pages 30-42, 2009. [13] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Ko- lodny. The power of priority: Noc based distributed cache coherency. In NOCS, pages 117-126, 2007. [14] I. Artundo,W. Heirman, M. Loperena, C. Debaes, J. V. Campenhout, and H. Thienpont. Low-power reconhguialIe nelvoik aichilecluie foi on-chip pho- tonic interconnects. High-Performance Interconnects, Symposium on, 0:163-169,2009. [15] N. D. E. Jerger, L.-S. Peh, and M. H. Lipasti. Circuit- switched coherence. In NOCS, pages 193-202, 2008. [16] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha. Ex- press virtual channels: towards the ideal interconnec- tion fabric. In ISCA, pages 150-161, 2007. [17] A. Abousamra, R. Melhem, and A. K. Jones.Winning with pinning in NoC. In Proc. of Hot Interconnects (HOTI), 2009. [18] C. Kim, D. Burger, and S.W. Keckler. Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro, 23(6):99-107, 2003. [19] J. A. Brown, R. Kumar, and D. M. Tullsen. Proximity- aware directory-based coherence for multi-core proc- essor architectures. In SPAA, pages 126-134, 2007. [20] B. M. Beckmann,M. R. Marty, and D. A. Wood. Asr: Adaptive selective replication for cmp caches. In MI- CRO, pages 443-454, 2006. [21] J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In ISCA, pages 264-276, 2006. [22] Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Op- timizing replication, communication, and capacity al- location in cmps. In ISCA, pages 357-368, 2005. [23] Z. Guz, I. Keidar, A. Kolodny, and U. C. Weiser. Util- izing shared data in chip multiprocessors with the nahalal architecture. In SPAA, pages 1-10, 2008. [24] N. Hardavellas,M. Ferdman, B. Falsah, and A. AiIa- maki.Reactive nuca: near-optimal block placement and replication in distributed caches. In ISCA, pages 184-195, 2009. [25] J. Huh, C. Kim, H. Shah, L. Zhang, D. uigei, and S.W. Keckler. A nuca substrate for exilIe cnp cache sharing. In ICS, pages 31-40, 2005. [26] M. Zhang and K. Asanovic. Victim replication: Max- imizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, pages 336-345, 2005. [27] R. Lemaire, F. Clermidy, Y. Durand, D. Lattard, and A. }eiiaya, Ieifoinance LvaIualion of a NoC-Based Design for MC-CDMA Telecommunications Using NS-2, in The 16th IEEE International Workshop on Rapid System Prototyping, Jun. 2005, pp. 24-30. [28] Breslau L., Estrin D., Fall K., S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne, K. Varadhan, Ya Xu, and Haobo Yu. "Advances in network simula- tion", IEEE Computer, 33(5):59{ 67, May 2000. [29] LBNL Network Simulator, http://www- nrg.ee.lbl.gov/ns/ [30] The network simulator-ns-2,available at http://www.isi.edu/nsnam/ns/ [31] M. Ali, M. Welzl, A. Adnan, F. Nadeem , " Using the NS-2 Network Simulator for Evaluating Network on Chips (NoC)" International Conference on Emerging Technologies, pp.506 - 512, 2006. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 157