Anda di halaman 1dari 5

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4, APRIL 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

104

Pyramid-NOC: A Heterogeneous and Scalable Network-on-Chip Architecture


Reza Kourdy Department of Computer Engineering Islamic Azad University, Khorramabad Branch, Iran Mohammad Reza Nouri rad Department of Computer Engineering Islamic Azad University, Khorramabad Branch, Iran

AbstractMost network-on-chip (NoC) architectures are based on a mesh-based interconnection structure. In this paper, we present a new NoC architecture, which relies on source synchronous data transfer. We consider variations in Pyramid architectures that can lead to higher performance or greater cost-effectiveness in certain applications. For large-scale networks, our topology reduces hop and switch count, which decreases latency and power. We also carry out the high-level simulation of on chip network using NS-2 to verify the analytical analysis. Index Terms Pyramid, Network on chip (NoC), System-On-Chip (SoC), Chip-level multiprocessors (CMPs), quality-ofservice(QOS).

1 INTRODUCTION
omplexities of scaling single-threaded performance have pushed processor designers in the direction of chip-level integration of multiple cores. Todays state-of-the-art general purpose chips integrate up to one hundred cores [1, 2], while GPUs and other specialized processors may contain hundreds of execution units [3]. In addition to the main processors, these chips often integrate cache memories, specialized accelerators, memory controllers, and other resources. Likewise, modern systems-on-a-chip (SOCs) contain many cores, accelerators, memory channels, and interfaces. As the degree of integration increases with each technology generation, chips containing over a thousand discrete execution and storage resources will be likely in the near future. Chip-level multiprocessors (CMPs) require an efficient communication infrastructure for operand, memory, coherence, and control transport [4, 5, 6], motivating researchers to propose structured on-chip networks as replacements to buses and ad-hoc wiring solutions of singlecore chips [7]. Chip density is increasing, allowing even larger systems to be implemented on a single chip. With increasing demands on flexibility and performance, these systems, known as Systems-onChips (SoCs), combine several types of processor cores, memories and custom modules of widely different sizes to form MultiProcessor Systems-on-Chips (MPSoCs). The bottleneck in such systems is shifting from computation to communication [8]. The traditional way of using bus-based mechanisms for inter-module communication has two main limitations. Firstly, it does not scale well with increasing system complexity. Secondly, it couples computation and communication of the system leading to longer design times. Networks-on-Chip (NoCs) have been proposed as an efficient and scalable alternative to shared buses which allow systems to be designed modularly. Initial NoC proposals rely only on a BestEffort service based on packet-based switching techniques.

QNoC [9] and xPipes [10] are examples of packet-based NoCs. Another group of proposals is based on circuit-based techniques whose approach is to reserve the entire path from the source to the destination before data is sent out from the source. Examples include PNoC [11] and ProtoNoC [12]. For more demanding systems, it is necessary to have predictable performance as connections between the IP blocks are subjected to timing constraints. For such applications, it is necessary to be able to guarantee throughput before run-time. To achieve this, link allocation is done statically during design-time. The design of these networks-on-chip (NOCs) typically requires satisfaction of multiple conflicting constraints, including minimizing packet latency, reducing router area, and lowering communication energy overhead. In addition to basic packet transport, future NOCs will be expected to provide certain advanced services. In particular, quality-of-service(QOS) is emerging as a desirable feature due to the growing popularity of server consolidation, cloud computing, and real-time demands of SOCs. Despite recent advances aimed at improving the efficiency of individual NOC components such as buffers, crossbars, and flow control mechanisms [13, 14, 15, 16], as well as features such as QOS [17, 18], little attention has been paid to network scalability beyond several dozen terminals.

3 BAKGROUND
This section reviews key NOC concepts, draws on prior work to identify important Kilo-NOC technologies, and analyzes their scalability bottlenecks. We start with conventional NOC attributes topology, flow control, and routing followed by quality-of-service technologies.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4, APRIL 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

105

3.1 Conventional NOC Attributes a) Topology


Network topology determines the connectivity among nodes and is therefore a first-order determinant of network performance and energy-efficiency. To avoid the large hop counts associated with rings and meshes of early NOC designs [19, 20], researchers have turned to richly-connected low-diameter networks that leverage the extensive on-chip wire budget. Such topologies reduce the number of costly router traversals at intermediate hops, thereby improving network latency and energy efficiency, and constitute a foundation for a Kilo-NOC. One low-diameter NOC topology is the flattened butterfly (FBfly), which maps a richly-connected butterfly network to planar substrates by fully interconnecting nodes in each of the two dimensions via dedicated point-to-point channels [21]. An alternative topology called Multidrop Express Channels (MECS) uses point-to-multipoint channels to also provide full intra-dimension connectivity but with fewer links [22]. Each node in a MECS network has four output channels, one per cardinal direction. Light-weight drop interfaces allow packets to exit the channel into one of the routers spanned by the link.

node has to be routed back up the tree to the first node that is common to both the sender and the receiver. Once the message arrives at the common parent it can then travel back down the tree to the receiving node. A single message path leaves most of

b) Flow Control Flow control governs the flow of packets through the network by allocating channel bandwidth and buffer slots to packets. Conventional interconnects have traditionally employed packetgranularity bandwidth and storage allocation, exemplified by Virtual Cut-Through (VCT) flow control [23]. In contrast, NOCs have relied on flit-level flow control [24], refining the allocation granularity to reduce the per-node storage requirements. c) Routing A routing function determines the path of a packet from its source to the destination. Most networks use deterministic routing schemes, whose chief appeal is simplicity. In contrast, adaptive routing can boost throughput of a given topology at the cost of additional storage and allocation complexity. /or

Fig.1. Binary Tree

4 TREES AND HIRERCHICAL STRUCTURES


4.1. Binary and Ternary Tree 'Trees' are hierarchical structures that have some resemblance to natural trees. Trees start with a node at the top called the root, this node is connected to other nodes by 'edges' or 'branches'. These nodes may spawn further nodes forming a multilayered structure. Nodes at one level can only connect to nodes in adjacent levels, furthermore, a node may only stem from one other node (ie, it may only have one parent), even though it may give rise to several nodes (children). The connections are such that the branches are disjoint and there are no loops in the structure. Figures 1 and 2 show 'binary' and 'ternary' trees respectively. Nodes that do not have any children are called 'terminal nodes'. The number of processors in a chain between the root and the deepest terminating node is called the 'depth' of the tree. M By examining the figures it can be seen that there is only one path between any two nodes. Even so, the nature of the tree makes the routing algorithm slightly more complex than that for chains. A message from one terminal node to another terminal

Fig.2. Ternary Tree

the communications links free so the tree structure can support many messages at the same time. This requires that the routing algorithm be extended to cope with messages clashing on particular routes. The extended algorithm must also make sure that two frequently communicating processors do not completely block out other processors - a situation called lockout. Binary trees allow a simplification of the general tree routing algorithm. The binary representation of a node's address can be used in determining the message path. Multiplying the current node address by two gives the address of the node below and to the left. Multiplying by two and adding one gives the address

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4, APRIL 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

106

for the node below and to the right. An important step in finding the message path involves finding the first node that is common to both sender and receiver. This can be done by generating two lists of successive parents all the way up to the root. One list for the sender and one for the receiver. The parent of the current node can be found by dividing the address by two and taking the modulus. (This is equivalent to an 'arithmetic shift right by one bit', which can be found as a single machine code instruction on most processors.) The formula is repeated until the root node is reached (ie the address is 00001).

The growth complexity G is given by the formula: G = (d-1)/(N + 1) When a ternary tree is enlarged another twice as many plus one nodes must be added in order to complete the next level. The

4.2 X-Tree M The major disadvantage of this topology is that nodes higher up the tree tend to get congested if several processors further down wish to communicate. One way to alleviate this communication problem is to add additional connections between levels. Another possible solution to the problem of congestion is to add links between branches at the same level. This structure is called an 'x-tree', see figure 3. Both of these solutions increase the complexity of the routing algorithm. (Fat-trees and xtrees are no longer proper trees as there are loops in the structure and all the branches are no longer disjoint.) 4.3 Diamond Tree Several important problems map themselves very well on to tree topologies. These include searches and sorts. Search algorithms map particularly well onto the diamond tree structure illustrated in figure 4. M The number of nodes in a tree is given by the mathematical formula for the sum of a geometric progression: N = (dw - 1)/(d - 1)
The number of nodes can be increased by either increasing the depth of the tree 'w', or by increasing the fan-out of the nodes 'd'. The number of links in a tree is given by the formula: L = (dw - d)/(d - 1)

Fig.4. Diamond Tree

growth complexity of trees is high compared to other topologies.

PYRAMIDS

Pyramid topologies are a subset of the tree topology. The pyramid in figure 5 can be arrived at by redrawing a quaternary-tree. All the topological properties of the pyramid are the same as those of the quaternary-tree;

Fig.3. X Tree

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4, APRIL 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

107

8. SIMULATION RESULTS
Figures 6 and 7, show different views of a Pyramid NOC.

Fig.5. Quaternary Tree and its Pyramid

6. SIMULATION FRAMEWORK
In this paper, we have modeled our MPLS-noc architecture concepts with the widely used network simulator ns-2 [26]. NS2 has been widely applied in research related to the design and evaluation of computer networks and to evaluate various design options for noc architectures [27], including the design of routers, communication protocols, etc.
Fig.6. Pyramid-NOC

7. SIMULATION EXPERIMENTS
All of the topology parameters can be described as a script file in Tcl. A part of the ns-2 script file about constructing the topology is shown below. for {set k 1} {$k <= $n} {incr k} { if ($k==1) {set x 1} if ($k==2) {set x 2} if ($k==3) {set x 4} if ($k==4) {set x 8} if ($k==5) {set x 16} for {set i 1} {$i <= $x} {incr i} { for {set j 1} {$j <= $x} {incr j} { set sw([expr ($k*100+$i*10+$j)]) [$ns node] if ($k==1) { $sw([expr ($k*100+$i*10+$j)]) color blue} if ($k==2) { $sw([expr ($k*100+$i*10+$j)]) color red} if ($k==3) { $sw([expr ($k*100+$i*10+$j)]) color green} if ($k==4) { $sw([expr ($k*100+$i*10+$j)]) color brown} if ($k==5) { $sw([expr ($k*100+$i*10+$j)]) color black} $sw([expr ($k*100+$i*10+$j)]) ($k*100+$i*10+$j)] }}} label sw[expr

Fig.7. Pyramid-NOC

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 4, APRIL 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

108

REFERENCES
[1] J. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, C. Hwang, H. Li, A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A. Leon, and A. Strong. A 40nm 16-core 128thread CMT SPARC SoC Processor. In International SolidState Circuits Conference, pages 9899, February 2010. [2] Tilera TILE-Gx100. [3] NVIDIA. NVIDIAs Next Generation CUDA Compute Architecture: Fermi. http: / www.nvidia.com/content/PDF/fermi_white_papers/NVID / IA_Fermi_Compute_Architecture_Whitepaper.pdf, 2009. [4] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All to Software: RAW Machines. IEEE Computer, 30(9):8693, September 1997. [5] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W. Keckler, and D. Burger. On-Chip Interconnection Networks of the TRIPS Chip. IEEE Micro, 27(5):4150, September/October 2007. [6] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. B. III, and A. Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5):1531, September/October 2007. [7] W. J. Dally and B. Towles. Route Packets, Not Wires: Onchip Interconnection Networks. In International Conference on Design Automation, pages 684689, June 2001. [8] K. Goossens, J. Dielissen, and A. Radulescu, thereal network on chip: Concepts, architectures, and implementations, IEEE Design and Test of Computers, vol. 22, no. 5, pp. 414421, 2005. [9] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, QNoC: QoS architecture and design process for network on chip, Journal of Systems Architecture, vol. 50, no. 2-3, pp. 105 128, 2004. [10] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, NoC synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 2, pp. 113129, 2005. [11] C. Hilton and B. Nelson, PNoC: A flexible circuit-switched NoC for FPGA-based systems, IEE Proceedings: Computers and Digital Techniques, vol. 153, no. 3, pp. 181188, 2006. [12] D. Castells-Rufas, J. Joven, and J. Carrabina, A validation and performance evaluation tool for ProtoNoC, 2006 International Symposium on System-on-Chip, SOC, 2006. [13] T. Moscibroda and O. Mutlu. A Case for Bufferless Routing in On-Chip Networks. In International Symposium on Computer Architecture, pages 196207, 2009. [14] H. Wang, L.-S. Peh, and S. Malik. Power-driven Design of Router Microarchitectures in On-chip Networks. In International Symposium on Microarchitecture, pages 105116, December 2003. [15] J. Kim. Low-cost Router Microarchitecture for On-chip Networks. In International Symposium on Microarchitecture, pages 255266, December 2009. [16] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha. Express Virtual Channels: Towards the Ideal Interconnection Fabric. In International Symposium on Computer Architecture, pages 150161,

May 2007. [17] J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In International Symposium on Computer Architecture, pages 89100, June 2008. [18] B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: a Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip. In International Symposium on Microarchitecture, pages 268279, December 2009. [19] D. Pham et al. Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor. IEEE Journal of Solid-State Circuits, 41(1):179196, January 2006. [20] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All to Software: RAW Machines. IEEE Computer, 30(9):8693, September 1997. [21] J. Kim, J. Balfour, and W. Dally. Flattened Butterfly Topology for On-chip Networks. In International Symposium on Microarchitecture, pages 172182, December 2007. [22] B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. Express Cube Topologies for on-Chip Interconnects. In International Symposium on High-Performance Computer Architecture, pages 163174, February 2009. [23] P. Kermani and L. Kleinrock. Virtual Cut-through: a New Computer Communication Switching Technique. Computer Networks, 3:267286, September 1979. [24] W. J. Dally. Virtual-channel Flow Control. In International Symposium on Computer Architecture, pages 6068, June 1990.

Anda mungkin juga menyukai