International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 7, July 2015)
262
Hardware Implementation of Improved Adaptive NoC Router
with Flit Flow History based Load Balancing Selection Strategy
Parag Parandkar
1, Sumant Katiyal
2, Geetesh Kwatra
31,3
Research Scholar, School of Electronics, Devi Ahilya University, Indore, M. P., India 2Professor, School of Electronics, Devi Ahilya University, Indore, M. P., India
Abstract— To improve load balancing in NoCs several
techniques exists in literature like Regional Congestion Awareness (RCA) and similar techniques. Also there are some techniques based on output port selection like count of free virtual channels, count of fluid buffers, buffer occupancy time at reachable downstream neighbors and flit flow history based algorithm named as Tracker. Among these techniques, Tracker has performed significantly better than others. However, Tracker has been simulated and verified using NoC simulation tool and no hardware implementation of flit flow based algorithm exists in literature yet. The proposed work is anew in the regard that no hardware implementation of Tracker architecture have been seen till now in the research literature of Network on chip. It implements improved flit flow based technique used by Tracker implemented on programmable hardware (Xilinx Virtex-5 FPGA) and achieves significant frequency of 686.81 MHz as validated by experimental synthesized results. The innovation in the existing architecture was brought about by insertion of additional buffers in the tracker internal logic to achieve better area / performance trade off for chip multiprocessors.
Keywords— NoC, Tracker, Adaptive routing, MOE, Load balancing routing, Virtual Channel router, Verilog, Virtex FPGA.
I. INTRODUCTION
Since the last decade, optimizing computational efficiency of intellectual property cores had been the preferred choice in technological innovations among System on Chips (SoCs). At the same time, reliable and efficient communication also has to be given more emphasis as far as achieving important metrics, high performance and throughput among SoCs are concerned. This is due to the fact of wire delays getting comparable with the gate delays with continually decreasing feature sizes [1].
Network on chip considered as efficient replacements out of other form of interconnects in chip multiprocessors and system on chip designs [1, 3]. Network on Chip (NoC) architecture consists of combination of topology, routing algorithms, switching, power optimization etc. Among different topologies, mesh topology considered as the most competent topology.
In the 2D mesh topology, the processing cores are arranged as rectangular tiles. Each processing IP core is connected to a local router, which in turn connects them together in the form of mesh arrangement by interconnecting with the other similarly connected routers.
The communication among the IP cores take place in the units of packets, flits and phits at different levels of abstraction. Among the three switching techniques store and forward, wormhole and virtual cut through switching, the wormhole switching is mostly preferred because of its less buffer space utilisation, in turn utilizing lesser area and power requirements. In this switching, each packet is sub divided into sequence of flow control units called as flits. The control information is driven by the header flit.
The router of a 2D mesh topology contains five bidirectional ports, four ports for each of the directions, east, west, north and south and one for the local tile (IP Core). Every input port of a router could optionally be associated with flits buffers set, termed as virtual channels (VCs). The use of virtual channels is subjected to the choice of establishing balance between saving on average network latency for the cost of area and power consumption of added buffers [5]. The available buffer space information among the routers is being carried by control signals. Flits belonging to several VCs within the same input port arbitrate among themselves and successful flits from various input ports will undergo switch arbitration and allocation and finally flits are forwarded through the crossbar switch to the respective output ports [2].
An existing NoC architecture mostly has a choice among deterministic, oblivious and adaptive routing algorithms to determine route taken by a packet to its destination. Although adaptive routing imbibes more complex implementation but it is still preferred because of better fault tolerance, increased network throughput and decreased network latency as compared to oblivious policies when non-uniform or busty traffic is applied to it. Despite of these pluses performance of adaptive routing becomes detrimental when local decisions based on network load are taken, resulting in disturbing load balancing within the NoC.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 7, July 2015)
263
This may imply undesired infusion of hot spots. Also adaptive routers choose the best route for the incoming packets among the available routes by the choice of having chosen dynamically varying network congestion status among other selection metrics.
For the unrestricted flow of packets within the routers buffers along with virtual channel is utilized. At the same time, handshaking mechanisms using set of control signals are used for maintaining information synchronization among the routers. Thus computational units like that of IP cores establish efficient communication among themselves by producing, and processing data packets and control signals through the NoC infrastructure. Adaptive routing chooses the best route for incoming packets from a set of available routes by making use of a proper selection metric that captures dynamically varying congestion status [3].
Recently designed Network on chip architectures generally requires crucial parameters like low latency, load balancing and deadlock free routing to be satisfied to the best of the extent in order to maintain optimized routing implementation. At the higher packet injection rates, the capability to handle packets flow among the neighboring routers and number of allocated virtual channel buffers pose biggest limitation. The adaptivity in the routing mechanism for a typical adaptive router gets feasible when it chooses the best possible output port for an incoming bit. For this, the output port selection function chooses one of the output ports for the flit by choosing an appropriate metric that takes care of congestion [6]. The selection metric which was preferred to represent congestion will decide the optimum selection strategy.
Link utilization of the network can be improved by balancing the traffic across all the links. Flit flow history based analysis method as proposed by Tracker [2] was obvious choice made over the existing metrics like availability of free virtual channels [5, 6], buffer fluidity values [10] and buffer occupancy values [11]. Routing decisions are taken in such a fashion that less frequently used links are preferred.
II. RELATED WORK
The tracker algorithm utilized the minimum odd-even routing (MOE) which is one of the simplest and most commonly used deadlock free adaptive routing algorithms used in mesh NoCs [4]. The MOE algorithm makes random selection of the available ports. The selection functions have to work above MOE to implement routing function.
Load information of neighboring switches for channel selection is depicted in [7]. Load balancing routing scheme by random channel selection is proposed in [9]. Based on the past flow pattern, the author in [8] estimates network’s congestion level and deterministically calculates optimized routing paths for all traffic flows. Count of free Virtual channel is also explored as a selection metric in the adjacent subsequent routers [5, 12]. Free VC status of reachable neighbors of adjacent routers of current node is investigated by [6]. Count of number of fluid buffers is explored in [10]. History of buffer occupancy within realistic time interval is discussed in [11].
Tracker architecture included a Virtual channel router [3] that monitors flow of flits through all its outgoing ports and exchanges this flit flow information with its neighbors. Computation of flit flow information is done using Cumulative Flit Count(CFC). It designates contention level of an output port of neighboring router.
The architecture of the selection logic of tracker router which was implemented in Tracker is as shown in Fig.1.
Fig 1. Internal architecture of internal logic in Tracker [2]
Working of flit forwarding in a tracker router is as shown in the following Fig. 2. Flit F is sourced from node 4 and is destined to node 15. It first reaches at node 5. For flit F, the MOE routing function [3] chooses east port (link to node 6) and north port (link to node 9) as the possible output ports. At the same time, flip flop (FF) values from node 9 and node 6, reach node 5.
FF FF FF FF
Status
register
of node 5
West North East South To router 4 From router 4International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 7, July 2015)
264
12 8 4 S 0 13 9 5 1 14 10 6 2 15 D 11 7 3Fig. 2: Router Architecture Network Model III. PROPOSED IMPLEMENTATION
The proposed architecture of improved tracker design is as shown in Fig. 3. The additional buffers are inserted in the existing tracker internal logic to achieve better area / performance trade off for chip multiprocessors.
The router network architecture model (Fig. 2) as designed in [2] is implemented using the improved internal architecture.
16 nodes are designed and configured in a 4 x4 mesh network model and tested the Tracker (node) behavior among the network. Each node is said to be a test_node and using the array of 16 such nodes connected together, a typical NoC architecture is designed.
The modified router architecture network model is designed in verilog HDL by module name router_top. It consists of combination of following group of inputs and output signals:
Inputs: a) clk, reset of 16 bits, b) west_data0_in, south_data0_in, south_data1_in, south_data2_in, south_data3_in, east_data3_in, east_data7_in, east_data11_in, east_data15_in, north_data15_in, north_data14_in, north_data13_in, north_data12_in, west_data12_in, west_data8_in, west_data4_in of 8 bits.
Fig. 3: Improved internal architecture of internal logic of tracker
c) req0_w, req4_w, req8_w, req12_w, req12_n, req13_n, req14_n, req15_n, req15_e, req11_e, req7_e, req3_e, req0_s, req1_s, req2_s, req3_s of 1 bit.
d) busy0, busy1, busy2, busy3, busy4, busy5, busy6, busy7, busy8, busy9, busy10, busy11, busy12, busy13, busy14, busy15 of 4 bits.
Outputs:
west_out0_out, south_out0_out, south_out1_out, south_out2_out, south_out3_out, east_out3_out, east_out7_out, east_out11_out, east_out15_out, north_out15_out, north_out14_out, north_out13_out, north_out12_out, west_out12_out, west_out8_out, west_out4_out of 8 bits.
Among the inputs, the a) group contains clk and 16 bit resets, one reset for each test_node. The b) group consists of all the inputs, each of 8 bits pertaining to each of the test_nodes from 0 to 15, in the 4 x 4 mesh network model. The c) group consists of collection of control signals in terms of request signals of neighboring nodes for data transfer. Each of the request signals are of 1 bit size. The d) group contains the status of the output port of a test_node in terms of busy signal.
West out North in East in South in Buffer Buffer Buffer Buffer
Status register of node 5
North out From Router 4 East out South out West in To Router 4
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 7, July 2015)
265
Test_Node
north_out
south_out
east_out
west_out
o_nsew_req
nsew_ack
north_data
south_data
west_data
nsew_req
nsew_busy
east_data
Each of the test_node is associated with 4 bits busy signal corresponding to the directions north, south, east and west. The busy signal will be put on as 0 provided the tracker algorithm working within test_node validates the output path availability corresponding to the referred direction. Among the outputs, the port directions of outermost test_nodes each having data ports of 8 bits are depicted.
Test_Node:
The test node is as shown in Fig. 4. It is the basic building block of the NoC architecture network model. It contains the information related to and required by all the neighborhoods surrounding it.
A typical test_ node has four input data ports each of 8 bits named as north_data, south_data, east_data, west_data corresponding to each direction north, south, east and west respectively along with the associated buffers. It also has 4 bits request input coming from the incoming nodes. nsew_busy signal of 4 bits, gives the desired direction of propagation of flit as proposed by the MOE algorithm running within the test_node.
Test_node has four output data ports each of 8 bits named as north_out, south_out, east_out and west_out corresponding to each direction north, south, east and west respectively. To forward the requests towards the downstream there is a signal of four bits o_nsew_req and to respond to the requests acknowledgements are sent in terms of 4 bits nsew_ack signals.
One path is selected and configured for only one side receive and one side output. For example: if the node 5 needs to receive the data in the west side and outputs the data to north, then the preferable settings are designed such that west node request is 1 and north side busy signal should be 0.
Fig. 4: Test_node
Inputs : north_data, south_data, east_data, west_data are of size 8 bits and nsew_req, nsew_busy are of size 4 bits.
Outputs: north_out, south_out, east_out, west_out are of 8 bits and o_nsew_req, nsew_ack are of size 4 bits.
According to the top level bit configuration, each node can receive and send the value.
Test_wrapper:
The above 4x4 network model is given data inputs, along with clock and resets and the data outputs are observed by designing a top level module, named as test_wrapper, which encompasses router_top module. It has clk, 16-bits reset, 8-bits data_in and 8-bits data_out. It is as shown in Fig. 5.
Fig. 5: Test_wrapper
The router_top module is instantiated within it and data is sent from the west direction of test_node 4, according to minimal odd-even routing (MOE) algorithm and routed by applying the improved tracker algorithm pertaining to the architecture as shown in Fig. 3. The neighborhood nodes are investigated and counts for Cumulative Flit Count (CFC) for them is observed, so as to reach to test_node 15. The code is tested for one path (configuration bit is generated according to MOE algorithm). As depicted by dark arrows in Fig. 2, the data is routed from the west port of test_node 4, then to the west of test_node 5, then to the north of test_node 9, then to the east of test_node 10, then to the east of test_node 11 and finally to the north of test_node 15. The data will have the options of diverting to the east of the node 5, but due to turn model restrictions of the MOE algorithm, it could not take turn to north from node 6, to node 10, thereby follows the most conveniently available path towards north of node 9.
The test_wrapper is functionally verified by applying a test bench. After resetting initially for the first clock cycle the desirable response appears as shown in the simulation result as depicted in Fig. 6.
The code of test_wrapper has been synthesized on Xilinx Virtex-5 FPGA and results in the maximum clock frequency of 686.81 MHz are shown in results section.
clk
reset
data_in
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 7, July 2015)
266
IV. RESULTS AND DISCUSSIONS
The simulation result is as shown by Figure 6. As per the simulation it took around 23 clock cycles to reach the data, which was put at the west input of source test_node 4 to reach at the east port of destination test_node 15.
Upon synthesizing on Xilinx Virtex-5 (XCVvlx50t-2ff1136), the maximum clock frequency of 686.81 MHz is obtained.
Fig. 6: Functional Simulation of test_wrapper
Fig. 7: Synthesized top level block Test wrapper
V. CONCLUSION
There are some techniques based on output port selection like count of free virtual channels, count of fluid buffers, buffer occupancy time at reachable downstream neighbors and flit flow history based algorithm named as Tracker. Among these techniques, Tracker has performed significantly better than all others. Hardware implementation of improved adaptive NoC router with flit flow history based load balancing selection strategy, has been proposed.
The proposed technique is an improved version of the Tracker architecture which incorporates insertion of additional buffers in the existing tracker internal logic to achieve better area / performance trade off for chip multiprocessors. The proposed work implements improved flit flow history based technique used by Tracker implemented on programmable hardware (Xilinx Virtex-5 FPGA) using Verilog HDL and achieves significant frequency of 686.81 MHz as validated by experimental synthesized results.
REFERENCES
[1] W. Dally and B. Towles, “Route packets, not wires: On-chip interconnection networks”, in DAC, pp. 684-689, 2001.
[2] John Jose, K.V. Mahathi, J. Shiva Shankar and Madhu Mutyam, “TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Nov. 5-8, 2012, San Jose, California, USA.
[3] W. Dally and B. Towles, “Principles and Practices of Interconnection Networks”, Morgan Kaufmann Publishers Inc., USA, 2003.
[4] G. M. Chiu, The odd-even turn model for adaptive routing, IEEE TPDS, vol. 11, no. 7, pp. 729-738, 2000.
[5] W. Dally, Virtual-channel flow control, IEEE TPDS, vol. 3, no. 2, pp. 194-205, 1992.
[6] G. Ascia, et al., “Implementation and analysis of a new selection strategy for adaptive routing in NoC”, IEEE TOC, vol. 57, no. 6, pp. 809-820, 2008.
[7] E. Nilsson, et al., “Load distribution with the proximity congestion awareness in a network-on-chip”, in DATE, pp. 1126-1127, 2003.
[8] A. E. Kiasari, et al., “A framework for designing congestion- aware deterministic routing”, in NoCArc, pp. 45-50, 2010.
[9] M.H.Cho, et al. “Path-based, randomized, oblivious, minimal routing”, in NoCArc, pp. 23-28, 2009.
[10] Y. C. Lan, et al., “Fluidity concept for NoC: A congestion avoidance and relief routing scheme”, in SoC, pp. 65-70, 2008. [11] J. Jose, et al., “BOFAR : Bu_er occupancy factor based
adaptive router for mesh NoCs”, in NoCArc, pp. 23-28, 2011. [12] J. Kim, et al., “A low latency router supporting adaptivity for