Throughput of 3D-bus vs shared-bus - K-ary n-cube simulation framework

CHAPTER 5: EXPERIMENTAL RESULTS

5.1 K-ary n-cube simulation framework

5.3.2 Throughput of 3D-bus vs shared-bus

Throughput measures the number of bits transmitted per unit of time. The interconnect settings of this simulation resembles the one for latency. Message size increases in multiples of two and the channel width varies from 1b to 32b. The number of cubes composing the interconnect is 8. The shared bus settings are 16b channel width and the same length as the 8 cubes measured in series.

Fig. 5.35 portrays that shared bus throughput outperformed 3D-bus throughput for small size messages, since switching delays become dominant for short messages. For long messages, only the header of each message introduces switching delay while the rest of the packets propagate through the interconnect (following the header) in one propagation delay per channel.

Thus, once the header reaches destination the rest of the message is transmitted with lower latency and higher throughput. Long messages generated by modules connected to shared bus cause large arbitration latencies and therefore, performance degradation. 3D-bus peak throughput reached 405 Gbps. Another observation shows that throughput increases with higher intervals as the channel width doubles in size.

Figure 5.35: Throughput of 3D-bus vs. shared-bus

5.3.3 3D-bus routing accuracy

Routing accuracy measures how close the actual path taken by all worms generated within one simulation to the ideal (shortest) path. If the actual path taken is equal to the shortest path then routing accuracy equals 100%. In this simulation our goal is to investigate the effect of VC and SC on routing accuracy.

Figure 5.36: 3D-bus average routing accuracy

Fig. 5.36 portrays that both VC and SC significantly increase the average routing accuracy. When SC were enabled routing accuracy increased by 16% while VC contribute a 10% increase.

5.3.4 3D-bus failure rate

Failure rate measures the number of worms that were retransmitted as a result of their inability to reach destination. This scenario can occur in multiple situations. First, if a worm at a current node cannot be forwarded to an output port and virtual channels are not enabled then it causes a failure and it will result in retransmission of the worm.

Figure 5.37: 3D-bus worm failure rate

Second, when a virtual channel is enabled but reaches its max capacity (and all ports are still busy) then it requires a retransmission of the worm. Simultaneously, the virtual channel is flushed so that a new worm can occupy it. Failure rate is measured as the number of retransmitted worms vs. the total number of worms generated (Fig. 5.37).

Simulation results portray that failure rate diminished when virtual channels as well as sub-channeling were used. The highest failure rate was reached when there were no virtual channels and no channel partitioning. Since many worms collide due to heavy congestion and deadlocks, many worms need to be retransmitted. Virtual channels lessen the number of retransmissions by allowing worms to be queued until the path is cleared. The size of the virtual channels determines the number of packets that can be queued. If a virtual channel becomes full as a result of a worm overflowing it, then the same worm is

retransmitted. Sub-channeling, on the other hand, significantly reduced retransmissions due to the fact that worms have more flexibility in choosing a route.

5.3.5 3D-bus latency with memory and PE interfaces

3D-bus interconnect can connect more than 4 PEs and 4 memories on each of its sides by adding two interfaces: memory interface and PE interface. Modules connected to each interface cannot sent messages simultaneously since there are only four input nodes on each of the interconnect sides. Therefore, at each simulation only eight modules can transmit data into the interconnect while the other modules have to wait in a round robin process scheduling. In addition, if the interfaces are enabled, any module being selected to transmit data at a certain simulation cycle will incorporate additional switching delay in order to count for the delay the interface introduces when connected to the 3D-bus interconnect.

Fig. 5.38 shows that, as the number of modules (PEs and/or memories) on each side of the interconnect increases latency increases, almost exponentially, as well. As more PEs/memories are connected the time that each element is required to wait increases rapidly and as a result latency reaches almost 1msec (about 10 times higher that the latency recorded with peak throughput of 308 Gbps). Latency at this level reduces the throughput dramatically (from PE=16/M=16 and above).

Figure 5.38: 3D-bus latency with interfaces attached

The figure emphasizes the effect of worm size on latency as well. Long worms means that messages are longer which requires extended transmission time. Hence, other modules are required to wait significantly longer for their turn to transmit data.

5.3.6 3D-bus performance comparison with common interconnects

In this simulation we evaluate our 3D-bus with other currently used high- performance interconnect technologies such as Infiniband [70], Hypertransport [72] and PCI-express [73].

Figure 5.39: 3D-bus throughput comparison with common interconnects

To be accurate, 3D-bus is set to resemble all same parameters used for each of the compared interconnects. The following values are used: channel width is 32b, interconnect size is 8 cubes, Number of worms generated is 10, each worm is 1KB in size. Virtual channels and channel partitions are not used. The results are shown in Fig. 5.39. From Fig. 5.39 we see that 3D-bus shows superior results compared to all of its competitors although none of the enhanced features were used (VC, sub-channeling) which can enhance its performance even further.

In document OFF-CHIP COMMUNICATIONS ARCHITECTURES FOR HIGH THROUGHPUT NETWORK PROCESSORS (Page 154-161)