Performance Evaluation Graphs

3.4 RTL Simulator Infrastructure

3.4.3 Performance Evaluation Graphs

Based on the features of the TPG and the TRE units attached on the RTL-Simulator In- frastructure of the XHiNoC, there are some graphs that can be depicted to show the performance evaluation results of the XHiNoC. The graphs are described in the following items.

• 2D graph of the the last flit transfer latency vs injection rate. This graph will present the acceptance of the last flit of each communication pairs when the expected rate of the data injection at each source node is increased or decreased. In general, the average last flit acceptance delay of all communication pairs over the expected data injection rate changes can be also presented in a graph. This type of graph is commonly used to evaluate NoCs performance as exhibited in [182] and [18].

• 2D graph of the the last flit transfer latency vs workloads. This graph will present the acceptance of the last flit of each communication pairs when the number of injected data flits at each source node is increased or decreased. In general, the average last flit acceptance delay of all communication pairs over the number of workload changes can be also presented in a graph.

• 2D graph of the communication bandwidth vs injection rate. This graph will present the actual (real) measured communication bandwidth (throughput) of each communication pairs when the expected rate of the data injection at each source node is increment or decrement. In general, the average real throughput of all communication pairs over the expected data injection rate changes can be also presented in a graph.

• 2D graph of the communication bandwidth vs workloads. This graph will present the actual (real) measured communication bandwidth (throughput) of each communication pairs when the number of injected data flits at each source node is increased or decreased. In general, the average real throughput of all communication pairs over the number of workload changes can be also presented in a graph.

• 3D graphs of the link and bandwidth occupancy. These graphs will present the link occupancy represented by the number of reserved ID slots and reserved bandwidth space at each output port of the NoC routers. In general, the total outgoing link occupancy of all output ports can be presented in a graph. This graph is interesting to see hotspots in a 2D network topology.

• 2D graphs of the injection and acceptance rate transient response. These graphs will shows us the transient responses of the actual injection rate at a source node and the acceptance rate at a destination node measured at runtime during certain time pe- riod at certain active nodes, which are determined by the users. From these graph, we can see the time responses of each communication partner and analyze their steady state points compared to the expected data rates of each communication.

3.5 SUMMARY 83

3.5 Summary

The main issue related to the implementation of the local ID management technique is the available ID slots on each communication link. If the parameterizable ID field on each flit is set to 4 bits, then a maximum number of 16 packets (24) can be in flight on the same link. The number of available ID slots can be increased by increasing the number of ID field bits as presented in the packet format, resulting in an increase of the routing table size and ID slot table size in the ID management unit. The number of required ID slots is application-dependent and cannot be increased anymore if the NoC had been implemented on ASIC. Hence, an optimal post-manufacture application mapping should be made, in order to avoid more than 16 packets interfering with each other across the same link.

In Chip-level Multiprocessor (CMP) systems running a coarse-grain multiprocessing applications, i.e. the ratio between computation to communication is more than one, it seems that 16 ID slots per channel are enough to run several applications. But, if the computation to communication ratio is less than one (fine-grain), then the number of available ID slots per channel must be taken into account. Programmers must ensure that each channel will not be overloaded with excessive communication traffics. This effort can be easily done especially when an explicit parallel programming model is considered. The problem may appear when we use implicit parallel programming models such as shared- memory and multithread programming models. Therefore, it is reasonable to anticipate the ID run out problem in the CMP systems by setting the minimum acceptable number of ID slots per link and setting the number of available ID slots at each local output port equal to the number of the processing element cores. This issue has been well addressed in Section3.1.2.

Fortunately, in the context of embedded Multiprocessor System-on-Chip (MPSoC), applications traffic patterns are predictable. Hence, it is possible in this case to map the application in the NoC-platform in such a way that every considered traffic will be able to reserve one ID slot per link to perform its data communication with fulfilled bandwidth requirement. Although it would be a rare case that more than 15 messages are in-flight in the same link, it is however a good decision if the packet dropping mechanism is applied in this case to avoid data flow stall. If the number of ID tags per link Nslotis set to cover

all considered traffics, e.g. equal to Equ.3.1when using a minimal adaptive routing algo- rithm, then the packet dropping mechanism can be neglected. There is a design trade-off in this aspect. By setting the minimum number of ID slots (Nslot) per link as discussed

in Section 3.1.2, the size of the RRT and ID Slot Table units would be larger, but there is no need for a retransmission protocol. When data dropping is applied and the number of entries in the tables units is reduced, then router size will be smaller, but the retransmission protocol must be applied leading to area overhead in the network interface, and probably time overhead when the data drop occurs.

Chapter 4 Wormhole Cut-Through Switching:

Flit-Level Messages Interleaving

4.1 Blocking Problem in Traditional Wormhole Switching . . . . 86

In document Microarchitecture and Implementation of Networks-on-Chip with a Flexible Concept for Communication Media Sharing (Page 116-119)

3.4 RTL Simulator Infrastructure

3.4.3 Performance Evaluation Graphs

3.5

Summary

Chapter 4

Wormhole Cut-Through Switching:

Flit-Level Messages Interleaving

Contents