9.2 Efficiency of Multi-Threaded Model
9.2.1 Area Efficiency
Figure 9.8 shows the area efficiency for each implementation of the kernel rela- tive to four copies of the static single-threaded model. This figure uses the best results obtained from using different optimization for choosing which optional multi-threaded stages are used (see Section 9.3). For comparison Figure 9.9 shows the efficiency without applying any optimizations. From this it can be seen that without using area optimizations about half the benchmarks become area inefficient. In Section 9.3 it will be shown that these optimizations can be applied with only a very small impact on the runtime efficiency.
While the multi-threaded model performs well for many benchmarks, there are also benchmarks which are less efficient than the single-threaded model. To un- derstand why some applications are more suitable for the multi-threaded model
1 The hardware software co-simulation has a problem with some bigger (input data) benchmarks
when executed with eight threads. A few benchmarks were executed on the DINI system with eight threads.
than others, the efficient and inefficient applications are analysed in order of the benchmarks (see Section 9.1).
df... Benchmarks
As these benchmarks contain no loops, most features of the multi-threaded model are not used. In general, they are more efficient with the multi- threaded model than the single-threaded model. However, Nymble currently handles non-loops as loops with a single iteration. Because of that, they contain some overhead only really necessary for loops. Optimizing the back- end to handle non-loops more effective would also be useful for improving the multi-threaded basic block model as each basic block is generally imple- mented as such a non-loop. Additionally, because basic blocks are generally much smaller than in the df... benchmarks, the overhead has an even bigger impact.
The modified versions of these benchmarks have a lowered area efficiency because of the additional loop and its multi-threading overhead. However, the overall runtime of modified versions is much lower than the unmodified one because of reduction of the hardware software interface latency by doing less transitions through the interface.
Remaining CHStone and integer MachSuite Benchmarks
These benchmarks are mostly just a bit more efficient with the multi-threaded model. The cause for this is that the relative multi-threading overhead in- creased without bigger operations like floating point ones. This is proven by the fact that most benchmarks with floating point operations are more efficient than all non floating point benchmarks.
While queues have a negligible impact on the area efficiency, enabling optional reordering stages can have a significant impact on area efficiency. Both features also have an impact on the runtime efficiency which is analysed in Section 9.2.2.
Another important point for the efficiency is the use of complex operators. With- out a relative high amount of complex operators as in most benchmarks without floating operations, the efficiency is not as good as for benchmarks with many complex operators. For simple operators like integer addition and logic, the multi- threading logic overhead is much bigger compared to complex operators. For example, if a simple one bit comparator result has to be stored in TCS, it uses at least two additional bits for addressing (with four threads) in addition to the mul- tiplexing logic. When a 32 bit float MCO is made multiplexed, at best the same two additional bits addressing are needed. Of course, in many cases the operator requires additional queues, reducing the efficiency as shown in Sections 3.2.4.1 and 4.3.7.1. dfdiv and spmv_crs are the most extreme outliers. dfdiv has a single integer divider (complex operation) compared to dfadd and dfmul. On the other
Area Efficiency 0 0.5 1 1.5 2 2.5 3
dfadd dfdiv dfmul dfsin adpcm CHStoneblowfish dfadd_mod dfdiv_mod dfmul_mod dfsin_mod gsm gsm_mod mips sha aes bfs_bulk bfs_queue gemm_blocked gemm_blocked_mod gemm_ncubed gemm_ncubed_mod kmp nw MachSuitesort_merge sort_radix stencil2d stencil3d fft_strided md_grid md_knn spmv_crs spmv_ellpack spmv_ellpack_mod viterbi
No Loops Floats
without opt. MT with opt. MT without queues
with queues
Figure 9.8: Area efficiency of multi-threaded model compared to four copies of static single-threaded model with optimizations (minimum values out of all optimizations, see Section 9.3). The multi-threaded model is more efficient for values greater than one
hand, spmv_csr has only two floating point operations which provide not enough potential for the multi-threaded model.
Overall, it can be said the multi-threaded model is more area-efficient for bench- marks with complex operators.
Area Efficiency 0.5 1 1.5 2 2.5 3
dfadd dfdiv dfmul dfsin adpcm CHStoneblowfish dfadd_mod dfdiv_mod dfmul_mod dfsin_mod gsm gsm_mod mips sha aes bfs_bulk bfs_queue gemm_blocked gemm_blocked_mod gemm_ncubed gemm_ncubed_mod kmp nw MachSuitesort_merge sort_radix stencil2d stencil3d fft_strided md_grid md_knn spmv_crs spmv_ellpack spmv_ellpack_mod viterbi
No Loops Floats
without opt. MT with opt. MT without queues
with queues
Figure 9.9: Area efficiency of multi-threaded model compared to four copies of static single-threaded model without optimizations. The multi- threaded model is more efficient for values greater than one