Runtime Efficiency - Efficiency of Multi-Threaded Model

9.2 Efficiency of Multi-Threaded Model

9.2.2 Runtime Efficiency

Similar to the area efficiency, Figure 9.10 shows the runtime efficiency for each implementation of the kernel relative to four consecutive sequential executions of the single-threaded models (static or dynamic). Thus, the multi-threaded and single-threaded versions perform the same number of computations.

Again, the impact of the different features on the unmodified df... benchmarks is almost not existent as they contain no loops (see discussion in Section 9.2.1). Currently, the hardware threads are all issued by the software. The context switch latency between the hardware and software is higher than the runtime of a single thread in the small df... data-paths. Because of that there is almost never a time where multiple threads are at the same point in the pipeline, requiring thread reordering. That is why there is no almost difference in the runtime efficiency between the different parameter options.

Overall the impact of queues and using optional multi-threaded stages on the runtime efficiency of all CHStone benchmarks is limited. The reason for that is that the threads spread out enough so that reordering at memory accesses is not necessary most of the time. This effect can be seen more clearly in the optimizations aiming to improve area efficiency by reducing the number of optional multi-threaded stages in use (see Section 9.3). For CHStone, these optimizations have almost no impact on the runtime efficiency.

The reduced efficiency of df_sin_mod results from a starved thread. This means that a thread is waiting at multi-threaded stages much longer than the other threads to be selected to move into the next stage. This starvation is caused by combination of the static reordering priorities and a specific placement of multi- threaded stages. In the modified version of df_sin, the new outer loop has a small

I I = 7 with a pipeline length of 39. However, df_sin is the only benchmark out of all df... which actually contains an inner loop (which is not long enough to hide the previously discussed context switching latency). However, this inner loop is located almost at the end of the new outer loop. Additionally, the stage directly before the inner loops VLO contains a memory access VLO. Because of the small II compared to length and the enabled queues, many iterations of the outer loop are started before this memory access (a cache miss) is completed once. This leads to the situation that next iteration of the highest priority thread is already in the input queue of the memory access and thus immediately issues the next request (a cache hit).

By itself this does not starve the other threads because only cache misses require the single shared memory controller. However, before the second highest priority thread has finished its first access (a cache miss), the ninth access of the highest priority threads results again in a cache miss because it already read all data from the previously loaded cache-line and has to wait for other accesses to finish. After

that, these two threads now alternate between reading from the cache and having to access the shared memory controller. This starves all threads with lower priority until the highest priority thread is finished and the same behaviour continues with second and third highest priority thread. In fact, this continues until only a single thread is left. An improved prioritization in the memory system (a fair arbiter based on the wait time) would solve this by making sure that no thread is starved. But in general the simple static priority works for most benchmarks.

For many benchmarks of MachSuite, however, the impact of using queues and optional multi-threaded stages can be clearly seen. Using them increases the runtime efficiency of almost all MachSuite benchmarks. But this also decreases the area efficiency which can be seen especially for spmv_crs in Section 9.2.1. This higher impact of thread reordering comes from the higher number of instances where threads are blocking each other at memory accesses in the pipeline. How- ever, unlike the problem with df_sin_mod, in these benchmarks the reordering in the optional multi-threaded stages improves the runtime efficiency because while there are multiple threads in a stage with memory access VLOs, no single thread is starved by the other threads. The timing of the cache hits and misses leaves enough room for the reordering to be useful.

Runtime (#Clocks) Efficiency 1 1.5 2 2.5 3 3.5 4 4.5

dfadd dfdiv dfmul dfsin adpcm CHStoneblowfish dfadd_mod dfdiv_mod dfmul_mod dfsin_mod gsm gsm_mod mips sha aes bfs_bulk bfs_queue gemm_blocked gemm_blocked_mod gemm_ncubed gemm_ncubed_mod kmp nw MachSuitesort_merge sort_radix stencil2d stencil3d fft_strided md_grid md_knn spmv_crs spmv_ellpack spmv_ellpack_mod viterbi

No Loops Floats

without opt. MT with opt. MT without queues

with queues

Figure 9.10: Runtime efficiency of multi-threaded model compared to four consecutive runs of single-threaded model without optimizations. The multi-threaded model is more efficient for values greater one

In document An Execution Model and High-Level-Synthesis System for Generating SIMT Multi-Threaded Hardware from C Source Code (Page 166-169)