Different thread priority schemes for Runahead Threads

3.6 Design analysis for Runahead Threads implementation

3.6.2 Different thread priority schemes for Runahead Threads

In the previous study, we evaluate ICOUNT configuration assuming that all kind of threads have the same opportunities in order to be considered to fetch instructions according to ICOUNT policy. Nevertheless, with the introduction of RaT mechanism

in the SMT architecture, there are two types of threads that try to use the processor resources: normal threads (non-speculative) and runahead threads (speculative). The question is whether different rules of thread priorities can be used instead of the standard ones of ICOUNT policy as better scheme to manage this new situation. In order to analyze this issue, we investigate several modification of ICOUNT scheme varying the thread priorities depending on the kind of thread. We recall that ICOUNT till now only takes into account the amount of instructions in the pre-execute stages to calculate the thread priority, independently of the kind of thread (speculative or not). We evaluate three different priority proposals derived from ICOUNT. One of the proposals consists of giving high priority to normal threads against runahead ones at the fetch stage. Thus, the normal threads have the opportunity to get into the pipeline in the first place. Subsequently, the runahead threads take profit of the possible remaining fetch slots after scheduling the normal threads. However, this new policy follows the same priority rules that standard ICOUNT imposes among the normal and runahead threads respectively. That is, it decides the thread priority in function of instructions in the pre-execute stages for the threads belong to the same category (normal or runahead ones). The second fetch priority scheme we analyze here consists of inverting the rights with regard to the previous one, that is, giving priority to runahead threads opposite to the normal ones. The third approach is similar to the first one, but also adding a new level of priority with the same criteria at the issue stage: first instructions from normal threads can be issued and then runahead ones. In this case, normal threads have not only the maximum priority at the fetch stage but also at the moment of taking the functional units in the issue stage.

Figure 3.30 shows the average performance obtained by RaT in function of the different thread fetch priority policies. The four bars of each kind of workload represent the average throughput for RAT with ICOUNT base policy and the different thread priority schemes respectively described above: fetch priority to normal thread (FPNT), fetch priority to runahead threads (FPRT) and, both fetch and issue priority to normal threads (BPNT).

This figure shows that giving priority to runahead threads (FPRT) is not a good choice since the performance considerably decreases. This approach favors speculative runahead threads while the non-speculative threads, which do the “real” work, are being delayed. The consequence is that FPRT has a slowdown compared to RaT+ICOUNT of 9% for 2-thread workloads and 19% for 4-thread workloads. The

Figure 3.30: Performance of RaT in function of the different thread priority schemes

performance loss is higher for 4-thread workloads since having more threads, it is more likely to have more runahead threads, thereby more interference with the normal threads.

However, the worst performance priority scheme evaluated is BPNT, with an average performance loss of 40%. This slowdown is specially significative for memory- intensive workloads (66%) following by mixed workloads (41%). We analyze the reason of these bad results. The cause of this poor performance is the side effect of the different priority between normal and runahead threads at the issue stage. Since normal instructions have the highest priority in the issue stage, these instructions always are issued to execute before the runahead instructions each cycle. Then, most of the times the issue bandwidth is filled with normal instructions and runahead instructions have no room to advance. The negative effect of this is that BPNT scheme does not let the processor execute enough runahead instructions, causing a lot of very short runahead threads (20 executed instructions on average per runahead thread). Our results show that BPNT produces four times more runahead thread activations on average than RaT with original ICOUNT rules. In consequence, these short runahead threads are totally useless since they cannot advance and almost do not issue any prefetching, causing at the same time a big performance degradation due to the activation and deactivation of even much more runahead threads due to every long-latency load. Therefore, this is not obviously a valid priority scheme.

On the other hand, giving higher priority to normal threads only at the fetch stage (FPNT) seems a good option, especially for ILP and MIX workloads. This scheme favors the threads that have long periods of high-level instruction parallelism. The

instruction streams flow quickly in the multithreaded pipeline, having fewer instructions in the front-end stages. However, this effect harms memory-bound threads (according to MEM workload results) due to it impedes the speculative runahead instructions to advance quickly in order to anticipate the prefetching. Then, while for ILP and MIX workloads FPNT gets a performance improvement of 2% and 4.5% respectively, for MEM workloads it suffers of a 6% slowdown.

Therefore, in relation to the throughput results obtained in this study, we consider that original ICOUNT is the best choice to combine with RaT as the underlying thread priority policy. ICOUNT is able of distributing effectively the threads independently of they were speculative or not. Besides, it is the simplest option, not requiring additional logic or hardware complexity to manage the priorities according to the thread type.

In document Runahead threads (Page 84-87)