3.4 Evaluation of Runahead Threads
3.4.1 Performance of Runahead Threads
We compare the RaT performance results versus the baseline using the throughput and the Hmean metrics.
Throughput
First, we evaluate how RaT performs compared to the baseline processor with ICOUNT in terms of throughput. We show the overall SMT performance throughput for the baseline ICOUNT and RaT mechanism for every workload. Figure 3.5 shows the throughput for 2-thread workloads and Figure 3.6 shows it for 4-thread workloads grouped according to each workload category. There are three groups of bars in each figure, one for the ILP-intensive workloads (ILP), one for the mixed workloads (MIX) and one for the memory intensive workloads (MEM) respectively, each group divided by the corresponding average bar.
Figure 3.5: Runahead Threads throughput performance vs. SMT baseline for 2- thread workloads
These figures show as RaT gets higher throughput than ICOUNT for all kind of workloads. From Figure 3.6, the baseline throughput is higher than 2-thread hardware contexts because the greater number of executed threads for 4-thread workloads. They
Figure 3.6: Runahead Threads throughput performance vs. SMT baseline for 4- thread workloads
also show that the influence of the runahead threads on the overall SMT throughput dif- fers depending on workload characteristics. In the case of ILP workloads, the through- put difference among RaT and ICOUNT is lower than for the other two workload categories, MIX and MEM. ILP workloads do not contain memory intensive threads and then the prefetching benefits for them are more reduced. Even so, the average performance improvement is 11% for 2-thread ILP workloads and 2.5% for 4-thread ILP workloads.
For MIX workloads, the behaviour of workloads which include both memory bounded and high-ILP benchmarks is not the same as observed previously when only ILP bench- marks are executed. In this case, the performance of RaT is 78.6% and 15.1% better than ICOUNT for MIX2 and MIX4 on average respectively. On the one hand, RaT improves the performance of memory-intensive threads by prefetching. On the other hand, computing-intensive threads are also improved by eliminating the resource mo- nopolization cases of the memory intensive threads.
Finally, RaT provides significant throughput improvements in the case of MEM workloads. As Figure 3.5 shows, RaT outperforms ICOUNT by 102.8% for MEM2 and 52% for MEM4. As we show later, these workloads benefits mainly by the prefetching effect thanks to the aggressive exploitaion of the MLP.
Per-thread SpeedUp
In the previous figures, we show the total throughput performance of each workload entirely. Now, we study the performance speedup of each separate thread in the multi- programming workloads. That is, the IPC variation of RaT compared to ICOUNT for each benchmark in the multithreaded scenario. We compare the baseline performance (IPC) per each thread that composed the different workloads to the performance of the same thread under RaT mechanism execution.
Figure 3.7: Individual thread speedup between baseline and runahead performance for 2-thread workloads
Figure 3.7 shows the speedup of RaT performance with regard to baseline ICOUNT for every individual benchmark from the 2-thread workloads. Each workload has two bars in the figure, one is the speedup corresponding to thread 1 and the other is the speedup of thread 2. The speedups for the ILP2 workloads are more or less uniform, only excelling some particular thread in some workload, such gcc or galgel. In the case of MIX2 workloads, we can see higher speedups per thread for both threads in almost all these 2-thread workloads. As we commented before, threads in MIX workloads get performance improvements in a cooperative way by RaT. We here demonstrate as the ILP thread performs better by the resource availability and the MEM thread improve its performance by prefetching. There is special cases for the workloads with the mcf. As far as mcf is concerned, this benchmark suffers from many loads which
are dependent among them, resulting more difficult to improve its own performance by runahead. In these cases, RaT eliminates the big periods of resource monopolization due to the instructions depending on the high number of long-latency loads of mcf. Therefore, RaT improves the resource availability to the other thread that compose the workload, thereby improving its performance (speedups around 5X).
Regarding the MEM2 workloads, the overall speedups are higher in comparison to the other workloads as the figure illustrates. In general, all MEM threads individu- ally achieve performance improvements, being the minimum the 6% speedup of mcf benchmark for (mcf,twolf ) workload and the maximum, the 3,5X speedup, for swim in (swim,mcf ) workload. The ratio of these percentages depends on the ability of the executed runahead threads to issue in advance the different prefetches and to avoid the resource contention due to these memory intensive threads.
Figure 3.8: Individual thread speedup between baseline and runahead performance for 4-thread workloads
In the same way, Figure 3.8 shows the performance speedup for every thread for the case of 4-thread workloads. The improvement trend for these workloads is similar to 2-thread workloads, but with the performance speedup spreads over more threads instead of only two. There are few cases for 4-thread workloads in which some particular thread suffers for a slight performance slowdown, mainly for ILP4 threads, although the total workload throughput is compensated for the other thread speedups as we
show in Figure 3.6. This is caused because in absence of useful work in runahead, the speculative instructions of runahead threads can hinder the execution of normal ones for computing intensive benchmarks. However, this is not the general rule and RaT provides good speedups in most of threads, with ratios greater than 2X in benchmarks for MIX4 and MEM4 workloads.
Hmean
As other important factor to evaluate the Runahead Threads performance, we ana- lyze the performance-fairness results using the harmonic mean of individual speedups (Hmean metric). We show the Hmean results for the baseline ICOUNT and RaT in Figures 3.9 and 3.10 for 2-thread and 4-thread workloads respectively. In these figures, a higher bar is interpreted as better.
These Hmean results confirm that the RaT mechanism presents a better through- put/fairness balance than the ICOUNT policy. In spite of computing-intensive bench- marks are not the objective of RaT, the gains in ILP workloads are not so outstanding but better than ICOUNT. The Hmean improvement over ICOUNT for ILP workloads is 12.5% for ILP2 and 2.4% for ILP4. The MIX and MEM type workloads contribute more to the whole average Hmean gains. For MIX workloads, RaT gets better ratios with 79% and 12.9% Hmean improvement over ICOUNT for 2-thread and 4-thread workloads respectively. Finally, Hmean metric indicates that RaT is also much more fair than ICOUNT from performance point of view for the MEM workloads, improving by 75.4% the Hmean for MEM2 workloads and by 42.9% for MEM4.
Figure 3.10: Hmean of Runahead Threads vs. SMT baseline for 4-thread workloads
Therefore, these results demonstrate that RaT provides good performance improve- ment as well as fairness, specially for memory intensive programs present in MIX and MEM workloads. RaT takes into account the fairness among threads, avoiding to favor only threads with high IPC, and boosting memory intensive threads (with low IPC) compared to ICOUNT.