5.3 Analysis of the approaches
5.3.3 Length of runahead threads
We have shown the prediction accuracy and predicted distances of the two approaches, so we also show the runahead executed instructions in the controlled runahead threads to complete part of this analysis. Figure 5.17 shows the average length of runahead threads for the original RaT mechanism and the two approaches of this chapter. This length is measured as total number of instructions executed per runahead thread on average. So, this data is relative to the total executed runahead instructions and the number of runahead thread activations (for instance, ILP workloads execute fewer runahead threads than MIX or MEM workloads). For comparison purposes, we also show a striped part for RaT bars which indicates the average ideal useful distance for the executed runahead threads.
From this figure, runahead threads controlled by RDP and R2DP have smaller lengths than RaT. The average length reduction per runahead thread for 2-thread workload is 462 instructions for RDP and 282 instructions for R2DP, whereas for 4- thread workloads this reduction is 341 and 201 instructions respectively. Therefore, RDP causes runahead threads to execute the lowest number of speculative instructions, resulting in runahead threads with 202 executed instructions on average opposite to
Figure 5.17: Average number of instructions executed per runahead thread in func- tion of used mechanism (RaT, RDP and R2DP)
604 of RaT. R2DP, with an average of 362 instructions, still reduces by 37% the average runahead instructions per thread related to RaT.
Likewise, looking at the ideal computed useful runahead distances (striped part) provides information about how well each technique manages the executed runahead threads to get the optimum performance and control the useless instruction execution. As the figure shows, RDP bars are under this optimum distance of RaT for each category of workload while the R2DP bars are over. So, RDP effectively reduces the speculative runahead instructions but RDP mispredictions produce that the predicted distances are below of the appropriate useful runahead distances for getting similar performance to the original RaT. It eliminates a large number of runahead threads and therefore it eliminates the prefetching benefits provided by many useful runahead threads.
Using R2DP, the length of runahead threads is over to the length of ideal useful runahead distance, although close to this. Therefore, these data show that R2DP results in smaller runahead threads than RaT alone, but they are more efficient ones, that is, runahead threads with fewer executed instructions than RaT with similar performance improvement.
5.3.4
Number of runahead threads
Figure 5.18 shows the number of runahead threads executed per each technique (RAT, RDP and R2DP) on average for each kind of workload. R2DP increases 11% the amount of runahead threads for all workloads compared to RaT on average, specially for MIX and MEM workloads, with 13% and 12% more runahead thread executions respectively. In the case of ILP, the increase is smaller (9%) since these workloads have a low ratio of long latency misses. For RDP technique, the ratio is greater than R2DP with 41% more runahead threads than RaT. However, although there are more runahead thread executions for these techniques, the length of runahead threads controlled by the runahead distance prediction techniques is lower than baseline RaT as we show in the previous section.
Figure 5.18: Number of runahead threads executed per each mechanism
5.3.5
Distribution of controlled runahead threads
We finally show in this last section several experiments that give insightful details about how the most-efficient R2DP approach works. Figure 5.19 illustrates a particular analysis of the runahead thread distribution in terms of the different R2DP decisions during 2-thread workload executions. In this figure, each bar is broken up in three parts (from top to bottom): the percentage of runahead threads fully executed, the runahead threads limited by a predicted useful runahead distance and, finally, the percentage of possible runahead threads that were not started (due to their predicted useful distances do not fulfill the activation threshold).
Figure 5.19: Runahead thread executions breakdown
This figure indicates how R2DP manages the runahead threads in the different workloads and how many times a runahead thread is controlled or not according to the predicted useful distances. For example, if we analyze how many times our technique eliminates a runahead thread completely for a particular workload, we can observe there are workloads which present a high percentage. These are the cases of bzip2-mcf and mcf-eon. As mcf is a benchmark with a huge number of dependent loads. This feature causes invalid runahead loads that do not issue prefetches which are not taken into account for useful distance computation, thereby reducing the useful distance value for the corresponding runahead threads. On average, the percentage of not initiated runahead threads due to small distances for 2-thread workloads is 34% (in the case of four threads this percentage is 37%). This ratio represents the useless runahead threads detected and avoided.
On the other hand, there are workloads which have a high ratio of runahead threads limited by the corresponding useful runahead distance. Such examples are apsi-eon, mgrid-galgel, galgel-equake,swim-mgrid or applu-art with around 40% of runahead threads which R2DP controls their execution according to the useful runahead dis- tances. This percentage is 22.5% for overall 2-thread workloads. This ratio addresses the elimination of the useless part of executed runahead threads and then, it contributes to reduce the speculatively executed useless instructions and their energy consumption as we have already showed in the previous section.
However, there are many loads for some kind of programs (specially computing intensive threads) that have isolated long-latency misses. In addition, these loads
are not re-offender, generating a runahead thread just once. Therefore, the runahead thread is always fully executed the first and single time (for instance gcc,mgrid or lucas,crafty). For these cases, the useful distance information is rarely used in the future for the same load, so it is difficult to control and avoid more useless runahead executions.
To complete the previous study, Figure 5.20 shows a histogram of how many runa- head threads execute between N and M instructions (for ranges of 32 instructions) for particular art,gzip workload using RaT and R2DP technique2. The bar for R2DP between 0-32 instructions indicates the number of runahead threads not started for this workload because the runahead distance is not higher than 32.
Figure 5.20: Runahead threads distance histogram for R2DP and RaT
As we can observe, R2DP reduces the number of runahead threads with larger distances compared to RaT. This reduction is concentrated in the central distribution of this figure, in which the higher number of runahead threads are. This distribution is shifted towards runahead threads with smaller runahead distances. Thus, the Figure shows as R2DP effectively eliminates the useless part of larger inefficient runahead threads causing more efficient runahead threads in the range of smaller distances.
5.4
Summary
In this chapter, we have developed and evaluated two fine-grain approaches to make runahead threads more efficient. To improve that energy efficiency, these approaches focus on predicting the maximum MLP achievable by a particular runahead thread while at the same time reducing the extra useless speculative work.
The two mechanisms described are similar designs that aim to improve the efficiency of runahead threads with low hardware cost and complexity. Both schemes are based on the useful runahead distance, a concept that indicates how far a thread should run ahead such the speculative runahead execution is efficient. One approach is called Runahead Distance Prediction (RDP) and the other Runahead Two Distance Predictor (R2DP). Aside from the own operative process of each approach, the general scheme has to main actions. First, it predicts whether or not a thread should employ runahead execution, (i.e. whether or not runahead execution is useful) to avoid the execution of useless runahead threads. Second, it predicts how long the thread should execute in runahead mode to reduce the unnecessary speculative instructions execute at the end of useful runahead threads. Among both approaches, R2DP distance prediction scheme is a better attempt at capturing the useful runahead distance in a more accurate way. Limiting the runahead execution of a thread by this distance prediction not only avoids unnecessary speculative execution but also reduces the executed instructions, thereby the resource requirements of runahead threads. We have evaluated both RDP and R2DP in terms of performance improvement, reduction of extra instructions, and energy efficiency. Although both approaches effectively reduce the speculative extra instructions, RDP by 42% and R2DP by 28%, results have shown that R2DP is more energy-efficient than RaT reducing the power consumption by 12% on average without affecting its performance. Therefore, R2DP provides not only high performance but also an efficiency aware way of managing runahead threads in SMT processors.
Chapter
6
Related Work
B
oth Simultaneous Multithreading and Runahead execution are well-known micro- architectural models focused on overcoming the superscalar processor limitations. The former exploits the thread-level parallelism to improve the performance throughput whereas the later provides an alternative to building large instruction windows to tol- erate long-latency operations. Nevertheless, they are two different and clearly separate proposals that have not been considered together before to this work.In this chapter we describe the related work that involve SMT and Runahead re- search lines. We cover simultaneous multithreading mechanisms, thread-base specu- lative techniques and the Runahead background close to the scope of this thesis. We describe the functionality and benefits of the most relevant approaches and discuss the differences, advantages or disadvantages in relation to Runahead Threads and the techniques proposed in this dissertation.
6.1
Simultaneous Multithreading
Simultaneous Multithreading [53][76][84] emerges as a solution to superscalar processor limitations to increase the performance through exploiting thread level parallelism and tolerate long main memory latencies better. A simultaneous multithreaded processor has the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context. The processor maintains a list of active threads and decides which instructions from those threads to issue into the pipeline.
Depending on the level of sharing, threads use exclusively some machine resources, like the reorder buffer, or they share others resources like the issue queues, the func- tional units, and the physical registers. Shared resources are dynamically allocated be- tween threads competing for them. This dynamic resource distribution among threads determines not only the final processor performance, but also the performance of indi- vidual threads. If a single thread monopolizes most of the resources, it will run almost at its full speed, but the other threads will suffer from a slowdown because of resource unavailability. Therefore, the design of an SMT processor requires additional resource policies that determine how the resources should be shared.
Although software approaches exist that try to reduce the interference in SMT shared resources, like compiler techniques [42][54] or operating system techniques [57], in this thesis we mainly focus on hardware techniques. SMT processors topic is an intensive research line, and different and new SMT techniques and policies have been proposed as this thesis was being developed. The different researches in literature in this line include instruction fetch policies, dynamic resource allocation mechanisms and memory-level parallelism aware techniques. We describe the different hardware proposals for each category in the next sections. We will provide a detailed quantitative comparison of the mechanisms from this thesis with the main resource control policies described in this section in Chapter 7.