Memory-intensive threads can clog up shared resources due to long-latency memory operations without making progress on SMT processors, thereby hindering overall sys- tem performance. In this chapter, we have presented Runahead Threads as alternative mechanism to alleviate these problems related to the SMT scenarios.
In contrast to existing fetch policies and resource control schemes that usually restrict memory-bound threads in order to get higher throughput, RaT implies a new point of view in the context of resource management through a speculative execution mechanism. RaT turns any running thread into a runahead thread when the SMT processor detects that thread undergoes a long-latency load. While being a runahead thread, this thread behaves as a fast speculative thread by runahead execution until the load is resolved. This runahead thread uses the different shared resources without having a negative impact of their availability for the other threads.
This simple functionality of RaT mechanism has several important advantages on the SMT processors:
• First, RaT alleviates the SMT problem of handling the long-latency loads, spe- cially in the case of memory-intensive threads. RaT allows memory-bound threads to advance speculatively, instead of stalling the thread, doing beneficial work without disturbing the other threads. In this sense, RaT balances resource usage between computation-intensive and memory-intensive threads.
• Second, RaT significantly improves the single-thread performance by prefetch- ing which allows exploiting the memory-level parallelism available while a long- latency load is serviced. This provides benefits on a single threaded application, which is not provided by multithreading, therefore also improving the overall processor performance.
• Third, RaT provides not only a high-performance but also an efficient way of using shared resources in SMT processors in the presence of long-latency memory
operations. On the one hand, it avoids the possible resource monopolization of memory-bound threads, transforming them into light speculative threads and allowing the other threads to continue executing with the remaining resources. On the other hand, RaT also prevents threads from falling in resource under-use situation, since the execution of runahead threads take profit of the free resources to perform the speculative execution.
• Fourth, RaT increases the register file efficiency and provide higher performance for the same number of registers. It is also worthy to note that an SMT processor that implements RaT can benefit from smaller register file with even performance improvements.
To contrast RaT advantages, a detailed evaluation of the mechanism is provided. Our detailed evaluation has shown that RaT performs better than the SMT processor baseline in terms of throughput (44%) and Hmean (38%). RaT outperforms ICOUNT on average for all categories of workloads, especially in the case of MIX and MEM workloads. These evaluation results show the significant performance benefits of using RaT, whereas higher throughput ensures higher utilization of processor resources to improve the performance, good fairness results through Hmean metric ensure that all threads are given similar opportunities and that no threads are forced to starve.
Chapter
4
Code Semantic-Aware Runahead Threads
I
n the previous chapter, we introduce Runahead Threads (RaT) as a promising so- lution to alleviate the memory-intensive thread problem in SMT processors. RaT employs runahead execution to enable a thread to speculatively execute instructions and prefetch data instead of stalling because of a long-latency load. However, as runa- head threads speculatively executes large portions of the instruction stream, an SMT processor with RaT executes more instructions than a not speculative SMT processor. Therefore, RaT improves overall processor performance by prefetching and alleviating the resource contention among threads but RaT has a shortcoming: these benefits come at the cost of executing a large number of instructions speculatively due to runa- head executions. If a runahead thread execution does not provide prefetching benefits, this can degrade energy efficiency by executing a large number of useless instructions without performance gain.This chapter addresses this drawback of runahead threads in which we propose several solutions to enhance the effectiveness of RaT. The objective is to decrease the number of useless instructions executed with the runahead threads, while still pre- serving the performance improvement provided by RaT. In this chapter we present a research line for improving runahead thread efficiency by simple and complementary code semantic control techniques. These proposals perform coarse-grain analysis to capture the prefetch opportunities (usefulness) of executed code structures, such loop and subroutines, during the runahead thread executions. We propose to dynamically use code semantic information for detecting these particular program structures and
to analyze when they are useful or not in order to control the runahead thread exe- cution. In function of this runtime dynamic analysis, the proposed techniques make a control decision either to avoid or to stall the particular loop or subroutine execution in runahead threads. By means of these control actions, our goal is to make runahead threads more efficient and reduce the dynamic energy consumption of SMT processor that use RaT.
4.1
Efficiency and Runahead Threads
By using RaT, SMT processors provide a performance and complexity-effective frame- work to improve memory latency tolerance and reduce resource clogging on long-latency loads. Nevertheless, RaT requires the speculative processing of extra instructions while an L2 cache miss is in progress for that thread. When every runahead thread ends its speculative execution, the processor restarts the normal thread flushing the hardware context pipeline and beginning with the instruction that caused turn into runahead thread. Hence, an SMT processor with RaT mechanism can execute the same instruc- tions in the instruction stream several times. Therefore, RaT increases the number of instructions executed by a conventional SMT processor.
To expose this fact, Figure 4.1 shows the distribution of speculative runahead in- structions over the total number of instructions executed for the different workloads. Each bar is composed of the ratio of normal instructions and the runahead ones with regard to the total executed instructions. As we can observe, the portion of additional executed speculative instructions due to runahead threads increases from ILP to MEM for both 2-thread and 4-thread workloads. There are much less cache misses execut- ing ILP workloads than MEM workloads, therefore, much less runahead threads are generated. For ILP workloads, the percentages of runahead instructions with regard the total number of instructions are 8.9% for 2 threads and 11.7% for 4 threads. On contrary, these ratios are increased up to 47.9% and 46.3% for 2-thread and 4-thread MEM workloads respectively. That is, in case of memory-intensive workloads, almost the half of total instructions executed are speculative runahead instructions.
Therefore, this figures exposes the execution of additional amount of speculative instructions by RaT, which results in an increase in the dynamic energy dissipated by the processor. Nevertheless, the inefficiency problem arises due to RaT has no in- formation in advance about the existence of useful future prefetches for a particular
(a) 2-thread workloads
(b) 4-thread workloads
Figure 4.1: Distribution of executed instructions for 2- and 4-thread workloads
runahead thread. As consequence, it can perform useless extra work if there is no avail- able prefetching. Thereby, when there is a runahead thread executing without doing prefetching, this thread does not contribute to improve the performance meanwhile it executes useless speculative instructions. This useless extra work impacts on the on the overall power consumption, thereby energy per instruction wasted. The main drawback for these cases is that RaT benefits can be spoiled if the thread executes a large number of useless speculative instructions without doing prefetching.
To be worth, the executed runahead threads should maintain an efficient relation between the performance gain and the extra speculative work performed. That is, the clue about the efficiency of a runahead thread can be seen as the relation between the performance improvement provided due to prefetching and the number of additional ex- ecuted speculative instructions due to runahead execution. For instance, RaT increases the number of executed instructions by 29.7% to achieve a 64% speedup on average for 2-thread workloads whereas for 4-thread workloads, RaT increases the throughput by 24% at a cost of increasing the number of executed instructions by 30% on average. Therefore, even considering acceptable this extra work from the performance point of view, we can make RaT a more efficient mechanism if we control the useless portions of runahead threads.