Runahead paradigm - Runahead threads

Runahead approach (RA) is a speculative paradigm whose goal is to bring ahead data and instructions into the caches. A first proposal of Runahead was presented and eval- uated as a method to improve the data cache performance of a pipelined in-order execution machine [23]. This first mechanism pre-executes instructions under a cache miss on an in-order processor that does not employ any hardware prefetching techniques. It shows to be effective at tolerating the latency of first-level data and instruction cache misses for this type of processors.

Later, Runahead was extended for out-of-order superscalar processor [51] as an al- ternative to large instruction window processors [1][18][65]. In this scenario, Runahead consists of avoiding the blockage of the instruction window due to long-latency oper- ations, e.g. a load that misses in the second level (L2) cache. Instead, the processor

continues executing instructions speculatively, trying to follow the most likely program path until the load that triggered the runahead mode is resolved. The runahead main benefit comes from the pre-execution of these speculative instructions which improves the data and instruction cache efficiency.

Currently, we contribute to extend the paradigm of Runahead to improve the performance of the multithreaded processors, which are nowadays the base for high- performance computing designs. We call this approach Runahead Threads(RaT). As we have described in previous chapters, RaT is a valuable solution for both exploiting memory-level parallelism and reducing resource monopolization in SMT processors. We propose a new utilization of the Runahead on SMT processors as a different and speculative policy to improve the performance of memory-intensive threads without penalizing computing-intensive threads. Memory-intensive threads can tolerate bet- ter memory latencies and the other threads can proceed without resource clogging problems by using RaT on SMT processors.

Energy-efficiency techniques for Runahead

Previous research [50] shows that Runahead executes significantly more instructions than an out-of-order processor, sometimes without providing any performance benefit. Hence, runahead execution is not efficient for these cases. This work identifies three causes of inefficiency in runahead execution; short, overlapping, and useless runahead periods. In base of this study, several simple techniques to reduce their occurrence and improve the efficiency of runahead execution in single-threaded processors are proposed in that work.

To eliminate useless runahead periods, they propose a technique with a binary MLP predictor based on two-bit saturating counters. In case there is no MLP to be exploited, a runahead period is not initiated at all. To eliminate short and overlapped periods, the authors presented different threshold-based heuristics. To avoid many short runahead periods, this work [50] suggests keeping track of the number of cycles each L2 miss spends waiting for data from memory. Thus, the processor is only allowed to enter runahead when this count crosses a threshold (rather than when the load reaches the head of the ROB). Overlapping runahead periods are another source of inefficiency. Two runahead periods overlap if some of the instructions executed during the first period are re-executed during the second. To avoid this inefficiency, Mutlu et al. proposed beginning a new runahead interval only when it will not overlap a previous

runahead interval. To achieve this, the processor counts the number n of instructions pseudo-retired during runahead mode and saves that value in a register on return to normal mode. In normal mode, it also records the number of instructions i fetched since the last runahead period. When an L2 miss reaches the head of the ROB in normal mode, the processor enters runahead mode only if i is greater than n.

These single-thread efficient runahead techniques are different from our Runahead distance prediction because they solely try to predict and eliminate runahead periods that are ineffective (short, overlapping, and useless runahead periods). However, if a runahead period is predicted to be effective (i.e., not short, not overlapping, and not useless) then the processor stays in runahead mode until the L2 miss that caused entry into runahead mode is serviced. For this reason, even if runahead execution does not provide any benefits beyond some point in runahead mode (that we called useful runahead distance), these techniques continue speculatively executing instructions in runahead mode until the L2 miss that caused runahead is serviced.

Another technique [17] combines the advantages of both MLP-aware flush and RaT mechanisms. This paper proposed MLP-aware runahead threads to reduce the number of useless runahead periods. In case the MLP predictor predicts there is far- distance MLP to be exploited, the long-latency thread enters runahead execution. If not, the MLP-aware flush policy is applied to free allocated resources while exposing short-distance MLP. Whereas this proposal predicts the MLP only in order to decide whether the thread goes into full runahead execution (far-distance MLP) or the thread is flushed/stalled (short-distance MLP), the novelty and difference of our proposal is that we also predict how long a thread should stay in runahead by the useful runahead distance prediction. For this very reason, we will show as our mechanism is able to eliminate more useless speculative instructions (even in useful runahead periods) that cannot be eliminated by these previous techniques. In Chapter 7, we provide a detailed qualitative and quantitative comparison of our proposals to efficient runahead execution and find that our proposal provides significantly higher performance and energy efficiency for Runahead Threads.

In document Runahead threads (Page 165-167)