Runahead Execution - Energy Efficient Load Latency Tolerance: Single-Thread Performance for the

An out-of-order processor uses Runahead Execution [20, 54] to expose MLP in the pres- ence of last-level cache misses. When a pending last-level cache miss reaches the head of the re-order buffer, Runahead checkpoints architected register state, and begins Runahead execution mode. In this mode, loads which miss the last level cache undergo a pseudo- execution which produces “poison” rather than an actual output value. This poison is indicated by an extra bit in the destination register which indicates the value is not known. The pseudo-execution of the missing load causes the load to release its issue queue entry and wake up dependent instructions as if it had executed. When dependent instructions read their input values, they ingest poison, pseudo-execute, and propagate poison to their own outputs. Runahead mode retirement processes executed instructions in program order, removes them from the ROB, and frees their physical registers but does not commit them to architected state—specifically, Runahead stores do not write the data cache.

When the miss that triggered the Runahead episode returns, the processor restores the register checkpoint and re-fetches and re-executes all instructions younger than the miss. Re-execution of the instructions that already executed in Runahead mode is accelerated because Runahead execution initiated parallel last-level cache misses and warmed up the caches. Runahead Execution is not a true load latency tolerant design as it discards all miss-independent work after the miss returns and does not exploit ILP under cache misses. Figure 2.11 shows an example of Runahead execution. When Runahead encounters A’s

miss, it checkpoints register state and enters Runahead mode. Runahead mode exposes MLP by executing H’s miss. However, it is unable to expose ILP—it discards and re- executes the independent instructions: B,C, F, and G.

2.2.1 Load and Store Queues

Runahead’s approach to virtually scaling the load and store queues hinges on the fact that it discards all instructions from Runahead mode. Since Runahead instructions will be dis- carded, it is acceptable if load ordering violations are not detected for Runahead loads, so Runahead simply removes loads from the load queue when they exit the ROB. For stores, forwarding is performed in a best-effort fashion via a Runahead cache. When Runahead stores exit the ROB, they write the Runahead cache and release their store queue entries. Younger Runahead loads check the Runahead cache to see if there are any matching stores. When a Runahead episode ends, the forwarding cache is cleared. As stores may age out of the Runahead cache, a proper forwarding may be missed. This possibility is acceptable because all loads and stores re-execute when the Runahead episode ends.

2.2.2 Efficiency

If Runahead mode does not expose any MLP, it not only does not help performance, but also wastes energy. To avoid such wastes, Runahead can implement heuristics which predict whether a Runahead episode is likely to expose MLP or not [53]. One heuristic avoids useless periods of Runahead by learning which static loads have historically exposed MLP and which have not. Another heuristic avoids overlapping periods of Runahead by tracking how many instructions were pseudo-retired during Runahead mode, and preventing the processor from re-entering Runahead mode until at least that many instructions have been retired conventionally. These techniques occasionally harm Runahead’s performance, but greatly improve its overall energy efficiency (see Section 5.3).

2.3 Checkpointed Early Load Retirement and

Checkpoint Assisted Value Prediction

Two similar designs: Checkpointed Early Load Retirement (CLEAR) [42] and Checkpoint Assisted Value Prediction (CAVA) [14] use value prediction to tolerate long load latency. When a load misses the last-level cache, a predictor is used to guess the output value. The load then binds this value and dependent instructions execute normally, releasing their issue queue entries.

CLEAR and CAVA both couple value prediction with speculative retirement to scale the physical register file. Speculative retirement is not traditional retirement made speculative—instructions are not made globally visible and then pulled back. Instead it is a more resource efficient way of buffering speculative instructions. Specifically, it uses register checkpointing to buffer a large number of instructions without explicitly representing the register output of each instruction. In CLEAR and CAVA, speculative retirement begins when a value-predicted load reaches the head of the ROB. The processor checkpoints architected register state and begins retiring instructions speculatively. Speculatively retired instructions exit the processor and release their physical registers allowing younger instructions to enter behind them. When value-predicted load misses re- turn, the actual output value is compared against the predicted value. If all predictions are correct, then the checkpoint is released, making speculative retirement non-speculative. If any mismatch occurs, then speculation is aborted to the checkpoint. Both CLEAR and CAVA use more than one checkpoint to reduce the number of instructions squashed on a mis-speculation.

As mis-speculations would result in squashes of many independent instructions, it is only beneficial to apply value prediction to those loads whose value is very predictable. Unfortunately, most loads have difficult to predict values (CLEAR reports that only 24.3% (integer) and 39.6% (floating point) of loads are high confidence [42]), meaning that value prediction is not a general enough solution for a processor to achieve true load latency

tolerance.

2.3.1 Load and Store Queues

CAVA uses atransactional cacheto buffer speculatively retired loads and stores. A transactional cache is a data cache which accepts speculative writes and can abort them if needed. The transactional cache handles the verification of loads relative to stores from other threads by tracking which cache lines have been speculatively read and signaling a violation if one of these lines must be evicted before the read is made non-speculative. The transactional cache works well with value prediction-based designs like CAVA, but is ill-suited to other forms of load latency tolerance which re-execute miss-dependent instructions. In designs which depend on the re-execution of miss-dependent instructions, a re-executing load may need a value which has been overwritten in the cache, preventing the load from executing properly.

CLEAR assumes large load and store queues, but notes that other solutions exist, in- cluding a transactional cache.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 43-46)