Executing a runahead thread - Fundamentals of Runahead Threads

3.2 Fundamentals of Runahead Threads

3.2.2 Executing a runahead thread

Once a thread becomes a runahead thread following the process just described, any instruction that is fetched in this thread is marked as a runahead instruction. That is, all instructions from this thread in the instruction window are identified as “runahead operations” until runahead thread finishes. The execution of instructions in a runahead thread is very similar to instruction execution in a normal thread. The differences are that the execution of runahead instructions is purely speculative and they do not update the architectural state.

The main issue involved in the execution of runahead instructions is the tracking of L2-miss instructions and the control of their dependents. During a runahead thread execution, we can distinguish two kinds of instructions depending on the validity of their source operands: valid and invalid instructions. An invalid instruction is any instruction that references a source register whose value depends on the dependence chain of the load instruction that causes the runahead thread (values not available - invalids). Thus, an invalid instruction is not executed and it is directly driven to the retirement. In case of a valid instruction, it is executed and updates its physical destination register with the corresponding result depending on the operation performed. Finally, both kind of instructions pseudo-retire in program order at the commit stage. In this execution, the first instruction that introduces an invalid (bogus) value is the load instruction that causes to start the runahead thread. When this load is retired to start the runahead thread, this instruction marks its destination register as invalid (INV). In the same way, each valid L2-miss load executed in the runahead thread is marked INV as well although its corresponding memory access is performed. This action prevents the retirement logic from blocking for these speculative long-latency instructions. Next, whatever instructions that requires an INV register (i.e., an L2- miss dependent instruction) is marked as invalid (INV) together with its destination register. In this sense, conditional branches that are marked as invalids are really not resolved. Since the value for checking the test condition of an invalid branch is not available in the runahead execution, the processor relies on the prediction of the branch predictor for that branch rather than using a bogus stale value to resolve the branch.

The characteristics of valid and invalid instructions have some important and inter- esting benefits in multithreaded scenarios which we describe next and we summarize in Figure 3.2:

• When a thread is turned into a runahead thread, the invalid instructions almost do not use processor resources since they are immediately pseudo-retired. INV instructions are not executed and can be removed from the instruction window without waiting for the completion of the L2 miss they are dependent on. Then, they do not use functional units, registers, issue queues and so on. This reduces the resource requirements for runahead threads and allows other threads to make forward progress without suffering from resource starvation conditions.

• The other valid long-latency loads are also invalidated just like the load that started the runahead thread, but the memory references of these instructions are issued to the memory system. The L2 miss requests to main memory generated by these runahead memory operations are treated as prefetch requests. Besides, load and store instructions which are already invalids are not allowed to generate memory requests since their addresses would be bogus. This reduces the prob- ability of polluting the caches with bogus memory requests. In addition, these invalid memory instructions do not saturates the memory bandwidth and cache ports because they do not used.

• The rest of the valid instructions executed in the runahead thread are usually short-latency instructions that quickly use the different shared resources during their execution. Valid instructions allocate and de-allocate the resources faster than long-latency instructions that are not explicitly execute in a runahead thread.

Figure 3.2: Type of instructions while a runahead thread is executed

As result of this kind of speculative execution, runahead threads behave as fast threads with low resource requirements. In this sense, runahead threads are much

less aggressive than normal threads (especially memory-bound ones) with the valuable processor resources, allocating and deallocating them in short periods of time. Besides, the issued prefetches in runahead mode increase the single thread performance by exploiting the memory-level parallelism as we will show later.

3.2.3 Exiting runahead thread

A runahead thread exits when the runahead-causing L2 cache miss is resolved. This point is not a fixed interval of time, since the cycles required to service an L2 cache miss is variable depending on bank conflicts, port contentions, signal delays, etc. In our memory model, an L2 cache miss has a latency of 300 cycles, that represents the minimum number of cycles.

Figure 3.3: Runahead thread recovery

To turn back a runahead thread to a normal thread, we need to recover the hardware context state before starting the runahead thread. This requires a pipeline flush in which all runahead instructions belonging to that thread are discarded. So, all runahead instructions of this hardware context are flushed and the resources allocated for them are deallocated. Next, the context architectural state is restored from the corresponding thread architectural registers, branch history register and RAS check- point. Finally, the thread is returned to normal mode. From this point onwards, the thread starts fetching normal instructions starting with the instruction that caused the runahead thread.

3.3 Implementation of Runahead Threads

The translation of Runahead Thread operation to an actual implementation of the mechanism does not introduce any complexity in the design of a multithreaded processor. This section describes these implementation issues and addresses the design

trade-offs that need to be considered when integrating the Runahead Threads mechanism in an SMT processor. The final implementation details of Runahead Threads can be slightly different among particular multithreaded architectures, but the basic mechanism is applicable to anything taking into account the relevant design details. RaT implementation issues are common to all hardware context, since it treats them equally from the mechanism operation point of view. Next, we discuss the design trade-offs and required modifications regarding the three phases of runahead thread operation.

In document Runahead threads (Page 48-51)