3.2 Fundamentals of Runahead Threads
3.3.2 Implementation issues for runahead thread execution
The main issues involved in execution of runahead threads are the treatment of the different runahead instructions and the propagation and communication of the INV results. Here, we describe the hardware required and other important factors related to the SMT scope to support this new functionality.
Register validation control
The control mechanism that communicates data between dependent instructions is already present in the processors (wake-up and select logic). Therefore, INV bits can
be propagated in the datapath with the data they are associated with. In the case of an SMT environment, as the register file is shared, we only needs an INV bit associated to each physical register to track the propagation of register invalidations. This bit indicates whether or not a register has a bogus (INV) value during a runahead thread execution. INV bits are used to prevent bogus prefetches and resolution of branches using bogus data. In case of an invalid one, its value is not available for the runahead and the corresponding invalid instruction is not executed. Any INV instruction marks its destination register as INV after it is scheduled. Any valid operation that writes to a register resets the INV bit associated with its physical destination register. So, when a physical register is invalid (INV bit set to 1) this would be released soon and to be used for the same thread to continue exploiting MLP or for the rest of threads.
Load and store management
Similar to the registers, Runahead execution does not require significant hardware to handle memory data invalidations. The runahead store and load dependencies can be done through the store buffer. The forwarding of INV bits from the store buffer to a dependent load is accomplished similarly to the forwarding of data from the store buffer. For Runahead Threads, we need to add only one more bit (the INV bit) per entry in this structure and in the forwarding data path. When the address of a memory operation is INV, they are simply treated as a No OPeration (NOP).
Mutlu et al. [51] introduce the runahead cache to provide extra communication of data and invalid status between runahead loads and stores for runahead execution in out-of-order processors. This structure holds the results and INV status of runahead stores that have already pseudo-retired. Based on this information, some loads depen- dent on stores can be identified as valid or invalid. Nevertheless, there are other cases in which this memory dependency cannot be identified. For example, a store that has an invalid effective address cannot save its status or data in this runahead cache.
From the SMT point of view, using a runahead cache results expensive in terms of extra hardware. In a multithreaded processors the runahead cache needs to be larger to avoid aliasing and line contention among threads. Likewise, it is necessary to include a new identification tag to distinguish the runahead cache block owner for each thread. So, the runahead cache would be a large structure required by runahead execution in this multithreaded framework. However, we measure the performance with and without the runahead cache to consider the need to include it in the final
Runahead Threads mechanism implementation (see Section 3.6.4). We will show that using the runahead cache does not have significant impact on performance in our SMT model. Based on this result and the fact that a runahead cache implies the use of more area in the SMT core, we decide not to use it in our RaT implementation. As runahead threads are purely speculative, there are no strict consistency requirements to satisfy correct propagation of data between store instructions and dependent load instructions. Hence, the functional difference is that some loads dependent on previous retired stores that were not identified as invalids, will use bogus values for speculative memory accesses, but it just affects runahead execution, i.e., it does not affect correct program execution.
Floating-point resources
The performance improvement of Runahead Threads mainly comes from the pre- execution of memory operations. Generally, the computation of the address for mem- ory operations involves a base register plus an offset. This is an integer arithmetic operation, so floating-point (FP) operations are not needed to compute the effective addresses. According to this observation, we can decrease the resource demand of runahead threads by avoiding the execution of FP instructions in the RaT mechanism. This modification was considered for runahead execution in out-of-order processors [50]. We also apply it in our implementation for an additional benefit in the SMT scenario. If a runahead thread does not execute FP instructions, it does not need the floating-point resources of the SMT processor. So, once an instruction is detected to be an FP operation in the decode stage, it is invalidated and directly proceeds to pseudo- commit. With this implementation, FP instructions in a runahead thread do not use any processor resources after they are decoded. Therefore, the FP issue queue, the FP functional units, and the FP physical register file are not used by most FP runahead instructions. The exceptions are FP loads and stores, which are treated as prefetch instructions because their effective addresses are obtained in the integer datapath.
Synchronization
Finally, an important issue in the context of multithreaded processors is that there can be both independent and parallel programs in execution. Independent multipro- gramming workloads do not need any synchronization, since they are threads executing
different programs or belonging to different users. On the contrary, parallel programs normally use a scheme that allows threads to synchronize each other to use the shared memory. Usually, this is made by lock operations (basically through block, acquire, and release special instructions) which force to atomically use data from the memory. The basic mechanism relies on serializing the operations to ensure that critical sections are not executed concurrently by different threads in parallel programs.
As speculative execution, runahead threads cannot make any changes to the lock variable or to the critical data protected by the lock variable, thereby avoiding the inconsistency among parallel threads. In the case that a parallel thread switches to a runahead thread, these lock instructions are ignored, since the runahead thread does not need to obey the atomicity semantics. The instructions inside the critical section are speculatively executed because they do not modify program state and any critical shared data. This could enable faster progress in runahead executions without incurring the latency associated with lock operations.