Extra work - Evaluation of code semantic techniques

4.6 Evaluation of code semantic techniques

4.6.3 Extra work

Once we have shown that proposed code semantic control techniques obtain similar performance to original no-controlled RaT, the other factor to enhance the RaT efficiency is consequently reduce the number of executed instructions under runahead

threads. Achieving this issue, we also reduce the power consumption as we show in this section.

Speculative instructions

Figure 4.21 shows the number of speculative instructions executed by runahead threads for each evaluated mechanism (Figure 4.21(a)) and the percentage of these instructions when applying our control techniques compared to RaT (Figure 4.21(b)). As we can observe in these graphs, the results are quite different among the kind of proposals. While the stall action techniques (LS, SS) and PSU reduce the number of speculative instructions for all workloads (up to 48% in the case of ILPs for PSU), the other techniques (LR and SK) increment this amount (15% and 20% on average respectively). LS and SS cause activating more runahead threads (8% more) due to the stall action that avoids capturing more long latency load in some runahead thread executions. How- ever, the stop action reduces significantly the number of extra speculative instruction executed, specially for MIX and MEM workloads as demonstrate Figure 4.21(a).

On the other hand, the LR and SK techniques execute a higher number of instructions since their control actions are allowing runahead threads execute faraway each time a loop is reversed or a subroutine is skipped. In the case of MEM workloads, these techniques executed between 25-30% more of speculative instructions. So, they are able to keep the RaT performance but they cause a negative impact on the extra work executed. Therefore, LR and SK are not a good choice to improve RaT efficiency, being the other techniques based on stall action more suitable taking into account these results. Among the best performing techiques, we remark the combination of LS+PSU which achieves 33% reduction on average in the amount of speculative instructions executed compared to RaT. This mechanism exploits the benefits of both techniques (LS and PSU).

Power consumption

Finally, we quantitatively evaluate the power reduction obtained by the different techniques. To measure the power estimation, we use the implemented power model based on Wattch integrated in our simulator. We give results about power reduction compared to RaT for the different mechanisms evaluated that we show in Figure 4.22. This power results depict a similar trend to instruction reduction results from Figure

(a) Number of speculative executed instructions

(b) Speculative executed instructions ratio

Figure 4.21: Speculative executed instructions analysis in function of the code semantic-aware techniques

4.21. Note that MIX and MEM workloads power consumption are reduced more than ILP ones in spite of the higher percentage of speculative instruction reduction previ- ously shown. The instruction percentage represents only a ratio but the average power (which is correlated with energy) is very much affected by the number of executed

instructions. Thus, a higher quantity of instructions reduced (no the percentage) in the cases of MIX and MEM (see Figure 4.21(a)) produces a higher energy savings.

Figure 4.22: Average extra power reduction

Regarding the techniques in particular, whereas LR and SK get a slight average power increment (2% and 4,5% respectively), the LS+PSU technique achieves the best result with an average 6% power reduction compared to RaT. Also, LS and SS techniques reduce the power consumption close to 5% each one.

4.7 Summary

Runahead Threads mechanism generates an excess of speculative work that produces an energy overhead. If this overhead does not improve the processor performance in reward, the performance-power efficiency of RaT is diminished. To avoid this shortcom- ing, we have presented code semantic-aware techniques that enhance RaT mechanism in a more efficient way. Basically, these mechanisms are based on runtime analysis to detect code patterns which identify loops or subroutines. Likewise, this coarse-grain analysis oversees the usefulness of loops or subroutines depending on the prefetches opportunities during runahead thread execution. By means of this information, different control techniques decide either stall or skip the loop or subroutine executions to reduce the number of useless runahead instructions.

These techniques achieve comparable performance with regard to original RaT mechanism, even improving the performance of RaT (1%) in some cases. The evaluation in this research shows that the techniques that stall the useless loop or subroutine execution present better results from efficient point of view. The best combination of these techniques results in a performance-efficient mechanism that maintains the performance improvement of RaT while reducing the extra speculative work required (33% less speculative instructions on average) and power consumption (6%).

Chapter

5

Efficiency-aware Runahead Threads

A

s we explained in the previous chapter, RaT benefits come at the cost of executing a large number of instructions speculatively due to runahead execution what leads to an extra power consumption. The proposed code semantic-aware techniques reduce part of that useless extra work by performing coarse-grain analysis of code patterns executed by the runahead threads. Thus, these techniques improve the runahead thread efficiency avoiding the useless speculative work of RaT by means of overseeing the usefulness of loops and subroutines during runahead mode. Therefore, the effectiveness and potential energy reduction for them are highly dependent on the presence of loops and subroutines during the runahead thread executions.

In such sense, code semantic-aware runahead threads are effective but specific for dealing with loops and subroutines. The ability to make runahead threads more energy efficient using these code semantic-aware techniques are determined by the amount and features of that loops and subroutines. The kind of software or the way the programmer develops an application can determine this fact. In addition, some compiler techniques, such loop unrolling or subroutine in-lining, can make that high-level language loops or subroutines presented in the original source code are not represented as such in the optimized machine code generated by the compiler. Hence, aggressive compiler optimizations may diminish the number of loops or subroutines in the final binary code. Therefore, code semantic-aware techniques are effective in function of the program features, and not the own runahead thread features.

In this chapter, we devise a more generic scheme to enhance the efficiency of RaT. This scheme predicts the efficiency of the runahead threads based on the execution of a particular runahead thread and does not what type of code are being executed. Thus, useless runahead executions will be detected independently of the executed code patterns. The key idea behind this different scheme is to perform a fine-grain analysis of each runahead thread to collect information focused on optimized the speculative work done. Based on this information, it is possible to predict how far a thread should run ahead speculatively such that speculative execution will be efficient.

5.1 Runahead distance prediction

As we have already pointed out in previous chapters, the usefulness of a runahead thread is given by the total amount of prefetching that a particular runahead thread is able to exploit during its speculative execution. Following the goal to make RaT mechanism more energy-efficient, we propose new approaches that try to analyze the number of long-latency loads per runahead thread to balance between useful prefetching and useless instructions. In other words, we want to dynamically find out the useful lifetime of a runahead thread to expose as much MLP as possible with the minimum number of speculative instructions.

In document Runahead threads (Page 121-128)