4.4 Performance evaluation
4.4.2 Tagged prefetcher analysis
In the following, we will show the average results for the Tagged prefetcher and the per benchmarks results.
Average analysis
We evaluate the confidence of our technique by getting the average accuracy of the requests classified in each group. Figure 4.8a shows the result of this prediction for the Tagged prefetcher. In this Figure, we can see that the techniques derived from the baseline (PH and BP) present very similar accuracy for each of the three groups. We can also see that the predictions for these techniques are pretty fair. The requests classified as high confidence have an average accuracy of 55% for the high confidence, 45% for the medium confidence and finally, 10% for the low confidence. We can also see that for this case the BP has about 2% less of accuracy in the high confidence predictions which is found in the medium confidence group. This is because the predictor in this case works a little bit worse than the rest. Thus, some requests with high accuracy has been wrongly predicted as medium. However, we can see that the results for the code region predictor stand out from the others. Most of the wrongly predicted requests by the other predictors (mainly high confidence, predicted as medium confidence) are predicted accurately. The accuracy of the requests predicted with high confidence is more than 80%. Medium confidence requests had an average accuracy about 30% and low confidence requests are about 5% accuracy. Although the results of the stream based predictor are better than the baseline, for this prefetcher, this technique does not work better than the code region predictor. Nevertheless, the combined predictor, manages to take profit of the two techniques that form it. This is seen because is the approach that has better results for this prefetcher.
However, the accuracy does not take into account the total number of requests. This means that if a technique classifies only one request as high confidence and considers this request useful, this predictor would be 100% accurate in its high confidence prediction. But the predictor would be far from working properly. Hence, we contrast the accuracy evaluation with a study of the useful and useless requests for each of the confidence levels.
Figure 4.8b shows the distribution of useful requests among the confidence predictions. We can see that all the proposed techniques improve the number of useful requests classified with high confidence. However, most of them still classify many useful requests as medium
4.4 Performance evaluation 87
(a) Accuracy prediction.
(b) Useful requests prediction. (c) Non-useful req. prediction.
Fig. 4.8 Results from evaluating the confidence predictors with the Tagged Prefetcher as the stream prefetcher
or low confidence. As in the previous case, the code region predictor is the one that works best, followed by the stream-based predictor. Again, the combined predictor benefits from both of the predictors that comprise it. Nevertheless, although it seems to be the best proposal for this configuration, it classifies as many useful requests as low confidence as it does high confidence. This happens because of the large number of non-useful requests that the prefetcher issues. Note that this version of the tagged prefetcher issues two requests per trigger. It also triggers once per miss or useful prefetch. This means that it would generate approximately twice as many requests as misses. Thus, in the best of scenarios, if all the misses were prefetched, the prefetcher would generate 50% useful requests and 50% non-useful requests. Figure 4.8c shows the distribution of these non-useful requests. As we can see, all the techniques manage to classify over 70% of these non-useful requests as low confidence. The percentage of useful requests is so small compared to non-useful that it makes it very difficult to accurately classify them. Nevertheless, the results show that
the code region still classifies most of the non-useful requests as medium or low priority, followed by the stream-based predictor. And the combined version still makes the most of the two techniques and classifies the majority of requests as medium or low priority.
Per benchmark analysis
In order to provide a more detailed analysis, we have analyzed each benchmark with each confidence predictor. We have analyzed in a qualitative manner which benchmark´s prefetch- ing requests can each heuristic classify with better accuracy. Remember that, one of the reasons to use 3 confidence levels is that from a qualitative point of view, this enables the high category to be associated with the concept of being almost sure that those prefetchers are going to be useful, that the ones in the low category are going to be non-useful, and that we do not know or are not sure about the ones in the medium confidence category. Therefore, we are going to consider that a confidence predictor has done a good job classifying the requests from a given benchmark if there are more useful prefetchers classified as high confidence than as medium confidence, and more useful prefetchers classified as medium confidence than low confidence.
Let´s start with commenting the behaviour of the baseline, the Last Phase predictor, which per benchmark results are shown in Figure 4.9a. We can see that this heuristic is only able to classify with high accuracy the facesim benchmark. It is specially doing wrong predictions with the blacksholes, dedup, or vips benchmarks, in which is predicting with low confidence a lot of useful prefetches. In Figure 4.9b, we see the predictions of the Phase History technique, which improves a little bit the prediction of the facesim benchmark, being the resulting classification almost optimal. Nevertheless, it still predicts wrongly the same benchmarks as the baseline does. The next heuristic is the Balanced Phase History, its results are shown in Figure 4.9c. As we can see, the resulting chart is very similar to the Phase History predictor without the balancing, what means that for this prefetcher, the optimization applied on this heuristic is irrelevant. If we look at the results of the Code Region predictor in Figure 4.9d, we can see important improvements on the classification of the previous heuristics. The Code Region predictor is able to increase the number of useful prefetchers classified in high priority in almost all the benchmarks. Moreover, this heuristic succeed in classifying the requests from benchmarks that were wrongly classified with the previous predictors such as blacksholes, dedup, or vips. As can be seen in Figure 4.9e, the Stream Position predictor also improves the predictions from the baseline predictors. However, the classification of the requests is not as accurate as the Code Region predictor is. Finally, as we can see in Figure 4.9f, the Combined predictor gets advantage for each of the predictors that compose it, predicting the most accurate classification for each of the benchmarks.
4.4 Performance evaluation 89
(a) Last Phase predictor. (b) Phase History predictor.
(c) Balanced Phase History predictor. (d) Code Region predictor.
(e) Stream Position predictor. (f) Combined predictor.
Fig. 4.9 Percentage of useful and non useful requests generated by the Tagged prefetcher classified with high, medium, or low priority by the different predictors.