• No results found

Combined filtering and prioritization experiments

5.4 Combined confidence predictor based filtering and prioritization

5.6.5 Combined filtering and prioritization experiments

In this subsection, we show the evaluation of our confidence predictor with the combined technique of filtering and prioritization.

Figure 5.7 was obtained by doing a first simulation without the filtering and prioritization techniques and compare its result with the same simulation with prioritization and filtering. In this figure, we can see the same trends that are shown in Figure 5.5. Actually the speedup is almost negligible. The reason for this small speedup increment, is that although the both techniques can be applied in parallel, their performance does not increase in the same proportion. This is because both techniques tries to solve similar problems. Thus, as the filtering reduces the congestion in the interconnect, the impact of the prioritization is smaller. Nevertheless, the speedup is higher in all the benchmarks and in average the combined technique is 1% faster than the filtering technique alone.

Fig. 5.7 RPT Request reduction vs. IPC speedup .

In order to evaluate the effect of the hardware limitations, we compare the accuracy of our combined technique with the predicted accuracy from the Figure 5.3. Remember that, in the experiment of Figure 5.3 there is where no hardware limitations. Figure 5.8 compares the accuracy of applying our technique with the hardware restrictions (basically the number of entries of the several tables) and the idealized case without hardware restrictions. We can see that our technique manages to get 65% accuracy in average. In Figure 5.3, we saw that the potential of this technique was about 73%, what means that we only lose less than 7% of accuracy due to the hardware limitations.

5.7 Conclusions 117

Fig. 5.8 RPT accuracy increment.

5.7

Conclusions

In this chapter, we have shown the potential of our confidence predictor heuristic applying it to two techniques that manages the prefetching according to their confidence, filtering and prioritization. We have proposed a feasible hardware implementation for these techniques and we have evaluated their performance by comparing them with the techniques from the state of the art.

In the results, we have seen that our technique is more accurate than the one from prioriti- zation. It is able to do better confidence predictions and gets more speedup when prioritizing. We have also seen that the current filtering technique is very aggressive, what filters many requests. Our technique manages to reduce the filtered requests without generating slowdown. The combined technique improves both, the prioritization and the filtering techniques, what means that both techniques can work together successfully. Moreover, we have seen that the hardware limitations of our technique had a negligible effect in the performance obtained.

Chapter 6

Prefetching Challenges in Distributed

Shared Memories for CMPs

It’s lack of faith that makes people afraid of meeting challenges, and I believed in myself.

Muhammad Ali

6.1

Introduction

Processor design techniques have evolved toward architectures that implement multiple processing cores on a single die, commonly known as chip multiprocessors (CMPs). However, in those chips, the network on chip that connects all the cores with only one unified memory, becomes a new bottleneck. When incrementing the number of cores, the contention in the port of the unified memory and the time required to send a request from one core in the chip to the memory port, makes the solution of the unified memory hierarchy not the best choice. For these reasons, when the number of cores grows, the idea of a unified memory hierarchy becomes unfeasible. In this way, as it is shown in Figure 6.1, future multi- core CMPs with tens (or even hundreds) of processor cores will probably be designed as arrays of replicated tiles (with shared and distributed memories), connected over an on-chip switched direct network [78]. These tiled architectures are reported to provide a scalable solution for managing design complexity and the effective use of resources in advanced VLSI technologies.

As stated before, prefetching works as a latency-hiding strategy, as well as multithreading does. For this reason, the potentiality of prefetching is lower when we combine both

techniques. Nevertheless, there is already a lot of work to do in multithreading. Nowadays, performance does not scale proportionally to the number of threads. For these reason Surendra Byna et al. in [10] shows us that smart prefetching mechanisms can already improve performance in multithreading processors.

As shown in Figure 6.1, the most common memory hierarchy organization of these architectures lies on one or two private cache levels per tile and, as a last level of cache, a Distributed and Shared Memory (DSM). This DSM holds a banked organization with one bank on each tile. Moreover, each cache block-sized unit of memory is statically mapped to one of the banks based on its address in an interleaved way.

In these systems, if prefetching is allocated on a DSM cache level, the prefetcher also has to be distributed in such a way that each tile of the CMP contains a prefetch engine. In a straightforward implementation derived from unified memory systems, these prefetch engines would work independently from each other by analyzing the set of memory accesses

Fig. 6.1 Top: CMP with unified memory hierarchy. Bottom: Tiled CMP architecture with distributed and shared memory system.