Dynamic management - Improving prefetching mechanisms for tiled CMP platforms

Given that the effectiveness of prefetching, in a high degree, depends on workload and runtime effects, several techniques have been proposed to dynamically adapt the behavior of the prefetching mechanism based on run-time profiling information of the previously presented parameters. In fact, given the extra interaction complexity inherent to CMP systems, research on this environment has, therefore, focused more on improving the overall performance achieved by existing prefetching mechanisms than on proposing new prefetching engines to detect new kinds of patterns. Some of the most relevant techniques that have been proposed are summarized in the following sections.

2.4.1 Filtering

The action of this technique relies on avoiding specific prefetching requests according to some heuristic. For instance, the Pollution Filter approach [91], [90] is a set of filtering techniques that are implemented through a standalone module that examines the addresses generated from the prefetcher. The prefetch engine generates the prefetching request and reroutes it to the prefetch pollution filter to check if the request should proceed. If the prefetch pollution filter rejects the prefetch, this prefetch operation will be terminated and no prefetch will be issued to the L1 cache. Otherwise, the prefetch is issued to the prefetch queue. The heuristic used in order to decide if a prefetch operation may proceed or not, may vary according to the filtering technique. Nevertheless, the heuristic presented in the literature, is based on a branch predictor technique. This heuristic, tries to predict the confidence of the prefetch requests triggered by a certain static instruction. This is done by using a table indexed by the static instruction address. Each entry of this table contains a two but counter that is increased when a prefetch request triggered by this static instruction is useful and decreased when it is useless.

2.4.2 Throttling

This technique relies on modifying the aggressiveness of the prefetching engine according to some heuristic. The aggressiveness of a prefetcher refers to the degree of speculation and number of prefetch requests that are generated each time the prefetching engine is triggered. The difference with filtering is that prefetching requests are not discarded after being generated. Instead of that, throttling increases or decreases the aggressiveness of the prefetching mechanism. For instance, Feedback Directed Prefetching (FDP) [74] estimates some characteristics as the accuracy, prefetcher timeliness, and prefetcher-caused cache

2.4 Dynamic management 31

pollution, and it makes use of all these data to adjust the aggressiveness of the data prefetcher dynamically. With this aggressiveness adjustment the system can increase the performance improvement provided by prefetching as well as reduce the negative performance and bandwidth impact of prefetching. Hierarchical Prefetcher Aggressiveness Control (HPAC) [19] is a global throttling management approach that proposes a low-cost mechanism to make prefetching between L2 shared memory and off-chip memory more effective, minimizing interferences among cores. The idea is to dynamically tune the aggressiveness of core prefetchers by means of a global control system that accepts or overrides decisions made by local control systems. Moreover, filtering and throttling are both orthogonal, so they can be applied at the same time.

2.4.3 Prioritization

Traditionally, memory systems do not make a difference between prefetch requests and demand requests. Lately, there are several approaches that try to give different priorities to both types of requests as it has been shown that delaying demand requests may degrade performance, particularly if prefetch requests are not accurate [42]. For instance, approaches as issuing prefetches only when the memory channels are idle [49] or an insertion policy in the Feedback Directed Prefetching (FDP) mechanism [74] always prioritize demand requests over prefetch requests. Nevertheless, this rigid prioritizing does not always provides the best performance as a delayed useful prefetch request may turn into a completely useless prefetch if arrives too late. Moreover, traffic congestion and energy consumption would be increased for nothing. Prefetch-Aware DRAM Controller (PADC) [42] focus on an adaptive DRAM controller that self-adapts priorities of prefetching requests to minimize the negative impact of useless prefetches and maximize the benefits of useful prefetches.

There is another approach that differentiates prefetch requests from the rest by means of a heterogeneous interconnect [21]. The idea is to assume low-power wires for dealing with prefetching messages meanwhile the rest of them travels on regular baseline wires. Finally, prioritization of demand packets over prefetch packets inside the NoC routers [13], [45] has been also considered. Now, the idea is to apply a regular arbitration (round-robin or age-based) among all the demand packets and only when there are no demand candidates, a round-robin arbitration among the prefetch packets will be applied. The main difference between these two works is that in [45] accuracy information is also used to determine the priority of the prefetch requests.

2.4.4 Summary of other dynamic management techniques

In [51] the authors have proposed analytical models for bandwidth partitioning to identify when prefetching can help in improving system performance. In [71] the authors used reuse- distance based cache modeling to insert non-temporal prefetch instructions to cache bypass the data that is not reused from the lower level caches. Similarly, in [41] the authors proposed a runtime mechanism to find opportunities to insert non-temporal prefetch instructions in batch applications to conserve LLC space so that userfacing applications’ performance in data-centers remains predictable. In [33] they implemented a run-time mechanism for exploring and adjusting hardware prefetcher configuration on a POWER7 processor to maximize performance. The POWER7 processor allows the prefetcher aggressiveness to be configured at 7 different levels. Their runtime method explores the best hardware prefetcher settings on per-core basis (for two cores only) and applies the one that performs best.

In document Improving prefetching mechanisms for tiled CMP platforms (Page 50-52)