The presented implementation of the adaptive prefetch is based on a user-level runtime. Compared to an OS implementation, a user-level runtime provides the maximum flexibility and portability. An OS-based implementation would provide several advantages, though. For instance, the overhead for reading performance counters as well as for changing the DSCR register would be reduced, since it would not be necessary to change the privilege mode to do so.
adaptive prefetch within the Linux OS. For that purpose, we have implemented OS-based adaptive prefetch algorithms similar to the runtime-based ones.
Algorithm 4 OS-based implementation of Algorithm 1
1: ct = get_current_running_thread() 2: if mode = EXPLORATION then 3: perf[ct, curr_ps[ct]] = read_ipc() 4: if curr_ps[ct]̸= last_ps() then
5: curr_ps[ct] = next_ps(curr_ps[ct]) 6: set_dscr(ct, curr_ps[ct])
7: else
8: best_ps = arg maxps(perf[ct])
9: set_dscr(ct, best_ps)
10: run_quantum[ct] = RUN_QUANTUM 11: mode = RUNNING
12: end if
13: else if mode = RUNNING then
14: run_quantum[ct] = run_quantum[ct]− 1 15: if run_quantum[ct] = 0 then 16: curr_ps[ct] = first_ps() 17: set_dscr(ct, curr_ps[ct]) 18: mode = EXPLORATION 19: end if 20: end if
We rely on the timer interrupt in order to divide the execution of threads into intervals containing exploration and running phases. At each timer interrupt a reference to the thread running on the current context is first obtained (see Algorithm 4). Then the behavior of the algorithm depends on the current phase: i) If the exploration phase is active, the performance for the current prefetch setting (curr_ps) is recorded and the next setting is selected (lines 5-6). In case no more settings are available, the algorithm starts the running phase, after selecting the best setting found during the exploration phase (lines 8-11). ii) If the running phase is active, the running quantum is first reduced (line 14). That quantum determines how long a running phase will be. A larger value will reduce the effect of inefficient prefetch settings at the expense of a coarser adaptability.
These promising results encourage us to further pursue this path. We leave, however, the exploration of other OS-based adaptive schemes for future work.
5.5 Conclusions
Prefetching engines in processors are getting more sophisticated over time. While designing a new processor it is not easy to select a prefetching setting that performs well under all workloads that may later run on the processor. In response to this, processor manufacturers are exposing multiple knobs that users can tweak in an attempt to improve their workloads performance. But, doing so typically re- quires a costly profiling step to determine the best prefetching setting for a particular workload. More- over, when the workload set changes overtime, the profile results might not be useful anymore. There- fore, this manual approach does not scale in a scenario where systems are shared among multiple users and workload consolidation is becoming pervasive.
In this chapter we present an adaptive prefetch mechanism capable of boosting performance by leveraging on prefetching knobs. We evaluate its impact on performance for single-threaded and mul- tiprogrammed workloads, showing that significant speedups can be obtained with respect to the de- fault prefetch setting. We compare the adaptive scheme to an approach where applications are first profiled and the best prefetch setting found is used for future executions. Our dynamic approach, however, frees users from profiling every application in order to find the best static prefetch setting.
6
Bandwidth Shifting: Improving
System-Wide Performance
6.1 Introduction
As newer systems become capable of running a larger thread count, effectively sharing the available bandwidth to memory is becoming even more important. Total bandwidth continues to increase through multiple architectural improvements. But bandwidth per thread is actually becoming scarcer in newer systems. Therefore in this section we place the focus on a solution that balances the band- width usage of the different workloads running on the system. This approach will attempt to maxi- mize the utilization of memory bandwidth, potentially improving system performance and/or reduc-
Number of omnetpp threads Har monic Speedup 1.1 1.2 1.3 1.4 1.5 1.6 4 8 12 16 20 24 28 ● ● ● ● ● ● ●
Figure 6.1:Effect of bandwidth shifting on system performance when a prefetch-efficient benchmark (bwaves) and a prefetch-inefficient one (omnetpp) run together. The X axis shows the number of omnetpp threads (x). The number of bwaves threads is32− x.
ing power consumption (e.g., by turning off the prefetcher for applications that are not amenable to prefetching). To the best of our knowledge, this solution is the first one that addresses prefetch bandwidth management for CMP processors without requiring hardware support. Because of its de- sign, our solution should work on any multicore system with a programmable prefetch engine—most modern processors allow users to control prefetching in different ways.
Figure 6.1 shows an illustrative example of the effect of bandwidth shifting on system performance. In this example we run two benchmarks—bwaves(prefetch friendly) andomnetpp(prefetch unfriendly). For every execution (represented as a tick in the X axis) we run 32 processes in total: xomnetppcopies and 32− xbwavescopies. We compute the system speedup using the harmonic speedup between two configurations: 1) both benchmarks using the most aggressive prefetch setting, and 2)bwaveskeeps us- ing the most aggressive setting, but prefetching is disabled foromnetpp. Our bandwidth shifting mech- anism would effectively shift prefetch resources from the prefetch-friendly to the prefetch-unfriendly benchmark. As the number ofomnetppcopies increases, the benchmark keeps adding pressure to the available memory bandwidth thus taking bandwidth away frombwaves. If we shift bandwidth be-
tween both benchmarks by disabling prefetching foromnetpp, we observe very significant speedups. This is especially noticeable as the number ofomnetppcopies increases, since prefetches issued for that benchmark saturate the bandwidth to memory. When we intelligently shift bandwidth between the applications, for 28omnetppthreads the system speedup exceeds 60%. As Figure 6.1 demonstrates, there is ample room for an intelligent bandwidth shifting mechanism that takes bandwidth resources away from prefetch-inefficient workloads, and gives those resources to more efficient workloads.
In this chapter we first introduce a metric that estimates prefetch usefulness for a given thread based solely on performance counters commonly available in current processors. Then we present a novel bandwidth shifting mechanism capable of significantly improving system performance by tak- ing bandwidth away from benchmarks that do not use prefetching in an efficient way and giving it to prefetch-efficient benchmarks. The mechanism does not require any hardware support, and it is able to obtain up to 18.5% speedup (10-11% on average). We also study the impact of bandwidth shifting in extreme cases where one benchmark is highly prefetch-efficient and the other uses prefetching ineffi- ciently. Our results show that bandwidth shifting achieves much larger speedups (>1.6X). Finally we evaluate the impact of the bandwidth shifting mechanism on power consumption too.