5.7.2
Roles of the Authors
The roles of the authors were accurately captured by the description in Section 5.6.2.
5.8
Other Publications
In addition to the above publications, I have contributed to the following articles: • Marius Grannæs, Magnus Jahre and Lasse Natvig, Multi-Level Hardware Prefetching using Low Complexity Delta Correlating Prediction Tables with Partial Matching, 5th International Conference on High Per- formance and Embedded Architectures and Compilers, 2010, Best Paper Nominee
• Marius Grannæs, Magnus Jahre and Lasse Natvig, Storage Efficient Hard- ware Prefetching using Delta Correlating Prediction Tables, 1st JILP Data Prefetching Championship, 2009
• Guttorm Sindre, Lasse Natvig and Magnus Jahre, Experimental Valida- tion of the Learning Effect for a Pedagogical Game on Computer Fundamentals, IEEE Transactions on Education, 2009
• Magnus Jahre and Lasse Natvig, Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor, Norwegian Infor- matics Conference, 2007
Furthermore, the we have been asked to submit extensions of two of our papers to special issues of two different journals:
• Marius Grannæs, Magnus Jahre and Lasse Natvig, Storage Efficient Hard- ware Prefetching using Delta Correlating Prediction Tables, To ap- pear in: Journal of Instruction Level Parallelism, 2010
• Marius Grannæs, Magnus Jahre and Lasse Natvig, Multi-Level Hardware Prefetching using Low Complexity Delta Correlating Prediction Tables with Partial Matching, To appear in: Transactions on High- Performance Embedded Architectures and Compilers, 2010
Chapter 6
Concluding Remarks
6.1
Conclusion
In this thesis, we have illustrated how managing bandwidth allocations can im- prove system performance. We have targeted two partially overlapping types of bandwidth: miss-bandwidth and off-chip bandwidth.
We have explored several feedback-directed techniques for managing miss band- width. Our starting point was Paper A.II which uses memory bus utilization mea- surements to control miss bandwidth allocations and improve performance. In Pa- per A.III, we showed that managing miss bandwidth also could improve throughput and fairness. Finally, Paper A.IV models the performance effects of miss bandwidth allocations at runtime and uses this model to allocate bandwidth. The resulting system can improve performance, throughput or fairness depending on which per- formance metric it is configured to optimize for. The model-based approach of Pa- per A.IV is powered by the Dynamic Interference Estimation Framework (DIEF) from Paper B.II. DIEF provides accurate estimates of the average memory latency a process would experience with exclusive access to all hardware-managed shared resources. We also provide a performance model that uses DIEF to accurately es- timate what the performance of a process would be in the absence of interference. In addition, we improve performance by using prefetching to utilize off-chip band- width more efficiently. Here, we choose prefetches that can be efficiently executed in the rows, columns and banks of modern DRAMs. In this way, we increase bus utilization which makes it possible to execute prefetches without severely inter- fering with demand reads. Furthermore, this technique reduces the performance reduction of the processes that do not benefit from prefetching.
6.2
Contributions
The main research question of this thesis is:
How, and at what cost, can performance be improved by managing resource sharing in CMP memory systems?
I approach this question through three subquestions. In this section, I review the subquestions in light of the results presented in this thesis.
6.2.1
Research Question 1
RQ1: How, and at what cost, can CMP performance be improved by managing miss bandwidth ?
Paper A.II, A.III and A.IV shed light on this research question. These papers man- age miss bandwidth allocations to reduce interference and improve performance. We provide two fundamentally different approaches to miss bandwidth manage- ment. In Paper A.II and A.III, we use an iterative approach. In these papers, miss bandwidth allocations are chosen based on measurements and evaluated in the real system. Performance feedback is used to check if the new allocation was better than the previous one. In Paper A.IV, we provide an analytical model of the performance effects of miss bandwidth allocations and use this model to make allocation decisions.
The implementation cost of the miss bandwidth allocation techniques are closely related to their complexity. For the iterative methods, we use relatively simple algorithms. Furthermore, the storage overhead is low since a few strategically placed counters are sufficient to provide the required feedback. For the model-based method, the algorithms are fairly complex, and the accurate feedback mechanisms have a moderate storage cost. Since allocation decisions are used for a long period of time, we can reduce the implementation cost with a high-latency implementation. In fact, the total storage requirement for the 4-core CMP feedback mechanisms in Paper A.IV is only 13.8 KB. The main storage cost drivers are the memory bus interference measurement techniques and the Auxiliary Tag Directory (ATD).
6.2.2
Research Question 2
RQ2: How, and at what cost, can off-chip bandwidth management be used to increase CMP performance?
We approach this research question in both a direct and indirect manner. In the direct approach (Papers C.I and C.II), we use prefetching to provide larger bandwidth allocations to processes with predictable memory access patterns. Thus, we aim to increase the performance of these processes. By choosing prefetches to improve DRAM page locality, we increase off-chip memory bus utilization. In Paper
6.2. Contributions 53
C.I, we increase bus utilization by piggybacking prefetches to open pages on demand reads to these pages. Since this makes prefetches cheaper in terms of latency than demand reads, it is worthwhile to issue prefetches even when prefetcher accuracy is moderately low.
In Paper C.II, we propose the opportunistic prefetch scheduling strategy. In this paper, we use a novel hardware structure called the Page Vector Table (PVT) to make prefetch information easily available to the memory bus scheduler. We use the PVT to issue prefetches to open pages when the scheduler closes a DRAM page. In this way, we increase the performance improvement of prefetching at the same time as we reduce the performance regressions of the processes which are not helped by prefetching. The implementation cost of these techniques is mostly due to adding a prefetcher. In addition, the PVT has a small storage cost.
Paper A.II, A.III and A.IV primarily manage miss bandwidth. Since off-chip band- width can be an important part of the miss bandwidth, these systems indirectly manage off-chip bandwidth. They differ from the Category C approaches by focus- ing on improving system performance rather than improving the performance of individual processes. The implementation cost of these techniques was discussed in relation to RQ1.
6.2.3
Research Question 3
RQ3: How, and at what cost, can the performance and latency effects of resource sharing be estimated ?
We provide two methodologies for estimating the latency effects of interference. In Paper B.I, we provide an offline methodology for estimating interference latency. We observed that the memory bus is the main interference source for the CMPs we considered. In addition, we also found that cache capacity interference may not be a significant part of the total interference. However, alleviating cache interference can have large performance effects for some benchmarks since it can remove a large part of the total memory latency. Finally, we observed that shared cache MSHRs can have a non-negligible interference impact.
An offline approach can provide important insights, but it cannot be used for runtime resource management. Therefore, Paper B.II provides an online technique called the Dynamic Interference Estimation Framework (DIEF) which provides accurate runtime estimates of private mode memory latencies. The underlying idea is that by either measuring interference or estimating private mode latencies in each shared unit, the total interference latency can be quantified. In Paper B.II, we provide interference estimation techniques for crossbar and ring interconnects, shared caches and a DDR2 memory bus. Consequently, we explore two possible implementations of the general framework.
Paper A.IV shows how the latency measurements of DIEF can be used to provide private mode performance estimates. This is made possible by the observation that
the ability of a process to utilize the parallelism available in the memory system is very similar in the shared and private modes. Consequently, we combine shared mode Memory Level Parallelism (MLP) measurements with DIEF’s private mode latency estimates to provide runtime estimates of private mode performance. The storage overhead of this technique was discussed in relation to RQ1.