Conclusion - Managing Shared Resources in Chip Multiprocessor Memory Systems

When designing Miss Handling Architectures (MHAs), the aim has been to support as many outstanding misses as possible in an area efficient manner. Unfortunately, applying this strategy to a CMP will not realize its performance potential. The

reason is that allowing too much miss parallelism creates congestion in the off-chip memory bus.

The first contribution of this paper is a thorough investigation of conventional MHA performance in a CMP. The main result of this investigation was that a majority of applications need large miss parallelism. However, this must be provided in a way that avoids memory bus congestion. Our Adaptive MHA (AMHA) scheme serves this purpose and is the second contribution in this paper. AMHA increases CMP performance by dynamically adapting the allowed number of outstanding misses in the private L1 data caches to the current memory bus utilization.

Bibliography

[1] K. Asanovic and et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California at Berkeley, December 2006.

[2] S. Belayneh and D. R. Kaeli. A Discussion on Non-Blocking/Lockup-Free Caches. SIGARCH Comp. Arch. News, 24(3):18–25, 1996.

[3] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52–60, 2006.

[4] D. Burger, J. R. Goodman, and A. Kagi. Memory Bandwidth Limitations of Future Microprocessors. In ISCA ’96: Proc. of the 23rd An. Int. Symp. on Comp. Arch., 1996.

[5] J. Chang and G. S. Sohi. Cooperative Cache Partitioning for Chip Multipro- cessors. In ICS ’07: Proc. of the 21st Annual Int. Conf. on Supercomputing, pages 242–252, 2007.

[6] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proc. of the 26th Inter. Symp. on Comp. Arch., pages 222–233, 1999.

[7] H. Dybdahl and P. Stenstrom. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. In HPCA ’07: Proc. of the 13th Int. Symp. on High-Performance Comp. Arch., 2007.

[8] S. Eyerman and L. Eeckhout. System-Level Performance Metrics for Multi- program Workloads. IEEE Micro, 28(3):42–53, 2008.

[9] K. I. Farkas and N. P. Jouppi. Complexity/Performance Tradeoffs with Non- Blocking Loads. In ISCA ’94: Proc. of the 21st An. Int. Symp. on Comp. Arch., pages 211–222, 1994.

Bibliography 97

[10] J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach, Fourth Edition. Morgan Kaufmann Publishers, 2007.

[11] J. Huh, D. Burger, and S. W. Keckler. Exploring the Design Space of Future CMPs. In PACT ’01: Proc. of the 2001 Int. Conf. on Parallel Architectures and Compilation Techniques, pages 199–210, 2001.

[12] ITRS. International Technology Roadmap for Semiconductors. http://www. itrs.net/, 2006.

[13] DDR2 SDRAM Specification. JEDEC Solid State Tech. Association, May 2006.

[14] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In ISCA ’81: Proc. of the 8th An. Symp. on Comp. Arch., pages 81–87, 1981. [15] K. Luo, J. Gummaraju, and M. Franklin. Balancing Throughput and Fairness

in SMT Processors. In ISPASS, 2001.

[16] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Bandwidth Adaptive Snooping. In HPCA ’02: Proc. of the 8th Int. Symp. on High- Performance Comp. Arch., page 251, 2002.

[17] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO 40: Int. Symp. on Microarchitecture, 2007. [18] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory Systems. In MICRO 39: Int. Symp. on Microarchitecture, pages 208–222, 2006.

[19] K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA ’07: Proc. of the 34th An. Int. Symp. on Comp. Arch., pages 57–68, 2007. [20] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A

Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO 39: Proc. of the 39th An. IEEE/ACM Int. Symp. on Microarch., pages 423–432, 2006.

[21] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. In PACT ’07: Proc. of the 16th Int. Conf. on Parallel Architecture and Compilation Techniques (PACT 2007), pages 245–258, 2007.

[22] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA ’00: Int. Symp. on Comp. Arch., pages 128–138, 2000.

[23] S. L. Scott and G. S. Sohi. The Use of Feedback in Multiprocessors and Its Application to Tree Saturation Control. IEEE Trans. Parallel Distrib. Syst., pages 385–398, 1990.

[24] J. Shao and B. Davis. A Burst Scheduling Access Reordering Mechanism. In HPCA ’07: Proc. of the 13th Int. Symp. on High-Performance Comp. Arch., 2007.

[25] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreading Processor. In Arch. Support for Programming Languages and Operating Systems, pages 234–244, 2000.

[26] G. S. Sohi and M. Franklin. High-bandwidth Data Memory Systems for Su- perscalar Processors. In ASPLOS-IV: Proc. of the fourth Int. Conf. on Ar- chitectural Support for Programming Languages and Operating Systems, pages 53–62, 1991.

[27] SPEC. SPEC CPU 2000 Web Page. http://www.spec.org/cpu2000/. [28] M. Thottethodi, A. Lebeck, and S. Mukherjee. Exploiting Global Knowledge

to achieve Self-tuned Congestion Control for k-ary n-cube Networks. IEEE Trans. on Parallel and Distributed Systems, 15(3):257–272, 2004.

[29] J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. In MICRO 39: Proc. of the 39th An. IEEE/ACM Int. Symp. on Microarchitecture, pages 409–422, 2006.

[30] L. Zhao, R. Iyer, R. Illikkal, J. Moses, S. Makineni, and D. Newell. CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms. In PACT ’07: Proc. of the 16th Int. Conf. on Parallel Arch. and Comp. Tech., pages 339–352, 2007.

Appendix B

Paper A.III

A Light-Weight Fairness

Mechanism for Chip

Multiprocessor Memory

Systems

Magnus Jahre and Lasse Natvig

ACM International Conference on Computing Frontiers

2009

101

Abstract

Chip Multiprocessor (CMP) memory systems suffer from the effects of destructive thread interference. This interference reduces performance predictability because it depends heavily on the memory access pattern and intensity of the co-scheduled threads. In this work, we confirm that all shared units must be thread-aware in order to provide memory system fairness. However, the current proposals for fair memory systems are complex as they require an interference measurement mechanism and a fairness enforcement policy for all hardware-controlled shared units. Furthermore, they often sacrifice system throughput to reach their fairness goals which is not desirable in all systems.

In this work, we show that our novel fairness mechanism, called the Dynamic Miss Handling Architecture (DMHA), is able to reduce implementation complexity by using a single fairness enforcement policy for the complete hardware-managed shared memory system. Specifically, it controls the total miss bandwidth available to each thread by dynamically manipulating the number of Miss Status Holding Registers (MSHRs) available in each private data cache. When fairness is chosen as the metric of interest and we compare to a state-of-the-art fairness-aware memory system, DMHA improves fairness by 26% on average with the single program baseline. With a different configuration, DMHA improves throughput by 13% on average compared to a conventional memory system.

B.1. Introduction 103

B.1 Introduction

The multi-core paradigm has become the norm for high performance processors. Commonly, these processors share part of the memory system which creates the possibility that memory requests from different processing cores can interfere with each other and increase latencies. Depending on the amount of Instruction Level Parallelism (ILP) available in the application, this can increase the processors’ stall time and degrade performance. This performance reduction is unpredictable as it depends heavily on the memory access patterns and access intensity of the applications running on the other processing cores. A large, unpredictable latency variation is clearly undesirable, and an important design goal for a Chip Multi- processor (CMP) memory system is to limit these effects by providing some form of performance isolation. Unfortunately, the memory systems employed in CMPs today have no means of controlling this interference, but a number of researchers have proposed techniques that alleviate this problem [3, 18, 19, 28].

A good performance isolation technique should provide both fairness and Quality of Service (QoS). A memory system is fair if the performance reduction due to interference between threads is distributed across all processes in proportion to their priorities [14]. QoS is provided if it is possible to put a limit on the maximum slowdown a process can experience when co-scheduled with any other process [3]. Furthermore, the allowed slow-down can depend on the priority of the process. It is also common to divide fairness/QoS into mechanisms and policies [19]. Here, the policy decides the desired resource allocation and implements it with the primitives provided by the mechanism.

CMPs are commonly used in enterprise IT data centres. In this setting, it is important to have absolute control over how different threads and processes interfere with each other in order to guarantee that the resources specified in the Service Level Agreement are made available. Therefore, complex techniques that can provide QoS are needed. However, CMPs are also used in desktop computers, often running a collection of single-threaded processes. Here, a less complex solution that achieves good fairness can be more appropriate.

The main contribution of this work is a novel, light-weight mechanism called the Dynamic Miss Handling Architecture (DMHA). DMHA’s key feature is that it makes it possible to change the number of Miss Status Holding Registers (MSHRs) available in the private L1 data caches at runtime. These registers determine the number of misses the cache can sustain without blocking, and a blocked cache can not receive any requests. Consequently, the processor will quickly stall as it will be unable to fetch more data, thus reducing its execution speed. DMHA uses this effect to match the execution speeds of the processors such that the slowdown due to memory system interference is equalized across threads.

DMHA divides the total miss bandwidth between all cores at the end of the private memory system. In this work, we show that DMHA is able to provide fairness values comparable to a state-of-the-art fairness-aware memory system or improve

throughput compared to a conventional memory system. We establish this result by exhaustively simulating all 256 combinations of 1, 4, 8 and 16 L1 data cache MSHRs in a 4-core CMP for 10 randomly generated multiprogrammed workloads with SPEC CPU2000 benchmarks. In other words, we measure the metric value DMHA is able to attain if provided with a good policy for this metric. A fairness policy that uses DMHA as its mechanism can improve on this result by adapting the number of MSHRs to interference patterns at runtime as well as using more fine grained MSHR allocation.

To show that DMHA can be used as the mechanism in a practical fairness implementation, we implement a simple interference measurement scheme which we call Interference Points (IP). Here, we add a few registers to each shared unit and in- crement these with a fixed value each time we detect a certain type of interference. This measurement technique builds on previous work, and our contribution is to integrate it into a coherent whole [17, 28]. In addition, we implement a simple hardware policy that searches through different DMHA configurations at runtime to reduce interference. The combination of these techniques result in a fairness management system which we call the Fair Adaptive Miss Handling Architecture (FAMHA). FAMHA is suitable for systems where strict QoS is not needed and a trade-off between fairness and throughput is desired.

The rest of this paper has the following outline. First, section B.2 gives the nec- essary background information on fairness techniques, MHAs and metrics. Then, section B.3 presents our DMHA mechanism before section B.4 discusses IP interference measurement and our hardware policy. Section B.5 discusses our simulation methodology before we present our results in section B.6. Finally, section B.7 discusses possible DMHA extensions before section B.8 concludes the paper and gives indications for further work.

In document Managing Shared Resources in Chip Multiprocessor Memory Systems (Page 117-126)