6.4
Outlook
The number of processing cores on a chip is expected to increase in the future [6, 62]. It may not be worth the cost to ensure that all cores have the same shared cache latency. Therefore, future CMPs will likely have Non-Uniform Cache Access (NUCA) latencies [78]. Furthermore, Amdahl’s law states that as the number of threads is increased, the relative impact of the serial part of the program will grow [3]. For this reason, it may be helpful to have some specialized cores for serial parts of programs and simpler cores for the parallel parts, making CMPs heterogeneous [82]. In a heterogeneous CMP, it might also be useful to have special purpose cores that can execute important program types with high performance and good energy efficiency. These possibilities indicate that future resource management systems will exist in a considerably more complicated environment than today. This both increases the need for resource allocation schemes and the complexities of implementing them.
Traditionally, programmers have been accustomed to processor manufacturers pro- viding significant annual performance improvements while upholding the abstrac- tion of serial execution [50]. With CMPs, programmer effort is required to fully achieve the performance potential. In the short term, we can achieve good utiliza- tion of the available hardware by concurrently executing independent programs. However, there is often a limit on how many different tasks a user needs to carry out at the same time. Consequently, the focus will likely shift towards parallel processing which enables using many of the on-chip resources for completing a sin- gle task. To achieve this, it is helpful to develop support software (e.g. libraries, runtime systems and debug tools) that facilitate rapid development of parallel pro- grams. These systems might need hardware support to provide accurate and timely information that can be used for run-time and design-time performance optimiza- tion as well as debugging. The mechanisms and methodologies proposed in this thesis can become important primitives for such support systems.
Bibliography 57
Bibliography
[1] V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate Versus IPC: The End of the Road for Conventional Microarchitectures. SIGARCH Comput. Archit. News, 28(2):248–259, 2000.
[2] A. Alameldeen and D. Wood. IPC Considered Harmful for Multiprocessor Workloads. IEEE Micro, 26(4):8–17, july-aug. 2006.
[3] G. M. Amdahl. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In AFIPS ’67 (Spring): Proceedings of the April 18-20, Spring Joint Computer Conference, pages 483–485, 1967. [4] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. The IBM 360 Model
91: Processor Philosophy and Instruction Handling. IBM J. Research and Development, pages 8–24, 1967.
[5] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Technical report, University of California at Berkeley, 2009.
[6] K. Asanovic and et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Depart- ment, University of California at Berkeley, December 2006.
[7] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35:59–67, 2002.
[8] S. Belayneh and D. R. Kaeli. A Discussion on Non-Blocking/Lockup-Free Caches. SIGARCH Comp. Arch. News, 24(3):18–25, 1996.
[9] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT ’08: Pro- ceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pages 72–81, 2008.
[10] M. V. Biesbrouck, T. Sherwood, and B. Calder. A Co-Phase Matrix to Guide Simultaneous Multithreading Simulation. IEEE International Symposium on Performance Analysis of Systems and Software, pages 45–56, 2004.
[11] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52–60, 2006.
[12] R. Bitirgen, E. Ipek, and J. F. Martinez. Coordinated Management of Mul- tiple Resources in Chip Multiprocessors: A Machine Learning Approach. In MICRO 41: Proc. of the 41th IEEE/ACM Int. Symp. on Microarchitecture, 2008.
[13] E. Bloch. The Engineering Design of the Stretch Computer. In IRE-AIEE- ACM ’59 (Eastern): Eastern Joint IRE-AIEE-ACM Computer Conference, pages 48–58, 1959.
[14] D. Burger, J. R. Goodman, and A. Kagi. Memory Bandwidth Limitations of Future Microprocessors. In ISCA ’96: Proc. of the 23rd An. Int. Symp. on Comp. Arch., 1996.
[15] J. F. Cantin, M. H. Lipasti, and J. E. Smith. Stealth Prefetching. SIGPLAN Notices, 41(11), 2006.
[16] J. Casazza. Intel Core i7-800 Processor Series and the Intel Core i5-700 Processor Series Based on Intel Microarchitecture (Nehalem). White paper, Intel Corp., 2009.
[17] J. Chang and G. S. Sohi. Cooperative Caching for Chip Multiprocessors. In ISCA ’06: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pages 264–276, 2006.
[18] J. Chang and G. S. Sohi. Cooperative Cache Partitioning for Chip Multipro- cessors. In ICS ’07: Proc. of the 21st Annual Int. Conf. on Supercomputing, pages 242–252, 2007.
[19] T. Chen and J. Baer. Effective Hardware-Based Data Prefetching for High- performance Processors. IEEE Transactions on Computers, 44:609–623, 1995.
[20] D. Chiou, L. Rudolph, S. Devadas, and B. S. Ang. Dynamic Cache Parti- tioning via Columnization. Computation Structures Group Memo 430, Mas- sachusetts Institute of Technology, 1999.
[21] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A Performance Comparison of Contemporary DRAM Architectures. In Proc. of the 26th Inter. Symp. on Comp. Arch., pages 222–233, 1999.
[22] F. Dahlgren and P. Stenstrom. Evaluation of Hardware-Based Stride and Se- quential Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(4):385–398, apr 1996.
[23] W. Dally and B. Towles. Principles and Practices of Interconnection Net- works. Morgan Kaufmann Publishers Inc., 2003.
[24] W. J. Dally and B. Towles. Route Packets, Not Wires: On-chip Inteconnec- tion Networks. In DAC ’01: Proceedings of the 38th annual Design Automa- tion Conference, pages 684–689, 2001.
[25] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. Application-aware Prior- itization Mechanisms for On-Chip Networks. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 280–291, 2009.
Bibliography 59
[26] P. Diaz and M. Cintra. Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching. In ISCA ’09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pages 81–92, 2009. [27] H. Dybdahl. Architectural Techniques to Improve Cache Utilization. PhD
thesis, Norwegian University of Science and Technology, 2007.
[28] H. Dybdahl and P. Stenstrom. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. In HPCA ’07: Proc. of the 13th Int. Symp. on High-Performance Comp. Arch., 2007.
[29] H. Dybdahl, P. Stenstrom, and L. Natvig. A Cache-Partition Aware Replace- ment Policy for Chip Multiprocessors. In Proceedings of 13th International Conference of High Performance Computing (HiPC), pages 22–34, 2006. [30] H. Dybdahl, P. Stenstrom, and L. Natvig. An LRU-based Replacement Algo-
rithm Augmented with Frequency of Access in Shared Chip-Multiprocessor Caches. In MEDEA ’06: Proc. of the 2006 workshop on MEmory perfor- mance, pages 45–52, 2006.
[31] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt. Coordinated Control of Multiple Prefetchers in Multi-core Systems. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 316–326, 2009.
[32] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. Patt. Fairness via Source Throt- tling: A Configurable and High-Performance Fairness Substrate for Multi- Core Memory Systems. In ASPLOS XV: Proc. of the 15th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, 2010.
[33] J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukher- jee, H. Patil, S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A Performance Model Framework. Computer, 35(2):68–76, 2002.
[34] S. Eyerman and L. Eeckhout. System-Level Performance Metrics for Multi- program Workloads. IEEE Micro, 28(3):42–53, 2008.
[35] K. I. Farkas and N. P. Jouppi. Complexity/Performance Tradeoffs with Non- Blocking Loads. In ISCA ’94: Proc. of the 21st An. Int. Symp. on Comp. Arch., pages 211–222, 1994.
[36] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair Thread Scheduling for Multicore Processors. Technical report, Harvard University, 2006.
[37] A. Fedorova, M. Seltzer, and M. D. Smith. Improving Performance Isola- tion on Chip Multiprocessors via an Operating System Scheduler. In PACT ’07: Proc. of the 16th Int. Conf. on Parallel Architecture and Compilation Techniques, pages 25–38, 2007.
[38] R. Gabor, S. Weiss, and A. Mendelson. Fairness and Throughput in Switch on Event Multithreading. In MICRO 39: Proc. of the 39th Int. Symp. on Microarchitecture, pages 149–160, 2006.
[39] S. Gochman, A. Mendelson, A. Naveh, and E. Rotem. Introduction to Intel Core Duo Processor Architecture. Intel Technology Journal, 2006.
[40] P. Goyal, H. M. Vin, and H. Chen. Start-time Fair Queueing: A Scheduling Algorithm for Integrated Services Packet Switching Networks. In SIGCOMM ’96: Conf. Proc. on App., Tech., Arch., and Protocols for Comp. Com., pages 157–168, 1996.
[41] M. Grannæs. Bandwidth-Aware Prefetching in Chip Multiprocessors. Mas- ter’s thesis, Norwegian University of Science and Technology, 2006.
[42] M. Grannæs. Reducing Memory Latency by Improving Resource Utilization. PhD thesis, Norwegian University of Science and Technology, 2010.
[43] M. Grannæs, M. Jahre, and L. Natvig. Low-Cost Open-Page Prefetch Scheduling in Chip Multiprocessors. In XXVI IEEE International Conference on Computer Design (ICCD), 2008.
[44] M. Grannæs, M. Jahre, and L. Natvig. Storage Efficient Hardware Prefetch- ing using Delta Correlating Prediction Tables. In Data Prefetching Champi- onships, 2009.
[45] M. Grannæs, M. Jahre, and L. Natvig. Multi-level Hardware Prefetching Us- ing Low Complexity Delta Correlating Prediction Tables with Partial Match- ing. In International Conference on High-Performance Embedded Architec- tures and Compilers, 2010.
[46] B. Grot, S. W. Keckler, and O. Mutlu. Preemptive Virtual Clock: a Flexible, Efficient, and Cost-Effective QoS Scheme for Networks-on-Chip. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 268–279, 2009.
[47] F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A Framework for Providing Quality of Service in Chip Multi-Processors. In MICRO 40: Proc. of the 40th An. IEEE/ACM Int. Symp. on Microarchitecture, 2007.
[48] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and More Flexible Program Analysis. In Journal of Instruction Level Parallelism, 2005.
[49] N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. SimFlex: A Fast, Accu- rate, Flexible Full-System Simulation Framework for Performance Evaluation of Server Architecture. In SIGMETRICS Perform. Eval. Rev., pages 31–34, 2004.
Bibliography 61
[50] J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach, Fourth Edition. Morgan Kaufmann Publishers, 2007.
[51] J. L. Henning. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News, 34(4):1–17, 2006.
[52] A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate- based QoS Techniques for Cache/Memory in CMP Platforms. In ICS ’09: Proceedings of the 23rd International Conference on Supercomputing, pages 479–488, 2009.
[53] M. Hill and A. Smith. Evaluating Associativity in CPU Caches. IEEE Trans- actions on Computers, 38:1612–1630, 1989.
[54] M. D. Hill. Aspects of Cache Memory and Instruction Buffer Performance. PhD thesis, University of California, Berkeley, 1987.
[55] H. Hofstee. Power Efficient Processor Architecture and the Cell Processor. HPCA 11: 11th Int. Symp. on High-Performance Comp. Arch., pages 258– 262, 2005.
[56] L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource. In PACT ’06: Proc. of the 15th Int. Conf. on Parallel Arch. and Comp. Tech., pages 13–22, 2006.
[57] C. J. Hughes, V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors. Computer, 35(2):40– 49, 2002.
[58] J. Huh, D. Burger, and S. W. Keckler. Exploring the Design Space of Future CMPs. In PACT ’01: Proc. of the 2001 Int. Conf. on Parallel Architectures and Compilation Techniques, pages 199–210, 2001.
[59] I. Hur and C. Lin. Adaptive History-Based Memory Schedulers. In MICRO 37: Proc. of the 37th An. IEEE/ACM Int. Symp. on Microarch., pages 343– 354, 2004.
[60] E. Ipek, O. Mutlu, J. Martinez, and R. Caruana. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. In ISCA ’08: Proc. of the 35th Int. Symp. on Computer Architecture, pages 39–50, 2008.
[61] ITRS. International Technology Roadmap for Semiconductors. http://www. itrs.net/, 2006.
[62] ITRS. International Technology Roadmap for Semiconductors - 2007 Edition. http://www.itrs.net/, 2007.
[63] R. Iyer. CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms. In ICS ’04: Proceedings of the 18th An. Int. Conf. on Supercom- puting, pages 257–266, 2004.
[64] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. QoS Policies and Architecture for Cache/Memory in CMP Platforms. In SIGMETRICS ’07, pages 25–36, 2007.
[65] M. Jahre. Improving the Performance of Parallel Applications in Chip Mul- tiprocessors with Architectural Techniques. Master’s thesis, Norwegian Uni- versity of Science and Technology, 2007.
[66] M. Jahre and L. Natvig. Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor. In Norwegian Informatics Conference, 2007.
[67] M. Jahre and L. Natvig. A High Performance Adaptive Miss Handling Ar- chitecture for Chip Multiprocessors. Transactions on High Performance Em- bedded Architecture and Compilation, 4(1), 2009.
[68] M. Jahre and L. Natvig. A Light-Weight Fairness Mechanism for Chip Mul- tiprocessor Memory Systems. In CF ’09: Proc. of the 6th ACM Conf. on Computing Frontiers, pages 1–10, 2009.
[69] M. Jahre, M. Grannæs, and L. Natvig. A Quantitative Study of Memory System Interference in Chip Multiprocessor Architectures. In 11th IEEE International Conference on High Performance Computing and Communica- tions (HPCC), pages 622–629, 2009.
[70] M. Jahre, M. Grannæs, and L. Natvig. DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems. In Interna- tional Conference on High-Performance Embedded Architectures and Com- pilers, pages 292–306, 2010.
[71] DDR2 SDRAM Specification. JEDEC Solid State Tech. Association, May 2006.
[72] L. K. John. More on Finding a Single Number to Indicate Overall Perfor- mance of a Benchmark Suite. SIGARCH Comput. Archit. News, 32(1):3–8, 2004.
[73] L. K. John and L. Eeckhout, editors. Performance Evaluation and Bench- marking. CRC Press, 2005.
[74] R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40–47, 2004.
[75] T. S. Karkhanis and J. E. Smith. A First-Order Superscalar Processor Model. ISCA ’04: Proceedings of the 31st An. Int. Symp. on Computer Architecture, 2004.
[76] D. Kaseridis, J. Stuecheli, J. Chen, and L. K. John. A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Pro- filers for Large CMP Systems. In HPCA ’10: Proc. of the 16th Int. Symp. on High-Performance Comp. Arch., 2010.
Bibliography 63
[77] T. Kilburn, D. B. G. Edwards, M. J. Lanigan, and F. H. Sumner. One-level Storage System. IRE Transactions on Electronic Computers, 11(2):223–235, 1962.
[78] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. SIGPLAN Not., 37 (10):211–222, 2002.
[79] S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In PACT ’04: Proc. of the 13th Int. Conf. on Parallel Architectures and Compilation Techniques, pages 111–122, 2004.
[80] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multi- threaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005.
[81] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In ISCA ’81: Proc. of the 8th An. Symp. on Comp. Arch., pages 81–87, 1981. [82] R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous
Chip Multiprocessors. Computer, 38(11):32–38, 2005.
[83] R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In ISCA ’05: Proc. of the 32nd Int. Symp. on Comp. Arch., pages 408–419, 2005. [84] A. J. Lande. Evaluering av Chip Multiprosessor Simulatorer (in Norwegian).
Master’s thesis, Norwegian University of Science and Technology, Norway, June 2006.
[85] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO ’08: Proceedings of the 41st IEEE/ACM Interna- tional Symposium on Microarchitecture, pages 200–209, 2008.
[86] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank- Level Parallelism in the Presence of Prefetching. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitec- ture, pages 327–336, 2009.
[87] J. W. Lee, M. C. Ng, and K. Asanovic. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks. In ISCA ’08: Proceed- ings of the 35th Annual International Symposium on Computer Architecture, pages 89–100, 2008.
[88] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Sim- ulation and Real Systems. In HPCA ’08: Proc. of the 13th Int. Symp. on High-Perf. Comp. Arch., 2008.
[89] W. Lin, S. K. Reinhardt, and D. Burger. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In HPCA ’01: Proceedings of the 7th International Symposium on High-Performance Computer Architecture, pages 301–312, 2001.
[90] W. Lin, S. K. Reinhardt, and D. Burger. Designing a Modern Memory Hierarchy with Hardware Prefetching. IEEE Transactions on Computers, 50(11), 2001.
[91] F. Liu, X. Jiang, and Y. Solihin. Understanding How Off-Chip Memory Band- width Partitioning in Chip Multiprocessors Affects System Performance. In 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), pages 1–12, 2010.
[92] G. H. Loh. 3D-Stacked Memory Architectures for Multi-core Processors. In ISCA ’08: Proceedings of the 35th International Symposium on Computer Architecture, pages 453–464, 2008.
[93] K. Luo, J. Gummaraju, and M. Franklin. Balancing Throughput and Fairness in SMT Processors. In ISPASS, 2001.
[94] M5 Documentation. SPEC2006 Benchmarks. http://www.m5sim.org/wiki/ index.php/SPEC2006_benchmarks. Retrieved 11.03.2010.
[95] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Haallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full System Simulation Platform. Computer, 35(2):50–58, 2002.
[96] M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Bandwidth Adaptive Snooping. In HPCA ’02: Proc. of the 8th Int. Symp. on High- Performance Comp. Arch., page 251, 2002.
[97] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s Gen- eral Execution-Driven Multiprocessor Simulator (GEMS) Toolset. SIGARCH Comput. Archit. News, 33(4):92–99, 2005.
[98] C. J. Mauer, M. D. Hill, and D. A. Wood. Full-System Timing-First Simu- lation. In SIGMETRICS ’02: Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Sys- tems, pages 108–116, 2002.
[99] M. Moreto, F. J. Cazorla, A. Ramirez, and M. Valero. Online Prediction of Applications Cache Utility. In Int. Conf. on Embedded Comp. Systems: Architectures, Modeling and Simulation (IC-SAMOS), pages 169–177, 2007. [100] M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero.
FlexDCP: A QoS Framework for CMP Architectures. SIGOPS Oper. Syst. Rev., 43(2):86–96, 2009.
Bibliography 65
[101] T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Mem- ory Service in Multi-Core Systems. In SS’07: Proceedings of 16th USENIX Security Symposium, pages 1–18, 2007.
[102] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO 40: Int. Symp. on Microarchitecture, 2007. [103] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhanc- ing both Performance and Fairness of Shared DRAM Systems. In ISCA ’08: Proc. of the 35th An. Int. Symp. on Comp. Arch., pages 63–74, 2008. [104] C. Natarajan, B. Christenson, and F. Briggs. A Study of Performance Impact
of Memory Controller Features in Multi-processor Server Environment. In WMPI ’04: Proc. of the 3rd Workshop on Memory Perf. Issues, pages 80–87, 2004.
[105] K. Nesbit and J. Smith. Data Cache Prefetching Using a Global History Buffer. In 10th International Symposium on High Performance Computer Architecture, HPCA-10, pages 96–96, 2004.
[106] K. Nesbit, M. Moreto, F. Cazorla, A. Ramirez, M. Valero, and J. Smith. Multicore Resource Management. IEEE Micro, 28(3):6–16, 2008.
[107] K. J. Nesbit and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. IEEE Micro, 25:90–97, 2005.
[108] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith. AC/DC: An Adaptive Data Cache Prefetcher. In Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques, pages 135–145, 2004. [109] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair Queuing Memory
Systems. In MICRO 39: Int. Symp. on Microarchitecture, pages 208–222, 2006.
[110] K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA ’07: Proc. of the 34th An. Int. Symp. on Comp. Arch., pages 57–68, 2007. [111] NOTUR. NOTUR Web Page. http://www.notur.no/.
[112] K. Olukotun and L. Hammond. The Future of Microprocessors. Queue, 3(7): 26–29, 2005.
[113] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The Case for a Single-Chip Multiprocessor. SIGPLAN Notices, 31(9):2–11, 1996. [114] E. Perelman, G. Hamerly, and B. Calder. Picking Statistically Valid and Early Simulation Points. In PACT ’03: Proc. of the 12th Int. Conf. on Parallel Architectures and Compilation Techniques, page 244, 2003.
[115] D. G. Perez, G. Mouchard, and O. Temam. MicroLib: A Case for the Quan- titative Comparison of Micro-Architecture Mechanisms. In MICRO 37: Int. Symp. on Microarchitecture, pages 43–54, 2004.
[116] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO 39: Proc. of the 39th An. IEEE/ACM Int. Symp. on Microarch., pages 423–432, 2006.
[117] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A Case for MLP- Aware Cache Replacement. In ISCA ’06: Int. Symp. on Comp. Arch., pages 167–178, 2006.
[118] N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural Support for Op- erating System-driven CMP Cache Management. In PACT ’06: Proc. of the 15th Int. Conf. on Parallel Architectures and Compilation Techniques, pages 2–12, 2006.
[119] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. In PACT ’07: Proc. of the 16th Int. Conf. on Parallel Architecture and Compilation Techniques (PACT 2007), pages 245–258, 2007.
[120] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA ’00: Int. Symp. on Comp. Arch., pages 128–138, 2000.
[121] M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. Complete Computer System Simulation: The SimOS Approach. IEEE Parallel & Distributed Technology: Systems & Applications, 3(4):34–43, 1995.
[122] S. L. Scott and G. S. Sohi. The Use of Feedback in Multiprocessors and Its Application to Tree Saturation Control. IEEE Trans. Parallel Distrib. Syst., pages 385–398, 1990.
[123] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a Many-core x86 Architecture for Visual Computing. In ACM SIGGRAPH 2008, pages 1–15, 2008.
[124] A. Settle, D. Connors, E. Gibert, and A. Gonzalez. A Dynamically Recon- figurable Cache for Multithreaded Processors. J. Embedded Comput., 2(2): 221–233, 2006.
[125] J. Shao and B. Davis. A Burst Scheduling Access Reordering Mechanism. In HPCA ’07: Proc. of the 13th Int. Symp. on High-Performance Comp. Arch., 2007.
Bibliography 67
[126] J. Shao and B. T. Davis. The Bit-Reversal SDRAM Address Mapping. In SCOPES ’05: Proc. of the 2005 Workshop on Software and Compilers for Embedded Systems, pages 62–71, 2005.
[127] G. Sindre, L. Natvig, and M. Jahre. Experimental Validation of the Learning Effect for a Pedagogical Game on Computer Fundamentals. IEEE Transac- tions on Education, 52(1):10–18, feb. 2009.
[128] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Appli- cations for Shared-Memory. SIGARCH Comput. Archit. News, 20(1):5–44, 1992.
[129] K. Skadron, M. Martonosi, D. I. August, M. D. Hill, D. J. Lilja, and V. S.