CONCLUSIONS
Throughput performance on data parallel processors such as GPUs is funda- mentally limited by chip power budgets. Therefore, memory and functional unit latency tolerance, the key enabler for throughput performance, must be designed with energy-efficiency in mind. However, prior latency tolerance techniques either have too much complexity, or suffer performance and en- ergy pitfalls when commonly occurring code patterns are exhibited during application runtime.
This dissertation proposes a novel decoupled architecture developed specif- ically for high energy-efficiency. While traditional decoupled architectures have energy consumption and performance pitfalls, techniques to extract more strand parallelism, implement control speculation, and enable a single decoupled instruction stream are developed. Additionally, hybrid latency tolerance techniques leveraging both multithreading and decoupling are de- veloped to provide robust performance and energy-efficiency. While multi- threading and decoupling in isolation have performance pitfalls on different code patterns commonly found in data parallel workloads, enabling a hy- brid latency tolerance can avoid these pitfalls and improve energy-efficiency significantly.
High-fidelity performance and physical design models are leveraged to per- form a comprehensive design space exploration to compare the energy effi- ciency of common latency tolerance techniques on a 1024-core data paral-
lel processor. By designing a decoupled architecture specifically for energy efficiency, robust energy-efficiency across a wide range of code patterns is achieved. The proposed decoupled architecture improves energy-efficiency over other techniques by 28% to 89% on data parallel benchmarks. A hybrid of multithreading and decoupling can improve energy-efficiency by another 14% on average across data parallel benchmarks.
REFERENCES
[1] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: An architecture and scalable programming interface for a 1000-core accel- erator,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 140–151.
[2] J. Kelm, D. Johnson, S. Lumetta, M. Frank, and S. Patel, “A task- centric memory model for scalable accelerator architectures,” in Pro- ceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, 2009, pp. 77–87.
[3] T.-F. Chen and J.-L. Baer, “Effective hardware-based data prefetching for high-performance processors,” IEEE Transactions on Computers, vol. 44, no. 5, pp. 609–623, May 1995.
[4] J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc, “Many-thread aware prefetching mechanisms for gpgpu applications,” in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microar- chitecture, 2010, pp. 213–224.
[5] M. K. Farrens and A. R. Pleszkun, “Strategies for achieving improved processor throughput,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991, pp. 362–369.
[6] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. We- ber, “Comparative evaluation of latency reducing and tolerating tech- niques,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991, pp. 254–263.
[7] J. Haskins, J.W., K. Hirst, and K. Skadron, “Inexpensive throughput enhancement in small-scale embedded microprocessors with block multi- threading: extensions, characterization, and tradeoffs,” in Proceedings of IEEE International Conference on Performance, Computing, and Com- munications, 2001, pp. 319–328.
[8] C. McNairy and R. Bhatia, “Montecito: a dual-core, dual-thread ita- nium processor,” IEEE Micro, vol. 25, no. 2, pp. 10–20, March-April
[9] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
[10] J. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, C. Hwang, H. Li, A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A. Leon, and A. Strong, “A 40nm 16-core 128-thread CMT SPARC SoC processor,” in Proceedings of the IEEE International Solid-State Circuits Conference, 2010, pp. 98–99.
[11] C.-K. Luk, “Tolerating memory latency through software-controlled pre- execution in simultaneous multithreading processors,” in Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001, pp. 40–51.
[12] R. D. Barnes, S. Ryoo, and W.-m. W. Hwu, “‘Flea-flicker’ multipass pipelining: An alternative to the high-power out-of-order offense,” in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, 2005, pp. 319–330.
[13] S. Chaudhry, P. Caprioli, S. Yip, and M. Tremblay, “High-performance throughput computing,” IEEE Micro, vol. 25, pp. 32–45, 2005.
[14] J. Dundas and T. Mudge, “Improving data cache performance by pre- executing instructions under a cache miss,” in Proceedings of the 11th International Conference on Supercomputing, 1997, pp. 68–75.
[15] R. Kessler, “The alpha 21264 microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24–36, March/April 1999.
[16] K. Yeager, “The mips r10000 superscalar microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–41, apr 1996.
[17] A. Hilton, S. Nagarakatte, and A. Roth, “iCFP: Tolerating all-level cache misses in in-order processors,” in Proceedings of the IEEE 15th International Symposium onHigh Performance Computer Architecture, 2009, pp. 431–442.
[18] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, H. Zeffer, and M. Tremblay, “Simultaneous speculative threading: A novel pipeline architecture implemented in Sun’s rock processor,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 484–495.
[19] A. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A large, fast instruction window for tolerating cache misses,” in Proceed- ings of the 29th Annual International Symposium on Computer Archi- tecture, 2002, pp. 59–70.
[20] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, “Continual flow pipelines,” in Proceedings of the 11th International Con- ference on Architectural Support for Programming Languages and Oper- ating Systems, 2004, pp. 107–119.
[21] D. M. Gallagher, W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W.-m. W. Hwu, “Dynamic memory disambiguation using the memory conflict buffer,” in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Sys- tems, 1994, pp. 183–193.
[22] J. E. Smith, “Decoupled access/execute computer architectures,” in Pro- ceedings of the 9th Annual Symposium on Computer Architecture, 1982, pp. 112–119.
[23] G. Tyson, M. Farrens, and A. R. Pleszkun, “MISC: A multiple instruc- tion stream computer,” in Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992, pp. 193–196.
[24] L. Kurian, P. T. Hulina, and L. D. Coraor, “Memory latency effects in decoupled architectures with a single data memory module,” in Proceed- ings of the 19th Annual International Symposium on Computer Archi- tecture, 1992, pp. 236–245.
[25] J. E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, and C. M. Rozewski, “The zs-1 central processor,” in Proceedings of the Second In- ternational Conference on Architectural Support for Programming Lan- guages and Operating Systems, 1987, pp. 199–204.
[26] Y. Zhang and I. Adam, G.B., “Exploiting instruction level parallelism with the ds architecture,” in Proceedings of the 1996 International Con- ference on Parallel Processing, 1996, pp. 230–237.
[27] J.-M. Arnau, J.-M. Parcerisa, and P. Xekalakis, “Boosting mobile GPU performance with a Decoupled Access/Execute fragment processor,” in Proceedings of the 25th Annual International Symposium on Computer Architecture, 2012.
[28] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimiz- ing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in Proceedings of the 40th Annual International Symposium on Microarchitecture, 2007, pp. 3–14.
[29] S. Kumar, C. J. Hughes, and A. Nguyen, “Carbon: Architectural sup- port for fine-grained parallelism on chip multiprocessors,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007, pp. 162–173.
[30] C. Lattner and V. Adve, “LLVM: a compilation framework for lifelong program analysis transformation,” in Proceedings of the International Symposium on Code Generation and Optimization, 2004, pp. 75–86. [31] P. L. Bird, A. Rawsthorne, and N. P. Topham, “The effectiveness of
decoupling,” in Proceedings of the 7th International Conference on Su- percomputing, 1993, pp. 47–56.
[32] E. Jacobsen, E. Rotenberg, and J. E. Smith, “Assigning confidence to conditional branch predictions,” in Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996, pp. 142–152.
[33] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, “Confidence es- timation for speculation control,” in Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998, pp. 122–131. [34] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: speculation control for energy reduction,” in Proceedings of the 25th Annual Inter- national Symposium on Computer Architecture, 1998, pp. 132–141. [35] O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose, “Increasing proces-
sor performance through early register release,” in Proceedings of IEEE International Conference on Computer Design, 2004, pp. 480–487. [36] K. D. Rich and M. K. Farrens, “Code partitioning in decoupled compil-
ers,” in Proceedings of the 6th EUROPAR, 2000, pp. 1008–1017.
[37] N. Topham, A. Rawsthorne, C. McLean, M. Mewissen, and P. Bird, “Compiling and optimizing for decoupled architectures,” in Proceedings of the 1995 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, 1995, p. 40.
[38] G. Ottoni, R. Rangan, A. Stoler, and D. I. August, “Automatic thread extraction with decoupled software pipelining,” in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitec- ture, 2005, pp. 105–118.
[39] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective su- perscalar processors,” in Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997, pp. 206–218.
[40] A. Gonz´alez, M. Valero, N. Topham, and J. M. Parcerisa, “Eliminating cache conflict misses through xor-based placement functions,” in Pro- ceedings of the 11th International Conference on Supercomputing, 1997, pp. 76–83.
[41] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using prime numbers for cache indexing to eliminate conflict misses,” in Proceedings of the 10th International Symposium on High Performance Computer Architecture, feb. 2004, pp. 288–299.
[42] O. Mutlu, H. Kim, and Y. N. Patt, “Efficient runahead execution: Power-efficient memory latency tolerance,” IEEE Micro, vol. 26, pp. 10–20, 2006.
[43] R. Rangan, N. Vachharajani, G. Ottoni, and D. I. August, “Perfor- mance scalability of decoupled software pipelining,” ACM Transactions on Architecture and Code Optimization, vol. 5, no. 2, pp. 8:1–8:25, 2008. [44] A. Hilton, S. Nagarakatte, and A. Roth, “iCFP: Tolerating all-level cache misses in in-order processors,” IEEE Micro, vol. 30, no. 1, pp. 12–19, 2010.
[45] A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, “Slice-processors: An implementation of operation-based prediction,” in Proceedings of the 15th International Conference on Supercomputing, 2001, pp. 321–334. [46] J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter,
and H. C. Young, “PIPE: A VLSI decoupled architecture,” in Proceed- ings of the 12th Annual International Symposium on Computer Archi- tecture, 1985, pp. 20–27.
[47] L. K. John, V. Reddy, P. T. Hulina, and L. D. Coraor, “Program balance and its impact on high performance RISC architectures,” in Proceedings of the 1st IEEE Symposium on High Performance Computer Architec- ture, 1995, pp. 370–379.
[48] Y. Zhang and I. Adams, G.B., “Performance modeling and code parti- tioning for the ds architecture,” in The 25th Annual International Sym- posium on Computer Architecture, 1998, pp. 293–304.
[49] J.-M. Parcerisa and A. Gonzalez, “Improving latency tolerance of mul- tithreading through decoupling,” IEEE Transactions on Computers, vol. 50, no. 10, pp. 1084–1094, 2001.
[50] A. Hilton and A. Roth, “BOLT: Energy-efficient out-of-order latency- tolerant execution,” in Proceedings of 16th International Symposium on High Performance Computer Architecture, 2010, pp. 1–12.
[51] F. Tseng and Y. N. Patt, “Achieving out-of-order performance with al- most in-order complexity,” in ISCA ’08: Proceedings of the 35th Annual International Symposium on Computer Architecture, 2008, pp. 3–12.
[52] W. Dally, J. Balfour, D. Black-Shaffer, J. Chen, R. Harting, V. Parikh, J. Park, and D. Sheffield, “Efficient embedded computing,” IEEE Com- puter, vol. 41, no. 7, pp. 27–32, July 2008.
[53] M. Gebhart, S. W. Keckler, and W. J. Dally, “A compile-time managed multi-level register file hierarchy,” in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011, pp. 465–476. [54] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz, “Energy-
performance tradeoffs in processor architecture and circuit design: a marginal cost analysis,” in Proceedings of the 37th annual international symposium on Computer architecture, 2010, pp. 26–36.
[55] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel, “Tradeoffs in de- signing accelerator architectures for visual computing,” in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microar- chitecture, 2008, pp. 164–175.
[56] J. Huh, D. Burger, and S. Keckler, “Exploring the design space of fu- ture cmps,” in Proceedings of the Parallel Architectures and Compilation Techniques, 2001, pp. 199–210.
[57] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Soft- ware, 2009, pp. 163–174.
[58] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Kottapalli, “A 45nm 8-core enterprise Xeon proces- sor,” in Proceedings of the IEEE International Solid-State Circuits Con- ference, 2009, pp. 56–57.
[59] M. Ware, K. Rajamani, M. Floyd, B. Brock, J. Rubio, F. Rawson, and J. Carter, “Architecting for power management: The IBM POWER7 approach,” in Proceedings of the 16th International Symposium on High Performance Computer Architecture, 2010, pp. 1–11.
[60] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, “IBM POWER6 microarchitecture,” IBM Journal of Research and Develop- ment, vol. 51, no. 6, pp. 639 –662, November 2007.