CONCLUSIONS

Throughput performance on data parallel processors such as GPUs is funda- mentally limited by chip power budgets. Therefore, memory and functional unit latency tolerance, the key enabler for throughput performance, must be designed with energy-efficiency in mind. However, prior latency tolerance techniques either have too much complexity, or suffer performance and energy pitfalls when commonly occurring code patterns are exhibited during application runtime.

This dissertation proposes a novel decoupled architecture developed specifically for high energy-efficiency. While traditional decoupled architectures have energy consumption and performance pitfalls, techniques to extract more strand parallelism, implement control speculation, and enable a single decoupled instruction stream are developed. Additionally, hybrid latency tolerance techniques leveraging both multithreading and decoupling are developed to provide robust performance and energy-efficiency. While multithreading and decoupling in isolation have performance pitfalls on different code patterns commonly found in data parallel workloads, enabling a hybrid latency tolerance can avoid these pitfalls and improve energy-efficiency significantly.

High-fidelity performance and physical design models are leveraged to per- form a comprehensive design space exploration to compare the energy efficiency of common latency tolerance techniques on a 1024-core data paral-

lel processor. By designing a decoupled architecture specifically for energy efficiency, robust energy-efficiency across a wide range of code patterns is achieved. The proposed decoupled architecture improves energy-efficiency over other techniques by 28% to 89% on data parallel benchmarks. A hybrid of multithreading and decoupling can improve energy-efficiency by another 14% on average across data parallel benchmarks.

REFERENCES

[1] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: An architecture and scalable programming interface for a 1000-core accelerator,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 140–151.

[2] J. Kelm, D. Johnson, S. Lumetta, M. Frank, and S. Patel, “A task- centric memory model for scalable accelerator architectures,” in Pro- ceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, 2009, pp. 77–87.

[3] T.-F. Chen and J.-L. Baer, “Effective hardware-based data prefetching for high-performance processors,” IEEE Transactions on Computers, vol. 44, no. 5, pp. 609–623, May 1995.

[4] J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc, “Many-thread aware prefetching mechanisms for gpgpu applications,” in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microar- chitecture, 2010, pp. 213–224.

[5] M. K. Farrens and A. R. Pleszkun, “Strategies for achieving improved processor throughput,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991, pp. 362–369.

[6] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. We- ber, “Comparative evaluation of latency reducing and tolerating techniques,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991, pp. 254–263.

[7] J. Haskins, J.W., K. Hirst, and K. Skadron, “Inexpensive throughput enhancement in small-scale embedded microprocessors with block multithreading: extensions, characterization, and tradeoffs,” in Proceedings of IEEE International Conference on Performance, Computing, and Com- munications, 2001, pp. 319–328.

[8] C. McNairy and R. Bhatia, “Montecito: a dual-core, dual-thread ita- nium processor,” IEEE Micro, vol. 25, no. 2, pp. 10–20, March-April

[9] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.

[10] J. Shin, K. Tam, D. Huang, B. Petrick, H. Pham, C. Hwang, H. Li, A. Smith, T. Johnson, F. Schumacher, D. Greenhill, A. Leon, and A. Strong, “A 40nm 16-core 128-thread CMT SPARC SoC processor,” in Proceedings of the IEEE International Solid-State Circuits Conference, 2010, pp. 98–99.

[11] C.-K. Luk, “Tolerating memory latency through software-controlled pre- execution in simultaneous multithreading processors,” in Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001, pp. 40–51.

[12] R. D. Barnes, S. Ryoo, and W.-m. W. Hwu, “‘Flea-flicker’ multipass pipelining: An alternative to the high-power out-of-order offense,” in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, 2005, pp. 319–330.

[13] S. Chaudhry, P. Caprioli, S. Yip, and M. Tremblay, “High-performance throughput computing,” IEEE Micro, vol. 25, pp. 32–45, 2005.

[14] J. Dundas and T. Mudge, “Improving data cache performance by pre- executing instructions under a cache miss,” in Proceedings of the 11th International Conference on Supercomputing, 1997, pp. 68–75.

[15] R. Kessler, “The alpha 21264 microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24–36, March/April 1999.

[16] K. Yeager, “The mips r10000 superscalar microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–41, apr 1996.

[17] A. Hilton, S. Nagarakatte, and A. Roth, “iCFP: Tolerating all-level cache misses in in-order processors,” in Proceedings of the IEEE 15th International Symposium onHigh Performance Computer Architecture, 2009, pp. 431–442.

[18] S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, A. Landin, S. Yip, H. Zeffer, and M. Tremblay, “Simultaneous speculative threading: A novel pipeline architecture implemented in Sun’s rock processor,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 484–495.

[19] A. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A large, fast instruction window for tolerating cache misses,” in Proceed- ings of the 29th Annual International Symposium on Computer Archi- tecture, 2002, pp. 59–70.

[20] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, “Continual flow pipelines,” in Proceedings of the 11th International Con- ference on Architectural Support for Programming Languages and Oper- ating Systems, 2004, pp. 107–119.

[21] D. M. Gallagher, W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and W.-m. W. Hwu, “Dynamic memory disambiguation using the memory conflict buffer,” in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Sys- tems, 1994, pp. 183–193.

[22] J. E. Smith, “Decoupled access/execute computer architectures,” in Pro- ceedings of the 9th Annual Symposium on Computer Architecture, 1982, pp. 112–119.

[23] G. Tyson, M. Farrens, and A. R. Pleszkun, “MISC: A multiple instruction stream computer,” in Proceedings of the 25th Annual International Symposium on Microarchitecture, 1992, pp. 193–196.

[24] L. Kurian, P. T. Hulina, and L. D. Coraor, “Memory latency effects in decoupled architectures with a single data memory module,” in Proceed- ings of the 19th Annual International Symposium on Computer Archi- tecture, 1992, pp. 236–245.

[25] J. E. Smith, G. E. Dermer, B. D. Vanderwarn, S. D. Klinger, and C. M. Rozewski, “The zs-1 central processor,” in Proceedings of the Second In- ternational Conference on Architectural Support for Programming Lan- guages and Operating Systems, 1987, pp. 199–204.

[26] Y. Zhang and I. Adam, G.B., “Exploiting instruction level parallelism with the ds architecture,” in Proceedings of the 1996 International Con- ference on Parallel Processing, 1996, pp. 230–237.

[27] J.-M. Arnau, J.-M. Parcerisa, and P. Xekalakis, “Boosting mobile GPU performance with a Decoupled Access/Execute fragment processor,” in Proceedings of the 25th Annual International Symposium on Computer Architecture, 2012.

[28] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimiz- ing NUCA organizations and wiring alternatives for large caches with CACTI 6.0,” in Proceedings of the 40th Annual International Symposium on Microarchitecture, 2007, pp. 3–14.

[29] S. Kumar, C. J. Hughes, and A. Nguyen, “Carbon: Architectural support for fine-grained parallelism on chip multiprocessors,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, 2007, pp. 162–173.

[30] C. Lattner and V. Adve, “LLVM: a compilation framework for lifelong program analysis transformation,” in Proceedings of the International Symposium on Code Generation and Optimization, 2004, pp. 75–86. [31] P. L. Bird, A. Rawsthorne, and N. P. Topham, “The effectiveness of

decoupling,” in Proceedings of the 7th International Conference on Su- percomputing, 1993, pp. 47–56.

[32] E. Jacobsen, E. Rotenberg, and J. E. Smith, “Assigning confidence to conditional branch predictions,” in Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996, pp. 142–152.

[33] D. Grunwald, A. Klauser, S. Manne, and A. Pleszkun, “Confidence es- timation for speculation control,” in Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998, pp. 122–131. [34] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: speculation control for energy reduction,” in Proceedings of the 25th Annual Inter- national Symposium on Computer Architecture, 1998, pp. 132–141. [35] O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose, “Increasing proces-

sor performance through early register release,” in Proceedings of IEEE International Conference on Computer Design, 2004, pp. 480–487. [36] K. D. Rich and M. K. Farrens, “Code partitioning in decoupled compil-

ers,” in Proceedings of the 6th EUROPAR, 2000, pp. 1008–1017.

[37] N. Topham, A. Rawsthorne, C. McLean, M. Mewissen, and P. Bird, “Compiling and optimizing for decoupled architectures,” in Proceedings of the 1995 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, 1995, p. 40.

[38] G. Ottoni, R. Rangan, A. Stoler, and D. I. August, “Automatic thread extraction with decoupled software pipelining,” in Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitec- ture, 2005, pp. 105–118.

[39] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar processors,” in Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997, pp. 206–218.

[40] A. Gonz´alez, M. Valero, N. Topham, and J. M. Parcerisa, “Eliminating cache conflict misses through xor-based placement functions,” in Pro- ceedings of the 11th International Conference on Supercomputing, 1997, pp. 76–83.

[41] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using prime numbers for cache indexing to eliminate conflict misses,” in Proceedings of the 10th International Symposium on High Performance Computer Architecture, feb. 2004, pp. 288–299.

[42] O. Mutlu, H. Kim, and Y. N. Patt, “Efficient runahead execution: Power-efficient memory latency tolerance,” IEEE Micro, vol. 26, pp. 10–20, 2006.

[43] R. Rangan, N. Vachharajani, G. Ottoni, and D. I. August, “Perfor- mance scalability of decoupled software pipelining,” ACM Transactions on Architecture and Code Optimization, vol. 5, no. 2, pp. 8:1–8:25, 2008. [44] A. Hilton, S. Nagarakatte, and A. Roth, “iCFP: Tolerating all-level cache misses in in-order processors,” IEEE Micro, vol. 30, no. 1, pp. 12–19, 2010.

[45] A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi, “Slice-processors: An implementation of operation-based prediction,” in Proceedings of the 15th International Conference on Supercomputing, 2001, pp. 321–334. [46] J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter,

and H. C. Young, “PIPE: A VLSI decoupled architecture,” in Proceed- ings of the 12th Annual International Symposium on Computer Archi- tecture, 1985, pp. 20–27.

[47] L. K. John, V. Reddy, P. T. Hulina, and L. D. Coraor, “Program balance and its impact on high performance RISC architectures,” in Proceedings of the 1st IEEE Symposium on High Performance Computer Architec- ture, 1995, pp. 370–379.

[48] Y. Zhang and I. Adams, G.B., “Performance modeling and code partitioning for the ds architecture,” in The 25th Annual International Sym- posium on Computer Architecture, 1998, pp. 293–304.

[49] J.-M. Parcerisa and A. Gonzalez, “Improving latency tolerance of multithreading through decoupling,” IEEE Transactions on Computers, vol. 50, no. 10, pp. 1084–1094, 2001.

[50] A. Hilton and A. Roth, “BOLT: Energy-efficient out-of-order latency- tolerant execution,” in Proceedings of 16th International Symposium on High Performance Computer Architecture, 2010, pp. 1–12.

[51] F. Tseng and Y. N. Patt, “Achieving out-of-order performance with al- most in-order complexity,” in ISCA ’08: Proceedings of the 35th Annual International Symposium on Computer Architecture, 2008, pp. 3–12.

[52] W. Dally, J. Balfour, D. Black-Shaffer, J. Chen, R. Harting, V. Parikh, J. Park, and D. Sheffield, “Efficient embedded computing,” IEEE Com- puter, vol. 41, no. 7, pp. 27–32, July 2008.

[53] M. Gebhart, S. W. Keckler, and W. J. Dally, “A compile-time managed multi-level register file hierarchy,” in Proceedings of the 44th Annual International Symposium on Microarchitecture, 2011, pp. 465–476. [54] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz, “Energy-

performance tradeoffs in processor architecture and circuit design: a marginal cost analysis,” in Proceedings of the 37th annual international symposium on Computer architecture, 2010, pp. 26–36.

[55] A. Mahesri, D. Johnson, N. Crago, and S. J. Patel, “Tradeoffs in designing accelerator architectures for visual computing,” in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microar- chitecture, 2008, pp. 164–175.

[56] J. Huh, D. Burger, and S. Keckler, “Exploring the design space of fu- ture cmps,” in Proceedings of the Parallel Architectures and Compilation Techniques, 2001, pp. 199–210.

[57] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Soft- ware, 2009, pp. 163–174.

[58] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Kottapalli, “A 45nm 8-core enterprise Xeon processor,” in Proceedings of the IEEE International Solid-State Circuits Con- ference, 2009, pp. 56–57.

[59] M. Ware, K. Rajamani, M. Floyd, B. Brock, J. Rubio, F. Rawson, and J. Carter, “Architecting for power management: The IBM POWER7 approach,” in Proceedings of the 16th International Symposium on High Performance Computer Architecture, 2010, pp. 1–11.

[60] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, “IBM POWER6 microarchitecture,” IBM Journal of Research and Develop- ment, vol. 51, no. 6, pp. 639 –662, November 2007.

In document Energy-efficient latency tolerance for 1000-core data parallel processors with decoupled strands (Page 150-157)