Heterogeneous Multi-threading

9.3 Future Work

9.3.4 Heterogeneous Multi-threading

RPPM uses a new version of StatStack to model the memory behavior of multi-threaded applications. To model cache interference in the private and shared caches, it records memory operations during the profiling phase. These memory operations are ordered as they are executed during the profiling phase. This ordering is later used to model write invalidation in private caches and positive or negative interference in the shared cache. Therefore, accurately modeling heterogeneous multicore systems, where the ordering of the memory operations could be totally different, is not possible.

To accurately model these heterogeneous systems, a solution has to be de- veloped to avoid the ordering of the memory operations during profiling. To predict write invalidation, without ordering the samples, we need to sample all accesses to the same address. One possible approach to do this, without an infeasable increase in the number of samples, could be to use a different sampling technique. StatStack currently uses random sampling. However, with a prerun we could possibly identify interesting memory operations, and sample these during the profiling phase. When modeling write invalidation, the memory accesses can be reordered as if they were executing on the configuration to model (not the configuration where the profiling was done).

Bibliography

[1] G. ˚Ahlman. Microarchitecture-independent data locality analysis of multi- threaded applications on multicore processors. Master’s thesis, Uppsala University, Division of Computer Systems, 2016.

[2] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the ACM Sym- posium on Principles of Programming Languages (POPL), pages 177–189, January 1983.

[3] T. Austin, E. Larson, and D. Ernst. Simplescalar: an infrastructure for computer system modeling. Computer, 35(2):59–67, Feb 2002.

[4] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

[5] N. Binkert, B. Beckmann, G. Black, S. K Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. volume 39, pages 1–7, 2011.

[6] M. Breughe. Maximizing Branch Behavior Coverage for a Limited

Simulation Budget. http://http://www.jilp.org/cbp2016/slides/

mbreughe_WorkloadSelection.pptx, 2016. [Online; accessed 18 October 2018].

[7] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In SC ’11: Proceedings of 2011 International Conference for High Perfor- mance Computing, Networking, Storage and Analysis, pages 1–12, Nov 2011.

[8] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout. An

evaluation of high-level mechanistic core models. ACM Trans. Archit.

Code Optim., 11(3):28:1–28:25, August 2014.

[9] P. Chang, E. Hao, T. Yeh, and Y. Patt. Branch classification: A new mechanism for improving branch predictor performance. In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), pages 22–31, 1994.

[10] P. P. Chang, S. A. Mahlke, and W. W. Hwu. Using profile information to assist classic code optimizations. Software: Practice and Experience, 21(12):1301–1321, 1991.

[11] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Comput- ing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pages 44–54, October 2009.

[12] I. K. Chen, J. T. Coffey, and T. N. Mudge. Analysis of branch prediction via data compression. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Sys- tems (ASPLOS), pages 128–137, October 1996.

[13] X. E. Chen and T. M. Aamodt. Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. In Proceedings of the Interna- tional Symposium on Microarchitecture (MICRO), pages 59–70, December 2008.

[14] Y. Choi, A. Knies, L. Gerke, and T. Ngai. The impact of if-conversion and branch prediction on program execution on the Intel Itanium processor. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 182–191, December 2001.

[15] T. M. Conte, M. A. Hirsch, and K. N. Menezes. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings Interna- tional Conference on Computer Design. VLSI in Computers and Proces- sors, pages 468–477, Oct 1996.

[16] K. Du Bois, J. B. Sartor, S. Eyerman, and L. Eeckhout. Bottle graphs: Vi- sualizing scalability bottlenecks in multi-threaded applications. In Proceed- ings of the 2013 ACM SIGPLAN International Conference on Object Ori- ented Programming Systems Languages & Applications (OOPSLA), pages 355–372, 2013.

[17] L. Eeckhout and L. K. John. Performance evaluation and benchmarking. CRC Press, 2005.

[18] D. Ekl¨ov and E. Hagersten. StatStack: Efficient modeling of LRU caches. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 55–65, March 2010.

[19] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A Perfor-

mance Counter Architecture for Computing Accurate CPI Components. In Proceedings of The Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 175–184, October 2006.

[20] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A Mechanis- tic Performance Model for Superscalar Out-of-Order Processors. ACM Transactions on Computer Systems (TOCS), 27(2):42–53, May 2009.

115 [21] S. Eyerman, J. E. Smith, and L. Eeckhout. Characterizing the branch mis- prediction penalty. In Proceedings of the 2006 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS), pages 48–58, March 2006.

[22] S. R. Goldschmidt and J. L. Hennessy. The accuracy of trace-driven simu- lations of multiprocessors. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 146–157, 1993.

[23] A. Hartstein and T. R. Puzak. The optimal pipeline depth for a micro- processor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA), pages 7–13, May 2002.

[24] M. Haungs, P. Sallee, and M. Farrens. Branch transition rate: A new metric for improved branch classification analysis. In Proceedings of the Sixth International Symposium on High-Performance Computer Architec- ture (HPCA), pages 241–250, 2000.

[25] K.M. Hazelwood and T.M. Conte. A lightweight algorithm for dynamic if-conversion during dynamic optimization. In Proceedings of the Interna- tional Conference on Parallel Architectures and Compilation Techniques (PACT), pages 71–80, October 2000.

[26] J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

[27] K. Hoste and L. Eeckhout. Microarchitecture-independent workload characterization. IEEE Micro, 27(3):63–72, 2007.

[28] E. Ipek, S. A. McKee, B. R. de Supinski, M. Schulz, and R. Caruana. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 195–206, October 2006.

[29] Y. Ishii. Global-local combined branch history: The alternative way to improve TAGE branch predictor. In JWAC-4: Championship Branch Pre- diction. JILP, June 2014.

[30] D. Jim´enez. Strided sampling hashed perceptron predictor. In JWAC-4:

Championship Branch Prediction. JILP, June 2014.

[31] D. A. Jimenez and C. Lin. Dynamic branch prediction with perceptrons. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA), pages 197–206, Jan 2001.

[32] R. Jongerius, A. Anghel, G. Dittmann, G. Mariani, E. Vermij, and H. Cor- poraal. Analytic multi-core processor model for fast design-space explo- ration. IEEE Transactions on Computers, 67(6):755–770, June 2018.

[33] A. Joshi, A.and Phansalkar, L. Eeckhout, and L. K. John. Measuring benchmark similarity using inherent program characteristics. IEEE Trans- actions on Computers, 55(6):769–782, 2006.

[34] T. Karkhanis and J. E. Smith. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), pages 338–349, June 2004.

[35] T. S. Karkhanis and J. E. Smith. Automated design of application specific superscalar processors: An analytical approach. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), pages 402–411, 2007.

[36] P. M. Kogge. The architecture of pipelined computers. CRC press, 1981.

[37] B. Lee and D. Brooks. Accurate and efficient regression modeling for

microarchitectural performance and power prediction. In Proceedings of the Twelfth International Conference on Architectural Support for Pro- gramming Languages and Operating Systems (ASPLOS), pages 185–194, October 2006.

[38] B. Lee, D. Brooks, Bronis R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. In Proceedings of the 12th ACM SIGPLAN Sympo- sium on Principles and Practice of Parallel Programming (PPOPP), pages 249–258, March 207.

[39] B. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable performance regression for scalable multiprocessor models. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 270–281, November 2008.

[40] B. C. Lee and D. M. Brooks. Illustrative design space studies with mi-

croarchitectural regression models. In Proceedings of the International

Symposium on High Performance Computer Architecture (HPCA), pages 340–351, February 2007.

[41] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wal- lace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implemen- tation (PLDI), pages 190–200, June 2005.

[42] P. S. Magnusson, F. Dahlgren, H. Grahn, M. Karlsson, F. Larsson,

F. Lundholm, A. Moestedt, J. Nilsson, P. Stenstr¨om, and B. Werner.

Simics/sun4m: A virtual workstation. In Proceedings of the Annual Con- ference on USENIX Annual Technical Conference (ATEC), pages 10–10, 1998.

[43] S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gal- lagher, and W. W. Hwu. Characterizing the impact of predicated execution

117 on branch prediction. In Proceedings of the International Symposium on Microarchitecture (MICRO), pages 217–227, December 1994.

[44] S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I August, and W. W. Hwu. A comparison of full and partial predicated execution support for ILP processors. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 138–149, June 1995.

[45] S. McFarling. Combining branch predictors. Technical Report WRL TN- 36, Digital Western Research Laboratory, June 1993.

[46] P. Michaud, A. Seznec, and S. Jourdan. Exploring instruction-fetch band- width requirement in wide-issue superscalar processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 2–10, 1999.

[47] S. Otiv, K. Garikipati, M. Patnaik, and V. Kamakoti. H-pattern: A

hybrid pattern based dynamic branch predictor with performance based adaptation. In JWAC-4: Championship Branch Prediction. JILP, June 2014.

[48] E. Ould-Ahmed-Vall, J. Woodlee, C. Yount, K. A. Doshi, and S. Abra- ham. Using model trees for computer architecture performance analysis of software applications. In Proceedings of the IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS), pages 116–125, April 2007.

[49] M. Popov, C. Akel, F. Conti, W. Jalby, and P. d. O. Castro. PCERE: Fine- grained parallel benchmark decomposition for scalability prediction. In 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1151–1160, May 2015.

[50] A. Seznec. Analysis of the O-GEometric history length branch predictor. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), pages 394–405, June 2005.

[51] A. Seznec. TAGE-SC-L branch predictors. In JWAC-4: Championship Branch Prediction. JILP, June 2014.

[52] Y. S. Shao and D. Brooks. ISA-independent workload characterization and its implications for specialized architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 245–255, 2013.

[53] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 45–57, 2002.

[54] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the Inter- national Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 45–57, October 2002.

[55] J. E. Smith. A study of branch prediction strategies. In Proceedings of the 8th Annual Symposium on Computer Architecture (ISCA), pages 135–148, 1981.

[56] S. Van den Steen, S. De Pestel, M. Mechri, S. Eyerman, T. Carlson,

D. Black-Schaffer, E. Hagersten, and L. Eeckhout. Micro-architecture

independent analytical processor performance and power modeling. In Proceedings of the IEEE International Symposium on Performance Anal- ysis of Systems Software (ISPASS), pages 32–41, 2015.

[57] S. Van den Steen and L. Eeckhout. Modeling superscalar processor

memory-level parallelism. Computer Architecture Letters, 1(2):10–13, June 2018.

[58] S. Van den Steen, S. Eyerman, S. De Pestel, M. Mechri, T. E. Carlson,

D. Black-Schaffer, E. Hagersten, and L. Eeckhout. Analytical proces-

sor performance and power modeling using micro-architecture independent characteristics. IEEE Transactions on Computers, 65(12):3537–3551, 2016.

[59] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. Smarts: accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), pages 84–95, June 2003.

[60] T. Yeh and Y. N. Patt. A comparison of dynamic branch predictors that use two levels of branch history. In Proceedings of the 20th annual international symposium on computer architecture (ISCA ’93), pages 257–266, 1993.

[61] T. Yokota, K. Ootsu, and T. Baba. Potentials of branch predictors: From entropy viewpoints. In Proceedings of the 21st International Conference on Architecture of Computing Systems (ARCS), pages 273–285, 2008. [62] M. T. Yourst. PTLsim: A cycle accurate full system x86-64 microarchi-

tectural simulator. In IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS), pages 23–34, 2007.

In document Microarchitecture-independent analytical branch behavior and multi-threaded performance modeling (Page 138-144)