Sambamba - Efficient runtime systems for speculative parallelization

The Sambamba framework contains the work of different authors, and many parts are unstable or unusable in its current form. We therefore decided to not publish the source code of Sambamba yet. Interested research groups however get access to the source code with a respective disclaimer.

In order for Sambamba to effectively support speculation, a few changes are needed:

Adaptive switching between runtime systems. As shown in the individual evaluation sections, there is no single runtime system for thread level speculation which provides the best performance in all cases. Especially deciding which granularity

for K-TLS+ performs best is hard to do a priori. Sambamba

already supports dynamically generating the parallel code per section for any of the presented runtime systems. What is missing though is a feedback mechanism which iteratively tries to find the best runtime system per section, potentially even considering the input values which determine the behaviour of the parallel tasks.

Include speculation in ILP-based parallelization. Sambamba uses integer linear programming (ILP) in order to find the op- timal parallel schedule per set of basic blocks with the same control dependencies. In order to better detect speculative parallelization opportunities, this ILP formulation would need to be extended such that it is able to ignore certain dependencies. Information from previous runs can be used to make a more qualified decision here.

[1] Advanced Micro Devices (AMD). BIOS and Kernel Devel- oper’s Guide (BKDG) For AMD Family 10h Processors. http: //developer.amd.com/wordpress/media/2012/10/31116. pdf. [PDF, accessed 11-Mar-2016]. Apr. 2010.

[2] Mohammad Ansari, Christos Kotselidis, Ian Watson, Chris

Kirkham, Mikel Luj´an, and Kim Jarvis. “Lee-TM: A Non-

trivial Benchmark Suite for Transactional Memory”. In: Pro- ceedings of the 8th International Conference on Algorithms and Architectures for Parallel Processing - ICA3PP ’08. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pages 196–207. isbn: 9783540695004. doi: 10.1007/978-3-540-69501-1_21.

[3] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin

Zhao, Alan Edelman, and Saman Amarasinghe. “PetaBricks: A Language and Compiler for Algorithmic Choice”. In: Pro- ceedings of the 2009 ACM SIGPLAN Conference on Program- ming Language Design and Implementation - PLDI ’09. New York, New York, USA: ACM Press, 2009, pages 38–49. isbn: 9781605583921. doi: 10.1145/1542476.1542481.

[4] Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark.

“Grace: Safe Multithreaded Programming for C/C++”. In: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA ’09. New York, New York, USA: ACM Press, 2009,

pages 81–96. isbn: 9781605587660. doi: 10.1145/1640089. 1640096.

[5] A. J. Bernstein. “Analysis of Programs for Parallel Processing”. In: IEEE Transactions on Electronic Computers EC-15.5 (Oct. 1966), pages 757–763. issn: 0367-7508. doi: 10.1109/PGEC. 1966.264565.

[6] Anasua Bhowmik and Manoj Franklin. “A general compiler

framework for speculative multithreading”. In: Proceedings of the 14th ACM Symposium on Parallel Algorithms and Archi- tectures - SPAA ’02. New York, New York, USA: ACM Press, Aug. 2002, pages 99–108. isbn: 1581135297. doi: 10.1145/ 564870.564885.

[7] Daniel Birtel. “Variable Granularity TLS via Code Instrumen- tation”. Bachelor’s Thesis. Saarland University, July 2015.

[8] Burton H. Bloom. “Space/time trade-offs in hash coding with

allowable errors”. In: Communications of the ACM 13 (1970), pages 422–426. issn: 00010782. doi: 10.1145/362686.362692.

[9] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kusz-

maul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. “Cilk: An Efficient Multithreaded Runtime System”. In: Pro- ceedings of the fifth ACM SIGPLAN symposium on Princi- ples and practice of parallel programming - PPOPP ’95. New York, New York, USA: ACM Press, 1995, pages 207–216. isbn: 0-89791-700-6. doi: 10.1145/209936.209958.

[10] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P.

Sadayappan. “A practical automatic polyhedral parallelizer and locality optimizer”. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and

Implementation - PLDI ’08. New York, New York, USA: ACM Press, 2008, pages 101–113. isbn: 9781595938602. doi: 10. 1145/1375581.1375595.

[11] Derek Bruening and Qin Zhao. “Practical memory check-

ing with Dr. Memory”. In: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO ’11. IEEE, Apr. 2011, pages 213–223. isbn: 978-1-61284-356-8. doi: 10.1109/CGO.2011.5764689.

[12] Mihai Burcea, J. Gregory Steffan, and Cristiana Amza. “The

Potential for Variable-Granularity Access Tracking for Opti- mistic Parallelism”. In: Proceedings of the 2008 ACM SIG- PLAN Workshop on Memory Systems Performance and Cor- rectness - MSPC ’08. New York, New York, USA: ACM Press, 2008, pages 11–15. isbn: 9781605580494. doi: 10.1145/ 1353522.1353527.

[13] Michael Burke, Ron Cytron, Jeanne Ferrante, and Wilson

Hsieh. “Automatic generation of nested, fork-join parallelism”. In: The Journal of Supercomputing 3.2 (July 1989), pages 71– 88. issn: 0920-8542. doi: 10.1007/BF00129843.

[14] Irina Calciu, Justin Gottschlich, Tatiana Shpeisman, Gilles

Pokam, and Maurice Herlihy. “Invyswell: A Hybrid Transac- tional Memory for Haswell’s Restricted Transactional Mem- ory”. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation - PACT ’14. 2014, pages 187–200. isbn: 9781450328098. doi: 10.1145/2628071. 2628086.

[15] Simone Campanoni, Kevin Brownell, Svilen Kanev, Timo- thy M. Jones, Gu-Yeon Wei, and David Brooks. “HELIX-RC: An Architecture-Compiler Co-Design for Automatic Paral- lelization of Irregular Programs”. In: 41st ACM/IEEE In- ternational Symposium on Computer Architecture - ISCA’14. IEEE, June 2014, pages 217–228. isbn: 978-1-4799-4394-4. doi: 10.1109/ISCA.2014.6853215.

[16] Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay

Janapa Reddi, Gu-Yeon Wei, and David Brooks. “HELIX: Au- tomatic Parallelization of Irregular Programs for Chip Multi- processing”. In: Proceedings of the Tenth International Sympo- sium on Code Generation and Optimization - CGO ’12. 2012, pages 84–93. isbn: 9781450312066. doi: 10.1145/2259016. 2259028.

[17] Luis Ceze, James Tuck, C˘alin Ca¸scaval, and Josep Torrel-

las. “Bulk Disambiguation of Speculative Threads in Multi- processors”. In: 33rd International Symposium on Computer Architecture (ISCA’06). IEEE, 2006, pages 227–238. isbn: 0-7695-2608-X. doi: 10.1109/ISCA.2006.13.

[18] Michael K. Chen and Kunle Olukotun. “The Jrpm system

for dynamically parallelizing Java programs”. In: 30th Annual International Symposium on Computer Architecture, 2003. Proceedings. June. IEEE Comput. Soc, 2003, pages 434–445. isbn: 0-7695-1945-8. doi: 10.1109/ISCA.2003.1207020.

[19] Marcelo Cintra and Diego R. Llanos. “Toward efficient and

robust software speculative parallelization on multiprocessors”. In: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’03.

New York, New York, USA: ACM Press, 2003, pages 13–24. isbn: 1581135882. doi: 10.1145/781498.781501.

[20] Luke Dalessandro, Fran¸cois Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and Michael F. Spear. “Hybrid NOrec: A Case Study in the Effectiveness of Best Effort Hard- ware Transactional Memory”. In: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems - ASPLOS ’11. 2011, pages 39–52. isbn: 9781450302661. doi: 10.1145/2248487. 1950373.

[21] Luke Dalessandro, Michael F. Spear, and Michael L. Scott.

“NOrec: streamlining STM by abolishing ownership records”. In: Proceedings of the 15th ACM SIGPLAN symposium on Princi- ples and practice of parallel programming (PPoPP ’10). 2010, pages 67–78. isbn: 9781605587080. doi: 10.1145/1693453. 1693464.

[22] Alain Darte, Georges-André Silber, and Frédéric Vivien. “Com- bining Retiming and Scheduling Techniques for Loop Paral- lelization and Loop Tiling”. In: Parallel Processing Letters 7.4 (1996), pages 379–392.

[23] Matthew DeVuyst, Dean M. Tullsen, and Seon Wook Kim.

“Runtime parallelization of legacy code on a transactional memory system”. In: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers - HiPEAC ’11. New York, New York, USA: ACM Press, 2011, pages 127–136. isbn: 9781450302418. doi: 10.1145/1944862.1944882.

[24] Dave Dice, Ori Shalev, and Nir Shavit. “Transactional Locking II”. In: Proceedings of the 20th International Conference on Distributed Computing - DISC ’06. 2006, pages 194–208. doi: 10.1007/11864219_14.

[25] Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. “Software behavior oriented parallelization”. In: Proceedings of the 2007 ACM SIGPLAN Confer- ence on Programming Language Design and Implementation - PLDI ’07. New York, New York, USA: ACM Press, 2007, pages 223–234. isbn: 9781595936332. doi: 10.1145/1250734. 1250760.

[26] Aleksandar Dragojevi´c, Pascal Felber, Vincent Gramoli, and Rachid Guerraoui. “Why STM can be more than a research toy”. In: Communications of the ACM 54.4 (Apr. 2011), pages 70–77. issn: 00010782. doi: 10.1145/1924421.1924440.

[27] Aleksandar Dragojevi´c, Rachid Guerraoui, and Michal Ka-

palka. “Stretching transactional memory”. In: Proceedings of the 30th ACM SIGPLAN Conference on Programming Lan- guage Design and Implementation - PLDI ’09. May 2009, pages 155–165. isbn: 978-1-60558-392-1. doi: 10.1145/1543135. 1542494.

[28] Paul Feautrier. “Automatic Parallelization in the Polytope

Model”. In: The Data Parallel Programming Model: Founda- tions, HPF Realization, and Scientific Applications. Lecture

Notes in Computer Science 1132 (1996). Edited by Guy-Ren´e

[29] Paul Feautrier. “Some efficient solutions to the affine scheduling problem. I. One-dimensional time”. In: International Jour- nal of Parallel Programming 21.5 (Oct. 1992), pages 313–347. issn: 0885-7458. doi: 10.1007/BF01407835.

[30] Paul Feautrier. “Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time”. In: International Journal of Parallel Programming 21.6 (Dec. 1992), pages 389– 420. issn: 0885-7458. doi: 10.1007/BF01379404.

[31] Pascal Felber, Christof Fetzer, Patrick Marlier, and Torvald Riegel. “Time-Based Software Transactional Memory”. In: IEEE Transactions on Parallel and Distributed Systems 21.12 (Dec. 2010), pages 1793–1807. issn: 1045-9219. doi: 10.1109/ TPDS.2010.49.

[32] Pascal Felber, Christof Fetzer, and Torvald Riegel. “Dynamic performance tuning of word-based software transactional memory”. In: Proceedings of the 13th ACM SIGPLAN Sympo- sium on Principles and Practice of Parallel Programming - PPoPP ’08. New York, New York, USA: ACM Press, 2008, pages 237–245. isbn: 9781595937957. doi: 10.1145/1345206. 1345241.

[33] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. “The program dependence graph and its use in optimization”. In: ACM Transactions on Programming Languages and Systems

9.3 (July 1987), pages 319–349. issn: 01640925. doi: 10.1145/ 24039.24041.

[34] B. Fleisch and G. Popek. “Mirage: a coherent distributed

Symposium on Operating Systems Principles - SOSP ’89. 1989, pages 211–223. doi: 10.1145/74851.74871.

[35] Allan Gottlieb, Boris D. Lubachevsky, and Larry Rudolph.

“Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors”. In: ACM Transactions on Programming Languages and Systems (TOPLAS)

5.2 (Apr. 1983), pages 164–189. issn: 01640925. doi: 10.1145/ 69624.357206.

[36] Justin Gottschlich, Manish Vachharajani, and Jeremy Siek.

“An efficient software transactional memory using commit-time invalidation”. In: Proceedings of the 8th annual IEEE/ACM International Symposium on Code Generation and Optimiza- tion - CGO ’10. 2010, pages 101–110. isbn: 9781605586359. doi: 10.1145/1772954.1772970.

[37] Rachid Guerraoui, Michal Kapalka, and Jan Vitek. “STM-

Bench7: A Benchmark for Software Transactional Memory”. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Con- ference on Computer Systems - EuroSys ’07. 2007, pages 315– 324. isbn: 9781595936363. doi: 10.1145/1272996.1273029.

[38] Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe,

Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Mon- ica S. Lam. “Maximizing multiprocessor performance with the SUIF compiler”. In: IEEE Computer 29.12 (Dec. 1996), pages 84–89. issn: 0018-9162. doi: 10.1109/2.546613.

[39] Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar

K. Prabhu, Michael Chen, and Kunle Olukotun. “The Stanford Hydra CMP”. In: IEEE Micro 20.2 (2000), pages 71–84. issn: 02721732. doi: 10.1109/40.848474.

In document Efficient runtime systems for speculative parallelization (Page 190-200)