Compiler and Runtime System Optimizations for the Insieme Compiler Infrastructure

(1)

Compiler and Runtime System Optimizations for the Insieme

Compiler Infrastructure

PhD thesis in Computer Science

by

Peter Zangerl

Submitted to the Faculty of Mathematics, Computer Science and Physics of the University of Innsbruck

In partial fulfillment of the requirements for the degree of doctor of philosophy

Advisor : Prof. Dr. Thomas Fahringer, Institute of Computer Science

Innsbruck, September 27, 2019

(2)

(3)

Certificate of Authorship and Originality

I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text.

I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis.

Peter Zangerl, Innsbruck, September 27, 2019

iii

(4)

(5)

Abstract

Developing programs that fully utilize the available computing capabilities of the underlying hardware has always been a challenge. The introduction of high-level programming languages and their optimizing compilers aimed at shielding developers from platform specific details and foster the creation of efficient applications. Following the adoption of parallelism in hardware along with the associated complexity and error potential, specialized task-parallel runtime systems intended to ease the creation of correct and maintainable parallel programs were developed.

Still, most research in this area is focused on just one of these two components of the toolchain – either the compiler or the runtime system. By broadening our view and researching on combined compiler and runtime system optimizations, we can tap into further optimization potential. This goal is achieved by forwarding static analysis results obtained during the compile- time of a program to the runtime system. Along with dynamic data gathered by the runtime system itself, this information enables improved scheduling decisions and more precise parameter settings within a runtime system.

This thesis describes research in three different areas of combined compiler and runtime system optimizations. First, we investigate the cache- and performance impacts of code multi- versioning for parallel programs. Second, we study and develop runtime system parameter optimization by means of compiler analyses. Third, we research compiler analyses for progress estimation of parallel programs.

We base our work on the Insieme compiler and runtime system, which offers a source-to-source compiler with a parallelism- aware intermediate representation fostering the development of high-level analyses and transformations. In combination with In- sieme’s parallel runtime system, this enables us to research novel ideas and exploit additional optimization potential for parallel programs along the whole toolchain.

v

(6)

(7)

Acknowledgements

There are several people I would like to thank for their continuous support during the course of my work on this PhD thesis.

Foremost, I would like to thank Prof. Dr. Thomas Fahringer for his guidance and advice throughout the years I have been working in the DPS group. His knowledge and experience in the field contributed significantly to the direction of my research and eventually this thesis.

Additionally, I would also like to thank the external reviewers for this thesis, as well as all peer-reviewers for the publications I made in the course of my work. The input and feedback we get from them often plays a major role in the quality of our work.

I would also like to acknowledge the efforts of the countless people who contributed to all the software components I used during my work. Most of the software is open source, and I think that the countless developers working on these projects rarely get the acknowledgement they deserve for their invaluable work, which immensely eases the daily life for so many people.

Working on a thesis for such a long time would have been a very dull and boring task if it had not been for all the discussions and fruitful talks regarding work, research, technology, and also everyday life I had with my co-workers. Sharing this time during and even after work with them made it not only productive but also worthwhile and entertaining. For this reason I would like to thank all of my current as well as my former co-workers from the Insieme team, as well as my more or less regular lunch group.

Among all of my co-workers, a very special thanks goes to Dr. Peter Thoman. Not only did I develop and write all of my published research together with him, but he also guided and mentored me through most of my PhD, and always had an open ear when I encountered new problems.

Last but not least, I would like to express my appreciation to all my friends and family. They encouraged and supported me in all my decisions throughout these years.

vii

(8)

(9)

List of Figures

1.1 High-level approach overview . . . 4

2.1 Insieme architecture overview . . . 16

2.2 Example INSPIRE tree structure . . . 19

2.3 Insieme runtime system overview . . . 20

2.4 Insieme extensions . . . 21

3.1 Binary code size increase per version on the Intel machine . . 28

3.2 Intel system L1i cache misses, single-threaded . . . 31

3.3 Intel system L2i cache misses, single-threaded . . . 31

3.4 PowerPC system L1i cache misses, single-threaded . . . 32

3.5 Intel system normalized wall-clock time, single-threaded . . . 32

3.6 PowerPC system normalized wall-clock time, single-threaded 33 3.7 Intel system normalized wall-clock time, single-threaded . . . 34

3.8 Intel system L1 instruction cache misses, single-threaded . . . 35

3.9 Intel system normalized wall-clock time, single iteration . . . 35

3.10 PowerPC system normalized wall-clock time, two iterations . 36 3.11 AMD system normalized wall-clock time, two iterations . . . 37

3.12 Random strategy: worst-case wall-clock time impact . . . 39

3.13 Converging strategy: worst-case wall-clock time impact . . . . 39

3.14 Dynamic strategy: worst-case wall-clock time impact . . . 40

3.15 PowerPC system L1i cache misses, single-threaded . . . 41

3.16 PowerPC system normalized wall-clock time, single-threaded 41 4.1 Execution time of the Strassen benchmark . . . 47

4.2 Memory consumption of the Strassen benchmark . . . 48

4.3 Overview of our method . . . 49

4.4 Behavior of available queue policies . . . 51

4.5 Fundamental parallel program structures . . . 57

4.6 Example INSPIRE address tree structure . . . 59

4.7 Execution time improvement comparison box plot . . . 68

4.8 QAP2 execution times with varying queue size . . . 69

4.9 Overall memory consumption comparison box plot . . . 71 4.10 Floorplan execution times with optimal parameter buffer size 72

xiii

(14)

4.11 Performance charts for the Fibonacci benchmark . . . 73

4.12 Performance charts for the Strassen benchmark . . . 74

4.13 Performance charts for the SparseLU benchmark . . . 75

4.14 Performance comparison of all evaluated technologies . . . 76

5.1 Comparison of different progress reporting methods . . . 83

5.2 Method overview for our automatic progress estimation . . . 84

5.3 Overheads by benchmark . . . 92

5.4 Accuracy by benchmark . . . 94

5.5 Accuracy by number of threads . . . 95

(15)

List of Tables

1.1 Thesis organization . . . 13

3.1 Evaluation platforms hardware and software setup . . . 29

4.1 Tuned parameters with all evaluated values . . . 50

4.2 INSPIRE constructs related to task parallelism . . . 54

4.3 Benchmark overview . . . 65

4.4 Benchmark properties . . . 66

4.5 Default, optimum and achieved execution times by benchmark 70 4.6 Minimum execution time comparison for different approaches 75 5.1 Requirements and features of progress estimation methods . . 90

5.2 Benchmark overview . . . 91

xv

(16)

(17)

List of Listings

2.1 Example input code . . . 18

3.1 Code template used for version generation . . . 27

4.1 Example C++ code snippet . . . 60

5.1 Runtime system API for progress reporting . . . 89

xvii

(18)

(19)

List of Algorithms

4.1 Task context identification . . . 56

4.2 Determine recursive parallel paths . . . 58

4.3 Determine loop-like parallelism . . . 59

4.4 Effort estimation . . . 61

4.5 Stack size estimation . . . 62

4.6 Closure size computation . . . 63

5.1 Progress report generation . . . 87

5.2 Parallelism handling . . . 88

xix

(20)

(21)

List of Acronyms

API Application Programming Interface CPU Central Processing Unit

DAG Directed Acyclic Graph

DVFS Dynamic Voltage and Frequency Scaling GCC GNU Compiler Collection

GPL GNU General Public License GPU Graphics Processing Unit HPC High Performance Computing

INSPIRE Insieme Parallel Intermediate Representation IR Intermediate Representation

JIT Just-in-Time

L1 Level 1 Cache

L1d Level 1 Data Cache L1i Level 1 Instruction Cache

L2 Level 2 Cache

L3 Level 3 Cache

MPI Message Passing Interface xxi

(22)

MSE Mean Squared Error

NUMA Non-Uniform Memory Access

SMT Simultaneous (Hardware) Multi-Threading

(23)

Glossary

Application Programming Interface (API)

In the context of a software library, a set of functions and data structures offered by the library which can be used by a client program. This interface specifies the behavior of the library and all offered functional- ity. Thus, an API serves as an abstract layer between different software components. There might be several different implementations offering the same interface and thus adhering to the same specification.

Code Multi-Versioning

A technique for program optimization where the compiler generates multiple versions of certain program regions tuned for different optimization goals, allowing the runtime system to select among these versions during execution time. This selection can be done according to the current state of the program or available resources.

Compiler

A program translating code (typically program code) from a source language to a target language. In most cases, a compiler will translate program source code into machine code, which can then be executed directly by the CPU.

Source-to-Source Compiler

A compiler which translates source code to source code again instead of generating executable machine code. Such a compiler can also apply analyses or transformations along the way.

Hardware Parallelism

Parallelism provided directly by hardware entities within a system, ranging all the way from parallel functional units within a single CPU core to parallel systems interconnected via a network.

Non-Uniform Memory Access (NUMA)

A hardware architecture, where not all of the memory available within a system can be accessed with the same latency or bandwidth by every

xxiii

(24)

processing unit. Most modern systems with multiple cores per CPU or multiple CPUs fall into this category.

Shared Memory Architecture

Computer architecture where the available system memory can be accessed by all CPUs via the same address space. This is typically the case for single systems with one or more CPUs, but there also exist software and hardware solutions which enable the creation of a large virtually shared address space over a network. Programming parallel applications for shared memory systems is often easier, as the underlying hardware is responsible for the necessary communication of memory updates and synchronization to ensure coherence.

Distributed Memory Architecture

Computer architecture, where each CPU can only access its own memory and has to use other means of communication (usually over some network) to exchange data with other processes. Programming such a system normally is done with the help of specialized libraries which handle the communication and exchange of data.

Intermediate Representation (IR)

A data structure used internally by a compiler to represent the source code of the program to be translated. This representation is specifically designed to be suitable for transformations and analyses during the compilation process.

Just-in-Time (JIT) Compilation

A technique for program optimization where the executed code is re- compiled during program run-time with modifications intended to optimize for the current program environment and behavior.

Mean Squared Error (MSE)

A measure of the quality of an estimator in statistics. The MSE is calculated as follows:

MSE= 1 n

n

X

i=1

(Yi− ˆY_i)²

Where Y_i is the value returned by the estimator and ˆY_i the ideal ex- pected value for the i^thof n estimates. MSE values are always positive numbers with values closer to zero being better.

Memory Gap

The gap between CPU speed and memory access speed. As CPU speeds increase much more quickly than memory access speeds, this gap has been growing for decades.

(25)

Glossary xxv

Runtime System

A software component responsible for the orchestration of the execution of a program along with the management of the available hardware resources. Especially useful for parallel programs.

Task Parallelism

A popular concept to parallelize software. Task-parallel runtime systems or libraries offer the ability to submit a piece of work – a task – to be computed in parallel to the main control flow and provide facilities to specify dependencies between them or wait for the completion of these tasks.

(26)

(27)

Chapter 1

Introduction

1.1 Motivation

Writing programs that achieve good performance and utilize the hardware effectively has always been a challenge. The introduction of parallelism in hardware architectures at various levels further complicated this process.

For this reason, compiler technologies as well as parallel runtime systems and libraries have been the target of many research efforts with the goal to ease development, improve developer productivity and maximize execution performance.

However, often this research is limited to only one part of the whole toolchain – either the compiler or the runtime system. By applying a combined approach on both components at the same time, we can tap into further, otherwise unattainable optimization potentials. In this thesis, we define a set of research areas with promising potential for combined compiler- and runtime system optimizations, present novel solutions that take advantage of these optimizations and evaluate their effectiveness in comparison to other state-of-the-art systems.

1.1.1 Historical Overview

Thanks to the introduction of high-level programming languages like C and Fortran, writing efficient programs that effectively utilize the available hardware became possible without too much hassle. These languages enable developers to focus on solving a problem by specifying what the processor should accomplish, without having to worry too much on how this will be achieved or focus on low-level details. The compilers for these languages were responsible for the translation of the high-level language to efficient machine code and applying optimizations without changing program semantics.

With processor speeds increasing and memory access speeds not keeping up with this development [16, 50, 83, 17], newly introduced CPU features such as caches, out-of-order execution, and instruction-level parallelism were

1

(28)

added to bridge or at least mitigate this ever-growing memory gap. The addition of these elements however again complicated the work of software developers and called for novel compiler optimizations capable of dealing with the newly introduced processor features, in order to achieve good performance and hardware utilization.

Once it became apparent that increasing processor speeds further would eventually hit some physical limitations, the hardware industry sought for different ways to keep up with the constant demand for faster systems. Hard- ware parallelism was chosen as the new way to achieve higher performance in processors: initially with the introduction of multiple CPUs in a single system and later on with multiple processor cores per CPU [32, 4]. How- ever, with the introduction of parallel hardware architectures, new levels of complexity as well as additional error potential in software development emerged [70, 69].

Initially, these early parallel systems were programmed using multiple concurrent processes and later on with low level threading primitives offered by the operating system. However, this approach turned out to have several drawbacks: i) it is very cumbersome to parallelize programs using these primitives. The code is hard to write as well as hard to maintain; ii) the programmer needs to take care of correct data distribution and synchronization which causes additional burden and complexity; and iii) tuning the program performance often leads to over-fitting of the application to the system that was used to develop the program, and thus to source code that is not performing well on a different hardware setup.

1.1.2 Parallel Runtime Systems and Libraries

These aforementioned drawbacks motivated the development of specialized parallel runtime systems and libraries intended to ease the creation of correct, maintainable and portable parallel programs [11, 10, 64]. For ease of use, most of these systems either focus completely on, or at least provide some form of task-based parallelism. These systems offer the ability to submit a piece of work – a task – to be computed in parallel to the main control flow and provide facilities to specify dependencies between them or wait for the completion of these tasks.

By using such a task-parallel system, developers could now more easily leverage the potential of parallel systems without having to consider all the low level details and peculiarities of a given hardware platform or software stack. As a consequence, this development moved parts of the responsibility for efficient hardware usage away from application developers to the runtime system and library designers.

For efficient software development, however, it turned out to be advanta- geous for the compilers to also participate in this process. This observation led to the creation of new programming languages, dialects and variants

(29)

1.2. OUR APPROACH 3

that all feature compiler modifications or extensions designed to translate higher level parallelism concepts to code leveraging their respective runtime systems.

The popularity of task-based parallelization approaches is exemplified by the large number of programming languages, runtime systems and libraries providing support for task parallelism with widespread use in academia and industry – several of these also featuring a compiler component along with a specialized runtime system or library. Popular examples are: OpenMP [24], StarPu [5], the whole StarSs family [54] with all its flavors (OmpSs [34], CellSs, SMPSs, . . . ), X10 [19], Chapel [18], Cilk [12] and many more. Tho- man et al. [74] give a good overview on various task-parallel systems with their respective focus and properties.

1.2 Our Approach

Two main software components play an important role in the efficient execution of task-parallel programs: a compiler component translating the program source code into a binary, and a runtime system executing the translated program on the desired target hardware. As a consequence, both of these systems have been the target of much research and development in order to exploit the possible optimization potential.

From the perspective of a traditional compiler, a task-parallel program is mostly the same as any other sequential program, and it is translated into machine code by applying the same optimizations. Even though parallel runtime systems often require a specialized compiler component, mostly the sole responsibility of these compilers is to map certain pragmas or markers to specific library function calls. As the compiler lacks the information on the actual environment and dynamic program behavior under which the task- parallel runtime system will be operating, there is not much it can do to further optimize the translated executable.

The runtime system on the other hand can take advantage of its detailed knowledge about the underlying hardware platform to improve program performance during execution. It has all the dynamic information needed for better decision making regarding the execution of the program and can also monitor its behavior. Still, it is limited to running the executable that has been generated by the compiler and normally cannot apply any changes on the structure of the program or its low level optimizations for further performance improvements. Additionally, tuning the runtime system’s internal parameters is also a challenging task, as the optimal settings often highly depend on the behavior of the executed program.

One way to address these shortcomings and bridge this gap is to make static information about the source code that was analyzed by the compiler available to the runtime system. This enables the runtime system to perform

(30)

Input Code Program Binary Source-to-Source Compiler

Runtime System Backend Compiler (GCC, Clang, …) Output Code

Figure 1.1: High-level overview on combined compiler- and runtime system research information flow.

actions like selecting between several specialized program versions or making better scheduling decisions for improved execution performance based on the program behavior, system environment and user-defined optimization priorities.

The goal of the research in this thesis is centered on combined compiler- and runtime system optimizations for improved task-parallel program execution performance. This is carried out in the context of the Insieme compiler and runtime system, which is a source-to-source compiler and runtime system developed at the University of Innsbruck. The Insieme compiler sup- ports parallel C and C++ programs as inputs and translates them into a parallelism-aware, high-level intermediate representation (IR). This IR can then be analyzed and transformed, before it will be translated back to code suitable for running within the Insieme runtime system. The runtime system is then responsible for the parallel execution on the given target hardware.

Crucially, it is also possible to influence several runtime system parameters directly from within the compilation phase, as well as to pass additional meta-information to the runtime system. This allows us to perform research on the additional optimization potential achievable by this tight integration between a compiler and a parallel runtime system.

Figure 1.1 depicts a general overview of the information flow in such a combined compiler- and runtime system approach. The compiler component (which is split in two phases here, but could also be a single component) can analyze the input code and apply transformations on it. With the knowledge about the static structure of the program, the compiler can apply optimizations, transform code or add additional meta-data and thus forward this knowledge to the runtime system. Additionally, it can also influence runtime system settings and parameters directly based on its analysis results.

After the compilation by the backend compiler, the runtime system is responsible for the efficient execution of the program on the target platform

(31)

1.3. THESIS SCOPE 5

by considering available hardware resources and adapting to the dynamic program behavior of the executed application. In this process, it can also base decisions on the information gathered during the compilation phase and thus improve overall program performance.

1.3 Thesis Scope

The scope of this thesis specifically includes:

• Programs running on homogeneous multi-core shared-memory systems.

This also includes multi-socket systems featuring several CPUs and thus also Non-Uniform Memory Access (NUMA) systems. While our approaches would in principle also work for heterogeneous shared- memory systems, we did not focus on them due to their (still) lower relevance in the high performance computing (HPC) domain.

• Programs employing loop- or task-based parallelism. More specifically, programs that are parallelized using one of the input languages supported by the Insieme compiler (OpenMP or Cilk at the moment).

• The compilation and program run-time of parallel applications. Our research does not cover other parts of the software development lifecycle like the design, development, testing, debugging or tuning of programs.

Regarding the hardware architectures, accelerator-computing as well as distributed-memory computing were not researched in the course of this thesis, since the Insieme compiler and runtime system we based our research on does not offer support for those at the moment. Adding support for these architectures as well as additional input languages would have exceeded the scope of this thesis.

1.4 Research Objectives

The main research objectives covered by this thesis can be summarized as follows:

Research Objective 1 (O1)

Investigate cache- and performance impacts of code multi-versioning for parallel programs

(32)

Study and develop runtime system optimization by means of compiler analyses

Research compiler analyses for (parallel) program progress estimation

1.5 Research Areas

In order to investigate the research objectives specified above, this thesis will cover research in the following areas:

One pattern often employed to allow for more flexibility and optimized program execution in parallel runtime systems is code multi-versioning. The compiler generates multiple semantically equivalent versions of the same code section that are all optimized for distinct runtime characteristics or input-data dependent program features (e.g. different loop-unrolling factors or loop-tiling thresholds). During program execution, the runtime system then selects among the different versions available in the program binary.

The decision which version to execute might depend on the current state of the runtime system, available hardware components, system workload or input data characteristics.

Runtime Parameter Selection

Most parallel runtime systems have several tunable parameters that influence program performance in one way or another. The optimal values for these parameters highly depend on the executed programs as well as the targeted hardware platform and input data. Finding the best values is often achieved manually by performing several profiling runs of the application with different configurations and selecting good candidates. Additionally, the runtime systems themselves might adapt their parameters during program execution as a reaction to certain events or internal monitoring.

(33)

1.6. STATE-OF-THE-ART 7

Progress Estimation

In order to perform adaptions to runtime system internal parameters and optimize execution performance, the runtime system needs some kind of feedback mechanism on whether a particular change was beneficial to program performance or not. This feedback is needed to decide if the parameter change should be reverted again, or whether the parameter might be changed even more in the same direction for improved performance. To get this feedback information, there are several different ways for a runtime system to reason about the progress of the executed program.

1.6 State-of-the-Art

There is a large body of related work on the optimization of compilers for parallel programs, task-parallel runtime systems and the combination of both.

In this section we will only give an overview on the related work covering our research presented in Chapters 3, 4 and 5. The related work sections in the respective chapters will then go into more detail.

1.6.1 Code Multi-Versioning Impacts

Code multi-versioning offers a promising way to improve program performance by generating multiple versions for the (presumed) most computa- tionally demanding functions, each specialized by the compiler for different environmental characteristics like cache sizes, input-data properties or loop unrolling factors. Several works try to improve this process by generating better targeted code versions [68, 31, 23, 59] or reducing binary size increase by focusing on more promising code variants [88]. Others focus on the selection process during program runtime and aim to improve this by adapting to the environment [26], pruning the search space [39], or applying machine learning to achieve better results [57, 21].

Open Problems

Even though the use of code multi-versioning as a tool to improve program performance has been the target of previous research, it still is not completely clear what the negative consequences of multi-versioning are. Most of the research referenced above tries to optimize application performance by generating better code versions at compile time, or finding good versions at program runtime more efficiently.

However, packing a high number of versions for the same function (or even worse – for multiple functions) into the program binary not only results in more flexibility and tuning potential for the runtime system, but as a side

(34)

effect also yields a program binary of (potentially significantly) increased size.

Still, the actual side effects of increased binary size on the execution time and cache impacts has largely remained unexplored until now. Our work [86] presented in Chapter 3 characterizes and evaluates the side effects of larger binary size, different version properties, selection strategies and binary layout on different hardware platforms. Additionally, we also develop a tuning strategy that can be employed to reduce the negative consequences of code multi-versioning.

1.6.2 Runtime System Optimization by Compiler Analysis Another possibility of taking advantage of a combined compiler- and runtime system approach is runtime parameter selection based on compiler analysis.

By applying specialized analyses on the program source code, a compiler can gather static information about the program behavior which is inaccessible to the runtime system. Making these results available to the runtime system can enable it to take better decisions or perform additional optimizations.

There is a large body of work with a focus on these combined compiler- and runtime system approaches. These range from automatic parallelization of codes [9], over task granularity control and overhead reduction [58, 75, 81]

to more efficient scheduling decisions based on the characteristics of the input code [20, 82, 62, 35].

Open Problems

The related work mentioned above mostly focuses on forwarding compiler analysis results to the runtime system for improved decision making and execution optimization. All of these combined dynamic approaches have the inherent drawback that decisions taken at program runtime come with negative impacts on performance. Even though this cost can be minimized by careful runtime system designs, it can never be completely eliminated.

Also, certain parameters need to be set at compile time of the program binary in order to exploit the full optimization potential.

Our focus is to perform static analyses based on the program source code in the compiler and configure compile-time constant parameters for program binary creation based on the analysis results. By taking advantage of the parallelism-aware intermediate representation of the Insieme compiler and runtime system, we are able to develop new analyses to statically set these compile-time constant parameters based on program features like parallelism structure or task granularity. Our work [80] detailed in Chapter 4 presents static analyses that are also able to reduce memory consumption in addition to achieving program performance improvements.

(35)

1.7. CONTRIBUTIONS 9

1.6.3 Compiler Generated Progress Estimation

Most related work in the area of progress estimation within parallel runtime systems has its focus on monitoring hardware counters to reason about an application’s progress. This information is often used to take scheduling decisions [84, 2, 37] or reduce energy consumption [56]. In addition, some research does not rely only on hardware counters, but also consider multiple different metrics like input/output events or inter-process communication [42, 40].

Open Problems

A majority of the approaches in literature only uses indirect metrics to reason about the progress of an application running within a runtime system. This poses a problem, as there might not be a direct relation between the employed metrics and actual program progress. Steere et al. [67] recognize the need for direct feedback, and suggest an interface for the application to report on its progress as a solution. This reporting can then be used to make better scheduling decisions within the runtime system.

Achieving good reporting accuracy by using such interfaces however re- quires a considerable development effort and especially a very good under- standing of the input program. Our solution to this dilemma presented in Chapter 5 and to appear in [87] offers a compiler-based approach, that can insert the required reporting calls into the program source code automatically. These reports enable the runtime system to track the progress of the running application with high accuracy and thus obtain a direct feedback mechanism to validate parameter tuning decisions.

1.7 Contributions

This section describes the contributions presented in this thesis for the research areas outlined in Section 1.5.

In the area of code multi-versioning, our contributions include a definition and categorization of the parameter space potentially affecting multi- versioning performance, as well as a set of utilities to explore and evaluate it. We also perform an in-depth characterization of the actual performance impact in terms of wall-clock time and highly relevant CPU metrics across three distinct hardware platforms. Finally, we analyze the instruction cache behavior of common runtime optimizations and propose a tuning method that can reduce the number of cache misses induced by multi-versioning significantly.

(36)

Runtime Parameter Selection

Regarding runtime parameter selection, we develop a method to determine task contexts within parallel programs, perform analyses on each of them, and derive a set of compile-time parameters for a parallel runtime system.

This approach offers a set of six task-specific analyses to determine code features that significantly influence parameter selection, such as the parallel structure or granularity of execution. Our implementation is targeting five runtime parameters that affect execution time and memory consumption, and we evaluate it on 12 task-parallel programs on a shared-memory parallel system with up to 64 hardware threads. To put our findings into a broader context, we evaluate our approach empirically and compare against four state-of-the-art task-parallel runtime systems.

Progress Estimation

In this area, we present a compiler-based analysis and transformation sup- porting sequential as well as parallel OpenMP input programs along with an application programming interface (API) for progress information collection and reporting in the runtime system. We develop a prototype implementation of our approach and evaluate the achieved progress estimation accuracy for eight benchmark applications on a shared memory system running in different configurations, along with a comparison with the use of CPU counters, task throughput metrics and manual code instrumentation.

1.8 List of Publications

This section lists all publications that I contributed to in the course of my PhD studies at the University of Innsbruck. For the publications relevant for this thesis, the list will also include a description of my contributions to the publication, as well as a rough estimate of the extent of this contribution:

• Peter Zangerl, Peter Thoman, and Thomas Fahringer. Characterizing Performance and Cache Impacts of Code Multi-Versioning on Multi- core Architectures. In 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pages 209–213. IEEE, 2017 [86].

Contribution:¹ My contributions in this publication are the development of the evaluation framework (20%), conducting the measurements (90%) and analyzing the resulting data (50%) to develop a tun-

1The figures in parentheses are a rough estimate of my share of work on the respective part of this publication. As all these publications were team efforts with constant and interactive development within our group, it would be unfeasible to give more exact figures.

(37)

1.8. LIST OF PUBLICATIONS 11

ing method (30%) applicable for multi-versioned programs, and finally writing the manuscript (20%).

• Peter Thoman, Peter Zangerl, and Thomas Fahringer. Task-parallel runtime system optimization using static compiler analysis. In Pro- ceedings of the Computing Frontiers Conference (CF), pages 201–210.

ACM, 2017 [79].

Contribution: Here my contributions are the design (30%) and development (90%) of the compiler analyses, conducting the measurements (80%) and evaluating the results (60%) on a set of task parallel benchmark applications, as well as writing the manuscript (40%).

• Peter Thoman, Peter Zangerl, and Thomas Fahringer. Static Com- piler Analyses for Application-specific Optimization of Task-Parallel Runtime Systems. Journal of Signal Processing Systems, pages 1–18, 2018 [80].

Contribution: My contributions to this journal article are the design (20%) and development (90%) of the additional compiler analyses, conducting measurements and comparison with state-of-the-art implementations (80%), as well as writing of the manuscript (40%).

• Peter Zangerl, Peter Thoman, and Thomas Fahringer. Compiler Gen- erated Progress Estimation for OpenMP Programs. To appear in 15th International Conference on Parallel Computing Technologies (PaCT).

Springer, 2019 [87].

Contribution: I contributed to the design (50%) as well as the development (90%) of the compiler analyses, conducted the measurements (90%) and evaluated the results (80%), before creating the manuscript (80%).

The following additional publications with my contributions are not part of this thesis:

• Jesus Carretero, Javier Garcia-Blas, David Singh, Florin Isaila, Alexey Lastovetsky, Thomas Fahringer, Radu Prodan, Peter Zangerl, Christi Symeonidou, Afshin Fassihi, and Horacio Pérez-Sánchez. Acceleration of MPI mechanisms for sustainable HPC applications. Supercomputing Frontiers and Innovations, 2(2), 2015 [15].

(38)

• Herbert Jordan, Peter Thoman, Peter Zangerl, Thomas Heller, and Thomas Fahringer. A context-aware primitive for nested recursive parallelism. In Euro-Par 2016: Parallel Processing Workshops, pages 149–161, Cham, 2017. Springer International Publishing [49].

• Peter Zangerl, Herbert Jordan, Peter Thoman, Philipp Gschwandtner, and Thomas Fahringer. Exploring the Semantic Gap in Compiling Embedded DSLs. In Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simu- lation, pages 195–201. ACM, 2018 [85].

• Herbert Jordan, Thomas Heller, Philipp Gschwandtner, Peter Zangerl, Peter Thoman, Dietmar Fey, and Thomas Fahringer. The AllScale Runtime Application Model. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), pages 445–455. IEEE, 2018 [45].

1.9 Organization

The remainder of this thesis is organized as follows: Chapter 2 presents an overview of the Insieme compiler and runtime system, that provides the foundation for the research results in Chapters 3 to 5. Table 1.1 depicts the connection between our research objectives, the respective thesis chapters which investigate them as well as our related publications.

Chapter 3 discusses the cache- and runtime-effects of applying code multi- versioning for task-parallel programs. Following this, Chapter 4 presents a combined compiler and runtime system approach to tune program execution by setting compile-time constant runtime parameters according to compiler analyses. Chapter 5 describes a novel compiler analysis and transformation developed to automatically create progress estimations for task-parallel programs that enable a runtime system to better evaluate its parameter tuning decisions.

Finally, Chapter 6 concludes this thesis with a list of its contributions and potential topics for future work and further research.

(39)

1.9. ORGANIZATION 13

Table 1.1: Thesis organization showing the relationship between our research objectives, the respective thesis chapters and relevant publications.

ResearchObjective ThesisChapter

Publication(s)

O1 Chapter 3 Peter Zangerl, Peter Thoman, and Thomas Fahringer.

Characterizing Performance and Cache Impacts of Code Multi-Versioning on Multicore Architectures [86]

O2 Chapter 4 Peter Thoman, Peter Zangerl, and Thomas Fahringer.

Task-parallel runtime system optimization using static compiler analysis [79];

Peter Thoman, Peter Zangerl, and Thomas Fahringer.

Static Compiler Analyses for Application-specific Op- timization of Task-Parallel Runtime Systems [80]

O3 Chapter 5 Peter Zangerl, Peter Thoman, and Thomas Fahringer.

Compiler Generated Progress Estimation for OpenMP Programs [87]

(40)

(41)

Chapter 2

The Insieme Compiler and Runtime System

This chapter provides an overview of the Insieme compiler and runtime system [27]. The Insieme system is a combination of a research source-to-source C and C++ compiler and a parallel runtime system, which have been created by the Distributed and Parallel Systems Group at the University of Innsbruck.

The project is developed and published under the GNU General Public License (GPL). All source files of the project including the compiler, the runtime system, unit and integration tests as well as build instructions are available online¹.

2.1 Overview

The Insieme project’s mission statement [29] summarizes parts of its goals as follows:

The main goal of the Insieme project of the University of Inns- bruck is to research ways of automatically optimizing parallel programs for homogeneous and heterogeneous multi-core architectures and to provide a source-to-source compiler that offers such capabilities to the user.

What is mentioned only indirectly in this statement is the second main component of the project, which is just as important for this thesis – the Insieme runtime system. The combination of a compiler and runtime system enables research on the complete toolchain – all the way from the source code to the running program.

1https://github.com/insieme/insieme

15

(42)

Input Code

Output Code Program

Binary Runtime

System Compiler

Backend Intermediate Representation

Data Modules

Compiler Frontend

Analyses & Transformations

Compiler

Meta- Information

Runtime Parameters

Figure 2.1: Insieme architecture overview depicting the main components.

2.2 Architecture

Figure 2.1 gives a high-level architectural overview of the Insieme compiler and runtime system. The frontend of the compiler component is responsible for translating the input program into a parallelism-aware intermediate representation, which can be analyzed and transformed thereafter. This IR is then converted to source code again – our target code – in the backend. Com- piling the target code together with the Insieme runtime system eventually yields a program binary for execution. Crucially, the Insieme compiler can also influence compile-time parameters of this backend compilation step, as well as forward additional meta-information to the runtime system. This at- tached meta-information can be evaluated by the runtime system to improve execution performance.

All contributions of this thesis use or extend the compiler and the runtime system. The relevant changes are described in detail in Chapters 3 to 5.

2.2.1 Compiler Compiler Frontend

The parsing of C and C++ input programs in the frontend of the In- sieme compiler is achieved with the help of the C language family fron-

(43)

2.2. ARCHITECTURE 17

tend (Clang) [71] of the LLVM project [55]. This pass translates the input program into our intermediate representation INSPIRE. Insieme’s compiler frontend features several extensions for supported parallel languages like Cilk and OpenMP that translate their parallelism-related markers and constructs to specialized IR nodes. Note that the frontend component can easily be extended with additional plugins to add support for different C-style languages or new language extensions.

Analyses and Transformations

After the input program has been converted to the intermediate representation, a set of analyses and transformations can be applied to the IR. These passes can perform modifications or annotate the IR with additional meta- information for subsequent transformations and the conversion to source code in the compiler backend. The high-level structure of the intermediate representation allows for easy access to high-level features of the input program in this phase and thus facilitates the development of analyses and transformations.

Compiler Backend

As Insieme is a source-to-source compiler, the compiler backend is responsible for translating the intermediate representation back to C or C++ source code. Insieme offers two different backend variants: The default one will create parallel code to be run within the Insieme runtime system, while the other one creates sequential output code for debugging purposes. Again, the backend can be extended with custom plugins which are engaged in this translation process to perform additional modifications and changes.

2.2.2 INSPIRE

The Insieme Parallel Intermediate Representation (INSPIRE) is the intermediate representation used by the Insieme compiler [46, 44]. In contrast to the intermediate representations of most traditional compilers (like GCC or LLVM), it is a mostly structural and very high-level representation of the input program. This structure fosters easier implementation of high-level analyses and transformations, since they often can make decisions locally instead of having to consider the whole program representation [48].

The crucial difference to other popular intermediate representations however is the direct representation of parallelism in INSPIRE. These specialized nodes in the IR enable analyses and transformations to directly target, create or modify parallelism-related constructs of the input program without having to perform intricate searches or traversals.

Figure 2.2 shows a simplified example of the IR structure that is a re- sult of the compilation of the code shown in Listing 2.1 with the Insieme

(44)

Listing 2.1: Example parallel OpenMP program code used to generate the IR displayed in Figure 2.2.

1 # i n c l u d e < omp . h >

2

3 # d e f i n e N 1 0 0 0 0 4

5 v o i d foo() {

6 // ...

7 } 8

9 int m a i n() { 10 int d a t a[N];

11

12 #pragma omp parallel for nowait A 13 for(int i = 0; i < N; ++i) {

14 d a t a[i] = i; B

15 }

16

17 if( . . . ) {

18 // ...

19 #pragma omp task C

20 foo();

21 #pragma omp taskwait D

22 // ...

23

24 } e l s e {

25 // ...

26 foo(); E

27 // ...

28 }

29 }

compiler. We can see how the OpenMP parallelism constructs are represented within INSPIRE. A for loop that is marked for parallelization with OpenMP’s #pragma omp parallel for will be translated to a call of the merge primitive A, representing a synchronization point in the IR. The child of this object is a call to parallel, which represents an operation that is to be executed concurrently by multiple workers in parallel. Here, JobRange encodes the number of workers to use (unbound in this example) and the loop body as the code section to be executed by the workers B. The task parallelism employed at C is represented by a call to a parallel primitive with a job range bound to a single worker. Finally, synchronizing on this newly spawned task is achieved by the call to mergeAll at D. To exemplify the difference between sequential and parallel code parts, E shows the direct call to function foo.

(45)

2.2. ARCHITECTURE 19

LambdaExpr main . . .

CompoundStmt . . .

CallExpr A

merge CallExpr

parallel JobExpr

JobRange [1..]

CompoundStmt CallExpr

pfor 0 10000 1 . . . B

CallExpr barrier IfStmt

. . .

CompoundStmt . . .

CallExpr ^C parallel JobExpr

JobRange [1..1]

LambdaReference foo CallExpr D

mergeAll . . .

CompoundStmt . . .

CallExpr ^E

LambdaReference foo . . .

Figure 2.2: Example INSPIRE tree structure generated from the input code shown in Listing 2.1.

(46)

Figure 2.3: Insieme runtime system overview showing all the main modules and abstractions (Image source: [28]).

2.2.3 Runtime System

The Insieme runtime system is responsible for the parallel execution of programs that have been processed with the Insieme compiler. Figure 2.3 depicts a high-level overview of the runtime system with its main components.

Though it also offers a parallel library interface that can be used to manually write programs using the Insieme runtime system, normally programs intended to be executed by the runtime system are generated by the runtime- backend of the Insieme compiler.

The runtime system itself has been carefully designed for low-overhead processing of task-parallel programs [73]. To this end, the implementation is based on custom, light-weight user-level threading routines to minimize context switching and scheduling overheads. This design enables the Insieme runtime system to compete with – and often exceed – the performance of other state-of-the-art task-parallel library implementations like OpenMP and Cilk+ [75].

Internally, the runtime system manages a set of worker threads, which are generally mapped to a specific CPU core or hardware thread. These worker threads all manage their own double-ended queue (deque) of open work items to schedule. Once a worker runs out of work locally, it will steal some work from another worker, governed by the current scheduling policy.

If the deque of a worker thread is full, newly created work items will be executed directly instead of being queued.

During the execution of a program, the runtime system is responsible for the monitoring and re-configuration of the underlying parallel hardware system – including aspects like thread affinity settings or applying dynamic voltage and frequency scaling (DVFS). Also, the facilities provided by the Insieme runtime system allow for detailed instrumentation of the whole program as well as of specific code regions.

(47)

2.3. EXTENSIONS TO INSIEME 21

Compiler Generated Progress Estimation Runtime System Optimization

by Compiler Analysis

Insieme Compiler

Insieme Runtime

System

Frontend Core Backend

Existing Modules Contributions

Task Identification Progress Report Generation Recursive / Loop Parallel Paths Effort Estimation Stack Size Estimation

Closure size

Reporting Call Conversion

Progress Reporting

Interface Parallelism

Handling

Figure 2.4: Extensions added to the Insieme compiler and runtime system in the course of this thesis.

2.3 Extensions to Insieme

The extensions to the various modules of the Insieme compiler and runtime system developed in the course of this thesis are depicted by Figure 2.4.

New components have mainly been added to the Core component of Insieme - the part where the input program has already been translated by the compiler frontend into the INSPIRE intermediate representation. Here these components can analyze the IR or apply transformations onto it..

2.4 Related Publications

The development of the Insieme compiler and runtime system inspired novel research and led to several peer-reviewed publications. Along with the ones already referenced in this chapter, we would like to highlight some specific publications, as they also showcase the added research potential gained by having the ability to perform research on a system that combines a compiler with a runtime system.

Thoman et al. [77] investigated combined compiler- and runtime optimizations, where the compiler generates effort estimation functions that can be evaluated by the runtime system for improved OpenMP loop scheduling.

Jordan et al. [47] used the compiler to create multi-versioned code to enable the runtime system to tune the execution for multiple objectives. Another work by Thoman et al. [75, 76] also creates multi-versioned programs con- taining different granularities of selected functions, which allows the runtime system to select the best one depending on the current system state.

(48)

In the area of multi-objective (auto-)tuning, the works published by Gschwandtner et al. [43], Kofler et al. [53] and Alessi et al. [1] all take advantage of the Insieme ecosystem by forwarding compiler analysis results to the runtime system. Finally, the work of Thoman et al. [78] investigates the optimization potentials of applying library-semantics-aware compilation for parallel programs.

The full list of publications related to the Insieme system can be found online [30].

(49)

Chapter 3

Code Multi-Versioning Impacts

This chapter presents research on code multi-versioning and its runtime impacts, published in shortened form under the title Characterizing Perfor- mance and Cache Impacts of Code Multi-Versioning on Multicore Architec- tures in [86]. We provide definitions and categorizations of the relevant parameter space as well as an in-depth evaluation of this space across multiple hardware platforms, along with an analysis of the results and tuning strategies.

My contributions¹ in this publication are the development of the evaluation framework (20%), conducting the measurements (90%) and analyzing the resulting data (50%) to develop a tuning method (30%) applicable for multi-versioned programs, and finally writing the manuscript (20%).

3.1 Introduction

Over the past years, there has been an increasing interest in adapting program optimizations to runtime conditions, particularly for parallel systems.

These runtime conditions might include factors unknown at compile time, such as the size or structure of input data, the presence or absence of external load on a system, or shifting optimization priorities e.g. due to the battery state of an embedded device.

Three common methods exist for enabling this type of runtime adaptiv- ity: i) Program-level and runtime system flags and parameters, ii) Just- in-time compilation, and iii) Compile-time multi-versioning with dynamic version selection at runtime.

Option i provides the flexibility of covering a practically unlimited parameter space by varying combinations of these parameters, however, it is infeasible for many purposes. For example, varying the loop unrolling factor

1The figures in parentheses are a rough estimate of my share of work on the respective part of this publication. As all these publications were team efforts with constant and interactive development within our group, it would be unfeasible to give more exact figures.

23

(50)

of a hot loop nest is well-known to be an effective optimization [25] that cannot be performed without actually generating code implementing it. Even in cases that can be parameterized in principle, such as tiling factors, intro- ducing a dependency on runtime parameters might prevent a compiler from performing important optimizations such as vectorization.

Just-in-time (JIT) compilation, listed as option ii, addresses these issues in the most straightforward manner, by moving at least part of the compilation process to the program execution time. While this approach offers the most complete space of possible optimizations and adaptations, it comes with the cost of compilation times affecting execution time. Due to this fact, high-end compiler optimization and analysis is only viable for very long-running code bases in a just-in-time compilation setting. Recent research approaches, such as performing compilation asynchronously with program execution [13], improve the situation, but can still only mitigate and not eliminate the performance impact. Furthermore, JIT compilation approaches might be precluded on many consumer, embedded or HPC platforms either due to security concerns or issues with distributing the entire toolchain required.

Option iii, compile-time multi-versioning with dynamic version selection at runtime, is an attempt at attaining many of the advantages of JIT compilation with a significantly smaller overhead during program execution, and no additional complexities in program distribution or security vetting. As we will illustrate with a short survey in Section 3.4, this approach is broadly used in current practice and research.

Note that in all three methods outlined above, their use in modern optimization techniques varies from being applied to smaller code fragments all the way to the selection of different implementations of entire parallel algorithms [47]. Our characterization captures the full range of these scenarios.

A common point of discussion in the community regarding these multi- versioning approaches with runtime version selection are their practical limits, as well as if and to what extent they impact performance on modern hardware when approaching these limits. Of particular interest in this context is the relation between the number and size of the generated versions and the pressure on the instruction caches of a given hardware platform. Despite these concerns, to the best of our knowledge, there has been no comprehensive study so far that analyses the performance impact of multi-versioning across execution scenarios and hardware architectures in depth. Further- more, the relationship between hardware parallelism implementations such as multi-core or simultaneous hardware multi-threading (SMT) and multi- versioning remains largely unexplored.

In this chapter, we will provide a comprehensive analysis of the topic of code multi-versioning and how it affects performance across a wide variety

(51)

3.2. METHOD 25

of scenarios and hardware architectures. Our concrete contributions are as follows:

• A definition and categorization of the parameter space potentially affecting multi-versioning performance, as well as a set of utilities to explore and evaluate it.

• In-depth characterization of the actual performance impact, both in terms of wall-clock time and highly relevant CPU metrics (such as instruction cache misses) across three distinct hardware platforms and the complete parameter space identified previously.

• Analysis of the instruction cache behavior of common runtime optimizations making use of compile-time multi-versioning, as well as a tuning method applicable to many such approaches that can reduce the number of cache misses induced by multi-versioning significantly.

The remainder of this chapter is structured as follows. Section 3.4 will provide an overview of related work and illustrate the importance of multi- versioning in current program optimization research. Section 3.2 defines and justifies the parameter space of code multi-versioning, as well as clarifying our method of exploring it. In Section 3.3 we apply this method to a set of target platforms, interpret the results, and investigate how distinct runtime version selection mechanisms will perform in terms of their instruction cache usage. We also provide an optimization strategy for such selection mechanisms, and demonstrate its effectiveness.

3.2 Method

3.2.1 Parameter Space

In order to fully characterize the potential impact of multi-versioning on program performance on contemporary parallel hardware, we have identified a large set of parameters that may influence the performance and cache behavior of a particular multi-versioned code fragment. We designed our experimental setup such that each of these parameters can be explored individ- ually, as well as having the option of analyzing how arbitrary combinations of them affect the execution behavior.

In the following we explain the parameter space that may influence the execution behavior of multi-versioned programs:

Version selection strategy How the runtime system selects which version of a multi-versioned code fragment to run is an important aspect and has a considerable influence on the cache impact of an application. Our experimental setup enables us to simulate different strategies employed in practice, also including the two extremes – always executing the

(52)

same version which will take best advantage of the CPU’s prediction and prefetching capabilities, as well as selecting a random version every time.

CPU architecture and cache properties The properties of the CPU’s caches – especially their size and whether they are dedicated or shared between different CPU cores – affect the runtime behavior of a multi- versioned program. We performed our experiments on three different hardware platforms with different cache properties to investigate their effects on the execution behavior.

Concurrency Depending on the design of the processor, certain parts like caches or arithmetic and functional units are shared between multiple cores and/or hardware threads, and thus there is a potential for re- source competition. To observe the effects of these shared resources on the runtime behavior of a multi-versioned code fragment, we performed evaluations for different numbers of threads, where each thread independently runs the same program with the same properties. We also include platforms with no, 2-way and 4-way SMT.

Number of versions The number of versions generated for a certain code fragment is likely to be the most immediately obvious parameter that should be investigated in a study of multi-versioning performance impact. A large number of versions increases the size of the generated executable and might lead to an increase in instruction cache misses during program execution. In the course of our experiments we generated a set of executables with different version counts to investigate the performance and cache impacts of multi-versioning.

Code size per version The size of each individual generated function version in the executable also influences the cache behavior of the application. Larger versions increase instruction cache requirements, potentially resulting in more cache misses, however, they also require more time to execute. Our toolset allows us to change the code size per version, from generating very small functions with only a few bytes worth of instructions up to an arbitrarily large number of instructions in the function.

Execution time per version Another influential parameter is the time spent in multi-versioned code fragments. Short execution times ex- acerbate the relative overhead caused by cache misses, while long ones mitigate them. This parameter is indirectly related to the code size of each version, as loops or function calls can increase the execution time of a certain code fragment without taking up additional binary space.

Our implementation inserts a loop with an arbitrary upper bound in

Compiler and Runtime System Optimizations for the Insieme Compiler Infrastructure