Static Parallelization - Runtime-adaptive generalized task parallelism

Burke et al. [53] describe the exploitation of nested fork-join parallelism while taking into account the possibility to resolve (or eliminate) data dependences by using privatization. Parallelism is greedily introduced in the form of DOALL-loops and COBEGIN. . . COEND blocks of parallel processes. The approach does not trade parallelism for overhead, parallelizes everything it can, and uses all opportunities of privatization to increase parallelism without taking profitability or available computational resources into account. The model of parallelism in that work, which is described as being “general and simple”, shares important properties with the model described in this thesis, but is less expressive. We furthermore do not greedily introduce parallelism but instead take the introduced overhead into account when statically and dynamically striving for profitable parallel execution.

In a similar fashion, Sarkar [54] presents a heuristics-based approach to statically parallelize task trees computed from the program dependence graph of FORTRAN functions. The approach takes into account the overhead introduced by parallelization as well as profiling information collected during dedicated executions of the target application to statically

estimate the profitability of the parallelized code. The enforced tree structure, motivated by the requirement to generate a parallel FORTRAN program with structured parallelism, limits the flexibility of the approach. The linear-programming-based scheduling of hierar- chical task graphs for embedded systems by Cordes et al. [55] shares this limitation and further imposes restrictions on the shape of the generated parallel code regions.

Rugina and Rinard [56] propose a simple method to automatically parallelize divide- and-conquer algorithms as a use-case to their sophisticated region-based inter-procedural memory analysis: parallelism is introduced to a C program in the form of a Cilk spawn instruction preceding every relevant call-site, and a Cilk sync succeeding it. The parallel region formed this way is then expanded by moving the sync along the control-flow path until the dependence analysis forbids further propagation because of a potential conflict of the next statement with the spawned function. While the dependence analysis proposed in the work by Rugina and Rinard is quite strong, in particular for regular data structures, the simple parallelization approach is very limited as it is unable to abstract from the implemented control-flow.

Zhong et al. [57] describe an approach of automatic speculative DOALL parallelization of loops relying on hardware transactional memory, hardware-based low-cost thread spawning and low-latency inter-core communication. Mehrara et al. [58] implement a software transactional memory system to get rid of these hardware requirements. The described STM is specialized and limited to automatic DOALL parallelization of loops, however. Kim et al. [59] apply speculative DOALL parallelization to distribute the computation performed by a loop to a cluster of machines.

Madriles et al. [60] propose Anaphase, a fine-grained speculative parallelization technique finding regions for parallel execution and scheduling the code using a multi-level graph partitioning approach. The approach speculatively parallelizes a given sequential application at the level of single instructions driven by several heuristics estimating the affinity of computation nodes. Anaphase relies on hardware support for efficient recovery from misspeculation.

Suesskraut et al. [61] introduce Prospect, a compiler framework using an approach which they call predictor/executor, which resembles the master/slave parallelization concept of

Zilles and Sohi [62] by providing a fast but potentially incorrect variant and a slow but correct variant of the code. Parallelism is introduced by executing the fast variants on the critical path, and multiple slow variants in parallel to verify the results of the fast variants. The approach introduces very high overhead: even in the best case, i.e., in case no roll-backs occur, the slow variants have been occupying more computational resources than the actual (fast) computation.

To optimize loop nests, in particular for mathematical and scientific applications mostly based on the usage of regular data structures and control flow, so-called polyhedral loop nest optimization has been proposed by Feautrier [25] in his seminal work on scheduling in the polyhedral model for one-dimensional [25] and multi-dimensional [26] time. The focus of that work has been on efficient scheduling, while parallelization has been described as one possible use-case. Later work by Lengauer [27] and Feautrier [63] specifically dealt with parallelization based on polyhedral scheduling. Pluto by Bondhugula et al. [64] is a C source- to-source compiler which uses polyhedral scheduling to produce a parallelized OpenMP [6, 7] program. Pluto is able to achieve extreme performance by far outperforming state-of-the- art productive compilers, if, and only if, the polyhedral model is applicable at all, which still is a drawback of polyhedral optimization. The cost of using this very clean and elegant mathematical model is a limited applicability with respect to irregular applications. A loop, or more precisely a static control part (SCoP), represented in the model typically needs to fulfill certain criteria: loop bounds as well as the predicates of conditionals used in the loop body have to be representable by affine functions in the surrounding loop indices as well as (provably) loop invariant parameters. Dependences between individual statements are only allowed via accesses to indexed variables (arrays), whose access functions are affine, also in the above mentioned parameters. Furthermore, called functions need to be statically known and provably pure1. These are severe restrictions whose mitigation has been the goal of excessive research work [65–70], conducted also by ourselves [16, 18] and Doerfert et al. [71]. Parallelization in the polyhedral domain is related but not addressed by the work described in this thesis. Its mathematically clean representation and optimization-based scheduling however have had a strong influence on our work.

1_{A pure function does not have any observable side-effect, and it computes the same result when called}

Decoupled Software Pipelining (DSWP) aims at parallelizing sequential loops by forming patterns of pipelined execution [72]. Loops are decomposed into pipeline stages, possibly executing in parallel to each other. Each stage communicates produced values to the threads executing later stages as needed. DSWP has been extended in multiple ways over the years: Ottoni et al. describe how to automatically perform thread extraction [73]; the work of Raman et al. allows to distribute a single pipeline stage to multiple threads [74], introducing further parallelism. The work of Vachharajani et al. describes how to speculatively parallelize [75], and August et al. enables cross-invocation parallelism among iterations of different loop instances [76] for loops of a specific shape. Huang et al. [77] generalize the idea of Raman et al. [74] and enable the parallelization of individual DSWP stages by manually applying a secondary loop parallelization scheme. The work clearly shows that different parallelization schemes can be profitably combined. However, the question on how to automatically select and prioritize different approaches is considered to be a challenging open research question by the authors. While modern implementations of DSWP, like Parcae [78] for instance, avoid it, the earlier approaches rely on specialized hardware for inter-thread communication and recovery from misspeculation. The approach described in this thesis instead runs on commodity systems.

Vandierendonck et al. [79] (also [80]) describe Paralax, a semi-automatic approach of parallelization in a DSWP like fashion. Like the approach presented in this thesis the approach relies on DSA [23] for its dependence analysis, and suffers from the same imprecision as we do. To address this concern, Vandierendonck et al. [79] motivate a set of user annotations, which gave partial inspiration to our approach of semi-automatic parallelization presented in Chapter 10.

In Helix [81], adjacent loop iterations are automatically distributed in a round-robin fashion to different threads executing on adjacent cores of the same processor. The latency of inter-core communication necessary to transfer values to fulfill loop-carried dependences is hidden by exploiting the SMT capabilities of modern multi-core processors: potentially needed values are continuously pre-fetched to guarantee their availability in the local L1-cache without latency once they are used by the target core. While the performance results are impressive, the authors show in their own follow-up work [82] that the approach does not scale to more than four cores and propose hardware support in the form of

a proactive ring-cache interconnecting all participating compute cores to overcome this limitation by being able to send values from one core to the next with a delay of one clock cycle. Both approaches are limited to parallel execution of a single loop on the cores of a single processor at a time.

In document Runtime-adaptive generalized task parallelism (Page 39-43)