ParA γ —Relevant Implementation Details - Runtime-adaptive generalized task parallelism

Concerning the implementation of ParA_γ the following techniques are of particular impor- tance to understand the results achieved.

8.2.1 Block Splitting

While the principle ideas and techniques used by ParA_γ are conceptually not limited to operating on a basic block level, in the current implementation we chose to do so for reasons of scalability. This is clearly a trade-off in the light of which the precision and power of parallelization might suffer, as the scheduler is limited in its freedom to independently schedule potentially costly instructions contained within the same basic block2.

To counter this limitation in freedom, ParAγ seeks to isolate two kinds of instructions by

extracting them from their containing block:

Potentially costly instructions are isolated to give the scheduler the freedom to sched-

ule them in parallel to each other or the surrounding code. Costly instructions

Theoretically the size of a basic block is not limited. The average number of instructions per basic block highly varies with the compiler and the source language for instance. Calder et al. [113] for instance give numbers of 5 to 8 instructions on average per basic block for C/C++ applications.

isolated by para include call and invoke instructions, except if the called function is known to be very short running.

Parallelization hindering instructions are isolated to break chains of dependences

between basic blocks and give the scheduler the freedom to parallelize the code surrounding the isolated instructions. Examples of such instructions include the computation of loop iteration variables and reduction operations.

Extracting an instruction from a basic block is however not as simple as splitting the basic block before and after the respective instruction. Doing so typically introduces a dependence pattern that would equally likely hinder parallelization. Consider a basic block as follows:

1 a=b+c

2 d=e+f

3 call h(a)

4 call i(d)

Naively splitting this block to isolate both calls would result in three new blocks:

1 a=b+c

2 d=e+f

1 call h(a) 1 call i(d)

Unfortunately, both blocks containing the isolated calls do have a dependence to the first block containing the operand computation. The scheduler is now free to execute the calls in parallel to each other, but only after the argument computation has completed. This is an unnecessary restriction. The desired result instead is as follows:

1 a=b+c

2 call h(a)

1 d=e+f

This configuration does not impose any dependences between basic blocks and gives the scheduler all freedom to schedule the calls in parallel to each other.

Instead of performing naive splitting, ParA_γ isolates complete intra-block dependence chains originating from a target instruction. In case multiple target instructions with non-disjoint dependence chains exist within the same basic block, dedicated basic blocks are created for the overlapping computation in order to maximize scheduling freedom within the bounds of preserving the sequential program semantics.

8.2.2 Schedule Cache

Upon a typical recompilation of an application after changing a source file, big parts of the code remain unchanged. ParAγ makes use of this fact to speed up the compilation process

by re-using pre-computed parallel schedules (i.e., parallelization candidates). This way, it effectively saves the time to re-run the linear optimizer for each unchanged function.

The idea is motivated by ccache [114], which indexes a cache persisted on disk by computing a hash over the source to compile (after running the preprocessor). ParA_γ follows this example but indexes its schedule cache by computing a deterministic hash over the PDG structure. This implies two things: the cache is operating on function granularity; and two functions having the same PDG share the same entry in the cache. As the PDG contains all memory effects, which is also reflected in the hash computation (in contrast to function or variable names and other identifiers, which are abstracted away), this is exactly the desired behavior.

In case ParAγ finds for a given PDG an entry in its schedule cache, the corresponding

schedule is immediately taken and the ILP not even constructed.

8.2.3 ILP Cloud

In case ParAγ does not find a suitable pre-computed parallel schedule for a given PDG in

its cache, it typically constructs multiple integer linear programs to find local parallelization candidates as described in thorough detail in Chapter 6. As solving those ILPs, some

of which are of non-negligible complexity, takes most of the compilation time, a second cache is introduced: the ILP Cloud is indexed by a hash over the ILP structure and stores feasible solutions for a given ILP.

After constructing the ILP for a given PDG block as described earlier, ParAγ checks the

availability of a feasible solution to the ILP. In case an optimal solution exists, ParA_γ takes it and does not need to run the optimizer. In case a feasible but non-optimal solution exists, ParAγ can, subject to the given configuration, decide to do two things:

1. Check if the CPU time spent computing this solution (which is also stored in the ILP cloud) is above its own ILP solving timeout, and if so, take the feasible solution as is.

2. If the CPU time spent so far to optimize the available solution is below ParA_γ’s own threshold, the feasible solution can be used as a starting point for the ILP solver, which in turn spends the difference of time to ParA_γ’s time budget in polishing the solution further towards an optimum.

3. ParAγ can, independently of the CPU time spent so far to solve the cloud solution,

use its own time budget to polish the solution.

In any case, ParAγ will store a found and potentially improved solution in the ILP cloud

for later reuse. In addition to instances of Sambamba/ParA_γ which contribute to the solutions stored in the ILP cloud, we employed available compute resources to regularly take feasible but non-optimal solutions from the ILP cloud and further optimize them.

Note that as the ILPs constructed by ParAγ are based on local dependence DAGs which

abstract away identifiers, control flow and program order and contain less than 10 nodes on average (see Table 9.1) chances are that ILP solutions can be shared between independent programs.

The implementation of the ILP cloud has been done together with Clemens Hammacher who implemented in particular the server side part of the cloud, including ILP serialization and deserialization.

In document Runtime-adaptive generalized task parallelism (Page 152-156)