Dynamic Blocking - Runtime-adaptive generalized task parallelism

Dynamic blocking4, another important feature of the ParAγ runtime system, aims at

increasing the size of parallel tasks, as the overhead of enqueuing parallel tasks can easily outweigh the actual work to be done by the tasks. This frequently is the case for parallelized loops doing little work per iteration of the loop body, like for instance in the BiCG example shown in Figure 1.4a. The reentrant parallel section for such a loop contains exactly one reentrant task representing the iteration part of the loop, and thus also all loop-carried dependences. One or several non-reentrant tasks in the same section represent the actual parallel work to be spawned off for parallel processing in each iteration. If these tasks only contain a small amount of code, like loading a value from an array and performing a reduction operation, then the overhead of handling the parallel tasks nullifies the benefit of parallel execution.

Such a loop should however by far not be discarded from the parallelizer’s perspective. Plenty of research in automatic parallelization has shown that the parallelism in such loops can successfully be exploited. Modern instances of Decoupled Software Pipelining [10, 74, 76– 78] and the Helix [81, 82] family of approaches in particular have shown impressive performance improvements by effectively reducing the overhead of parallelization and the parallelism enabling necessary synchronization and communication mechanisms. Those and most other approaches of automatic parallelization however usually seek for program patterns suitable for the specifically chosen approach of parallelization. Other possibly parallel but non-fitting program parts are ignored, effectively ruling out parallelization candidates right away.

ParA_γcomes from a different direction: not taking into account profitability and suitability in the first place, a very wide range of parallelism is found and exposed to the runtime system by the static parts of ParA_γ. The number of parallelization candidates found by

4_{Dynamic blocking has been excogitated and an early instance implemented by myself. The currently}

used, and by far improved, version of it has been devised by Johannes Doerfert and myself. The implementation has been mostly done by Johannes Doerfert.

ParAγ is therefore expected5 to be way higher than for approaches seeking for special

patterns of parallelism only. Candidates are then left for selection and composition as described in the previous sections to finally compile ParCFGs for relevant parts of the application. In order to be profitable in the above mentioned case of long-running loops with minimal work per body instance, ParAγ has to perform further, parallelism specific,

optimizations on the ParCFGs. ParA_γ’s dynamic blocking is one very important example of such an optimization. It increases the task size of a reentrant parallel section (i.e., a parallel loop) by dynamically joining (blocking) a number of parallel tasks, before enqueuing the whole block as one batch task. This greatly reduces the number of small tasks doing negligible work and the associated overhead and pressure on the dynamic scheduler.

Apart from the direct effect of decreasing the overhead_work -ratio of small parallel tasks, it further allows to join necessary parallelization-specific per-task overhead: any task accessing privatized data, for instance a privatized reduction location, profiling counters, or simply privatized data, profits from blocking: the private copy needs to be determined only once per batch task; for atomic operations involved in reduction, we only need to update the shared location once.

Equally important is the reduced communication overhead and required storage: instead of computing a simple iteration variable in the reentrant part of a loop and communicating its value to all the spawned tasks, it is communicated not more than once per block of tasks, provided its computation is deterministic and free of side-effects. The computation code is replicated per block in that case6_{, communication and stalling of tasks waiting for}

input reduced. This technique is not only applied to iteration variables. The results of side-effect free and deterministic computations of values consumed by a blocked task are not communicated, but instead recomputed within the consuming block. Loop-invariant values are communicated only once per block.

We say “expected” because during the course of our studies none of the most promising approaches have been made available to us for evaluation, despite the fact that we asked for it multiple times. Reasons mentioned where the quality of code, ongoing refactorings, and assumptions to a hypothetical hardware (Helix for instance is simulated because of an assumed but non-existing inter-core ring-cache.).

Replicating the loop control structure and necessary value computations is also an essential part of decoupled software pipelining.

Note that all the described measures are not necessary in the well-known and regularly applied (recursive) range splitting since they are included by design. In contrast to recursive range splitting, dynamic blocking is however more general and thus wider applicable because it does not require a statically known, and not even a loop invariant iteration range of the loop. A downside is that the thread that collects the tasks before spawning them in one block may become the bottleneck of parallel execution. Therefore, ParA_γ applies another optimization: in case the loop iteration range is loop invariant, i.e., known before entering the loop, the ParA_γ runtime system produces code that immediately distributes the loop execution equally among available threads, which basically corresponds to a one-level (i.e., non-recursive) range splitting.

Finally, the dynamic nature of ParAγ’s blocking allows to arbitrarily change the block

size at runtime in case the dynamic scheduler (or the executing system) is under- or oversubscribed. This is not currently done by ParAγ, which instead allows to select the

block size as a parameter to the parallelized binary, or automatically chooses a sensible default.

In document Runtime-adaptive generalized task parallelism (Page 128-130)