4.3 The Dynamic Nature of Sambamba
5.1.2 Sequentialization of the Program Dependence Graph
As stated earlier, the program dependence graph is the perfect representation of parallel programs. It removes the structurally imposed program execution order and leaves only the
root
A B F
b1 b2
C D E
Figure 5.3: A simple PDG for sequentialization.
strictly necessary control and data dependences which are to be respected to guarantee preservation of the sequential program semantics. In order to be executed by the machine, the program’s statements however need to be put in an execution order in the form of a control flow graph. This process is called sequentialization of the PDG and has been the subject of extensive research with the goal of solving the issue for different forms of programs ranging from loop-less programs [104–106] to programs containing only single- entry loops [107] to irreducible programs containing arbitrary loops [98] respectively. The more recent work of Zeng et al. [108] has dealt with the efficiency of the code sequentialized from a PDG with possible interleaving, i.e., circular dependences between disjoint PDG sub-graphs. The goal to achieve in the optimal sequentialization is minimal duplication of code, or minimal number of guards3.
As an example consider the simple PDG shown in Figure 5.34, and the two possible sequentializations in Figures 5.4a and 5.4b, the latter being optimal, while the former required duplication of node D due to a sub-optimal order of generating code for the children of group nodes b1 and b2.
Steensgaard [98] shows how to optimally sequentialize PDGs even for irregular code based on the so-called external edge condition (EEC ). Our situation is special, however, and in fact allows for a simpler solution: the PDG used in Sambamba/ParAγ is computed
from an existing control-flow graph, and mostly used for analysis purposes only. Since ParAγdoes not introduce new dependences into the PDG it is clear that a duplication and
3
Sequentialization can always be duplication-free provided enough predicates are inserted into the code guarding the execution.
4
A
B
D1 D2
C E
F
(a) Non-optimal with D duplicated.
A B C E D F (b) Optimal. Figure 5.4: Sequentialization of the PDG shown in Figure 5.3.
guard-free sequentialization of the PDG exists: the original CFG. In the general case this is not always true (see Ball and Horwitz [107]). Furthermore, and more importantly, we can keep the information on this sequentialization, at least implicitly, by storing for each PDG node a group-unique ID based on the post-order numbering of the blocks in the control flow graph with loop closing edges removed. IDs are group-unique, i.e., unique among the children of each PDG group node, instead of unique for the whole PDG, since group nodes themselves do not have a correspondence in the CFG, and therefore no corresponding ID. Instead, group nodes inherit the ID from the decision nodes they belong to. The basic idea then is that during sequentialization of a PDG group node, the children are ordered in descending order of their respective IDs, which for a loop-less program results in the original, duplication-free CFG.
Loops impose a different situation: without loops, the children of the reachable PDG sub-graph, rooted in node n always have a smaller ID than n itself, following from the definition of control-dependence, which requires reachability in the CFG, and post-order numbering, which guarantees that all nodes reachable from n are numbered before n. CFG loops however result in loops also in the control-dependence sub-graph of the PDG, which in turn result in nodes with higher IDs being reachable in the PDG. This needs to be
reflected when ordering the children of a group node g with post-order ID poid(g) for sequentialization by sorting in two steps: first, all children c with poid(c) < poid(g) are sorted in descending order of their IDs, followed by all children with poid(c) ≥ poid(g), also in descending order of their IDs. We call all children with poid(c) < poid(g) regular, and all children with poid(c) ≥ poid(g) loop-back. The boundary between those two groups is called loop-back boundary, the one child with the smallest ID bigger or equal to that of its parent is called reentrant as it is the one closing the loop. Figure 5.5 illustrates the order of children and the terminology introduced above.
am bm−1 cm−2 dm−3 eo+2 fo+1 go
grpn
regular children loop-back boundary loop-back children
reentrant node
Figure 5.5: Order of children of a PDG group node grp with poid(grp) = n, and children
a - g with subscripted CFG post-order IDs. m < n and o ≥ n.
A5
B4
C3
E1 D2
(a) A CFG with post-order numbers of blocks as subscript.
root
A5 C3 E1
g3
D2 B4
(b) A PDG with CFG post-order num- bers as node IDs.
Figure 5.6: A CFG and its corresponding PDG.
The loop-back boundary has a special meaning during parallelization: a schedule that executes all regular children of a group node in parallel to all loop-back children typically results in a DOALL parallelized loop. In cases of loop parallelization requiring realization
of a reduction or privatization, for instance, the loop-back boundary is the location to introduce fix-up code.
As a simple but complete example consider the CFG and its corresponding PDG in Figure 5.6, in particular the group node labeled g3 with its three children D2, B4, and C3,
which have to be sequentialized in exactly this order to guarantee a minimal CFG. The root node’s children have to be scheduled in the order A5, B4, C3, E1.
Storing the post-order ID as described above for each PDG node, especially since it has to be kept throughout the whole compilation process, within the binary as produced by Sambamba, and at runtime, might seem like a significant overhead. Indeed, for exactly that reason, we have implemented Steensgaard’s algorithm at first to recompute and store the relevant ordering only in case it is needed. It turned out however that this particular information is frequently needed throughout the whole process. Additionally, we have been surely willing to trade memory for runtime efficiency. Finally, a unique ID per node is needed anyway for several implementation-related reasons.
5.1.2.1 Sequentialization of Parallelized PDGs
So far we talked about sequentializing a PDG with the goal to get the original, sequential, CFG without duplication of code or introducing execution guards. In case the children of a group node are scheduled for parallel execution, however, generating the ParCFG might require duplication of blocks, not only those directly contained in the parallel region, but especially those preceding it. Note that relying on the simple node ordering defined above still results in a minimum of duplicated code. Placing multiple copies of blocks however requires to select among the cloned values for later use by control-flow successors. This is being taken care of during ParCFG generation.