Discovering Parallelism Patterns - Partitioning and Parallelism Extraction

5.3 Partitioning and Parallelism Extraction

5.3.2 Discovering Parallelism Patterns

The initial clustering approach of the MAPS framework heavily relied on the user to produce the final parallel implementation. It was up to him to decide if two clusters which were loosely connected by data dependencies would actually feature a good degree of TLP. Besides, DLP and PLP are not explicitly addressed by a plain clustering approach. This section presents new heuristics that expose all different kinds of parallelism, and reduce the amount of user interaction needed to obtain a parallel implementation.

Figure 5.5 shows the parallelism patterns introduced in Section 2.4.1.5, as seen in a de- pendence flow graph. Discovering those patterns from a DFG is a straightforward process. It is more challenging to decide when a given pattern exposes relevant parallelism. The way this is done is described in Sections 5.3.2.1–5.3.2.3. The discussion in those sections is based on a measure of the efficiency of the parallel execution.

Definition 5.2. The parallel efficiency of a portion of code with sequential execution time

5.3. Partitioning and Parallelism Extraction 83 data ctrl t T1 T2 T1 T2 T2 T1 t T2 T1 T1 Loop entry condition switch increment Loop exit t T1/4 T1/4 T1/4 T1/4 t T1/2 a) b) c) T1/2 T1 T2 T3 Loop t T1 T2 T3 t T1+T2 T3 loop body

Figure 5.5: Patterns in a partitioned DFG and sample Gantt Charts. Lines in the nodes

represent program points. a) Examples of TLP with different data dependencies. b) Example of DLP with two different levels of parallelism. c) Ex- ample of PLP with and without balancing.

η = tseq tpar· nPE

(5.4)

The sequential execution time can be obtained by using some of the performance estimation functions on any given processor type. The parallel execution time must be estimated, depending on the type of parallelism, communication costs and other factors. The goal of the parallelism extraction process is to reduce tparwhile retaining an acceptable

parallel efficiency, as close as possible to unity. For acceptable results, η ∈ (1/n_PE, 1). That is, a parallelization scheme for which tpar >tseq is not acceptable.

5.3.2.1 Analysis for TLP

As mentioned before, clusters in a partitioned graph explicitly express potential TLP. In general, TLP is characterized by nodes which feature few data dependencies, or no dependency at all. The benefit of actually creating tasks for each cluster depends on several factors. This is shown graphically in Figure 5.5a for two different configurations of a two- clusters graph. As the figure shows, the efficiency of the parallel execution depends on the program points that create the dependencies and the time it takes to communicate the data (∆t in the figure). If the data is produced late and needed early in the clusters, the parallel execution could be even longer than the sequential one.

The pseudocode for TLP analysis is shown in Algorithm 5.3. It receives as inputs the DFG of a function and the results of graph clustering (C_iDFGf) and returns a new clustering (C_oDFGf) improved for TLP. The function CollapseCF in Line 2 collapses all control flow structures in the function’s DFG so that the remaining control flow is linear. This means that complete if-then-else regions and loops are clustered together. Loops are therefore ignored in this algorithm and are later analyzed for DLP and PLP. The loop in Lines 4–11 walks the linear list and merges every two consecutive clusters that display a parallel efficiency lower than a threshold (η∗in Line 8). The target-dependent sequential time tseq is computed using Equation 5.1 for all the basic blocks in clusters C and C′. The

parallel execution time tpar is obtained from an ASAP scheduling of the two clusters on

any given PE, using the fastest communication primitive to estimate data communication costs. The computation of tseqand tpar is illustrated on the right-hand side of Figure 5.5a.

84 Chapter 5. Sequential Code Flow Algorithm 5.3Improving TLP. 1: procedure TLPfind(DFGf,CDFGf i ) 2: CollapseCF(CDFGf i ) 3: C←First(CDFGf i ),CDFG f o ←∅ 4: whilesucc(C) 6=∅_do

5: C′ ←succ(C) →Next cluster in the linear control flow

6: tpar←ASAP({C, C′}, DFGf), tseq←SeqTime({C, C′}, DFGf) 7: η←tseq/(2·tpar) 8: if η<η∗thenC←C∪C′ →η∗≥0.5 9: elseCDFGf o = CDFG f o ∪ {C}, C←C′ 10: end if 11: end while 12: ifC /∈ CDFGf o thenCDFG f o ← CDFG f o ∪ {C} 13: end if 14: returnCDFGf o 15: end procedure

Although the example in Figure 5.5a has only one dependency edge, it is easy to understand how this applies to cases with more data dependencies. Since the purpose of this phase is to expose all available parallelism in the application, ASAP scheduling is used, which ignores resource constraints. Whether or not two clusters will actually be exported as tasks is decided later. For this reason, Algorithm 5.3 ignores other costs associated with the parallel execution, e.g., the time needed for task creation.

5.3.2.2 Analysis for DLP

The DLP graph pattern is shown on the left-hand side of Figure 5.5b. It is characterized by a loop body that has no loop carried dependencies, other than those generated by the loop induction variables. The loop has input and output dependency edges, but the memory locations that create these dependencies are disjoint across loop iterations. This means that the body of the loop can be copied several times and can work in parallel on different input and output data containers. The following restrictions are also checked before marking a loop as DLP:

• There must be one induction variable.

• The loop body must have no side effects, such as calls to non-reentrant functions. • Incoming and outgoing data edges must refer to a single array.

• The loop must be well structured.

All this information is contained in the annotations produced by the graph analysis phase at the level of loops and functions (see Section 5.2.4). Since no data dependencies have to be considered as in the case of TLP, the efficiency computation is simpler for DLP. The parallel time is modeled by tpar = tseq/nPE+ tcomm(nPE), where tcomm measures the

communication and synchronization overhead, depending on the amount of cores nPEthat

are used. Note that if ∀nPE, tcomm(nPE) → 0, then η → 1. In reality, the synchronization

5.3. Partitioning and Parallelism Extraction 85

Algorithm 5.4DLP Annotation.

1: procedure DLPAnnot(DFGf, C, ¯c, n)

2: PAC← (PAC_type=∅_{, X}PAC ₌∅_{, V}PAC _{= {}_v₁_}) _→_{Initial empty parallel annotation}

3: DvPA1 C ← {1, . . . , ¯c} →¯c: Average trip count

4: if ¯c<K·n then return PAC →K: minimum ratio trip count – loop executions 5: end if 6: tseq←SeqTime({C}, DFGf) 7: PACtype←DLP 8: m←0 9: repeat 10: m←m+1

11: tcomm(m) ←CommCost(DFGf, C, m), tpar←tseq/m+tcomm(m) 12: η←tseq/(m·tpar)

13: until(η<η∗) ∨ (m≥mmax) →mmax: e.g., number of processors 14: DvPA₁ C ←DPA

v₁ ∩ {1, . . . , m}

15: returnPAC →With associated domain DPAv1 C

16: end procedure

Gantt Charts on the right-hand side of Figure 5.5b. The efficiency of the 4-PE and 2-PE configurations is 22/(12 · 4) = 0.46 and 22/(15 · 2) = 0.73 respectively.

The heuristic for DLP is quite simple, as shown in Algorithm 5.4. It receives as input a cluster C from a function’s DFG that represents the body of a loop that meets the con- ditions for DLP stated above. It also receives the average trip count ( ¯c) and the amount of times the loop was executed (n). These two values are annotated in the loop header after graph analysis. The function returns a parallel annotation in the sense of Definition 2.34. Recall the definition of a parallel annotation for DLP PAC = (DLP, ∅, VPAC _{= {v}

1}),

where the variable v1 determines the number of data parallel tasks to create from the par-

allel loop. Algorithm 5.4 iterates over the possible number of tasks m in Lines 9–13 and stops once the efficiency falls below a threshold η∗ or m exceeds a maximum value mmax,

e.g., the number of PEs in the platform. The domain of the variable is then restricted to {1, . . . , min( ¯c, m)} in Line 14. The code in Lines 4–5 is used to discard loops that provide DLP but would produce an overall low gain. This is determined by comparing the amount of times the loop is started with the average trip count. For example, a constant K = 10 enforces that a loop has to have at least 10 times more iterations than the amount of times the loop is instantiated. This helps to reduce the task creation overhead which is ignored in the computation of tpar.

5.3.2.3 Analysis for PLP

The PLP pattern is shown on the left-hand side of Figure 5.5c, where the DFG loop nodes were omitted for the sake of clarity. PLP is characterized by a loop with several clusters and mostly forward dependencies. Loop carried dependencies are allowed, as long as they are not between the last and the first clusters.

The pseudocode in Algorithm 5.5 describes how the parallel annotations for PLP are created. The algorithm receives the same inputs as for DLP in Algorithm 5.4, together with the previous clustering in the hierarchy (C_i−DFG₁ f). The function returns a PLP parallel annotation, PAC = (PLP, XPAC_{, V}PAC _{= {v}

1, υ2}), where XPA

86 Chapter 5. Sequential Code Flow Algorithm 5.5PLP Annotation. 1: procedure PLPAnnot(DFGf, C, ¯c, n,CDFGf i−1 ) 2: PAC← (PACtype=∅, XPA C

=∅_{, V}PAC _{= {}_v_{1, υ2}_}) _→_{Initial empty parallel annotation} 3: XPAC ←DeCluster(C,CDFGf

i−1 )

4: XPAC ←CollapseCF(XPAC); m<← |XPAC| →XPAC = {C1, C2, . . . , Cm}from previous partition 5: if(¯c<K·n) ∨m<2 then return PAC

6: end if

7: D_vPA₁ C ← {2, . . . , m}, PAC_type←PLP, tseq← ¯c·SeqTime({C}, DFGf), D_υPA₂ C ←∅ 8: fori∈ DvPA1 C do

9: (D, tstage) ←BalancePipe(DFGf, XPA C

, i)

10: tpar← (¯c+i−1) · (tstage+tcomm), η←tseq/(i·tpar) 11: if η>η∗thenDυPA2 C ←D

PAC

υ2 ∪D

12: end if

13: end for

14: returnPAC →With associated variable domains

15: end procedure

16: procedure BalancePipe(DFGf,L, nstage)

17: tbudget←SeqTime(L, DFGf)/nstage →Optimal time per stage 18: C←First(L), tstage←SeqTime(C, DFGf), s←1, t∗_stage←tstage, D← {(nstage, C, s)}

19: whilesucc(C) 6=∅_do

20: C′ ←succ(C), t′ ←SeqTime(C′, DFGf)

21: ift′+tstage>K·tbudgetthens←min(s+1, nstage), tstage←0 →K: Imbalance factor 22: end if

23: D← {(nstage, C′, s)}, tstage←tstage+t′ 24: ift∗stage<tstagethentstage∗ ←tstage 25: end if

26: end while

27: return(D, t∗stage) 28: end procedure

stitute the loop body, v₁ is a variable that determines the number of pipeline stages and

υ2 is a function that maps clusters to pipeline stages.

The partitions that constitute the loop body (C) are retrieved from the clustering C_i−DFG₁ f by the function DeCluster in Line 3. Thereafter, the control flow is collapsed in Line 4, in the same way it was done for TLP in Algorithm 5.3. The code in Lines 5–6 discards the loop in a similar way as it was done for DLP. In addition to the execution condition, the loop is rejected if the loop body has no partition. The amount of clusters m = |XPAC| defines the maximum number of pipeline stages, and therefore, restricts the domain of the first variable of the parallel annotation (see Line 7). The speedup of a pipeline cannot be measured per iteration as done for DLP, but requires a complete execution of the loop. For this reason, the sequential estimation returned for C is multiplied by the average trip count ¯c in Line 7.

The loop in Lines 8–13 determines the domain of the second variable υ2 by iterating

over all possible pipeline lengths. Note that this variable defines a mapping of clusters to pipeline stages depending on the number of stages to use. This can be modeled as a relation on D_vPA₁ C× XPAC_{× D}PAC v1 , where a triple (x ∈ D PAC v1 , y ∈ X PAC_{, z ∈ D}PAC v1 ) means

5.3. Partitioning and Parallelism Extraction 87 that in a x-stage pipeline, the cluster y is mapped to the z-th stage. The actual mapping happens within the function BalancePipe, which returns a mapping D for a given number of stages nstage. The function also returns the time of the largest pipeline stage tstage∗ . The

algorithm in BalancePipe corresponds to the greedy first-fit solution of the bin packing

problem [84]. The first-fit decreasing variant is not used in order to retain the original control

flow. The constant K ∈ (1.0, 1.5) in Line 21 allows some imbalance. A value of K = 1.0 often results in a very unbalanced last stage.

After obtaining a mapping of the clusters to pipeline stages in Line 9, the time of the parallel execution is estimated. For the estimation, the time of the largest pipeline stage is considered. The time needed to initialize and flush the pipeline is approximated by (i − 1) · tstage, so that the total execution of the pipeline is 2 · (i − 1) · tstage+ ( ¯c − (i − 1)) · tstage =

( ¯c + i − 1) · tstage. The communication cost is also added to the computation time, as shown

in Line 10. Only if the efficiency of the pipeline configuration is above the threshold η∗, it is added to the domain of variable υ2. For the example in Figure 5.5c, the domain of variable

υ2 would then be DPA

υ2 = {(2, T1, 1), (2, T2, 1), (2, T3, 2), (3, T1, 1), (3, T2, 2), (3, T3, 3)}.

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 92-97)