Greedy constructive procedures - Variable Space Search

7.7 Variable Space Search

8.1.2 Greedy constructive procedures

In order to construct a solution, the greedy procedures require specifying only the target number of processors. The solution generation succeeds in a negligible time frame, since no performance estimation needs to be performed.

Workload Balance (WB)

The concept of balancing the workload in order to minimize the bottlenecks of the program and hence maximize the throughput has been already successfully employed for partitioning purposes of systems of different types [185]. Inspired by such approaches, the very first constructive heuristic has been designed. The algorithm starts from calculating the total workload of each group throughout the program execution. It is expressed as the sum of the pj’s for all jobs (firings) belonging to one group (actor) g . The actors are then sorted decreasingly by the sum of weights (workload)w l (g ) =P

j ∈gpj. The partitioning decision is based on the sum of workloads of actors partitioned already in one processorρ: wl(ρ) = P

g ∈ρw l (g ). The next actor on the list is always partitioned on the processor with the smallest sum of workloads w l (ρ). In this way, a balance of the total workload of each partition should be achieved and the workload of the most occupied processor is likely to be minimized. In order to illustrate the flow of the algorithm, the sample network depicted earlier in Figure 2.4 is used. The sample file containing processing weights for the actions of the actors is presented in Listing 8.1. For the considered set of weights, Table 8.1 presents all the steps of the algorithm, assuming that a partitioning on 2 machines is to be established. For each step, it is indicated which actor is selected for an assignment, its total workload, the target partition it is chosen to be assigned to (ρ1orρ2, in this case) and the value of w l (ρ) after every assignment. Notice

that the resulting values of w l (ρ) are very close to each other (410 and 406, respectively). Step id actor/group w l (g ) targetρ wl(ρ)

1 D 270 ρ1 270 2 F 180 ρ2 180 3 G 120 ρ2 300 4 B 110 ρ1 380 5 C 60 ρ2 360 6 A 50 ρ2 410 7 E 26 ρ1 406

Listing 8.1 – Sample (processing) weights file for the program from Figure 2.4

1 <?xml version="1.0" ?>

2 <network name="Sample_Network">

3 <actor id="A">

4 <action id="a" clockcycles="10" firings="5"/>

5 </actor>

6 <actor id="B">

7 <action id="b1" clockcycles="20" firings="4"/>

8 <action id="b2" clockcycles="15" firings="2"/>

9 </actor>

10 <actor id="C">

11 <action id="c" clockcycles="10" firings="6"/>

12 </actor>

13 <actor id="D">

14 <action id="d1" clockcycles="50" firings="3"/>

15 <action id="d2" clockcycles="40" firings="3"/>

16 </actor>

17 <actor id="E">

18 <action id="e1" clockcycles="5" firings="2"/>

19 <action id="e2" clockcycles="4" firings="4"/>

20 </actor>

21 <actor id="F">

22 <action id="f" clockcycles="30" firings="6"/>

23 </actor>

24 <actor id="G">

25 <action id="g" clockcycles="20" firings="6"/>

26 </actor>

27 </network>

Balanced Pipeline (BP)

The algorithm starts from giving each actor a dedicated processor. Next, the processors are iteratively reduced and the members of the least occupied processor are attached to the remaining processors. The optimization criteria of the algorithm include equalizing the average preceding workload (APW) between the partitions and maximizing the number of common predecessors (ACP) for each partition. APW is defined as the maximal sum of weights of the jobs belonging to the actors (groups) that precede the given actor in the network in terms of topological order: maxP p_{j ∈g}_jgj≺ G. The ACP number is evaluated for each pair of actors and denotes the number of actors appearing on the topological list of predecessors. An actor is also considered to be its own predecessor. In addition, the list of predecessors must consider the cycles between the actors, if they appear. The idea behind employing the aforementioned criteria is to join the units where the overall APW is small with those with a big

APW so that the actors which are about to fire at the similar time during the execution do not

block each other. An additional criterion favors a high ACP value between actors inside one unit, as most likely there is a pipeline between them that would disable their parallel execution anyway.

network and the weights) as for the W B algorithm. Again, partitioning on 2 processing units is considered. First, the settings for the algorithm are presented. Table 8.2 summarizes the calculated values of AW and APW for each actor, and Table 8.3 presents the values of AC P for each pair of actors. Table 8.4 illustrates the steps of the algorithm. In each step it is indicated what is the initial partitioning configuration, what is the value of APW for each partition (put in brackets) and which move is chosen to be performed. Notice that, unlike for the previous algorithm, in this case the created partitions have close values of the preceding workload, instead of the workload. The resulting configuration is also completely different compared to the one established by the W B algorithm.

actor/group AW APW A 50 0 B 110 330 C 60 186 D 270 546 E 26 220 F 180 246 G 120 246

Table 8.2 – B P partitioning algorithm: AW and APW settings. A B C D E F G A - 1 1 1 1 1 1 B 1 - 4 4 4 4 4 C 1 4 - 4 4 4 4 D 1 4 4 - 4 5 5 E 1 4 4 4 - 4 4 F 1 4 4 5 4 - 5 G 1 4 4 5 4 5 -

Table 8.3 – B P partitioning algorithm: AC P settings.

Step id Partitioning configuration Chosen connection

0 {D} = 546,{B} = 330,{F } = 246,{G} = 246,{E} = 220,{C } = 186,{A} = 0 {A} → {D} 1 {A, D} = 273,{B} = 330,{F } = 246,{G} = 246,{E} = 220,{C } = 186 {C } → {B}

2 {A, D} = 273,{B,C } = 258,{F } = 246,{G} = 246,{E} = 220 {E } → {B,C }

3 {A, D} = 273,{F } = 246,{G} = 246,{B,C ,E} = 245 {B,C , E } → {F }

4 {A, D} = 273,{G} = 246,{B,C ,E ,F } = 245 {B,C , E , F } → {G}

5 {A, D} = 273,{B,C ,E ,F,G} = 245 -

Table 8.4 – B P partitioning algorithm: sample flow.

The algorithm can operate in two modes. If the number of partitions is fixed, the algorithm proceeds until the given number is reached. Otherwise, the number of processing units

must be established. For that purpose, two additional parameters are introduced: (a) the Average Partitioning Occupancy (APO), calculated as an average value of the processing time of each unit expressed in percent; (b) the Standard Deviation of Occupancy (SDO), calculated as a statistical standard deviation for the APOs of the units. These parameters are calculated during the performance estimation. Preliminary experiments and observations lead to characterization of the balanced workload of a partitioning configuration with a high value of average occupancy and, at the same time, a low value of standard deviation. With such a distribution of values, in the ideal case, all partitions should be equally and maximally occupied. Therefore, the ratio of APO to SDO is used as an evaluation of partitioning configuration. As the reduction procedure continues, this ratio quite naturally increases. If the opposite occurs, it usually means that a strong inequality of the workload among units is introduced. Hence, this determines the stop condition of the algorithm.

Once an initial partitioning configuration is established, a further optimization procedure can be applied, for instance one of the descent local search methods (idle time or communication volume minimization) described in the following Section. Alternatively, instead of using a performance estimation during the search, it is also possible to specify a fixed percentage of the most idle (most communicative, respectively) actors which will be moved to different processing units.

In document Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-core Platforms (Page 134-137)