BSP Implementations of Derivation Steps - Costing Derivation Steps

7.4 Costing Derivation Steps

7.4.2 BSP Implementations of Derivation Steps

We outline the BSP implementation of each BSP program (8)-(12) based on the im-plementation strategy given in 3.3 and imim-plementation skeletons for combinators in 6.3.

Algorithm (8): fold(↑) (map (fold (+)) (foldconcat (map (tails) (inits x))))

• inits phase:

1. the contents of x in the master processor are scattered to the p processors;

2. synchronisation;

3. each processor computes the local initial segments;

4. the last element of the local initial segment in processor i is sent to processor i+ 1 and prepended to each of the partial initial segments held by processor i+ 1 followed by synchronisation. this is repeated p times.

• map (tails) phase:

5. each processor computes the tail segments for each partial initial segment held by the processor;

6. the local result in each processor is gathered to the master processor;

7. synchronisation.

• foldconcat phase:

8. the master processor computes foldconcat for the result of 7.

• map (fold (+)) phase:

9. the contents of the result of 8 in the master processor are scattered to the p processors in vector block manner;

10. each processor computes the fold(+) for each vector held by the processor.

• fold (↑) phase:

11. each processor computes fold(↑), that is takes the maximum of the result of 10 held by the processor;

12. the local result in each processor is gathered to the master processor;

13. synchronisation;

14. the master processor computes fold(↑), that is takes the maximum of the gathered local results.

Algorithm (9): fold(↑) (foldconcat (map (map (fold (+))) (map(tails) (inits x))))

• inits phase:

This is the same as the step 1.-4. of algorithm (8).

• map (tails) phase:

5. each processor computes the tail segments for each partial initial segments held by the processor.

• map (map (fold (+))) phase:

6. each processor computes the fold(+) for each inner vector held by the pro-cessor;

7. the local result in each processor is gathered to the master processor;

8. synchronisation.

• foldconcat phase:

9. the master processor computes foldconcat for the result of 7.

• fold (↑) phase:

10. the contents of the result of 9 in the master processor are scattered to the p processors;

11. each processor computes fold(↑) for the result of 10;

12. the local result in each processor is gathered to the master processor;

13. synchronisation;

14. the master processor computes fold(↑) for the gathered local results.

The BSP implementation of the algorithm (9) is similar to algorithm (8). The dif-ference between (8) and (9) is the timing of the use of foldconcat. We look at the difference using an example in which the input vector is[1, 2, 3, 4] and the number of processors is 4. In (8), after the computation of map(tails) (step 5), the local results are

P0 :[[1]], P1 :[[2], [1, 2]],

P2 :[[3], [2, 3], [1, 2, 3]],

P3 :[[4], [3, 4], [2, 3, 4], [1, 2, 3, 4]]

The local result in each processor is gathered into the master processor (step 6) fol-lowed by foldconcat (step 8), and redistributed among the worker processors evenly in vector-block manner (step 9). The data distribution at this time is

P0 :[1], [2], [1, 2]

P1 :[3], [2, 3], [1, 2, 3], P2 :[4], [3, 4],

P3 :[2, 3, 4], [1, 2, 3, 4]

and then fold(+) is applied in each vector, resulting in P0 : 1, 2, 3,

P1 : 3, 5, 6, P2 : 4, 7, P3 : 9, 10

As we can see, after the application of map(tails) (inits x)) the outermost elements have different number of inner vector elements which become the outermost elements after application of the next foldconcat in the master processor. Since redistribution of these for the next fold(+) is made in terms of these new outermost elements, the load imbalance caused by inits and map(tails) is improved and the computation costs of the following operations are reduced, but the gather and redistribution cost to perform foldconcat is introduced.

In contrast, in the algorithm (9), for the result of map(tails) (step 5) P0 :[[1]],

P1 :[[2], [1, 2]],

P2 :[[3], [2, 3], [1, 2, 3]],

P3 :[[4], [3, 4], [2, 3, 4], [1, 2, 3, 4]]

map(map (fold (+))) is applied immediately (step 6), resulting in P0 :[1],

P1 :[2, 3], P2 :[3, 5, 6], P3 :[4, 7, 9, 10]

The local result in each processor is now gathered into the master processor (step 7) followed by foldconcat (step 9), and redistributed among the worker processors evenly in vector-block manner in step 10, resulting in

P0 : 1, 2, 3, P1 : 3, 5, 6, P2 : 4, 7, P3 : 9, 10

As we can see, the application map(map (fold (+))) is applied to the unbalanced dis-tributed data, which could introduce a considerable parallel computation cost. How-ever, the gather and redistribution costs to perform foldconcat are relatively small be-cause the data size transmitted for the communication is small after the application of map(map (fold (+))).

Algorithm (10): fold(↑) (map (fold (↑)) (map(map (fold (+))) (map (tails)(inits x))))

• inits phase:

This is the same as the step 1.-4. of algorithm (8) and (9).

• map (tails) phase:

5. each processor computes the tail segments for each partial initial segments held by the processor;

• map (map (fold (+))) phase:

6. each processor computes the map(fold (+)) for each vector held by the pro-cessor.

• map (fold (↑)) phase:

7. each processor computes the fold(↑) for each vector held by the processor.

• fold (↑) phase:

8. each processor takes the maximum of the result of 7;

9. the local result in each processor is gathered to the master processor;

10. synchronisation;

11. the master processor takes the maximum of the gathered local results.

The BSP implementation of algorithm (10) is similar to (8) and (9). In algorithm (10), redconcat is not used and gathering occurs only in the last phase fold(↑).

Algorithm (11): fold(↑) (map (λv. ((fst A) ↑ (snd A)))(inits x))

• inits phase:

This is the same as the step 1.-4. of algorithm (8),(9) and (10).

• map (λv. ((fst A) ↑ (snd A))) phase:

6. each processor computes A, that is, makes a pair with 0 for each vector segment and folds with;

7. each processor takes the maximum of each resulting pair in 6.

• fold (↑) phase:

8. each processor computes fold(↑) for the result of 7;

9. the local result in each processor is gathered to the master processor;

10. synchronisation;

11. the master processor takes the maximum of the gathered local results.

Algorithm (12): fold(↑) (shiftright(0) (map (↑) (scan () (map (pair 0) v))))

• map (pair 0) phase:

1. the contents of x in the master processor are scattered to the p processors;

2. synchronisation;

3. each processor makes a pair with 0 for local element.

• scan () phase:

4. each processor computes local scan with;

5. the final element of the local scan in each processor is scanned in parallel across the processors with using the obvious tree algorithm;

6. the result of the global scan in the processor i(< p) is sent to processor i+ 1;

7. synchronisation;

8. each processor applies to the pair of the value sent to the processor in 6 and each element of the results of 4.

• map (↑) phase:

9. each processor takes the maximum element of each pair of the result of 8.

• shiftright (0) phase:

10. shiftright(0) rotates the entire list right one place, moving a single element from each processor to the next and inserting 0 at the left end.

• fold (↑) phase:

11. each processor computes fold(↑) for the result of 10;

12. the local result in each processor is gathered to the master processor;

13. synchronisation;

14. the master processor takes the maximum of the gathered local results.

In document Shape-based Cost Analysis of Skeletal Parallel Programs (Page 165-171)