c prod - Implementation and Costing of the Parallel Combinators

4.2 Implementation and Costing of the Parallel Combinators

4.2.5 c prod

c prod applies a function to each member of the cross-product of two vectors txand t_y. c prodt_f[x1, x2, · · · , xm][y1, y2, · · · , yn] = [[tfx₁y₁, tfx₂y₁, · · · ,tfx_my₁],

[tfx₁y₂, tfx₂y₂, · · · ,tfx_my₂], ...

[tfx₁y_n, tfx₂y_n, · · · ,tf x_my_n]]

It is used for a class of algorithms in which each object interacts with every other and corresponds to the All-Pairs Paradigm in [19] and the RaMP(Reduce-and-Map-over-Pairs) skeleton in Darlington’s skeletons [29].

The BSP implementation strategy for c prod is:

1. the data in t_f and in t_xis broadcast and the data in t_yare scattered to all p proces-sors;

2. synchronisation;

3. each processors applies t_f to the all members of the cross product of t_x and the local segments of t_y;

4. the local result in each processor is gathered to the master processor;

5. synchronisation

In SPMD pseudo-code this is:

bsp get(data describing f from P0);

bsp get(copy of t_x from P0);

bsp get(local share of t_y from P0);

bsp sync();

for each local item t_y⁰ from t_y for each item t_x⁰ from copy of tx

apply tf to t_x⁰ and t_y⁰; bsp put(results to P0);

bsp sync();

The application cost for c prod is: the communication cost

t apcost(t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) for step 3, the communication cost

(t size (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) · (p − 1)) · g

for step 4 and the synchronisation cost l for step 5. Thus, the overall cost is expressed as

ap cost c prod f x yhp, g, li

= t apcost (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x))

+(t size (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) · (p − 1))

·g + l

The shape of c prod is:

λf.hλx.hλy.h( t len (y), (t len (x), t shp (t shp( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i

This packs in the following information. Working from the right hand end, it takes no time to apply c prod to a given function t_f and its application pattern isSEQ. The application cost to apply c prod t_f to a given vector t_x is 0 and its application pattern isSEQ. The application cost to apply c prod t_f txto a given vector tyis apcost c prod, which is given above. The application pattern involved to apply c prod t_f t_x to the given vector t_y is MAP since the implementation skeleton ends by gathering the lo-cal result. The resulting shape of the application c prod t_f t_x to the given vector t_y is ( t len y, (t len x, t shp (t shp ( f x) y))).

The cost function ofc prodis

cost(c prod) = hλf.hλx.hλy.h ( t len y, (t len x, t shp (t shp ( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i, 0, SEQ, 0i The first 0 from the right hand end indicates that it takes no time to evaluate the term c prod itself andSEQindicates data pattern of the term c prod itself isSEQ. The next 0 indicates that it carries no data. The next component

λf.hλx.hλy.h ( t len y, (t len x, t shp (t shp ( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i

is the shape of the term c prod, which was explained above.

4.3 Chapter Conclusion

A common approach to cost consideration for the skeleton approach is to formulate the cost of each skeleton based on its low level implementation using some parameters. In our context, skeletons are higher-order functions each of which has a predetermined template based on the BSP computation model. We presented details of the algorithm and its SPMD pseudo-code for the implementation of each skeleton and defined a cost formula in the form of a function of shape and the BSP parameters.

In skeleton-based models in which a parallel algorithm is expressed using more than one skeleton, cost would be expressed as some kind of combination of the cost formula of each single skeleton. However, a simple summation of each formula does not work well because the input size (or shape) parameterised in each formula will take different values in the general case. We need to take account of the impact of size (or shape) changes between skeletons. The distinguishing feature of shape-based cost analysis is that the composition of these formula can be automated by the incorporation of au-tomatic shape analysis. Our analysis adds to this feature the ability to compute the communication and synchronisation cost considering impact of architecture character-istics through BSP parameters.

Efficiency of the BSP implementation of each skeleton could be improved by investi-gating the costs of possible alternative implementations and the implications of param-eter sizes. This remains as future work.

Implementation of Cost Analysis

This chapter outlines the Haskell implementation of our cost analysis, which was de-scribed in chapters 3 and 4. It illustrates some details of the system structure and definitions of functions by using examples rather than full source code. The system was developed by modifying the Haskell implementation of PRAM cost analysis de-veloped at the University of Technology Sydney, reflecting the amendments to achieve our BSP cost analysis. The basic structure of the system is based on that of the original PRAM cost analysis implementation.

5.1 Automating Cost Analysis

The natural use of our system would be as an aid during program development, al-lowing the programmer to experiment with the behaviour of various equivalent pro-gram structures on various data sets. Since the cornerstone of shapely propro-gramming is that behavioural structure is independent of data content, it would be both unneces-sary and time-consuming to require the provision of real data sets during development (e.g. constructing an array of 1000 by 1000 values only for the cost calculator to immediately throw them away). Thus, for development purposes we add a new con-structordummyvec, which allows the programmer to directly specify the input shape as its argument, and use dummyvec ishp instead of the real input data vector. This

109

would be replaced by calls to IO operations in the real program. The cost function for dummyvec ishp is simple, as the programmer provides the input shape directly.

cost(dummyvec ishp) = hishp, sz ishp, 0, SEQ, 0i Note that this implies that we are not costing the I/O for the real program.

In document Shape-based Cost Analysis of Skeletal Parallel Programs (Page 104-110)