4.2 Implementation and Costing of the Parallel Combinators
4.2.5 c prod
c prod applies a function to each member of the cross-product of two vectors txand ty. c prodtf[x1, x2, · · · , xm][y1, y2, · · · , yn] = [[tfx1y1, tfx2y1, · · · ,tfxmy1],
[tfx1y2, tfx2y2, · · · ,tfxmy2], ...
[tfx1yn, tfx2yn, · · · ,tf xmyn]]
It is used for a class of algorithms in which each object interacts with every other and corresponds to the All-Pairs Paradigm in [19] and the RaMP(Reduce-and-Map-over-Pairs) skeleton in Darlington’s skeletons [29].
The BSP implementation strategy for c prod is:
1. the data in tf and in txis broadcast and the data in tyare scattered to all p proces-sors;
2. synchronisation;
3. each processors applies tf to the all members of the cross product of tx and the local segments of ty;
4. the local result in each processor is gathered to the master processor;
5. synchronisation
In SPMD pseudo-code this is:
bsp get(data describing f from P0);
bsp get(copy of tx from P0);
bsp get(local share of ty from P0);
bsp sync();
for each local item ty0 from ty for each item tx0 from copy of tx
apply tf to tx0 and ty0; bsp put(results to P0);
bsp sync();
The application cost for c prod is: the communication cost
t apcost(t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) for step 3, the communication cost
(t size (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) · (p − 1)) · g
for step 4 and the synchronisation cost l for step 5. Thus, the overall cost is expressed as
ap cost c prod f x yhp, g, li
= t apcost (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x))
+(t size (t shp ( f (t eshp (x)))(t eshp (y))) · (t len (y)/p) · (t len (x)) · (p − 1))
·g + l
The shape of c prod is:
λf.hλx.hλy.h( t len (y), (t len (x), t shp (t shp( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i
This packs in the following information. Working from the right hand end, it takes no time to apply c prod to a given function tf and its application pattern isSEQ. The application cost to apply c prod tf to a given vector tx is 0 and its application pattern isSEQ. The application cost to apply c prod tf txto a given vector tyis apcost c prod, which is given above. The application pattern involved to apply c prod tf tx to the given vector ty is MAP since the implementation skeleton ends by gathering the lo-cal result. The resulting shape of the application c prod tf tx to the given vector ty is ( t len y, (t len x, t shp (t shp ( f x) y))).
The cost function ofc prodis
cost(c prod) = hλf.hλx.hλy.h ( t len y, (t len x, t shp (t shp ( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i, 0, SEQ, 0i The first 0 from the right hand end indicates that it takes no time to evaluate the term c prod itself andSEQindicates data pattern of the term c prod itself isSEQ. The next 0 indicates that it carries no data. The next component
λf.hλx.hλy.h ( t len y, (t len x, t shp (t shp ( f x) y))), MAP, ap cost c prod f x yi, SEQ, 0i, SEQ, 0i
is the shape of the term c prod, which was explained above.
4.3 Chapter Conclusion
A common approach to cost consideration for the skeleton approach is to formulate the cost of each skeleton based on its low level implementation using some parameters. In our context, skeletons are higher-order functions each of which has a predetermined template based on the BSP computation model. We presented details of the algorithm and its SPMD pseudo-code for the implementation of each skeleton and defined a cost formula in the form of a function of shape and the BSP parameters.
In skeleton-based models in which a parallel algorithm is expressed using more than one skeleton, cost would be expressed as some kind of combination of the cost formula of each single skeleton. However, a simple summation of each formula does not work well because the input size (or shape) parameterised in each formula will take different values in the general case. We need to take account of the impact of size (or shape) changes between skeletons. The distinguishing feature of shape-based cost analysis is that the composition of these formula can be automated by the incorporation of au-tomatic shape analysis. Our analysis adds to this feature the ability to compute the communication and synchronisation cost considering impact of architecture character-istics through BSP parameters.
Efficiency of the BSP implementation of each skeleton could be improved by investi-gating the costs of possible alternative implementations and the implications of param-eter sizes. This remains as future work.
Implementation of Cost Analysis
This chapter outlines the Haskell implementation of our cost analysis, which was de-scribed in chapters 3 and 4. It illustrates some details of the system structure and definitions of functions by using examples rather than full source code. The system was developed by modifying the Haskell implementation of PRAM cost analysis de-veloped at the University of Technology Sydney, reflecting the amendments to achieve our BSP cost analysis. The basic structure of the system is based on that of the original PRAM cost analysis implementation.
5.1 Automating Cost Analysis
The natural use of our system would be as an aid during program development, al-lowing the programmer to experiment with the behaviour of various equivalent pro-gram structures on various data sets. Since the cornerstone of shapely propro-gramming is that behavioural structure is independent of data content, it would be both unneces-sary and time-consuming to require the provision of real data sets during development (e.g. constructing an array of 1000 by 1000 values only for the cost calculator to immediately throw them away). Thus, for development purposes we add a new con-structordummyvec, which allows the programmer to directly specify the input shape as its argument, and use dummyvec ishp instead of the real input data vector. This
109
would be replaced by calls to IO operations in the real program. The cost function for dummyvec ishp is simple, as the programmer provides the input shape directly.
cost(dummyvec ishp) = hishp, sz ishp, 0, SEQ, 0i Note that this implies that we are not costing the I/O for the real program.