New Cost Functions for Combinators - Shape-based Cost Analysis of Skeletal Parallel Programs

Defining cost functions for combinators involves defining functions which capture the shape change and determining the application cost and the application pattern based on assumed implementation skeletons. For some combinators which take a vector as the argument we need to define two kinds of functions selected according whether the argument shape expression is a pair (referred to as pair shape) or a vector (referred to as vector shape). The cost functions which are given below, except for map and fold when x is pair shape, are introduced for the first time in this chapter. In particu-lar, the introduction of foldconcat, inits and tails are made possible by the new shape expression.

map

The modelled implementation of map is: apply the function sequentially on the vector segments in each processor then gather the results to the master. Its cost function is:

cost(map) = hλf.hλx.shape map, SEQ, +0, 0i, 0, SEQ, 0, 0i, shape map

= h(t len x, t shp ( f (t eshp x))), MAP, t lchange ( f x), t apcost ( f (t eshp x)) ·

(t len (x)/p) + t size ( f (t eshp x)) · (t len (x)/p) · (p − 1) · g + li, if x is pair shape hmap (t shp) tuples, MAP, t lchange ( f x), maxsum (map (t apcost) tuples)

+ gath sz (map (t size) tuples) · g + li, if x is vector shape where tuples= (map f x)

The second 0 from the right hand end means the vector level of map is 0. The +0 means the application of map to a given function t_f does not change the vector level.

The level change function of map tf is the same as the level change function of tf, that is t lchange( f x). When x is a pair shape the cost function of map is the same as that in chapter 4 except for this additional information of the vector level and the level change. When x is a vector shape the analysis performs map f x, that is the shape function f is applied to each element of x generating each application tuple, which becomes an element of the resulting vector. The result shape of map t_ft_xis obtained by taking the result shape component of each application tuple, that is map(t shp) tuples.

Since the data size of the vector elements allocated to each processor can be different, the local computation cost to apply t_f to the segments in each processor can also be different. Therefore the cost of this parallel computation part is the maximum of the local computation cost in any processor. It is computed by taking the application cost component from each application tuple (that is map(t apcost) tuples), computing the summation for every (length x)/p elements, and then taking the maximum among them. The last two steps are expressed by the operator maxsum. The communication cost is computed using the message size to gather the local result, that is taking the size of each result shape of element, (that is map(t size) tuples) and computing the summation of them excluding the first(length x)/p elements which are already kept

in the master. The last step is expressed by the operator gath sz. The definitions of maxsum and gath sz are

maxsum x = (sum (take (t len (x)/p) x)) ↑

(if length (drop (t len (x)/p) x) > 0

then maxsum(drop (t len (x)/p) x) else 0), gath sz x = sum (drop (length (x)/p) x)

where take is a function which takes an integer n and a vector xs and returns the first n elements from xs.

fold

The modelled implementation of fold is: the sub-vectors are folded sequentially on each processor, with results then transferred to the master processor which folds them together. As in Skillicorn’s calculus, fold is used only with operators which take con-stant space, that is the shapes of their results are same as the shapes of their arguments.

Fold operations with a non-constant space operation are defined individually (e.g. fold-concatdefined below). Its cost function is:

cost(fold) = hλ⊕ . hλx. shape fold, SEQ, +0, 0i, 0, SEQ, 0, 0i, shape fold

= ht eshp x, FOLD, (−1), t apcost (t shp (⊕ (t eshp x)) (t eshp x)) · (t len (x)/p + p− 2) + size (t eshp x) · (p − 1) · g + li, if x is pair shape

hhd x, FOLD, (−1), t apcost (t shp (⊕ (hd x))(hd x)) · (length (x)/p + p − 2) + size (hd x) · (p − 1) · g + li, if x is vector shape

The second 0 from the right hand end means the vector level of fold is 0. The +0 means the application of fold to a given function t_⊕ does not change the vector level.

The level change function of fold t_⊕ is(−1) because fold t⊕ reduces the vector level by 1. When x is a pair shape, the shape of the result of fold t_⊕t_xis t eshp x because of the assumption of constant space. The application cost of fold t_⊕t_xis the computation

cost for t len(x)/p + p − 2 applications of fold t⊕ (in which t len(x)/p − 1 are for the parallel part and p− 1 are for the sequential part), that is

t apcost(t shp (⊕ (t eshp x)) (t eshp x)) · (t len (x)/p + p − 2),

the communication cost for gathering the local results, that is size(t eshp x) · (p − 1) · g and the synchronisation cost l. The case when x is a vector shape is similar, but the computation of the shape and the application cost is performed using hd and length instead of t eshp and t len to take the element shape and the length respectively.

foldconcat

In the expanded version described in this chapter, foldconcat (fold with concatenate) is added to the primitive functions in order to express one of the intermediate algorithms in the mss derivation. The modelled implementation of foldconcat is to concatenate sequentially in the master processor rather than in parallel to avoid distribution cost.

cost(foldconcat) = hλx.shape foldconcat, 0, SEQ, 0, 0i, shape foldconcat

= h(t len x · t len (t eshp x), t eshp (t eshp x)), SEQ, (−1), concatConst· (t len x − 1)i, if x is pair shape

hfoldconcat x, SEQ, (−1), concatConst · (length (x) − 1)i, if x is vector shape

Working from the right hand end, the cost of foldconcat itself and vector level are 0. The data pattern is SEQ and the message size is 0. When x is a pair shape, the result shape of foldconcat x is(t len x · t len (t eshp x), t eshp (t eshp x)). The applica-tion pattern is SEQ since it is sequential function. The level change is(−1) because foldconcat reduces the vector level by 1. The application cost is some constant time for concatenation multiplied by t len x− 1. When x is a vector shape, the result shape is foldconcat x and the other components are the same as those when x is a pair shape.

scan

The implementation of scan is: the assigned block of elements is scanned sequentially;

the final value of local scan is scanned across processors in parallel using the obvious tree algorithm; the result of the global scan on processor i(< p) is sent to processor i+ 1. In each processor, t⊕ is applied to the pair of the result of global scan and the local results; the result in each processor is gathered to the master processor. We make the same restriction concerning constant space operators as for fold. Its cost function is:

cost(scan) = hλf.hλx.shape scan, SEQ, +0, 0i, 0, SEQ, 0, 0i, shape scan

= hx, MAP, 0, t apcost (t shp (⊕ (t eshp (x)))(t eshp (x))) · (2 · (t len (x)/p) − 1 + log(p)) + (size (t eshp (x)) · (log(p) + 1) + (size (x)/p) · (p − 1)) · g

+(log (p) + 2) · li, if x is pair shape

hx, MAP, 0, t apcost (t shp (⊕ (hd (x)))(hd (x))) · (2 · (length(x)/p) − 1 + log(p)) + (size (hd (x)) · (log(p) + 1) + (size (x)/p) · (p − 1)) · g

+(log (p) + 2) · li, if x is vector shape

The second 0 from the right hand end means the vector level of scan is 0. The+0 means the application of scan to a given function t_⊕does not change the vector level. The level change function of scan t_⊕ is 0 since scan does not change the vector level. When x is a pair shape, the shape of the result of scan t⊕ tx is x because of the assumption of constant space. The application cost of scan t_⊕ t_x is the computation cost for 2· (t len (x)/p) − 1 + log(p) applications of t⊕(in which(t len (x)/p) − 1 are for the local scan, log(p) are in the tree algorithm and t len (x)/p are for between each element of the local scan result and the global scan result), that is

t apcost(t shp (⊕ (t eshp (x)))(t eshp (x))) · (2 · (t len (x)/p) − 1 + log (p)), the communication cost is size(t eshp (x)) · log (p) · g in the tree algorithm,

size(t eshp (x)) · g for sending the local result to the next processor and size (t eshp (x))·

(p − 1) for gathering the local results, in total,

(size (t eshp (x)) · (log(p) + 1) + (size (x)/p) · (p − 1)) · g

and the synchronisation cost is log(p) · l in the tree algorithm, l after sending the local result to the next processor and l for gathering, in total,(log (p) + 2) · l. The case when x is a vector shape is similar, but the computation of the shape and the application cost is performed using hd and length instead of t eshp and t len to take the element shape and the length respectively.

inits and tails

The modelled implementation of inits begins with each processor computing the local initial segments of its part of the list. The last element of this local result is then passed to the processor immediately to its right, where it is prepended to each of the partial initial segments held by that processor. After p− 1 steps, the values from the first processor are prepended to each of the segments in the last processor, and then the local results from all processors are gathered to the master processor. Its cost function is:

cost(inits) = hλx. shape inits, 0, SEQ, 0, 0i, shape inits

= h [(1, t eshp (x)), (2, t eshp (x)), · · ·, (t len (x), t eshp (x))], MAP, +1, concatConst · (t len (x)/p)

+(size (t eshp (x)) · (t len (x)/p) · g + (t len (x)/p) · concatConst + l) · (p − 1) + size(drop (t len (x)/p)) [(1, t eshp (x)), (2, t eshp (x)), · · ·, (t len (x), t eshp (x))] · g + li, if x is pair shape

h inits x, MAP, +1, concatConst · (length (x)/p)

+ ((size (take (length (x)/p) x)) · g + concatConst + l) · (p − 1) + size (drop (length (x)/p) inits x) · g + li, if x is vector shape

Working from the right end, the cost of inits itself and vector level are 0. The data patten isSEQ. The message size is 0. When x is a pair shape, the result shape of inits x

[(1, t eshp (x)), (2, t eshp (x)), · · ·, (t len (x), t eshp (x))]

which is a vector shape. The level change function is(+1) since inits increases the vec-tor level by 1. The application pattern isMAPsince the modelled implementation ends by gathering the local results. The application cost is: concatConst· (t len (x)/p) for computing the local initial segments;(size (t eshp (x)) · (t len (x)/p) · g + (t len (x)/p) · concatConst+ l) · (p − 1) for the p − 1 steps of passing the last element of the local result followed by prepending to each element; and

[(1, t eshp (x)), (2, t eshp (x)), · · ·, (t len (x), t eshp (x))] · g + l

for gathering the local results followed by a synchronisation. When x is a vector shape, the result shape is inits x. The application cost is concatConst· (t len (x)/p) for com-puting the local initial segments,((size (take (length (x)/p) x)) · g + concatConst + l) · (p − 1) for the p − 1 steps of passing the last element of the local result followed by prepending to each element and size(drop (length (x)/p) inits x) · g + l.

tails is an analogue of inits, which computes the suffix segments of its argument vector and its cost function is the same as that of inits.

In document Shape-based Cost Analysis of Skeletal Parallel Programs (Page 157-163)