Cost Modeling - Cost Analysis - Shape-based Cost Analysis of Skeletal Parallel Programs

3.5 Cost Analysis

3.5.3 Cost Modeling

The BSP cost modeling of our assumed implementation model is carried out by adding the mechanism to compute communication and synchronisation phase costs into the definition ofcappfor the PRAM cost modeling, resulting in the definition ofbsppapp. bsppapp also includes the mechanism to optimise communication. We explain first how to cost the communication phase, next how to detect the inefficient case, and finally the definition ofbsppapp.

Costing Communication

The communication which is involved in our computation model occurs in two situ-ations, firstly in C when the parallel pattern is used, and secondly within A when a parallel template which includes a communication process is used. Here we discuss how to compute communication cost in the former situation, giving the reasons for the rules imposed in the implementation strategy. The communication cost in the latter situation is counted as the part of application cost, which is defined for each skeletal combinator in chapter 4.

In the BSP cost model, the cost of a communication phase is determined by the formula h· g, where g is one of the BSP parameters and h is the largest message size h sent or received by any one processor during the phase. Since g is determined as a parameter to capture the communication performance of the target architecture, which is obtained

experimentally by running a benchmark program on the architecture, what we need is machinery to determine the value of h at every communication phase in our analytic framework.

In the general case, determining h in the communication part C requires the following information:

• the distribution pattern of components evaluation, that is how the data of the results of E_t⁰ and Et are distributed among the processors.

• the size of the data of the results of E_t⁰ and E_t placed in each processor.

• the distribution pattern which is required by the application.

• the communication pattern to realise the data rearrangement from the data dis-tribution at the end of component evaluation to the data disdis-tribution which is required by the application.

To get all this information in the framework of shape analysis would require a complex mechanism and consequently may impose expensive analysis costs. To simplify the issue, we used one processor as the master processor and imposed the rule that the data of the result of each part is eventually stored in the master processor. This made in-formation on distribution pattern and size of the result of components evaluation quite simple. We also restricted the data distribution patterns at the beginning of application parts. If the function is sequential, we use the master processor for the application part so that there is no communication in C since the necessary data all resides in the master processor. For a parallel function, we placed a restriction on the parallel application templates so that the data of the argument are always distributed evenly among the processors and all processors perform the same operation. Thus, the communication pattern in C in the case of parallel application is determined uniquely that is, the data of the result of E_t⁰ is scattered to the processors evenly and that of E_t is broadcast to the processors. Whether a function is sequential or parallel can be known from in-formation on the application pattern in the cost tuple. Consequently, the inin-formation components, message size in cost tuples and application pattern in application tuples are sufficient to determine h.

The communication cost in C is now determined by computing the number of words transmitted by processor 0, that is

s· (p − 1) · g

for broadcasting s words of the results of E_tto the worker processors, and s⁰· p− 1

p · g

for scattering s⁰words of the results of E_t⁰ to the worker processors.

Optimising Communication

The strategy described above allows us to compute the communication cost in C but it caused an efficiency problem as explained in 3.4. It requires further refinement of our analytic framework to solve this problem.

In our translation framework, information on the application pattern is used for two purposes. One is that it tells whether the function is parallel or sequential, which determines the data distribution pattern required in A, and is used to compute the com-munication cost. The other is that it tells whether the application template finishes by gathering the local results or not. This information is not used for costing this appli-cation process itself but is kept in the cost tuple of the result as a component, the data pattern, giving information on how the data is generated, which is used when the re-sult becomes an argument of another function. Thus, the cost tuple of any argument always has information on the data pattern. Note that the data pattern of a primitive datum is predefined as SEQ. To indicate information for the second purpose, one of the three notationsSEQ,MAP,FOLDwhich indicate sequential, parallel with a gather at the end, and other parallel, respectively is used and it also gives information for the first purpose (SEQis sequential, other values are parallel). The inefficient case is de-tected by checking the combination of the application pattern in the application tuple of the function and the data pattern in the cost tuple of the argument. If the application pattern indicates that the function is parallel (MAPorFOLD), and the data pattern indi-cates that the data of the argument was generated by gathering the local results (that is,

MAP), the inefficient case is detected and the cost for the gather and the scatter is not counted in the resulting cost. Note that this decision would be available to the compiler to make the corresponding optimisation in a real implementation.

In document Shape-based Cost Analysis of Skeletal Parallel Programs (Page 80-83)