3. Performance Tools
3.5 Model Analysis
In using model analysis to predict the performance of a parallel program, there is a often a conflict between exactitude and generality; a general technique which is inexact exposes the project to risk and an exact technique which is not general limits the scope of program design. A key limitation of model analysis at the state o f the art is that simple, low-cost techniques have proved to be either accurate, but specific to particular types of parallel application, or general, but very approximate. For example, Pritchard’s model analysis [Pritchard90] of a farm computation is accurate, but it is applicable only for the specialised case in which a “farm” of worker transputers'"^ send messages of a constant size to a “master” transputer™. The analytical method used to address this problem was a form of recursive linear programming. However, this would become excessively complicated either for two-way message traffic or for variable grain sizes. In a further example, Pritchard modelled a two-dimensional geometrical array computation mapping a two- dimensional physical model defined over a rectangle of space onto an array of transputers™, as shown in Figure 15. The computation comprised alternating phases of computation and exchange of border results between neighbouring processors. The simple model predicted the improvement in performance produced by using concurrency between communication and computation. For simple array computations, it delivers useful information for low cost, but, as with the farm model, more sophisticated model analysis techniques are needed to predict performance for other types of parallel program.
processor array processed on A and sent to D processed on processor A processed on
physical model processor a and sent to B
Figure 15 : 2D Geometric Array Computation
To achieve greater generality, several researchers have developed variants on the “speedup model”, by which run-times are calculated by dividing the work among the number o f processors Amdahl’s important refinement to this [Amdahl67] took serial portions o f the program into account, revealing the important general limitation that the run-time is at least as long as the serial portions o f the program. Gustafson [Gustafson88] used the same model to argue that, while run-time cannot be decreased, parallelism can still be used to address larger or more detailed problems without significantly increasing the overall run-time. Gelenbe [Gelenbe89] developed Amdahl’s model to take load balancing and communications into account; this illustrates the limits o f Amdahl’s model in that, when such effects are large and are combined, they give rise to behavioural complexity outside the scope o f low-cost model analysis. To see why this is so, consider the DAG farm introduced in Chapter 2. A simple parallel application is that o f a farm which involves using N processors to carry out N portions o f a computation, each o f which take time T. If the initial and final serial portions o f the program take times t l and t2 and if the times to communicate data and results are neglected, the time for the program can be reduced from t I + N. T^t 2 to tl ^ T+t2. This produces important benefits if t l and t2 are much smaller than T or, as Gustafson pointed out [Gustafson88], if N is large. The ideal behaviour o f such a farm computation is depicted as a chart o f processor activity against time in the upper half o f Figure 16.
The behaviour o f a parallel application with a superficially simple design can become complex for several reasons, so that simplistic application o f performance analysis is no longer appropriate. Firstly, if communication times are not negligible, worker processors must wait for data and the master must wait for the results Secondly, the grain sizes can vary in length between tasks. Finally, the master can only
proceed to end the computation when all processors have finished their work. The overall effect is that the performance achieved may be much worse than the ideal, such a behaviour is depicted in the lower half of Figure 16. w ork er 4 w ork er 3 w orker 2 w ork er 1 m aster w ork er 4 w ork er 3 w orker 2 w orker 1 farm er idealised tl t l+ T H--- tl+ T + t2 realistic
Figure 16 : Idealised and Realistic Behaviour of a Processor Farm
Hence, low cost analysis techniques applicable to programs in general tend to be very approximate. In Chapter 4, analysis techniques will be developed which define an upper limit on what can be achieved through tuning. Because they define limits on performance, rather than estimates o f performance, such models can be valid and applicable across a wide range o f programs. These models will be shown to be useful for predicting when a program cannot be tuned to meet specification or cannot be tuned to meet the specification consistently. To predict when a program will meet specification, other techniques are needed.