Large Programs and Re-Use - Performance Tools

3. Performance Tools

4.5 Large Programs and Re-Use

As the number of DAG nodes increases, program behaviour becomes more complicated and the risk of tuning increases. Complex behaviour arises from large numbers of grains of irregular size, from contention between a number of tasks on each processor, from complex data-dependencies between tasks and from large numbers of messages or large messages. In the extreme, the behaviour is so complex that the monitoring tools cease to support the effective diagnosis of the problem. As program complexity and scale increase, the number of events needed to understand the application behaviour becomes larger. Eventually, the task of understanding the nature of the events and their relationships to one another overwhelms the user. Tools such as VPB/View™ can provide various means of ordering event information, by distinguishing between types of events and organising events into logical groups, and of abstracting from the information, particularly by using utilisation monitoring measurements. This increases the scale at which the diagnosis becomes impractical, but it does not ultimately solve the problem of diagnosis in the face of behavioural complexity. The term large program will be used for programs with so many DAG nodes and such complex behaviour that problem diagnosis becomes impossible. The point at which the limit occurs depends not just on the number of nodes, but also on the behavioural complexity and the effectiveness of the performance visualisation tools.

The issue of large programs is important because the most substantial benefits of parallelism come from exploiting it to large degrees. It is for large programs running on highly parallel machines that the highest absolute performance is achieved.

There are two important ways in which a high degree of parallelism can be achieved in a structured manner, thus potentially reducing the problem of complex behaviour. The first way is by scaling the program, in which the degree of task replication in a program is increased to provide more parallelism. By

replicating the same behaviour, the problem of understanding overall behaviour is reduced. For example, a geometric array computation can be sub-divided to run on an arbitrarily large number of processors, so as to process the corresponding physical model faster and in more detail. A farm can be replicated to run on more and more processors, achieving a higher throughput of data, as illustrated in Figure 27. The limit of this technique generally occurs because the communications required increase as a proportion of the computation times; the scalability is said to be limited.

The second way to increase parallelism is by hierarchical decomposition, in which a parallel program is built up from a number of component sub-programs with simple relationships between them as illustrated in Figure 28. If the sub-programs can be tuned separately, the manageability of the performance problem is improved. This is particularly true if sub-programs are repeated in the program design. The use of sub programs can occur at a number of levels, so that a hierarchy of sub programs is formed. However, the sub-programs can only be tuned separately if they do not contend for processors and links and if overall run-time can be calculated from that of the sub programs in a simple way, in particular by using critical path analysis techniques. If sub-programs contend for processors and links or if messages are passed between sub-programs in complex ways, there can be complex interactions between their behaviours and arbitrary delays may be caused. In these cases, the neglect of the complex interactions between sub programs is likely to be a very optimistic assumption. Analysis based upon it will yield a valid optimistic limit, but there is also likely to be a large gap between the optimistic limit and the tuning that can be achieved. It is then better to regard the collection of sub programs as a single, complex program for the purposes of monitoring analysis. Where there is then a high degree of contention in the program, critical load analysis should be used, but this is of limited use because the assumptions used in the analysis are likely to prove very optimistic. The next chapter discusses methods for dealing with these problems.

program with

six replicated tasks

Figure 27 : Replicated Tasks

main program

sub-program for task D

D2

D4

D1

D6

D3

D5

sub-program for task C

_Cl

4.6 Summary

□ Because optimistic assumptions can be very effective at simplifying model analysis computations, analysis to define an optimistic limit on performance after tuning is much more generally applicable than analysis to predict the actual level of performance.

□ The use of critical load and critical path techniques for model analysis was examined. Critical path analysis assumes that there is no contention on the critical path. Critical load analysis assumes that there will be no idle time on the critical load.

□ For event tuning, analysis techniques can be applied directly to monitoring results. They can set optimistic limits on what can be achieved by event tuning by assuming that tuning will ensure either that the events on critical paths are not delayed or that processors or links supporting the critical loads are fully utilised.

□ Event tuning may not succeed in fulfilling these assumptions, so there is a risk that the limits on performance defined by the analysis of monitoring results will not be achieved. The degree and likelihood of the performance deficit tends to increase as the optimism of the assumptions in the analysis increases.

□ For event tuning, assumptions are most likely to prove optimistic for programs with complex behaviours; this is especially true for large programs constructed as a hierarchy of sub programs unless contention between the sub-programs can be removed.

□ More extensive tuning, especially that brought about by grain tuning, parallelism tuning and hardware tuning, can easily cause performance to be limited by an unexpected critical load or path, so the risks involved in applying analysis techniques to the existing critical load or path in a simple way can have very severe consequences.

In document Methods for Improving the Performance of Software for DM MIMD Systems. (Page 86-90)