Other Considerations - The SMG DSM system: enabling shared memory for the grid

Constructing parallel applications can involve many, and often conflicting, considerations. Some of these have already been mentioned, such as whether the functional (task) or domain (data) approach is taken to problem decomposition, which is in turn depen- dent on the underlying algorithm. Other factors that need consideration are:

• Granularity: which is a qualitative measure of the ratio of computation to communication and synchronisation, in essence the frequency at which data dependencies arise. Applications that exhibit a coarse granularity are more suitable for parallelisation on distributed memory architectures. Consider a simple application of summing an array ofNintegers. It is possible for the application to be parallelised where each process in a process pool, of size M, performs the required calculation of a sub-array of size N/M. If M = N/2, then each process will initially perform one addition, in which case the communication and/or synchronisation overhead would result in a significant slowdown, as the time for a modern processors to execute one addition is negligible with respect to communication cost.

• Data Access Patterns: there have been a number of attempts to classify parallel applications and their sharing patterns. The developers of the Munin DSM [34] identified distinct categories of shared data. Data sharing that is responsible for inter-process communication contributes to lowering the granularity and thus is responsible for a degradation in performance and/or scalability of parallelisation. Poor algorithm design can also be a contributing factor, since accessing data in a non-structured manner will generate more traffic than structured access where optimisations can be taken advantage of.

• Load balancing: if a distributed memory cluster is very loosely composed, e.g. from a network of workstations (NOW) [7], then it is likely to be composed of different architectures and platforms, and so different threads of a parallel application will have different performance characteristics. Even machines with the same architecture and platform can have different performance metrics. This is an important fact to consider as in many applications the performance will be governed by that

OTHER CONSIDERATIONS 24

of the slowest processor. Load balancing attempts to match tasks to processors such that overall performance is maximised. For example, if in the matrix example of the previous sections, this same job is divided among Mprocessors, where the performance of one is greater than the other, then as there is a data dependency between theMsub tasks the performance of the system is determined by the last thread to finish its task.

On symmetric-memory architectures this is less important so long as the developer partitions the work evenly among the threads, as each thread will normally get the same resources at the same performance level. Any asymmetry exacerbates the problem and so requires load balancing.

• Data Distribution: it is often necessary for the programmer to consider data distribution when developing the algorithm for the application. Employing a data provisioning strategy will allow for the caching principle of data locality to per- sist across distributed machines; additionally, a reduction in the potential load imbalance will occur. Data parallel languages, such as HPF [35], allow for the programmer to annotate variables, so that a strategy can be followed (or at least suggested to the compiler) for data distribution that employs one of the traditional categories, such as: block, cyclic, block-cyclic, replicated, and local [8].

• Fault Tolerance: fault tolerance is the ability of a system to recover from a situation that would otherwise result in failure, possibly necessitating the complete restart of an application. In parallel computing one errant resource could be responsible for the aborting of a job involving thousands of processors. When such an event occurs the subsequent action is determined by the availability of a fault tolerant recovery mechanism e.g. re-start from a saved/checkpoint-ed state. Efforts have been made to address this area such as fault tolerant message passing implementations exist such as [36] described in Section D.

Applying fault tolerance to parallel programming is another important factor that is often not considered by the application developer as extra effort is introduced into the development process, so transparent fault-tolerance is a desirable attribute. Fault tolerance can be supported in DSM systems, however some direction from the programmer will still be necessary, as the application would need to be check- pointed at global synchronisation points.

• Data Divergence: A parallel implementation of an algorithm may obtain a different result than a comparable serial implementation [37]. Such a situation can arise as floating-point arithmetic is not associative, or in a situation that may arise from the rounding accuracy associated with differences in internal representation of floating point variables, leading to a variance in precision when performing floating point arithmetic (e.g. Intel processors represent IEEE 754 double precision values internally as 80-bit values, while PowerPC has a direct 64-bit representation).

• Multi-thread per process support: Section 2.2 identified the likely future direction of computer architecture. Multi-core processors are now standard and many observers

REVIEW 25

predict that the number of cores will rise sharply in the near future. Support for multi-threaded systems will be vital in order to leverage this. Even with copy- on-write and shared code features that are now standard with modern operating systems it is not feasible to allow many processes with multiple threads of execution to share the increasingly scarce resource that is memory bandwidth. To maximise resources one should ensure that there the number of applications threads be at least equal, if not greater than the number of processing cores, i.e. 1 Process : N user threads : M cores per processing node (N≥ M).

In document The SMG DSM system: enabling shared memory for the grid (Page 43-45)