Performance Impacts of Distributed Program Design Decisions

sions

During the process of compilation there are many different choices that must be made, which may have an effect on performance. The choices that can be made relate to how the workload of the program is divided among the coarse-grain tasks, how the tasks communicate and how they are allocated to compute nodes. Improving performance of a distributed program is often a matter of increasing the utilisation of the compute nodes or reducing the communication overhead. Decisions which affect performance are generally made during the agglomeration step of the design process.

To explore these various choices two case studies were examined, the Game of Life and a prime counting program. These programs were hand-translated into distributed Java programs. Performance experiments were done to access the performance implications of various design decisions.

8.3.1 How the Workload is Divided Up Among the Coarse-grain Tasks

One decision which was required in each of these case studies was how to divide the workload up among the coarse grain tasks. When doing this, it is important to take the communication costs into account. The amount of communication between tasks should be minimised by placing the fine-grain tasks which communicate with each other together in the same coarse-grain tasks when possible. The workload should be balanced among the compute nodes to reduce the amount of time tasks have to wait for data to be sent by other tasks.

For the prime counting program, two different programs were evaluated:

• A program which divides up the search range. In this program, each task calculates the initial set

of primes that is sufficient to filter out the all the non-primes. Each task then counts the number of primes within its sub range. The subtotals of the primes counted are then sent to one task which adds the subtotals together to get the total number of primes.

• A filter pipeline program. In this program, each task is allocated a different set of primes. Each

task filters out multiples of the primes allocated to the task. The tasks are organised as a pipeline so that the stream of possible primes flows through the pipeline, with non-primes being identified and removed by filter tasks. After the numbers have been through all the tasks, only primes remain.

The program which divided the search range up into subranges performed very well, being almost embarrassingly parallel. However, the pipeline version of the program performed much less well. The main reason for this was that the cost of the filtering operations was less than the cost of communication. Also, the primes pipeline proved to be difficult to load balance because most of the work was done by the tasks filtering out the low prime numbers, making it difficult to find enough work to keep the other compute nodes busy. Unless the load across the compute nodes is balanced then this will cause a bottleneck which forces compute nodes to wait rather than do productive work. However, pipelines could work quite well in other situations where the compute costs are much higher than the communication costs and the load can easily be balanced across the compute nodes.

For the Game of Life program, the board is divided up into pieces and each piece is handled by a different task. The size and shape of the pieces that the board is broken up into does have an effect on the communication time. Two different piece shapes were experimented with, vertical slices where each piece has two neighbours and a brick layout where each piece has six neighbours. Neither of these layouts was found to be universally superior to the other. Which one had the lowest communication cost depended on the interactions with other decisions, such as the size of each piece and the amount of haloing being used.

8.3.2 Haloing

One technique that significantly reduced communication time was haloing, or duplicating computation on multiple nodes. Haloing lowers the cost of communication by reducing the number of messages required. Haloing considerably reduced communication costs in the Game of Life case study. Haloing was found to yield the greatest improvement when tasks sent large numbers of small messages. For tasks that send a small number of large messages then the improvement is likely to be minimal or to reduce performance as the cost of duplicating computation exceeds any savings gained from reducing the number of messages.

8.3.3 Communication

It was found that small message sizes (less than 800 bytes or so) had relativity low throughput. If a large quantity of data is being sent using small messages, then communication overhead can be reduced by packing the small messages into larger messages.

Using the eager mode communication protocol rather than the synchronous mode protocol was found to significantly increase communication throughput, especially with small message sizes.

8.3.4 Allocation of Coarse Grain Tasks to Compute Nodes

In this thesis, each compute node has only one task allocated to it. The reason for doing this was to create an environment in which each compute node was a single processor. However, as compute nodes these days will certainly have many cores in one processor, allocating more than one task to each compute node will allow the program to take advantage of a compute node with multiple cores. Another approach that could be investigated in future work is to have multi-level parallelism, where inter-node communication could use message passing, but tasks on the same node could use shared-memory. This is discussed in Section 8.4.4.

In some situations, the load is not evenly balanced between tasks. This was the case for the Pascal triangle program presented in Chapter 4. In such cases allocating the tasks so that workload is balanced across compute nodes may increase performance.

In document Parallelization of JStar Programs on a Distributed Computer (Page 117-119)