Distributed Life Program - Parallelization of JStar Programs on a Distributed Computer

7.4.1 Program Outline

A distributed version of the Game of Life was written to conduct experiments to assess the performance of various design decisions. The program was written in Java using the MPJ Express[5] message passing library.

Each task is responsible for calculating part of the board. The task asynchronously sends the cell values on the edge of its block to other tasks that require them. The task then waits to receive cell values from neighbouring tasks. After cell values arrive from neighbouring tasks the next generation cell values are calculated. These steps are repeated until the specified number of generations has been calculated.

1300 1320 1340 1360 1380 1400 1420 1440

1280x1280, 65536 Gens 2560x2560, 16384 Gens 5120x5120, 4096 Gens 10240x10240, 1024 Gens 20480x20480, 256 Gens

Runtime (s)

Board Size, Number of Generations

Figure 7.3: A box and whisker graph showing the runtimes of the single node version of life (for nine runs).

generation_counter = 0

while generation_counter < generations_to_run: send edges to neighbours

receive edges from neighbours calculate next board generation generation_counter++

Figure 7.4: Pseudocode for the distributed life program.

These steps are the main loop which runs on each task. The pseudocode for this main loop is shown in Figure 7.4.

The pseudocode for calculating life cells on the board is shown in Figure 7.5. Each task’s part of the board is stored in an array of arrays of byte. Two of these are used, one to store the current generation cell values and one for holding the next generation of the board.

For communication between tasks the cells to be transmitted are bit-packed into an array of byte, with each bit in the byte holding one cell value.

After each generation has been calculated the current generation reference is updated to refer to the array containing the newly calculated values as they become the new current generation. The next generation reference is updated to point to the old current generation array, so that the array will be

for (x=0; x < width; x++): for (y=0; y < height; y++):

surroundingAlive = count live cells surrounding (x,y)

if (x,y) is alive and (surroundingAlive = 2 or surroundingAlive = 3): next generation (x,y) = alive

else if (x,y) is alive and surroundingAlive = 3: next generation (x,y) = alive

else

next generation (x,y) = empty

Figure 7.5: Pseudocode for calculating board life cells.

reused and overwritten by the new next generation values.

7.4.2 Measuring the Performance of the Distributed Program

When measuring the runtime of a computer program there are often factors that affect the runtime which are difficult for the experimenter to control. For example, most operating systems have background processes which may consume the computers resources during the experiment and cause the program to run more slowly than normal. This problem becomes more serious when measuring the runtime of the life program running on a distributed computer, as using multiple compute nodes increases the likelihood that one of the compute nodes will run more slowly than usual. When this occurs all the other compute nodes are held up as the neighbours of the slow node must wait to receive data, which in turn holds up their neighbours and so on. This makes the entire program run much more slowly than usual. In Figure 7.6 the effect of one compute node holding up the rest can be seen.

As the number of compute nodes used increases, this sort of slowdown becomes increasingly likely. When using 32 nodes, more than 20% of the experiment runs had at least one node whose calculation time exceeded the median node calculation time by more than 10% .

The problem with this is it makes it more difficult to measure the difference in runtime between different configurations as the difference in runtime is overshadowed by the slowdown caused by one node running more slowly than the others. Each board configuration is run twelve times times. However the slowdowns are large enough and frequent enough that they have a considerable impact on the mean value. The median value is usually less affected by runs with a slow node, but unfortunately slowdowns are common enough that for some experiments the majority of twelve runs had a slowdown, which means the median is affected.

To make the data from different experiments easier to compare, runs that contained an slow node that caused the runtime for that run to be abnormally high are excluded. To determine which runs

Figure 7.6: Timelines for two runs of a distributed life program. Each bar represents a compute node. The coloured area of the bar is time spent by the node computing life cells, the white area is time spent communicating with other nodes or waiting data to arrive from other nodes. In the top graph the nodes are all computing life cells at a similar rate and thus there no significant bottlenecks. In the bottom graph one compute node here is computing more slowly that the others. The other nodes are held up by this slow node and spend much of their time waiting for data to arrive. The delays propagate from node to node, as the nodes that are held up then hold up their neighbours. The layout that is used for this experiment is the 2-row brick layout which causes the double “V” pattern visible in the bottom graph.

Figure 7.7: Slices, Grid and Brick layouts.

contained a slow node the time spent calculating life cells is measured using System.nanoTime. Any node which spends more than 6.1% more time calculating life cells compared to the median for that run is considered a slow node. All runs with such a node are excluded. The median is taken of the remaining runs.

The tolerance of 6.1% was chosen because this percentage discarded abnormally high runtimes but retained at least 3 runs for each experiment.

In document Parallelization of JStar Programs on a Distributed Computer (Page 88-92)