Single Node Runs - Parallelization of JStar Programs on a Distributed Computer

To calculate the speedup of the distributed versions of the prime counting program, the runtime of the prime counting program on a single node was measured. To measure runtime on a single node a variant of the prime counting program was created. This program shares the prime filtering code with the distributed versions, but only runs on one node and does not use the MPJ Express communication library. The runtime is measured by calling System.nanotime at the beginning and end of the program. The runtime of the single node version of the program was measured using both the priority queue filter and the mark and check filter.

6.5.1 Priority Queue Filter

The single node version of the program using the priority queue filter was run five times on five different compute nodes, counting primes up to 10 billion. The median runtime for counting the primes up to 10 billion is 7181.2 seconds.

To see if it is possible to gain any improvement over this runtime several optimisations were tested:

• Storing the value of multiple instead of recalculating it.

The prime-multiplier pairs in the priority queue are ordered on the value of the prime multiplied by the multiplier. This value would need to be recalculated many times during operations on the heap. Instead of recalculating the value each time, it would likely be faster if it was calculated once and stored.

• Skip checking multiples of two.

Instead of generating multiplies of two and then filtering them out, multiples of two can be skipped by checking only odd numbers from three onwards. As even numbers greater than 2 are skipped, it is not necessary to put multiples of two into the priority queue.

• Partitioning the priority queue into multiple priority queues.

The complexity of inserting an item into a heap is O(log n). By dividing the priority queue into several smaller priority queues, the amount of time taken to insert prime-multiple pairs could be reduced. In these single node experiments, a series of priority queues are kept in an array. The maximum number of items stored in each priority queue is set to be 2arrayindex+3_{−1, or {7, 15, 31,} 63, 127, 255, ... }. These maximum number of items have been chosen because heaps containing these numbers are items are perfect binary heaps. Perfect binary heaps are heaps where all leaves are at the same depth.

The priority queues are filled from the left of the array, so that multiples of smaller primes are filtered out by priority queues with fewer items stored in them. As multiples of small primes occur much more frequently than multiples of larger primes, they will have to be inserted into the priority queues much more frequently. Therefore it makes sense to store multiples of those primes in priority queues with a smaller number of items.

Optimisations Single Node Runtime (s)

None 7181.2

Skip Multiples of 2 5683.3

Storing P rime· Multiplier 6103.1

Partitioning Priority Queue 3706.3

All of above optimisations 2864.8

Table 6.1: Single Node Runtimes for the priority queue algorithm with different optimisations.

0 100 200 300 400 500

10000 100000 1e+06 1e+07 1e+08

Runtime (s)

Buffer Size

Runtime Runtime Skip 2s

Figure 6.7: Median runtime of the single node implementation with the mark and check filter, using different buffer sizes.

6.5.2 Mark and Check Filter

The single node primes program was run using the mark and check filter with various buffer sizes. The program was run with various buffer sizes. Each experiment was run five times on five different compute nodes, counting up to 10 billion. The experiment was repeated with the optimisation of skipping the checking of multiples of two.

The median runtimes are shown in Figure 6.7.

6.5.3 Conclusions

For the priority queue algorithm, the optimisations discussed (storing the multiple rather than recalculating it, skipping multiples of two, partitioning the multiples into a number of different priority queues) significantly reduce the program runtime.

For the mark and check algorithm, using a buffer size that is too small can significantly increases the runtime required, as can be seen in Figure 6.7. For buffer sizes 200,000 up to 800,000 the program runtime slowly decreases as the buffer size becomes larger. For buffer sizes greater than 3,200,000 the program runtime sharply increases. As 800,000 is the optimal buffer size, all experiments in this chapter using the mark and check algorithm use this buffer size.

Clearly, from the above experiments, the mark and check algorithm has much higher performance. The mark and check algorithm with a buffer size of 800,000 skipping multiples of two is 36 times faster than the priority queue algorithm with all the discussed optimisations.

For calculating the speedup over the single node version of the program, the median runtime of the fastest single node program is used. This is the mark and check algorithm using a buffer size of 800,000 skipping multiples of two. The median runtime of this program is 78.9 seconds.

In Section 6.6, the distributed primes program using the priority queue algorithm is compared to the single node priority queue program. For calculating this speedup, median runtime of the single node program using the priority queue with all the discussed optimisations is used. The median runtime of this program is 2864.8 seconds.

In document Parallelization of JStar Programs on a Distributed Computer (Page 63-66)