Shared Memory Overheads - Hybrid Code Performance

4.3 Hybrid Code Performance

4.3.4 Shared Memory Overheads

Contrasting the performance of routines in the hybrid code containing OpenMP parallelisation with their performance in the pure MPI code where no OpenMP parallelisation is present reveals the effect of the OpenMP additions on performance.

The forces routine contains two for loops that have been parallelised with OpenMP in the hybrid version. If the performance of the MPI version is taken as a baseline, the difference in

4.3 Hybrid Code Performance 99

Hybrid (1 MPI) Hybrid (2 MPI)

Cores Absolute Percentage Absolute Percentage

64 16.49761 14.24% 5.65062 5.38% 128 1.66713 3.18% 0.95764 1.85% 192 0.01156 0.03% 0.55513 1.60% 256 -0.09332 -0.37% 0.69562 2.64% 320 2.25459 9.92% 3.42417 14.32% 384 0.27799 1.59% 0.76414 4.24% 448 0.08347 0.56% 1.14503 7.21% 512 0.14007 1.06% 0.69516 5.07%

Table 4.6: Absolute difference between minimum observed pure MPI timing and minimum observed hybrid timing for the forces routine on Merlin running the small simulation. Absolute differences given in seconds.

performance of the hybrid code can be calculated by subtracting the runtime of the MPI version from the runtime of the hybrid code. Table 4.6 shows the difference for the small simulation, while Table 4.7 shows the difference for the large simulation. As well as the absolute timing difference between the hybrid and MPI code, the percentage of the hybrid method runtime that this difference represents is also shown. The absolute difference for both the small and large simulation are illustrated graphically in Figure 4.18, which shows the difference between the average timing as a line plot, where error bars are given showing the sum of the errors from both hybrid and pure MPI forces measurements and also shows (as a marked point) the difference between the minimum observed performance measurements. Standard errors seen are again relatively small, suggesting that the variance between the timing of the three runs is small when considering the runtime of the forces routine. This would be expected, as the routine is deterministic and entirely node bound, involving no communication. Sources of potential variance in runtime are therefore limited.

There is a clear correlation between the number of cores and the difference in performance of the hybrid and pure MPI code. On 64 cores, the difference between the two is relatively large, but this difference shrinks rapidly as the number of cores increases. This suggests that part of the performance differences is due to the size of the problem (or subdomain) on each node. When running on a low number of cores (and therefore with a low number of MPI processes),

100 4.3 Hybrid Code Performance

Hybrid (1 MPI) Hybrid (2 MPI)

Cores Absolute Percentage Absolute Percentage

64 93.98499 25.76% 44.91361 8.16% 128 17.98513 11.58% 7.35597 8.00% 192 4.24421 4.36% 1.15308 4.63% 256 4.09148 5.65% 3.5239 7.85% 320 3.21845 5.57% 2.93613 9.67% 384 1.56781 3.33% 2.06792 6.99% 448 8.05073 17.13% 7.00104 37.30% 512 1.86674 5.17% 2.2139 14.21%

Table 4.7: Absolute difference between minimum observed pure MPI timing and minimum observed hybrid timing for the forces routine on Merlin running the large simulation. Absolute differences given in seconds. -‐20 0 20 40 60 80 100 64 128 192 256 320 384 448 512 D iﬀ er en ce to P ur e MP I ( se co nd s) Cores

Hybrid (1 MPI) -‐ Small Simula<on -‐ Average Hybrid (2 MPI) -‐ Small Simula<on -‐ Average Hybrid (1 MPI) -‐ Large Simula<on -‐ Average Hybrid (2 MPI) -‐ Large Simula<on -‐ Average

Hybrid (1 MPI) -‐ Small Simula<on -‐ Difference between Minimums Hybrid (2 MPI) -‐ Small Simula<on -‐ Difference between Minimums Hybrid (1 MPI) -‐ Large Simula<on -‐ Difference between Minimums Hybrid (2 MPI) -‐ Large Simula<on -‐ Difference between Minimums

Figure 4.18: Absolute difference between average pure MPI timing and average hybrid timing for the forces routine on Merlin (line), presented with error bars representing the sum of errors on both forces timing measurements, and difference between both minimum observed timing for both hybrid and pure MPI code (markers) . Large difference between codes at small counts decreases rapidly as the number of cores increases.

the subdomain on each node in the hybrid version will be relatively large when compared to the subdomains in the pure MPI code, meaning the work needed to be done in the main work loop of the forces method will also be relatively large. As the number of cores increases, the

4.3 Hybrid Code Performance 101

subdomain shrinks, and the work needed to be done becomes less. Similarly, the amount of data to be copied into private copies within each OpenMP thread will be much larger at lower core counts than at larger core counts; the OpenMP benchmarks in Section 3.1.4 have already shown that this adds increased overheads. The loop performance and OpenMP overheads therefore reduce as the number of cores increases and the performance approaches that of the MPI version. A further possible cause of performance differences could be the simple parallelisation strategy used in the code. In order to see if simple hybridisation approaches can deliver performance improvements, significant restructuring of the code has not been done, only simple loop level parallelisation has been added to the hybrid version. The memory structures and data layout may therefore not be optimum for shared memory parallelisation of this type, which may cause an extra performance hit in the hybrid message passing + shared memory codes. These indirect overheads of the OpenMP parallelisation are less of an issue as the numbers of nodes increases and the work done by each thread decreases. In order to test this theory it would be necessary to carry out a much more in depth restructuring of the code to see if performance improvements can be gained by altering the memory/loop structures.

In document Performance engineering of hybrid message passing + shared memory programming on multi-core clusters (Page 112-115)