Hybrid Model Performance - Hybrid Programming

2.3 Hybrid Programming

2.3.1 Hybrid Model Performance

“There seems to be a general lore that pure MPI can often outperform hybrid, but counterexamples do exist and results tend to vary” [99].

A hybrid MPI/OpenMP version of the Nasa Advanced Supercomputing (NAS) benchmarks is compared and contrasted with the pure MPI versions in [26], and it is concluded that performance depends on several parameters such as memory access patterns and hardware performance, and that “even optimized hybrid codes may provide insignificant performance improvements compared to the original MPI version” [26]. It is also noted that “[f]or all benchmarks, the MPI computation times are less than the MPI+OpenMP ones.” [26], suggesting that the addition of OpenMP to the code has actually slowed the computational parts of the code. The specific case of a hybrid message passing + shared memory Discrete Element Modelling (DEM) code is examined in [55], which agrees with [26] in finding that that OpenMP overheads result in the pure MPI code outperforming the hybrid code. However [55] also concludes that that the fine-grain parallelism required by the hybrid model results in poorer performance than in a pure OpenMP code, suggesting that the hybrid message passing + shared memory code is therefore worse than either a pure message passing or pure shared memory code. A direct contradiction is found in [104], where the conclusion of examining a hybrid message passing + shared memory code is that in certain situations the hybrid model can offer better performance than pure MPI codes, but that it is not ideal for all applications. An example using SMP nodes is found in [54], which concludes that performance improvements can be made with the hybrid model, and that memory overhead is drastically reduced in hybrid code compared to pure MPI code. A further example with SMP nodes in [24] shows performance improvements with kernel algorithms but with a significant amount of work needed, and that pure MPI offers better performance with real applications. It concludes that problems with memory bandwidth, cache access and threading overheads result in lower performance. Again an example hybrid application does not perform as well as either pure OpenMP or pure MPI versions in [101], while a Simulated Annealing optimisation example in [36] finds better performance with MPI

40 2.3 Hybrid Programming

+ OpenMP hybrid code than with pure MPI. The hybrid approach has even been used across Grid systems in [119]. Another example is found in [40], again focused on SMP clusters, but suggesting that performance can be improved with the hybrid model. In [67] the hybrid model is found to deliver performance benefits, with the performance gains a result of “inherently lower latency of shared memory threads across processors within a node” [67], while in [107] the load balancing of hybrid CFD codes on SMP clusters is examined. In [105] a hybrid Quantum Monte Carlo code is developed, but only tested on one SMP node, and results are inconclusive. In [53] a positive conclusion is reached on the experience and performance of “developing hybrid MPI and OpenMP parallel paradigms for real applications”, and in [68] it is found that a hybrid code performs better than pure MPI over Gigabit Ethernet, but that overall scalability of the code decreases. Again, in [68] one of the NAS parallel benchmarks is examined, and it is found that the hybrid model has benefits on slower connection fabrics. The plane wave Car Parrinello code, CPMD [28] has been parallelised in a hybrid fashion based on a distributed-memory coarse- grain algorithm with the addition of loop level parallelism using OpenMP compiler directives and multi-threaded libraries (BLAS and FFT). Good performance of the code has been achieved on distributed computers with shared memory nodes and several thousands of CPUs [10, 63]. In [86] the hybrid and pure MPI approaches are found to deliver similar performance, but on large numbers of SMP nodes the hybrid approach outperforms MPI. However, in [44] it is found that the hybrid paradigm is ‘inferior compared to a pure MPI parallelization” [44]. This performance is suggested to be because “MPI libraries tend to be highly optimized for message passing communication and provide poor support for thread management” [44], however as discussed in the section on MPI this may not actually be the case. A micro-benchmark suite for analysing hybrid code performance and results of the suite are presented in [23], showing that understanding the performance of the hybrid message passing + shared memory model is of importance to HPC.

Some of the literature is flawed, testing hybrid codes on only one SMP node and so not testing the impact of reduced communication requirements, which is shown to be an important part of the hybrid model performance in much of the other literature (and this thesis). Even ignoring

2.3 Hybrid Programming 41

these studies though there is no consensus in the literature as to the effectiveness of the hybrid message passing + shared memory model. The contrasting set of results and conclusions found throughout the literature show that as with the problem of MPI vs. OpenMP, the performance of the hybrid message passing + shared memory model depends heavily on the algorithm or application being parallelised and the hardware used for testing. There is no simple way to know if a hybrid version of an application will perform better or worse than a pure message passing one without a deep understanding of the performance of the hardware it is to be run on and the structure of the application itself.

In document Performance engineering of hybrid message passing + shared memory programming on multi-core clusters (Page 53-55)