OpenMP - Performance engineering of hybrid message passing + shared memory programming on multi

“OpenMP is probably the most commonly used communication standard for shared-memory based parallel computing” [21]. “OpenMP is becoming the standard programming model for shared memory parallel architectures” [45].

The OpenMP Application Programming Interface (API) [92] defines a set of library routines and compiler directives that facilitate parallelisation on shared memory systems using threading. A newer standard than MPI, it specifies some library routines for general tasks (such as discovering the number of threads in a parallel application) but relies primarily on compiler directives to parallelise a code. These directives are translated at compile time into parallel code by a compatible compiler, which will produce the threaded application code. It is now a widely accepted standard for “annotating programs for parallel executions” [66]. The bulk of directives (in OpenMP 2.0) are related to work sharing constructs and synchronisation.

In order to use OpenMP the programmer specifies which sections of the code should be carried out in parallel through the declaration of parallel regions, using the omp parallel and

omp end parallel directives. Within these regions, the programmer can then specify

the distribution of work via constructs such as the sharing of iterations in a loop (using an

omp fordirective), or (if using OpenMP 3.0) the specification of tasks that may be carried out

concurrently. Data access may be controlled also, allowing variables to be declared as private to a thread or shared between all threads. Other constructs allow for the synchronisation of threads at barriers (omp barrier), and for other useful operations such as the copying in of data

28 2.1 Parallel Programming

values to a parallel region, or the reduction of values across a group of threads. OpenMP also allows nested parallelism (i.e. threads spawned from within other threads, which may match underlying hierarchical architecture features in multi-core systems), but the performance may be reduced using this style [37].

As with MPI, the use of OpenMP introduces overheads to a parallel code. In contrast to MPI, these overheads are not generally explicitly incurred through calls to external libraries, rather they are inherent within the operation of the shared memory threading. Many operations will be carried out as part of the parallel operation of the code which will consume run time. Threads must be forked and joined at the start and end of each parallel region. Synchronisation may also need to be carried out while an application is running, as may reduction of global variables. Some shared memory related constructs such as locking of variables for atomic updates can be very expensive in terms of runtime. Various systems and toolkits exist to characterise these overheads caused by OpenMP, such as CLOMP [19] or the EPCC micro-benchmarks [22]. While OpenMP is designed primarily for shared memory hardware, distributed shared memory versions do exist, but these do not provide the same level of performance as pure MPI on distributed memory hardware [19].

Advantages

A clear advantage of OpenMP is its high level nature which makes it much easier to perform simple parallelisation of a code [75]. The low level implementation is done by the compiler; this allows the programmer to specify what to parallelise, without having to specify exactly how it is parallelised. The API defines a large range of directives and routines to use in a parallel code, allowing a vast range of codes to use OpenMP for parallelisation. The API defines the exact behaviour of routines and directives in both C and Fortran, appealing to the wide C and Fortran code base already in use in HPC. The ease of use is pointed out in [21]: “OpenMP ... can be easily adapted to existing sequential software”.

2.1 Parallel Programming 29

with far less programming effort required. OpenMP is also overall much easier to use than ‘hand-threading’ using pThreads [75]. Code parallelised using OpenMP can be easily made sequential by simply turning off the OpenMP compilation option.

Many tools exist for debugging OpenMP, and the nature of the shared memory threading makes it easier to debug and tune than a distributed application [75] such as one using MPI, or a shared memory application using pThreads or similar mechanisms. [74] has found that: “OpenMP performance profiling can be done more easily than for performance profiling of a program making manual calls to a threads library”, and that for OpenMP: “current tools can very accurately locate threading correctness problems” [74]. These examples point to the advantage OpenMP applications have over distributed MPI based applications when debugging and maintaining application code.

OpenMP also has some speed advantages; on some architectures, OpenMP barriers are faster than the equivalent MPI barriers, and broadcasting data can also be much faster [60].

Disadvantages

Although simple parallelisation is very straightforward with OpenMP, it can be much harder to get very good performance when a more complicated strategy is required. It may also be hard to debug problems such as false sharing caused by an incorrect parallelisation strategy. The shared memory model is only scalable as far as the shared memory hardware, and this disadvantage is carried into OpenMP. The knowledge base for OpenMP among the HPC community, while growing, is still smaller than that of MPI. Additionally there may be performance problems caused by OpenMP itself. Synchronisation of OpenMP threads may cause issues relating to cache that may slow down an application [98], and furthermore OpenMP may have performance problems on cache-coherent Non Uniform Memory Access (ccNUMA) hardware [20]. The use of OpenMP during compilation may stop the compiler carrying out certain loop optimisations which may affect code performance [52].

30 2.2 Multi-core Clusters

In document Performance engineering of hybrid message passing + shared memory programming on multi-core clusters (Page 41-44)