Guidelines for an Adjoinable MPI Code - Semantics driven adjoints of the message passing interf

This section summarizes this chapter by extracting relevant practical guidelines for developers of numer- ical simulation codes. The guidelines are listed in order of priority as follows:

• Adjoint memory wall

• Two-sided communication (MPI standard 1.0) • One-sided communication (MPI standard 2.0)

First, the issue with decreasing memory availability on single cluster nodes has been discussed in Sec- tion 3.1.2. Before implementing code one should decide whether an adjoint code should run in parallel. Adjoint code generated by an AD tool may perform well using single processor machines, while scaling very badly when run in parallel. If the adjoint memory wall risks to become an issue, one should look into continuous adjoints as discrete adjoints become infeasible.

Second, if the adjoint memory wall is not an issue, one should start writing MPI code using two-sided communication. Two-sided communication has been proven to be robustly adjoinable (see Chapter 5). Only the product reduction introduces a challenge by requiring a decision at runtime on how it should be adjoint (see Section 3.6.2).

Third, given a use case for one-sided communication in the original code, one should seriously recon- sider implementing one-sided MPI if adjoint code thereof should be generated by an AD tool. One-sided MPI code is not generically adjoinable. In order to keep one-sided communication generically adjoinable and expect reasonable performance, one has to enforce the access restriction axiom (see Section 3.8). It introduces a constraint on the original code by restricting concurrent read and write access in the public window and the local memory inside an access epoch. Moreover, communications like for example the accumulate with a product operation require serious workarounds in order to generate correct adjoints (Section 3.8.4), where the original communication pattern is altered considerably. This may have draw- backs performance-wise. Additionally, passive target synchronization should only be used with manual user input. It is not possible to adjoin it generically. An in depth understanding of AD should be ex- pected for developing adjoinable passive target code, thus reducing the maintainability of the original code. (Section 3.8.5).

Chapter 4

Adjoint MPI Library

There are two generic adjoint MPI libraries at the moment. The first one was developed by the author from 2010 to 2014 for prototyping purposes called “Adjoint MPI library” or simply AMPI. It was eventually integrated into several in house AD tools that will all be described in Chapter 5 (dco/c++in Section 5.4, dcc in Section 4.3.2 and CompAD in Section 5.3.3). The other library started in early 2013 as a joint effort between the Argonne National Laboratory (US), INRIA (France) and the STCE at the RWTH Aachen University (Germany) with the aim to develop a common AMPI library called “Adjoinable MPI”. It has more control over the internal structure of the library with for example support of zero-copy (see Section 2.3.3) for tools with contiguous memory layout. From the ground up it was designed to work with ADOL-C,dco/c++, and Tapenade. However, it breaks the MPI function signatures in favor of more granular performance tweaks and settings for the adjoint communication.

In this chapter the development of the adjoint MPI library (AMPI) based on the results of the previous chapter will be discussed. One-sided communication is left out for the time being, since there was no real use case and no hardware with specific support that achieved any measurable runtime benefits. A first implementation based on the results of this work has been made by the author in the Adjoinable MPI library and is available in the repository (see Section A.2).

The adjoint MPI library is a straightforward implementation of the patterns derived in Chapter 3. However, some design choices had to be made and those implementation specifics will be discussed in this chapter. First, the usability requirements are elaborated in Section 4.1. The library should be portable, generic, efficient and self-contained. The portable data layout design is illustrated in 4.2.1. In Section 4.2.2 the nonblocking communication handling is presented. A first try of tracing the reduction operations is illustrated in Section 4.2.3. And last the implementation of higher order adjoint communication is shown in Section 4.3.

4.1 Usability

Usability is about how the user perceives the developed implementation. Actually, there are two users with regard to the AMPI library. The first one is the AD tool developer and the second one is the end user who applies AD to some code using an AD tool. There are four design requirements that have an impact on either the AD tool developer or the end user.

Portability (tool developer, end user) The MPI standard itself is the de facto standard for parallel programming. Although its implementations may be highly specialized and hardware specific, there is hardly any cluster with no MPI library available. It is clear that the same requirement applies to the adjoint

Generic AMPI Library

Wrapper Wrapper

Source Transformation Operator Overloading

Generic MPI

reversal library

AD Tool specific wrapper

AD Tool based on either

source transformation or operator overloading

Figure 4.1: Coupling the AMPI library with an arbitrary AD tool. The wrapper has to be implemented jointly by the AD tool developers and AMPI developers.

MPI library. Any system that supports MPI should be able to run code that relies on the adjoint MPI library. The MPI standard itself is defined as a C or Fortran interface. C is chosen as the implementation language for our library. Moreover, it is clear that any system that supports MPI should also support the AMPI library. At the communication level AMPI will then be linked against this particular MPI library. Finally, minimum requirements for running adjoint MPI code is a valid ANSI C compiler and an MPI library.

Generic (tool developer) The adjoint MPI library should not be bound to any particular AD tool. The interface of the library should be very generic, as small as possible and without any tool specific code. The only tool specific code will reside in the tool specific wrapper (see Figure 4.1). Inside the wrapper, any internal adjoint MPI logic and structure should be opaque and hidden to the user.

Efficiency (end user) Chapter 3 derived the structure and constraints that any adjoint MPI code has to fulfill. However, the implementation of these constraints may potentially have a huge effect on the performance of the adjoint code. Up until today, there is no formal way for deriving the most efficient code with regard to MPI, let alone for adjoint code. In this chapter implementations will be proposed that fulfill the constraints of Chapter 3. As the efficiency of MPI implementations tend to be very hardware specific it is hard to reach a general consensus on how the most efficient MPI code should be. Hence, we propose the most generic implementations that may nonetheless yield bad performance on a particular hardware. However, in all the use case validations in Chapter 5 no such degradation of performance could be observed.

Self-contained (tool developer, end user) The adjoint MPI library should be closed within MPI. One reason is the aforementioned portability. Increasing the number of dependencies would only decrease the portability. Moreover, the library should still be usable in the years to follow. By only relying on the MPI standard, we adhere to one of the oldest parallelization standards in a research field that is moving at a very high pace.

Besides our requirements, MPI itself puts forward some restrictions. First, MPI is inherently a runtime library. Most execution properties are determined at runtime like for example a process’ rank which determines the data flow of a parallel run. Differentiating MPI using source transformation is therefore only achievable by going to great lengths in order to parse the source code of an MPI implementation.

In document Semantics driven adjoints of the message passing interface (Page 76-79)