2.6 MPI Extended SAC
3.1.1 Efficiency
A generic approach is taken in our runtime analysis in order to understand what the effects of applying AD are. In particular, possible wrong expectations about runtime behaviour should be emphasized. Given a parallel speedupS(p) of the original passive implementation with p being the number of processes, T1 is the runtime of the sequential code andTpis the runtime withp processes:
S(p) = T1 Tp
.
According to Amdahl’s law, the maximum speedup is given by
S(p) = 1
(1− α) +αp ,
whereα is the proportion of code run in parallel. In essence, it formulates that the number of processes p only affects the parallel code, whereas the runtime of the sequential part is not affected. In this special caseT1= 1 andTp= (1− α) + αp. The general case ofT1andTpis
T1=t1 andTp= (1− α)t1+
α· t1
p ,
witht1being the runtime of the sequential code. Without the loss of generality, we assume that in a mes- sage passing context the only non parallel part of the code consists of the MPI communication operations. We also assume an embarrassingly parallel code with a linear speedupp with the parallel computation time being tcomp
p . A communicated message, even if split, is not assumed to be communicated faster with increasingp. Hence, the communication time increases with the number of processes p. The entire computation timetcompis equal to the runtime with one processT1where no communication takes place. Thus, we have
T1=tcompand Tp=tcomm(p) +
tcomp
p .
The delay function tcomm(p) is unknown. However, it is assumed that tcomm is increasing and tcomm(1) = 0. Our speedup amounts to the ratio
S(p) = tcomp
tcomm(p) + tcomp
p .
Assuming a “store everything approach” and neglecting any recomputation scheme through checkpoint- ing [17], we conclude a constant slowdown ofδafor the computation and a delay ofδcfor the communi- cation in the adjoint code. This gives us the following speedup of the adjoint code:
S(1)(p) =
δa· tcomp δc· tcomm(p) + δa·tcompp
.
The significant factor is the efficiency which marks the evolution of the speedup with increasing processes p. Neglecting super linear speedup, an efficiency of 1 would be a perfect speedup for an arbitrary number of processes. The original efficiencyE(p) = S(p)p is now compared with the efficiency of the adjoint codeE(1)(p) =
S(1)(p)
p .
E(1)(p) E(p) =
δa· tcomp· (tcomm(p)· p + tcomp) (p· δc· tcomm(p) + δa· tcomp)· tcomp Finally, we look at what happens in an exascale environment wherep→ ∞.
lim p→∞ E(1)(p) E(p) = limp→∞ δa· tcomp· tcomm(p) δc· tcomm(p)· tcomp =δa δc
For the adjoint computation the important factor is the average ratio of operations in the adjoint code with respect to the original code. For a multiplication in SAC notation where each variable is only
3.1. OBSERVATIONS 33
Original Adjoint forward Adjoint reverse ≈ δc
Point-to-point O(n) O(n) O(2n) 3
Broadcast O(n log(p)) O(n log(p)) O(2n log(p)) 3
Reduction (sum) O(2n log(p)) O(2n log(p)) O(n log(p)) 32 Reduction (prod) v1 O(2n log(p)) O(2n log(p) + np) O(n log(p)) NC Reduction (prod) v2 O(2n log(p)) O(3n log(p)) O(n log(p)) 2
Allreduction (sum) O(3n log(p)) O(3n log(p)) O(3n log(p)) 2 Allreduction (prod) v1 O(3n log(p)) O(3n log(p) + np) O(3n log(p)) NC Allreduction (prod) v2 O(3n log(p)) O(3n log(p)) O(3n log(p)) 2
Get O(n) O(n) O(2n) 3
Put O(n) O(n) O(2n) 3
Accumulate (sum) O(2n) O(2n) O(2n) 2
Accumulate (prod) v1 O(2n) NC NC NC
Accumulate (prod) v2 O(2n) O(4n) O(5n) 4.5
Table 3.1: Summary of the adjoint pattern complexities and the estimated slowdown factorδc. Constants in big-O notation hint at the constant ratio between original and adjoint pattern. Patterns with a non constant slowdown ratio are marked as NC (non constant).
used once (no increment and no nullification in the adjoint code, see Section 2.4), it is for example 1 operation for the value (z = x· y) and 2 for the adjoints (x(1) =y· z(1) andy(1) = x· z(1)). In that case the slowdown of the adjoint computation is at bestδa = 3 (one value versus one value and two adjoint operations). Suppose that this is the general slowdown of the adjoint code then this means that if the communication slowdown is smaller thanδc = 3 there is an increase in scalability for our adjoint code. In particular, an AD tool with a rather high value ofδa may lead to an apparent good scalability, which may be attributed mistakingly to the adjoint MPI implementation. This is a very rough estimate; specialized communication patterns may indeed yield a more complex adjoint pattern described in the coming sections.
An estimation of the communication slowdown factorδc is provided for each of the communication patterns in Section 3.3 (point-to-point), Section 3.4 (point-to-point nonblocking), Section 3.6 (collective) and Section 3.8 (one-sided). Notice that these pattern implementations do not rely upon an implementa- tion of an adjoint MPI library. These benchmarks were conducted by implementing the patterns directly with MPI. No data handling besides the communication itself is measured. The tests were conducted on the RWTH Compute Cluster and the institute workstation Heisenberg (see Section 5.1). The code is avail- able on the CD in the folderpatterns of the adjoint MPI repository (see Appendix A). A summary of the pattern complexities is provided in Table 3.1. For the collective communications it is assumed that the network has a binary tree topology, thus leading for example to a communication complexity of O(n log(p)) for the reduction, with n being the message length and p the number of processes. Point-to- point communication is assumed to have a linear complexity ofO(n). The ratio of the original runtime complexity and the adjoint pattern complexity defines the expected slowdown factorδc.