Single grid timings - AMR Implementation - Efficient Domain Partitioning for Stencil-based Para

5.8 AMR Implementation

5.9.1 Single grid timings

For the single grid problem, we set the Dirichlet boundary conditions to one and update the ghost cells representing the boundaries after every iteration. Since we use a single sub-domain per MPI rank, the domain decomposition completely determines the shape of the sub-domain. As an example, if a 243 domain is decomposed as a 1 × 4 × 6 topology, it results in 24 cells in the X-direction, 6 cells in the Y-direction and 4 cells in the Z-direction, respectively. This is achieved by using the layout scheme shown in Listing 5.3. As an example Figure 5.10a shows the box-shapes resulting from a Dx× Dy× Dz = 1 × 4 × 6 domain decomposition on a 243 domain and 24 cores (single node). This results in Px× Py× Pz = 24 × 6 × 4 cells per sub- domain. The mpi dims create() topology of 4 × 3 × 2 for 24 cores produces a sub-domain of shape 6 × 8 × 12 on a 243 _{domain as shown in Figure 5.10b. The evolution of the solution for a} 3-D domain is shown in Figure 5.11 for iteration counts 0 (initial guess in Figure 5.11a) and 800 (Figure 5.11b) for a uniform mesh having 243 _{cells. The numerical solution advances from an} initial guess of zero towards the exact solution i.e. approaches unity everywhere on the domain (as the Dirichlet boundary conditions are set to one).

Table 5.2 compares the execution times per iteration of the topology returned by the default mpi dims create() (henceforth referred to as MDC) subroutine of MPI and the best topology for 24 to 1536 MPI processes. It is also appropriate to compare the best timings with the

(a) Iterations=0 (b) Iterations=800

Figure 5.11: 2-D slices of a 3-D domain having 243_{cells at x = 0.5, y = 0.5 and z = 0.5 showing} evolution of the numerical solution for ∇2_{u = 0 with Dirichlet boundaries set to 1 at iteration} count 0 and 800

reverse of mpi dims create() (referred to as Rev. MDC) as the code was written in Fortran where the first dimension is the contiguous dimension. For 24 cores (single node), it can be seen that the Rev. MDC outperforms the MDC for all the domain sizes except for 7683_{. Further,} in no case is MDC the best topology. The number of topologies performing better than the MDC or Rev. MDC is significant for most of the domain sizes and core counts. In BoxLib, by default, communication is not overlapped with computation, yet the communication minimizing topology is outperformed by several topologies. For example, for 96 cores and 3.62 billion degrees of freedom, there are 28 topologies which outperform the MDC topology, the corre- sponding figure being 23 topologies for 48 cores. Although the best topology (Dx× Dy× Dz) is 6 × 16 × 1 for 96 cores, the value of Dy= 16 is much higher than the Dyfor MDC (which is 4).

Let Dbx, Dby, Dbzdenote the MPI Cartesian topology process dimensions of the best topologies and Dsx, Dsy, Dsz that of the mpi dims create() topology. It can be seen from Table 5.2 that DbxDby ≥ DsxDsy holds with only two exceptions (Cores=24, Domain=3843 and Cores=48, Domain=3843). This implies that the three planes of the compute kernel to be brought into the cache for updating a single plane of data for the best topologies are smaller in size than the ones which are brought into the cache with the communication minimizing topology (MDC). For all the best performing topologies, Dby≥ Dbz- a criterion that is in agreement with our discussion on optimal sub-domain dimensions in Chapter 4 and [141, 142]. We also ex- pand on these relations in Chapter 6. Further, for non-cubic sub-domains DsxDsy> DrxDry, where Drx and Dry denote the Cartesian topology dimensions of the reverse of MDC (or

483 ₉₆3 ₁₉₂3 ₃₈₄3 ₇₆₈3 ₁₅₃₆3 ₃₀₇₂3 0 10 20 30 40 Domain Size F requency 24 cores 48 cores 96 cores 192 cores 384 cores 768 cores 1536 cores

(a) default mpi dims create()

483 ₉₆3 ₁₉₂3 ₃₈₄3 ₇₆₈3 ₁₅₃₆3 ₃₀₇₂3 0 10 20 30 40 Domain Size F requency 24 cores 48 cores 96 cores 192 cores 384 cores 768 cores 1536 cores

(b) Rev. mpi dims create()

Figure 5.12: Number of topologies outperforming the default mpi dims create() and Rev. mpi dims create() topology at various domain sizes and number of cores

Rev. MDC). For example, if MDC = 4 × 3 × 2 then Rev. MDC = 2 × 3 × 4 and thus DsxDsy= 4 × 3 > DrxDry= 2 × 3.

At all processor cores and domain sizes, we were able to find topologies which performed better than the mpi dims create() and the Rev. MDC topology. Figures 5.12a and 5.12b show the number of topologies which outperformed the MDC and Rev. MDC topology at various domain sizes and cores. Interestingly, even at a domain size of 30723 or 28 billion degrees of freedom, there existed 21 topologies which outperformed the MDC and 43 topologies which performed better than the Rev. MDC. The percentage gains of the best topologies over the MDC and Rev. MDC are shown in Figures 5.13a, 5.13b, 5.13c and 5.13d for 24, 48, 96 and 192 cores, respectively. The percentage gain of the best topology over MDC ranged from approximately 1 − 70% and 1 − 66% for Rev. MDC at these core counts, respectively. The percentage gain of the best topology over the MDC for 384 cores at a domain of size 7683_{was 19.8% and 9.67% at a} domain size of 15363_{. For 768 cores the gain was 11.30% while being 11.11% for a core count of} 1536. This showed that the gains need not decrease with an increasing domain size or core count.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 165-167)