Hybrid Parallelization - Scalable space-time adaptive simulation tools for computational electr

structure of the mesh. The octree in PROPAG-5 is built in integer-coordinate space and does not

require floating-point operations.

The individual steps of the bootstrap and mesh distribution algorithm in PROPAG-5 are collected in Algorithm 3.3.

1: Read voxel-types using MPI_File_read_all and extract mesh entities

2: Compute connectivity between mesh cells and mesh nodes and compute part using ParMETIS_V3_PartMeshKway

3: Extend part to an array defined on all voxels in the local (to the process) sub-tile and exchange boundary values to fill in the halo

4: Compute number of peers and set up a mapping from cells to peers and from nodes to the list of peers

5: Identify nodes on inter-process boundaries and store their index after redistribution and the ranks of the processes storing a copy

6: Exchange data

7: Assign (consistently between processes) owners and mark inter-process connections as “in” and “out”

8: Build communication traces for in-going and out-going communications (Section 3.3)

9: Reorder the nodes in the communication trace according to coordinates to ensure consistency

10: Compute connectivity information using an octree

11: Reorder mesh entities locally (according to coordinates)

Algorithm 3.3. Bootstrap and mesh distribution algorithm.

Using this approach we have been able to bootstrap a mesh with 1.56 billion nodes (X = 2176, Y = 1920, Z = 3024) in less than 79 seconds on 4224 cores of the Cray XT5 at CSCS (see Section 3.4). Roughly 19 seconds where required for Step 1 of the algorithm (corresponding to a read performance exceeding 0.6 GiB/s). The partition was read from a file (as PARMETIS was

unable to partition such a large mesh, we computed part in a pre-processing step by interpolation from a coarser mesh) in about 47 seconds. The remaining portions of Algorithm 3.3 took about 12 seconds.

3.3 Hybrid Parallelization

The currently largest shared-memory machines are limited to a few thousand cores per machine while the largest distributed-memory architectures scale to millions of cores. To efficiently utilize these resources, we ported PROPAG-4 to an MPI code that can run on distributed-memory architectures. Such systems usually consist of a large number of multi-socket compute nodes connected by a high-speed interconnect. In recent years, the number of cores per socket has increased signifi-

30 3.3 Hybrid Parallelization

cantly. Within a compute node, memory is shared between cores, usually with NUMA architecture. Therefore, we retained the existing OpenMP parallelization, which is efficient for intra-node parallelization, and added an MPI layer for inter-node parallelism. Such a hybrid parallelization approach has been used for a variety of codes and has proven beneficial for several reasons:

1. It simplifies adding new levels of concurrency beyond what is easily accomplished with MPI and hence can be used to overcome algorithmic scaling limitations (e.g., GTC61).

2. It allows to mitigate efficiency loss in applications that are limited by the scaling of all-to-all communication (e.g., PARATEC119and CPMD88) or where communication time is a signif- icant part of the runtime.

3. Since the shared memory often renders halo (or overlap) zones unnecessary, hybrid codes can use less memory. If additional work must be performed on the halo, scalability can be enhanced by increasing the number of threads per process (e.g., FISH94).

4. It simplifies the load balancing of applications with dynamic or complicated structure since intra-process load balancing is possible using dynamic or guided loop scheduling (e.g., NPB BT-MZ Benchmark130).

It is worth noting, though, that hybrid parallelization is not always beneficial. Mahinthakumar and Saied111 report no improvement in a hybrid implicit finite element solver. In general, there are many factors contributing to the performance of hybrid execution and results can vary between simulation setups105.

3.3.1 MPI Parallelization

For the MPI parallelization of the code, we exploited techniques that have proven to be very efficient for the parallelization of general (unstructured) finite element applications. Hence, we use a cell-wise distribution of the geometry. The decomposition is computed through an interface to existing graph-partitioning libraries (e.g., PARMETIS95) as described in Section 3.2.2. Differently than

previous versions of PROPAG, all arrays range only over cells and nodes and connectivity information is stored explicitly. Hence, the stencil-based computation of Idifis replaced by a sparse-matrix

vector multiplication. We use an ELLPACK-ITPACK format136 _{that is suitable for vectorization}

by the compiler. In Figure 3.2 the impact of this change on the time required for computing Idifis

shown. The additional indirect addressing and the corresponding increase in memory bandwidth usage reduces performance which, however, is compensated for by better scalability of the MPI layer.

Since the mesh in PROPAG-5 is distributed cell-wise, nodes are duplicated on multiple processes. One of these processes is distinguished as the owner of the node. For inter-process communication, we use the notion of communication traces introduced by Sahni et al.137. In PROPAG-5 a communication trace consists of a set of nodes (located on an inter-process boundary) and the rank of a peer process. On the peer, a matching communication trace is built with a consistent ordering of the interface entities. Hence, by means of a communication trace, inter-process communication is

31 3.3 Hybrid Parallelization

possible without the need for a global numbering of mesh entities. All communication is based on two primitives: The function sumup_at_owner gathers data on the owner and copy_to_others overwrites the data at each copy by the data at the owner (scatter). These communication steps are implemented on top of non-blocking MPI send/receive calls and an extended interface (start, test, wait) is provided to overlap these operations with computations.

Using these communication primitives, we can rewrite Algorithm 3.1 as shown in Algorithm 3.4. The algorithm is written in such a way that it allows for overlapping communication of the diffusion currents with the computation of Iapp (to hide the communication in sumup_at_owner)

and with the evaluation of Iion for the interior nodes (to hide copy_to_others), assuming the

necessary hardware capabilities. In our tests, we have not seen improvements in scalability or runtime due to overlap. Nevertheless, by construction, all receive calls are pre-posted timely before the wait call. This is important for good MPI performance on many systems including the targeted Cray XT5.

1: whilet < T do

2: fori = 1,...,Llapdo

3: Evaluate I_difi = χ−1∇∇_{∇ · G}_mono∇∇∇Vilocally

4: Call sumup_at_owner_start(I_difi )

5: Evaluate I_appi

6: Call sumup_at_owner_wait(I_difi )

7: Call copy_to_others_start(I_difi )

8: Compute I_ioni and integrate state variables for all owned nodes ⊲ In ion_step

9: Call copy_to_others_wait(I_difi ) ⊲ In ion_step

10: Compute I_ioni and integrate state variables for all other nodes ⊲ In ion_step

11: Update Vi+1= Vi+ τIi

dif− Iioni + Iappi

12: end for

13: Gather statistics using collective communication

14: Write the (downsampled) solution to disk

15: t ← t + τ · Llap

16: end while

Algorithm 3.4.Parallel monodomain solver in PROPAG-5.

3.3.2 MPI Threading Support

The intra-process parallelization via OpenMP was retained and extended to new code segments. As in PROPAG-4, we mostly use parallel for worksharing constructs. This approach (in compari- son to the use of large parallel sections) incurs some overhead but simplifies the implementation. Experiments with PROPAG-4 (Figure 3.1) show that OpenMP overhead does not significantly affect

In document Scalable space-time adaptive simulation tools for computational electrocardiology (Page 51-54)