puts written to data files that were subsequently combined to compute the SGCT solution. Complimentary work to ours on load balancing of GENE sub-grid in- stances and an alternative hierarchization based implementation of the SGCT has been reported in [Heene et al., 2013] and [Hupp et al., 2013], respectively. None of these aforementioned efforts has investigated the fault-tolerant possibilities of the SGCT for this application, nor have they implemented any alternative fault-tolerant techniques.
The effectiveness of ABFT by applying the FT-SGCT to GENE was analyzed in [Parra Hinojosa et al., 2015; Pflüger et al., 2014]. An analysis of solution accura- cies in the event of several sub-grids lost, and the overhead of computing redundant smaller sub-grids were presented there. The load balancing implemented there was from the developed load model from amaster-slaveparallelism model. In this thesis, we contribute the tolerance of real process and node failures with ULFM MPI, which was absent there. Moreover, we provide a load balancing scheme on a globalSingle Program Multiple Data(SPMD) parallelism model and show how several SGCTs could be applied to the non-SGCT dimensions concurrently4.
2.6
Summary
This chapter introduces key background material. We discuss the importance of fault tolerance, the reasons behind the failure of supercomputer nodes, an overview of some failure recovery techniques, including fault tolerance techniques implemented on the top of the MPI library. We discuss the SGCT and a fault-tolerant version of the SGCT, which provide necessary background for the key contributing thesis chapters. We further discuss some research work closely related to the contributions of this thesis. Before we move to the primary contributions of the thesis, we next give an overview of our implementation, and experimental platform.
4The concept of non-SGCT dimensions arises from the scenario where the total number of dimen-
sions of the SGCT is smaller than that of the application solution field. Non-SGCT dimensions could be any of the dimensions among the lower dimensions which are usually forming blocks to fit into the SGCT dimensions. As for example, if we have two SGCT dimensions (say, xandy), and three field dimensions (say, x,y, andz; wherezis the lower dimension), thenzis the non-SGCT dimension. In this scenario, each element in the SGCT dimensions will be a block of elements with block size equals to the size of dimensionz. For details, see Section 5.2.2.
Chapter 3
Implementation Overview and
Experimental Platform
This chapter presents an overview of our implementation, hardware and software platform, fault injection technique, and measurement methodologies that we use throughout the evaluations presented in this thesis.
3.1
Parallel SGCT Algorithm Implementation
An implementation of thedirectSGCT algorithm is used in this thesis. The key idea of this algorithm is to perform a scaled addition of part of each sub-grid’s solution ui inPi to Pcto get the combined (or sparse) grid solution ucI.
With the direct SGCT algorithm, each PDE instance whose solution isuiis run on
a distinct set of processes denoted by Pi and is arranged in a logicald-dimensional
grid. The algorithm consists of first a gather stage, where each process in Pi sends
its portion of ui to each of the corresponding (in terms of physical space) processes
in a logical d-dimensional grid Pc to be scaled added into uc
I. This is illustrated
in Figure 3.1. The portion of ui is selected based on the local to global mapping
of processes in each d dimension from Pi to Pc. Suppose, a 2D process grid Pi is
represented by {Pix,Piy}, and Pc by {Pxc,Pyc}. If Pix = Pxc and Piy = Pyc, then the whole solution ui is scaled added intoucI (initially ucI is empty) with an exact mapping of
processes. Otherwise, solutionui is split into Pxc Px
i and Pc
y
Piy parts inxandydimensions,
respectively, and then scaled added each chunk of ui into ucI. In this case, each
process inPix and Piy is mapped into Pxc Px i and Pc y Piy processes of P c x and Pyc, respectively.
Finally, each process in Pc then gathers the|I|versions of each point of the full grid (using interpolation where necessary), and performs the summation according to formula (2.1) to get the combined (or sparse) grid solution uc
I, which can be used as
an approximation to the full grid solution. The use of interpolation in turn requires that a ‘halo’ of neighbouring points (in the positive direction, for our implementation) have been filled by a halo exchange operation by each process in eachPi and is also
sent in the gather stage. For reasons of efficient resource utilization, Pc is made up
of a (normally near-maximal) subset of all processes in∪i∈IPi. 21
Figure 3.1: Figure 4 from [Larson et al., 2013b]. Message paths for the gather stage for the
2D direct combination method (not truncated) on a levell=5 combined grid. The combined
grid and component grid (3,3) have 2×2 process grids, all others have 2×1 or 1×2 process
grids.
Similarly, in the scatter stage, a reverse mapping of processes from Pc to Pi is
done to scatter a down sample ofucI toPi, iteratively, for eachi∈ I.
Further details on the algorithm using a full grid representation of the combined gridGcIare available in [Strazdins et al., 2015]. An improved version of this algorithm, where a partial sparse grid, rather than a full grid, representation of Gc
I is used to
perform an efficient interpolation onGcI, is available in [Strazdins et al., 2016b]. We used this improved version of the algorithm in this thesis, except where otherwise indicated.
In terms of load balancing, we used a simple strategy to balance the loads among the processes. The same number (p0 ∈ N) of processes is allocated on each of the distinct set of processes Pi for each sub-grid on the uppermost diagonal (i.e., for the
2D case, eachGi,jwithi+j=l−1) in the grid index space. Each sub-grid on the next
lower diagonals (i.e., for the 2D case, each Gi,j with i+j = l−2, i+j = l−3, and i+j= l−4) is allocated dp0/2e,dp0/4e, anddp0/8eprocesses, respectively. Details are discussed and analyzed in Section 6.3 of Chapter 6.
A failure of computing nodes or application processes causes the loss of some processes on some gridsGi. This is handled as follows. Before the SGCT algorithm
is applied, the loss of any processes inPi is detected using ULFM MPI (see Chapter 4
for details). Replacement processes are then created (with the same process grid size as Pi) on the same node when that node is still available (i.e., the failure is not