• No results found

puts written to data files that were subsequently combined to compute the SGCT solution. Complimentary work to ours on load balancing of GENE sub-grid in- stances and an alternative hierarchization based implementation of the SGCT has been reported in [Heene et al., 2013] and [Hupp et al., 2013], respectively. None of these aforementioned efforts has investigated the fault-tolerant possibilities of the SGCT for this application, nor have they implemented any alternative fault-tolerant techniques.

The effectiveness of ABFT by applying the FT-SGCT to GENE was analyzed in [Parra Hinojosa et al., 2015; Pflüger et al., 2014]. An analysis of solution accura- cies in the event of several sub-grids lost, and the overhead of computing redundant smaller sub-grids were presented there. The load balancing implemented there was from the developed load model from amaster-slaveparallelism model. In this thesis, we contribute the tolerance of real process and node failures with ULFM MPI, which was absent there. Moreover, we provide a load balancing scheme on a globalSingle Program Multiple Data(SPMD) parallelism model and show how several SGCTs could be applied to the non-SGCT dimensions concurrently4.

2.6

Summary

This chapter introduces key background material. We discuss the importance of fault tolerance, the reasons behind the failure of supercomputer nodes, an overview of some failure recovery techniques, including fault tolerance techniques implemented on the top of the MPI library. We discuss the SGCT and a fault-tolerant version of the SGCT, which provide necessary background for the key contributing thesis chapters. We further discuss some research work closely related to the contributions of this thesis. Before we move to the primary contributions of the thesis, we next give an overview of our implementation, and experimental platform.

4The concept of non-SGCT dimensions arises from the scenario where the total number of dimen-

sions of the SGCT is smaller than that of the application solution field. Non-SGCT dimensions could be any of the dimensions among the lower dimensions which are usually forming blocks to fit into the SGCT dimensions. As for example, if we have two SGCT dimensions (say, xandy), and three field dimensions (say, x,y, andz; wherezis the lower dimension), thenzis the non-SGCT dimension. In this scenario, each element in the SGCT dimensions will be a block of elements with block size equals to the size of dimensionz. For details, see Section 5.2.2.

Chapter 3

Implementation Overview and

Experimental Platform

This chapter presents an overview of our implementation, hardware and software platform, fault injection technique, and measurement methodologies that we use throughout the evaluations presented in this thesis.

3.1

Parallel SGCT Algorithm Implementation

An implementation of thedirectSGCT algorithm is used in this thesis. The key idea of this algorithm is to perform a scaled addition of part of each sub-grid’s solution ui inPi to Pcto get the combined (or sparse) grid solution ucI.

With the direct SGCT algorithm, each PDE instance whose solution isuiis run on

a distinct set of processes denoted by Pi and is arranged in a logicald-dimensional

grid. The algorithm consists of first a gather stage, where each process in Pi sends

its portion of ui to each of the corresponding (in terms of physical space) processes

in a logical d-dimensional grid Pc to be scaled added into uc

I. This is illustrated

in Figure 3.1. The portion of ui is selected based on the local to global mapping

of processes in each d dimension from Pi to Pc. Suppose, a 2D process grid Pi is

represented by {Pix,Piy}, and Pc by {Pxc,Pyc}. If Pix = Pxc and Piy = Pyc, then the whole solution ui is scaled added intoucI (initially ucI is empty) with an exact mapping of

processes. Otherwise, solutionui is split into Pxc Px

i and Pc

y

Piy parts inxandydimensions,

respectively, and then scaled added each chunk of ui into ucI. In this case, each

process inPix and Piy is mapped into Pxc Px i and Pc y Piy processes of P c x and Pyc, respectively.

Finally, each process in Pc then gathers the|I|versions of each point of the full grid (using interpolation where necessary), and performs the summation according to formula (2.1) to get the combined (or sparse) grid solution uc

I, which can be used as

an approximation to the full grid solution. The use of interpolation in turn requires that a ‘halo’ of neighbouring points (in the positive direction, for our implementation) have been filled by a halo exchange operation by each process in eachPi and is also

sent in the gather stage. For reasons of efficient resource utilization, Pc is made up

of a (normally near-maximal) subset of all processes in∪i∈IPi. 21

Figure 3.1: Figure 4 from [Larson et al., 2013b]. Message paths for the gather stage for the

2D direct combination method (not truncated) on a levell=5 combined grid. The combined

grid and component grid (3,3) have 2×2 process grids, all others have 2×1 or 1×2 process

grids.

Similarly, in the scatter stage, a reverse mapping of processes from Pc to Pi is

done to scatter a down sample ofucI toPi, iteratively, for eachi∈ I.

Further details on the algorithm using a full grid representation of the combined gridGcIare available in [Strazdins et al., 2015]. An improved version of this algorithm, where a partial sparse grid, rather than a full grid, representation of Gc

I is used to

perform an efficient interpolation onGcI, is available in [Strazdins et al., 2016b]. We used this improved version of the algorithm in this thesis, except where otherwise indicated.

In terms of load balancing, we used a simple strategy to balance the loads among the processes. The same number (p0 ∈ N) of processes is allocated on each of the distinct set of processes Pi for each sub-grid on the uppermost diagonal (i.e., for the

2D case, eachGi,jwithi+j=l−1) in the grid index space. Each sub-grid on the next

lower diagonals (i.e., for the 2D case, each Gi,j with i+j = l−2, i+j = l−3, and i+j= l−4) is allocated dp0/2e,dp0/4e, anddp0/8eprocesses, respectively. Details are discussed and analyzed in Section 6.3 of Chapter 6.

A failure of computing nodes or application processes causes the loss of some processes on some gridsGi. This is handled as follows. Before the SGCT algorithm

is applied, the loss of any processes inPi is detected using ULFM MPI (see Chapter 4

for details). Replacement processes are then created (with the same process grid size as Pi) on the same node when that node is still available (i.e., the failure is not