MPI-ULFM Overview - Resilience in high level parallel programming languages

As high performance computing applications run longer on larger numbers of cores, they are more likely to experience process failures. If HPC programmers do not prepare their applications to deal with process failures, much time and energy will be wasted on supercomputers due to restarting failed computations. MPI-ULFM

addresses this problem by extending MPI with fault tolerance features that enable programmers to design a variety of failure recovery mechanisms. It focuses onfail- stopfailures in which a process impacted by a hardware or a software fault crashes or stops communicating.

The following paragraph quoted from the the ULFM specification document [MPI- ULFM,2018] describes its main design principles at a high level:

The expected behaviour of MPI in the case of an MPI process failure is defined by the following statements: any MPI operation that involves a failed process must not block indefinitely but either succeed or raise an MPI error . . . ; an MPI operation that does not involve a failed process will complete normally, unless interrupted by the user through provided functionality. Errors indicate only the local impact of the failure on an operation and make no guarantee that other processes have also been notified of the same failure. Asynchronous failure propagation is not guaranteed or required, and users must exercise caution when determin- ing the set of processes where a failure has been detected and raised an error. If an application needs global knowledge of failures, it can use the interfaces defined in Section . . . to explicitly propagate the notification of locally detected failures. [MPI-ULFM,2018]

In this section we describe ULFM’s features in more detail, focusing on the aspects most relevant to a task-based runtime system like X10. Because X10 does not use MPI’sRDMAoperations, we do not cover those operations in this thesis.

3.3.1 Fault-Tolerant Communicators

A communicator in MPI represents a group of processes that are allowed to communicate with each other. The main communicatorMPI_COMM_WORLDincludes all the created processes at application start-up. By default, MPI communicators are not fault-tolerant. When the first error arises, whether it is a memory error, a communication error or otherwise, the runtime system immediately halts. That is because all MPI communicators useMPI_ERRORS_ARE_FATALas a default error handler. MPI provides a second error handler calledMPI_ERRORS_RETURNwhich, instead of halting the runtime system, reports the error to the calling process. However, for this error handler to be useful for recovering from process failures, the behaviour of the runtime system after a process failure must be specified.

ULFM closes the failure specification gap in MPI so that communicator error handlers start to have a practical value for fault tolerance. By assigning an error

§3.3 MPI-ULFM Overview 49

handler to the communicator, ULFM can reliably report process failure errors to active processes and permit them to continue communicating according to precisely defined rules. The following three error classes are added in ULFM for reporting process failures:

• MPI_ERR_PROC_FAILED_PENDING: for non-blocking receive operations where the source process isMPI_ANY_SOURCE, this error indicates that no matching send is detected yet, but one potential sender has failed.

• MPI_ERR_PROC_FAILED: for all communication operations, this error indicates that a process involved in the communication request has failed.

• MPI_ERR_REVOKED: for all communication operations, this error indicates that the communicator has been invalidated by the application.

3.3.2 Failure Notification

ULFM specifies the conditions that lead to notifying a process of the failure of another process. Processes that may be notified of a failure are those communicating with a dead process directly (through point-to-point operations) or indirectly (through collectives). When a failure is detected by one process, it is not propagated by default to other processes. Unless propagating the notification is requested explicitly by one process, failure knowledge remains local and may differ between processes. The following are the main rules governing failure notification for non-blocking operations, collectives, and the use ofMPI_ANY_SOURCE.

When non-blocking communication is used, failure notification is deferred from the initiation functions, such as MPI_Isend or MPI_Iallreduce, to the completion functionsMPI_TestandMPI_Wait. For example, in Figure3.2, we marked the return values that can notify failures in red color. If the source place is dead, ULFM will not notify X10 with that failure in step 7 whereMPI_Irecvis called. However, it will notify it in step 8 whenMPI_Testis called.

ULFM provides non-uniform failure reporting for the collective operations [Bland et al.,2013]. Depending on the collective implementation, when a process dies, some processes may report the collective as successful, while others may report it as failed.

The wildcard valueMPI_ANY_SOURCEis useful for detecting failures of any process in a communicator. When invoking a receive or a probe operation fromMPI_ANY_SOURCE, an error is raised whenMPIdetects the failure of any potential sender.

3.3.3 Failure Mitigation

The user-level resilience model ofMPI-ULFM provides great flexibility for application recovery. Rather than transparently recovering applications by a single fault tolerance mechanism, ULFM provides a few failure mitigation functions that can be integrated in different ways to implement different recovery mechanisms. We show these functions in Table3.1.

Table 3.1: MPI-ULFM failure mitigation functions.

Operation Type Description

1 MPI COMM FAILURE ACK(comm) local acknowledges the notified failures

2 MPI COMM FAILURE GET ACKED(comm, grp) local returns the group of

acknowledged failed processes 3 MPI COMM REVOKE(comm) remote invalidates the communicator

4 MPI COMM SHRINK(comm,newcomm) collective

creates a new communicator by excluding dead processes from the original communicator 5 MPI COMM AGREE(comm,flag) collective provides a fault-tolerant

agreement algorithm

When a process is notified of a failure, ULFM expects that some applications might need to silence repeated notifications of the same failure. The first function MPI_COMM_FAILURE_ACK is added for that purpose. It enables a user to ac- knowledge the failures notified by a communicator so far. After acknowledging a failure, receive operations that useMPI_ANY_SOURCEand the new collective operations MPI_COMM_SHRINK and MPI_COMM_AGREEwill proceed successfully even if they involve the failed process. The second function MPI_COMM_FAILURE_GET_ACKED is used for identifying the failed processes that were acknowledged. Figure3.4shows a graphical example on these two functions. In this example, two ranks have failed, rank 1 followed by rank 3. The failure of rank 1 was acknowledged by calling MPI_COMM_FAILURE_ACK, but the failure of rank 3 was not acknowledged. As a result, a call to MPI_COMM_FAILURE_GET_ACKED following the failure of rank 3 returns only rank 1.

Failures Acknowledged Failures

{} {} Rank 1 fails {1} {} MPI_COMM_FAILURE_ACK {1} {1} Rank 3 fails {1,3} {1} MPI_COMM_FAILURE_GET_ACKED = {1} 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Figure 3.4: Use of MPI COMM FAILURE ACK and MPI COMM FAILURE GET ACKED for failure detection and acknowledgement.

§3.3 MPI-ULFM Overview 51

The third function MPI_COMM_REVOKEis added for explicit global failure propagation. ULFM does not implicitly propagate failure notifications to all processes in a communicator. Thus, processes that do not communicate with the failed process will not detect its failure as long as the communicator is valid. If failure propagation is needed, an application may call MPI_COMM_REVOKE to invalidate the communicator. After one process revokes a communicator, all processes sharing the same communicator will receive MPI_ERR_REVOKED when they call any non-local MPI operation, except MPI_COMM_SHRINK and MPI_COMM_AGREE. These processes must recover the communicator to resume execution.

Communicator recovery strategies are generally classified as shrinking or non- shrinking (see Section 2.2.3.1 for the description of these classes). In shrinking recovery, a new communicator is created by excluding the dead processes in a given communicator. The fourth function MPI_COMM_SHRINK is added for that purpose. Non-shrinking recovery requires replacing dead processes with new ones, so that the application can restore its state using the same number of processes. The MPI-3 standard, on whichMPI-ULFM is based, supports dynamic process creation using MPI_COMM_SPAWN. Thus by combining MPI_COMM_SHRINK and MPI_COMM_SPAWN, MPI applications can implement non-shrinking recovery methods [Ali et al.,2014].

Figure 3.5 demonstrates the use of MPI_COMM_REVOKE and MPI_COMM_SHRINK for performing shrinking recovery. This example presents a 6-rank communicator, where each rank is communicating with only one other rank. When rank 1 failed, only rank 0 can detect the failure. However, rank 0 cannot shrink the communicator alone; the collective function MPI_COMM_SHRINK requires the participation of every non-failed rank in the communicator. By callingMPI_COMM_REVOKE, rank 0 invalidates the communicator, not only for itself but for all the other ranks. In this example, each rank handles an invalid communicator by callingMPI_COMM_SHRINK. This results in creating a 5-rank communicator for resuming the application.

A 6-rank communicator Rank 1 fails

Rank 0 gets a process error Rank 0 calls MPI_COMM_REVOKE Active ranks get revoke errors

Active ranks collectively call MPI_COMM_SHRINK A new 5-rank communicator

0 1 2 3 4 5

0 1 2 3 4

Figure 3.5: Use of MPI COMM REVOKE and MPI COMM SHRINK to perform shrinking recovery. Here we assume that communication is performed between pairs of processes.

The fifth and last functionMPI_COMM_AGREE(comm,flag)provides a fault-tolerant agreement algorithm. It is a collective operation that results in having all the active processes agreeing on the value of the integer in/out parameter flag and also implicitly agreeing on the group of failed processes. The agreed value is the bitwise-and of the values provided by all active processes. While normal collectives may succeed at some processes and fail at others, ULFM guarantees thatMPI_COMM_AGREEwill either complete successfully at all the processes or fail at all the processes.

This concludes our overview of theMPI-ULFMspecification. We findMPI-ULFM

promising for bridging the gap between performance and resilience. By providing user-controlled resilience in a performance-oriented programming model, many opti- mizations can be leveraged for handling failures more efficiently at scale.

In document Resilience in high level parallel programming languages (Page 68-72)