2.3 Resilience Support in Distributed Programming Models
2.3.1 Message-Passing Model
The message-passing model is a widely-used programming model for high perfor- mance computing. It is based on the Single Program Multiple Data (SPMD) model, which offers coarse-grain parallelism by executing multiple copies of the same pro- gram, each having a separate process and address space. Communication between the processes for data sharing and synchronization is performed by message pass- ing. The standardization efforts that started in the early 1990s resulted in the MPI
standard, which now has a unique portability advantage that can hardly be found in other runtime systems. The generality of the model has been proven by the diverse styles of algorithms that have been implemented using it for more than two decades. The two main open-source implementations ofMPIare OpenMPI [OpenMPI,2018] and MPICH [MPICH,2018].
A communicator in MPI refers to a specific group of processes that are allowed to communicate with each other. A collective operation is a global communication function that involves all the processes in the communicator.
A core task for an MPI programmer is distributing the application data across the processes. Therefore, we classify MPI under the explicit resource mapping category in our taxonomy. The first release of MPI, MPI-1, provides an application with a fixed set of processes that cannot be extended during execution, and it supports only two-sided communication in which a send request at the source process is expected to be matched with a receive request at the destination process. Starting
from MPI-2, new functions have been added to support dynamic process creation and one-sided communication. More advanced features have been added to MPI-3 including support for non-blocking collectives, neighborhood collectives, and an improved one-sided communication interface.
Unfortunately, the MPI standard still lacks support for process-level fault tolerance. By default, all errors are fatal, and they leave the runtime system in an undetermined state. The lack of fault tolerance in MPI has captured the interest of many research groups, and different methods have been proposed to support system-level and user- level fault tolerance. Despite the simplicity advantage of providing fault tolerance transparently without application changes, this approach has clear limitations. First, it forces one fault tolerance technique on all applications, and second, it prevents the use of algorithmic-based fault tolerance techniques. Meanwhile, the MPI programming model is well-suited for user-level fault tolerance. That is because of adopting an explicit resource mapping policy that makes reasoning about the lost parts of the computation easy for the programmer.
In the following, we start by reviewing four representative frameworks that pro- vide system-level fault tolerance for MPI applications: MPICH-V, FMI, rMPI, and RedMPI. After that, we review three approaches for extending MPI with user-level fault tolerance: FT-MPI, MPI-ULFM, and FA-MPI.
2.3.1.1 MPICH-V
MPICH-V is a research framework designed for studying different protocols for sup- porting fault tolerance in MPI without requiring any application changes [Bosilca et al.,2002]. It has been used for comparing coordinated and non-coordinated check- pointing [Lemarinier et al., 2004], as well as comparing blocking and non-blocking implementations of coordinated checkpointing [Coti et al.,2006]. Process failure is detected by a heartbeat mechanism at the TCP communication layer, which can be configured by specialkeep-aliveparameters. Because MPICH-V is based on MPI-2, dynamic process creation is supported; however, the application is not required to use this capability for fault tolerance. While checkpointing is done automatically, recovering the application requires a manual restart.
2.3.1.2 FMI
The Fault Tolerant Messaging Interface (FMI) [Sato et al.,2014] is a research prototype implementation of MPI that uses in-memory checkpointing for recovering failed pro- cesses. It relies on the failure detection capability of the Infiniband Verbs Application Programming Interface (API), ibverbs, for raising communication errors at the pro- cesses directly connected to the failed process. To propagate the failure signal to the rest of the processes, FMI connects the processes in a so-called log-ring overlay network that can propagate the signal in O(log(n))messages, wheren is the number of processes. To relieve the programmer from the complexities of shrinking recovery and to avoid degrading the performance after losing some resources, FMI supports
§2.3 Resilience Support in Distributed Programming Models 21
non-shrinking recovery by replacing a failed process with a spare process allocated in advance. Shrinking recovery is not supported by FMI.
2.3.1.3 rMPI
Process replication was attempted for MPI for its potential to achieve better forward progress than checkpoint/restart for highly unreliable systems and also for its ability to tolerate both fail-stop and soft errors. rMPI [Ferreira et al.,2011] uses replication to target fail-stop process failure. It implements a replication framework that creates two replicas of each MPI rank, ensures the sequential consistency of the replicas, and performs forward recovery by restarting failed replicas on pre-allocated spare nodes using the state of the corresponding active replicas. rMPI assumes the availability of aRASservice that notifies the active ranks when other ranks fail. Ropars et al.[2015] propose a user-level replication scheme for MPI that aims to achieve more than 50% resource utilization by allowing the replicas to share work rather than performing the same work twice. Application programming interfaces are provided to enable the programmer to define segments of the code that are eligible for work sharing.
2.3.1.4 RedMPI
RedMPI [Fiala et al.,2012] uses replication to detect and even correct soft faults cor- rupting the MPI messages. Each send request at the application level is transparently forwarded to all the replicas of the receiver. The receiver compares the received mes- sages to detect data corruption. If three or more replicas per process are available, voting is applied to discover and drop the corrupted message. Otherwise, RedMPI terminates the application once a soft fault is detected.
2.3.1.5 FT-MPI
FT-MPI [Fagg and Dongarra,2000] was the first significant step towards providing failure awareness to MPI applications. The application discovers the failure from the return code of the MPI function calls. Communications with the failed process will always return an error. Communications with non-failed processes can be configured to either succeed or fail based on the application requirements. When a communicator detects a communication failure, it immediately propagates the failure information to all the processes of the communicator. A communicator can be recovered by creating a new communicator using existing MPI communicator creation functions whose semantics were modified to allow the following three recovery modes: SHRINK — removes the failed processes and updates the ranks of the remaining processes, BLANK— replaces the failed processes withMPI_PROC_NULL, andREBUILD— replaces the failed processes with new ones. Collective functions were significantly modified to consider the above three modes and to ensure that a failure will not result in producing an incorrect result at the surviving processes. By the above semantic modifications, FT-MPI aims to provide enough flexibility to applications in defining the appropriate method for recovery. However, the practicality of FT-MPI has been
criticized due to introducing significant semantic changes to MPI’s functions which leads to difficult library composition [Gropp and Lusk,2004;Bland et al.,2012b].
2.3.1.6 MPI-ULFM
Driven by the increasing demand for a standard fault tolerant specification for MPI, the MPI Fault Tolerance Working Group (FTWG) was formed to meet this objective. The FTWGstudied two proposals: Run Through Stabilization (RTS) [Hursey et al., 2011] and MPI User Level Failure Mitigation (MPI-ULFM) [Bland et al.,2012b]. In both proposals, failure reporting is done on a per-operation basis using special error codes as in FT-MPI. Unlike FT-MPI, both proposals avoid propagating the failures implicitly and do not guarantee uniform failure reporting in collective operations. The
RTSproposal has been rejected by the MPI Forum citing implementation complexities imposed by resuming communications on failed communicators [Bland et al.,2012b]. Therefore, MPI-ULFMis currently the only active fault tolerance proposal for MPI-4. MPI-ULFM specifies the behavior of an MPI runtime in cases of process failure and adds a minimal set of functions to provide failure propagation, process agree- ment, and communicator recovery services. The specification covers all MPI functions (blocking/non-blocking, point-to-point/collectives, one-sided/two-sided); however, support for one-sided communication is still lacking in the currently released refer- ence implementation ULFM-2. Failure detection in the reference implementation of MPI-ULFM is based on heartbeating and communication timeout errors as detailed in [Bosilca et al.,2016]. The new communicator recovery operationMPI_COMM_SHRINK is added to facilitate shrinking recovery. By combining this operation with the ex- isting standard functionMPI_COMM_SPAWN, applications can implement non-shrinking recovery using dynamically created processes.
2.3.1.7 FA-MPI
Outside of the MPIFTWG, FA-MPI (Fault-Aware MPI) [Hassani et al.,2015] proposes the addition of a transaction concept to MPI to serve as a configurable unit for failure detection and recovery. Unlike MPI-ULFM, which provides failure reporting at the granularity of a single function call, FA-MPI enables the customization of the granu- larity of failure reporting to include one or more communication functions. FA-MPI restricts the communication within a transaction to non-blocking communications. The transaction waits for the completion of any pending communication and aggre- gates any detected failures. Special functions are provided to enable the application to query for the detected failures. FA-MPI allows the application to raise algorithmic- specific errors that will also be detected by the transaction, therefore it can be used for handling both soft faults and hard faults. Like MPI-ULFM, performance recov- ery is the application’s responsibility. Both shrinking and non-shrinking recovery functions are provided to support different application requirements. The FA-MPI prototype implementation is limited to detecting failures due to segmentation faults. External daemon processes detect these failures and propagate them to the remaining processes.
§2.3 Resilience Support in Distributed Programming Models 23