2.2 Taxonomy of Resilient Programming Models
2.2.1 Adaptability
2.2.2.5 Fault Tolerance Technique
In the following, we describe the common fault tolerance techniques used for han- dling soft faults and hard faults.
Checkpoint/Restart
Checkpoint/Restart is the most widely used fault tolerance technique in HPC. It prepares the application to face failures by capturing snapshots of the application state during execution. Checkpoint/restart is a rollback-recovery mechanism — when failures occur, the application is restarted from an old state using the latest saved checkpoint. In a distributed execution, capturing a consistent state across several processes in the absence of a global clock is challenging. Many protocols have been investigated in the literature for tackling this challenge [Elnozahy et al.,2002;Maloney and Goscinski,2009], where the protocols are generally classified into: coordinated protocols and uncoordinated protocols.
Coordinated checkpointing: In a coordinated protocol, checkpointing is performed as a collective operation between all the processes. At the start of a checkpointing phase, each process pauses executing application actions, completes any pending communications with the other processes, then saves its local state to a reliable storage. Because application-level communications are stopped, checkpointing does not require capturing the communication state between the processes. A failure of one or more processes requires restarting all the processes from the latest checkpoint.
Uncoordinated checkpointing with message logging: An uncoordinated protocol avoids synchronizing the processes globally for checkpointing or restarting. Each process decides the checkpointing time independently from the other processes. Upon a failure, only the failed process is restarted using its latest checkpoint. In order to bring the restarted process to a state that is consistent with the rest of the processes, the other processes replay the communication events that were performed since the checkpoint time of the restarting process. Because of that, most uncoordinated checkpointing protocols require the processes to log the communication messages and to store them as part of their checkpoint.
Replication
Process replication is a forward-recovery mechanism. It is based on the intuitive idea of executing multiple instances of the same computation in parallel, so that when one instance is impacted by a failure another instance can proceed towards normal completion. The main challenge of this procedure is ensuring that the different instances remain identical. This goal can be achieved using atomic broadcast protocols
that ensure that changes to the application state are applied on all of the replicas in the correct order. The main drawback of process replication is introducing high resource utilization overhead. Despite that, replication is being investigated as a more practical fault tolerance technique for exascale systems than checkpoint/restart [Bougeret et al., 2014;Ropars et al.,2015]. Some studies suggest that global checkpoint/restart may not be feasible on exascale systems because checkpointing the state of an exascale machine is likely to exceed its Mean Time Between Failures (MTBF) [Cappello et al., 2009;Dongarra et al.,2011].
In addition to replicating processes, replication may also be performed for fine- grained work units (e.g. actors, tasks) for failure recovery or load balancing.
Migration
Migration [Wang et al.,2008;Meneses et al.,2014] is a proactive fault tolerance tech- nique that aims to avoid failures by predicting imminent faults and migrating the work units (e.g. actors, tasks or processes) from nodes that are likely to fail to other healthier nodes. The accuracy of the prediction model used is a determining factor for the performance overhead imposed by migration and for the level of protection it can deliver to applications. False positives lead to unnecessary migrations, while false negatives lead to crashed executions. To mitigate for false negatives, migration is often used with another fault tolerance technique, such as checkpoint/restart [Egwu- tuoha et al.,2013]. The ability to predict some of the failures enables the application to checkpoint its state less frequently, thereby enhancing the overall execution perfor- mance.
Transactions
A distributed transaction provides atomic execution of multiple distributed actions, such that either all the actions take effect (i.e. the transaction commits) or none of them do (i.e. the transaction aborts). Atomic commit protocols are employed behind the scenes to track the issued actions, acquire the needed locks, detect conflicts with other transactions, and finalize the transaction properly by either committing or aborting it. Handling process failure during transaction execution has been widely studied by the database community [Skeen,1981;Bernstein and Goodman,1984].
Resilient transactions can simplify the development of fault tolerant applications as they remove the burden of data consistency and failure recovery from the user to the transactional memory system. MostHPCprogramming models lack support for distributed transactions due to the scalability limitations of the distributed commit protocols [Harding et al., 2017]. These protocols require multiple phases of com- munication between all the transaction participants which incurs high performance overhead.
§2.2 Taxonomy of Resilient Programming Models 17
Task Restart
Task restart is a popular fault tolerance mechanism for dataflow systems that are represented by various big-data systems [Isard et al.,2007;Dean and Ghemawat,2008; Zaharia et al.,2010] andHPCtask-based runtime systems [Aiken et al.,2014;Mattson et al., 2016]. In these systems, the computation is expressed as a graph of inter- dependent tasks. The dependencies between the tasks are explicitly defined at the program level and passed to the runtime system. Process failures result in losing tasks and data objects that other tasks may depend on. From the task-dependence graph, the runtime system can identify and re-execute the set of tasks that can regenerate the lost data in order to satisfy the dependencies of the pending tasks to direct the execution towards successful termination. Checkpointing can be used in conjunction with task restart to avoid restarting long-running tasks from the beginning.
Algorithmic-Based Fault Tolerance
Algorithmic-Based Fault Tolerance (ABFT) was first introduced byHuang and Abra- ham[1984]. They designed methods for detecting and correcting soft errors in matrix operations by adding checksum information into matrix data. The same idea has been used later on for tolerating fail-stop failures in distributed matrix computa- tions [Chen and Dongarra, 2008; Hakkarinen and Chen, 2010; Davies et al., 2011; Du et al.,2012]. Such algorithmic recovery methods often outperform generic fault tolerance techniques such as checkpoint/restart [Graham et al.,2012]. More naturally fault tolerant algorithms can be found in the approximate computing domain. For example, monte-carlo methods generate approximate answers using a huge number of random samples. A process failure that causes the loss of some of these samples can be tolerated by either ignoring the lost samples and producing a less accurate result or replacing lost samples with new ones. Checkpointing or replication would be unnecessarily expensive for these algorithms.