Checkpoint/restart, or checkpointing, is a well-known rollback recovery technique deployed to mitigate fail-stop errors [37]. In checkpointing the application state, called checkpoint, is saved and the application is restarted by using the checkpoint when a failure occurs. The most common types of checkpointing are coordinated and uncoordinated checkpointing for distributed applications. In coordinated checkpointing, processes coordinate with each other to
reach a consistent global state before taking a checkpoint. The main drawback of coordinated checkpointing is the expensive coordination. Uncoordinated checkpointing was proposed to avoid the expensive coordination [37]. In uncoordinated checkpointing, the states of processes are saved independently, which avoids the coordination but supplementary application data is also logged such that a consistent global state could be formed in a restart. The logging of supplementary data is called message logging which is a technique used to prevent domino effect [37]. Domino effect is a condition in which, in the lack of message logging, the restart of a process causes other processes to restart; and in the end all processes have to restart losing all useful computation. Uncoordinated checkpointing has its own disadvantages. The log size, communication overhead, and fault-free execution overheads can be prohibitive.
To address these issues, hierarchical coordinated checkpointing schemes are proposed such as [38] and [39]. These schemes partition the application processes into groups in which each group can checkpoint independently. They limit the coordination among the processes in a single group and they perform costly global checkpoints less frequently than those within each group. However message logging is performed across the groups to be able for a global restart. Moreover the coordination among the processes belonging to the same group can still be ex- pensive.
Basic checkpointing and message logging algorithms often suffer from low scalability and large memory footprint. Considerable work has been done to address these issues [40]. Some researchers suggest considering the knowledge on correlation between hardware faults in order to lower the message log memory overhead [41], [42]; others combine coordinated checkpoint- ing with message logging for the same purpose [43]. To the best of our knowledge, however, no prior work has tackled message logging for hybrid PMs.
The Berkeley Lab Checkpoint/Restart [44] is a widely adopted approach in high perfor- mance computing for fault-tolerance. As a system-level solution, the most important drawback of this mechanism is the large memory footprint that is generated during checkpointing process due to the lack of knowledge of user-level structures. This mechanism provides a binary check- point solution that increases dependency with operating system, it generates the checkpoint using the I/O interaction at Linux kernel level. BLCR needs an operating system modification for a good performance. Our proposed solution in Chapter4is runtime-level, which does not create OS dependency and decreases the usage of memory.
Application-level approaches exist such as [45], however such solutions are not generic and have to be adapted for each single application with developer/programmer knowledge.
These approaches are more efficient than system-level ones. Our novel solution (Chapter 4) has the knowledge of the explicit asynchronous execution flow, which enables us to build a more accurate non-centralized incremental checkpoint and to share information through all the active threads.
System-level solutions have recently evolved into multi-level nature with the main idea to be performing expensive global checkpoints less frequently. Sato et. al. [39] proposes an approach based on multi-level checkpointing for petascale and exascale systems. Their check- pointing system merges non-blocking and multi-level checkpointing. This system is shown to be efficient on future post-petascale systems. Another hybrid solution proposed for hybrid petascale systems is the Fault Tolerance Interface by Bautista-Gomez et. al [38]. Our proposed framework is complementary to these system-wide solutions (Chapter4).
Ma and Krishnamoorthy [46] propose an approach for task-parallel computations in the presence of work-stealing. The main objective of the work is to identify tasks to be recovered and to recover partial data updates accurately and efficiently. They track the communication operations and maintain an idempotent data store for this purpose. They target fault tolerance within a task parallel phase complementary to checkpointing and fault detection which consti- tute some parts of our work (Chapter4). Chung et al. [47] introduce the Containment Domains (CD) for system resiliency. CDs are programming constructs for the programmer to express resilience concerns as well as to specialize fault detection and recovery according to applica- tion under consideration. In addition, CDs provides flexibility in terms of state preservation, location of preservation as well as the method of fault detection.
Algorithm-Based Fault Tolerance [48] requires additional programmer effort for generating specific redundant calculation on the algorithm.
Bronevetsky et al. [49] proposed an application-level checkpointing technique for OpenMP applications. The application state is regularly stored to disks and in case of a failure it is loaded and application is restarted. Fu and Yang [50] explore thread-level redundancy to detect and recover from transient errors. Tahan and Shawky [51] propose triple modular redundancy fault-tolerance for OpenMP programs via task replication at compiler level.