Checkpoint/Restart Optimization Techniques There are many optimizations

Background and Related Work

4. Fault Recovery: Checkpoint/Restart

4.3. Checkpoint/Restart Optimization Techniques There are many optimizations

that checkpoint/restart systems can take advantage of in order to reduce the checkpoint overhead and/or checkpoint latency. In this section we will explore a set of commonly used optimizations.

4.3.1. Forked Checkpointing. Forked checkpointing relies on the copy-on-write seman- tics of process creation in modern operating systems in order to reduce checkpoint overhead [57, 153, 208, 238, 265]. At the point of the checkpoint, the parent process forks a

child process to perform the checkpoint operation while the parent program continues execution. Since the child is only reading pages which are then written to stable storage as the checkpoint there is no need for a complete copy of the parents memory image, only those pages of memory that have changed since the fork operation. By using copy-on-write se- mantics the child process is able to be created quickly and run concurrently with the parent process, thus reducing the impact on checkpoint overhead.

The performance gains diminish if the parent process writes a large amount of data while the child process is writing the checkpoint due to paging overhead. Additionally, in memory constrained systems, where the application requires most or all of main memory

the child process may be forced to page to disk causing trashing which drastically diminishes the performance of both the checkpointing operation and the parent process execution. This technique works well for programs that do not use fork sensitive libraries or resources. Unfortunately, high performance interconnect drivers are typically sensitive to fork calls due to operating system bypass techniques used for optimized memory transfers making them ill suited for these techniques.

Similar to forked checkpointing is a mirror copy (a.k.a. continuous checkpointing) tech- nique which seeks to reduce the overhead of copy-on-write paging by keeping a duplicate copy of the process image in memory at all times [59, 127]. This duplicate image is then

occasionally written out to stable storage concurrently by the operating system [59, 127].

For real-time systems, the mirror copy technique provides a more consistent checkpoint latency than copy-on-write making it more appealing to system with hard deadlines on the completion of application execution [59].

Instead of maintaining this duplicate image throughout the life of the process, a pre-copy technique has been suggested that creates a full duplicate copy of the parent at the point of the checkpoint [83, 257]. It was shown by [83] that forked checkpointing performed

better in practice than the pre-copy technique, indicating that the overhead of maintaining the copy-on-write properties of main memory is not a significant enough factor to warrant such an optimization.

4.3.2. Checkpoint Compression. One way to reduce the checkpoint latency is to reduce the amount of checkpoint data that is pushed to stable storage, thus reducing the I/O required to store the checkpoint [166, 208, 236, 238]. This technique may also reduce the

amount of space required on stable storage for checkpoint files. Most of these techniques employ in-line compression while writing the checkpoint [208], but few have looked at

compression as part of a larger staging process at the stable storage level.

The amount to which a checkpoint can be compressed is application state dependent. Most studies have shown that checkpoints, especially application generated checkpoints, can be significantly compressed. Though these studies typically focus on the size reduc- tions, it is equally important to consider the performance of the checkpointing algorithm

as it impacts checkpoint overhead. [208] notes that while compression may reduce the

size of the checkpoint, “... it only improves the overhead of checkpointing if the speed of compression is faster than the speed of disk writes, and if the checkpoint is significantly compressed.” Informally, researchers have found that even when checkpoint overhead is increased by using compression, the reduction in stress on the stable storage system often improves system reliability allowing the application to checkpoint less frequently.

4.3.3. Memory Exclusion. Another method of reducing checkpoint latency is to exclude temporary or unused buffers from the checkpoint [209]. This can reduce the size of the

checkpoint and therefore the amount of I/O required to transfer it to stable storage. Most of the techniques focus on augmenting applications to mark regions of memory that should be excluded from the checkpoint [3, 51, 208]. Alternatively the checkpointing library may

be able to transparently find memory to exclude by augmenting the memory allocator [51]

and/or using pre-compiler analysis [175].

This technique relies on the ability of the application writer to highlight both critical and temporary regions of memory. Usually this is employed by support libraries as a way to reduce their additive impact on the amount of additional state that needs to be saved in the checkpoint as a result of their use by the application [3].

4.3.4. Incremental Checkpointing. Incremental checkpointing techniques focus on re- ducing checkpoint latency by checkpointing only the changes made by the application from the last checkpoint [208]. During recovery the last N checkpoint files are required in or-

der to recreate the process. In order to reduce the number of checkpoint files necessary for recovery, a full checkpoint is taken less often than the incremental checkpoints and used as a base for the next set of incremental checkpoints to be combined against. Sim- ilar to forked based checkpointing these techniques commonly rely on the paging system of modern operating systems, particularly the modified or dirty bit in modern paging hardware [51, 83, 87, 95, 113, 127, 208, 215, 243, 273]. When a checkpoint is taken the

modified (a.k.a. dirty) bits are cleared, and as the program executes it flips the bits on the pages that have been modified. During the next pass of the CRS, only those pages that have

been modified are included in the incremental checkpoint saved to stable storage. Upon recovery, the incremental checkpoints are combined with the last full checkpoint to recreate the process state.

The smallest incremental checkpoint is based on the smallest segment of memory, the word. Since the overhead involved in tracking changes to individual words in memory is prohibitively expensive most techniques use pages which contain a set of words, and paging hardware to support incremental checkpointing at the page level. However even a single word change may cause the entire page to be included in the incremental checkpoint adding expense to the checkpoint operation. Probabilistic checkpointing allows for incre- mental checkpointing based on blocks of memory which are larger than a word, but smaller than a page in order to reduce this overhead [3, 186]. These techniques use a memory block

encoding algorithm to determine if a block of memory has changed, since they cannot rely on paging hardware to highlight differences. The encoding technique explored often suffers from aliasing, in which differences may be masked because the encoding algorithm encodes the old and modified blocks to the same value. Though the authors of technique present analysis that shows this to be a sufficiently rare event [186], later research found that alias-

ing occurs much more frequently in practice making probabilistic techniques dangerous to employ [81].

In document Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems (Page 35-38)