MANAGING PARALLEL ENVIRONMENTS
5. To delete a parallel environment, select it, and then click Delete.
3.1.3 Managing Checkpointing Environments
3.1.3.1 Checkpointing Overview
Checkpointing is a facility that does the following tasks:
■ Freezes the status of an running job or application ■ Saves this status (the checkpoint) to disk
■ Restarts the job or application from the checkpoint if the job or application has
otherwise not finished, for example, due to a system shutdown
If you move a checkpoint from one host to another host, checkpointing can migrate jobs or applications in a cluster without significant loss of resources. Hence, dynamic load balancing can be provided with the help of a checkpointing facility.
The Grid Engine system supports two levels of checkpointing:
■ User-level checkpointing - At this level, providing the checkpoint generation
mechanism is entirely the responsibility of the user or the application. Examples of user-level checkpointing include the following:
■ The periodic writing of restart files that are encoded in the application at
prominent algorithmic steps, combined with proper processing of these files when the application is restarted
■ The use of a checkpoint library that must be linked to the application and that
thereby installs a checkpointing mechanism.
■ Kernel-level transparent checkpointing - This level of checkpointing must be
provided by the operating system, or by enhancements to it, that can be applied to
Note: The performance of a tight integration with a parallel
environment is an advanced task that can require expert knowledge of the parallel environment and the Grid Engine software parallel environment interface. You might want to contact your Sun support representative for assistance.
Note: A variety of third-party applications provides an integrated checkpoint facility that is based on the writing of restart files.
Checkpoint libraries are available from hardware vendors or from the public domain. Refer to the Condor project of the University of Wisconsin, for example.
any job. No source code changes or relinking of your application need to be provided to use kernel-level checkpointing. Kernel-level checkpointing can be applied to complete jobs, that is, the process hierarchy created by a job. By contrast, user-level checkpointing is usually restricted to single programs.
Therefore the job in which such programs are embedded needs to properly handle cases where the entire job gets restarted. Kernel-level checkpointing, as well as checkpointing based on checkpointing libraries, can consume many resources. The complete virtual address space that is in use by the job or application at the time of the checkpoint must be dumped to disk. By contrast, user-level checkpointing based on restart files can restrict the data that is written to the checkpoint on the important information only.
■ Transparent Checkpointing - This method of checkpointing uses a checkpointing
library, such as the one provided by the public domain package, Condor.
3.1.3.2 Checkpointing Environments Overview
The Grid Engine software provides a configurable attribute description for each checkpointing method used. Different attribute descriptions reflect the different checkpointing methods and the potential variety of derivatives from these methods on different operating system architectures.
This attribute description is called a checkpointing environment. Default
checkpointing environments are provided with the distribution of the Grid Engine system and can be modified according to the site's needs.
New checkpointing methods can be integrated in principal. However, the integration of new methods can be a challenging task. This integration should be performed only by experienced personnel or by your Grid Engine system support team.
3.1.3.3 How to Configure Checkpointing Environments From the Command Line
To configure the checkpointing environment from the command line, use the following arguments for the qconf command:
■ To display a checkpointing environment, type the following command: qconf -sckpt <checkpointname>
The -sckpt option (show checkpointing environment) prints the configuration of the specified checkpointing environment to standard output.
■ To display a list of all currently configured checkpointing environments, type the
following command:
qconf -sckptl
The -sckptl option (show checkpointing environment list) displays a list of the names of all checkpointing environments currently configured.
■ To add a checkpointing environment, type the following command: qconf -ackpt <checkpointname>
The -ackpt option (add checkpointing environment) displays an editor containing a checkpointing environment configuration template. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable. The parameter ckpt-name specifies the name of the checkpointing
environment. The parameter is already provided in the corresponding field of the template. To configure the checkpointing environment, change and save the template.
■ To add a checkpointing environment from file, type the following command: qconf -Ackpt <filename>
-Ackpt option (add checkpointing environment from file) parses the specified file and adds the new checkpointing environment configuration. The file must have the format of the checkpointing environment template.
■ To modify a checkpointing environment, type the following command: qconf -mckpt <checkpointname>
The -mckpt option (modify checkpointing environment) displays an editor containing the specified checkpointing environment as a configuration template. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable. To modify the checkpointing environment, change and save the template.
■ To modify a checkpointing environment from file, type the following command: qconf -Mckpt <filename>
The -Mckpt option (modify checkpointing environment from file) parses the specified file and modifies the existing checkpointing configuration. The file must have the format of the checkpointing environment template.
3.1.3.3.1 Example - Modifying the Migration Command of a Checkpoint Environment #!/bin/sh # ckptmod.sh: modify the migration command
# of a checkpointing environment
# Usage: ckptmod.sh <checkpoint-env-name> <full-path-to-command> TMPFILE=tmp/ckptmod.$$
CKPT=$1 MIGMETHOD=$2
qconf -sckpt $CKPT | grep -v '^migr_command' > $TMPFILE echo "migr_command $MIGMETHOD" >> $TMPFILE
qconf -Mckpt $TMPFILE rm $TMPFILE
■ To delete a checkpointing environment, type the following command: qconf -dckpt <checkpointname>
The -dckpt option (delete checkpointing environment) deletes the specified checkpointing environment.
3.1.3.4 How to Configure Checkpointing Environments With QMON
1. On the QMON Main Control window, click the Checkpoint Configuration button.