2.3 C ONDOR DAGMAN
2.3.3 Condor
Condor [23] is a specialized resource management system project developed at the University of Wisconsin-Madison. Condor provides a High Throughput Computing (HTC) environment from large collections of distributed computing resources ranging from desktop computers to supercomputers.
Condor also performs as a batch scheduler, and accordingly provides a queuing mechanisms, scheduling policies, and priority schemes. End-users submit their computational jobs through an interface, which are placed in one of the Condor job queues. Condor then takes care of the handling and execution of the job. Job is dispatched to a specific resource and executed, after which the user is informed regarding the results of their job.
Job dispatch in Condor occurs according to the matchmaking mechanism in Condor. Through matchmaking mechanism jobs and available resources are matched with each other according to their job requirements and resource classified advertisements. When more than one resource satisfies the requirements of job, the resource with higher rank value is chosen.
29
Condor provides support for checkpoint and migration mechanisms. This way, should there be a fault or change in the utilization of a resource, execution of a task can be continued on a different resource without going over the same calculations again.
Condor also makes it easy for users to run the same job many different times with varying input parameters/data. Condor makes it really simple to construct such job submissions, as well as organizing the outputs of such calculations. This is a very important feature for scientific experimentation where parameter studies are quite commonly used.
2.3.3.1 Submitting and Running Jobs with Condor
Following are the steps needed to run a job using Condor.
Code preparation: A job run under Condor must be able to run as a background batch job. As such, applications that require interactive input/output are not suitable to run over Condor. However, making the interactive portions of the application to progress via using files may make them Condor-ready.
Choosing the Condor universe: A universe in Condor defines an execution environment for the application. Condor supports several different universes for user jobs: standard, vanilla, grid, java, scheduler, local, parallel, and VM. The universe under which a job runs is specified in the submit description file. If a universe is not specified, vanilla universe is selected by default. The standard universe provides checkpointing and job migration support; however it enforces some restrictions on the applications before they can be eligible to use this universe. The vanilla universe provides less reliability, however it is the most simple and straightforward environment that can be chosen by most applications.
30
The grid universe allows users to submit jobs hassle-free on a remote resource management system. The java universe allows users to run jobs using the Java Virtual Machine (JVM) as Condor take care of specific details regarding java configuration (e.g. locating JVM binary, setting the ‘classpath’). The scheduler universe is typically used to run the Condor DAGMan metascheduler. As a consequence, Condor DAGMan itself runs as a Condor job on the submit host. The parallel universe is aimed for distributed memory programs (e.g. MPI jobs). The VM universe allows users to run jobs on a virtual machine by facilitating the proper disk image and infrastructure for the application.
Preparing submit description file: All the requirements and details of a job has to be specified in a submit description file. Some of the basic information found in this file includes the executable/binary to be run for the job, input parameters, file information regarding input/output data, resource-related requirements, ranking method for eligible resources, and user notification information (e.g. email information). In this file, user can also specify how many times to run the job, and where to put the associated data for each individual run.
Job submission: Job is submitted to the Condor queue for execution via a simple ‘condor_submit’ command.
2.3.3.2 Condor Submit Description File
Condor submit description file specifies all the requirements and details of the job that is expected to be run by Condor. Accordingly, it contains various keywords and parameters that define information about the job such as the executable/binary to be run,
31
command-line arguments, working directory, input/output files, resource requirement specifications, and so on.
Condor allows easy and convenient approach to run multiple copies of the same program. Users can easily specify in submit description file for each run to use different data sets and to read/write in their own files. Condor simply allows each run to have its own working directory, input/output/error files, and command-line arguments.
The sample submit description file illustrated in Fig. 2.8 submits two copies of the application ‘prime’ found under ‘/home/selim’ directory. The first copy runs under directory prime_1, and the second runs under directory prime_2. Output, error, and log files are associated with each run separately. This means, first copy will generate prime.out, prime.err, and prime.log files under prime_1 directory, whereas the second copy will generate the same files under prime_2 directory. However, two copies get different command-line arguments for the execution. First copy will find prime numbers between 2 and 1000000, whereas the second copy will find prime numbers between 1000000 and 2000000. For both copies, the standard universe is selected as the execution environment which provides additional reliability for the execution of ‘prime’. Both copies request a minimum memory of 4 GB RAM to exist in a resource that is going to execute the application.
32
Figure 2.8: Sample Condor submit description file