Condor is a batch scheduling DRM which was first developed in 1988 at the Computer Science Depart- ment, University of Wisconsin-Madison. It targets at harvesting unused computational cycles from a set of user workstations. Condor provides scheduling with different policies and prioritization, resource monitoring and job management. Resource owners maintain full control of their hardware and can set policies on their acceptable use. Condor uses a proprietary ClassAd language [Ram00], which allows hardware owners to describe their resources, and users to specify resource requests. A matchmaker com- ponent compares them to match job requirements with appropriate execution hardware. Currently the UCL Condor Pool contains about 1400 Windows PCs, and the UCL CS Condor pool has about 60 lab machines running Linux.
A.1.1
Run Applications in Condor
A job is submitted to Condor using the condor submit command. condor submit takes as an argument the name of a file called submit description file. This file contains items such as the name of the executable to run, the initial working directory, command-line arguments to the program, and resource specifications.
condor submitcreates a job ClassAd based upon the information, and Condor works toward running
the job. Condor plays the role of a matchmaker by continuously reading all the job ClassAds and all the machine ClassAds, matching and ranking job ads with machine ads. Condor makes certain that all requirements in both ClassAds are satisfied.
For example, Listing A.1 specifies 150 batch tasks called foo which has been compiled and linked using condor compile specifically for Silicon Graphics workstations running IRIX 6.5. This job requires Condor to run the program on machines which have greater than 32 Megabytes of physical memory, with the specified platform. It expresses a preference to run the program on machines with more than 64 Megabytes of physical memory, if such machines are available. The Image size command advises Condor that the program will use up to 28 megabytes of memory when running. Since these tasks are independent with each other, Condor distributes them one by one as long as available resources exist.
A.1. Condor 134
Listing A.1: Condor simple job submission sample
E x e c u t a b l e = f o o
R e q u i r e m e n t s = Memory >= 32 && OpSys == ” IRIX65 ” && Arch == ” SGI ”
Rank = Memory >= 64 I m a g e S i z e = 28 Meg E r r o r = e r r . $ ( P r o c e s s ) I n p u t = i n . $ ( P r o c e s s ) O u t p u t = o u t . $ ( P r o c e s s ) Log = f o o . l o g Queue 150
Condor’s parallel universe supports parallel jobs which need to be co-scheduled. A co-scheduled job has more than one process that must be running at the same time on different machines to work correctly. Condor must be configured such that resources (machines) running parallel jobs are dedicated. Listing A.2 specifies the universe as parallel, letting Condor know that dedicated resources are required. The machine count command identifies the number of machines required by the job. When submitted, the dedicated scheduler allocates eight machines with the same architecture and operating system as the submit machine. It waits until all eight machines are available before starting the job. When all the machines are ready, it invokes the /bin/foo command on all eight machines more or less simultaneously.
Listing A.2: Condor parallel job submission sample u n i v e r s e = p a r a l l e l
e x e c u t a b l e = / b i n / f o o m a c h i n e c o u n t = 8 q u e u e
Listing A.3: Condor-PVM job submission sample u n i v e r s e = PVM
# The e x e c u t a b l e o f t h e m a s t e r PVM p r o g r a m i s ” m a s t e r . e x e ” . e x e c u t a b l e = m a s t e r . e x e
R e q u i r e m e n t s = ( Arch == ” INTEL ” ) && ( OpSys == ”LINUX” ) m a c h i n e c o u n t = 2 . . 4
q u e u e
Condor has a special parallel environment (Condor-PVM) to support dynamic PVM applications. In Condor-PVM [Con, GKLY00], when a PVM program asks for nodes (machines), the request is re- mapped to Condor. Condor then finds a machine in the Condor pool via the usual mechanisms, and adds it to the PVM virtual machine. If a machine needs to leave the pool, the PVM program is notified of that as well via the normal PVM mechanisms. Listing A.3 shows a example submitted file for Condor- PVM. By using machine count = < min > .. < max >, the submit file tells Condor that before the PVM master is started, there should be at least < min > number of machines of the current class. It also asks Condor to give it as many as < max > machines. During the execution of the program, the application may request more machines of each of the class by calling pvm addhosts() with a string
A.2. Sun Grid Engine 135