Chapter 7 Implementation of DECGrid and MODO Services
7.5 Services implementation
7.5.7 Computational service
The optimisation processes are performed within the DECGrid which consists of 8 computers as described in section 7.3. The efficient utilisation of processing power is enhanced by allocating jobs to systems especially when they are idle. This service is provided by the Condor scheduling system. This is done by configuring the condor resource manager to allow remote nodes to submit jobs to Condor pools. A Condor pool is made up of machines that put their resources together to accomplish a computational task. Condor allows jobs to flock from one pool to another when a pool is idle. A match making algorithm is used to match job request for resources to available computational resources. Condor uses Globus GRAM service to accomplish cycle-stealing task. MODO tasks are interdependent. For example the mathematical model is built when a domain is selected and after that the criteria, parameters and constraints are provided. Condor uses Directed Acyclic Graph (DAG) concept to locate resources on the network and allocate them to jobs. DAG is a method of representing inputs, outputs or data that another process depends on as its input. The DAGMan (DAG Manager) capability in Condor is used for efficient resource workflow management. This capability is explored when inputs, outputs and some data depend on other programs. This is a common feature in MODO resource sharing.
The submission of jobs can be done by two methods. The first method is the use of Condor-G (Condor-Globus Resource Allocation Management) and the second method is using the Globus submit command. To submit a job, the submit description file is
required. This is the file which contains information on the executable file, certificate, host that the job should be sent to for execution and name of the grid. An example of a submission description file is shown Figure 7.16.
Figure 7.16 is the description file that allows users to submit multiple jobs at the same time to different machines with different architectures running different operating systems in DECGrid. This is one of the advantages of using the grid platform for job processing. It allows jobs to be parallelised into sub-jobs that are sent to different nodes for processing to speed up the computation of the job. MyProxyHost and MyProxyServerDN are variables that shows the host and server machines. Requirement is a variable that indicates the types of operating systems that this job can run on. For security reasons, password requirement usually precedes submission of jobs so that the user is verified at the host.
Universe = DECGrid
Grid_resource = gt4 condor
MyProxyHost = isxp1314c.sims.cranfield.ac.uk:7512
MyProxyServerDN = /O=sims.cranfield.ac.uk/OU=People/CN=Gokop Goteng MyProxyPassword = condorsuser
Executable = modoFile.$$(OpSys).$$(Arch)
Requirements = (OpSys == "WINDOWS" && Arch =="INTEL") || (OpSys == "LINUX" && Arch =="INTEL")
Error = modoErrFile.err
Input = modoInputFile.in
Output = modoOutputFile.out
Figure 7.16: Condor job submission description file
The Condor configuration file is also configured in a way that jobs that are submitted from the Condor Master (sever) are flocked (migrated) to worker (clients) nodes for processing especially when the processors are idle. This capability is implemented in DECGrid by configuring the FLOCK_FROM and FLOCK_TO properties as follows.
$FLOCK_FROM =”hostnames with semi-colon after each hostname if more than 1” $FLOCK_TO = “hostnames with semi-colon after each hostname if more than 1”
$FLOCK_FROM migrate jobs from the hosts named to the hosts listed in $FLOCK_TO. This means that when the nodes listed in $FLOCK_FROM have some jobs running or are occupied, they can send jobs to hosts in $FLOCK_TO when they are idle. This ensures computational throughput of the grid. The cycle-stealing concept (using processors of systems that are idle) is implemented in Condor to speed up job processing by setting the Condor configurations in properties of StartIdleTime (amount of time for which the keyboard must be idle before starting a job at the node), KeyboardBusy (a true or false expression which indicates TRUE when the keyboard is busy and FALSE when it is not busy), CPUIdle (a Boolean expression which indicates TRUE when the CPU is idle) and CPUBusy (indicates TRUE when the CPU is busy). By configuring these properties in Condor configuration file, jobs were flocked to idle worker nodes to speed up processing and ensure efficient utilisation of the systems.
The design of DECGrid also takes into consideration the issue of resource management among distributed experts carrying out MODO. The important components for resource management are the GRAM and GASS. GRAM ensures the allocation of resources to different clients (worker nodes) for remote execution and job submission. When jobs are submitted, the request is sent to the remote host and handled by a daemon built in GRAM located in each host. The daemon creates the job manager that starts and monitors the job processing progress. The job manager notifies the client when the job has finished processing. GASS provides the capability for file transfers among the worker nodes and master server. A web-enabled GUI is used for parameter inputs that are stored in files managed by GASS. Submission of jobs using the Globus GRAM is done using the Resource Specification Language (RSF).
This research uses the concept of master-worker model to parallelise optimisation job. The optimisation algorithm is run on the master node (server) and the objective functions are parallelised to free nodes for computation. For each generation, the fitness functions are computed on separate worker nodes for each population and results sent back to the server for ranking, crossover and mutation and then back again to the worker nodes for the computation of the next round of fitness functions. This
process continues until the last generation. To do this a multiple job submission description is provided. This is shown in Figure 7.17.
Figure 7.17 enables submission of 2 jobs at the same time. The figures 1 and 2 as shown shows the number of jobs submitted. These jobs are submitted to different nodes. This is the technique used to parallelise jobs for submission in this research. The file ensures that the time of submission, execution and completion of the jobs are recorded. <?xml version="1.0" encoding="UTF-8"?> <multiJob xmlns:wsa="http://www.w3.org/addressing"> <factoryEndpoint> <wsa:Address> https://isxp1313c.sims.cranfield.ac.uk:8443/wsrf/services/ManagedJobF actoryService </wsa:Address> </factoryEndpoint> <directory>/usr/local/globus-4/</directory> <count>1</count> <job> <factoryEndpoint> <wsa:Address>https://isxp1314c.sims.cranfield.ac.uk:8443/wsrf/service s/ManagedJobFactoryService</wsa:Address> </factoryEndpoint> <executable>/bin/date</executable> <count>2</count> </job> <job> <factoryEndpoint> <wsa:Address>https://isxp1315c.sims.cranfield.ac.uk:8443/wsrf/service s/ManagedJobFactoryService</wsa:Address> </factoryEndpoint> <count>1</count> </job> </multiJob>
Figure 7.17: Job submission definition file