FAULT TOLERANT SCHEDULING STRATEGY FOR COMPUTATIONAL GRID ENVIRONMENT

Full text

(1)

FAULT TOLERANT SCHEDULING

STRATEGY FOR COMPUTATIONAL

GRID ENVIRONMENT

MALARVIZHI NANDAGOPAL

Ramanujan Computing Centre Anna University Chennai Chennai, Tamilnadu, India

V. RHYMEND UTHARIARAJ

Ramanujan Computing Centre Anna University Chennai Chennai, Tamilnadu, India

Abstract:

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance mechanism with Minimum Total Time to Release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the TTR by selecting a computational resource based on job requirements, job characteristics and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. A critical aspect for an automatic recovery is the availability of checkpoint files. A strategy to increase the availability of checkpoints is replication. Replica Resource Selection Algorithm (RRSA) is proposed to provide Checkpoint Replication Service (CRS). Globus Tool Kit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and NWS (Network Weather Service) are used to gather hardware and network details respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault tolerant way thereby reduces TTR of the jobs submitted in the grid.Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.

Keywords: Grid Resource Management; Grid Job Scheduling; Checkpoint Replication; Fault Tolerance.

1. Introduction

Grid computing has emerged as the next-generation parallel and distributed computing methodology that aggregates dispersed heterogeneous resources for solving various kinds of large-scale parallel applications in science, engineering and commerce [6]. A Grid enables sharing, selection, and aggregation of a wide variety of geographically distributed resources including supercomputers, storage systems, data sources and specialized devices owned by different organizations. Management of these resources is an important infrastructure in the grid computing environment. It becomes complex as the resources are geographically distributed, heterogeneous in nature, owned by different individual or organizations with their own policies, have different access models, and have dynamically varying loads and availability.

(2)

scheduling methods, the system can get better performance, as well as applications can avoid unnecessary delays.

The emergence of grid computing further increases the importance of fault tolerance. Grid computing will impose a number of unique new conceptual and technical challenges to fault-tolerance researchers. Some of the factors due to which the probability of faults in a grid environment is much higher than a traditional distributed system [9],[10] are lack of centralized environment, predominant execution of long jobs, highly dynamic resource availability, diverse geographical distribution of resources and heterogeneous nature of grid resources. Thus, fault tolerance related features must be incorporated in grid job scheduling to improve the performance of the grid system.

The number of dynamic resources in the grid system increases continuously, so fault tolerance becomes a critical property for applications running on these resources. However, in traditional implementations, when a failure occurs, the whole application is shutdown and has to be restarted from the beginning [8]. A technique to avoid restarting of the application from the beginning is rollback recovery [3] which is based on the concept of checkpoint. Checkpoint mechanism is used to reduce the limitations imposed by the high volatility of resources. It periodically saves the application’s state to stable storage. So, whenever a failure interrupts a volunteer computation, the application can be resumed from the last stable checkpoint.

Some of the existing fault tolerance and recovery mechanisms like checkpoint-recovery and over provisioning are discussed in [16] , [17]. Checkpoint-recovery techniques make it possible for the job to resume execution from the last checkpoint instead of restarting from the beginning, whenever a failure occurs. Over-provisioning [11] techniques replicate a job in more than one resource to increase the probability of successful execution. Although these techniques address the reliability challenges to some extent, no large-scale study has been done on how effective they are when coupled with scheduling.

The usual solution to save the generated checkpoint is to install checkpoint server and connect them to the resources through a high-speed network. But a dedicated server can easily become a bottleneck as grid size increases. It is likely that checkpoint data stored on a server will be unavailable when requested to restart a failed application. One way to solve this problem is to store multiple replicas of checkpoint data, so that the stored data can be recovered even when checkpoint server is unavailable. This way, the computation time elapsed is reduced since the last usable checkpoint was saved. In this paper, the performance of checkpoint replication and checkpoint recovery based fault tolerance mechanism is analyzed in combination with MTTR scheduling technique.

Some of the main contributions of the study include:

(i) The need for a job scheduling strategy with fault tolerant mechanism for a computational grid environment is advocated. Architecture for fault tolerant based job scheduling model with CRS is proposed. The model for MTTR scheduling strategy presented in [15] is modified. Checkpoint recovery based fault tolerant feature named as MTTR with Checkpoint Set (MTTR_CS) strategy is added along with CRS to the MTTR scheduling.

(ii) The proposed strategy uses job checkpointing recovery mechanism which enables the grid to complete the jobs within specified deadline. That is done by using the result of the last saved checkpoint in case of possible fault occurrence at grid resources.

(iii)The proposed strategy is implemented using Globus Tool Kit. NWS and Ganglia are integrated into Monitoring and Discovery System (MDS) service of Globus to collect the additional resource details required for the algorithm.

(iv) In this paper it is focused to provide checkpoint replication service using RRSA algorithm. (v) Finally this paper compares the performance of MTTR with MTTR_CS (i.e without fault tolerance

versus with fault tolerance). Also MTTR_CS is compared with the time optimization strategy. 2. Related Work

(3)

In the grid environment in case of a resource failure, an application is restarted on another grid resource. If the application execution state is saved, then the application can be restarted from its last successful state. To store the state of the application, the checkpoint files are needed. The checkpoint files are stored in a checkpoint server. As the grid size increases the application cannot be restarted from another resource because a dedicated checkpoint server can easily become a bottleneck. Hence replication of storing the checkpoint files [13] is essential. Then only checkpoints can be taken from any of the replication resource and restarts the application from its last successful execution state, thereby reduces the response time of the application. Grid scheduling based on job execution time is discussed elaborately [12]. In this research a program is analyzed in segments for execution time and these times are combined together to give the total execution time of the program. Job is scheduled to a resource that executes it faster. This experiment does not consider waiting time of a job in the resource queue and also transfer time. Fault tolerant feature is also missing. This differs from the proposed work, which considers transfer time and queue wait time for scheduling jobs to resources and also provides checkpoint based fault tolerance strategy.

This paper focuses on job scheduling with check pointing based fault tolerant strategy along with the checkpoint replication service for the computational grid environment. The total time to release needed for the job in the grid system is analyzed. A scheduling algorithm selects and assigns jobs to resources in a grid computing system. The fault tolerant (FT) mechanism sets job checkpoint based on the failure rate of the grid resource. The RRSA algorithm selects the resource for checkpoint replication. Thus it avoids unnecessary overheads and improves the grid system performance.

3. Problem Formulation

Grid jobs are executed by the computational grid as follows:

(i) Grid users submit their jobs to the grid scheduler by specifying their QoS requirements, i.e., deadline in which users want their jobs to be executed, the number of processors and type of operating system.

(ii) Grid scheduler schedules user jobs on the best available resource by optimizing time. (iii) Result of the job is submitted to user upon successful completion of the job.

Such a computational grid environment has two major draw backs:

1. If a fault occurs at a grid resource, the job is rescheduled on another resource which eventually results in failing to satisfy the user’s QoS requirement i.e deadline. The reason is simple. As the job is reexecuted, it consumes more time.

2. In the computational based grid environments, there are resources that fulfill the criterion of deadline constraint, but they have a tendency toward faults. In such a scenario, the grid scheduler goes ahead to select the same resource for the mere reason that the grid resource promises to meet user’s requirements of the grid jobs. This eventually results in compromising the user’s QoS parameters in order to complete the job.

In this paper, in order to address the first problem, a job checkpointing strategy is used to tolerate faults gracefully, as it is able to restore the partially completed job from the last checkpoint. In order to address the second problem, make the checkpointing strategy adaptive by maintaining a fault index. This fault index is maintained by taking into consideration the fault occurrence history information of the grid resource. In this way, the checkpoint is introduced mostly when it is necessary. The application can only be restarted from the last known state, if the checkpoint is available. To increase the availability of checkpoint CRS is used. Using CRS, the current application state can be taken at any time. Simulation experiments reveal that the proposed strategy is able to tolerate faults by taking appropriate measures according to resource vulnerability toward faults.

4. Fault Tolerant Based Job Scheduling with CRS

(4)

The fault index of all available resources of the grid are maintained and updated. The fault index of the grid resource will suggest its vulnerability to faults (i.e., higher the fault index is higher the failure rate). The fault index of a grid resource is incremented every time the resource does not complete the assigned job within the deadline and also on resource failure. The fault index of a resource is decremented whenever the resource completes the assigned job within deadline. The availability of created checkpoints is increased by providing CRS.A key feature of this service is the ability to replicate and monitor checkpoints in the grid environment. Hence the current application state can be taken at any time and will not depend on the checkpoint server.

4.1 Components of Proposed Strategy

The interaction between different components of a computational grid in the proposed scheduling and fault tolerant strategy is shown in figure 1. In the proposed approach, in addition to scheduler which reduces response time of the job, the fault tolerant mechanism reduces the wastage of time in redo of partially completed job from the scratch. Instead, it restarts the failed job from the last saved checkpoint to complete.

A grid resourceis a member of a grid and it offers computing services to grid users. Grid users register themselves to the Grid Information Server (GIS) of a grid by specifying QoS requirements such as the deadline to complete the execution, the number of processors, type of operating system and so on.

Fig. 1. Architecture Diagram for Grid Scheduling with Fault Tolerance Strategy

The components used in the proposed architecture are described below: Scheduler

Scheduler is an important entity of a grid. Scheduler receives jobs from grid users. It selects feasible resources for those jobs according to acquired information from GIS. Then it generates job-to-resource mappings. The entities of scheduler are Schedule Manager, MatchMaker, Response Time Estimator, Resource Selector and Job Dispatcher.

(5)

time and Service time of the job. Resource selector selects the resource with minimum response time. A job dispatcher dispatches the jobs one by one to the checkpoint manager.

GIS

GIS contains information about all available grid resources. It maintains details of the resource such as processor speed, memory available, load and so on. All grid resources that join and leave the grid are monitored by GIS. Whenever a scheduler has jobs to execute, it consults GIS to get information about available grid resources.

Checkpoint Manager

It receives the scheduled job from the scheduler and sets checkpoint based on the failure rate of the resource on which it is scheduled. Then it submits the job to the resource. Checkpoint manager receives job completion message or job failure message from the grid resource and responds to that accordingly. During execution, if job failure occurs, the job is rescheduled from the last checkpoint instead of running from the scratch. Checkpoint manager implements checkpoint setter algorithm to set job checkpoints.

Checkpoint Server

On each checkpoint set by the checkpoint manager, job status is reported to the checkpoint server. Checkpoint server save the job status and return it on demand i.e., during job/resource failure. For a particular job, the checkpoint server discards the result of the previous checkpoint when a new value of checkpoint result is received.

Fault Index Manager

Fault Index Manager maintains the fault index value of each resource which indicates the failure rate of the resource. The fault index of a grid resource is incremented every time the resource does not complete the assigned job within the deadline and also on resource failure. The fault index of a resource is decremented whenever the resource completes the assigned job within the deadline. Fault index manager updates the fault index of a grid resource using fault index update algorithm.

Checkpoint Replication Server

When new checkpoint is created, Checkpoint Replication Server initiates CRS which will replicate the created checkpoints into remote resources by applying RRSA. Once replicated, details are stored in Checkpoint Server. To obtain information about all checkpoint files, Replication Server queries the Checkpoint Server. During the entire application runtime, CRS monitors the Checkpoint Server to detect newer checkpoint versions.Information about available resources, hardware, memory and bandwidth details are obtained from GIS. NWS and Ganglia tools are used to determine these details. The required details are periodically propagated by these tools to the GIS. Depending on transfer sizes, available storage of the resources and current bandwidths, CRS selects a suitable resource using RRSA to replicate the checkpoint file.

Estimating Total Time to Release (TTR)

The TTR of a job includes transmission time of input and output data to and from the resource, waiting time of the job in the resource queue and the service time of the job when assigned to the resource. TTR is defined as Eq. (1)

TTR = Transmission Time + Waiting Time + Service Time (1) The estimated transmission time of the job i from the scheduler to the resource r can be determined by Eq. (2) where Mi is the size of the job i and bandwidthr is the network bandwidth between the scheduler and the resource r.

Transmission Timeir =

r i

Bandwidth M

(2)

Waiting Time of the job i in the queue of resource r could be estimated by the sum of all service time of jobs in waiting queue of r which have been assigned to that resource before arriving of job i and is defined as in Eq.(3)

Waiting Timeir = Timej

Queue of Length

(6)

Service Time of the job j in resource r is defined as the number of instructions in job j divided by the processing speed of the resource r and is defined in Eq. (4).

Service Timejr =

Speedr nsj Instructio of

Number

(4)

Checkpoint Setter Algorithm

Input : Fault Index of a resource

Output: Interval at which checkpoint is to be set F: Fault index of the selected grid resource

F(i), i = 0, 2, . . . ,N, are integers such that F(0) < F(1)· · · < F(N)

C(i), i = 1, 2, . . . ,N are the time in milliseconds(ms) at which checkpoint is to be set such that C(1) > C(2) >· · · > C(N).

1. IF (F == F(1)) THEN

(i) The job is queued to that resource with a checkpoint interval C(1). (ii) GOTO STEP 6.

2. IF (F == F(2)) THEN

(i) The job is queued to that resource with a checkpoint interval C(2). (ii) GOTO STEP 6.

3. IF (F ==F(3)) THEN

(i) The job is queued to that resource with a checkpoint interval C(3). (ii) GOTO STEP 6.

4. IF (F == F(N)) THEN

(i) The job is queued to that resource with a checkpoint interval C(N). (ii) GOTO STEP 6.

5. IF (F>F(N)) THEN

(i) Remove resource from the available resources list and label it as unavailable resource i.e., no job is assigned to that resource.

(ii) Add the job to the unassigned job list and reschedule it. 6. EXIT

Fault Index Update Algorithm

1. IF checkpoint manager receives the job completion result from resource THEN (i) IF resource fault index >= 1 THEN

Send a message to fault index manager to decrement the fault index of resource that completes the assigned job.

Send details of the finished job to the scheduler. END IF

(ii) GOTO Step3 END IF

2. IF checkpoint manager receives the job failure message from resource THEN

(i) Send a message to fault index manager to increment the fault index of resource that fails to complete the assigned job.

(ii) Send a message to checkpoint server, whether there is any checkpoint result of this job. (iii) IF checkpoint result of the job exists in checkpoint server THEN

Submit the remaining part of job after last checkpoint received to the scheduler for rescheduling.

GOTO Step 3. END IF

(iv) IF checkpoint result of the job does not exist in checkpoint server THEN Submit the job from start to the scheduler for rescheduling. GOTO Step 3.

END IF END IF

3. EXIT

(7)

1. Specify replica resources of the gird to the map file ReplicaResourceMap

2. Specify each resource and its checkpoint file location into the map file CheckpointFileLocMap

3. Specify target replica storage details into a map file ReplicaLocMap

4. FOR each periodic interval of time

FOR each resource in grid - CheckpointFileLocMap

Traverse all the checkpoint files in the checkpoint file location IF checkpoint file is updated THEN

(i) Obtain the checkpoint file size

(ii) Check the memory availability of all the replica resources -

ReplicaResourceMap

(iii) Take bandwidth for replica resource- ReplicaResourceMap

(iv) Choose the best two replica resources based on available memory and bandwidth.

(v) Replicate the checkpoint file (vi) Store the details in ReplicaLocMap

END IF END FOR

END FOR

5. Simulation Environment

To simulate a grid environment Globus Toolkit 4.0., [7] the grid middleware is used. Globus is a software infrastructure that enables applications to view distributed heterogeneous computing resources as a single virtual machine. It provides key services such as resource access, data management, security infrastructure and communications. Experiments are done by connecting 10 machines in network. Each machine involved in the network is installed with Globus Toolkit. Each machine is integrated with Ganglia-3.0.7, the cluster monitoring tool used to monitor the machines connected in grid cluster with one head node. Ganglia is made to run in all the machines including the head node. NWS collects bandwidth between two systems in network.

The Torque (Tera-scale Open-source Resource and QUEue manager) resource manager [18] is used as service which provides the low-level functionality to start, hold, cancel, and monitor jobs. It also preserves the job queue state in the event of a head node failure. To implement checkpoint/restart technique Berkeley Lab’s Linux Checkpoint/Restart project (BLCR) [19] is used. BLCR is a checkpoint/restart package for multi-threaded applications. Torque with BLCR is a package integrated into the Globus for providing fault tolerance to the grid system. Globus provides base services for managing replicas: Replica Location Service, Local Replica Catalogs and Data Replication Service.

(8)

Fig.. 2. Sequence Diagram for Fault Tolerant Based Job Scheduling with CRS

1. When the simulation starts, the grid resources that form the simulated grid environment register themselves with GIS.

2. Scheduler queries the GIS to get the list of available resources.

3. GIS returns a list of registered resources and their details to Scheduler.

4. After receiving the available resource list from GIS, the Scheduler performs matchmaking of the job and resource requirements and estimates the response time of the job based on transfer time, queue wait time and the service time of the job.

5. The scheduler selects the resource with minimum response time and submits the scheduled job to the Checkpoint Manager to set job checkpoints.

6. Checkpoint Manager obtains resource fault index from Fault Index Manager.

7. Based on the resource fault index, job checkpoint interval is set by Checkpoint Manager and job is submitted to the selected resource for execution. The process of check pointing itself involves some overhead and needs time. During simulation this time is considered. This time is added to the total service time of the job.

8. Grid resource performs one of the following steps:

(i) During the execution of job, the grid resource saves the status of partly executed job on each checkpoint into the Checkpoint Server.

(ii) When the submitted job is finished, the grid resource sends the message of completion of last checkpoint to the Checkpoint Manager.

(iii)When the selected resource fails to execute the job, the grid resource sends job failure message to the Checkpoint Manager.

9. Checkpoint Replication Service performs the following:

(i) Receives the recent checkpoint result from Checkpoint Server. (ii) Apply RRSA to perform replication of checkpoint files. (iii) Store replication details in the Checkpoint Server.

10. The Checkpoint Manager on receiving the job failure message from grid resource performs the following:

(i) Checkpoint Manager sends a message to Fault Index Manager to increment resource fault index which failed to complete the assigned job.

(9)

(iii)Upon receipt of the last successful checkpoint result from Checkpoint Server, the Checkpoint Manager submits the job to scheduler for rescheduling. Scheduler reschedules the remaining part of the job, to some other resource from that checkpoint result onward. 11. The Checkpoint Manager on receiving the job completion message from grid resource performs the following:

(i) Checkpoint Manager sends a message to Fault Index Manager to decrement resource fault index which completed assigned job.

(ii) Checkpoint Manager sends details of the finished job to the Scheduler. 12. Upon job completion, Scheduler returns the result to the user.

6. Experimental Results

In this section, the performance of the proposed strategy is evaluated using the simulation setup described above by varying/fixing number of jobs/resources and deadline for different scenarios. The following values are assigned to the parameters of checkpoint setter algorithm discussed in Section 4:

F(0)=0, F(1)=2, F(2)=4, F(3)=6, F(4)=8, F(5)=10, F(6)=12, F(7)=14, F(8)=16, F(9)=18, F(10)=20 C(1)=10,C(2)=9, C(3)=8, C(4)=7, C(5)=6, C(6)=5, C(7)=4, C(8)=3, C(9)=2, C(10)=1 6.1 Experiment I – Scheduling With FT vs. Scheduling Without FT

Although scheduling and fault tolerance have been traditionally considered independently from each other, there is a strong correlation between them. As a matter of fact, each time a fault-tolerance action is performed, i.e. a check pointed job is restarted, a scheduling decision must be taken in order to decide where these jobs must be run, and when their execution must start. Therefore, scheduling and fault tolerance should be jointly addressed in order to achieve satisfactory performance. In this experiment, the TTR is plotted against the number of jobs. The MTTR scheduling approach schedules dynamically arriving jobs with resource requirements and reclaims resources from early completing jobs and does not address the fault tolerant requirements explicitly. MTTR_CS strategy uses MTTR scheduling approach and builds fault tolerant solutions around it to support checkpoint based fault tolerance. By adding fault tolerant features within the scheduling approach, the overall performance of the grid system is improved. Figure 3 shows the graph, which clearly reveal that throughout this experiment, MTTR_CS strategy performs better than the MTTR strategy. The performance improvement in terms of TTR remains better in MTTR_CS strategy than MTTR strategy.

Fig. 3. TTR comparison by varying number of jobs

6.2 Experiment II - Deadline vs. Number of Jobs Completed

(10)

experiment, the MTTR_CS’s number of jobs executed within deadline is greater than the MTTR strategy. The MTTR_CS uses job checkpointing depending upon the failure prone nature of the resources suggested by its history information maintained by Fault Index Manager. In case failure occurs at the grid resource, the MTTR_CS strategy will be able to tolerate faults by using the job result from last checkpoint result received. It reschedules that job from only that point onward, saving considerable time, as it is not necessary to execute the job from scratch. In the MTTR strategy, this feature is missing, resulting in delay in the completion of job execution, and ultimately the total number of jobs completed within the deadline is affected.

Fig. 4. Number of jobs completed for different deadline

6.3 Experiment III - Number of Jobs vs. Execution Time

In this experiment, the execution time is plotted against the number of jobs. Checkpoints are preserved in the Checkpoint Server. The job execution status information cannot be retrieved if bottleneck occurs. Hence the job needs to restart from the scratch which increases its response time. If CRS is available, it replicates the checkpoint file to remote resources. Hence the job is restarted from its last successful state using the replicated checkpoint file. This reduces the response time of the job. Including CRS in the fault tolerant based scheduling approach, the overall performance of the grid system is improved. Figure 5 show the performance improvement in terms of execution time with the presence of CRS strategy than absence of CRS strategy.

(11)

6.4 Experiment IV - Number of Replicas vs. Execution Time

In this experimentthe sensitivity of CRS is investigated by varying the number of replica resources from 1 to 5 and submitting 400 jobs. During simulation, a drastic increase in the execution time when the number of replicas is more than two as shown in figure 6. A maximum of two replicas can improve the system performance. From CRS implementation, it is found that if the number of replicas increased, the overhead also increases which results in increased execution time. Thus, to increase the system performance, optimal number of replicas is selected. Therefore in the simulation two resources are selected using RRSA algorithm to replicate the checkpoint file.

Fig. 6. Execution Time comparison by varying number of replica resources

7. Conclusion and Future Work

In this study, the problem of scheduling with fault tolerance strategy is proposed. Proposed scheme maintains history of the fault occurrence of resource in fault index manager. Before scheduling a job to a resource, checkpoint manager uses the resource fault occurrence history information and depending on this information, it sets different intensity of check pointing. The availability of checkpoint files is increased through replication. The performance of the proposed strategy is evaluated using Globus Toolkit middleware. The experimental results demonstrate that proposed strategy effectively schedule the grid jobs in fault tolerant way in spite of highly dynamic nature of grid. It is observed throughout the experiments that the number of jobs completed within the deadline for the MTTR_CS strategy is more than the MTTR strategy. Moreover, it improves the TTR of user jobs when compared with the MTTR strategy. Using CRS implementation the execution time of the job is greatly reduced by selecting optimal number of replica resources. The comparative analysis clearly express that the MTTR_CS strategy maintains its superiority over the MTTR strategy throughout all the varied experimental scenarios simulated.

In the future, it is planned to explore the potential of these scheduling strategies by embedding them into real world grid computing environments. Also it is planned to improve the checkpoint replication service by optimizing the recovery of checkpoint replica i.e. getting the replica in faster manner.

References

[1] Benoit Anne, Cole Murray, Gilmore Stephen and Hillston Jane. (2005): Enhancing the effective utilization of Grid clusters by

exploiting on-line performability analysis, IEEE International symposium on Cluster Computing and the Grid(CCGRID), pp. 317-324.

[2] Buyya. R, Murshed. M, Abramson. D. (2002): A deadline and budget constrained cost-time optimization algorithm for

scheduling task farming applications on global grids, In Proceedings of the international conference on parallel and distributed processing techniques and applications, Las Vegas, USA, pp. 24–27.

[3] Elnozahy. E. N., Alvisi Lorenzo , Wang Yi-min, Johnson. B. (2002): A survey of rollback recovery protocols in

message-passing systems, ACM Computing Surveys, 34:375-408.

(12)

[5] Foster. I, Kesselman. C (1999): The Grid: Blueprint for a Future Computing Infrastructure. San Mateo, CA: Morgan Kaufmann.

[6] Foster. I, Kesselman. C, Tuecke. S. (2001): The anatomy of the Grid: Enabling Scalable Virtual Organizations, International

Journal of Supercomputing Applications, 15: 200-222.

[7] Foster. I. (2006): Globus Toolkit Version 4: Software for service-Oriented Systems, In proceedings of IFIP International

Conference on Network and Parallel Computing, pp. 2-13.

[8] Gropp. W, Lusk. E (2002): Fault tolerance in MPI programs. Special issue of the Journal High Performance Computing

Applications, 18:363-372.

[9] Gupta. I, Chandra. T, Goldszmidt. G. (2001): On scalable and efficient distributed failure detectors, In Proceedings of 20th

annual ACM symposium on principles of distributed computing, ACM Press, New York, pp 170–179.

[10] Hayashibara. N, Cherif. A, Katayama. T. (2002): Failure detectors for large-scale distributed systems, In Proceedings of the 21st

IEEE symposium on reliable distributed systems, pp 404–409.

[11] Kandaswamy. G, Mandal Anirban, Reed Daniel. A. (2008): Fault tolerance and recovery of scientific workflows on

computational grids, In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp 777–782.

[12] Kiran. M , Aisha-Hassan A. Hashim, Lim Mei Kuan, Yap Yee Jiun. (2009): Execution Time Prediction of Imperative Paradigm

Tasks for Grid Scheduling Optimization, International Journal of Computer Science and Network Security, 9: 155-163.

[13] Luckow. A, Schnor. B, (2008): Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid,

Seventh IEEE International Symposium on Network Computing and Applications, pp. 299–306.

[14] Malarvizhi. N, Rhymend Uthariaraj. V. (2008): A Broker-Based Approach to Resource Discovery and Selection in Grid

Environments, IEEE International Conference on Computer and Electrical Engineering, 322- 326.

[15] Malarvizhi. N, Rhymend Uthariaraj. V. (2009): A New Mechanism for Job Scheduling in Computational Grid Network

Environments, Springer Lecture Notes in Computer Science, 5820: 490-500.

[16] Plankensteiner Kassian, Prodan Radu, Fahringer Thomas, Kertesz Attila, Peter K Kacsuk. (2007): Fault-tolerant behavior in

state-of - the-art grid workflow management systems, Technical Report.

[17] Sindrilaru Elvin, Costan Alexandru, Cristea Valentin, (2010): Fault Tolerance and Recovery

in Grid Workflow Management Systems, International Conference on Complex, Intelligent and Software Intensive Systems, pp.475-480.

[18] The Torque Resource Manager, http://www.clusterresources.com/products/torque/

[19] Yeo Chee Shin, Buyya, R. (2005): Service level agreement based allocation of cluster resources: handling penalty to enhance

Figure

Fig. 1.   Architecture Diagram for Grid Scheduling with Fault Tolerance Strategy
Fig. 1. Architecture Diagram for Grid Scheduling with Fault Tolerance Strategy p.4
Fig. 3.  TTR comparison by varying number of jobs
Fig. 3. TTR comparison by varying number of jobs p.9
Fig. 4.  Number of jobs completed for different deadline
Fig. 4. Number of jobs completed for different deadline p.10
Fig.5. Execution Time comparison by varying number of jobs
Fig.5. Execution Time comparison by varying number of jobs p.10
Fig. 6.  Execution Time comparison by varying number of replica resources
Fig. 6. Execution Time comparison by varying number of replica resources p.11

References

Updating...