A grid provides a variety of services to the users which can be broadly classified into four types: task management, data management, security and monitoring. The task manager is responsible for accepting tasks from the users and scheduling them on appropriate resources. The data manager provides tools for transferring data across various locations in a secure and reliable manner. Security services allow verified users to access the resources and delegate their credentials to ap- propriate resources for performing the tasks on their behalf. The monitoring services inform the users about the status of resources and perform accounting and bookkeeping functions for a fair usage. All these services are essential for sound operation of any grid. The software infrastructure which performs these services is termed grid middleware. The following subsections provide informa- tion on two relevant grid middlewares from the perspective of this thesis. A detailed and comprehensive review of other desktop grid middlewares can be found in [Cho07].
3.6.1
Globus Toolkit
One of the fundamental requirements for building an ideal computational grid is to develop standard protocols for grid services. The Globus Alliance [GLO10] is an international association to build these standards and technology. The Globus Toolkit [Fos06] is an open source implementation by them and is considered the de facto standard for computational grids. The toolkit is based on the Open Grid Services Architecture (OGSA) [FKNT02] and the Web Services Resource
3. Computing on Shared Resources 44
Framework (WSRF). It is a software package which allows users to build compu- tational grids by sharing their resources with each other without giving up their local autonomy.
The suite of tools includes components that implement grid services such as resource management, data management, communication, fault detection and security. Developers can choose from these components to build their own grid applications and services without worrying about the underlying protocols. Many commercial vendors such as Hewlett-Packard, IBM, DataSynapse, United Devices are using the Globus Toolkit for creating such applications. It is also being used all over the world by large scale production computational grids such as National Grid Service (NGS) [NGS10], TeraGRID [TER10], Network for Earthquake Engi- neering and Simulation Grid (NEESgrid) [PKG∗04] and Worldwide Large Hadron Collider Grid (WLCG) [WLC10]. The main components of the Globus Toolkit include [Fos06]:
Grid Resource Allocation and Management (GRAM): This service is responsible for task submission, monitoring and cancelation on the local resources. The toolkit does not provide a task scheduling mechanism and instead uses this service to communicate with the local schedulers and supports Portable Batch System (PBS), Platform LSF and Condor. It facilitates the delegation of credentials and file transfers from the submitting host to the execution host.
Grid Security Infrastructure (GSI): The security service of the Globus Toolkit uses asymmetric cryptography for authenticating users and services. A security certificate encoded in X.509 format and signed by a Certificate Authority (CA) is used for secure communication between the entities of the computational grid. It provides a single-sign-on and credential delega- tion using proxy certificate which are valid for limited time period without having a central security system.
Grid File Transfer Protocol (GridFTP): A large amount of data needs to be transferred between resources in a computational grid which may lie on different networks. GridFTP is a protocol for a fast, secure and efficient transmission of bulk data between the grid elements. It uses advanced file transfer mechanisms such as data stripping across multiple connections for enhanced performance. Along with the Reliable File Transfer (RTF)
3. Computing on Shared Resources 45
service and Replica Location Service (RLS) it forms the complete set of data management services needed to build a computational grid.
Monitoring and Discovery System (MDS): This is a system which provides the status information of resources and locates new resources on a computa- tional grid by identifying resources which form a part of the VO. It is composed of two services: an index service and a trigger service. The index service gathers information by querying the local information providers and the trigger service launches any actions that might be needed based on the received information. The users and services such as resource brokers, can query the MDS to check available resources. It acts as a central data man- ager for publishing information such that it becomes available to multiple sites on a computational grid.
3.6.2
Condor
Condor [CON10] is an open source high throughput computing platform devel- oped at the department of Computer Science, University of Wisconsin, Madison. It is a middleware for seamlessly tapping unused CPU cycles in an institutional environment and provides a batch system for job queuing, execution, management and monitoring on vacant resources. The resources can be either idle desktop ma- chines or dedicated cluster nodes. The aim of the project is to deliver a large number of computation cycles to the users over a long period of time in contrast to high performance computing which focuses on delivering high computational power over a shorter time period. Condor supports popular operating systems (Windows, Unix, MacOS) and can be used to combine heterogeneous resources. A group of connected machines forms a Condor pool. A self-sufficient Condor pool needs three types of machines: central manager, submit and execute. Any machine in the pool can perform any combination of these roles, however only one machine can be a central manager. A central manager keeps a record of all the connected resources. When a user submits a job to a Condor pool, it finds an empty resource satisfying the job requirements. Each resource publishes information regarding its capabilities such as processor speed, memory, operating system and any other user specified attributes, known as ClassAd [RLS98]. These ClassAds are used to identify the resources which match the job requirements. A submit machine allows the user to provide Condor with a job description specifying the job requirements. An execute machine when idle, processes jobs
3. Computing on Shared Resources 46
submitted to a Condor pool. Many Condor pools can be configured to share loads using a flocking mechanism [ELvD∗96] as long as the central managers of the pools can communicate with each other. This is useful for sharing across multiple departments of an organisation, where each department may have its own pool.
Condor monitors the user input and CPU load to determine idle resources and the administrators can configure the thresholds for implementing the desired job execution policies. This allows the donor to control the interference that a job may have on the resource. In case the donor withdraws the resource, the job is migrated to another vacant machine. To prevent loss of computations due to failures, Condor can create system level checkpoints of an executing job by saving the image of a running process [LTBL97]. However, there are some restrictions on jobs which can be checkpointed using this mechanism.
The MW tool [GKYL01] of Condor provides an API for developing and ex- ecuting master-worker applications in an opportunistic environment. The API is a broker between the tasks and idle workers. It combines together the user application and the Condor resource management and message passing libraries. The user application specifies the list of tasks to be performed and the variation of resources is handled by the underlying Condor mechanisms.