Job Management - Service oriented grids and problem solving environments

The Computation Web Service offers resource sharing to support multiple users with task containing multiple job. An essential requirement for a multi-user environment is security that in addition to providing access control to the service, it protects users files and data from unintended damage by other users or jobs and denies other users access to each others files. We have employed WS-Security [133] for message level security that provides encryption and signing of messages, and authentication and authorisation of users; enabling user access rights and roles to be configured in the service. WS-security has scalability advantages of traditional transport security, such as SSL, as messages can be verified on a trust basis negating the need to repeatable gain authentication from the original sender for any service that it may pass through.

In addition, the service provides systematic and robust job management to cater for jobs that may run from days to months offering recovery from computer failure. The compute jobs files are assigned a unique directory that stores its files locally before submission to a compute resources. Upon completion of a job’s execution files will be transferred back in this directory. This directory is assured to be unique by its path and name that are generated respectively from the owner’s unique username from their X509 Certificate and Universal Unique Identifier (UUID) job identifier generated by the service. Jobs from the same user will consequently sit in the same directory hierarchy. This provides a useful way to ensure security amongst users and provides a way to manage and insulate jobs from accidently writing over each others’ files.

The service provides robustness and secure caching by using the same directory structure to hold information that links the jobs submission with its directory and holds a unique cache of files for the user that only he has access to. The stored submission information enables the service to reestablish which job is associate with which directory in the event of its failure.

6.4.1 State Management

As we discussed in section4.3.5, a service must be able to contend with interactions and internal operations spanning multiple users and jobs. A cluster managements system contains many sharing operations, in particular Job Submission and execution, that demand state management.

Pure Web Services technologies do not address state management issues nor provide protocols defining the lifetime of the span of shared operation essential to preserving the integrity of a system. In order to provide state and lifetime management, a developer may either create their own new state management mechanisms from scratch or based on existing Web Server technologies, or employ extension technologies to the stateless Web Service model. We have implemented two versions of the Computation Web Service that explores both ideas.

The first version of our Web Service employed session and application state management functionality provided by the underlying Web Server hosting environment, IIS [134] and ASP.NET [135]. This method of implementation has distinct performance and reliability advantages with support for persistent long term state storage on a SQL database server. However, it also has many disadvantage stemming from its design as a mechanism for maintaining state across Web pages and HTTP/Web security model. Importantly support for long term session state requires HTTP cookies [136] on the client-side; a known security problem. In addition, a damaged or lost cookie would mean that the user would not be able to recover their session. Whilst the issue can be overcome by mapping session state onto application state it is then not possible to use cookies’ lifetime management mechanism for creation, management and destruction of sessions.

A revision of the Web Service performed its own lifetime and persistent state management mechanism by recording state information using the directory structure mechanism discussed in the last section. Upon creation of a session, for instance on a successful request for job submission, a new unique directory would be created, with its name returned to the user as its unique session handle. All subsequent operations within the context of the resource management must pass this handle back to the Web Service enabling multiple session to be maintained concurrently. Each of these operations checks

the certificate of the user to ensure that they are allowed to access that session. Subse- quent operations will store or check for state information, such as the job ID returned by Condor or a record of the file requirements, to control and maintain the integrity of the job submission and resource management process. The lifetime of the session in this instance represents the life of the job. Only after the job has completed and result files have been uploaded can the session be allowed to timeout and be destroyed. Care must be taken to not destroy the session after the job has been assigned a compute resource, otherwise data may be lost. Consequently, simply timeout mechanism for cleaning up sessions are not possible because the length of the jobs’ execution is an unknown factor.

A completely new Web Service was create using the transient Web Service model pro- posed in the OGSI [57]. OGSI supports dynamic creation and lifetime management of Web Service instances that’s operation and data are tied to the specific context in which it was created. We employed this mechanism, to represent the submission and management of a resource for a job as a Web Service instance. There is a one-to-one relationship between the job, resource it was assigned and the Web Service instance. Therefore operations on it only act on the job and compute resource to which they were assigned. A stateless Web Service factory is responsible for the creation of the instances. It performs all the checks previously carried out by the request of submission operation before creating and return a unique Grid Service Handlers (GSH) [57] of the new Web Service instances. This allows the implementation of the Web instance to operate un- der the assumption that initial context is valid. This approach simplifies the way state information is stored and managed, removing the need for arbitrary measures, such as, storage of Condor Job IDs’.

6.4.2 File Transfer Management

Jobs in a HTC environment may need to transfer huge amounts of data to and from the compute resource. Whilst this overhead is insignificant compared to the length of time the job runs, a Web Service handling multiple jobs could be swamped and effectively suffer a form of denial of service attack if too many large file transfer operations occur concurrently. Therefore, it is important that file transfer operations are as fast and as efficient as possible.

Within the Condor system this problem does not exist because file transfers are dis- tributed across the system with each submission machine transferring a job’s files di- rectly to the compute resource. However, our Web Service must receive and transfer all jobs’ files causing a potential bottleneck. To overcome this architectural deficiency, we improved the data transfer mechanism of Web Services by employing DIME [137] implementation of the WS-Attachments standard [138]. In addition, file transfer control and caching schemes were employed to make our Web Service’s file transfer operation more efficient.

Jobs generally need to transfer binary formatted files, such as the compiled executable. Unfortunately, due to text markup nature of the SOAP and XML standards they do not handle binary encode data and files efficiently. The only means to transfer binary files using SOAP is too encode them into arrays of bytes (Base64 encoding) and pass that as a fragment of an XML document. Consequently, these documents are often much larger then the original file. Transfer of huge files is unreliable as SOAP documents cannot be split and must be sent in one large chunk to the Web Service. DIME offers a more efficient means of transfer with files as binary attachments to the SOAP document. DIME has recently lost favour with the Grid Community because its functionality is not sufficient for supporting all the file sharing operations envisioned by the Grid. Alternatives such as SOAP MTOM.

In document Service oriented grids and problem solving environments (Page 139-142)