The objective of this work was to develop a RESTful grid model, starting with a generic model for representing resources via GRDL. There remain many areas where this can be more fully developed. The first step is the formalisation of a RESTful grid
9.3 Future Work 183
process model. Early work has begun on this and needs further development. The objective is to enable GRDL to represent resource state throughout the resource’s life cycle.
There have been many small points which have been touched on but not fully developed. Each of these has potential for further exploration in contributing to an effective RESTful grid model, or as optimisation points for implementations or useability. Within the context of the work which has been described here, two clear areas remain: simulation of the model in the context of a large dynamic compu-tational grid, and performance evaluation of a full implementation. Simulation of large dynamic generic computational grids is a difficult task, as it is not well sup-ported by any of the available simulation tools. SimGrid[113] is the best available tool, however it caters for simulation of a single fixed distributed algorithm with well defined workflow/task properties throughout the simulation. Dynamic traces of resource behaviour are possible, but creating such a model was beyond the scope of what was achievable in this work. It is only on the scale of a large, long run-ning time, heterogenous dynamic computational grid that the key characteristics of the model’s performance can be observed. Realisable simulations are sufficiently simplistic as to show the same results between the REST model and a traditional batch management/scheduling system. In order to simulate a large computational grid it is necessary to characterise the dynamic profiles of various aspects of a grid over a long period: storage, network, processing power, and workload generation. It is necessary to inject failures into all of those aspects, and represent the RESTful model within the simulation.
As has been discussed at length, the concepts presented in the REST model were motivated by the proto-REST implementation found in the DIRAC archi-tecture, therefore there is some real-world validation for these concepts. The two implementations of the REST scheduling model which were prepared in conjunction with this work (one in Python and one in Haskell) were developed for exploring the properties of the model, rather than as part of a complete grid resource man-agement system. As such, they were unsuitable for performance benchmarking. A full implementation would enable performance measures to be taken and provide a comparison against DIRAC and LCG task management. The final aspect which requires careful consideration is a security model. This ties in with both the GRDL model, and an overall model for a RESTful grid process.
In the context of particle physics computing a RESTful grid implementation would empower users to experiment with, extend, and improve the strategies for
grid resource management. A staged grid process finite state machine model would facilitate task management from creation, to scheduling, to staging, to execution, to completion, to archiving. It is even conceivable a GRDL-like model, coupled with a finite state machine, could allow checkpointing, recovery, and work flow manage-ment. It is still necessary to focus on the basic objective of large scale distribution, execution, and management of single grid tasks (i.e. embarrassingly parallel prob-lems, or high volume decoupled tasks). To achieve this, greater degrees of resource logging, experimentation, and simulation are required, with particular attention given to translating the experience from preemptive operating systems to a grid do-main. Added to that is the requirement for a comprehensible security infrastructure with a strong emphasis on groups (Virtual Organisations) and roles. The experience from DC04 suggests that a single identity approach such as basic X.509 certificates is insufficient – the reality is that users have many different identity tokens all of which need to be made accessible in a grid environment and which may form part of an operation within a grid task. The complexity of current X.509 systems, security policies, and role-based access control clearly indicate that a significant amount of work remains to simplify this to a level which is usable by the ordinary user.
The principles which underlie improving computational grid architectures must be independent of any particular technology or implementation approach, thinking in particular of Web Services. While Web Services provide a strong foundation for a service oriented grid architecture, their weaknesses in the grid domain quickly became clear. Again, this focuses attention on the plethora of Internet standards which manage to inter-operate or co-exist as part of a global computing infrastruc-ture. Decoupling aspects of a grid architecture into simple, efficient, scalable, and reliable services and protocols has the benefit of following a path which the Inter-net has proven can lead to success. A REST approach for representing the entities within the grid opens the way for different user groups to operate on grid entities in their own way, and allows the best mechanisms and implementations to rise to the top “organically”, rather than by asserting a priori a particular set of services, or worse an entire grid infrastructure.
The experience and examples found in the successful Internet standards should form the basis of future work in computational grids. While the model presented here breaks from tradition in the distributed computing sense, it very much builds on a long tradition of large scale computing as established through standards such as DNS, HTTP, HTML, and XML. It is hoped that this RESTful approach will pro-vide a new perspective on scheduling strategies and large grid architectures which
9.3 Future Work 185
will move grid computing closer to its desired goal of Internet-scale federated het-erogenous dynamic distributed computing.
Task and Executor Description Languages
A.1 Globus Resource Specification Language
The following table is a summary of the properties described in the RSL Specification [99].
186
A.1 Globus Resource Specification Language 187
RSL description
directory specifies the path of the directory the jobmanager will use as the default directory for the requested job.
executable The name of the executable file to run on the remote machine.
arguments The command line arguments for the executable.
stdin The name of the file to be used as standard input for the executable on the remote machine.
stdout The name of the remote file to store the standard output from the job.
stderr The name of the remote file to store the standard error from the job.
count The number of executions of the executable.
environment The environment variables that will be defined for the executable in addition to default set that is given to the job by the jobmanager.
maxTime The maximum walltime or cputime for a single execution of the exe-cutable.
maxWallTime Explicitly set the maximum walltime for a single execution of the executable.
maxCpuTime Explicitly set the maximum cputime for a single execution of the executable.
jobType This specifies how the jobmanager should start the job (single, mul-tiple, mpi, condor).
gramMyJob This specifies how the gram myjob interface will behave in the started processes.
queue Target the job to a queue (class) name as defined by the scheduler at the defined (remote) resource.
project Target the job to be allocated to a project account as defined by the scheduler at the defined (remote) resource.
hostCount Defines the number of nodes (”pizza boxes”) to distribute the
”count” processes across.
dryRun If dryrun = yes then the jobmanager will not submit the job for execution and will return success.
minMemory Specify the minimum amount of memory required for this job.
maxMemory Specify the maximum amount of memory required for this job.
save state Causes the jobmanager to save job state/information to a persistent file on disk.
two phase Implement a two-phase commit for job submission and completion.
restart Start a new jobmanager but instead of submitting a new job, start watching over an existing job.
stdout position Specifies where in the file streaming should be restarted from for streamed output.
stderr position Specifies where in the file streaming should be restarted from for streamed error.
remote io url Provides the base URL prefix for remote IO operations.
Table A.1: Summary of RSL properties.
A.2 Job Description Language
The following is a summary of the attributes provided by the latest published version of the EGEE Project Job Description Language[101]. These attributes are utilised in Condor ClassAd, and processed by the LCG Workload Management System (WMS).
This attribute set originated with the EDG project. Its purpose is to describe in a standard way the pre-conditions and execution details for a task or task set to be executed on a computational grid. Table A.2 provides an overview of the key attributes in JDL. It is not definitive.