Security is a key issue in computational grids, where resource interaction passes between different administrative regions over a federated and geographically dis-tributed network. GRDL and the overall model presented here make no assumptions or in any way limit the approach to resource security. Non-repudiation of GRDL resource representations can be achieved through the use of digital signatures, and in an XML context this is particularly convenient as the XML Digital Signature standard allows different sections of GRDL to be signed by different identities.
Ac-8.8 Summary 177
cess control to queues is the other major point for security, and again this is left to implementations to specify. No work has been done on approaches for specifying interaction policies within GRDL, and only limited work (not reported here) on how various identities can be specified, transported, and utilised via GRDL. This is part of the larger work on an overall RESTful grid process model which remains in the early stages. In this respect, security is the least satisfied Key Goal described in the introduction.
8.8 Summary
This chapter has discussed a number of applications of GRDL and the REST model for a computational grid. It has described the value of composition profiles, and established the theoretical basis for resource queues and templating. The strategy of priority queues and templating combine to overcome the NP-complete limitation of optimal resource scheduling and centralised workload management, providing a scalable distributed architecture for grid resource management. While discussion of integrating many other desireable features of computational grids into the REST model are possible, these are left as areas for future work.
Future Work and Conclusions
9.1 A Scalable Computational Grid Architecture
As was described in the introduction, this dissertation is a work of two parts: one part presenting the design and operational experience of a large computational grid infrastructure, the other part presenting an abstract and general model for grid resource descriptions as the basis for a RESTful grid, where this second part is motivated by experience from the first.
The success of the DIRAC infrastructure and strategies employed in its imple-mentation contrasted with the difficulties of utilising LCG thus revealing the need to investigate alternative grid architectures. DIRAC was deemed to be successful because it employed a simple, distributed, service oriented architecture which fol-lowed a “high throughput” scheduling model utilising a “task-pull” approach rather than the traditional “high performance” model which utilises “task-push”. A further contributing factor to its success was the emphasis on robustness through various mechanisms: asynchronous messaging between services, a light weight client, “pilot”
jobs, service watchdogs, dynamic configuration, dynamic software deployment, and decoupled file transfer queues. It also provided insight into the use of instant mes-saging as a light weight communications infrastructure for grid resources. Finally, it served as a testing ground for real deployment of OGSI/GT3 Grid Services, and established many short comings both with the OGSI approach and the available implementations.
In contrast, it was shown that LCG could provide a usable distributed com-putational infrastructure, albeit with a failure rate exceeding 30%. Even this was only achieved through careful management and augmentation of the LCG system.
The centralised Resource Broker was a major bottle neck and source of failures, 178
9.1 A Scalable Computational Grid Architecture 179
while the overall infrastructure did not provide the level of control, programmatic APIs, or logging to make it easily usable. It was a unanimous decision of the LHCb computing team that the attempts to create an omnipotent and omniscient Re-source Broker were impossible to realise, and an architecture which required this could not be part of a long term, robust, grid solution. The distribution of state information throughout a grid is such that it is impossible to maintain in a single location complete, consistent, and timely details of all grid resources. Furthermore, an architecture which apparently could only be realised by an opaque, monolithic, homogeneous system was a very long way from the vision of a grid infrastructure of heterogeneous hardware and software, with federated administrative domains and a plethora of dynamic virtual organisations[15]. This redoubled the emphasis on an ARDA-like services model[47] which facilitated multiple implementation, trans-parency, and extensibility.
Analysis of the LHCb Data Challenge 2004 results confirmed the high degree of heterogeneity in a grid across all dimensions, for example network bandwidth, CPU architecture, processor loading, and memory distribution. It highlighted the need for task logging throughout the task lifecycle and the need for a handle to the executing task in order to debug or recover stalled tasks. It also demonstrated the reality of a large computational grid with tens of thousands of queued tasks, tens of thousands of executing tasks, hundreds of sites, and thousands of nodes. Contrary to common distributed computing systems which focus on the management of a single or small numbers of concurrent processes, a grid environment must support operations on thousands of concurrent processes. Issues around security, roles, vir-tual organisations, and delegation were also discovered. It is essential that users can operate with a selection of identities and roles at different times or with respect to different tasks. LCG did not make any of this easy, if it was possible at all.
The work on DIRAC and experience from DC04 motivated a number of further refinements to the DIRAC architecture. This dissertation focused on one of them:
outlining a REST model for computational grids, which emphasised a common rep-resentation of grid resources, and in particular refined the Condor ClassAd model for symmetric resource matching. The REST approach was radical in that it em-phasised the description of the resources within the system while saying little about the operations on those descriptions (representations). This was in contrast to an Object Oriented approach which hides the description and focuses on interface and behaviour, or a Service Oriented Architecture which describes the system in terms of interacting services. Both XML and HTTP/HTML have benefited from this
REST-ful approach. In a grid environment it is argued that resource consumers will wish to act on a resource description in arbitrary ways, therefore the most effective aspect to specify is a common resource description, rather than a common service interface.
While a sketch was provided of an overall RESTful grid architecture, the work here was limited to the aspects concerning generic resource description and composition.
In particular, the REST principles described in Section 2.3 have guided the RESTful grid model in the several ways:
1. All entities can be described in a common way as resources (GRDL).
2. All resources can interact in a common way, via the set theoretic compositional model.
3. Grid resources have a hidden resource state, with a public representation of that state.
4. Content negotiation to present a representation of a resource relevant to the client or customised based on the client request.
5. Cacheable representations.
6. Dynamic representations.
7. Client-driven rendering or interpretation of resources.
8. Stateless services to transact representations.
9. Elimination of any specific services or “resource stores”, enabling decentrali-sation and therefore scalability.
This built on DIRAC in a number of ways. While DIRAC contained the prin-ciples of distributed services and flexible resource matching, it was, nonetheless, fo-cused on a Service Oriented Architecture with a central task queue, and performed explicit task/executor matching, as opposed to general resource composition. Fur-thermore, the key entities within the system were the services and the architecture consisted of the service configuration and service APIs. In contrast, the REST model focuses on the description of resources within the system and making those resources directly accessible. DIRAC’s proto-RESTful features were its stateless interactions, replicable services, independent clients, and simple/light-weight API.
9.1 A Scalable Computational Grid Architecture 181
The model has been developed from a strong foundation in set theory in order to benefit from the properties of sets. The model consisted of characteristics, re-quirements, and preferences, with each inheriting the structure of the more basic property, thus providing a generic basis for interaction between the different property classes. The semantics of each of these properties was explored in depth. The model eliminated the complexity of tri-state logic, used in ClassAds for requirements, and generalised the concept of “type” and attribute comparability via equivalence classes and a formal structure for the transformation of properties. The entire model was also presented formally in Haskell in Appendix C.
One of the greatest features of the model was the ability to combine priority queues with resource templates, thus allowing resource representations to be repli-cated to multiple queues in order to maximise the likelihood of finding a “good”
match, and reducing the computational effort of evaluating candidate compositions through the use of templates. This allows a relatively small number of distributed resource queues to hold, in principle, an unlimited number of resources, and for the scheduling problem to be rephrased as matching through resource composition with queue templates. These concepts were not present in Condor, nor in any other grid scheduling system, which generally struggle with more than 10e4 queued tasks and rely on a single central scheduling engine, or a set of centralised schedulers.
This new strategy was initially investigated in DIRAC and have been formally pre-sented in this work. The variations of asymmetric, symmetric, pair, multi-way, and aggregated resource matching were all developed providing a comprehensive range of resource composition alternatives. This fits into a framework of comparators, matchers, and rankers which can be used to evaluate resource compositions and select from valid alternatives.
Finally, a selection of applications of GRDL were discussed. These covered issues such as composition contracts, validation, extension, reservation, and security. They illustrated some of the properties which can be derived from the formal set theory model of GRDL.
In total, the REST model presented here provided the foundation for RESTful generic computational grids in a Condor ClassAds style, however with a signifi-cantly more robust relationship between characteristics, requirements, and prefer-ences. The simplicity and consistency of the model makes it easily realisable thus facilitating multiple implementations. By decoupling resource descriptions from a particular service interface or scheduling strategy, the REST paradigm allows re-source representations to be replicated and cached which, when combined with
pri-ority queues and the transitive properties of templates, enables scalable distributed resource scheduling. A complete implementation utilised in a production environ-ment is required to fully validate this model, however the concept has been verified both in practice via the DIRAC implementation and performance results from DC04, and established in theory via the work presented in the later half of this dissertation.