7.2 Globus Middleware in the Cloud
7.2.2 An Implementation Globus Virtual Cluster
To investigate the possibility of running Grids on Clouds in an interoperable fashion above and beyond the current state of the art, a Virtual Grid was implemented using the Globus
Toolkit at its core. The middleware was configured to enable the execution of Web Ser- vices (via Java WS Core containers), job and batch applications via the integration of GRAM4 with Torque [247]. Torque is an actively developed open source resource man- ager forked from PBS that provides control of batch jobs over distributed compute nodes. Torque provides additional features and functionality that enhance scalability and reliabil- ity. Torque was configured to use Maui [164], an open source cluster scheduler support- ing a number of scheduling polices including advanced reservation and fairshare. Support for secure data management and authorization/authentication was enabled via the use of GridFTP and GSI respectively. In addition, MPI job submission was enabled by installing and configuring MPICH2 [215] with a Torque enabled mpiexec binary avoiding the need to configure mpd, the MPI daemon, on worker nodes, which complicates MPI job submission. The architecture of the Globus Virtual Cluster can be seen in Figure 7.5
Figure 7.5. Globus Virtual Cluster Architecture.
No components in the Globus Virtual Cluster interact with the underlying Cloud in- frastructure. The Grid middleware is thus unaware that it is running virtualised across an number of virtual machines in a dynamically changing environment. This enables in- teroperability with a wide number of IaaS provides as long as the base virtual machine image, with which the Grid middleware software stack is installed in, is of a format that is supported by the IaaS provider’s Hypervisor.
to set the networking context in either dynamic DHCP or static environments and selects an appropriate pre-generated client and server certificate at boot time enabling the contex- tualized node to be accessible by other nodes in the cluster. The first node to come on-line automatically takes on the role of the head node, maintains a list of available worker nodes and acts as a gateway to GRAM via a “globus” client account accessible via GSI OpenSSH that uses X509 certificates and the OpenCA [186] Public Key Infrastructure (PKI).
Fault tolerance and reliability mechanisms in the Grid middleware software stack are configured so that the cluster can take advantage of new virtual resources that come online and also accommodate for VMs that are taken offline for cost savings when enhanced scalability and elasticity is no longer required. Jobs that fail due to a resource terminated at the Cloud infrastructure level are reassigned to nodes that are currently online and re- executed when these nodes become available.
There are a number of limitations with the current implementation. Since certificates are pre-generated for use with GSI there is a predefined upper limit on the number of virtual resources that can be brought online. This is currently statically configured to 1000 nodes and can be increased if needed. In addition, there is no monitoring of the Grid middleware performed by the Cloud IaaS to enable the dynamic scaling of resources on demand given a certain KPI such as the queue depth of the Maui scheduler. This is planned future work but the user who created the cluster is able to pro-actively provision more resources manually to reduce the runtime of an application.
A number of tests were created to confirm the correct configuration of all the com- ponents within the software stack of the Globus Virtual Cluster due to the complexity of the system and to facilitate interoperability testing on IaaS providers. To validate that all the components within the Globus Toolkit were configured correctly, two tests were cre- ated that submitted dummy jobs using both WS enabled GRAM and traditional Pre-WS GRAM. The WS-GRAM test utilised the Resource Specification Language (RSL) [214] to define a job, while the Pre-WS GRAM test used the gatekeeper daemon to dispatch a job using the GSI library for communication.
Listing 7.1. GRAM4 Test RSL Job Description
<j o b>
< e x e c u t a b l e>my echo< / e x e c u t a b l e>
< d i r e c t o r y> / home / g l o b u s c l i e n t / ws−gram− t e s t < / d i r e c t o r y> <a r g u m e n t>H e l l o< / a r g u m e n t> <a r g u m e n t>World !< / a r g u m e n t> < s t d o u t> / home / g l o b u s c l i e n t / ws−gram− t e s t / s t d o u t< / s t d o u t> < s t d e r r> / home / g l o b u s c l i e n t / ws−gram− t e s t / s t d e r r< / s t d e r r> < f i l e S t a g e I n> < t r a n s f e r> < s o u r c e U r l> g s i f t p : / / g l o b u s 0 1 : 2 8 1 1 / b i n / e c h o< / s o u r c e U r l> < d e s t i n a t i o n U r l> f i l e : / / / home / g l o b u s c l i e n t / ws−gram− t e s t / my echo< / d e s t i n a t i o n U r l> < / t r a n s f e r> < / f i l e S t a g e I n> <f i l e C l e a n U p> < d e l e t i o n>
< f i l e > f i l e : / / / home / g l o b u s c l i e n t / ws−gram− t e s t / my echo< / f i l e >
< / d e l e t i o n> < / f i l e C l e a n U p> < / j o b>
As part of the WS-GRAM test, the use of GridFTP was specified in the RSL for staging data to worker nodes to confirm that GSI secured GridFTP (GSIFTP) was fully operational and is shown in Listing 7.1. An additional test was created to confirm that the execution of MPI jobs was possible. This test utilised GRAM2 RSL and the underlying “qsub” command to test Torque and Maui functionality. The RSL used in this test is shown in Listing 7.2.
Listing 7.2. Job Description for Testing MPI with GRAM2 RSL
&( j o b t y p e=mpi ) ( e x e c u t a b l e= ‘ / home / g l o b u s c l i e n t / mpi− t e s t / h e l l o w o r l d . b i n0 ( c o u n t= ‘ 80 0 )
( s t d o u t= ‘ / home / g l o b u s c l i e n t / mpi− t e s t / s t d o u t . t x t 0 ) ( s t d e r r = ‘ / home / g l o b u s c l i e n t / mpi− t e s t / s t d e r r . t x t 0 )
An unforeseen benefit of the development of the Globus Virtual Cluster, in addition to evaluating whether running Grids on Clouds in an interoperable fashion is possible, has
been the facilitation of research in the subject area of Grid Computing at the University of Leeds, School of Computing by a number of postgraduate students [55, 224]. These students have made use of the Globus Virtual Cluster to perform experiments on the rene- gotiation of SLA and to gain an understanding of Web Services in Grid architectures. In addition, a number of final year undergraduate students have performed quantitative eval- uations on the negligible performance overheads of running MPI jobs on virtualised Grid infrastructure.