• No results found

Development of Co-Pilot Agent – Co-Pilot Adapter communication protocol

CHAPTER 4. DEVELOPMENT OF TWO MODELS OF INTEGRATION OF CLOUD

4.4 Development of ‗Co-Pilot‘ model for integration of cloud computing resources

4.4.1 Development of Co-Pilot Agent – Co-Pilot Adapter communication protocol

Agents communicate with Adapter service over Jabber/XMPP [149] instant messaging protocol using XML formatted messages. The messages are enclosed in <message> tag, which has two attributes: from and to (sender's

98 and receiver's Jabber IDs), and a <body> tag. All the values inside <body> tag are encoded using Base64 [152] algorithm. Each message body contains the <info> tag. The <info> tag has the command attribute, which contains the command needed to be performed by the peer (e.g. Agent sends 'jobDone' command when it has finished executing the job and uploading files) and also other attributes necessary to perform the command. The list of commands which agents sends as well the appropriate response commands which it expects from Adapter service are presented below.

The first command which is sent from Agent to the Adapter is the job request command – getJob. The request contains the information about the host (e.g. available disk space and memory) where agent runs formatted in JDL [116] as well as the hostname of the machine where agent runs. The example message containing the getJob command is given on Figure 4.10. <message to='[email protected]' from='[email protected]'> <body> <info agentHost='_BASE64:Y3ZtYXBwaTI0LmNlcm4uY2g=' jdl='_BASE64:CiAgICBbCiAgICAgICAgUmVxdW…..i4yNi44LTAuM S5zbXAuZ2NjMy40Lng4Ni5p Njg2IjsgCiAgICAgICAgRnJlZU1lbW9yeSA9IDY4MzsgCiAgICAgICA gU3dhcCA9IDEyNzsgCiAg ICAgICAgRnJlZVN3YXAgPSAxMjcKICAgIF0=' command='_BASE64:Z2V0Sm9i'/> </body> </message>

Figure 4.10 Co-Pilot Agent – Co-Pilot Adapter communication protocol message containing getJob command

Upon receiving job request from the Agent the Adapter contacts AliEn Job Broker and requests a job for execution. After receiving job JDL from AliEn Job Broker, Adapter gets the job input data files from AliEn file catalogue and copies them to the directory from where the Agent can download them by means of Chirp file system [153]. After that the Adapter sends to the Agent a message containing runJob command as well as the <job> tag containing the following attributes:

99

 id - Job ID

 chirpUrl - Address of Chirp server from where to get the job input

 inputDir - Directory on Chirp server from which to get the input files of the job

 inputFiles - List of job input files

 environment - Environment variables which must be set before job execution

 packages - List of application software packages needed by the job

 command - Job execution command

 arguments - Command line arguments of the job execution command

 validation - Name of the script for validating the job after it is finished (Optional)

Example message containing runJob command is presented on Figure 4.11. <message to='[email protected]' from='[email protected]'> <body> <info command='_BASE64:cnVuSm9i '> <job id='_BASE64:MzA3NTczOTQ= ' chirpUrl='_BASE64:Y3ZtYXBwaTIxLmNlcm4uY2g6OTA5NA==' inputDir='_BASE64:L2FsaWVuLWpvY...MDUwNTYxM2JhMDQ=' inputFiles='_BASE64:Y29tbWFuZA==' environment='_BASE64:IEFMSUVOX...Sk9CX1RPS0VOPSd1dU1 oUWNsOWw5‘ packages='_BASE64:' command='_BASE64:Y29tbWFuZA== ' arguments='_BASE64:' /> </info> </body> </message>

Figure 4.11. Co-Pilot Agent – Co-Pilot Adapter communication protocol message containing runJob command

When the Agent finishes job execution it sends to the Adapter a request to provide output directory for the job output data. That request contains the

getJobOutputDir command, the hostname of the agent as well as the ID of

the job. After receiving output directory request from the Agent the Adapter creates the output directory for the job on the Chirp server and makes it writeable for the Agent. A message with the output directory information contains the storeJobOutputDir command as well as the following attributes:

100

 outputChirpUrl - Address of Chirp to which the job output must be uploaded

 outputDir - Directory on the chirp server to which the job output filesmust be put

 jobID - Job ID

When output files are uploaded the Agent sends a message to the Adapter which contains the jobDone command, job execution exit code as well as the ID of the job. After receiving the jobDone command the Adapter uploads the job output files to the Storage Element specified in the job JDL, registers the files in the AliEn file catalogue and changes the job status to ‗DONE‘. Possible errors during the operation of the Adapter or the Agent can be reported to the corresponding peer using a message containing jobError command with errorCode and errorMessage attributes.

The protocol supports redirections of messages, i.e. Co-Pilot Adapter can redirect a request from an Agent to another Adapter. This feature allows implementing different kind of Adapters for performing different tasks and in the same time giving an agent only single communication address. For example one can setup an adapter which will be used to retrieve job details and input files (Job Request Adapter) and another adapter which will used to upload job execution results and set the final job status (Job Completion Adapter). The message redirection command is called redirect, the Jabber ID of the service to which the message must be redirected is passed using the attribute called referral and the message itself is enclosed in <info> tag. The example redirection message sent from the Adapter to the Agent is given in Figure 4.12. <message to='[email protected]' from='[email protected]'> <body> <info referral='_BASE64:c3RvcmFnZXJlYWxAY3ZtYXBwaTIxLmNlcm4u Y2g= ' command='_BASE64:cmVkaXJlY3Q= '>

101 <info exitCode='_BASE64:MA== ' jobID='_BASE64:MzA4MDE3NDE= ' command='_BASE64:am9iRG9uZQ== '/> </info> </body> </message>

Figure 4.12. Co-Pilot Agent – Co-Pilot Adapter communication protocol message containing redirect command

After receiving this message the Agent will decode the value of referral attribute to get the Jabber ID to which the message must be sent (in the

given example the string

'_BASE64:c3RvcmFnZXJlYWxAY3ZtYXBwaTIxLmNlcm4uY2g=' will be decoded to ‗[email protected]‘), will generate a new message using the contents of the <inner> tag of the original message, and will send it: <message to='[email protected]' from='[email protected]'> <body> <info exitCode='_BASE64:MA==' jobID='_BASE64:MzA4MDE3NDE=' command='_BASE64:am9iRG9uZQ==' /> </body> </message>

Figure 4.13. Co-Pilot Agent – Co-Pilot Adapter communication protocol message containing redirect command

The protocol also supports the secure mode of operation in which case all messages the messages exchanged by Agents and Adapters are encrypted using AES symmetric encryption algorithm [154]. Encryption is done using 256 bit key, which is generated by the Agent and is sent along the request to the Adapter (using session_key attribute). The session key itself is encrypted using RSA [155] algorithm with Adapter's public key (so it can be decrypted only using corresponding private key. Adapter encrypts the replies which it sends to agent using the key it got during the request. An example message exchanged in secure mode is given on Figure 4.14.

<message to='[email protected]' from='[email protected]'>

102 <body info='_BASE64:VTJGc2RHVmJ...MFAwakc3V2hiM3J6VjQKcDRiRgo=' session_key='_BASE64:YlcrWXlUHJw...UmmpPNGFmSmVSb1d Zdz09Cg==' /> </message>

Figure 4.14. Co-Pilot Agent – Co-Pilot Adapter communication protocol message exchanged in secure mode

In secure mode Agents are required to authenticate to the Adapters. Authentication is done using special authentication ticket, which Agent retrieves from the special service called Key Manager. The ticket specifies to which Adapters Agent is allowed to communicate. The ticket has a limited lifetime and is signed using the RSA algorithm with the private key of the Key server. To get the ticket Agents sends to the Key Manager a message containing the getTicket command as well as the credential attribute, the value of which is used by the Key Manager to authenticate the Agent. After successful authentication of the Agent the Key Manager sends to agent the message containing the storeTicket command and the ticket which is sent as the value of the attribute called ticket. Once the ticket is obtained it is included in all messages sent by the Agent as the value of serviceAuthenticationTicket attribute of <info> element.

4.5 Comparison of „Classic‟ and „Co-Pilot‟ models. Measurement of their timing characteristics

The major advantage of the ‗classic‘ model is that it is very easy to implement, because its implementation does not require modification of the code of neither AliEn nor Nimbus toolkit. The drawback of the model is that one needs a separate virtual machine (and in case of a high load several of them) to run site services. The deployment of these virtual machines is time consuming and by excluding those from the setup one could potentially utilize more virtual machines for deploying worker nodes and running more jobs. The ‗classic‘ model assumes that the application software is brought to the cloud by PackMan service and is made available to worker nodes through a server running Network File System (NFS). This is not optimal, because

103 CernVM image already provides the application software and it would be better if worker nodes, instead of waiting for the installation of the software by the PackMan service and using accessing it through NFS, could directly use the application packages available via CVMFS. Such an approach would allow to eliminate the PackMan service from the deployment, however it requires the modification of the code of AliEn Job Agent service which is not feasible in our case.

The implementation of the ‗Co-Pilot‘ model does not require deployment of service nodes and thus potentially allows running more jobs on the same number of virtual machines (since some of them can be used as worker nodes rather than service nodes). Besides it uses application software available from CernVM and does not require existence of Grid package management service such as AliEn PackMan. The current implementation of Co-Pilot Adapter can be used to execute jobs on the cloud only from AliEn Grid. However it can be extended to communicate with any pilot job framework, e.g. PanDA - distributed production and analysis system [156] used by CERN ATLAS [157] experiment, or Dirac [158] – Grid solution used by CERN LHCb experiment [159]. The current implementation of Co-Pilot Agent does not have anything AliEn-specific and is written in a way, that running jobs fetched from other frameworks should not require extra development.

For the implementations of both models we have measured the time which elapses between a) launching site deployment command and start-up of node(s) on the cloud, b) launching site deployment command and requesting of job(s) by worker nodes and c) requesting job(s) by the worker nodes and assignment of jobs to them by AliEn Job Broker. To perform the measurements we have deployed sites with different number of worker nodes: from 1 to 10. For each number of worker nodes, 3 deployments have been performed, so 30 deployments were done for ‗classic‘ model (launching 165 worker nodes overall) and 30 deployments for ‗Co-Pilot‘ model (launching 165 worker nodes overall). For each deployment timings have been measured, and afterwards mean values of recorded times have been

104 calculated. The deployment commands have been launched from a machine located at CERN to the Nimbus Science Cloud of the University of Chicago. The plot on Figure 4.15 represents mean values of the time which elapses between:

Figure 4.15. Virtual site deployment timings

 launch of site deployment command and start-up of node(s) on the cloud (light green bars for ‗classic‘ model implementation and light orange bars for ‗Co-Pilot‘ model implementation)

 launch of site deployment command and request of job(s) by worker nodes (dark green bars for ‗classic‘ model implementation and dark orange bars for Co-Pilot model implementation)

It is seen from the plot (Figure 4.15) that:

 The time of worker node deployment on the cloud is proportional to the number of virtual machines being launched.

 For the same number of the worker nodes, the start-up duration, that is the time period from the issuance of site deployment command to the booting of the OS of nodes is longer in ‗Classic‘ model: this is because in the case of deployment following ‗Classic‘ model one launches an additional node for running AliEn site services.

105

 The time interval between the start-up and the first job request does not practically depend on the number of worker nodes (Figure 4.15, dark green and dark orange bars). It is about 400 seconds in case of ‗Classic‘ model and about 15 seconds in case of ‗Co-Pilot‘ model. In the ‗Classic‘ model this interval is needed for starting the site and Job Agent services of Alien, while in the ‗Co-Pilot‘ model - for starting the Co-Pilot Agent.

The plot on Fig. 4.16 shows the mean values of the time which elapses between requesting of job(s) by the worker nodes and assignment of jobs to them by AliEn (dark green bars for ‗classic‘ sites and dark orange bars for Co-Pilot site).

In case of ‗classic‘ site the mean time does not exceed 5 seconds, and in case of Co-Pilot site it varies from 3 to 30 seconds. The reason for this is that the current implementation of Co-Pilot Adapter is currently serving requests sequentially, whereas AliEn services are processing several requests simultaneously.

Figure 4.16. Job request and arrival timings

Table 4.1 shows minimum, maximum, mean and the standard deviation of the measured time values for all 165 worker nodes launched during the 30

106 deployments according to ‗classic‘ model as well as for 165 worker nodes launched during the 30 deployments according to ‗Co-Pilot‘ model.

Minimum Maximum Mean Standard deviation

„classic‟ site

Launch – Node up 529 2031 1566.915 350

Launch – Job

request 982 2489 2010.188 350.558

Job request – Job

Arrival 1 5 2.381 0.82

Co-Pilot site

Launch – Node up 476 1906 1241.679 383.219

Launch – Job

request 491 1921 1256.588 383.184

Job request – Job

Arrival 3 30 11.327 7.717

Table 4.1. Min., Max., Avg. and SD of measured variables

Minimum values of ‗Launch – Node up‘ and ‗Launch – Job request‘ were recorded during the deployment of sites with 1 worker node and maximums were recorded during the deployment of sites with 10 worker nodes. The large variations of these intervals have been expected, because the load produced on the Nimbus Workspace Service naturally grows with the number of workspaces which are required to be launched simultaneously.