Int. J. Computational Science and Engineering, Vol. 1, No. 1/1,

(1)

Improving Data Transfer Performance of Web

Service Workflows in the Cloud Environment

Donglai Zhang

School of Computer Science, Universiy of Adelaide, Adelaide 5000, Australia

E-mail: [email protected] *Corresponding author

Paul Coddington

eResearch SA and University of Adelaide Thebarton 5031, Australia

[email protected]

Andrew Wendelborn

School of Computer Science, University of Adelaide, Adelaide 5000, Australia

E-mail: [email protected]

Abstract: Web Service Data Forwarding (WSDF) is a framework for centralized web service workflow, in which the intermediate result from a previous service is treated as a resource of the composite service and can be directly used by its subsequent service, without sending it back to the centralized control centre. To improve the data transfer performance of web service workflows in thecloud environment, we carried out a test of the WSDF framework in theScienceCloud, provided by theNimbuscloud infrastructure. The experiment showed that, in the cloud environment, the WSDF framework has significant performance advantage over normal web service framework for workflows with large data transfer and the improvement of performance agrees with the expected theoretical value.

Keywords:web service workflow; wsrf; stateful; data transfer; cloud.

Reference to this paper should be made as follows: Zhang, D., Coddington, P. and Wendelborn, A.L. (2012) ‘Improving Data Transfer Performance of Web Service Workflows in the Cloud Environment’, Int. J. Computational Science and Engineering, Vol. 1, Nos. 1/1, pp.1–10.

Biographical notes:

Andrew L Wendelborn received his PhD degree in computer science from the University of Adelaide in 1986. He was a software engineer for ICL in Australia and the UK. He commenced PhD studies in 1978, took up a position in Computer Science Department at the University of Adelaide in 1985, and is currently a Senior Lecturer. His research interests are programming models and applications in cloud and grid computing, e-Research and data intensive computing, parallel functional programming, and reflective computing. He is a member of ACM and the IEEE Computer Society, and active on several conference committees.

Donglai Zhang is currently a PhD candidate at the Computer Science Department of the University of Adelaide. He graduated as a Master of Computer Science from the University of Adelaide in 2005. After that, he worked as a programmer in South Australia Partnership of Advanced Computing (SAPAC). He commenced his PhD study in 2007. His search interests are scientific workflow and applications in cloud and grid environment; data transfer and management in distributed environment.

Dr. Paul Coddington is Deputy Director of eResearch SA, a South Australian eResearch service provider. He has a PhD in computational physics from the University of

(2)

Southampton. He subsequently worked at Caltech, Syracuse University and the University of Adelaide on research and development projects focussing on high-performance and distributed computing and the Web, particularly their application to a variety of scientific problems, including the development of online scientific data repositories.

1 Introduction

E-Science aims to serve scientists from various disciplines for their research works. For researchers, an ideal scientific workflow is designed to automatically carry out a complete set of data processing for their scientific research by simply clicking a run button (Atkinson et al., 2007; Bramley et al, 2006). For example, the data can be generated from instruments located at remote sites and processed by a sequential series of data analysis steps. Finally, the final result is returned to the user or stored in a location where it can be accessed by collaborators from different organizations.

The increasing size of data being processed in workflows has led to significant overhead for data transfer between different services within a workflow, especially when the services composed into a workflow are web services. Research work focuses on both centralized workflow systems and decentralized workflow systems. In a centralized workflow, it is relatively easy for the user to control the workflow, but the data generated from each service needs to be sent back to the centralized control point before it is finally forwarded to the next service provider as input, which increases both time and resource (network bandwidth, CPU and memory) consumption. On the other hand, the decentralized workflow can have the data shared directly between different services in a workflow. However, it is often hard to control the whole workflow in a decentralized manner. To overcome the efficiency problems in the centralized workflow model, in our previous work (Zhang et al., 2011), we introduced the WSDF framework to improve data transfer. With a single web service, if the state of a specific service instance can be kept between invocations, it is recognized as a stateful web service. In a WSDF framework, we define the concept of stateful workflow, by which we mean that the intermediate data is preserved between successive services in the same workflow (Zhang et al., 2011). In a stateful workflow, all atomic services need to be stateful and all intermediate data is directly shared between atomic services, which is the same as what happened with the decentralized workflow model, as discussed further in section 4.2.

Within a WSDF workflow, when the client invokes a service, data for processing is passed to the service together with the resource forwarding information, which contains the information about where to forward the result data after the data is processed. After the current service processes the input data, it can forward

the result data to the next web service according to the resource forwarding information and the data is saved on the successor service as a WSRF (Web Service Resource Framework) resource. The resource is given a unique resource reference, presented by an Endpoint Reference (EPR)1, is sent back to the workflow engine (client), and will be used by the client later to invoke the next service in the workflow. In previous work (Zhang et al., 2011), we use a simulated environment to test the feasibility of the WSDF framework. The time used for intermediate result transfer between different services has been significantly reduced.

The emerging cloud technology provides another platform for hosting web services involved in a workflow. Cloud computing provides IT related services via network connections. First of all, cloud computing is another distributed computing paradigm, in which the service provider provides large scale, scalable computational resources (e.g. virtual machines) and storage capacity (Foster, I. et al., 2005; Geelan, J., 1991). Different types of cloud services have been provided: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). Cloud as a platform has been widely applied in workflow execution. Research works have been carried out to verify the cost and effectiveness of cloud computing platform for scientific workflows in (Deelman et al. and Marcellin, 2008; Kondo et al., 2009; Hoffa et al., 2008). According to (Hoffa et al., 2008), workflows can suffer from wide-area communications, particularly when the data transfer time constitutes a large part of the overall computation time. When not using WSDF, centralized workflows suffer the same shortcomings as the workflows in a standard distributed environment: the data needs to go via the workflow engine, which is often located remotely from the cloud, e.g. on the users desktop. By using WSDF in a cloud environment, this shortcoming can be eliminated.

Consider the following scenario, a large set of data is generated from the users experiment site and to be processed by a workflow. To take the advantage of computational power provided by the cloud platform, e.g. Amazon EC22_{, before the experiment, the scientists} carefully select useful services to be used in the workflow that is going to process the scientific data. Services are built into images and stored in the cloud. Before the experiment starts, these images are instantiated on virtual machines, according to the expected resources, such as CPU, memory size, and network connections.

(3)

Therefore the services are initiated in the cloud. By controlling a centralized workflow for which the workflow engine sits on a desktop machine, the scientists initiate the generation of the data from scientific instrument(s) and the generated data will be processed by the services within the cloud. Finally, the result data is sent back to the workflow engine on client side.

Within a cloud environment, the data sharing between consecutive services is even better comparing with other distributed environment. In a IaaS cloud345_, a basic IaaS type cloud service is provided to the client to acquire the necessary computational resources, from a single Linux virtual machine to a virtual cluster which has over a hundred nodes. So the data sharing can be completed within the cloud environment without getting out the cloud. For different processing components within the workflow, users can submit an image that contains the service, by initiating the image as a virtual machine, the user can share the software within a cloud environment. However, the client would still be able to control the whole workflow from his/her desktop for a centralized workflow.

In this environment, the composite service is hosted within the cloud. At the beginning of the workflow, the data is sent to the service that is hosted within the cloud environment. After the web service processing the input data, the outcome data is forwarded to the next web service in the workflow. As the intermediate data does not need to be sent back to the workflow engine, it saves bandwidth and execution time. Furthermore, as the forwarding of the data is carried out between different web services within the workflow in the cloud, the web services can be configured in a local network. If they are in the same cloud, they can actually hosted within the same data centre and connected by high bandwidth network, which is physically nearly located.

This article is organized as following. The second part of this paper we review the WSDF framework and our previous work; in the third part, we compare the cloud environment with the previous distributed environment for hosting web services workflow under WSDF framework; in section four, the experiment environment is introduced; in section five, we explain the model and the equations used to calculate the expected performance improvement and give the result of our experiments. We also compare the performance improvement for WSDF framework in the cloud (i.e. the ScienceCloud 6_{) and in a normal distributed} environment; in section six, we review the related work; finally, we will present the conclusion and discuss future work.

2 Web Service Data Forwarding (WSDF)

Framework

With the development of information technology, more and more scientific research work is utilizing workflow as a power tool to carry out scientific research work.

Remote Distributed Environment

Composite Service Service Four Service Five Service Six Service Three Service Two Service One

Composite Service One

Composite Service Two

Figure 1 Composite Service

These works are often data oriented, therefore involve large scale data processing. Data transfer speed between different component in the workflow is one of the key aspects that affects the overall performance of a scientific workflow.

There are two different kind of workflows according to the control point: centralized workflow vs. distributed workflow. In most real world cases, users want to have the full control of the running workflow. This is natural for security and convenience reasons. One well known centralized workflow system is Kepler7_{(Ludascher et al.,} 2006) .

To save the data transfer time while give the user full control of the workflow, we introduced the WSDF framework (Zhang et al., 2011).

2.1 WSDF Model Assumptions

The primary target of WSDF framework is to address the data sharing issue within a web service workflow, as the current web service framework is based on the client-server model.

The WSDF framework is based on the following assumptions:

• Participants (e.g. researchers in a common research project) share computational resources and data to solve a large scale, collaborative task.

• The service workflow is composed of web services and the current service output can be directly consumed by thesuccessor service.

2.2 The Stateful workflow

With normal web service workflows, a centralized workflow engine works as the client of different web services: sends request to invoke each web service in the workflow. The intermediate result from the participant services will be sent back to its requester, the workflow engine, even if the result will be used by the successor service. On the other hand, the services can be seen

(4)

as a composite service, as shown in Figure 1. To the workflow engine, the combination of different services can be seen as a composite service from the remote side. If the workflow engine provides the initial data and does not need to get the intermediate data, then the composite service as a whole is as simple as a normal web service, get input, process the data and sends back the result. To achieve this, the intermediate data needs to be stored on the composite service side and sent to the next service.

We define workflows that can provide this functionality as a stateful workflow. In a stateful workflow, each web service is a stateful web service. The de facto standard of stateful web service is Web Service Resource Framework (Zhang et al., 2011). In a stateful workflow, successive web services share the intermediate data, without sending back to the workflow engine, but instead, it is stored as a resource in the individual web services, and returned to the workflow engine by creating and giving back a reference. In this way, the state of the whole workflow is saved and stored on the composite service side. In our design, the workflow engine has the complete control of the workflow execution and uses the returned resource reference to invoke the next service in the composite service.

2.3 Data Forwarding between stateful web services

Within the composite service, a stateful web service (current service) which is invoked not only processes the data forwarded to it, but also needs to store the result data on the composite service side as resource and returns the resource reference. To allow data sharing between current service and the successor service, either push or pull mechanism can be applied. We implemented push model as we suppose the workflow engine already knows which service will be invoked next, that will need the output of the current service. By applying push model, the current service needs to forward the result to the successor service, the workflow engine should send the data (if it is the first service in the workflow, or extra data as input) as well as information about the successor service (forwarding information) to the current service. To distinguish the normal data parameter from the resource forwarding information, we introduced a specific namespace, wsdf (Zhang et al., 2011). When the current service is invoked, the normal data and the resource forwarding information, which is under the namespace ofwsdf are both sent to that service.

3 Distributed Environment Comparison

The cooperation between different participants in a cooperative research work often involves building a workflow to process scientific data. Each participant can provide software, data and/or host the services. To support these services, there are two general approaches:

• Each software provider hosts their programs/services on their local site. Any user can access these services by composing them into a workflow. This is normally hard for the services that require non-trivial resources to run. The software providers are often unable to predict and supply enough computational/storage/network resources to host these services. It also needs more system administrations to monitor and administrate the servers.

• Software and services are all hosted by a regional computational centre or a computer centre for a specific research discipline. These service providers can provide better outcomes comparing with the first approach, as these entities often have better network connection, more powerful computational facilities and larger storage capacity. All the servers are run under the single administrate domain, which is the other great advantage over different administrations.

The second approach has shown significant advantages over the first one. The organization and management of the services is maintained within a single administration scope, the availability of services and interconnection between the services can be greatly improved. Different services can be connected within a local network connection, typically, Ethernet connection, which can save significant amount of data transfer time, particularly when using the WSDF framework. However, there are also problems for the second approach. Under these circumstances, the cloud infrastructure is a better alternative.

• The administration is not completely automatic. Human intervention and communication is necessary to provide service hosting services. For example, a service provider asks the system administrator to maintain the service by providing a Linux box that has necessary service image installed on the machine. A system administrator will take care of the machine, including providing necessary network configuration and firewall setting for these machines. With very large data storage, which possibly exceeds the capacity of the single machine, it may take some time for system administrators to provide sufficient storage, which may require procuring and installing additional disksand could potentially introduce uncertainty as well as delay of the deployment of the whole workflow. On the other hand, the cloud is built to essentially eliminate these problems: the resource instantiation and allocation are controlled by resource management software to achieve the highest efficiency. These services almost cover all requests that a workflow might require. Furthermore, within a cloud, it is not necessary to talk to a system administrator, since the provisioning or resources is all automated.

(5)

• Resources are limited in a specific organization. The service provider has a reasonable amount resources, in terms of machines, network connection, for example, to provide services. But comparing with the cloud provider, such as Amazon3 and Apple 8 which have hundreds or even thousands of servers in their resource pool; more stable power supply; better network connection and system administration expertise, the cloud often provides a better solution for these kind of services.

We utilize the cloud infrastructure provided by ScienceCloud to test the performance of WSDF workflow to see how effective this new framework can be within the cloud environment.

4 Facilities and Experiments Setting

4.1 Facilities

We use ScienceCloud as our cloud platform to carry out data transfer experiments of scientific workflows. ScienceCloud is an instance of Nimbus9_cloud management software, provided by the University of Chicago. It provides free access to its computational resources to the academic society.

According to the cloud configuration requirement, we use the default configuration file provided by ScienceCloud. We initialize three virtual machinesvm01, vm02 and vm03. Each of these virtual machine are allocated 2 CPU cores with 3 gigabytes memory. The firewall settings allow the three virtual machines to communicate with each other directly. By using iperf to test the connection between the virtual machines, we found the average result is about 910Mbits per second. The network connection between the cloud virtual machine and the client desktop in Adelaide University was also tested and the average speed by using default configuration withiperf was 54.0 Mbits/sec.

Based on a simple cloud imagehello-world which is provided by ScienceCloud6_{, we built an image for our} workflow experiments. The new image is named wsdf-hello-world image and the size is 10 Gigabytes. The ScienceCloud did not provide data storage service as Amazon4 does when the experiment was carried out. To avoid extra overhead that could be introduced by the data storage service, we enlarge thehello-world image to 10 Gigabytes. The new image is submitted to the cloud and saved into the user’s repository by using client side tool provided by nimbus cloud.

On the client side, a Linux box with kernel version 2.6.18 is used as the workflow engine.

4.2 Workflow Setting

The following figure shows the relationship between the normal web service workflow and a WSDF workflow in a cloud environment. Cloud Environment Web Service One Web Service Two Web Service Three Workflow Engine

Data and Control Flow

(a) Web Service Workflow

Cloud Environment WSDF Service One WSDF Service Two WSDF Service Three Workflow Engine

Data Flow Control Flow

(b) WSDF Workflow

Figure 2 Workflows in Cloud

Figure 2 illustrates that within a workflow, for any web service which is hosted by the cloud, all its input/output data and the control information (overlapped with the data flow) between this web service and the client engine needs to be passed between the workflow engine and the cloud. This is not efficient as the cloud provider can be far from its client. Within a WSDF workflow, on the other hand, only the initial data needs to be transferred between the client and cloud. Within the cloud, as network connection for data transfer between different services is very efficient, and the WSDF framework has provided the functionality of direct data transfer between different participants within the web service workflow, it will be much more efficient to apply WSDF workflow in a cloud environment.

We use the workflows we built from the previous experiment (Zhang et al., 2009) to test its performance in the cloud environment.Every workflow has two versions: a WSDF version and a normal web service version. The WSDF workflow has utilized WSDF services as the service providers, and the normal web service workflow use normal web services as its service providers. Each workflow is made up of a set of instances of the same service, called RGB. A WSDF RGB service provides create, setAttachAsResource and convert operations. Convert is a functional operation that takes the content of a .bmp image file as input and changes the color of its pixels: red to green, green to blue and blue to red; the create operation is used to create an EPR for a service instance on this service and the reference is sent back to the client; ThesetAttachAsResource operation is used to set the attachment of the request as a resource which is to be processed by the convert operation.The convert operation of a RGB web service also consumes an input file (.bmp format) and changes the color of each pixel in the file. The updated content will be returned. In the cloud environment, RGB services run on two different web service servers: normal web service server or WSDF web service server.

Figure 3 shows steps in a WSDF workflow. In this figure, there are three RGB services involved in the workflow. For each of them, first, a request to create

(6)

RGB_A Workflow Engine RGB_B RGB_C RGB Request /EPR

Step 1. Create EPR

RGB_A Workflow Engine RGB_B RGB_C RGB Save File as Resource

Step 2. Set Resource

RGB_A Workflow Engine RGB_B RGB_C RGB Invoke convert

Step 3. Invoke Convert Operation

RGB_A Workflow Engine RGB_B RGB_C RGB

Step 4. Process saved Resource

Step 5. Create Resource Instance Request /EPR RGB RGB_A Workflow Engine RGB_B RGB_C

Step 6. Set Resource RGB EPR RGB_A Workflow Engine RGB_B RGB_C

Step 7. Invoke Convert Operation Invoke convert RGB RGB_A Workflow Engine RGB_B RGB_C RGB

Step 9. Create EPR Request /EPR RGB_A Workflow Engine RGB_B RGB_C RGB

Step 10. Set Resource EPR RGB_A Workflow Engine RGB_B RGB_C RGB

Step 11. Invoke Convert Operation Invoke convert RGB_A Workflow Engine RGB_B RGB_C RGB

Step 13. Return Processed Content Send File Back

Control Flow Data Flow

Figure 3 RGB Workflow Steps in a WSDF Framework

an EPR is sent to the service and the created EPR is returned (steps 1, 5 and 9); then the data to be processed is sent to the service and saved as a resource referenced by the EPR created in the previous step (steps 2, 6 and 10); after that, a convert request is sent to the service to process the saved resource (steps 3, 7 and 11); finally, the result is sent back to the workflow engine (step 13).

5 Experiments In the Cloud

We carried out testing within the cloud and give performance results of both normal web service workflow and WSDF workflow. We also give the expected performance improvement for WSDF workflow in the cloud environment according to the formula we derived from our previous work (Zhang. D, 2011) and verify that the performance improvement of WSDF workflows meets our expectation.

5.1 Total Time Consuming

Figure 4 provides the total time consuming information of the web service workflow and WSDF workflow. From this figure we can see, with different file sizes, as well as different web services involved, the WSDF workflow always has significant advantages over normal web service workflow. This has been the same case as the experiment we have tested in the simulated distributed environment.

5.2 Performance Improvement Expectation

In our previous work, we built equations to represent the theoretical time saving on data transfer for a WSDF workflow. We define the percentage of time saving from WSDF to be:

P = T−T 0

(7)

#! !" %$! ! (a) ! % # "" # $ '&#" # (b)

Figure 4 Total time consuming in cloud for different file sizes and different number of web services for both WSDF and normal web services.

In equation (1), T is the overall transfer time for normal web service workflow, T0 is the overall transfer time for the WSDF workflow. A derivation of the expected theoretical values of T, T0 and P is given in (Zhang et al., 2011). For services hosted in a cloud, we can assume that bandwidths between the client and all the services are the same (represented by BWC,S)

and bandwidths between all services are the same (BWS,S). Based on these assumptions, the performance

improvement is given by : P = Pn−1 i=1( DOi+DIi+1 BWC,S − DOi BWS,S) Pn i=1( DIi+DOi BWC,S ) ∗100 (2)

Within a workflow, the output dataDOi(i∈(1, n−

1)) of one service is often used as the input data

DIi+1(i∈(1, n−1)) of the next service. If we useDOi to replace DIi+1, then equation ( 2) can be simplified to: P = Pn−1 i=1((2∗ DOi BWC,S)−( DOi BWS,S)) Pn i=1( DIi+DOi BWC,S ) ∗100 (3)

In our experiment, the WSDF workflow is built from

n instances of RGB services, where the input data

DIi, i∈(1, n), and the output dataDOi, i∈(1, n), have

the same size, which is represented by D. In this case, equation (3) becomes:

P =D∗

Pn−1

i=1(2/BWC,S−1/BWS,S)

D∗Pn_i₌₁(2/BWC,S) ∗100 (4) Equation (4) can be further simplified to:

P =n−1

n ∗(1−0.5∗

BWC,S BWS,S

) (5)

We measured the network bandwidth on the cloud instances that we obtained. The bandwidth between the client and the servers was 54.0Mbits/sec, and the connections between different servers were 910Mbits/sec. If there are total 6 services in the workflow, by using equation (5), the theoretical result will be:

P = 6−1

6 ∗(1−0.5∗ 54.0

910) = 81% (6) In the following section, we will compare this expected data transfer performance improvement with the real performance improvement we measured from the experiments to verify if the practical result agrees with our proposed theory.

5.3 Performance Improvement Analysis

Our interest is not limited to the general trend of performance improvement by applying WSDF workflow. We also analysis the performance impact brought by file size and number of web service involved in the workflow. The time consumption of a workflow can be classified into two categories: functional processing time and I/O time. The WSDF workflow provides the same computational functionalities which consume the same amount of time. However, it saves time in the data transferring (I/O) part. We compare the performance improvement by eliminating the processing time from the total time consuming for both workflows.

In Figure 5, 6 and 8, the performance of WSDF vs. normal web service workflow is shown. For each workflow, we also apply the workflow with different numbers of web services within the workflow. The BST time (Basic Service Time) represents the time used by the web service for functional processing of the input data.

In Figure 5, the experiment is based on files with sizes ranging from 100K bytes to 1M bytes. As we can see from this figure, the BST takes very little part in this processing, as the data size is very small. Majority of the time used by both workflows is for transferring data between participants in the workflow. According to the result, when there are 3 web services involved within the processing, the 100K bytes workflow gets 14% time saving on data transfer, the 500K Bytes file gets 30.72% of improvement and for a 1M bytes input file, the transfer time has been saved up to 41.58%. It shows the following trend: the larger the file, the higher improvement.

(8)

" !# ! # %$ (a) " !# ! # %$ (b) ! " " $# (c)

Figure 5 Comparison of performance of WSDF and normal web services in the cloud for small file sizes

As we have seen, for a WSDF workflow, a WSDF service manages the result generated from the current service, and forwards it to the successor service, also saving it on the latter server as a resource. This means thesuccessor serviceneeds to create a resource reference for the result and save it, which involves extra resource management cost. With small files, the resource creation and management cost of time is steady and relatively high, comparing with the total time consumption. When the input file size increases, the ratio of time used in resource management is decreasing quickly, so the overall performance of improvement is also significant.

In Figure 6, the performance improvement of WSDF in cloud environment is given for medium size files. For workflow with 3 RGB services, the performance improvement for workflow with 5M bytes is 58.71%, with 10M bytes file is 60.33% and with 50M bytes file is about 57.79%. This means the performance improvement by increasing the file size is relatively small and becomes reasonable stable and this also applies to the larger files as shown in Figure 8.

The other factor that will affect the performance improvement of a WSDF workflow is the number of services involved within a workflow. As shown in Figure 2(a) and (b), three web services hosted in the cloud that are invoked in a workflow. In Figure 2 (a), there are three data transfers happening between the client and server, all of them dual-directional. In Figure 2 (b), the cloud hosts three WSDF services. In total there are two unidirectional data transfers between the services and the client. This means that for a workflow with three services invocations, the WSDF service workflow needs two single way data transfers, comparing with a normal web service workflow needs six single direction data transfers. If there are more services involved within a workflow, a WSDF workflow still needs two single direction data transfers, only the data transfers between different services, which are all hosted in the cloud will be increased. While the number of data transfers between client and server for normal web services will need a proportional number of data transfers. From this point, when there are more services involved in the WSDF workflow, the performance of the workflow will also be increased accordingly.

For example, as shown in Figure 6, with a file size of 10M bytes, when there are three services involved,

! " " $# (a) " !# ! # %$ (b) " !# ! # %$ (c)

Figure 6 Comparison of performance of WSDF and normal web services in the cloud for medium file sizes ! " " $# (a) " !# # %$ (b) " !# # %$ (c)

Figure 7 WSDF Performance Improve in Normal Distributed Environment " !# ! # %$ (a) ! " " $# (b) ! " " $# (c)

Figure 8 WSDF vs. Normal Web service Performance in Cloud (with large files)

the time saving is 58.71%, when there are 6, 9 and 12 services involved, the time savings are 78.44% 81.97% and 85.09% respectively. According to our experiment, the same trend also happens with different file sizes.

As we discussed, the amount of time saved by using WSDF has been increasing as more data transfer happens between web services that are in the cloud. Here we suppose that web services hosted in the cloud are located within a single data centre, ideally, the physical distance are not more than a few racks.

5.4 WSDF performance improvement comparison

In our previous work (Zhang et al., 2009), we have run the WSDF performance tests within a simulated wide-area network environment, by using WANem10 _{simulation software. Some of the results are} shown in Figure 7. These results were obtained with WANem configured to have a network performance of 100Mbits/sec between the web services (under the

(9)

assumption that the services were all on a local area network connected by 100Mbits/sec Fast Ethernet) and the simulated network between the client and the services configured to match measured intercontinental (Australia to the USA) latencies and bandwidths. Hence this simulated network is a close match to the real network environment for the experiments we have done using the ScienceCloud. The only significant difference is that for the ScienceCloud the bandwidth that we measured between the services was significantly higher than our simulation, 941 Mbits/sec rather than 100 Mbits/sec.

The performance of the WSDF workflows in the cloud is moderately better than the performance as measured in the simulated distributed environment, and the data transfer performance improvement shows that the performance difference is very small. For example, with a 5M bytes file as the input file size, with 3, 6, 9 and 12 services, comparing with normal web services, the WSDF get 56.21%, 72.51%, 77.40% and 80.85% performance improvement in the simulated environment, and the same WSDF workflow in the cloud gets 60.33% 76.37% 82.35% 83.25%.

For a workflow with 6 services and different file sizes: 5M, 10M, 50M and 100M bytes, in the normal distributed environment the performance improvements are: 72.51%, 72.68%, 68.27% and 68.30%. In the cloud, the same WSDF workflow gets 76.37%, 78.44, 79.37% and 79.94% performance improvement. These two groups of data imply two facts: first, the WSDF performance improvement in a cloud environment is pretty close to the theoretical result we get from section 5.2, where the expected time saving is about 81% with a workflow of 6 services; second,the measured performance in the cloud was slightly better than the simulated performance. We believe the reason for that is the ScienceCloud provided faster network connections between different WSDF services (910Mbits/sec) than the value we used in the normal simulated environment (100Mbits/sec), which is based on the measured bandwidths of real networks.

6 Related work

Scientific workflows often involved in large data sets processing and it is often data flow oriented. It turns out that, within a centralized workflow model, data sharing between different web services can often become the bottleneck of workflows (Barker et al., 2009). Some researches in this area suggest to use decentralized workflow model to solve this problem (Barker et al, 2008).

In any workflow, there are control flow as well as data flow aspects. These relationships are explored in (Liu et al, 2005).

In (Barker et al, 2008), the authors described algorithms to convert a workflow into smaller units that run on different servers with direct communication

between them. The expected benefit from this approach is to avoid the central point of the workflow orchestration becoming a bottleneck and significantly improve the overall throughput. This approach is especially good for data driven workflow. But this algorithm also makes the whole system more complex, whereas our approach does not change the workflow.

In (Walter et al, 2006), the proxy model is suggested and a hybrid architecture is built. Here a proxy is defined as a piece of middleware closely coupled to a functional service as a gateway. It delegates the invocation of the functional service; managing input/output data storage and responsible of sending the result data between workflow components. This research work has pointed out some research issues in the data sharing problem between Web services in a centralized workflow, such as result data storage, forward and retrieving. Comparing with our approach, the drawback of the proxy model is that it addresses the data sharing problem from an application level, rather than from the server level, which leaves the workload to the programmer to maintain these services for themselves.

7 Conclusion and Future Work

In our previous work, we have proposed a WSDF framework to allow directly data sharing between consecutive web services within a web service workflow. We also built prototype workflows and simulation environment to verify the performance improvement. Based on the previous achievement, we carried out the similar experiments in a cloud based environment.

The experiments in the cloud show that the measured improvement in this particular cloud environment was even greater than expected with a similar experiment in a normal distributed environment. However, the advantage of the cloud is more significant in terms of management and stability.

In most circumstances, for a user of IT resources who has the basic knowledge of using cloud infrastructure, the hosting of disk images in a research centre rather than in cloud is more time consuming in terms of locating hardware, installing proper software, setting up all services, configuring firewalls and managing other system administration related works. Computational power, network connection and data storage can not be expanded as quickly as the cloud does. And finally, the normal centre-based services need person to person communication (by using email, or talking face to face, etc) to set up the services, which is very inefficient and less predictable. Cloud allows the user to specify machine instances the researchers are looking for, all configurations can be operated by users directly. Finally, all these functions are exposed to users via web or web services interface by using APIs provided by the cloud provider, therefore all the work can be automated.

From the experiment, we compare the performance of WSDF workflow with a normal web service workflow

(10)

in a cloud environment that is provided remotely and significant performance improvement is achieved. The percentage of performance improvement by using WSDF are very similar in different environment. However, as the cloud can often be more efficient in system administration, network connection and providing storage capacity, cloud environment is often a better choice for applying workflows based on WSDF framework.

The other advantage of the cloud is that cloud providers also provide data storage service for its users. For example, Amazon provides data storage service 4 which could be used together with its EC2 service. Within our work, we carried out our experiments without using such services, as the data sets used in the workflows are relatively small and can be saved on the disk image. In the real world, a scientific workflow could possibly process large data set, e.g. in Tera bytes or even Peta bytes scale. Therefore this functionality is vital. On the other hand, most normal service providers, such as small computing centres can not provide this level of data storage in a convenient way.

Acknowledgment

The staff members who maintain theScienceCloud gave us extensive support to run our experiments.

References

Atkinson, I., et al. (2007) ‘Developing cima-based cyberin-frastructure for remote access to scientific instruments and collaborative e-research’,Australian Symposium on Grid Computing and Research, Conferences in Research and Practice in Information Technology, Australian Computer Society, Australia, 2007, Vol.68, pp.3–10. Ludascher, B. et al. (2006) ‘Scientific workflow management

and the Kepler system: Research Articles’,Concurrent Computing: Practice and Experience, Vol. 18, No. 10, pp.1039–1065.

Zhang, D., Coddington, P., and Wendelborn, A.L., (2011) ‘Web Services workflow with result data forwarding as resources’,Future Generation Computer Systems, Vol.27, pp.694-702.

Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., and Good, J. (2008) ‘On the use of cloud computing for scientific workflows’,IEEE Fourth International Conference on eScience, 2008, pp.640–645. Bramley, A., Chiu, K., Devadithya, T., Gupta, N., Hart, C., Huffman, J.C., Huffman, K., Ma, Y., and Mcmullen, D.F., (2006) ‘Instrument monitoring, data sharing and archiving using common instrument middleware architecture’, Journal of Chemical Information and Modeling, Vol. 46, No. 3 pp. 10171025.

Deelman, E., Singh, G., Livny, M.,Berriman, B.,and Good, J., (2008) The cost of doing science on the cloud: the montage example,Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, 2008 pp. 50: 1–50:12.

Kondo, D., Javadi, B., Malecot, P., Cappello, F., and Anderson, D. (2009) ‘Cost- benefit analysis of cloud computing versus desktop grids’, Parallel Distributed Processing, 2009, IEEE International Symposium on, pp.1–12.

Geelan, J.,(2011) ‘Twenty-one experts define cloud computing’, http://cloudcomputing.sys-con.com/node/612375/, 2011.

Foster, I., Zhao, Y., Raicu, I. and Lu, S. (2005) ‘Cloud Computing and Grid Computing 360-Degree Compared’, Grid Computing Environments Workshop, 2008. GCE ’08’, pp. 1–10.

Zhang, D., Coddington, P., Wendelborn, A.L., (2011) ‘Technical Report: Web Services workflow with result data forwarding as resources’, http://www.dhpc.adelaide.edu.au/reports/198/dhpc-198.pdf.

Barker, A, Besana, P., Robertson, D., and Weissman, J. B. ‘The benefits of service choreography for data-intensive computing’, Proceedings of the 7th international workshop on Challenges of large applications in distributed environments, 2009. CLADE ’08’, pp. 1–10. Chafle, G., Chandra, S., Mann, V., Nanda,

M.G.‘Orchestrating composite Web services under data flow constraints’, Web Services, 2005. ICWS 2005. Proc. 2005 IEEE Int. Conference on, vol.1 pp. 211–218. Liu, D., and Law, K.H., and Wiederhold, G.‘Data-flow Distribution in FICAS Service Composition Infrastructure’,In Proceedings of the 15th International Conference on Parallel and Distributed Computing Systems, 2003.

Barker, A., Weissman, J.B., van Hemert, J.I. ‘Orchestrating Data-Centric Workflows’,Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid, pp. 210-217, 2008.

Walter, B., Ion, C., and Boi, F., ‘Orchestrating Data-Centric Workflows’, ICWS ’06: Proc. of the IEEE Int. Conference on Web Services, pp. 869-876 2006.

Note

1

http://www.oasis-open.org, ‘Oasis open homepage’,

2_{http://docs.amazonwebservices.com, 2010} 3 http://aws.amazon.com/, 2010 4 http://aws.amazon.com/s3, 2010 5_{http://www.microsoft.com/windowsazure, 2009} 6 http://www.scienceclouds.org 7_{http://www.kepler-project.org/} 8 http://www.apple.com/icloud/, iCloud 9_{http://www.nimbusproject.org/, Nimbus} 10

http://www.wanem.sourceforge.net/,‘WANem: The Wide Area Network emulator’