A scientific workflow framework for scientific data querying and processing

(1)

Wayne State University Dissertations

1-1-2011

A scientific workflow framework for scientific data

querying and processing

Xubo Fei

Wayne State University,

Follow this and additional works at:http://digitalcommons.wayne.edu/oa_dissertations

This Open Access Dissertation is brought to you for free and open access by DigitalCommons@WayneState. It has been accepted for inclusion in Wayne State University Dissertations by an authorized administrator of DigitalCommons@WayneState.

Recommended Citation

Fei, Xubo, "A scientific workflow framework for scientific data querying and processing" (2011).Wayne State University Dissertations. Paper 347.

(2)

SCIENTIFIC DATA QUERYING AND PROCESSING

by

XUBO FEI

DISSERTATION

Submitted to the Graduate School of Wayne State University,

Detroit, Michigan

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

2011

MAJOR: COMPUTER SCIENCE

Approved by:

(3)

2011

(4)

To my parents with love

(5)

First and foremost, I would like to thank my advisor, Dr. Shiyong Lu, for helping me fin-ish my thesis. In my time under his advisorship, I have learned skills that I had struggled with all my life: how to write and present clearly; how to conquer large and complex problems; and how to formalize a mathematical model. He often challenged me for creative ideas and gave me a lot of insightful comments and unending encouragement. Dr. Lu’s supervision in my study and research shifted my fundamental perspective from viewing computer science as a useful tool to viewing it as a world of precise logic, subtle complexity, and artistic design.

Many thanks to Dr. Jeffrey Ram from the Physiology department. I collaborated with Dr. Ram on the TangoInSilico project for about 2 years. We worked closely and held a regular meeting every week. I am always impressed by his profound knowledge, great intelligence and passion for research. I really enjoy our discussions and have learned a lot of biology knowledge and scientific methodologies during our collaboration.

I am very grateful to Dr. Shiyong Lu, Dr. Farshad Fotouhi, Dr. Jeffrey Ram, and Dr. Chandan Reddy, for serving on my dissertation committee and providing profuse encourage-ment and productive advice on my dissertation.

I would also thank every member of the Scientific Workflow Research Laboratory, Dr. Yi Lu, Dr. Mustafa Atay, Dr. Jamal Ali Alhiyafi, Dr. Seunghan Chang, Dr. Artem Chebotko, Dr. Cui Lin, Chunhyeok Lim, Dong Ruan, and Sha Liu, for their sincere friendship and great help.

Finally, I would like to thank my parents, Honglie Fei and Jianli Xu, for their love, sup-port, and encouragement throughout my whole life.

(6)

Dedication . . . ii

Acknowledgments . . . iii

List of Tables . . . vii

List of Figures . . . viii

CHAPTER 1 Introduction . . . 1

1.1 Scientific Workflows and Scientific Workflow Management Systems . . . 2

1.2 Scientific Data Management . . . 4

1.3 Research Challenges . . . 5

1.4 Contributions . . . 7

1.5 Roadmap . . . 8

CHAPTER 2 Related Work . . . 10

2.1 Workflow Modeling . . . 10

2.1.1 Business workflow modeling . . . 11

2.1.2 Scientific workflow modeling . . . 15

2.2 Scientific Workflow Data Models . . . 20

2.3 Scientific Workflow Management Systems . . . 22

2.4 Chapter Summary . . . 24 iv

(7)

3.1 Key Requirements for a Scientific Workflow Composition Model . . . 26

3.2 Scientific Workflow Model . . . 31

3.3 Scientific Workflow Constructs and Composition . . . 33

3.3.1 The Map Construct . . . 35

3.3.2 The Reduce Construct . . . 36

3.3.3 The Tree Construct . . . 37

3.3.4 The Conditional Construct . . . 43

3.3.5 The Loop Construct . . . 44

3.3.6 The Curry Construct . . . 46

3.4 Workflow Composition . . . 51

3.5 A Dataflow Based Approach for Exception Handling . . . 53

3.5.1 Exception Handling . . . 54

3.5.2 The Exception Construct . . . 54

3.6 Case Studies . . . 55

3.6.1 Workflow for Freebase Processing . . . 56

3.6.2 Workflows for Matrix Summation . . . 57

3.7 Chapter Summary . . . 57

CHAPTER 4 Collectional Data Model . . . 59

4.1 An Motivating Example of Biological Simulation . . . 59

4.2 The Collectional Data Model . . . 60

4.3 Collectional Scientific Workflow Composition . . . 71 v

(8)

CHAPTER 5 VIEW: A Prototypical Scientific Workflow Management System . . . 75

5.1 VIEWArchitecture . . . 75

5.2 Workflow Engine . . . 78

5.3 Data Product Manager . . . 85

5.3.1 Architecture of the Data Product Manager . . . 85

5.3.2 Interface of the Data Product Manager . . . 88

5.4 Data Type System in VIEW . . . 89

5.5 Scientific Workflow approach for Collectional Data Querying . . . 92

5.6 Chapter Summary . . . 95

CHAPTER 6 Conclusions and Future Work . . . 96

Appendix A Scientific Workflow Language (SWL) . . . 99

Appendix B Data Product Language (DPDL) . . . 111

Appendix C WDSL Specification for Workflow Engine Web Services . . . 116

Appendix D WDSL Specification for Data Product Manager Web Services . . . 128

Bibliography . . . 137

Abstract . . . 152

Autobiographical Statement . . . 154

(9)

Table 5.1: Scalar data type mappings among VIEW, MySQL, and XML. . . 90

(10)

Figure 1.1: A Reference Architecture for SWFMSs. . . 3

Figure 3.1: (a) Correct data dependencies under the single-assignment property; (b) incorrect data dependencies due to violation of the single-assignment property. . . 29

Figure 3.2: (a) Traditional scientific workflow model; (b) our proposed scientific work-flow model. . . 31

Figure 3.3: (a) a graph-based workflow; (b) a unary-construct-based workflow. . . 32

Figure 3.4: Six unary workflow constructs. . . 34

Figure 3.5: WorkflowW2 created by applying the Map construct onW1. . . 35

Figure 3.6: WorkflowW3created by applying the Reduce construct on anAdd Work-flow. . . 37

Figure 3.7: WorkflowW₄ created by applying the Tree construct on anAddWorkflow. . . . 38

Figure 3.8: (a) W₅ created by applying the Conditional construct on the Projection work-flow with a predicate p = (P I(1) < P I(2)); (b) W6 created by applying the Conditional construct on the Projection workflow with an opposite predicate p= (P I(1)>=P I(2)). . . 40

Figure 3.9: WorkflowW7 created by applying the Loop construct on anAddWorkflow. . . . 45

Figure 3.10: WorkflowW8 created by applying the Curry construct on anAddWorkflow. . . 47

(11)

ated by the composition of two Reduce constructs on the Add workflow; (c) unary-construct-based workflow W11 created by the composition of the Map

construct and the Reduce construct on the Add workflow; (d) unary-construct-based workflowW12created by applying the composition of the Map construct

and the Tree construct on theAdd workflow; (e) unary-construct-based work-flowW₁₅ created by applying the Loop construct on a graph-based workflow; and, (f) graph-based workflowW₁₇created by applying theG2W construct on

a workflow graph. . . 51

Figure 3.12: Workflow exception handling. . . 54

Figure 3.13: Workflow exception propagation. . . 55

Figure 3.14: The exception construct. . . 55

Figure 3.15: WorkflowW₁₈created by applying the Exception construct on a Divide workflow. 56 Figure 3.16: Freebase Processing Workflow. . . 56

Figure 3.17: Performance comparison of two workflows for matrix summation. . . 57

Figure 4.1: TheParameterscollection. . . 62

Figure 4.2: Two collections that union-compatible : (a) collectionM1 ; and (b) col-lectionM2. . . 63

Figure 4.3: The results of (a)M1∪cM2; and (b)M1−cM2. . . 63

Figure 4.4: The results of the selection and projection operations (a)σc M odel=0_m₂0_{AN DExperiment}₌0₁0 (P arameters); and (b)πc Experiment (P arameters). . . 66

Figure 4.5: The result of the composition of the Cartesian product and the renaming operationsρc M1.M odel/M odel(ρcM1.Result/Result(M1))×cρcM2.M odel/M odel(ρcM2.Result/Result (M2)). . . 68

(12)

Figure 4.7: TheParallelAggregationworkflow. . . 73

Figure 4.8: An example of the parallel Reduce construct. . . 73

Figure 4.9: TheQueryworkflow. . . 74

Figure 5.1: Overall architecture of the VIEW system [85]. . . 76

Figure 5.2: A typical scientific workflow execution diagram. . . 77

Figure 5.3: (a) A SWL specification example of the workflowInterface def-inition of a unary-construct-based workflow. (b) a SWL specification example of theworkflowBodyfor graph-based workflow (b-4), prim-itive workflow (b-1), and unary-construct-based workflow (b-2); (b-3) a SWL specification example of theworkflowBodydefinition for unary-construct-based workflow with a composition of the Map construct and the Reduce construct; (b-5) a SWL specification example of the exception handling. . . 78

Figure 5.4: Relational database schema for our scientific workflow composition model. 80 Figure 5.5: An example specification of a primitive workflow. . . 83

Figure 5.6: Architecture of the data product manager. . . 84

Figure 5.7: Example of the Compress operator: (a) the original relationParameters; (b) The result collectionRParametersfrom the operation%(%(Parameters)). . . 86

Figure 5.8: Example of the XML description of a collectional data product. . . 88

Figure 5.9: Example of a query workflow. . . 94

(13)

CHAPTER 1 INTRODUCTION

In recent decades, computational technologies have played an essential role in modern sci-entific research. While the couple between scientist and computer makes significant progress, it also creates new challenges. On the one hand, scientists increasingly rely on informa-tion and computainforma-tion technologies to enable and accelerate scientific discoveries. High-performance computing such as supercomputers, clusters and grids have been popularized in many scientific laboratories [9] [10] [31]. On the other hand, computer simulation has become a popular tool for scientists from many disciplines to explore domains that are inac-cessible or extremely expensive for real experiments such as the exploring evolution of the universe [112] [28], predicting global climate change [40] [2], and numerous “in silico” sim-ulation of biological processes [109] [54]. Moreover, scientific instruments, computations and computer simulations are creating vast data stores. Researchers in many areas of science, especially in astrophysics, physics, climatology and biology, are now facing tremendous in-creases in data volumes, which have exceeds our capacity to store and analyze the data.

Scientists demand better frameworks to support the new generation scientific research cycle from data capture, data curation to data analysis and data visualization [67]. The in-creasingly availability of massive volumes of scientific data and corresponding analysis tools requires an integrated system to manage the data, the applications that analyze the data, as well as the whole scientific discovery process. A recent science article, titled “Beyond the

Data Deluge” [25], concluded that,“In the future, the rapidity with which any given discipline

advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.”

(14)

1.1 Scientific Workflows and Scientific Workflow Management Systems

Workflowin general refers to the “automation of a business process, in whole or part, during

which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules” [68]. The workflow concept evolved from the notion of process in manufacturing and the office and have been developed in the business world,

as so called business workflow, to providing computerized facilitation and automation of

business processes, including the assessment, analysis, modeling, definition and subsequent operational implementation of the core business processes of an organization.

As the computational “e-science” component of scientific research becomes more and more extensive and complex, a systematic architecture to manage various computational processes and large amount of data becomes more and more important [116]. Workflow

concepts have recently been applied to organize scientific computations, so called scientific

workflows. Ascientific workflow is a formal specification of a scientific process, which

rep-resents, streamlines, and automates the analytical and computational steps that a scientist needs to go through from dataset selection and integration, computation and analysis, to fi-nal data product presentation and visualization. Scientific workflows share many features of business workflows, but also go beyond them. One of the main differences between scientific workflows and business workflows is that scientific workflows are more concerned with the throughput of data through various stages of programs and applications while the business workflows focus on correct, timely and secure execution of business logic. Therefore, scien-tific workflows are usually data-driven in that the tasks are orchestrated mostly by dataflows rather than traditional control flows. In a scientific workflow, each task has several input and output ports. The input ports receive tokens from a predecessor component(such as a task or workflow input); the outputs ports send tokens to successor components(such as a task or workflow output). In data flow perspective, Receipt of all the input data tokens will trigger the task. When the task is complete, it will then generates data tokens and send them to related output ports. Another difference is that scientific workflows are usually dynamic with highly

(15)

Figure 1.1: A Reference Architecture for SWFMSs.

user interactions while business workflows are static. Complex scientific experiments often involve many parameters which will be changed frequently by the domain scientists in order to refine the model. And moreover, the workflow itself is usually changed very often during the exploratory research. Therefore scientists require higher level tools with friendly working environment, which enables them to plug together problem solving components to prove a scientific hypothesis. Business workflow tools look more like traditional programming lan-guages, and are at the wrong level of abstraction for scientists to take advantage of. Instead, scientific workflow systems are trying to provide an environment to aid the scientific discov-ery process through the combination of scientific data management, analysis, simulation and provenance.

A scientific workflow management system (SWFMS) is a system that supports the spec-ification, modspec-ification, execution, failure handling, and monitoring of a scientific workflow using the workflow logic to control the order of executing workflow tasks. SWFMS has be-come a fundamental instrument for current and future scientific research and collaboration, which provides rich support for scientists to describe experiments, analyze data, share de-scriptions and results with colleagues, as well as automate the recording of vast amounts of

(16)

data products and provenance information. While a business workflow management systems (BWFMSs) focus on the management, coordination, and verification of business processes, SWFMSs focus on supporting data intensive and computation intensive scientific research projects. Figure 1.1 illustrate a reference architecture for SWFMSs proposed in [85].

1.2 Scientific Data Management

Scientific data management is one of the greatest challenges in the coming data intensive science paradigm, not only in terms of the volume, but also in terms of the heterogeneity and distributive organization. While the relational data model [37] and SQL have become standards in the commercial world, none of the existing data model alone has critical mass in the scientific community and different data models and representations exist even in the same domain. Although much science data is in the form of numeric arrays and tables, relational databases are not well accepted by scientists because the relational model lacks of some common scientific data types and SQL cannot support complex scientific computations. Several simple and convenient data models have emerged to represent arrays, tables and relationships among them, such as HDF [5], NetCDF [8] and FITS [4]. A standardization is yet to be proposed.

Recently, the coming data deluge generated by scientific instruments and simulations poses new requirements for scientific data management. Data volumes are doubling every year and many nowadays datasets can easily reach terabyte or even pegabyte level. New techniques are needed to analyze and organize the data. Moreover, the increasingly used distributed high performance computing such as Grid computing and Cloud computing often involve distributed and hierarchically organized data sets. However, current data model in Grid computing and Cloud computing are mainly file oriented and loosely organized. The current data management in such systems often relies on hard coded programs or even re-quires manual operations from the users.

(17)

1.3 Research Challenges

Scientific workflows are proving to be the preferred vehicle for computational knowledge ex-traction at a large scale. However, the research on scientific workflows is still in its infancy. This dissertation explores formal methodologies to modeling scientific workflows. Specifi-cally, the goal of this dissertation is to address the following challenges.

How to define a dataflow-based scientific workflow composition model to support scientific workflow compositions. Scientific workflows have become a new paradigm for scientists to integrate, structure, and orchestrate heterogeneous and distributed services and applications into scientific processes to enable and accelerate many scientific discoveries. In contrast to business workflows, which focus on the modeling of controlflow oriented business processes, scientific workflows aim to model often large-scale data-intensive and computation-intensive scientific processes. This poses new exciting challenges for the man-agement of scientific workflows [85].

We argue that there is a great need to design and implement a dataflow-based scien-tific workflow composition model. First, as more and more scienscien-tific research projects use scientific workflow as an enabling technology to automate and speed up the scientific dis-covery process, productive workflow composition that promotes workflow sharing and reuse becomes increasingly important. Second, while the goal of business workflows is to reduce human resources (and other costs) and increase revenue, the goal of scientific workflows is to reduce both human and computation costs and accelerate the speed of turning large amounts of bits and bytes into knowledge and discovery. As a result, while business workflows are typ-ically controlflow oriented, scientific workflows tend to be dataflow oriented. Therefore, in-stead of using an existing business workflow language (such as BPEL [23] and YAWL [118]), it is highly desirable to have a dataflow-based scientific workflow language to support the specification and execution of complex data-driven scientific workflows. Finally, although several dataflow-based scientific workflow languages have been implemented [89], [97], [56],

(18)

none of them provides the dataflow constructs (e.g., Map and Reduce) that are fully compo-sitional one with another.

How to define a formal scientific data model with well-defined operators. In con-trast to business data, which is usually relational and stored in databases, scientific data is often hierarchically organized and collection oriented. We argue that a scientific workflow data model should meet the following requirements. First, a scientific workflow data model should be collection oriented. Scientists often work with collection oriented datasets, such as arrays, lists, tables, or file collections, which are generated from various instruments or simu-lations [62]. Therefore, it is important that a scientific workflow data model can support such collection-oriented data structures. Moreover, a collection-oriented data model enables data parallelism in scientific workflows, such that multiple runs of the same workflow can be per-formed in parallel over collections of data. Second, a scientific workflow data model should support nested data structures. On one hand, scientific data is often hierarchically organized. For example, physiologists often classify their clinical data by different patients and dates, forming a hierarchical cluster of data. On the other hand, in scientific workflows, workflow tasks often produce lists of data products, and the execution of a workflow composed from such tasks can create increasingly nested data collections [91]. Finally, a scientific work-flow data model should provide well-defined operators and their arbitrary compositions to manipulate and query scientific data collections. Such operators can become the basis for a higher-level declarative workflow language and provide a mathematical foundation for query and workflow optimization. Although several collection oriented data structures have been proposed for SWFMSs [117] [91] [132] [32], a formal data model with a set of well-defined operators is still missing.

How to design and implement a scientific workflow management system to integrate proposed techniques.

A scientific workflow management system aims to provide a framework to support the whole cycle of scientific research. To realize and implement proposed techniques into an

(19)

integrated SWFMS remains a big challenge. First, while advanced computer science tech-niques enabled and accelerated many scientific discoveries, they also bring burden to domain scientist who are forced to learn computer science technologies. SWFMSs are designed to provide a higher level programming abstraction and rid scientists of complicated technical details,so that they can concentrate on the research problem. Therefore, SWFMSs should provide a simple and friendly user interface and detailed techniques should be hidden inside the backstage. Second, modern SWFMSs often consist of several subsystems either loosely coupled or tightly coupled, and each implements some partial functionalities. Therefore, it is very important to maintain the consistency between subsystems, such as data typing, and system status. Finally, the coordination and communication between subsystems need to be clearly identified, including the functional interface, state transition, and the data and message interchange protocols.

1.4 Contributions

Contributions of this dissertation are as follows:

• A Dataflow-based Scientific Workflow Composition Model.We identified seven key requirements for a scientific workflow composition model based on a comprehensive

literature review and our experience in developing the VIEWsystem. based on those

re-quirements, we proposed a dataflow based scientific workflow composition model con-sisting of: i) a dataflow-based scientific workflow model that separates the declaration of the workflow interface from the definition of its functional body; ii) a set of dataflow constructs, including Map, Reduce, Tree, Loop, Conditional, and Curry, which are fully compositional one with another; iii) a dataflow based exception handling approach to support hierarchical exception propagation and user-defined exception handling. Our workflow composition framework is unique in that workflows are the only operands for composition; in this way, our approach elegantly solves the two-world problem in

(20)

existing composition frameworks, in which composition needs to deal with both the world of tasks and the world of workflows.

• A Collectional Data Model. We formalized a collection-oriented data model, called

collectional data model, to model hierarchical collection-oriented scientific data. The

new collectional model naturally extends the relational model to support hierarchical scientific data. We also proposed a set of well-defined operators to manipulate and

query such data including union and set difference, selection, projection, Cartesian

productandrenaming. The proposed collectional operators can be composed

arbitrar-ily to form more complex operations and the result will always be a collection.

• An Integrated Scientific Workflow Management System. We designed and

imple-mented a prototypical scientific workflow management system, call VIEW. The VIEW

sys-tem comprises six loosely coupled subsyssys-tems implementing our proposed techniques: a workbench to visually design and compose workflows and visualize data products, a workflow engine realizing our proposed model to execute workflows, a task manager to manage and execute heterogeneous tasks, a data product manager to store and man-age scientific data products based on the collectional data model, a workflow monitor to display system status and track exceptions, and a provenance manager to store and query workflow provenance.

1.5 Roadmap

The remaining chapters of the dissertation are organized as follows. Chapter 2 reviews related research that covers the state of the art technologies of scientific workflow modeling, scien-tific data modeling, and scienscien-tific workflow systems. Chapter 3 proposes a new dataflow-based scientific workflow model and a set of workflow constructs to enable arbitrary hier-archical workflow compositions. Chapter 4 formalizes a collectional data model to support hierarchical collection-oriented scientific data, and a set of operators to manipulate and query

(21)

such data. Chapter 5 presents the detailed design and implementation of the VIEWsystem, which integrate all the proposed techniques. Finally, Chapter 6 concludes the dissertation and outlines some future research work.

(22)

CHAPTER 2 RELATED WORK

Scientific workflow has become an increasingly popular paradigm for scientific data query-ing and processquery-ing. As a multi-disciplinary research area, scientific workflows involve tech-nologies from various domains. This chapter presents existing workflow and data manage-ment technologies that are pertinent to this thesis. Section 2.1 provides an overview of busi-ness workflow and scientific workflow research with a focus on models and languages. Sec-tion 2.2 discusses existing scientific data models. SecSec-tion 2.3 further surveys existing scien-tific workflow management systems. Finally, Section 2.4 summaries this chapter.

2.1 Workflow Modeling

Workflow technology has been successfully used in business and scientific applications for many years and numerous competing proposals have been proposed to model workflow pro-cesses from opposing companies. There are two main architectural approaches to

imple-menting workflow: service orchestration and service choreography [24]. Service

orches-tration means an executable business process that may interact with both internal and ex-ternal services (e.g. Web services). Services interact with each other by explicitly defined controlflow or dataflow. Orchestrations can span multiple applications and/or organizations while a central process acts as a controller to the involved services and the services themselves have no knowledge of their involvement in a higher level application. BPEL [23] [12] and YAWL [118] are two representative business workflow languages that are widely adopted by the community for defining processes that can be executed on an orchestration engine. Most current scientific workflow languages including MoML [89], Xsculf [15], and our to be pro-posed language SWL are also based on a service orchestration model. Service choreography

(23)

focuses more on a collaboration between a collection of services in nature. Choreography describes interactions from a global perspective, meaning that all participating services are treated equally as a peer-to-peer fashion. Each party involved in the process describes only the part they play in the interaction and no process acts as a controller. All involved services are aware of their partners and when to invoke operations. WS-CDL [75] is a representative business workflow language in this area.

2.1.1 Business workflow modeling

Workflow technology was first adopted in the business community and has been developed for many years. The main purpose of business workflow is the automation of processing steps (activities) in order to accomplish some business process [14]. Below, we will introduce several representative business workflow languages.

Business Process Execution Language for Web services (or BPEL4WS) [23] [12] [76]

has gained broad acceptance in industry and research. It is an XML-based language as the formal specification of business processes and business interaction protocols. BPEL4WS ex-tends the Web services interaction model [124] and enables both, the composition of Web services and rendering the composition itself as Web services [96]. BPEL provides control

constructs including < sequence >, < f low >, < switch >, < pick >and < while >.

BPEL also providescontrol links, together with the associated notions of join condition and

transition condition, to support the definition of task precedence, synchronization and con-ditional dependencies [100]. BPEL has been supported by a significant number of business workflow tools, and has also been used to structure some simple scientific workflows [19]. However, while most modern scientific workflows are data driven, BPEL does not support ex-plicit data flow. Data in BPEL is stored in shared variables that can be accessed by activities

(e.g. < assign >activity). Moreover, it has been noted that standard BPEL fails to support

human tasks, that is, tasks that are allocated to human actors and that require these actors to complete actions, possibly involving a physical performance. Although some extensions

(24)

to BPEL are developed such as BPEL4People [77] to support human interactions, they are designed mainly to model business activities rather than scientific experiments.

Yet Another Workflow Language(YAWL)[118] is a formal language which was originally

proposed to support most of the workflow patterns [17]. Those patterns characterize the desirable properties of workflow languages from the controlflow perspective. The YAWL language is based on high-level Petri nets [119] and extends it with three main constructs, or-join, cancellation sets, and multi-instance activities, to express multi-instance activities. YAWL also introduces some other constructs, such as simple choice (xor-split), simple merge (xor-join), and multiple choice (or-split), to support workflow patterns that are not easily represented using Petri nets. YAWL is recently extended, so called newYAWL [105], offering to provide holistic support for the controlflow, data and resource perspectives, and to cover many new patterns which YAWL is unable to provide direct support for, including the partial join, transient and persistent triggers, iteration and recursion.

The XML Process Definition Language (XPDL)[16] is a format standardized by the

Work-flow Management Coalition (WfMC) to interchange the process design, both the graphics and the semantics of a workflow business process between different workflow products like modeling tools and workflow engines. XPDL defines an XML schema for specifying the declarative part of workflow. In XPDL, the Process Definition entity provides contextual information that applies to other entities within the process. It consists of one or more activ-ities, each comprising a logical, self-contained unit of work to be performed by either some resource or computer application. An activity may be a subflow containing the execution of a process definition that is separately specified, or a block activity that executes an activity set, or map of activities and transitions. Activities are related to one another via flow con-trol conditions. Each individual transition has three elementary properties, the from-activity, the to-activity and the condition under which the transition is made. XPDL also contains ele-ments to hold graphical information such as the X and Y position of the activity nodes as well as the coordinates of points along the lines that link those nodes. This distinguishes XPDL

(25)

from BPEL as the latter one does not contain elements to represent the graphical aspects of a process diagram.

WS-Choreography Definition Language (WS-CDL)[75] is an XML-based language that

describes peer-to-peer collaborative and complementary behavior of multiple participants. The major difference between WS-CDL and BPEL is that the former provides a definition of the information formats being exchanged by ALL participants, while the later provides the information formats exchanged by one participant. Thus WS-CDL provides the global message exchange between participants without a specific point of view.

Many other business workflow languages are proposed. Huang et al. proposed a policy language [71] in support of the project-oriented workflow. In their model, a project can be divided into many functional modules defined in a sub process definition, either composite activities or atomic actives, and composite activities can be divided further. Stefansen et al.

proposed aSMAll Workflow Language Based on CCS (SMAWL)[114], which aims to reduce

the amount of user-specified internal synchronization while can still provide elegant

con-structs for the workflow patterns [17]. Gregory et al. proposedWorkflow Prolog[63], which

leverages the properties of Prolog such as its familiarity and efficiency, and allows workflow

systems to be implemented in a novel declarative style. Han et al. proposed an

Ubiqui-tous Workflow Description Language (uWDL)[64], to support adaptive services and specify

context information on the transition constraints. Charfi et al. introduces a new unit, called aspect, to modularize crosscutting concerns in complex systems and proposed an

aspect-oriented workflow language, calledAO4BPEL [33]. Handl proposed HotFLow[65] for the

B2B Electronic commerce project MALL2000, which is a visual language for controlling the

dynamic workflow of negotiating and contracting. Wirtz introduced theObject Coordination

Nets (OCoN)[127] approach which carries the benefits of visual software engineering

tech-niques to the workflow area. Wong et al. proposed a process-algebraic approach [128] to model workflows as CSP processes and support various controlflow patterns. Ontology [120]

(26)

the parallel execution of semantic Web services while the Web Service Modeling Ontology

(WSMO)[103] also supports parallel workflows through a set of controlflow-based transition

rules which are executed in parallel.

All the above business workflow models and languages are driven by controlflows be-cause business workflows are driven by business rules and it is important to maintain the state of a business process and to provide controlflow constructs to formulate state-based

business rules. Although some constructs, such as ForEach, If, While in BPEL 2.0 [12];

MultipleInstance,Structured Loop,Multiple ChoiceandParallel Splitin YAWL [118], have

been proposed for business workflows to support iteration and concurrency, they cannot be directly applied to a dataflow-based scientific workflow composition framework due to the fundamental differences between controlflow and dataflow. For example, in contrast to our

to be proposed Map construct, which returns a list of results, the ForEachconstruct returns

nothing (since it is a controlflow construct). Considering the dataflow-oriented nature, the Map construct is more natural for scientific workflows as the results can be directly fed to the input of subsequent workflows or tasks.

Recently, data-centric approaches have received much recognition to model medium or large sized business workflows. IBM introduced an artifact-centric approach [57] [47], which focuses on recording “business artifacts” including business objects, their life cycles, and provenance information. E-BioFlow [123], a workflow system built on top on YAWL [118], provides three perspectives (controlflow, dataflow, and resource) to support workflow design. The information of the three perspectives will all be translated to controlflows during runtime. However, in essence, these approaches are still controlflow based rather than dataflow based.

(27)

2.1.2 Scientific workflow modeling

Scientific workflow shares many similarities with business workflow, but also go beyond it [44]. There are significant discussions about the similiarities and differences between sci-entific workflows and business workflows [115] [121] [19] [111]. First of all, scisci-entific work-flows are more concerned with the throughput of data through various stages of programs and applications while the business workflows focus on correct, timely and secure execution of business logic. Therefore, scientific workflows are usually data-driven in that the tasks are orchestrated mostly by dataflows rather than traditional controlflows. In a scientific work-flow, each task has several input and output ports. The input ports receive tokens from a predecessor component (such as a task or workflow input); the outputs ports send tokens to successor components (such as a task or workflow output). In dataflow perspective, Receipt of all the input data tokens will trigger the task. When the task is complete, it will then gen-erates data tokens and send them to related output ports. Second, scientific workflows are usually dynamic with intensive user interactions while business workflows are static. Com-plex scientific experiments often involve many parameters which will be changed frequently by the domain scientists in order to refine the model. And more over, the workflow itself is usually changed very often during the exploratory research. Scientists require higher level tools with friendly visual working environment, which enables them to plug together prob-lem solving components to prove a scientific hypothesis. Finally, while business workflows are mainly dealing with Web services coordinated via simple messages, scientific workflows often involve heterogeneous and distributed computation resources in order to process huge and complex scientific data.

As scientific workflows are more dataflow driven, we briefly review the literature of dataflow languages [73]. The name dataflow comes from the conceptual notion that a pro-gram is a directed graph and that data flows along its arcs between instructions (components). Many developments have taken place within dataflow programming languages in the past

(28)

dataflow languages. It was designed to be compiled into a dataflow graph with data streams in a relatively straightforward way. TDFL consists of a series of concepts including modules, analogous to procedures in other languages. Each module is made up of a series of state-ments such as assignstate-ments, conditional statestate-ments, or a call to another module. Iteration was

not provided directly. LAU [58] is a single assignment language which includes conditional

branching and loops that were compatible with this rule. It was one of the few dataflow

lan-guages that provided explicit parallelism. Cantata[102] is a coarse-grained visual dataflow

language in which nodes contain entire functions (similarly in workflows), rather than just a primitive operation. Each input is designated a name by the programmer, who also specifies either a loop variable and bounds, or a WHILE-condition, using the names. Much features and principles in dataflow research has been inherited in scientific workflows. As a matter of fact, most scientific workflow models are typically dataflow based. However, scientific work-flows are specifically designed to facilitate scientists for scientific data processing, therefore, they are usually more course-grained and also provide support for modern super computing techniques.

Many modern scientific workflows originate from Grid applications. Grid workflows have been proposed to enhance cyberinfrastructure for a wide range of scientific domains. Grid computing satisfies high-performance requirement of the complex scientific applications and enables resource sharing between collaborating organizations. Grid workflows provide an integrated and user friendly environment for domain scientists to utilize the advantages of grid computing. Pegasus [45] aims to take advantage of Grids for parallel processing at

the task level and its workflow language DAX [3] can describe controlflow-based

sequen-tial and parallel workflows. The DAX language use the notion of < job > to denote a

task and use < child > to define the control-flow dependencies between jobs. DAX does

not explicitly support dataflow. Instead, data is transferred as parameters or files. The

(29)

a C-like syntax to represent XML Schema types and procedures. It enables the program-mers to describe the types of both datasets (including file system data) and workflow com-ponents. It also supports the invocation of remote procedure calls to perform computations on those data objects and provides an implicitly parallel, functional programming model

based on dataflow concepts. ASKALON [49] proposed anAbstract Grid Workflow Language

(AGWL) [50] [51]. AGWL is an XML-based language designed specifically for describing

Grid applications at a high level abstraction, called activities, without dealing with

imple-mentation details. AGWL includes the most essential workflow constructs including activ-ities, sequence of activactiv-ities, sub-activactiv-ities, controlflow mechanisms, dataflow mechanisms, data repositories, and some grid workflow constructs such as parallel activities, parallel loops with pre- and post-conditions, synchronization mechanism, and event based selection of

ac-tivities. There are many other proposals include JXPL[72] for the GridNexus [29] system,

DPMLfor the DiscoveryNet system [41],GWorkflowDL[22] based on High-Level Petri Nets,

GPEL[121] [122] andGSWEL[129] extended fromBPEL4WS,SWFL[69] andMPFL [70]

extended fromWSFL[80] (an XML language developed by IBM for the description of Web

Services compositions as part of a business process definition). [78] [130] survey and the Grid programming environments and representative Grid workflow systems.

While grid workflows provide a high level abstraction on top of the distributed Grid re-sources, they are limited for Grid applications and lack the ability to manage scientific data, and to utilize heterogeneous resources. Recently, several general scientific workflow models and systems are developed. Below, we review several most representative proposals.

Kepler [89] inherits the actor-oriented modeling design [26] from the Ptolemy II sys-tem [30]. Actor-oriented modeling clearly separates two modeling concerns: component communication (dataflow) and overall workflow coordination (orchestration).A scientific work-flow is modeled as a composition of independent components, called actors. Actors are re-usable independent blocks of computation. They consume data from a set of input ports and write data to a set of output ports. The interaction between the actors is defined by a Model of

(30)

Computation (MoC) [60]. The MoC specifies the communication semantics among ports and

the flow of control and data among actors. Directorsare responsible for implementing

partic-ular MoCs, and thus define the orchestration semantics for workflows. A variety of models of computation are supported in Kepler, including: Process Networks (PN) for pipelined current execution, Dataflow (DDF and SDF) for dataflow based execution, Continuous Time (CT) for time based execution, and Finite State Machines (FSM) and Modal Models for state based

execution. Kepler also inherits Ptolemy’s ownModeling Markup Language (MoML)[79].

Taverna [97] implements its ownXML Simple Conceptual Unified Flow Language

(Xs-cufl) [15]. A Taverna workflow consists of a collection of processors with both data and

control links among them. A control link establishes a control dependency indicates that a processor can only begin its execution after some other processor has successfully completed its execution. Taverna [97] provides implicit iteration by allowing a user to specify the it-eration strategy of each processor (Taverna’s term of workflow task). Taverna can simulate control links using data links [117], and If-Else behavior can be supported by using control links and two distinguished processor called “Fail-if-false” and “Fail-if-true”. Recently, a successor of Taverna has been developed, called Taverna 2 [94]. Taverna 2 implements a new model [113], which improves the original model in two main ways: (i) support for data streaming, through pipelined execution of workflows; and (ii) support for extensibility of the set of workflow operators by wrapping each processor P with a stack of execution layers such

asLoopfor iterative execution,Branchfor conditional execution, andBounce,Failover, and

Retryfor exception handling.

Vistrail [32] features an action-based mechanism to automatically capture workflow evo-lution provenance - all the trial-and-error steps follow to construct a set of data products. In

Vistrail, a workflow is represented by a sequence of actions, so called avistrail. Avistrailis

essentially a tree in which each node corresponds to a version of a workflow, and the lines between the parent nodes and their children represent the actions applied to parent nodes to

(31)

obtain the child nodes. In this case, it allows scientists to explore visualizations by returning to and modifying previous versions of a workflow.

Triana [35] provides a clear separation between the abstract workflow model and the

concrete task model. Acomponent in Triana is the unit of execution, Components are Java

classes with an identifying name, input and output ports, a number of optional name/value parameters, and a single process method. Components can also be written in other languages with appropriate wrapping code. Each component has a definition encoded in XML with a similar format to WSDL [11], which specifies the name, input and output specifications and parameters. Triana uses both dataflow and controlflow for component execution but does not provide any explicit control constructs. Instead, Loops and conditional branching in Triana are handled by specific components, i.e. a specific loop component that controls repeated execution over a sub-workflow and a logical component that controls workflow branching.

None of the existing scientific workflow models provides the constructs that are com-posable and can be applied on arbitrary workflows. For example, Kepler [89] provides an

IteratorOverArrayactor (Kepler’s term of workflow task) to support iterated execution.

How-ever, this actor does not directly support parallel execution of its contained actor. A recently proposed scientific workflow language, Martlet [61], provides the map and fold constructs to support MapReduce-style workflows. However, because it is controlflow-based, the con-structs introduced in Martlet are inapplicable to dataflow-based scientific workflows in which input ports and output ports are well-defined. Moreover, the composability of Martlet is very limited as Martlet constructs cannot be applied in a nested way. Similarly, MOTEUR [59] supports both the parallel processing of independent data with a single service on different computing resources (called “data parallelism”) and parallel execution of different services with different datasets (called “services parallelism”). However, arbitrary composition of constructs is still not supported.

(32)

2.2 Scientific Workflow Data Models

Business workflows are mainly dealing with two data models: the relational data model [37] [39] and the XML data model [13]. Business data, such as financial records, medical records, per-sonal information and manufacturing and logistical data, are usually relational and stored in relational databases. A relation is defined as a set of tuples that have the same attributes. A tuple usually represents an object and information about that object. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints. The relational model offers an abstracted view of data. It basically abstracts the physical structure of data storage, from the logical structure of data, and provides a set of algebraic to query and manipulate re-lations. It also offers a declarative interface (relational calculus) for the specification of data manipulation, which is proved to be equivalent relational algebra with [38]. The relational model is realized in a Structured Query Language (SQL) [7] and implemented in a variety of relational database management systems including Oracle, MySQL, and MS SQL Server. Business workflows are also standardized with the XML data model to transfer data between businesses processes. XML(eXtensible Markup Language) is a markup language for docu-ment containing structured information. Docudocu-ments refer not only to traditional docudocu-ments, but also XML data formats such as e-commerce transactions, objects, and thousands of other kinds of structured information. Since XML data is self-describing, XML is considered a

means to represent semi-structured data. The basic construct of an XML document is the

el-ement. Elements can contain subelements. The content of an element is delimited by special

markups known as start tag andend tag. The start tag is the name of the element in angle

brackets; the end tag adds an extra slash character before the name. XML is a semi-structured model and provides a flexible format for data exchange between different types of databases. However, in XML, queries cannot be made as efficiently as in a more constrained structure.

As scientific workflow becomes an active research area, there is a growing interest in the development of a data model for scientific workflow management systems. Kepler [91] [48]

(33)

proposes a collection-oriented model in which a collection is a named set of heterogeneous data which can contain sub-collections to formalize a nested collection. Our collectional data model is different from Kepler’s nested data collection model. On one hand, a collection in Kepler is an XML-like semistructured data structure, consisting of labeled data items, metadata items, and nested collections with possible different types and nesting levels, while our collection is structured, consisting of data items of the same type, or consisting of nested collections with the same schema and nesting levels. On the other hand, we have defined several collectional operators that generalize their relational counterparts, no such operators have been defined in the Kepler’s nested data collection model. Taverna [117] adopts a list based data model, in which string is the only atomic data type and the nested list is the only data construct. Taverna provides implicit iteration to support parallel processing of a list of data products and allows a user to specify the iteration strategy on the processor to combine multiple lists with cross product or dot product. Swift [132] supports atomic data types such as integer and string, as well as a “mapped type”, which maps data directly to files on disks. Swift also supports the Array structure and user-defined structures, which are similar to those used in conventional programming languages. Pegasus [45] supports File as the only data type and data operations rely on user defined tasks. VisTrails [56] supports common atomic data types including File and provides List and Tuple data structures. GridDB [87] introduces the relational model into Grid workflows by using a Set construct to cast atomic data into relations. The relational operators can then be introduced into workflows as primitive tasks. However, GridDB does not support hierarchical data collection.

Google MapReduce [42] adopts a simple data model which is a collection of key-value pairs. However, this model does not support nested collections. Pig Latin [99] proposes a nested data model in which tuples are basic building blocks. Pig Latin provides the Bag structure to construct collections of tuples and the Map structure to construct collections of key-value pairs where the values can be of any data types. The schemas of Bag and Map are loose in that data items within one collection can be of different types. Pig Latin does

(34)

not provide operators except for basic storage and retrieval. DryadLINQ [131] adopts the LINQ data model consisting of strongly-typed collections of .NET objects. LINQ supports data collections including the dictionary data structure which contains key-value pairs and provides SQL-like operators. However, nested dictionary structure is not supported so far. 2.3 Scientific Workflow Management Systems

Business workflow management systems (BWFMSs) originate from office automation sys-tems about four decades ago, and grow fast during the last two decades in industry. Many business workflow management systems have been developed to orchestrate and coordinate business processes. For example, the YAWL system [118] [105] is developed on a service-oriented architecture and consists of four YAWL services: YAWL worklist handler to assign work to users of the system so that users can accept work items and signal their completion ; YAWL web services broker to discover services, YAWL interoperability broker interconnect different workflow engines, and custom YAWL services connects the engine with an entity in the environment of the system. Some other systems include [66] [107] [21] [93] [81] [92].

Scientific workflow managements systems (SWFMSs) emerge in recent years in order to provide an integrated platform for facilitate scientists to design workflow, monitor workflow execution, visualize data product, and query provenance. Comparing to BWFMSs, research and development of SWFMSs are still in their infancy. Until recently, a reference architec-ture [85] was proposed, which clearly defined the responsibility of a SWFMS, and clarified functionalities. Most SWFMSs haven been mentioned in Section 2.1 from the aspect of mod-eling and language, in section we will review their system design and implementation.

Kepler [89] inherited the Ptolemy II system [30], which is to tightly coupled system in-clude a user interface to design workflows and an engine to execute workflows. Kepler’s strength include its mature library of actors, which are mainly local application for biology, ecology, geology, astrophysics and chemistry, and its suite of directors that provide flexi-ble control strategies for the composition of actors. The Kepler system also implements a

(35)

novel hybrid type system for modeling scientific data that separates structural data types and semantic data types [26]. The well defined data type systems can facilitate the design and implementation of workflows by constraining the possible values and interpretations of data in a scientific workflow.

Taverna [97] [98] focuses particularly on orchestration of applications and services in the bioinformatics domain. Taverna is designed in a three- tiered model for describing resources and their interoperation at different levels of abstraction: An abstract layer to present the workflow from a user view, hiding the complexity of the service interactionsa Freefluo enactor manages different services in the low level with an extensible processor plug-in architecture; and an execution layer in between to interpret internal object model that handles controlflows such as implicit iteration and fault recovery on behalf of the user.

Triana [35] was originally developed as a data analysis problem-solving environment for gravitational wave detection project. The system is designed in an two layer architecture: first, users are allowed to use compose workflows graphically by dragging programming components called units or tools onto a workspace. Components are connected by data and control links. Triana workflows will be recorded and sent to the Grid Application Prototype Interface (GAP Interface) that can execute any sub-workflow and communicate with other Triana services they are connected to. GAP provides a subset of the functionality of the GAT (Grid Application toolkit, created by GridLab [20]). The GAP is used to interface with Triana services and provides us with the middleware independent view of the underlying services and interactions across the Grid.Three bindings to GAP are currently supported in Triana: Web services, P2PS(a lightweight P2P middleware capable of advertisement, discovery and virtual communication within ad-hoc P2P networks), and Jxta ( a set of protocols for P2P discovery and communication within P2P networks).

Pegasus [45] is a framework which maps scientific workflows onto distributed resources such as a Grid. Abstract workflows designed by a domain scientist are independent of any re-sources they will be executed on. By doing this, Pegasus leverages abstraction for workflow

(36)

description to obtain ease of use, scalability, and portability. Pegasus provides a compiler to map from high-level descriptions to executable workflows and it then use Artificial In-telligence planning techniques to find a mapping of the tasks to the available resources for execution at runtime. The execution of tasks are handled by the Condor system, which is an open source high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks in Grid.

ASKALON [49] is designed in a similar architecture to Pegasus. ASKALON as allows the user to compose the Grid workflow by using a graphical user interface or writing an AGWL program directly. It then uses a transformation system to compile AGWL into a con-crete representation through mapping abstract activities into specific Activity Deployments deployed in the Grid. Finally A concrete representation is interpreted by the underlying workflow runtime environment of ASKALON to construct and execute the Grid workflow application on a Grid infrastructure.

VisTrail [32] is the first system to provide support for tracking workflow evolution by maintaining detailed provenance of the exploration processłboth within and across different versions of a dataflow [56]. Users create and edit dataflows using the VisTrail Builder user interface. The dataflow specifications are saved in the VisTrail Repository and users can interact with saved dataflows by invoking them through the VisTrail Server or by importing them into the Visualization Spreadsheet, which stores all dataflow instances. The VisTrail Cache Manager keeps track of operations that are invoked and their respective parameters. Therefore, only new combinations of operations and parameters need to be executed.

2.4 Chapter Summary

This chapter has presented background information that is relevant to the rest of this the-sis. This chapter was composed of three main sections. The first section introduced the background of business workflow modeling, the state-of-the-art scientific workflow model-ing research, and discussed their differences. The second section continued with a review

(37)

of recent trends in scientific data management. The third section of this chapter presented several representative existing scientific workflow management systems.

(38)

CHAPTER 3 A SCIENTIFIC WORKFLOW COMPOSITION MODEL

Scientific workflows are designed to integrate and structure various local and remote het-erogeneous data and service resources to perform in silico experiments to produce significant scientific discoveries. Although several scientific workflow management systems (SWFMSs) have been developed, a formal scientific workflow composition model in which workflow constructs are fully compositional one with another is still missing. In this chapter, we pro-pose a new scientific workflow composition model. We first discuss key requirements for a scientific workflow composition model in Section 3.1. We then propose a new scientific workflow model in Section 3.2, with a set of workflow constructs in Section 3.3, including Map, Reduce, Tree, Loop, Conditional, and Curry, which are fully compositional one with another. Section 3.5 introduce a dataflow based exception handling approach. We also present two case studies in Section 3.6 to validate our proposed techniques. Section 3.7 concludes this chapter.

3.1 Key Requirements for a Scientific Workflow Composition Model

Based on a comprehensive study of the workflow literature and our own experience from the

development of the VIEWsystem [85], we identify the following seven key requirements for

a scientific workflow composition model.

R1: Programming-in-the-large.The concepts of“programming-in-the-large”and“

programming-in-the-small”were first introduced by Frank DeRemer and Hans Kron in 1976 [46]. While

programming-in-the-large focuses on high-level abstractions of modules and the modeling of their interactions and coordination, programming-in-the-small focuses on low-level pro-grammatic implementation of modules and functionalities. Given the high-level orchestration

(39)

and integration nature of scientific workflow composition, a scientific workflow composition model should fall in the programming-in-the-large paradigm.

R2: Dataflow programming model.While in the imperative (controlflow-based)

program-ming model, the order of program execution is explicitly specified by controlflow constructs, such as sequential, conditional, and loop, in the dataflow-based programming model, the availability of input data for a module initiates the execution of the module and the move-ment of data through modules determines the execution order of the whole program. Since most scientific workflows aim at data processing and scientific analysis problems, scientific workflow composition model should be dataflow-based. Although from a user’s perspec-tive, constructs such as Loop and If-Else are important, we show later in this section that their dataflow-based counterparts are possible. Moreover, a dataflow-based workflow model features implicit parallelism: workflow modules run in parallel by default unless there is an explicit specification that one module needs an input data that is to be produced as the output of another module. Since the dataflow-based programming model [74] eliminates the shared memory assumption and the need of program counter and control sequencer, a dataflow-based scientific workflow composition model will be able to more easily leverage the parallelism en-abled by today’s variety of parallel and distributed computing infrastructures (Grids, Clouds, multicore, and multiprocessor systems).

R3: Composable dataflow constructs. Current dataflow based workflow languages are

usually very simple, and contain only basic data links between components. In order to ad-dress the requirements of the more and more complex e-science applications, some languages borrowed several common controlflow constructs from business workflow languages. How-ever the semantics is thus obscured and becomes difficult to formalize because of the combi-nation of controlflow and dataflow. Therefore we argue that composable dataflow constructs are essential for a scientific workflow composition model. In contrast with controlflow con-structs, which are used to control and coordinate processes, dataflow constructs are featured with efficient and systematic data processing including: data parallelism and aggregation;

(40)

recursive data processing with finite or infinite loops; data-dependent conditional branch-ing. Dataflow constructs should also be composable. The ability to combine basic constructs and build more complicated ones will greatly improve the expressive power of the scientific workflow composition model.

R4: Workflow encapsulation and hierarchical composition. A scientific workflow

com-position model should facilitate encapsulation and support hierarchical workflow composi-tion. On one hand, one of the most import features of scientific workflows is to allow the reuse and sharing of scientific processes by workflow encapsulation [27] [110]. A scientific workflow model should provide input/ouput interface and implementation details. Such well encapsulated modules represent a separation of concerns and improve maintainability. On the other hand, a scientific workflow model should support hierarchical composition so that the users are able to compose workflows using existing scientific workflows and break down large-scale scientific workflow into smaller ones. This ability greatly improves the power of modeling of complex scientific processes and encourages scientific collaborations [88].

R5: Single-assignment property.To ease provenance tracking and workflow scheduling, a

scientific workflow composition model should have thesingle-assignment property, in which

data products are treated as immutable artifacts; they can be created and transported, but never updated. First, scientific discovery produced from scientific workflows must be reproducible, requiring the acyclicity of provenance graphs and the immutability of data products [95]. The violation of this property might lead to incorrect data dependencies and thus compromise re-producibility. Figure 1 illustrates an example of provenance for a workflow consisting of

three tasks: T1 takes input ofd1 and producesd2;T2 consumesd2 and generated3; T3

con-sumesd3 and generatesd01to replaced1. If the single-assignment property is respected, then

d0

1 will be a different data product, and we can derive the acyclic dependency graph shown

in Figure 3.1(a): d0

1 depends on d3; d3 depend on d2, and d2 depends on d1. However, if

the single-assignment property is not enforced, then d0

1 and d1 will be treated as the same

(41)

d₁ TR_1:T₁ d₂ TR_2:T₂ d₃ TR_3:T₃ d₁' i₁ o₁ i₂ o₂ i₃ _o₃ (a) o₃ d₁ TR1:T1 d2 TR2:T2 d3 TR3:T3 i₁ o₁ i₂ o₂ i₃ (b)

Figure 3.1: (a) Correct data dependencies under the single-assignment property; (b) incorrect data dependencies due to violation of the single-assignment property.

graph shown in Figure 3.1(b). Based on transitivity, one can infer that d2 depends on d3,

a false data dependency relationship. Second, the single-assignment property can eliminate the interference caused by parallel access (read and write) of data products, which can result in inconsistent and undesirable intermediate or final results that would not be obtainable if workflow tasks are run in a serial fashion. Third, the single-assignment property can greatly facilitate the realization of massive parallelism: multiple workflow tasks can be started as long as their input data products become available; the single-assignment property ensures the well-defined availability time of each data product; data products can be transported to their consumers directly and removed after consumption without first being stored and then retrieved. As a result, the single-assignment property is assumed by many functional pro-gram languages and dataflow propro-gramming languages [74] [126]. Finally, unlike the trans-action data in business workflows which need to be updated and changed frequently, most scientific datasets are accessed in a read-only manner and updates to datasets are usually not required [34]. Therefore, single-assignment will unlikely have negative impact on the computation and processing of scientific datasets.

R6: Physical and logical data models. Scientific applications usually involve

heteroge-neous and distributed data [62]. Data management is thus becoming one of the key challenges of SWFMSs [43]. We argue that scientific workflow composition model should provide both a physical data model and a logical data model, as well as the mapping between them. First,

(42)

a physical data model is important for the management of distributed data storage (such as local files, databases, and remote files) and heterogeneous physical representations (such as different formats representing the same data). Second, the logical data model provides data typing and data structures. In order to maintain the integrity and consistency of the scientific

workflow composition model, a formal data typing system is required to detect“type errors”.

Furthermore, data structures with well-defined operators/constructs are also essential for stor-ing and organizstor-ing collection of data tokens. Third, the separation of two data models allows the workflow users operate only on the logical data model and can be freed from physical data management [55]. As a result, changing of the underlying physical data model will breaking the scientific workflow model. Finally, an explicit and standard data mapping layer with pre-cise metadata and explicit data access is necessary to guarantee the efficient and consistent mapping between the physical data model and the logical data model.

R7: Task level and workflow level exception handling.Exceptions in scientific workflows

may happen in both the task layer and the workflow layer. A scientific workflow composition model should be able to capture and handle exceptions in both layers. First, while business workflows usually consists of Web services and exception handling in business workflows focuses on service exceptions such as service failure or deadline expiry [18], [106], [36], sci-entific workflows may involve heterogeneous tasks (e.g., local executables, grid applications, cloud services) and exception handling in scientific workflows is thus required to be able to detect and integrate heterogeneous exceptions generated by those tasks. Second, because scientific workflows are usually hierarchial and even distributed, exception handling in scien-tific workflows should also be hierarchical and exception propagation should be supported. Third, exceptions in scientific workflows are sometimes very important for scientists to de-tect hidden problems, improve scientific models and even achieve new scientific discoveries. Therefore, despite traditional failure handling techniques, a scientific workflow composition model should allow the users to introduce new exceptions and provide user-defined handlers.