• No results found

2.3 Workflows

2.3.3 Lifecycle of Scientific Workflows

The lifecycle of a workflow is important to understand the steps needed to set up and run workflows. Laymann and Roller [118] provide a workflow lifecycle that is widely accepted by the research community, but focuses on business-oriented workflow applications. How- ever, the lifecycle of scientific workflows are distinguished from its business counterpart as explained by Görlach et al. [119]. Unlike business-oriented workflows that may involve dif- ferent user groups (e.g. business specialists, administrators, and analysts), scientific work- flows involve a single user group (e.g. scientists) that play different roles in designing, composing, executing, managing workflows and analysing their results. Scientists are typ- ically interested in a single workflow instance that is developed based on a trial-and-error method. Hence, there is no rigid organisation in the steps required to conduct a workflow as it may be repeatedly modelled and executed using different datasets. There are many stud- ies focused on domain-specific workflows and attempted to define the lifecycle of scientific

2.3 Workflows 45

workflows. For example, Görlach et al. [119] proposed a scientific workflow lifecycle based on observations relating to the manner in which scientists create and conduct exper- iments, and from common characteristics of standard data analyses and simulations that are described by Barga and Gannon [120]. This lifecycle consists of modelling, execution, monitoring, and analysis steps. Ludäscher et al. [27] proposed a similar scientific workflow lifecycle that consists of steps needed for workflow design, preparation, execution, and post- execution analysis. This lifecycle requires the datasets needed for the workflow execution to be staged into computational resources before the workflow execution begins. Deelman et al. [121] provides a comprehensive study that focuses on the lifecycle of scientific work- flows and highlights the significance of provenance, and the ability to reproduce scientific experiments. Based on this study, this thesis discusses the scientific workflow lifecycle which consists of the following steps:

1. Composition. 2. Resource mapping. 3. Execution.

4. Provenance.

Figure 2.9 provides an adaptation of this scientific workflow lifecycle from Deelman et al. [121]. Typically, a set of different tools are used in handling each step in the workflow lifecycle. Section 2.3.4 provides a general background of workflow management systems, which provide the tools needed for composing workflow, mapping workflow tasks onto computation resources, executing workflow tasks, and capturing provenance.

Composition

Scientists use workflows to precisely compose experiments to process, and analyse data gathered from sophisticated scientific instruments. During composition, workflow tasks are

specified and combined from a high-level perspective using aworkflow language. This step

46 Literature Review Scientist Resource Mapping Execution Provenance Composition Abstract workflow M appi ng prove na nc e Com pos it ion prove na nc e Concrete workflow Execution provenance Refinement Catalouges Logging Compute Resources E xe cut ion Re cords Re sourc e m etri cs Inform ati on fl ow

2.3 Workflows 47

represents a logical model that describes the workflow tasks to be executed, and the order

in which to execute them. This logical model is commonly known as an abstract work-

flow [21], [22], [23] because it does not describe the mapping of tasks onto computing

resources. Following composition, the workflow provenance is captured and encoded in a

suitable format for storage in aworkflow catalogueorrepository. This permits scientists to

select existing workflows from a storage facility to be used again, or refined by introducing changes to the workflow specification. Scientists may alternatively choose to use workflows

from public repositories such asmyExperiment [20].

Resource Mapping

Resource mapping is an essential step in the workflow lifecycle because it is responsible for finding a group of appropriate services to execute the workflow tasks. It generates an

executable plan called a concrete workflow [21], [24], [25] which may affect the overall

workflow performance. Typically, a concrete workflow is encoded in amachine-readable

form that describes the mapping of tasks onto computing resources, and how the data is transferred between the resources to support their collaboration. Resource information may

be collected from aresource cataloguebased on high-level specifications of desiredQuality

of Service(QoS) properties [122] needed to execute the workflow, or may be gathered from available computing resources in the execution environment automatically.

Execution

Based on the outcome of the resource mapping step, the workflow tasks are deployed onto

computing resources for execution. Typically, a workflow engine is used to manage the

overall workflow execution. This engine is responsible for executing tasks in a particular order, and coordinating the data movement between computing resources that execute the workflow tasks. Multiple workflow engines may collaborate with each other to execute the workflow tasks. During execution, a particular engine may be responsible for monitoring the overall workflow execution progress, and the execution environment by collecting in- formation about the computing resources and the network condition. Such information can

48 Literature Review

be analysed to optimise the workflow performance, and adapt to dynamic changes in the execution environment if necessary. This thesis presents different approaches for executing workflows in Section 2.4, and reviews existing workflow engine technology.

Provenance

Provenance records the history the workflow data including initial inputs, intermediate data, and final outputs. This history can be used to predict and improve the workflow perfor- mance before execution using different data sets, and may be used to refine the workflow structure. There are many studies that focus on provenance issues [123], [124], [125], [126] and highlight related challenges [127]. However, the discussion of this area is beyond the scope of this thesis.