Scientific workflows - Representation of process knowledge

2.7 Representation of process knowledge

2.7.5 Scientific workflows

Research on workflows covers a variety of aspects, from the problem of reproducibility to the ones of validation, preservation, tracing and decay [Belhajjame et al. (2013); Di Francescomarino et al. (2009); Garijo and Gil (2011); Weber et al. (2008); Wolstencroft et al. (2013)]. These aspects are of primary importance in the context of science, where “the reproducibility of scientific experiments is crucial for corroborating, consolidating and reusing new scientific discoveries”[Garijo (2017)]. In the last decade, Scientific Workflows have been proposed to support the reproducibility of scientific

2.7. REPRESENTATION OF PROCESS KNOWLEDGE 81 experiments by means of a structured description of the processing components and data artefacts involved. Particularly, these types of objects are meant to be executed in silico, meaning that they include all the necessary elements required to perform the experiment [Taylor et al. (2014)]. Scientific Workflows have been used in several domains, including astronomy, neuroscience or bioinformatics [Olabarriaga et al. (2014)]. Similarly to Web Services, research on Scientific Workflows involves a number of issues, spanning from their efficient execution [Callahan et al. (2006); Deelman et al. (2005); Gil et al. (2011); Ludäscher et al. (2006); Wolstencroft et al. (2013)], to workflow component reuse [De Roure et al. (2007); Goderis et al. (2005)], discovery [Wroe et al. (2007)], and recommendation [Zhang et al. (2011)]. Research Data Platforms are based on Workflow Management Systems like Pegasus [Deelman et al. (2005)], Chiron [Ogasawara et al. (2013)], Galaxy [Goecks et al. (2010)], Triana [Shields and Taylor (2004)], Kepler [Altintas et al. (2004)], Taverna [Wolstencroft et al. (2013)] (see also [Liu et al. (2015)] for a recent survey). Current research includes the problem of monitoring workflow executions for resources consumption and data quality in large research data infrastructure [Mannocci (2017)]. Recently a number of repositories of scientific workflows have been published - Wings75_{, My experiment}76_{, SHIWA}77_are the prominent examples. From the point of view of knowledge representation, Scientific Workflows inherit the approach developed in three decades of process modelling and are structured as a directed graph of nodes as processing units linked by arcs representing a dependency relation (often as an output-to-input connection). Workflows are built on the concept of processor as the unit of operation78_{. A processor includes one or more input and output ports, and a specification of} the operation to be performed. Processors are then linked to each other through a set of data links connecting an output port to the input of another processor resulting in a composite tree- like structure. Figure 5.10 shows an example of a workflow taken from the "My Experiment" repository79. A major challenge in understanding workflows is their complexity. A workflow may contain several phases, whose role in the scientific analysis can be opaque if only looking at the workflow implementation. This difficulty in understanding the intention behind implementations stands in the way of workflow components reuse. Semantic technologies have been used to analyse the components of workflows, for example, to extract common structural patterns [Ferreira et al.

75_{Wings: \T1\textemdashhttp://www.wings-workflows.org/.}

76_{My experiment: http://www.myexperiment.org/.}

77_{SHIWA: http://www.shiwa-workflow.eu/wiki/-/wiki/Main/SHIWA+Repository}

78_{Here we use the terminology of the SCUFL2 specification developed in the context of the Taverna workflow}

management system. However, the basic structure is a common one. In Kepler, for example, this concept maps to the one of Actor.

79_{"LipidMaps Query" workflow from My experiment: http://www.myexperiment.org/workflows/}

Data-Operation motifs Data preparation Combine Filter Format transformation Input augmentation Output extraction Group Sort Split Data analysis Data cleaning Data movement Data retrieval Data visualization Workflow-Oriented motifs

Inter workflow motifs Atomic workflows Composite workflows Workflow overloading Intra workflow motifs

Internal macros Human interactions

Stateful (asynchronous) invocations

2.7. REPRESENTATION OF PROCESS KNOWLEDGE 83

236 L. Moreau et al. / Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015) 235–257

Fig. 1. The core concepts of prov.

Source: Taken from [14].

6000 mercurial commits,

_{and 152 teleconferences,}

_{the Prove-}

nance Working Group had numerous rich discussions, adopted

guiding principles, considered alternative designs, referred to im-

plicit requirements, and ultimately made design decisions, which

help explain why prov turned out to be as it is. The purpose of this

article is to provide justifications for the design of prov and link it

to explicit requirements.

We believe that making such requirements explicit is impor-

tant. Indeed, a benefit for users of prov is that the model is more

likely to be used consistently, if there is a canonical rationale ex-

plaining the intentions behind the concepts. This in turn means

that prov should be more interoperable.

For the research community, this article helps position future

novel work since the article identifies gaps and aspects that have

explicitly been ruled out or considered out of scope for a standard-

ization activity. It also makes it easier to present alternative designs

addressing specific existing requirements.

Finally, future standardization processes can build on an

explicit presentation of the rationale: charters can list these to

scope future activities, and future working groups can further

refine requirements, to justify their own work.

1.1. Naming convention

Terminology evolved during the lifetimes of the W3C Prove-

nance Incubator and Working groups. In this article, we adopt the

terminology defined in the W3C Recommendations for prov to

avoid confusion. Thus, requirements that pre-date the standard

definitions have been rewritten, to adopt a form that is consistent

with the Recommendations.

Likewise, the name prov was adopted some six months into

the lifetime of the standardization activity (seeR-2011-09-15/2

_).

Again, for clarity, we use it consistently here in the formulation of

all requirements.

A couple of name changes are worth noting: The term ‘‘process

execution’’ is now referred to as ‘‘prov activity’’, whereas ‘‘artifact’’

is now referred to as ‘‘prov entity’’. Likewise, ‘‘recipe’’ is now called

‘‘prov plan’’.

1.2. Article outline

The rest of this article is organized as follows. In Section2, we

summarize the key concepts of prov that are needed for this ar-

ticle, and we provide a small example to illustrate the prov data

4 Mercurial:https://dvcs.w3.org/hg/prov/.

5 Teleconferences: http://www.w3.org/2011/prov/wiki/Meetings and http:// www.w3.org/2011/prov/wiki/PIL_OWL_Ontology#Meeting_notes.

6 Resolution 2011-09-15/2: http://www.w3.org/2011/prov/meeting/2011-09- 15#resolution_2.

Fig. 2. Example PROV graph (turtle file ‘Inline Supplementary Computer Code S1’

and prov-n file ‘Inline Supplementary Computer Code S2’ are available online at

http://dx.doi.org/10.1016/j.websem.2015.04.001).

model. In Section3, we discuss various initiatives related to prove-

nance that precede the creation of the Provenance Working Group.

These initiatives are important because they resulted in a deep un-

derstanding of provenance issues, and help build a community of

expertise and momentum, necessary for the standardization activ-

ity. Section

4 focuses on the first provenance-related activity tak-

ing place under the auspice of the World Wide Web Consortium:

the W3C Provenance Incubator was instrumental in recommend-

ing the launch of a standardization activity. Section

5 introduces

a categorization of requirements. The Incubator Group drafted a

charter, which essentially forms a set of initial requirements for

prov: these are presented in Section6. Then, Section7contains the

bulk of this article’s contribution: the retrospective requirement

analysis of prov. Finally, in Section

8, we look at aspects that po-

tential future standardization activities may focus on, before con-

cluding the article.

2. PROV overview

The prov family of documents is a set of specifications allow-

ing provenance to be modeled, serialized, exchanged, accessed,

merged, translated, and reasoned over. This set includes a concep-

tual data model [1], an OWL ontology [14], XML serialization [15],

a human-readable notation [12], a formal semantics of the concep-

tual model [17], a set of constraints and inference rules [13], and a

mapping to Dublin Core [16]. In this section, we give a brief intu-

ition of the key concepts in the conceptual model using an example.

Fig. 1

shows the core concepts of the data model, centered

around the notions of entity, a digital, physical or other thing; ac-

tivity, an action using or creating entities; and agent, something re-

sponsible for an activity taking place as it did.

Consider a scenario, variant of the prov primer [11], in which

an online newspaper publishes an article with a chart about crime

Figure 2.14: PROV example scenario. Figure taken from [Moreau et al. (2015)].

Figure 2.15: A workflow from the My Experiment repository: “LipidMaps Query”.

(2011)]. Recently, more attention has been given to the elicitation of the activity of workflows in a knowledge-principled way, for example labelling data artefacts to produce high-level execution traces (provenance). This research highlighted the need for adding semantics to the representation of workflows and the challenges associated with the problem of producing such annotations [Alper et al. (2014)]. A recent line of research is focused on understanding the activities behind processes in scientific workflows, with the primary objective to support preservation and reusability of workflow components [Garijo et al. (2014)]. This analysis resulted in a set of workflow motifs that identified data-intensive motifsrepresenting the data-relying activities that are observable in workflows, and workflow-oriented motifsshowing the different ways in which activities are implemented. Scientific workflow motifs are shown in Figure 2.13 (content from [Garijo et al. (2014)]). An interesting discovery of this analysis is that a significant amount of data-intensive operations are related to data preparation [Garijo et al. (2014)].

However, workflow knowledge encompasses several aspects (control, data, implementation compliance, semantics), therefore it is naturally spread among several different artefacts (config- urations, datasets, logs, and so forth). Workflow-centric Research Objects have been designed with the objective of making persistent (and reproducible) research experiments in the scientific discourse. The approach is to bundle all the elements relevant to a research finding in a single

2.7. REPRESENTATION OF PROCESS KNOWLEDGE 85

(a) (b)

Figure 2.16: From workflows to dataflows. Workflow models are focused on actions, to support multiple and parametric executions (2.16a). There are scenarios in which we need to focus on the data (2.16b) and understand how the data is affected by the actions of the workflow (2.16c). In our work, data flows 2.16d are characterised as an expression of the implications of the process on the data.

artefact, including the workflow formalisation, the required input data, the provenance of the results obtained by its enactment, and any other digital object involved (papers, datasets, etc.), as well as semantic annotations that describe all these objects [Corcho et al. (2012)].

In document Knowledge Components and Methods for Policy Propagation in Data Flows (Page 103-109)