Workflow-centric Systems - Provenance Systems

2.3 Provenance Systems

2.3.6 Workflow-centric Systems

As discussed in Section 2.1, workflows currently play an important role in allowing scientists to design multi-institutional experiments. Thus, there are several provenance systems that are workflow centric. In the Provenance Challenge, eight of the seventeen systems were based on a particular workflow environment.

Barga and Digiampietri proposes that the workflow enactment engine should be respon- sible for the automatic collection of provenance information at runtime [15]. The system described, called REDUX, is a modification of Windows Workflow Foundation [31] and stores provenance information according to a multi-layered model. The four level model starts with abstract service descriptions. The second level represents the actual services used at runtime. Providing even greater detail is the third level which contains the data and parameters used during workflow execution. The final level contains runtime specific information such as timing information and information about the machines used during execution. Thus, a user can start from an abstract representation of the provenance of an item and drill down to the actual instance level information. Leveraging the model, the size of the information stored can be significantly decreased [16].

In the context of the myGrid project, the Taverna workflow enactment engine has been modified to generate provenance according to a Resource Description Framework (RDF) based data model [200]. Data generated according to Taverna’s provenance model is stored in a triple store. Once stored, a query application programming interface (API), ProQA, enables a wide variety of provenance queries to be executed over the provenance RDF graph[199]. Because provenance information is stored as RDF, it enables both workflow annotations and provenance information to be queried over simultaneously [198]. Moreover, using the semantic web browser, Haystack, provenance can visualised and browsed [200]. One of the drawbacks to the Taverna provenance model is that it has remnants of its origins in bioinformatics, namely, the use of Life Science Identifiers (LSIDs) to label all provenance information. The use of LSIDs requires that the infrastructure to issue them must be in place and accessible in order for Taverna to create provenance information.

Like Taverna and Windows Workflow Foundation, VisTrails provides a graphical user interface for building workflows. However, instead of just capturing the execution of a workflow, VisTrails also captures how the workflow was created by the user [154]. This system is the first to be specifically designed to capture the workflow evolution process [72]. While VisTrails motivating domain is visualization [159], the provenance- based querying and workflow construction techniques such as “Query by Example” and

Chapter 2 A Critical Analysis of Provenance Systems 28

“Creation by Analogy” are domain independent [155]. Using the system, users can return to previous versions of workflows and workflow runs to compare their results. Furthermore, portions of workflows can be combined together to create new workflows and the origin of this new workflow is also tracked. Essentially, VisTrails allows for the user to serendipitously explore a space of workflows while not losing any of their previous work. The data describing the provenance of a workflow is captured using an open, self-describing data model to enable sharing and publication [72].

Another workflow system is Kepler, which aims at supporting multiple kinds of workflows from those designed for high-level conceptual bioinformatics experiments to those designed for job control and data movement in Grids [117]. To allow for a wide range of workflows, Kepler adopts a formal model that supports different computational styles. It contains components called actors that are very similar to services, producing output and taking input. Furthermore, it adds a notion of a Director that governs how a pair of actors interact. For example, a Director may state that one actor cannot execute until receiving data from another actor [117]. Based on this formal model, a provenance model centered on recording read, write and state-reset events in an event log is outlined by Bowers et al. [30]. Using the log, write events (i.e. the output of data) can be paired with several read events (i.e. the input of data) and a dependency graph can be built. State-reset events in the event log identify when previous read events can be discarded as precursors to subsequent write events. Using this approach, several provenance queries can be answered [30]. However, these queries are dependent on the workflow to bound the query by having an event log per workflow. Furthermore, to support queries about the functional relationships between inputs and outputs, the model depends on actors implementing only one kind of functionality (i.e. actors cannot support more than one function). With modifications to support explicit dependencies and metadata, the execution of workflows in Kepler can be captured with the Kepler Provenance Recorder [10]. The implementation allows for the smart-rerun of workflows based on the algorithms from VisTrails [10].

The above systems capture provenance-related information only from the workflow enactment engine, the Karma Provenance Framework [165], on the other hand, supports the capture of this information both from the workflow enactment engine and from the services used. This capture is facilitated by a notification model. The workflow enactment engine and services publish information about their execution to a notification services as XML. The Provenance Repository then listens for those notifications and stores them. This has the benefit that services can asynchronously submit their provenance information and thus not delay execution. The data model Karma used is centered on the notion of workflow activities (i.e. the invocation of a service). To tie the various XML submissions about each activity, a workflow identifier is attached to each. Furthermore, activities are ordered by attaching a logical time stamp generated by the workflow enactment engine. The model also assumes that each service instance within a

Chapter 2 A Critical Analysis of Provenance Systems 29

workflow run is identified uniquely [162]. A variety of provenance related queries can be answered by Karma using a combination of SQL queries and a Web Services API [164]. The Virtual Data System (VDS) is a workflow system, which focuses on data inten- sive scientific applications [202]. The system takes a functional approach: Executable applications are described as transformations (i.e. functions) and the input to those applications are described by derivations that bind particular data to a transformation (i.e. function calls). The syntax to describe derivations and transformations is called the Virtual Data Language (VDL) [71]. Workflows described in VDL can then be submitted to workflow planners such as Pegasus [55] or converted to run in workflow enactment engines such as Condor DAGMan [76]. When the concrete workflow is executed the parameters along with information about the runtime environment are stored in the VDS. Like Karma this parameter and runtime information is submitted back to the VDS by the invoked services. Once stored in the VDS, parameter and runtime information can then be combined with lineage information inferred from the VDL to answer a variety of provenance queries [201]. This reliance on the existence of a workflow to infer provenance is one of the major disadvantages to the VDS approach because it does not allow provenance to be determined in cases where the workflow no longer exists.

Szomszor and Moreau argues for infrastructure support for provenance in Grid and Web Service applications [178]. An architecture and implementation was developed around a workflow enactment engine recording data into a separate repository. To cater for reproducibility all the inputs and outputs to Web Services are recorded along with the workflow script and the interface definitions of services. The recording interface provided by the implementation supports both the asynchronous and synchronous submission of data. The implementation also has a validation capability that determines if a particular result is current by re-executing the workflow and comparing the execution to the one documented in the repository.

The workflow systems described here can all answer a variety of queries about the provenance of data. However, as multi-institutional scientific systems become increasingly decentralised, centralised workflow enactment engines do not have all the information necessary to provide the complete provenance of various results. For example, if a workflow enactment engine, A, called a service, B, which also executes a workflow, the provenance-related information stored by A would not contain the information about how B produced its results and thus the complete provenance of the output of A could not be determined. Furthermore, with the exception of VDT and Karma, the systems described only capture the workflow enactment engines view of a service invocation, this leads to the possibility of manipulation of provenance-related information produced by the workflow enactment engine. In a multi-institutional environment, both the workflow enactment engine and the service in an interaction need to record documentation of their involvement with each other. Finally, while all these systems have accessible well specified data models, they are all tied to the particular notion of workflow implemented

Chapter 2 A Critical Analysis of Provenance Systems 30

by each system. In heterogenous environments, a model is needed that is not tied to any particular workflow environment.

In document The Origin of Data: Enabling the Determination of Provenance in Multi institutional Scientific Systems through the Documentation of Processes (Page 39-42)