• No results found

2.3 Provenance Systems

2.3.2 Application-Specific Systems

Many provenance systems are designed for the needs of one application or domain. We review these systems here.

Some of the first research in provenance was in the area of Geographic Information Systems (GIS)[113]. Knowing the provenance of map products is critical in GIS applica- tions because it allows one to determine the quality of those derived map products [113]. Lanter developed two systems for recording retrieving the provenance of map products in a GIS. The first system was a meta-database for recording data about GIS processes.

4

http://www.apple.com/macosx/leopard/features/timemachine.html 5

The Oxford English Dictionary defines Wiki as a type of web page designed so that its content can be edited by anyone who accesses it, using a simplified markup language. Implementations of Wikis allow for the revision history of a page to be seen and older versions retrieved [11].

Chapter 2 A Critical Analysis of Provenance Systems 21

The second system was for tracking operations in the Arc/Info GIS system’s graphical user interface and command line [112,114].

Provenance is also needed in the area of statistical analysis. S is an interactive system for statistical analysis where the results of user commands are automatically recorded in an audit file [18]. These results include the modification or creation of data objects as well as the commands themselves. S’s AUDIT utility can then be used to analyse the audit file to retrieve the provenance of a statistical analysis. This utility can also create a script to reexecute a series of commands from the audit file.

Manipulating arrays is an important procedure in many computational models. In the context of the Array Manipulation Language (AML), the Sub-pushdown algorithm tracks all the array operations performed when executing AML programs [120]. Using an implementation of the algorithm called ArrayDB, the provenance of output arrays produced by AML programs can be retrieved.

The above systems are inadequate for multi-institutional environments because they are domain specific, program specific, and cannot work in distributed environments. The next five systems are designed for distributed settings but are again application specific. Another system in the GIS domain is GOOSE [8], a tool for the creation of large geo- graphical models by multiple participants. GOOSE was designed to work in a GIS envi- ronment that resembles a multi-institutional system where multiple tools, data sources, and researchers work together to create scientific results. GOOSE deals with this envi- ronment by providing a user interface to a GIS modelling engine and a shared repository accessible by multiple users. As the user goes about creating an object for a geograph- ical model an operations log is kept. When the user stores the object the operations log is kept along with it. Then, when another user makes use of the object, they can retrieve the operations log for that object and thus its lineage. GOOSE mandates a common user interface and storage layer and is specific to one domain and is therefore not sufficient for the kind of environment we consider.

Another domain where provenance is of interest is satellite image processing. The Earth System Science Workbench (ESSW) is designed for processing satellite imagery locally [73]. It provides a lab notebook service for tracking processing steps and a No-Duplicate Write Once Read Many storage service for storing files. Essentially, the system pro- vides wrappers for each program a user calls, which transparently gather the input and output of the executing programs. After the fact, ESSW can recreate the lineage of science objects from the data stored in the notebook service. Newer versions of the software, named ES3 [74], are closer to the operating system level provenance systems discussed below in that the software captures operating system calls. To aggregate the data produced by separate workbenches, a lineage server is proposed that merges data to produce a repository of lineage data that can be searched [26]. ESSW identifies im- portant parts of a generic provenance solution, however, it is still tied to the particular

Chapter 2 A Critical Analysis of Provenance Systems 22

domain of satellite image processing.

The Collaborative Analysis Versioning Environment System (CAVES) and Collaborative Development Shell (CODESH) are designed to provide a virtual logbook for distributed collaborative groups [28]. Both CAVES and CODESH are interactive shells that users log into to perform various data analysis tasks. Similar to S, the systems track the user interaction with the shell and stores them as session logs. These logs are then published to a server allowing other members of the collaborative group to investigate and replay other users’ sessions. The system is designed specifically for sharing users’ interactive analysis sessions in multi-institutional collaboratories, however, it is not a complete solution because it does not capture what goes on outside the interactive shell. In the context of distributed job execution on the Grid, work has concentrated on gath- ering statistical information and re-running jobs. Both Quill++ [151] and gLite Job Provenance [60] support these tasks. These systems are designed to be scalable and to mimimise the impact of provenance on job execution. Capturing data for provenance in job execution environments is an important part of an overall solution for multi- institutional scientific systems, however, both Quill++ and gLite are tied completely to their execution engines (Condor and gLite respectively) and thus are not adequate as a total solution for heterogeneous systems.