Summary - The Origin of Data: Enabling the Determination of Provenance in Multi institutional S

In this chapter, a detailed description of multi-institutional scientific systems was given. This description identified the evolution of these systems towards greater decentralisa- tion through the use of the SOA architectural style. After describing the environment considered, three characteristics of provenance were given by analysing its use in the field of art. Provenance was identified as being about the past and process was identified as playing a key role. Thus, the important distinction between past and prospective processes was explained. In this context, we reviewed a range of provenance systems from those designed for specific domains to those integrated with workflow construction and enactment environments. We identified three cross cutting concerns: support for multiple levels of abstraction, a framework for understanding provenance queries, and the role of causality in understanding the provenance of a result. To understand the nature of causality, we briefly reviewed work in the area and arrived at a definition of causality suited to provenance in multi-institutional systems.

From our analysis, we arrived at six conclusions, which can be summarised as follows. First, current systems do not cater fully for heterogeneous SOA-based multi-institutional systems; they are often domain or technology specific. Second, they do not not take into the difference between a prospective process (such as a workflow) and past process (such as documentation of the workflows execution). Third, they do not specifically define the causal nature of the relationships they express between data nor do they support multiple levels of abstraction and nesting. Fourth, they do not distinguish between

Chapter 2 A Critical Analysis of Provenance Systems 40

the data they capture and store (process documentation) and the representation of provenance that they retrieve from that data. Finally, open, generic, data models are important in allowing provenance describing multi-site processes to be retrieved. In the next Chapter, we describe an open data model that supports high-quality characteristics and takes into account our analysis conclusions.

Chapter 3

A Model of Process

Documentation

At the beginning of this dissertation, we outlined the need for theprovenanceof results in order to establish confidence in those results especially when they are produced by dynamic multi-institutional scientific systems. Furthermore, we introduced the notion that provenance is a questionanswered by querying documentation of an application’s process. This novel distinction between provenance and process documentation allows provenance questions, unknown at the time of application execution, to be successfully answered provided enough documentation has been produced. Additionally, this sepa- ration of concerns allows components to be specialised for their specific role, either the creation of process documentation or its querying. To enable the creation and querying of process documentation by distributed software components, there must be some

sharedunderstanding between all these components. In this chapter, we present a data model for process documentation that provides this shared understanding.

But what should this data model look like? In Chapter 2, we arrived at six conclusions about the state of the art for determining provenance in multi-institutional systems. Taking these conclusions into account, this chapter specifies a model of process documentation that is compatible with SOAs, provides explicit veridical relationships between occurrences, and is both technology and domain independent. Furthermore, as discussed later, the data model is designed to support high quality characteristics derived from a use case analysis.

Therefore, the contributions of this Chapter are as follows:

• A more detailed description of the set of characteristics that define high quality process documentation.

• A precise conceptual definition of a generic data model for process documentation

Chapter 3 A Model of Process Documentation 42

that supports the creation of high quality process documentation and allows for the provenance of results to be determined.

The rest of the Chapter is organised as follows. First, a generic, shared model of process documentation is further motivated and a simple example application is introduced. Next, a set of characteristics are enumerated, which define high quality process documentation. After which the data model is conceptually specified. The specification begins by describing our definition of process, it then proceeds to describe how to create process documentation for applications once they are mapped to this perspective. After completely defining the model, we show how the provenance of a data item can be extracted from it. We also see how the characteristics defined earlier are supported. Finally, related work is briefly revisited and we conclude.

3.1 Motivation for a Generic, Shared Process Documenta-

tion Data Model

Just as the provenance of a work of art may include multiple owners, institutions, and handlers, the provenance of a particular digital object may include processes that oc- curred at different sites, at different institutions, and at different times. Because these processes may be different in terms of domain focus, underlying assumptions, and imple- mentation technology, it is helpful to have agenericdata model for their documentation so that the provenance of results can be traced back through these various interconnected processes.

Take the simple example application in Figure 3.1, at the request of a client initiator a mathematical calculation is performed on a collated sample, S, provided by a service that collates a sample from data provided by several sources. This application contains at least two sub-processes: Process A, the mathematical calculation, and Process B, the collation of the sample. The process documentation for Process A would document actions of addition, subtraction, and formula evaluation. On the other hand, the process documentation for Process B would document the collation of S, for example, by doc- umenting who was responsible for the data items being collated, what institutions the data items are from, and whether the data items were produced experimentally or were synthesised from publications. The documentation for these two sub-processes differ in their level of detail and the kind of information included. However, using a generic model, we can still obtain the provenance of the numerical result that includes the whole of the process.

A model that is applicable to multiple processes could be generated on a case-by-case basis. However, a number of benefits arise from a data model that is not only generic but

Chapter 3 A Model of Process Documentation 43 Client Initiator Mathematical Function Collate Sample

I2

I3

I1

I4

I1: initiator request

I2: collate sample request

I3: collate sample response

I4: numerical result

Figure 3.1: A simple example application

also shared across applications and application components. The benefits of a generic, shared model of process documentation are enumerated below.

1. Future proofing: It allows application developers to be sure that the process documentation their applications create today will be understandable by future applications and usable with process documentation generated by these future applications. This is vital since today’s process documentation willbe part of the provenance of tomorrow’s results.

2. Sharing: It allows different institutions to share their process documentation without the need for conversion between models.

3. Common tools: Using it, tools can be designed that allow for the visualisation, reasoning, and filtering of process documentation irrespective of the domain. 4. Independent creation: With it, process documentation created independently

by application components can be integrated together, which allows trailing anal- yses to be performed over unanticipated groupings of process documentation. 5. Clear guidelines: Application developers may want their applications to create

process documentation, however, they may not know what data belongs in process documentation and what the structure of it should be. A generic, shared model provides a set of guidelines that help developers determine what data should be part of process documentation and how that data should be structured.

Chapter 3 A Model of Process Documentation 44

6. Platform independence: Internet applications are often developed using a va- riety of platforms (i.e. operating systems, programming languages, architectures). Such a model allows for provenance to be determined from process documentation generated by application components running on any platform.

In document The Origin of Data: Enabling the Determination of Provenance in Multi institutional Scientific Systems through the Documentation of Processes (Page 51-56)