Provenance Data - Document DNA: Distributed Content-Centered Provenance Data Tracking

3. Annotations are stored in different formats and are not usable by other systems.

The third issue is supported by Svensson (2009), who stated that most annotation approaches limit annotations to context that can directly be gained from sensors, such as location, user or activity. However, each approach Svensson reviewed stored the contextual annotations in a different way, making it hard to re-use this information. We believe that this issue is magnified when using semantic annotations, since different approaches as- sign different identifiers to categories. It could be argued that cloud based systems, such as Google Docs, are slowly replacing the file-folder system. However, these systems also suffer from the above mentioned issues as they are based on file-folder structures.

However, there were two valuable concepts in the reviewed systems: 1. The concept of tracking the flow of information.

2. The concept of expressing different levels of strength of relations. The flow of information named in the first concept is considered data provenance. Data provenance is a wide term and we therefore need to define it in relation to this work.

2.4 Provenance Data

Provenance of an object refers to how the current object’s state came to be, e.g., the history of the object. Provenance for physical objects is typ- ically limited to the history of the ownership and storage of the object. Provenance data is valuable for determining the authenticity and condi- tion of objects like art, antiques, wine, books or collections of records. One example of such use is shown by Schibille et al. (2008), who discuss the provenance of late antique window glass from the Petra church in Jordan.

In this research, we limit provenance to information. For digital information provenance must be considered differently to that of physical objects. Both the storage and the ownership of digital objects are very different from physical objects. For example, physical objects cannot exist in several locations at the same time. The rate in which digital objects change location and ownership is also vastly increased when compared with the rate for

physical objects.

The other main difference between physical objects and digital objects is that physical objects are rarely manipulated in the same way, especially when goods of great value like antiques, wines or archeological artifacts are concerned. Digital objects on the other hand are constantly evolving. This is especially true when considering that changing location or ownership often comes with a change of format, which changes the underlying structure of digital information.

We now introduce the most prominent examples of provenance data of digital objects.

2.4.1 Provenance in Scientific Research

One of the cornerstones of scientific research is the production of data sets. The quality of produced data sets is often used to judge the overall quality of a piece of scientific work. Provenance is one of the tools used to measure the quality of a scientific data set. In this case, provenance means the methods, such as algorithms and experimental set ups used; and sources, such as data gained by sensors or cited sources, used to produce the data set (Barga et al., 2010). Provenance of scientific data sets is then used to judge the reproducibility and values of the data set.

2.4.2 Provenance in Databases

Data in databases is constituted of tuples. Using queries, these tuples can be manipulated, combined, aggregated and filtered in order to create views or result sets for data warehouses. According to Tannen (2008), the history of actions that preceded a view or result set is considered provenance for data sets in databases. They state that this provenance data is of high importance to establish the relationship between the result set or view and the source data used to create them. The nature of this relationship is of importance when judging the quality or suitability of the produced result set or view, because the relationship illustrates how the view was created. In databases, provenance is used to establish relationships between raw and processed data to give the users of this data a tool to judge the usefulness of the processed data.

2.4 Provenance Data 27

2.4.3 Provenance in Semantic Web

The Semantic Web utilizes structured data (for example: ontologies) and automated reasoning, to allow users to answer complex questions using the world wide web as a knowledge source. Provenance in Semantic Web research is used to assess the quality of an answer gained using the Se- mantic Web, by including the reasoning process in addition to the sources used (Shadbolt et al., 2006). Issues often arise when the quality of the information sources are unknown due to frequent modifications, thus ver- sions of sources and algorithms used are also important. In this domain, provenance is used to allow users to assess the quality of an answer gained through querying the Semantic Web.

2.4.4 Provenance as used in this Thesis

Our evaluation in Section 2.3.3 has shown that a focus digital files is not sufficient, instead an annotation system needs to recognize parts of files, which refer to as digital objects. Digital objects in this context are units of digital information that can be transported on their own and are meaning- ful to the sender or recipient. Provenance for digital objects is data that answers the following four questions, which we derived from the above – discussed definitions of provenance:

1. What is the origin of a digital object?

2. Which digital objects are derived from the current object? 3. Do digital objects share a common origin?

4. How do objects that share the same origin differ?

A digital object is considered to originate from another digital object if a series of manipulations lead from one object to the other.

This means that provenance in this context enables the user to find all other forms of an digital object (including origins and objects that origi- nated from it) and provides the ability to determine which of those objects is the most recent one. These four questions can be encapsulated into the following two requirements:

R1 Relationship Detection — The system needs to be able to determine if two digital objects are related, i.e., is one originating from the other, or do they share a common point of origin? (Q1 & Q2 & Q3)

R2 Relationship Metric — The system needs to enable the user to determine the nature of the difference between two related digital objects. For example, how much do they differ in content? (Q4)

Additionally, we found that the reliance on a centralized architecture is a major disadvantage to an annotation system. The same holds true for relying on either manual user input for annotation, or inaccurate automatic annotation. We therefore formulate these two additional requirements:

R3 Distributed — The metadata needs be stored with the content, instead of separately. (Q3)

R4 Automated — The metadata needs to be created automatically and accurately. (Q3)

In document Document DNA: Distributed Content-Centered Provenance Data Tracking (Page 44-47)