Collaborative Data Integration - Collaborative Data Sharing with Mappings and Provenance

A recent paper [94] proposes a paradigm for data integration calledcollaborative data integration

(CDI), and an implementation in a system called Youtopia, that is similar in many of its aims and technical features to CDSS. As in CDSS, CDI proposes a data exchange-style setup involving peer schemas related by networks of declarative mappings (tgds), and computation of materialized solutions. Youtopia employs an interesting new approach to resolving the technical difficulties associated with cycles in mappings, based on semi-automated user intervention. This allows the system to support mappings with arbitrary cycles. The system also supports update exchange, and incorporates sophisticated new transaction facilities. In contrast to CDSS, it does not track data provenance.

7.2 Provenance and Annotated Data Models

Lineage and why-provenance were introduced in [34,35,17] (the last paper uses a tree data model)

but the relationship with [79] was not noticed. The papers on probabilistic databases [50,138,96]

note the similarities with [79] but do not attempt a generalization.

Datalog with bag semantics in which derivation trees are counted was considered in several papers, among them [109,113,114]. The evaluation algorithms presented in these papers do not

terminate if some output tuple has infinite multiplicity. Datalog on incomplete and on probabilistic databases is considered in [40,97], again with non-terminating algorithms. Later [115] gave

an algorithm for detecting infinite multiplicities in Datalog with bag semantics and [51] gave a

terminating algorithm for Datalog on probabilistic databases.

Two recent papers on provenance, although independent of our work, have a closer relationship to our approach. Like us, [25] identifies the limitations of why-provenance and proposes

route-provenancewhich is also related to derivation trees. The issue of infinite routes in recursive programs is avoided by considering onlyminimalones. [8] proposes a notion of lineage of tuples

for a type of incomplete databases but does not consider recursive queries. We have seen that the annotations in that paper can be captured using a semiring,Trio(X).

The first attempt at a general theory of relations with annotations appears to be [81] where

axiomatizedlabel systemsare introduced in order to study containment.

A line of work is artificial intelligence on soft constraint satisfaction problems (soft CSP) [11]

relations and the two operations on constraints correspond indeed to relational join and projection. CSPsolutionsare expressed as projection-join queries or Prolog programs. Computing solutions is the same as the evaluation of join and projection in Section2.2and [11] also uses fixed points

on semirings. There are some important differences though. The semirings used in [11] are such

that+is idempotent and 1 is a top element in the resulting order. This rules out our semirings N,N∞_,_N_[_X_]_,_N∞_[[_X_]] _{hence the bag and provenance semantics.} 1

More importantly, much of the focus in CSP is in choosing optimal solutions rather than how these solutions depend on the constraints.

We left open the question of how to incorporate the relational algebra difference operator into the framework of semiring annotations. This problem was addressed in [54], which proposed the

use of a monus operation for this purpose. The same paper also initiated the study of expressive- ness of query languages on semiring-annotated relations.

7.3 Update Exchange

As already mentioned, our work on update exchange takes advantage of previous work on data integration, PDMS (e.g., [73]) and data exchange [45,104,118]. With our encoding in Datalog,

we reduce the problem of incremental updates in CDSS to that of recursive view maintenance where we contribute an improvement on the classic algorithm of [66]. Incremental maintenance

of recursive views is also considered in [108] in the context of databases with constraints using

the Gabrielle-Levi fixpoint operator. The AutoMed system [110] implements data transformations

between pairs of peers (or via apublic schema) using a language called BAV, which is bidirectional but less expressive than tgds; the authors consider incremental maintenance and lineage [46]

under this model. In [52], the authors use target-to-source tgds to express trust. Our approach

to trust conditions has several benefits: (1) trust conditions can be specified between target peers

or on mappings themselves; (2) each peer may express different levels of trust for other peers,

i.e., trust conditions are not always “global”; (3) our trust conditions compose along paths of

mappings. Finally, our approach does not increase the complexity of computing a solution. [8] proposes a notion of lineage of tuples which is a combination of sets of relevant tuple ids

and bag semantics. As best as we can tell, this is more detailed than why-provenance, but as we have seen in Chapter2, we can also describe it by means of a special commutative semiring, so our

approach is more general. The paper also does not mention recursive queries, which are critical for our work. Moreover, [8] does not support any notion of incremental translation of updates

Another difference is that for Datalog semantics we require our semirings to beω-continuous while [11] uses the less well-behaved fixed points given by Tarski’s theorem for monotone operators on complete lattices. However, the semiring examples in [11_{] appear to be in fact}_ω_-continuous.

over mappings or incompleteness in the form of tuples with labeled nulls. Our provenance model also generalizes the duplicate (bag) semantics for Datalog [114] and supports generalizations of

the results in [115].

A recent paper [106] adopts an approach to incremental view maintenance similar in spirit

to our work in its use of provenance annotations to speed up propagation of deletions. There, the annotations are from what is essentiallyPosBool(X)(cf. Chapter2), the semiring of positive

Boolean expressions over variables fromX.

7.4 Query Containment and Equivalence

The seminal paper by Chandra and Merlin [22] introduced the fundamental concepts of con-

tainment mappings and canonical databases in showing the decidability of containment of CQs under set semantics and identifying its complexity asnp-complete. The extension to UCQs is due

to Sagiv and Yannakakis [123]. We have built upon the techniques from these papers.

The papers by Ioannidis and Ramakrishnan [81] and Chaudhuri and Vardi [24] initiated

the study of query containment under bag semantics. Chaudhuri and Vardi showed that bag- equivalence of CQs is the same as isomorphism, established the Π₂p-hardness of checking bag- containment of CQs, and gave partial conditions for checking bag-containment (see Section4.4

for further connections with our results). Chaudhuri and Vardi [24] also introduced the study

ofbag-set semantics, and showed that bag-set equivalence of CQs (without repeated atoms in the body) is the same as isomorphism. This was essentially a rediscovery of a well-known result in graph theory due to Lov´asz [107] (see also [74]), who showed that for finite relational structures

F,G, if|Hom(F,H)|=|Hom(G,H)|for all finite relational structuresH, where Hom(A,B)is the set of homomorphismsh : A → B, then F ∼= G. In database terminology, this says that bag-set equivalence of Boolean CQs (without repeated atoms in the body) is the same as isomorphism. Ioannidis and Ramakrishnan showed that bag-containment of UCQs is undecidable.

In Section4.4we have discussed the results of Cohen et al. [30] and Cohen [27] on bag equiv-

alence and bag-set equivalence of UCQs. The decidability of bag-containment of CQs remains open. Recent progress was made on the problem by Jayram et al. [84] who established the unde-

cidability of checking bag-containment of CQs with inequalities.

Semiring-annotated relations are also related to the lattice-annotated relations used in para- metric databases by Lakshmanan and Shiri [98]. That paper also studied query containment and

equivalence, giving a number of positive decidability results. None of our provenance models fall into this framework (with the exception ofPosBool(X), cf. Theorem4.34).

containment and equivalence of positive relational queries on bilattice-annotated relations.

Tan [130] showed that query containment is decidable for CQs on relations withwhere-provenance

information. Our results here on why-provenance complement the where provenance results (why-provenance and where-provenance were introduced together in [17]).

Cohen [28] recently initiated the study of query optimization undercombined semantics, which

generalizes bag semantics and bag-set semantics by enriching the relational algebra with adupli- cate eliminationoperator. “Duplicate elimination” also makes sense forK-relations in the form of thesupportoperator:

supp(R)def= λt.    0 ifR(t) =0 1 otherwise

ForK=N, this is duplicate elimination; forK=PosBool(X)it corresponds to thepossoperator of [5] which returns the “possible” tuples of an incomplete relation. It would be interesting to see

whether the decidability results presented here can be extended to queries usingsupp.

7.5 Ring-Annotated Relations and Updates

Exact query reformulation using views has been studied extensively, due to its applications in query optimization, data integration, and view maintenance, starting with the papers by Levy et al. [101] and Chaudhuri et al. [23]. The former paper established fundamental results for UCQs

with built-in predicates (our UCQ<s) under set semantics. The latter paper considered CQs with built-in predicates (CQ<s) under bag semantics, but it did not provide a complete reformulation algorithm or consider UCQs.

The view adaptation problem was introduced in [65], which gives a case-based algorithm for

adapting materialized views under changes to view definitions (under bag semantics). In contrast, our methods apply to view adaptation, but use a more general term rewrite system to develop a sound and complete query reformulation algorithm.

OurZ-relations appeared in an early form as thedeltasin the countincremental view maintenance algorithm for UCQs of [66]. That paper did not consider query equivalence for deltas or

make a general study of query reformulation.

Z[X]-relations made their first appearance in [54]. That paper did not consider the use of

Z[X] to represent data updates, nor did it consider questions of query equivalence and query reformulation.

In document Collaborative Data Sharing with Mappings and Provenance (Page 176-180)