Normal Form and Decidability - Collaborative Data Sharing with Mappings and Provenance

Equivalence of relational queries under set semantics has long been known to be undecidable [123].

We have also seen (cf. footnote in the preamble to this chapter) that bag equivalence of relational queries is undecidable, via an easy reduction from containment of UCQs (which makes essential use of proper subtraction). In contrast, the form of subtraction used inZ-relations turns out to be surprisingly well-behaved: we will show in this section is thatZ-equivalence of relational queries is actuallydecidable. The key idea is that in contrast to bag and set semantics, underZ-semantics, every relational query is equivalent to a single difference of positive queries:

Theorem5.2(Normalization). For any Q∈ RAwe can effectively find A,B∈ RA+ _{such that Q}_≡

A−B.

Proof. For any Q ∈ RA we define a difference normal form denoted by diffNF(Q) by structural recursion onQas follows:

• IfRis a predicate symbol then diffNF(R) =R−∅. • IfdiffNF(Q) =A−Bthen

(A−B)−C ≡_Z A−(B∪C) A1(B−C) ≡_Z (A1B)−(A1C) (!) A−(B−C) ≡_Z (A∪C)−B (A−B)₁C ≡_Z (A1C)−(B1C) (!) A∪(B−C) ≡_Z (A∪B)−C σP(A−B) ≡Z σP(A)−σP(B) (!) (A−B)∪C ≡_Z (A∪C)−B (!) πX(A−B) ≡Z πX(A)−πX(B) ρ_β(A−B) ≡_Z ρ_β(A)−ρ_β(B)

Figure5.1: Algebraic identities for the difference operator underZ-semantics

– diffNF(πX(Q)) =πX(A)−πX(B), and

– diffNF(σP(Q)) =σP(A)−σP(B).

• IfdiffNF(Q1) =A1−B1anddiffNF(Q2) =A2−B2then

– diffNF(Q11Q2) = (A11A2 ∪ B11B2)−(A11B2 ∪ B11A2), – diffNF(Q1∪Q2) = (A1∪A2)−(B1∪B2), and

– diffNF(Q1−Q2) = (A1∪B2)−(A2∪B1).

ClearlydiffNF(Q)has the formA−BwithA,B∈ RA+. It is straightforward to show by induction onQthatQ≡_Z diffNF(Q)using the algebraicZ-semantics identities in Figure5.1. Note that in

generaldiffNF(Q)may be of size exponential in the size ofQ.

The given definition of diffNF(Q) proliferates redundant occurrences of ∅. We will assume that simplifications are made based on identities such as A∪∅ ≡ Aor A1∅ ≡∅ (true forZ- semantics, bag semantics, set semantics, and in fact all semiring-annotated relations semantics). As a result, it is easy to check that forRA+queriesQwe havediffNF(Q) =Q−∅.

Note that the identities in Figure 5.1 that are flagged by (!)fail in fact for both set and bag

semantics, and indeed Theorem5.2fails for set or bag semantics. Corollary5.3. Z-equivalence ofRAqueries is decidable.

Proof. Let Q,Q0 ∈ RA. By Theorem5.2,Q ≡_Z Q0 iffdiffNF(Q) ≡_Z diffNF(Q0). LetA,B,C,D∈ RA+ such thatdiffNF(Q) =A−BanddiffNF(Q0) =C−D. But A−B≡_Z C−Diff A∪D≡_Z

B∪CiffA∪D≡_N B∪C. ButA∪DandB∪Care inRA+and hence equivalent to UCQs so the result follows from the decidability of UCQ bag equivalence [58].

Corollary 5.3 can be used to show apspace upper bound on the complexity of checkingZ-

equivalence ofRAqueries; however, the exact complexity remains open. A lower bound is given by Proposition5.4which shows that the problem is at leastgi-hard.3

DUCQs

As we shall below, our reformulation algorithm usesRAqueries in difference normal form. In, fact as with CQs and UCQs it is notationally convenient to use a Datalog-style syntax for the queries on which reformulation operates directly. We define differences of unions of conjunctive queries(DUCQs) for this purpose, for example:

Q(x,z) :- R(x,y),S(y,z)

Q(x,y) :- R(x,y),R(y,x),R(x,x)

−Q(x,y) :- R(x,y),T(y,y)

The−marks a “negated” CQ, so this example corresponds (roughly) to the algebraic form Q= (π(R1S)∪(R1R1R))−(R1T)

DUCQs are related to theelementary differencesof [123]. UnderZ-semantics, anyRAquery can be

written equivalently as a DUCQ; this follows from Theorem5.2and the fact that anyRA+ query

can be rewritten as a UCQ. The semantics of DUCQs on bag relations/Z-relations/set relations can be given formally by translation toRA.

Although we do not have a precise complexity for theZ-equivalence of RAqueries, we can fully characterize the complexity of checkingZ-equivalence of DUCQs:

Proposition5.4. For DUCQs Q,Q0checking Q≡_Z Q0isgi-complete.

Proof. Following similar reasoning as in the proof of Corollary5.3, we have A∪ −B≡_Z C∪ −D

iff A∪D ≡_Z B∪C iff A∪D ≡_N B∪C. But by the results of [30], this holds iff A∪D and

B∪Care isomorphic. ThusZ-equivalence of DUCQs is polynomial time interreducible with the

gi-complete problem of checking isomorphism of UCQs.

5.3 Reformulation Using Views

LetΣbe a relational schema andVa finite set of views overΣ. LetΣV be the schema consisting of

the view names (disjoint fromΣ). AΣ∪ΣV instance isV-compatibleif it consists of aΣ-instanceI 3

giis the class of problems polynomial time reducible to graph isomorphism. Graph isomorphism is known to be in np, but is not known or believed to be eithernp-complete or inptime.

and aΣV-instance Jsuch that J=JVK

I_{. Orthogonally, these instances may consist of}_Z-relations, N-relations orB-relations. ForK∈ {B,N,Z}we say that two relational queriesQ,Q0overΣ∪ΣV

areK-equivalent underV, denotedQ≡V_KQ0, ifQandQ0 agree on allcV-compatible K-instances. Given aΣ-queryQ, areformulation of Q usingV is aΣ∪ΣV-queryQ0 such thatQ≡V_K Q0. We also callQ0anequivalent rewritingusingV.

Query reformulation algorithms typically work byeffectively enumerating certain queries, call this asearch space, filtering the queries that are not equivalent rewritings, and finding a minimum- cost query (according to some cost model). There are three requirements for designing such algorithms; (1) the search space must befiniteso the algorithm terminates, (2) the returned rewrit-

ing should be actually equivalent so the algorithm issound, and (3) if equivalent rewritings exist

then at least one of them should be found so the algorithm iscomplete. A more subtle requirement is that the returned rewriting has smaller cost thananyrewriting, even when there are infinitely many ones.

In the case of set semantics one can construct infinitely many rewritings of CQs even without views, simply by adding to their bodies atoms that can be homomorphically mapped to existing ones [2]. This can be dealt with by considering only queries that have no non-trivial endomor-

phism, call them locally minimal. It was shown in [101] that the space of locally minimal CQ

reformulations of a CQ query Q using CQ views V is finitely bounded. Further work focused on pruning and efficiently exploring this search space (looking for not just equivalent rewritings but also maximally contained rewritings) [102,44, 119,111]. Even if certain classes of integrity

constraints (capturing views and more) are considered, a corresponding notion of minimality can be used to define a finite search space for rewritings, using thechase and backchasetechnique [38].

For CQs and bag semantics a finite search space exploration is described in [23]. Interestingly,

the UCQ and/or bag-semantics analogs of the complexity results in [101] do not seem to appear

anywhere (to the best of our knowledge). For bag semantics, we attempt to remedy this omission in Subsection5.3below.

In the reformulation procedures we have mentioned, the search spaces are constructed and enumerated combinatorially, thus including non-equivalent rewritings. One takes advantage of the decidability of ≡V

K to filter them out. But such decidability is not just sufficient, it is also necessary. Indeed, all such reasonable approaches describe a total recursive function that associates to eachΣ-query Qa finite setsearchSpace(Q)of Σ∪ΣV-queries. Moreover, procedures like local

minimization can be abstracted by another total recursive function that associates to eachΣ-query Q and each Σ∪ΣV-query Q0 another Σ∪ΣV-query µ(Q,Q0) such that Q0 ≡K µ(Q,Q0). And

finally, one shows that Q0 ≡V_K Q if and only if µ(Q,Q0) ∈ searchSpace(Q). Since the latter is

In particular,K-equivalence ofΣ-queries must also be decidable which is why under bag or set semantics we cannot hope to extend the ideas that have worked for positive queries to rewritings of relational queries with difference.

We have seen that shifting toZ-semantics however leads to the decidability ofZ equivalence ofRA queries (hence DUCQs). According to the discussion above, next we need to exploreZ- equivalence under a set of views and we do so in Section5.3using an uniquely terminatingterm

rewrite system.

In the same subsection we discuss adding the reverse of the term rewrite rules as the basis for a reformulation algorithm. In the last two subsections we consider limiting our search todiff- irredundant rewritings, and we develop acost modeland a procedure for limiting the search space further.

In document Collaborative Data Sharing with Mappings and Provenance (Page 124-128)