Rewriting in a Query Optimizer - Collaborative Data Sharing with Mappings and Provenance

In principle, one could search the space of query rewritings by building a layer above the query optimizer, which enumerates possible rewritings; then separately optimizes each. However, a more efficient approach is toextend an existing optimizerto incorporate the rewriting system into its enumeration. Unlike with rewriting of conjunctive queries, most existing cost-based optimizers cannot easily be extended to DUCQ rewritings. The System-R optimizer [125] only does cost-

based optimization of joins, instead relying on heuristics for applying unions and differences. Starburst [67] can rewrite queries with unions and differences, but only at its heuristics-based

query rewrite stage. Starburst only has limited facilities for cost-based comparison of alternative rewritings.

Fortunately, the Volcano [55] optimizer generator (as well as its successors) can be modified to

incorporate our rewrite scheme within an optimizer. Volcano models a query initially as a plan oflogical operatorsrepresenting the algebraic operations, including unions and differences; it uses transformation rules to describe algebraic equivalences that can be used to find alternate plans. Implementation rulesdescribe how to rewrite a logical operator (or set of operators) into a series of physical algorithms, which have associated costs.

Our rewrite rules of Section5.3can be expressed as transformation rules for Volcano, and we

would not need to change any implementation rules. However, we would also need to modify Volcano’s pruning algorithm: we must place a finite bound on the size of the rewritings explored, as with ourprunefunction in the previous subsection.

5.5 Applications to Bag and Set Semantics

Having obtained a sound and complete algorithm for reformulation of RA queries using RA

views underZ-semantics, we wish to see for which RAqueries/views this same algorithm actually provides reformulations under bag semantics. Thus we are naturally led to study the following class of queries:

Definition5.14. We denote byRAthe class of all queriesQ∈ RAsuch that for all bag-instances I,Eval_NQI=Eval_ZQI.

Right away we see that:

Lemma5.15. For any Q1,Q2∈ RAwe have Q1≡ZQ2implies Q1≡N Q2.

Then

Lemma5.16. If A,B∈ RA+ then A−B∈ RAif and only if Bv_N A.

Proof. “⇒”: suppose A−B ∈ RA and consider an arbitrary bag instance I and tuple t. Since

Eval_ZA−BI(t) = Eval_NA−BI(t) ≥ 0, and by definition, Eval_ZA−BI(t) = Eval_ZAI(t)−

Eval_ZBI(t), it follows that Eval_AZI(t) ≥ Eval_ZBI(t). Since A and B are positive queries, by Lemma5.1,Eval_ZAI=Eval_NAIandEval_ZBI=Eval_NBIand thereforeEval_NAI(t)≥Eval_NBI(t).

SinceIandtwere chosen arbitrarily, it follows thatBv_N A.

“⇐”: suppose Av_N Band consider an arbitrary bag instance I. By Lemma5.1 Eval_NAI =

Eval_ZAIandEval_NBI =Eval_ZBI. But sinceEval_NBI ≤Eval_NA, it follows thatEval_ZA−BI =

Eval_NA−BI≥0. SinceI was chosen arbitrarily, it follows thatA−B∈ RA.

Proof. Suppose Q ∈ RA, let diffNF(Q) = A−B (with A,B ∈ RA+), and choose an arbitrary bag instance I and tuplet. ThenEval_NQI(t) = Eval_ZQI(t) = Eval_ZA−BI(t) = Eval_ZAI(t)−

Eval_ZBI(t) ≥ 0. It follows thatEval_ZAI(t) ≥Eval_ZBI(t)and therefore (using Lemma5.1) that

Eval_ZA−BI(t) =Eval_NA−BI(t). SinceIandtwere chosen arbitrarily, it follows thatQ≡_N A−

B, as required, and also thatBv_N A. By Lemma5.16this in turn implies thatA−B∈ RA.

Corollary5.18. Bag-equivalence ofRAqueries is decidable.

Next we show that if we start with a DUCQ inRAand a set of DUCQ views also inRA, then the exploration of the space of reformulations prescribed by the algorithm of Section5.4examines

only queries inRAwhich are≡_N under the views to the original DUCQ. Suppose thatV is a set of views inRAexpressed as DUCQs.

Proposition 5.19. If Q is a DUCQ in RA and Q0 is a DUCQ obtained from Q either by a step of augmentation or by a step of folding with a view fromV then Q0∈ RAand Q0is≡_Nto Q underV.

Therefore, the reformulation algorithm can be used withRAqueries and views. Unfortunately, but not unexpectedly, it is undecidable whether anRAquery or even a DUCQ is actually inRA

(Lemma 5.16 provides a reduction from bag-containment of UCQs). But there are interesting

classes of queries for which membership inRAis guaranteed. The simplest but still very useful case is based on the observation thatRA+ ⊂ RA. It follows that our algorithm is also complete for finding DUCQ reformulations of UCQs using UCQ views, as in the example in Section5.1.

Next we identify a subclass ofRAfor which, while still undecidable in general, membership may be easier to check in certain cases.

Definition5.20. We denote byRAdthe class of of allRAqueriesQsuch that for every occurrence A−Bof the difference operator inQ, we haveBv_N A.

Theorem 5.21 (Normalization for RAd). For any Q ∈ RAd can find A,B ∈ RA+ such that Q ≡N A−B.

Proof. Straightforward induction on Q, using the same algebraic identities as in Theorem 5.2.

Although these identities fail under bag semantics forRAqueries, they hold forRAdqueries. Corollary5.22. Bag-equivalence ofRAdqueries is decidable.

Theorem5.23. RA ⊂ RAd

Proof. Checking dRA ⊆ RA is a straightforward application of Theorem 5.21. To see that the inclusion is proper, consider the queryQdef= (R−(R∪R))−(R−(R∪R)).Qis bothZ-equivalent

and N-equivalent to the unsatisfiable query ∅; it follows that Qis in RA. However, since R∪

R6v_N R, it is clear thatQ6∈RAd.

Although RAcontains queries which are not inRAd (because of their syntactic structure), it turns out that, semantically,RAdcapturesRA, as the following theorem makes precise:

Theorem5.24. For all Q∈ RAthere exists Q0 ∈RAdsuch that Q≡NQ0.

Proof. For any Q ∈ RA, we show that there must exist A,B ∈ RA+ such that B v_N A and Q≡_N A−B. Fix aQ∈ RA. By Theorem5.2, there exist A,B∈ RA+ such thatQ≡_Z A−B. We

argue by contradiction that B v_N A. SupposeB 6v_N A. Then there exists a bag instance I and tupletsuch thatEval_NAI(t)<Eval_NBI. ThenEval_ZA−BI(t)<0, i.e.,Eval_ZA−BI =Eval_ZQI is not even a bag-instance. However, this is a contradiction, because by assumption (sinceQ ∈ RA), we must haveEval_ZQI = Eval_NQI. Finally, we argue that Q ≡_N A−B. To see this, fix an arbitrary bag instance I. We want to show that Eval_NQI = Eval_NA−BI. Since Q ∈ RA, we haveEval_NQI = Eval_ZQI. Since Q ≡_Z A−B we have Eval_ZQI = Eval_ZA−BI. Finally, since A−B ∈ RAd, by Theorem5.23 we haveEvalZA−BI = EvalNA−BI. This completes the proof.

Membership in dRAis also undecidable. However in some practical situations, such as incremental view maintenance ofRA+ views using delta rules [66], the difference operator is used in

a very controlled way where the containment requirement is satisfied (e.g., it is just necessary for the system to enforce that only tuples actually present in source tables are ever deleted from the sources).

We are also interested in reformulating and answering queries under Z-semantics, but then “eliminating duplicates” to obtain the answer under set semantics. Even for RAd queries, this is not in general straightforward: for example, consider the query Q= (R∪R)−R. Under set semantics, this is equivalent to the unsatisfiable query, while under bag semantics orZ-semantics, it is equivalent to the identity queryR. We can, however, restrict the use of negation inRAdfurther, to obtain another fragment ofRAsuitable for this purpose.

Definition5.25. AnRAqueryQover a schemaΣis said to be abase-differencequery ifA−Bcan appear inQonly when AandB are both base relations (names inΣ). Further, a base-difference queryQ is said to bepositive-differencewrt a set-instance I if for each A−Bappearing in Qwe haveAI ⊇BI(where AIis the relation in Ithat corresponds toA∈Σ.)

Although the use of negation in base-difference queries considered on instances wrt which they are positive-difference is highly restricted, it still captures the form needed for incremental

view maintenance, where negation just relates old and new versions of source relations via the tables of deleted and inserted tuples.

For conversion between bag semantics / Z-semantics and set semantics we also define the duplicate eliminationoperator δ : Z → B which maps0 to falseand everything else (positive or

negative) totrue. Conversely, we can view any set instance as a bag/Zinstance by replacingfalse

with0andtruewith1. With this we can state the salient property of base-difference queries: Proposition5.26. . Let Q ∈ RAbe a base-difference query and let I be a set instance w.r.t. which Q is positive-difference. Then, we can computeJQK

I _{by viewing I as a}_{Z-instance, computing}

JQK

I _under_Z semantics and finally applyingδ.

Consequently, the optimization techniques in this chapter (which replaces a query with a Z- equivalent one) will also apply to set semantics, provided we restrict ourselves to base-difference queries applied to instances w.r.t. which they are positive-difference.

In document Collaborative Data Sharing with Mappings and Provenance (Page 136-140)