Deciding Schema-Hit Containment - Incremental Query Evaluation with the RCADG Cache

10.5 Incremental Query Evaluation with the RCADG Cache

10.5.3 Deciding Schema-Hit Containment

This subsection presents an algorithm for computing theRCADG Cacheoverlap (see Definition 10.2 on page 146) for a new query Qnto be evaluated incrementally, given the cache look-up result LQn. At the heart of the algorithm is the decision procedure for schema-hit containment. For each cached schema hitχ in LQn _{that was retrieved for a schema hit}χQn _{of Q}n_{, we check whether}χ_⊃

sχQ

as defined on page 144. If the test succeeds, we create a cache hit saying thatχ⊃sχQ

to the set HQn_{of cache hits for Q}n_{. Before}

explaining the containment test and the creation of cache hits, let us take a brief look at the set of cache hits that are eventually produced for the query Qnin Figure 10.1 on page 143, assuming the cache C_{_Q′_,_Q_} that contains Q′and Q, as before.

Figure 10.6 on the next page depicts H_{Q_Qn_,_Q′_}, i.e., the set of cache hits obtained for Qnin the example above. Two cache hits have been created from the look-up result LQ_{_Qn_,_Q′_}in Figure 10.5 on the preceding page. Each cache hit specifies in the three leftmost columns how to obtain the matches to a specific schema hit for Qn(in the example,χ₁Qn) from the matches to a particular schema hit in the cache (χQorχQ′) using a fixed snapshot (steps sQ₂ and sQ₂′, respectively). For instance, the cache hitκin the first row in Figure 10.6 tells us that ans(χ₁Qn)is a subset of ans(JχQ_K

sQ₂). Furthermore, from the pairs of corresponding edges in

the queries Qn_{and Q (middle), we see that the matches to the query node q}n

CHAPTER 10. THERCADG CACHEFOR XML QUERIES AND RESULTS cache hit final step schema hits corresponding edges in new and cached queries

constraints in remainder query κ = h sQ₂ , {JχQ_K sQ₂ ⊃sχ Qn 1 }, Parent∗_∗(qn 2,qn1) Parent∗_∗(q2,q1) , _Parent∗ ∗(qn3,qn1) Parent∗_∗(q4,q1) , {Contains“Lee”(qn2)} i κ′ ₌ _h _sQ′ 2 , {Jχ Q′ K sQ₂′⊃sχ Qn 1 }, Parent∗_∗(qn₂,qn₁) Parent∗_∗(q′₂,q′₁) , _Parent∗ ∗(qn3,qn1) Parent∗_∗(q′₃,q′₁) , _Contains “Lee”(qn2) Contains“female”(qn3) i

Figure 10.6: The set H_{Q_Qn′_,_Q_}of cache hits for Qn, constructed from LQ

{Q′_,_Q_}in Figure 10.5 on page 149. The two cache hitsκ andκ′both obtain ans(χ₁Qn)from the cache, whereas ans(χ₂Qn)must be computed from scratch. The cache hit in the first row,κ, reuses the intermediate result that was cached after the second step in the evaluation of the query Q, with one keyword constraint as remainder query. The cache hit in the second row,κ′, needs two keyword constraints against the final answer to the query Q′in the cache.

matches to q1in the mentioned subset, and likewise for qn₂,q2as well as qn₃,q4. Finally, the remainder query

in the rightmost column indicates which subset of the cached results is relevant to Qn_{. In the case of}κ_,

a single keyword constraint narrows ans(JχQ_K

sQ₂)down to those tuples where the elements matching q n 2

(i.e., q2) contain the keyword “Lee”. Alternatively, the second cache hitκ′ shows how to compute the

same set ans(χ₁Qn) from ans(JχQ′_K

sQ₂′). Note that in this case, the remainder query has two keyword

restrictions instead of one as withκ, because q′₃in Q′does not enforce the constraintContains“female”that

is required by qn₃(see Figure 10.1 c. on page 143), unlike the node q4in Q that is used byκ.

Creating cache hits. Algorithm 10.1 on the following page lists pseudocode for processing a schema hitχQn of Qn, given the cache look-up result LQn and an initially empty set HQn of cache hits to be created forχQn. The procedure createCacheHits successively visits all sets of match edges forχQn and distinct evaluation steps in LQn_{. Evaluation steps belonging to the same query plan are processed one after the other,}

in the order defined by the plan. Remember that the match edges for a specific evaluation step indicate which pairs of query nodes and edges in Qnand a cached query might correspond. The outer for loop in Algorithm 10.1 (lines 8– 41) finds all consistent combinations of match edges in each step si(lines 27– 30),

and tests for which of these combinations there is a cached schema hitχ such thatJχKs_i ⊃sχQ

. In line with Definition 10.1 on page 144, the containment test involves the comparison of keyword constraints attached to corresponding query nodes (lines 15– 25) as well as of the binary D-constraints that have been matched up to step si(lines 32– 40). These two issues are elaborated below.

Each combination of match edges is represented as a cache hit containing the corresponding pairs of new and cached query edges as well as the remaining constraints in Qn. Cache hits that were successful in step siare added to the set Hcurof currently active cache hits. If there is another iteration for step si+1, these

cache hits are extended with additional match edges from that step to find out whetherJχKsi+1⊃sχ Qn

holds true, too. Successful cache hits for step si that fail in step si+1are removed from Hcurand are collected

in Hold instead. They remember si as the last reusable snapshot of the results they represent, but do not

participate in any further iterations. The other cache hits enter yet another round of containment tests until there are either no more steps in the current plan, or one step is missing in LQn (lines 10– 12). A missing step indicates that none of the constraints matched in this step is mirrored in Qn↓χQn_{. As a consequence,}

all subsequent snapshots of the cached query result after the missing step cannot be reused forχQn. In the end, all cache hits that were successful for any step in any plan are added to the result set HQn (lines 43– 48). HQn collects the cache hits for all schema hits of Qn, which are computed in successive calls to createCacheHits. Cache hits that represent the same combination of corresponding query edges for the same evaluation step are merged. Thus a single cache hit in HQn _{may specify multiple schema-hit}

containment pairs for different schema hits of Qn_{(hence the curly braces in the third column in Figure 10.6).}

This way each cache hit for a step sican be translated into a single remainder query plan operating on the

matches to multiple schema hits at once, which are all stored in the result table for si(and maybe those of

its successors). Query planning for Qn_{is explained in Section 10.5.4.}

10.5. INCREMENTAL QUERY EVALUATION WITH THERCADG CACHE

1 //createCacheHits: creation of cache hits for a new schema hit

2 //→χQn: a schema hit for a new queryQn 3 //→LQn: the cache look-up result forQn 4 //⇄HQn: the set of cache hits to be created

5 procedure createCacheHits (χQn: schema hit, LQn: map, HQn: set of cache hits) 6 group the steps with keyχQnin LQnby the plan they belong to

7 Hcur:= /0; Hold:= /0 for each new plan being processed 8 for all steps siin a given plan, in the order of their execution do

9 // only results obtained in successive evaluation steps can be used

10 if i>1 and the step before siwas skipped then

11 break loop

12 end if

13 // find cached and new query edges whoseD-constraints can be reconciled

14 M := /0

15 for all match edges cmassociated with siin LQ n

16 cn:= the query edge from Qnin cm

17 c := the query edge from the cache edge in cm

18 qns,qnt := the source and target nodes of cn 19 qs,qt:= the source and target nodes of c

→ 20 Ks:= call checkKeywords (qns,qs)

→ 21 Kt:= call checkKeywords (qnt,qt) 22 if Ks6=nil and Kt6=nil then

23 M := M∪ {hcn,c,Ks∪Kti}

24 end if

25 end for

26 // update the set of cache hits with new pairs of corresponding query edges

27 H := the cache hits in Hcurthat are inconsistent with any subset of edge pairs in M

28 Hcur:= Hcur\H; Hold:= Hold∪H

29 H := all consistent cache hits created from Hcurusing any subset of edge pairs in M

30 Hcur:= Hcur∪H

31 // keep only cache hits contributing a schema hit that containsχQn 32 for all cache hitsκ∈Hcur do

→ 33 X := call checkSnapshot (κ,si,LQ

n ) 34 if X=/0 then

35 Hcur:= Hcur\ {κ}; Hold:= Hold∪ {κ}

36 else

37 for an arbitraryχ∈X , addJχKs_i⊃sχQ n

toκ(replacing any existing statement forχQn) 38 replace the step inκwith si

39 end if

40 end for

41 end for

42 // collect and possibly merge successful cache hits for all steps and plans 43 for all cache hitsκ∈Hcur∪Holdwith a schema-hit containment forχQ

44 if∃κ′∈HQn:κ,κ′have the same corresponding query edges and step then 45 addκ’s schema-hit containment forχQntoκ′

46 else 47 HQn:= HQn∪ {κ} 48 end if 49 end for 50 end procedure

Algorithm 10.1: Creation of cache hits with theRCADG Cache. The input is a schema hitχQn _{for the}

new query Qnto be evaluated, the result LQn of looking up Qnin theRCADG Cache, and a set HQn for collecting the cache hits to be created. A sample output is shown in Figure 10.6 on the previous page.

CHAPTER 10. THERCADG CACHEFOR XML QUERIES AND RESULTS

Checking unary D-constraints. The only unary D-constraints to be compared in the containment test are keyword constraints.3The procedure createCacheHits in Algorithm 10.1 on the facing page compares the keyword constraints of every pair of query nodes that are the source or target nodes of two query edges in the same match edge (lines 15– 25). Only edges whose source and target node constraints can be reconciled pairwise are added to the set M (line 23) that is used to create new cache hits (lines 27– 30).

The actual comparison of keyword constraints is triggered by calls to checkKeywords in lines 20 and 21 of Algorithm 10.1. The pseudocode for checkKeywords is given in Algorithm 10.2. The procedure com- pares the keyword constraints of two query nodes qnand q belonging to the new query Qnand a cached query Q, respectively. It returns the subset of qn’s keyword constraints that remain to be checked against the cached matches to q, or nil if q’s keyword constraints are too strict for qn. The empty set is returned (line 60) if qnand q specify the same keywords with essentially the same Boolean junctor (conjunction or disjunction) and scope (containment or government). If only q has keyword constraints, nil is returned (line 63). If on the contrary only qnhas keyword constraints, all these constraints must be matched (line 66). In all remaining cases the keyword constraints of qn and q must be compared more thoroughly, as shown in Figure 10.7 on page 155. The right-hand side of the figure (coloured) comprises sixteen areas of eight squares each, most of them containing a relational symbol, which are arranged in pairs (a grey square on the left and a coloured or white square on the right). Each of the sixteen areas corresponds to a particular combination of the following four parameters: junctor(q), scope(q)(horizontal) and junctor(qn), scope(qn)(vertical). The upper left area, e.g., applies if both nodes specify a disjunction of containment constraints.

The four pairs of relation symbols in each area are to be read as follows: “=”, “⊂”, “⊃” and “⊃⊂” denote the equality, containment (in either direction) and non-empty intersection (overlap) of sets, respectively. Any pairhθ,θ′_i_{of a grey and a coloured symbol indicates that if the two sets of keywords used} in the constraints of q and qnare in relation θ (grey square), then the two sets of elements that satisfy these constraints are in relationθ′(coloured square). For instance, consider the upper left pairh=,=iin Figure 10.7. It says that if q and qnboth specify a disjunction of containment constraints for the same set of keywords, then they will be matched by the same set of elements (as far as keyword constraints are concerned, i.e., ignoring all other query constraints that q and qnmay be involved in). This obvious fact is captured by the first conditional branch of the procedure checkKeywords in Algorithm 10.2 on the following page, along with the other fourh=,=ipairs (highlighted grey and red).

The other pairs in Figure 10.7 deal with less obvious cases. All pairs with a “⊃” symbol on the right- hand side (highlighted yellow) indicate that q’s keyword constraints are no more restrictive than those of qn, which is exploited in lines 74 and 77 of Algorithm 10.2. If q and qnboth specify a conjunction of such constraints with the same scope (the two yellow “⊃” symbols directly below the two lower-right red “=” symbols in Figure 10.7), then only the constraints in qnthat are missing in q need to be part of the remainder query (line 74). For instance, given two sets of constraintsContainsk0(q)∧Containsk1(q)and Containsk0(qn)∧Containsk1(qn)∧Containsk2(qn)for q and qn, respectively, onlyContainsk2(qn)must be

checked against the matches to q in the cache. In all other cases where the keyword constraints can be reconciled (remaining pairs with yellow “⊃” symbols in Figure 10.7), the remainder query includes the entire set of keyword constraints of qn.

For all but the yellow and red pairs in Figure 10.7 (symbols “=” and “⊃”, respectively), either the set of elements matching q’s keyword constraints is known to be a subset of qn’s set of matches (blue “⊂” symbols), or no specific relation between the match sets can be inferred (white squares with no sym- bol). For these junctor/scope/keyword combinations, the procedure checkKeywords returns nil (line 69 in Algorithm 10.2 on the next page), which causes the corresponding match edge to be discarded from cache-hit creation (line 23 in Algorithm 10.1 on the facing page).

Checking binary D-constraints. The notion of schema-hit containment in Definition 10.1 on page 144 implies that the schematized cached query does not contain any D-constraints which make its extension too restrictive with respect to the schematized new query Qn. For every binary D-constraints in the cached query, this means that if the constraint has a counterpart in Qn, they must be reconciled, and if not, the

3_{Recall from Definition 2.7 on page 11 that the other unary query constraints specifying tag, type and level conditions are} S-constraints. Being fully captured by schema nodes, they need not be matched on the document level.

10.5. INCREMENTAL QUERY EVALUATION WITH THERCADG CACHE

51 //checkKeywords: comparison of keyword constraints 52 //→qn: a query node in the new queryQn

53 //→q: a query node in a cached queryQ

54 //←a set of keyword constraints for the remainder query, or nil 55 procedure checkKeywords (qn: query node, q: query node)

56 //qnandqhave similar constraints for the same keywords

57 if keywords(qn) =keywords(q)and

58 (junctor(qn) =junctor(q)or|keywords(qn)|<2)and

59 (scope(qn) =scope(q)or|keywords(qn)|=0)then

→ 60 return /0

61 // onlyqhas keyword constraints

62 else if keywords(qn) =/0 then

→ 63 return nil

64 // onlyqn has keyword constraints

65 else if keywords(q) =/0 then

→ 66 return the constraints for keywords(qn₎ 67 //q’s keyword constraints are too restrictive

68 else ifhq,qnidoes not have a yellow “⊃” in the “matches” column in Figure 10.7 then

→ 69 return nil

70 // some keyword constraints inQnare already subsumed byq 71 else if junctor(qn) =“∧” and

72 junctor(qn) =junctor(q)and

73 scope(qn_{) =}_scope₍_q₎_then

→ 74 return the constraints for keywords(qn)\keywords(q) 75 // all keyword constraints inQnmust be matched

76 else

→ 77 return the constraints for keywords(qn) 78 end if

79 end procedure

Algorithm 10.2: Comparison of keyword constraints with theRCADG Cache. This procedure is needed for verifying the second condition in Definition 10.1 on page 144. The input is a query node in the new query Qn to be evaluated and a query node from a query Q in theRCADG Cache. The output is the (possibly empty) subset of the keyword constraints of qnthat need to be matched as part of the remainder query for Qn. A return value nil indicates that the keyword constraints of qnand q cannot be reconciled. For a given query node q, junctor(q)is the Boolean operator (“∧” or “∨”), and scope(q)is either containment or government.

CHAPTER 10. THERCADG CACHEFOR XML QUERIES AND RESULTS

Figure 10.7: Comparison of the keyword constraints of a node q in a cached query and a node qn_{from a}

new query to be evaluated incrementally. Each cell in the table represents a specific relation between the two sets of keywords used in the constraints (left half of the cell, highlighted grey) and the resulting relation between the two sets of elements that satisfy these constraints (right half of the cell, white or coloured). The pairs of relations vary with the nature of the keyword constraints in q and qn. For instance, if both nodes specify a disjunction of containment constraints (four upper-left pairs) and q has more keywords in the disjunction than qn(third pair, symbol “⊃” highlighted grey), then it may also have a superset of the matches to qn(symbol “⊃” highlighted yellow). By contrast, if both query nodes feature a conjunction of government constraints (four lower-right pairs) and q has again more keywords than qn, then it may only have a subset of the matches to qn(third pair, symbol “⊂” highlighted blue).

constraint must not introduce a proper restriction. This is verified by the procedure checkSnapshot listed in Algorithm 10.3 on the following page. The procedure is called repeatedly by createCacheHits in line 33 of Algorithm 10.1 on page 152 for a (preliminary) cache hitκand an evaluation step siof a particular query Q

in the cache. At this point in time,κcontains a set of corresponding query edges from Q and Qnas well as a set of remainder query constraints for Qn, as illustrated in Figure 10.6 on page 151. The pairs of query edges in κ indicate which binary constraints in Q have which counterparts in Qn after schematization. Qnis schematized with a specific schema hit χQn given as a parameter to createCacheHits (see above).

In document Weigel, Felix (2006): Structural Summaries as a Core Technology for Efficient XML Retrieval. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 162-169)