8.4 Query Evaluation with the RCADG
8.5.2 Runtime Performance
RCADGversusCADG. A first set of experiments measures the performance gain theRCADGachieves over native XML retrieval with X2(see the RCADG and CADG columns in Table 8.6). To avoid a handicap for the X2system, which always matches the entire query graph, the systems treated all query nodes as result nodes. For theRCADG, the runtime performance therefore differs from the results in Table 8.5 on the preceding page.
Queries I1a to I1d retrieve the place of birth and the movies of different people mentioned in the movie database IMDb. Note how the performance of both theRCADGand theCADGremains stable as the selectivity of the query keyword decreases: while the keyword “mastroianni” is contained only in 406 elements, the frequency of “felix” is almost ten times higher; “cooper” occurs in 10,398 elements and “steve” in 38,983 elements. TheRCADG’s performance gain is two orders of magnitude for the most selective keyword (I1a) and still more than a factor 20 for the most frequent keyword (I1d). As queries I2 and I3 illustrate, the overhead incurred by theCADG is mainly due to “output” nodes like place and movie which are not subject to keyword constraints. While theCADGis highly competitive for queries without such unselective nodes, such as I2, the production year node in I3 slows down the native system by two orders of magnitude. Unlike theRCADG, theCADGretrieves matches to the title and production year nodes in the element table and transfers them into main memory for deciding their binary query constraint (thechildstep).4 By contrast, theRCADGtranslates binary constraints into join conditions supported by relational indices on the element table, and therefore faces no such overhead for loading large element sets.
Query I4 illustrates another potential weak-spot in native retrieval systems which compose matches to tree queries in main memory: processing large intermediate result sets containing tens or hundreds of thousands of tuples easily exceeds the hardware capacities. During the evaluation of I4, X2quickly ran out 4The huge overhead for I3 compared to I2 might not be faced by native systems which do not compose path occurrences in this way but retrieve entire tree fragments instead, like the NatiX system [May et al. 2004; Fiebig et al. 2002].
CHAPTER 8. THERELATIONAL CADG(RCADG)
Corpus QID closest XPath query result size processing time (s) RCADG XRel RCADG XRel
INEX N1a //p 609 609 <0.01 0.09
N1b //p[sub]/b 27 3,485,916 0.01 515.22
Table 8.7: Query performance comparison forRCADGandXRelon the schema level, in seconds. Process- ing times and intermediate result sizes are measured at the end of phase 1. The original queries are given here as their closest XPath equivalent.
of memory; allocating more than 800 MB on our 1-GB machine avoided a crash but resulted in swapping. TheRCADG, however, copes well with large result sets.
Query X21 against the XMark 1100 collection examines how the systems cope with tasks whose com- plexity is in the query structure, not the result size. TheRCADG invests 85% of a total of 321 ms in generating extremely efficient SQL code that involves the reconstruction of fourParentconstraints. By constrast, X2is again trapped in too many decision operations. The results for X1 and X2 in Table 8.6 con- firm the earlier observations for I3 and I4, respectively. Note that when returning only matches to the XPath result nodes, theRCADGanswers the same queries again up to 7 times faster (see Table 8.5), retrieving more than half a million matches in less than 20 seconds.
RCADGversusXPath Accelerator. It has been mentioned before thatXPath Acceleratordecides all binary constraints via selfjoins on the node table, lacking both reconstruction capabilities and schema-level information. Consequently, in our test with different keyword selectivities (queries I1a to I1d in Table 8.6 on the preceding page), the evaluation time rapidly grows with the size of intermediate results, reaching 820 seconds for I1d compared to only 0.52 seconds with theRCADG. Less selective queries like I2 to I4 also take longer than ten minutes to evaluate. Only for highly selective queries like I1a or X1, XPath Accelerator is slightly faster than theRCADG, possibly because the latter issues multiple SQL queries rather than only a single one. The impact of a complex query graph like X21 is much higher forXPath Acceleratorthan for the native or relationalCADGs. SinceXPath Acceleratorselects tuples in the element table based only on singleton tags rather than tag paths, it has to join large intermediate results.
The most unselective query in our test suite, X2, has a much simpler structure (no branches and no descendant steps). HereXPath Acceleratoris faster than for X21, but still takes more than twice as long as theRCADG. Query X2 is reported as critical by Grust et al. [2004], too. Note that when retrieving only matches to the leaf of the path, in XPath style, theRCADGoutperformsXPath Acceleratorby one order of magnitude (22 seconds versus 220 seconds). As query I4 shows,XPath Acceleratordoes not scale well to unselective queries with many descendant steps, which involve range conditions in the selfjoin of the node table. Here theRCADGis two orders of magnitude faster. Note that even the special relational index structures and join operators employed by Grust [2002], which are reported to recover up to one order of magnitude of processing time, are unlikely to remedy this handicap completely. Obviously theRCADG
takes considerable advantage from BIRDreconstruction when answering query I4, using the keyword- restricted genre node as a starting point in the query plan.
Summing up, the experiments prove that the native XML indexing techniques underlying theRCADG
entail a decisive performance gain in the relational domain.
RCADGversusXRel. As explained in Section 7.4.1,XRel’s atomic representation of tag paths as strings has a number of disadvantages, compared to the compositional path representation of theRCADG. First, string matching tends to be slower than the comparison of numeric node labels, especially for query paths starting with a descendant step. The following experiment quantifies this overhead using queries against the INEX collection. Table 8.7 compares how fastRCADGandXRelmatch a query graph on the schema level (phase 1) and how many matches they retain for document-level matching (phase 2). For N1a both systems retrieve 609 matches, but theRCADGis slightly faster. Second,XRelproduces many partial matches to be discarded later in phase 2: for N1b its intermediate result is five orders of magnitude larger than that of the
RCADG. As explained in Section 7.4.1,XRel’s atomic path representation is not precise enough to discard combinations of sub and b elements that do not belong to the same p parent.
8.5. EXPERIMENTAL EVALUATION
Corpus QID closest XPath query result size processing time (s) RCADG XRel RCADG XRel
DBLP D1a //article[author=“codd”]/title 34 34 0.12 9.18 D1b /dblp/article[author=“codd”]/title 34 34 0.12 9.14 XMark 1100 X1 /site/people/person[@id=“person0”]/name 1 1 0.09 3.96 X22 //parlist[.//text[.=“zenelophon”]]/ 133 183 0.14 27.95 listitem/text X14 /site//item[contains(description,“gold”)]/name 9461 9461 3.34 >600 X13 /site/regions/australia/ 22,000 22,000 0.88 >600
item[name and description]
X23 //regions[contains(.,“zyda@ask”)]//keyword 416,175 416,175 32.21 310.03 X2 /site/open auctions/open auction/ 597,777 597,777 17.54 6.12
bidder/increase
Table 8.8: Query performance comparison forRCADGandXRel, in seconds (phases 1 and 2). The original queries are given here as their closest XPath equivalent. Only matches to XPath result nodes were computed (unlike Table 8.6 on page 120). The symbol “ ” indicates that a specific query was not answered properly.
This also slows down the subsequent document-level matching, as shown in Table 8.8. Here the pro- cessing time subsumes the entire query evaluation process (phases 1 and 2), and the result size only counts only elements that are part of the final answer to the query. On the DBLP collection theRCADGis almost two orders of magnitude faster thanXRel(D1a), even for an absolute query path (D1b). On XMark 1100, the difference is between one and three orders of magnitude. XReloutperforms the RCADGonly for a single unselective query without branching nodes and descendant steps (X2). For such queries matching exactly one path in the schema, theRCADG’s compositional path representation has no extra benefit, but rather entails a small overhead compared to exact string matching without wildcards.
By contrast, for proper tree queries with descendant steps,XRelnot only takes more processing time but may also produce wrong final results on recursive collections like XMark 1100. For instance, in the case of query X22,XRelis two orders of magnitude slower than theRCADGand retrieves 50 false hits. By contrast, the query evaluation with theRCADGis fast and correct, owing to its compositional path representation and
BIRDreconstruction. This phenomenon is explained as follows. For illustration, reconsider the query in Figure 8.5 a. on page 104. TheRCADGanswers this query with only two element-table joins, as specified by the corresponding query plan in Figure 8.11 d. on page 109. The SQL code generated to answer the same query withXRelis given in Figure 8.18 on the next page. Here we ignore the query node q6and the NextSibedge becauseXReldoes not support sibling constraints. For the resulting query graph comprising the five query nodes q1to q5,XRelcombines a five-fold join of the path table with another five-fold join
of the node and content tables (see theFROMclause in Figure 8.18). As described in Section 7.4.1, tag path patterns are created from the query and matched against the pathexp column in the path table (black part of theWHEREclause in Figure 8.18). The path IDs retrieved this way act as foreign keys to the node and content tables (blue part of theWHEREclause in Figure 8.18). Finally, all binary query constraints are decided on the document level, using region encoding (green part of theWHEREclause in Figure 8.18). Note how matches to distinct tag paths are first retrieved independently and then combined through the join predicates on the node and content tables. This causes the large intermediate result after phase 1 for N1b in Table 8.7.
Compared toXRel, theRCADG(1) replaces suffix and infix string matching involving numerous wild- cards with efficient numeric equality predicates in the selfjoin of the path table, (2) saves three out of five expensive joins with the element table throughBIRDreconstruction, (3) looks up fewer schema hits in the element table in cases where the individual query paths have disparate partial matches in the documents (as in query N1b above), and (4) correctly discards partial matches from the final result in presence of a recursive schema. For instance, assume that the sample query from Figure 8.5 a. is run against a document collection containing nested person elements. Then the code in Figure 8.18 on the facing page wrongly accepts those person elements which lack a suitable watches child, but instead have a person descen- dant with such a watches child. The reason is thatXRelloses track of the common person ancestors of matches to node q2(watches) and q4 (profile), which are treated simply as matches to two dis-
CHAPTER 8. THERELATIONAL CADG(RCADG)
SELECT
NT3.start, NT3.end, NT4.start, NT4.end -- add matches to q3 and q4
FROM
PathTable PT1, PathTable PT2, PathTable PT3, -- join path, node and content tables
PathTable PT4, PathTable PT5,
NodeTable NT1, NodeTable NT2, NodeTable NT3, NodeTable NT4,
ContentTable CT5
WHERE
PT1.pathexp LIKE ‘#%/person’ AND -- match tag paths
PT2.pathexp LIKE ‘#%/person#/watches’ AND
PT3.pathexp LIKE ‘#%/person#/watches#%/open_auction’ AND
PT4.pathexp LIKE ‘#%/person#%/profile’ AND
PT5.pathexp LIKE ‘#%/person#%/profile#/gender’ AND
NT1.pathid = PT1.pathid AND -- match unary constraints
NT2.pathid = PT2.pathid AND NT3.pathid = PT3.pathid AND NT4.pathid = PT4.pathid AND CT5.pathid = PT5.pathid AND CT5.value = ‘XML’ AND
NT1.start < NT2.start AND NT1.end > NT2.end AND -- decide Child(q1,q2) NT2.start < NT3.start AND NT2.end > NT3.end AND -- decide Parent∗1(q3,q2) NT1.start < NT4.start AND NT1.end > NT4.end AND -- decide Parent∗1(q4,q1) NT4.start < CT5.start AND NT4.end > CT5.end -- decide Parent(q5,q4)
ORDER BY
NT3.start, NT3.end, NT4.start, NT4.end -- order result as needed
Figure 8.18: SQL code for query evaluation withXRel(see Section 7.4.1). Blue colour highlights code related to joins with the node or content table, whereas green colour is used for the decision of binary query constraints. The query being evaluated is a variant of the query in Figure 8.5 a. on page 104 where the node q6 and the binary constraintNextSib(q6,q5)have been removed (since XReldoes not support
sibling constraints).
tinct path patterns (#%/person#/watches and #%/person#%/profile in theWHEREpart of the SQL statement). By contrast, theRCADGkeeps tuples of matches to all nodes in the query graph as interme- diate results and hence never mixes up distinct person ancestors. Faced with two nested partial person matches as just described (one satisfying only the constraints related to q2and the other to q4), theRCADG
rejects both during phase 2 at the latest, but possibly even earlier during schema matching. In the same way, it discards nested parlist elements that only partially match the root of query X22 in Table 8.8.