• No results found

8.4 Query Evaluation with the RCADG

8.4.6 Computing the Final Query Result

The last intermediate result table created during phase 2 contains all query matchings as defined before (see Definition 2.3 on page 10). Recall from Section 2.2 that the final query answer ans(Q)is obtained by restricting these matchings to the set Qrof result nodes given as part of the query specification. Also, the

result tables contain the schema hits and weights corresponding to each matching, which are not part of the query answer. Therefore a final query is needed to extract all relevant data from the last intermediate result table and return it as the final result ans(Q)to the user. This is done by the procedure createResult called in line 14 of Algorithm 8.1 on page 101.

The SQL code for creating the final result of the query Q in Figure 8.11 a. on page 109 is given in Figure 8.17 on the next page. It consists of a single statement that simply projects the last intermediate result table, Q s3in Figure 8.4 d. on page 102, onto those columns eiwhich contain the matches to result

query nodes qi. In the example we assume that all nodes in Q are results nodes, i.e., Qr=Qv. Note the

CHAPTER 8. THERELATIONAL CADG(RCADG)

... -- CREATE, SELECT, FROM as before

WHERE

ET4.pid = p4 AND -- select matches v to q4

ET4.key = ‘’ AND

EXISTS ( -- match Governs disjunction using a single descendant w of v

SELECT eid

FROM ElementTable ET4desc WHERE

ET4desc.eid >= ET4.eid AND ET4desc.eid < ET4.eid + w4 AND -- match Child∗0(v,w)

(ET4desc.key = ‘female’ OR ET4desc.key = ‘PhD’) -- match Contains disjunction on w

)

a. disjunction of government constraints on node q4in query Q

... -- CREATE, SELECT, FROM as before

WHERE

ET4.pid = p4 AND -- select matches v to q4

ET4.key = ‘’ AND

EXISTS ( -- match Governs” f emale”(q4) using a descendant w of v

SELECT eid

FROM ElementTable ET4desc WHERE

ET4desc.eid >= ET4.eid AND ET4desc.eid < ET4.eid + w4 AND -- match Child∗0(v,w)

ET4desc.key = ‘female’ -- match Contains” f emale”(w)

) AND

EXISTS ( -- match Governs”PhD”(q4) using another descendant w of v

SELECT eid

FROM ElementTable ET4desc WHERE

ET4desc.eid >= ET4.eid AND ET4desc.eid < ET4.eid + w4 AND -- match Child∗0(v,w)

ET4desc.key = ‘PhD’ -- match Contains”PhD”(w)

)

b. conjunction of government constraints on node q4in query Q

Figure 8.16: SQL code for matching keyword government constraints. The code is generated for a variant of query Q in Figure 8.11 a. on page 109 where the containment constraintContains“female”(q4)on node q4

has been replaced with two government constraintsGoverns“female”(q4)andGoverns“PhD”(q4). The two

government constraints are either disjunctive (a.) or conjunctive (b.). Each of the two statements is meant to replace the code for the first evaluation step s1in the query plan PQin Figure 8.11 b. on page 109. The CREATE,SELECTandFROMclauses remain unchanged (see Figure 8.12 a. on page 113). Only theWHERE is modified according to the document matching rule DAGoverns

k in Figure 8.15 on page 115, as follows.

Let v be a match to the query node q4. a. A disjunction of the two government constraints is translated

into a single subquery selecting any descendant w of v that contains either keyword. b. A conjunction of the two government constraints translates to a conjunction of two separate subqueries. Each subquery independently selects a descendant w of v that contains a specific keyword.

SELECT

DISTINCT e1, e2, e3, e4 -- copy matches to result nodes

FROM

Q_s3 -- retrieve answer from the last intermediate result

ORDER BY

e1, e2, e3, e4 -- order result as needed

Figure 8.17: SQL code for computing the final result of the query Q in Figure 8.11 a. on page 109. The last intermediate result from phase 2 (see Figure 8.4 d. on page 102) is projected onto matches to the result nodes (in this case, all query nodes). This produces the query answer shown in Figure 8.4 e. on page 102.

8.5. EXPERIMENTAL EVALUATION

elimination is only needed when some match columns are dropped, i.e., when Qr (Qv. TheORDER BY

clause serves to return the query answer in some specific order. In this case, it is sorted so that all matches to the query node q1appear in document order. (Tatarinov et al. [2002] mention different output modes to

be applied analogously.)

The output of the SQL query in Figure 8.17 is shown in Figure 8.4 e. on page 102. Here ans(Q) consists of the tuple h18,21,25,26iof node labels that denotes exactly the document subtree depicted in Figure 2.1 e. on page 8 (the corresponding node labels are given in Figure 4.3 a. on page 46). Of course a different result presentation may be chosen for the user. For instance, given the original XML representation of the documents and a mapping from node labels to the corresponding byte offsets in the XML code, the query answer could be presented as XML fragments (possibly rendered using stylesheets). Alternatively, XML code might be generated on the fly. However, these presentation details are beyond the scope of this work.

8.5

Experimental Evaluation

To evaluate the practical use of XML indexing with theRCADG, we created path and element tables for different document collections in an RDBS and implemented the evaluation procedure evaluateQuery (see Algorithm 8.1 on page 101) in a retrieval engine called Document eXplorer (DoX). DoX evaluates XML queries like those used throughout this work by translating them into SQL statements against the path and element tables, as described above. The system is compared to (1) the native XML engine X2that was already used for the experiments with theCADGin Section 6.4; (2) the relational node indexing scheme

XPath Acceleratorby Grust et al. [2002; 2004] (see Section 7.3.2); and (3) the relational path indexing schemeXRelby Yoshikawa et al. [2001] (see Section 7.4.1). All query engines have been implemented (or reimplemented, in the case ofXPath AcceleratorandXRel) in Java. Details of the hardware and software set-up are given in the appendix (see Test Environment A in Section 13.1).

We ran a number of queries against the four document collections IMDb, XMark 1100, INEX and DBLP listed in Section 13.2 of the appendix. The Internet Movie Database (IMDb) comprises more than 8 GB of XML documents describing movies and actors from a commercial web site [IMDB], whereas XMark 1100 consists of 1 GB recursive XML synthetically generated by a benchmarking tool [XMark]. The highly heterogeneous INEX benchmark [INEX] contains scientific articles in full-text. DBLP [DBLP] is an on- line collection of bibliographic data from computer science. The key results of the evaluation are the following:

1. TheRCADGoutperforms both the native and the relational baseline systems by two orders of mag- nitude and more in terms of retrieval speed. Complex queries with large results causing the baseline systems to break down are answered within seconds by theRCADG. Querying XML in a relational database system benefits greatly from native XML indexing techniques (see Section 8.5.2 below). To a certain extent this also confirms previous findings reported by Chen et al. [2004] for theBLAS

storage scheme (see Section 7.4.2).

2. TheRCADGeasily scales up to collections of multiple gigabytes both in terms of retrieval speed and storage demands. The path table is typically several orders of magnitude smaller than the original data (see Section 8.5.4).

3. Query planning has a significant impact on the performance of theRCADG. While very encouraging results were obtained with the planning strategies described above, in some cases inappropriate plan- ning may prevent a performance gain. Also, enhancing the relational optimizer with tree statistics seems promising (see Section 8.5.3).

4. Keyword-driven schema matching using signatures in the path table does not entail a significant performance gain in our experiments. The overhead for signature comparison lies between 100 ms and 300 ms, whereas the time needed for creating signatures is negligible.

Table 8.5 on the next page summarizes the performance results for theRCADG(averaged after remov- ing the best and worst of five runs). Sample queries are given as their closest XPath equivalents. The

CHAPTER 8. THERELATIONAL CADG(RCADG)

Corpus QID result closest XPath query processing

size time (s)

IMDb I3 6507 //*[title=”love”]/production year 1.27

I4 118,150 //movie[.//genre=”documentary”]//actor 8.77

XMark 1100

X4 2 /site/open auctions/open auction[ 0.44

bidder[personref/@person=“person20”]/following-sibling:: bidder[personref/@person=“person17290”]]/reserve

X15 1890 /site/closed auctions/closed auction/annotation/description/ 0.52 parlist/listitem/parlist/listitem/text/emph/keyword

X14 9461 /site//item[contains(description, “gold”)]/name 3.34 X13 22,000 /site/regions/australia/item[name and description] 0.88 X2 597,777 /site/open auctions/open auction/bidder/increase 17.54

Table 8.5: RCADGquery performance, in seconds. The original queries are given here as their closest XPath equivalent. XMark 1100 queries are adapted from the XQuery benchmark [XMark]. Only matches to XPath result nodes were computed (unlike Table 8.6 on the following page).

XMark 1100 queries X2, X4, X13, X14 and X15 as well as X1 (in Table 8.6 on the following page) capture the XPath portion in the corresponding queries from the XQuery benchmark [XMark]. As can be seen in Table 8.5, theRCADGscales well with both the size of the document collection and the number of query results. The rest of this section discusses more results (see Tables 8.6, 8.7 and 8.8) in greater detail.