Compositional Path Indexing with BLAS

7.4 Path Indexing

7.4.2 Compositional Path Indexing with BLAS

TheBi-Labelling Based System(BLAS) by Chen et al. [2004] is so far the only relational storage scheme for XML we are aware of that represents (suffixes of) tag paths in a compositional manner. The name of the approach alludes to the fact that there are two different kinds of labels, D-labels for elements and P-labels for tag paths, which are used to match structural query constraints on the document and schema levels, respectively. D-labels are simply integer intervals following region encoding, as with the XRel

scheme above. P-labels are generated on the fly during indexing and query evaluation for any tag path suffix encountered in a document or query. A P-label is an integer interval denoting the set of all possible tag paths which share a specific suffix. For instance, the P-label for the tag path suffix /person/name represents all possible tag paths /. . . /person/name. In particular, each root-to-leaf tag path p (a special case of a path suffix) is assigned a P-label that is stored with each occurrence of p in the node table, similar to the path ID used byXRelabove.

The idea is to choose P-labels in such a way that given the P-label P of any tag path suffix in the query, one can easily retrieve all elements with that tag path suffix by inspecting their P-labels in the node table. To this end, the labelling ensures that for any two tag path suffixes s and s′with P-labels Ps

and P_s′, respectively, Ps contains Ps′ (as an interval) iff s is a suffix of s′. Otherwise Ps and Ps′ are disjoint. For instance, the P-label for the suffix /name contains the P-label for /person/name which in turn contains the P-label for /people/person/name. Thus a query path s=//person/name can be matched by selecting all tuples in the node table whose P-label is contained in Ps. This would include, e.g.,

elements reached by /people/person/name, but not those below /people/person/profile/name 5_{Regular expressions are not part of the SQL-92 standard [SQL2], but included in SQL:1999 [SQL3].}

7.5. SUMMARY AND DISCUSSION

(whose P-label is disjoint with Ps).

P-labels, D-labels and textual contents of elements are all stored together, i.e., there is no separate path table as withXRel. Chen et al. suggest using a separate node table for all elements with the same tag name, similar to theEdgescheme above. Each element v is represented as a tuple consisting of v’s D-label (i.e., its start and end positions in the documents), the P-label of v’s tag path as well as the level of v and its textual content, if any. The P-labels are created on-the-fly for all tag paths encountered during indexing, based on schema statistics like the total number of distinct tags and the height of the document tree. (In this senseBLASuses a schema-based storage scheme.)

Similarly, when a query Q comes in, P-labels are created for all tag path suffixes in Q. The tag path suffixes in Q are obtained by extracting all sequences of consecutive non-branchingChildsteps from the query path expressions. For instance, the tree query Q3=/people//person[name]//edu is cut into

four path suffixes, namely, /people, /person, /name and /edu. Both the “//” symbol denoting a Child+step and XPath predicates indicating a branch act as breakpoints for dividing path expressions into suffixes. These suffixes are looked up as P-labels in the node tables. The resulting four sets of people, person, name and edu nodes are then combined through structural joins on their D-labels, in order to filter out those quadruples which indeed form a subtree with the specified structure.

The example above illustrates that path suffixes withoutChild+steps are generally less selective than the original query paths (e.g., compare the four suffixes thatBLAS extracts from Q3 to the three rooted

query paths used byXRelabove). To obtain more selective look-up predicates, Chen et al. propose two optimizations. First, longer path suffixes can be created for children of a branching query node: in Q3,

e.g., we can use /person/name instead of /name because the person and name nodes are connected through aChildstep. This might reduce the number of name nodes participating in the structural joins. However, the technique does not apply to the edu node in Q3, because of the descendant step. Thus BLAS still tolerates even more false hits on the schema level thanXRel, despite its compositional path representation. Since only path suffixes are matched in the first place, there is no way to select only edu nodes below a specific person node in the schema tree, or even below any person at all, let alone to rule out combinations of person, name and edu nodes that do not belong to the same schema hit.

The second optimization makes use of schema information in a DTD (if available) to unfold (i.e., instantiate) path expressions like /people//person//edu in Q3into a set of root-to-leaf paths without Child+steps and tag wildcards. This way few look-ups for unselective path suffixes in the node table are replaced with many look-ups for very selective rooted tag paths, in a sort of query expansion. Note that the idea is similar to the path matching thatXRelperforms through string matching in the path table and that native systems realize by traversing the schema tree. However, with prescriptive schema information as specified by DTDs, the query expansion proposed by Chen et al. is likely to produce many tag paths that do not occur in the documents. For recursive DTDs the unfolding does not even terminate unless a maximum length for the resulting tag paths is fixed. Finally, the unfolding withBLAS seems to happen outside the RDBS, and it is not explained how this could be best done in the relational model.

7.5 Summary and Discussion

Given that today’s relational database technology is efficient, scalable, mature and widely deployed, the prospect of seamlessly integrating XML retrieval with RDBSs is particularly tempting. The literature abounds with different ways to store and query XML data as tuples. While many approaches depend on DTDs or other specifications of the document structure to choose a database schema, and some use labelling schemes as decentralized structural summaries of tree relations between individual tuples, very few relational storage schemes leverage the benefit of indexing schema information with a centralized structural summary. Systems that only index singleton elements with their tags, but not paths (as with theEdgescheme) must often join large node sets to find out that only few candidates are actually part of the query result. Sophisticated join algorithms have been developed as a compensation (like the Staircase Join by Grust et al. [2003] forXPath Accelerator). But still experimental results such as the ones reported by Chen et al. [2004] or those presented in the next chapter show that path indexing can speed up query evaluation in RDBSs just as much as in a native or hybrid environment.

CHAPTER 7. XML RETRIEVAL IN RELATIONAL DATABASE SYSTEMS

made in Chapter 5 for native path indexing also apply to relational systems. On the one hand, atomic path indices likeXReldo prevent irrelevant elements from being retrieved and joined in certain cases, but their string representation of tag paths is redundant, awkward to match and of limited use for branching path expressions and recursive document collections. By separating document-level and schema-level information into two distinct tables,XRelcan match schema constraints without accessing the full document data, but the resulting path information is often not precise enough to pick exactly the relevant elements in the node table. On the other hand, the compositional path representation ofBLASis quite compact, but produces even more false positives on the schema level thanXReland also requires query preprocessing outside the RDBS (for creating P-labels and unfolding query paths). Moreover,BLASstores and compares both schema-level and document-level information in node tables, which means larger index scans during schema matching and more I/O needed for updates when the document structure changes.

The next chapter shows how to avoid these shortcomings to make relational XML retrieval benefit even more from path indexing with a centralized structural summary. TheRelational CADG(RCADG) presented below is based on a compositional path representation which is simpler and more precise thanBLAS. It builds on the interval labelling of schema nodes described for theICADG[Weigel 2003] in Section 6.3.2. As a matter of fact, this approach is dual to BLASin the following sense. In the ICADG, the interval label of a schema node represents all rooted tag paths with a common prefix. The interval of a longer tag path is contained in the intervals of shorter ones with the same prefix. For instance, the interval for /people/person contains the one for /people/person/name. By contrast, the P-labels used byBLAS

represent sets of tag path suffixes. For the purpose of analogy, they may be regarded as interval-labelled nodes of a modified schema tree containing all inverse (i.e., leaf-to-root) tag paths or path suffixes in the documents. The examples above illustrate how indexing path prefixes rather than suffixes can reduce the number and size of intermediate results to be joined. The next chapter explains how theRCADGtakes advantage of this observation.

CHAPTER

EIGHT

The

Relational CADG

₍RCADG₎

8.1 Overview

This chapter introduces theRelational CADG(RCADG), a new time- and space-efficient approach to XML retrieval in relational database systems. The aim of this work is to bring together sophisticated XML indexing techniques and the mature and highly optimized relational technology in order to get the best from both worlds. TheRCADGbuilds on much of the work presented so far, most prominently: theBIRD

labelling scheme explained in Chapter 4, a decentralized structural summary with powerful decision and reconstruction capabilities, and theCADGindex presented in Chapter 6, a centralized structural summary that combines the schema tree in main memory with a materalization of the content/structure join on disk. The main contributions of theRCADGare (1) a relational storage scheme for theCADGand (2) query planning, translation and evaluation algorithms that together

1. leverage the full schema matching precision of theCADGin an RDBS,

2. preserve its compositional path representation to rule out many false schema hits early,

3. exploit the power ofBIRDreconstruction to avoid needless disk I/O and joins of large intermediate results,

4. enable query planning and optimization based on path and keyword selectivity statistics and an ana- lysis of reconstructible relations in the query, and

5. exploit standard relational techniques as much as possible.

The rest of this chapter discusses these issues in more detail. The next section explains the relational storage scheme used by theRCADGand outlines the query evaluation process from an intuitive point of view. Section 8.3 briefly reviews the child-balancedBIRDencoding introduced in Chapter 4, focusing on how to realize decision and reconstruction in the RDBS. Based on these preliminaries, Section 8.4 describes the nuts and bolts of XML retrieval with theRCADG, including query planning and rewriting as well as the generation of SQL code for query matching on the schema and document levels. Section 8.5 reports the results of comparing our implementations of theRCADG,XPath AcceleratorandXRelschemes with the originalCADG. Section 8.6 provides a quick wrap-up of theRCADG’s contributions compared to the related work reviewed in the previous chapter. The last section mentions some remaining issues and open questions.

In document Weigel, Felix (2006): Structural Summaries as a Core Technology for Efficient XML Retrieval. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 105-109)