Optimizations and Open Problems - Weigel, Felix (2006): Structural Summaries as a Core T

Weighting comparison criteria. As the discussion above illustrates, the decision which labelling scheme to use for a particular application depends on a number of different criteria and factors to be weighted against each other. Most prominently, the importance of robustness depends on whether the document col- lection to be labelled is frequently updated and if so, in which way (see Section 3.6). Similar constraints and preferences may apply to the storage available, the runtime performance e.g. on large collections, and the support for handling specific tree relations. For instance, the fact that in generalmPIDandVir-

tual Nodeslabels do not reflect the document order can be an important disadvantage especially for the evaluation of XPath and XQuery, whose semantics strongly build on node sets being sorted in document order. Lack of support for document order deeply affects the evaluation algorithm and seriously limits the use of most common structural join algorithms. However, if it is guaranteed that at any time during the query evaluation only labels are compared that belong to elements with the same tag path, then themPID

scheme may actually be a good choice, becausemPIDlabels of such nodes do respect document order (see

Section 3.4.2).7

Further criteria to be taken into account when choosing a suitable labelling scheme include the indexing performance (e.g., how many traversals of the document tree are needed for labelling), specific mappings to physical storage [Bremer and Gertz 2006] or other labelling schemes [Wang et al. 2003a], or whether global data structures such as the schema tree or an FST can be used [Gavoille and Peleg 2003; Peleg 1999]. Also, manipulating node labels in a restricted environment (such as standard SQL without user-defined extensions) may be an issue (see Chapter 7). For instance, some approaches require full regular expressions [Yoshikawa et al. 2001] or bitwise parsing [O’Neil et al. 2004], which may or may not be supported by the runtime environment.

As a general finding, however, the experiments in Section 4.6 show that the ability of a labelling scheme to reconstruct certain query constraints (most notably,parenti_{) is key to efficient XML query evaluation.}

This is confirmed in different settings by Christophides et al. [2003] and by Lu et al. [2005]. Consequently, while the subtree encodings reviewed in Section 3.3 produce small node labels that can be used in structural joins to decideChild+constraints, they are usually outperformed by schemes likeBIRDthat exploit the power of reconstruction. We empirically support this claim in further experiments to be presented later (see Chapter 8), whereBIRDcompetes with thePre/Postlabelling (see Section 3.3.2) in a relational environment. The same effect can be expected for other schemes with reconstruction support, e.g.,Deweyor

ORDPATH. As the use ofORDPATHin a commercial RDBS [O’Neil et al. 2004] shows, these approaches are of great practical interest. The plainDeweyscheme is easy to implement and fairly robust, but needs of course a binary label encoding to prevent excess label size. ORDPATHis particularly attractive due to its support for unlimited updates, which in a highly dynamic setting will outweigh by far the loss of a little expressivity and space efficiency.

4.8 Optimizations and Open Problems

Layered BIRDand unbalancedBIRDlabelling. The comparison and experimental evaluation of mul- tiple labelling schemes above has shown that the child-balanced, non-layeredBIRDscheme is highly efficient and expressive. The practical performance and benefit of theLayered BIRDlabelling outlined in Section 4.5.2 remains to be evaluated. As a matter of fact there is also an unbalanced variant ofBIRD

labelling, which emerges naturally when fixing a balancing factor of b=0. Additional work omitted here shows that the unbalanced labelling scheme creates labels and weights that are smaller and less likely to be affected by node insertions. Intuitively, this is explained by the fact that without balancing fewer document nodes are forced to have the same weight and hence labels that are multiples of a specific number. While a weight overflow in any balancedBIRDscheme invalidates the weights and labels of all document nodes that are represented by a sibling, cousin, . . . of the schema node causing the overflow, the unbalancedBIRD

labelling restricts this to those elements with exactly the same schema node.

However, without balancing certain tree relations such as i-th-childornextSibican no longer be recon- structed. Furthermore, the creation of unbalanced labels turns out to be more complex than in the balanced case. In particular, the memory consumption during labelling is probably prohibitively high because for

7_{In fact we exploit this feature, to the benefit of}

mPID, in our experiments with the X

2_{system, whose query kernel processes}

CHAPTER 4. THEBIRDLABELLING SCHEME

each element visited in the first pass through the document tree, the sequence of tags of its children must be recorded, rather than only the number of children as in the current labelling procedure. This issue would need to be solved before the unbalancedBIRDscheme might become a more space-efficient and robust alternative to the balancedBIRDlabelling described above.

Structural summaries of document subtrees. By contrast, there are other ways how theBIRDscheme could be optimized to obtain labels that are smaller and more robust against modifications of the document tree (most notable, node insertions at arbitrary positions). As suggested by the position ofBIRDin the trade-off space in Figure 4.10 on page 65, these are the major challenges faced by our approach. A possible technique for reducing the size ofBIRDlabels and weights has been hinted at in Section 4.6.1. There we sketched an alternative structural summary which is different from the schema tree that we used as weight index throughout this chapter. Currently all document nodes with the same tag path are assigned the same weight, as stated by the first invariant on page 43. Obviously this may cause many labels to be reserved for virtual nodes, namely, when some document nodes with a given path have a large subtree (and hence, a large weight) while other document nodes with the same tag path would only need a much smaller weight. The sample document in Figure 4.3 a. on page 46 illustrates this effect: although the node with theBIRD

label 36 (the rightmost child of the document root) has only two children, which would require aBIRDof 3 (see Section 4.2.1), the actual weight of the node 36 is 9. The reason is that other document nodes with the same tag path as node 36 (namely, its three siblings 9, 18 and 27) all have larger subtrees which do not fit a weight of 3.

It therefore seems promising to decouple the weights from the tag paths by using a structural summary in which every node represents element with a similar subtree size, rather than elements with the same tag path. As a matter of fact,BIRDcan be used with a variety of structural summaries covered by Definition 2.5 on page 11. The only restriction is that theChildrelation on document nodes must be preserved by the structural summary in the obvious sense, so that ancestor weights are available when reconstructingparenti. Clearly this is true for the schema tree: recall from Section 2.3 that given two document nodes u and v with respective tag pathsπ(u)andπ(v), if we have a D-constraintChild(u,v)in the document tree then the corresponding S-constraintChild′(π(u),π(v))holds true in the schema tree. An open question is which other structural summaries could be used that satisfy the above condition and at the same time treat elements as equivalent that have subtrees of a similar size or structure. Note that this could not only help to decrease labels and weights, but also makeBIRDlabelling more robust: after all, weight changes caused by overflows would no longer propagate to all document nodes with the same tag path, regardless of their subtree size. Instead only nodes with a specific sort of subtree would be affected. Depending on how heterogeneous the document structure is, this may mean that many node labels that are currently invalidated for no reason remain unchanged.

Part III

Index Structures for XML

CHAPTER

FIVE

Index Structures for Structured Documents

5.1 Overview

This chapter surveys existing techniques for indexing both the structure and the textual contents of XML documents. The various table- or tree-shaped data structures presented here are all instances of centralized structural summaries (see Definition 2.5 on page 11). As such they could in principle be complemented by decentralized summaries as those discussed before (see Chapters 3 and 4). From the wealth of centralized approaches to capturing the structure of XML documents, only a few representative indexing schemes can be reviewed in the scope of this thesis. For a more detailed survey, the reader is referred to earlier work [Weigel 2002].

In document Weigel, Felix (2006): Structural Summaries as a Core Technology for Efficient XML Retrieval. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 78-83)