3.2 Succinct Data Structures
3.2.2 Succinct Tree Representations
3.2.2.1 Fully-functional Succinct Tree
The main component of this representation is a novel data structure, called range min-max tree. Just with this data structure, it is possible to answer in constant time not only the core operations, but also the complex ones. This approach diers from previous works, in which each operation needs distinct auxiliary data structures to be solved [MRR01, MR04, CLL05, Sad07, LY08].
The fully-functional succinct tree proposal reduces the large number of relevant tree operations considered in the literature to a few primitives that are eciently carried out by the range min-max tree. Let P = [0 . . . n − 1] be a balanced parentheses sequence representing a tree, and excess(i) = rank((i)− rank)(i), a
function that gives us the dierence between the numbers of opening and closing parenthesis in P [0 . . . i]. Note that when P [i] is an opening parenthesis excess(i) is the depth of the corresponding node, while in case of a closing parenthesis, it is the depth minus 1. Then, the main core parentheses operations can be dened as:
• findclose(i) returns the position j of the closing parenthesis matching the opening parenthesis at P [i]: minj>i{j | excess(j) = excess(i) − 1}.
• findopen(i) returns the position j of the opening parenthesis matching the closing parenthesis at P [i]: maxj<i{j | excess(j) = excess(i) + 1}.
• enclose(i) returns the position j of the opening parenthesis enclosing the opening parenthesis at P [i]15: maxj<i{j | excess(j) = excess(i) − 1}. 15That is, this operation gives the position of the opening parenthesis corresponding to the parent of a node.
Now, let us consider excess(i, j) = excess(j) − excess(i − 1)16. Two primitive
operations constitute the kernel of the FF approach:
• fwd_search(i, d) returns the smallest j > i such that excess(i, j) = excess(j)− excess(i − 1) = d.
• bwd_search(i, d) returns the greatest j < i such that excess(j, i) = excess(i)− excess(j − 1) = d.
These operations can be used to express the aforementioned core parenthesis operations (base of the basic tree operations like, for instance, parent, subtreesize, nextsibling, or prevsibling [MR01]), together with other sophisticated tree opera- tions:
f indclose(i)≡ fwd_search(i, 0) f indopen(i)≡ bwd_search(i, 0) enclose(i)≡ bwd_search(i, 2) level_ancestor(i, d) ≡ bwd_search(i, d + 1)
level_next(i) ≡ fwd_search(findclose(i), 0) level_prev(i) ≡ findopen(bwd_search(i, 0))
Hence, the eciency of FF stems from its ability to compute fwd_search and bwd_search in constant time thanks to the range min-max tree. This data structure is built over the (virtual) array of excess(i) values as follows. The sequence P is split into blocks of size s = w
2 17. Then, for each block, the minimum
and maximum excess values within the block are stored. After that, blocks are recursively assembled into groups of size k = O(w/log w), in such a way that each new formed superblock stores the minimum and maximum excess within the blocks it holds. That results into a k-ary balanced search tree, the so-called range min-max tree. The total amount of space used is O(n log(s)/s) = o(n) bits. In Figure 3.15 we show an example of range min-max tree, where s = k = 3.
To compute fwd_search(i, d) by using the range min-max tree, we rst check if the answer is in the block i belongs to. Let us consider that this block, q = ⌊i/s⌋ corresponds to range [lq, rq] of P . The block scanning is done in constant time,
with table lookups over a simple precomputed table18. If unsuccessful, the range
[rq + 1, n− 1] of P , represented by range min-max tree nodes, is then examined. 16Notice that |excess(i)−excess(i−1)| = 1 for all i. In case P [i] is an opening parenthesis, then
excess(i)− excess(i − 1) = 1. If P [i] is a closing parenthesis, then the same subtraction results
into −1.
17Remember that w is the machine word length and that w ≥ log n.
18This table stores for all the dierent s-bit streams that constitute the dierent blocks of size sin P, the position where a target excess occurs.
1/2 2/4 3/4 2/3 1/3 2/3 1/2 0/0 1 2 1 2 3 4 3 4 3 2 3 2 1 2 3 2 3 2 1 2 1 0 min/max excess ( ( ) ( ( ( ) ( ) ) ( ) ) ( ( ) ( ) ) ( ) ) P 1/4 1/3 0/2 0/4 a b c d e f g h i j k l 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Figure 3.15: An example of the range min-max tree.
For each node, we verify if its minimum/maximum excess range, translated into absolute, contains excess(i − 1) + d. Once the proper range min-max tree node is found, we know that the answer to fwd_search(i, d) lies within it. If it corresponds to an internal node, we iteratively go down nding the leftmost child that contains the desired excess19, until reaching a leaf block, which will be nally scanned to
nd the exact value by table lookups, as before. An analogous procedure will be performed to compute bwd_search(i, d).
For instance, let us compute findclose(3) = fwd_search(3, 0) in the example of Figure 3.15. Notice that it is equivalent to nd the rst j > 3 such that excess(j) = excess(3−1)+0 = excess(2) = 1. Therefore, we start by examining the node ⌊3/s⌋, that is, the node d in Figure 3.15. Since the target value 1 is not in that block, we continue the process by checking the minimum/maximum values of the nodes that cover the range [5 . . . 21], which turn out to be that corresponding to nodes e ([6 . . . 8]), f ([9 . . . 17]), and j ([18 . . . 21]). In this way, we next scan node e. Again e does not contain the answer either, so we examine node f. Because 1 ≤ 1 ≤ 3, that is, the minimum and maximum values of f enclose the target value, the answer must exist in its subtree. Therefore we explore the children of f from left to right, and nd the leftmost one that contains the target value. In this case, it is node h. Given that it is already a leaf, we just scan its content using a precomputed table, and obtain that the answer to findclose(3) is 12.
19Again, it is done in constant time, by using a precomputed table that provides for all the patterns of k/c (c being a constant) minimum/maximum values stored in the children of a node of the range min-max tree, the rst child of the node whose minimum and maximum values enclose the target value.
Chapter 4
XML Storage and Querying -
State of the Art Revision
Since their introduction, the growing interest and challenge of XML query languages has triggered much research to provide ecient solutions either as theoretical proposals or in the form of real systems. Likewise, in line with the development of systems focused on query aspects, several works have addressed the space challenge that the verbosity of XML documents entails, in the form of XML compression techniques. Many of these methods also tried to keep some kind of query support, leading to the so-called queriable compression tools.
In this chapter, we make a complete revision and look through some of the most relevant solutions from both areas. Section 4.1 rst presents some well-known systems specically designed to provide XML query support, either as streaming approaches (Section 4.1.1) or based on indexed proposals (Section 4.1.2). In turn, Section 4.2 focuses on XML compression, and starts by introducing a classication of XML compressors in Section 4.2.1. Then, Sections 4.2.2 and 4.2.3 close the chapter by providing a detailed description of the most important queriable and non-queriable XML compression tools.
4.1 XPath Query Systems
Regarding the XPath query language, typical query systems are usually divided into two dierent categories: those that follow a streaming approach (such as XSQ [PC05], SPEX [Olt07] and GCX [SSK07]), hence having to sequentially read the document to answer each query; and the indexed ones (such as Galax [FSC+03],
Saxon [Kay08], Qizx/DB [Qiz], MonetDB/XQuery [BGvK+06], etc.), requiring a
rst preprocessing of the document to build additional data structures over it, 59
that are then used to solve the queries without sequentially traversing the whole document. Indexed approaches can be further categorized into in-memory engines and database systems. Next, we describe some of the most representative examples from each category.