2.2 XPath Query Language
3.1.8 One Step beyond Text Compression
Word-based byte-oriented compression techniques have been acknowledged as quite relevant solutions for natural language text databases, since they achieve competitive compression ratios, fast random access, and direct sequential searching. In case of semi-static statistical methods, compression has gone one step beyond. Recently, a novel reorganization proposal of the codeword bytes of any natural language text compressed with an encoding scheme of this category has been presented [BFLN08, BFLN12]. This codeword rearrangement, called Wavelet Trees on Bytecodes (WTBC), for its similarity with the original wavelet trees [GGV03], consists basically of placing the dierent bytes of each codeword at dierent nodes of a tree, instead of sequentially concatenating them, as in a typical compressed text. However, this minor change leads to a new implicitly indexed representation of the compressed text, where search times are drastically improved, by using a negligible amount of additional space. In fact, in [BFLN12], experimental data shown that WTBC not only performs much more eciently than sequential searches over compressed text, but also than explicit inverted indexes when little extra space is used. WTBC specially succeeds when searching for single words and short phrases. This structure has provided the inspiring starting point of this thesis work. We next conceptually describe it in detail.
The essence of this codewords rearrangement is the following: the root of the WTBC is represented by all the rst bytes of the codewords, following the same order as the words they encode in the original text. That is, let us assume we have the text words ⟨w1, w2. . . wn⟩, whose codewords are cw1, cw2. . . cwn, respectively,
and let us denote the bytes of a codeword cwi as ⟨cwi1...cwmi ⟩ where m is the size
⟨cw1
1, cw21, cw31...cw1n⟩. At position i, we place the rst byte of the codeword that
encodes the ith word in the source text, so notice that the root node has as many
bytes as words has the text.
We consider the root of the tree as the rst level. Therefore, second bytes of the codewords longer than one byte are placed in nodes of a second level. The root has as many children as dierent bytes can be the rst byte of a codeword of two or more bytes. For instance, in a (190, 66)-DC encoding scheme, the root will have always 66 children, because there are 66 bytes that are continuers. Each node X in this second level contains all the second bytes of the codewords whose rst byte is x, following again the same order of the source. That is, the second byte corresponding to the jthoccurrence of byte x in the root, is placed at position j in
node X. Formally, let us suppose there are f words coded by codewords cwi1...cwif
(longer than one byte) whose rst byte is x. Then, the second bytes of those codewords, ⟨cw2 i1, cw 2 i2, cw 2 i3...cw 2
if⟩, form the node x in the second level. The same
idea is used to create the lower levels of the tree. Looking into the example, and supposing that there are d words whose rst byte codewords is x and whose second one is y, then node XY is a node of the third level, child of node X, and it stores the byte sequence ⟨cw3 j1, cw 3 j2, cw 3 j3...cw 3
jd⟩ given by all the third bytes of that codewords.
Those bytes are again in the original text order. Therefore, the resulting tree has as many levels as bytes have the longest codewords.
TEXT: “ MAKE EVERYTHING AS SIMPLE AS POSSIBLE BUT NOT SIMPLER”
b0 b1b3 b2b1 b2b3 b3b0 b3b1 b1b2b3 b2b0b1
SYMBOL FREQUENCY CODE AS POSSIBLE SIMPLE EVERYTHING NOT MAKE BUT SIMPLER 2 1 1 1 1 1 1 1 B1 B3 B1B2 B2B0 B2 1 SIMPLER b1 b3 MAKE NOT 1 2 b1 b0 1 2 3 b3 b1 b0
EVERYTHING SIMPLE SIMPLER POSSIBLE BUT
1 2
b3 b2
Position: 1 2 3 4 5 6 7 8 9 Word: MAKE EVERYTHING AS SIMPLE AS POSSIBLE BUT NOT SIMPLER
b3 b2 b0 b2 b0 b1 b1 b3 b3
1
BUT
Figure 3.10: Example of WTBC structure.
To better understand this reorganization of codewords Figure 3.10 shows an example where a WTBC is built from the text MAKE EVERYTHING AS SIMPLE AS POSSIBLE BUT NOT SIMPLER, and the alphabet Σ = {AS, BUT, EVERYTHING, MAKE, NOT, POSSIBLE, SIMPLE, SIMPLER}. Once codewords are assigned to all the dierent words in the text, by using any word-based, byte-oriented semi-static statistical compressor, their bytes are spread in a tree following the reorganization of bytes explained. That is, all the rst bytes of the words are placed in the root following the
text order, while the remaining bytes are in the corresponding nodes of consecutive levels. For example, b3 is the 9th byte of the root because it is the rst byte of
the codeword assigned to 'SIMPLER', which is the 9th word in the text. In turn,
its second byte, b1, is placed in the third position of the child node B3 because
'SIMPLER' is the third word in the root having b3 as rst byte. Likewise, its third
byte, b2, is placed at the third level in the child node B3B1, since the rst and second
byte of the codeword are b3and b1, respectively. Observe that only the shaded byte
sequences are stored, the rest of the text is only shown for comprehensibility. Notice that the amount of space needed for all the nodes of a WTBC representation, matches the size of the text compressed with the compression method used to create the WTBC structure. That is, just a reorganization of the codewords bytes is performed in WTBC. Yet, this simple codewords rearrangement, provides important implicit indexing properties, which have a denite impact over the searching capabilities of this structure [BFLN12].