5.2 XWT Basic Procedures
5.2.1 Decompression
5.2.1.1 Random Word Decompression
If we want to decode a word at any position of the document we will use rank operations and perform a top-down traversal of the XWT. Hence, to extract a random document word j (random decompression), we rst access the jth byte of
the root node of the XWT to get the rst byte of its codeword. If according to the encoding scheme it is the last byte of a codeword (that is, it is a stopper), we nish the procedure. However, if the codeword has more than one byte (since the read byte is a continuer), we will continue traversing the XWT top-down to get the rest of the bytes. Notice that, at this point, we have to check if the byte read, bi, matches any of the continuers reserved to mark the codewords of any word of
the special vocabularies. Depending on this, going down in the XWT to obtain the remaining bytes will be done by using the s and c values of the corresponding vocabulary. Whatever the case, by reading bi as the rst byte, we already know
that the second byte of the codeword is stored in the node Bi. As all the words whose codeword starts by byte bi will have their second bytes placed at node Bi,
we only have to count how many times the byte bi occurs in the root node until
position j. So we compute rankbi(Root, j) = k, that tells us that the second byte
of the codeword we are decoding is the kthbyte of node Bi. Again, if that byte is
not yet the last one (that is, if it is not a stopper), we proceed in a same way until the last byte of the codeword is reached. Algorithm 5.2 depicts the pseudocode to decode a word at a given position of the document.
Algorithm 5.2:Display text position x Input: p, a position of the document
Output: w, word at position p in the document 1. cN ode← root; cw ←ø 2. b← XW T [cNode][p] 3. cw← cw || b 4. while b is a continuer do 5. p← rankb(XW T [cN ode], p) 6. cN ode← getChildNode(cNode, b) 7. b← XW T [cNode][p] 8. cw← cw || b 9. w← getW ord(cw) 10. return w
Example To know which is the third word in the source document of Figure 5.2, we will proceed as follows. We start by reading the third byte of the XWT root, that is, we get Root[3] = b3. According to the encoding scheme, we know that byte
b3is a continuer, but also that it is one of the reserved continuers, in particular, that
reserved to mark tags codewords. So, on the one hand, we know that the codeword is not complete yet, and we will have to read a second byte in the second level of the XWT, more precisely, in node B3, since it holds all the codewords starting by b3. On the other hand, we also know that hereafter the process will continue
by using the encoding scheme associated to the tags vocabulary. Therefore, the next step will be to nd out which position of node B3 we have to read. By using rankb3(Root, 3) = 2we obtain that there are 2 bytes b3 in the root until position
3. Thereby, B3[2] = b6, gives us the second byte of the codeword we are looking
for. Again b6 is not a stopper, so we need to continue the procedure. In the child
node B3B6, that corresponds to the rst two read bytes of the codeword we are decoding, we have to read the byte at position rankb6(B3, 2) = 2. Finally, we obtain
B3B6[2] = b0. But b0 is a stopper and, therefore, it marks the end of the searched
codeword. The complete codeword is b3b6b0, corresponding to the start-tag <film,
5.2.1.2 Full Text Extraction
If we want to decompress the whole document from the beginning (full decompres- sion), we can proceed by extracting each word individually. However, we can take advantage of a more ecient procedure. Full decompression implies sequentially covering the bytes of the root node and getting the codewords whose rst byte is stored there. Then the same process as the previous seen to decode a word could be applied from the beginning of the root, j = 1. But, given that the sequences of bytes of all the XWT nodes follow the original order of the words in the source document, full decompression can be eciently implemented using pointers to the next positions to be read in each node. That is, when going to a child node to read the following byte of an uncomplete codeword, we do not need to compute any rank operation to nd out which position of this child node sequence we have to read. It always will be the next one to process in that child node. The pseudocode for full decompression is described in Algorithm 5.3.
Algorithm 5.3:Full text extraction Output: d, original XML document
1. foreach node ∈ nodeS do
2. track[node]← 1 3. d←ø 4. foreach i = 1 . . . sizeNodeS[root] do 5. cN ode← root; cw ←ø 6. b← XW T [cNode][i] 7. cw← cw || b 8. while b is a continuer do 9. cN ode← getChildNode(cNode, b) 10. b← XW T [cNode][track[cNode]] 11. cw← cw || b
12. track[cN ode]← track[cNode] + 1
13. d← d || getW ord(cw)
14. return d
We will illustrate this procedure with the example of Figure 5.2. The rst step consists of initializing an array, we call track, that holds the positions of the rst unprocessed entry of each XWT node with the value 1. Then we start by reading the byte at position 1 in the root node. Since it is a continuer, b3, we know that
the codeword is not complete, so we have to move to the second level of the tree, in particular, to node B3, and read the second byte of the codeword. It is at this point that, by using the basic decoding procedure of a word, a rank operation is performed to know which position of node B3 should be read. Instead, we just have to read the byte of node B3 at the position given by track[B3] = 1. Therefore, we
obtain the byte b6, and we update the value track[B3] to 2. Again, according to
the encoding scheme, the codeword is still not complete, so we proceed in the same way, but in node B3B6. That is, we read the byte of that node placed at position track[B3B6] = 1, which is byte b2, and then update the value of track[B3B6] to
the next unprocessed entry of that node, 2. Since byte b2 is a stopper, we can get
the rst decoded word, corresponding to the codeword b3b6b2, the word <movies.
After that, we continue with the second word at the root node. The byte of the root node at position 2 is b0. It is the last byte of a complete codeword, so we nish the
decompression of the second word of the document, by obtaining the corresponding word, >. The following word at the root node is the third one. At position 3 in the root node, we get the byte b3, hence we have to read a second byte of the codeword
in node B3. Since the value of track[B3] = 2, we know that this second byte of the codeword we are searching for is at position 2 in node B3. So we read byte b6, that
newly leads us to node B3B6. Thus, we rst update the value of track[B3] to 3, and then we proceed in an analogous way, but in node B3B6. Finally, we obtain the third word of the document, <film. We will continue the same procedure, until reaching the last word of the root node, saving unnecessary rank operations and making faster the complete document decompression.
5.2.1.3 Partial Decompression
The same smart procedure applied to eciently perform full decompression can be extended to extract a fragment of the document starting from a random position, instead of from the rst one. The only dierence lies in the initialization of the track array, since we do not start by decompressing the document from the rst position. A priori, we cannot know the rst unprocessed entries of each XWT node, so when decompressing a word, whenever the track value of a visited node is uninitialized, we will perform a rank operation to set the value of the next byte to be read. Otherwise, we will just read the byte at the position given by the corresponding track entry. That is, at most we have to pay one rank operation for each XWT node, because once a XWT node has been visited, rank operations are avoided.
Notice also that this mechanism can be used, for instance, to speed up snippet extraction around a found occurrence of a word, by just applying the procedure from the earlier position of the snippet in the root node of the XWT up to the last one.