Searching - XWT Basic Procedures - Compressed self-indexed XML representation with efficient XP

5.2 XWT Basic Procedures

5.2.2 Searching

Locating. In general, we can nd the position in the document of any occurrence of a word by rst searching for the last byte of its codeword in the corresponding XWT node, and then performing consecutive select operations up to the root of

the XWT. This procedure arises from the own organization of codeword bytes. Given a codeword ⟨cw1_...cwm_{⟩, if byte cw}i_{occurs at position j in the corresponding}

XWT node (that is, in node Bcw1B_cw2...B_cwi−1), then the previous byte of that

codeword, cwi₋₁_{, will be the j}th_{one occurring in the parent node (that is, in node}

Bcw1B_cw2...B_cwi−2). Therefore, when the root node is reached, we have the position

of the word into the document. This procedure is sketched in Algorithm 5.4.

Algorithm 5.4:Locate jth _{occurrence of word w operation}

Input: w, a word; j, an integer

Output: pos, position of the jth_{occurrence of w}

1. cw← getCode(w)

2. cN ode← computeLastNode(cw) // the node where cw|cw|is placed

3. pos← j

4. foreach i = |cw| . . . 1 do

5. pos← selectcwi(XW T [cN ode], pos)

6. cN ode← getP arentNode(cNode)

7. return pos

For instance, let us assume we want to locate the rst occurrence of loveattin

the example of Figure 5.2. The codeword of this word is b6b3b0, then we have to

start the search at node B6B3, since b6b3are the rst bytes of the codeword till the

last one. Next, we will search in which position of node B6B3 the rst byte b0occurs

(the last byte of loveatt codeword), by computing selectb0(B6B3, 1) = 1. In this

way, we obtain that it is at position 1, that is, the rst occurrence of word loveattis

the rst one of the words held in node B6B3 (i.e. words with codewords starting by b6b3). Also, we know that all the codewords whose last byte is stored in node B6B3,

are represented in node B6 with a byte b3, and that they are in the same text order.

Therefore, the value 1 we obtained with the previous select operation indicates that the rst byte b3 in node B6 corresponds to the rst occurrence of word love_att

in the document. Again, we compute selectb3(B6, 1) = 2, that newly indicates

that our codeword is the second one starting by b6 in the root node. Finally, by

computing selectb6(Root, 2) = 8, we can answer that the rst occurrence of loveatt

is the 8th_{word in the document.}

To locate all the occurrences of a word, this procedure is repeated for each one. Since the traversed XWT nodes are the same for each occurrence and these will be processed consecutively, select operations and thus the whole process, can be sped up by using pointers to the already found positions in the XWT nodes.

Counting. To count the number of occurrences of a given word, is equivalent to compute how many times the last byte of the codeword assigned to that word appears in its corresponding XWT node. This node will be identied by all the

previous bytes of the codeword. Therefore, in a general case, if a word is encoded with a codeword bxbybz (being bx and by, continuers and bz, a stopper), it is only

necessary to count the number of bytes bz in node BXBY . That is, we only have

to perform rankbz(BXBY, i), where i is the size of the node BXBY . In turn, if the

codeword has just one byte, bz, we will do rankbz(Root,n), where n is the number

of words in the document, that is, the number of bytes in the root of the XWT. Taking the example of Figure 5.2, if we want to count the number of occurrences of Shakespeareatt, we have to rst obtain its codeword, b6b4b0, and then count

the number of times its last byte, b0, appears in the node identied by the rst

bytes of its codeword (b6b4), that is, in node B6B4. In a same way, to count how

many times the word <name appears in the document, given its codeword b3b0, we

only have to count the number of times the byte b0 (since it is the last byte of its

codeword) occurs in node B3 (since b3is the rst byte of its codeword). Regarding

words whose codeword has only one byte, like One in the same example of Figure 5.2, which is encoded by b2, we only have to gure out how many times the byte

b2 (as it is the solely one, hence also the last byte of the codeword) appears in the

root of the XWT (since all the rst codeword bytes are placed in that node). Algorithm 5.5:Count operation for a word w

Input: w, a word

Output: occ, number of occurrences of w 1. cw← getCode(w)

2. cN ode← root

3. foreach i = 1 . . . (|cw| − 1) do

4. cN ode← getChildNode(cNode, cwi)

5. occ← rank_cw|cw|(XW T [cN ode], sizeN odeS[cN ode])

6. return occ

Algorithm 5.6:Count operation for a word w until a position p Input: w, a word; p, a position of the document

Output: occ, number of occurrences of w up to position p 1. cw← getCode(w)

2. cN ode← root; occ ← p

3. foreach i = 1 . . . (|cw| − 1) do

4. occ← rankcwi(XW T [cN ode], occ) 5. cN ode← getChildNode(cNode, cwi) 6. occ← rank_cw|cw|(XW T [cN ode], occ)

7. return occ

Notice that by applying this procedure, count operation turns into the search of a byte inside a node of the XWT, instead of searching for the occurrences of a word

inside the whole document, hence the benets are straightforward. Algorithm 5.5 shows the pseudocode of this operation. Moreover, we can also count the number of occurrences of a word until a given position of the document. In that case, we just perform the same strategy, but for each codeword byte, tracking down the endpoint toward the leaf node of the word. The pseudocode for that scenario is presented in Algorithm 5.6.

5.2.2.2 Phrase Patterns

Locating and counting. Apart from individual words, we may also be interested in locating several words, that is, in searching phrase patterns. To eciently perform this over the XWT structure, we start by locating the rst occurrence of the least frequent word of the pattern in the root node. Then we check if all the rst bytes of the codewords of each word of the phrase pattern match the previous and next bytes of the root node. If those matches happen, we continue by validating the rest of the bytes of the corresponding codewords, until either we detect a false matching or we nd the complete phrase pattern. But if it is not the case, we save going down in the XWT, and we simply locate the next occurrence of the least frequent word to be processed in a same way. This same basic procedure is used for both locating and counting a phrase pattern, and it is shown in Algorithm 5.7.

5.3 XWT Connection with a Balanced Parentheses

In document Compressed self-indexed XML representation with efficient XPath evaluation (Page 142-145)