TF-MBW: Multiple-Query Traversal - Index Traversal Methods

3.4 Text Prioritization Approach

3.4.3 Index Traversal Methods

3.4.3.2 TF-MBW: Multiple-Query Traversal

Given a set of queries Q, our goal is to minimize the I/O cost where the queries share keywords and/or have close locations spatially. Algorithm 7 shows the pseudocode of our proposed approach, TF-MBW, to answer multiple queries as a batch. For each query q ∈ Q, a separate priority queue Hq is maintained that stores the current top-k objects of that query (Line 7.4).

Let TU be the union of the terms of all the queries in Q. We maintain a buffer of size |TU|, denoted as Buf_t to keep the last accessed block for each term t ∈ TU by any query, or

Text Prioritization Approach 61 3 88 172 263 352 366 358 q1.d = {t1, t2, t3, t4} q2.d = {t2, t3} pivot q1: 316, pivot q2: 360 479 501 411 t₁ t2 t3 t4 415 316 316 316

(a) Retrieving blocks for q1

t

3 366 358 479 360 360

(b) Retrieving and sharing blocks for q2

Figure 3.7: List traversal for multiple queries.

more specifically, only one block for each t ∈ TU. For each query q ∈ Q, the spatial-textual similarity score Rk(q) of the currently k·th ranked object, and the pointers CPt,q are also

maintained (Line 7.7). We initialize the pivot object for each individual query ν_q once, in the same way as described in Section 3.4.3.1. If we reach the termination condition for a query q, there is no object left that can be a top-k object of that query, so we can exclude it from Q (Lines 7.8 - 7.14).

Let ν⇓ be the smallest pivot ID among the current pivots for any of the queries q ∈ Q, and q⇓ be the corresponding query for which ν⇓ is selected. In each iteration, we take ν⇓ and process the query q⇓ for that pivot (Lines 7.16 - 7.18). While retrieving any block that is required to compute the total score of ν⇓, one of the following two conditions can occur: (i) the block that contains the current ν⇓ was retrieved in a previous iteration; (ii) the block was never retrieved in any prior step.

If condition (ii) holds, the block must be retrieved from disk, and then the block is stored in Buf for the corresponding term after the computation. If condition (i) holds, then the block must be found in the corresponding block buffer. This is true since Wand guarantees that the pivot selected in each step is always greater than or equal to all of the pivot IDs selected in any previous step. For multiple queries, as we process the minimum of all the pivot IDs in each iteration, the ν⇓ of a step is also greater than or equal to the minimum pivot ID, ν⇓ of any previous step. Thus, the objects less than the minimum pivot ID, ν⇓ are guaranteed to be processed already for all of the queries in each step. Recall that the blocks are stored in a sorted order by object ID, and the objects are sorted in the blocks as well. So if a block

ALGORITHM 7: TF-MBW

7.1 Input: A set of queries Q, a positive integer k, and an SIF over the set of objects O. 7.2 Output: The top-k results for each query in Q.

7.3 Initialize an array H of |Q| min-priority queues 7.4 TU ← S

q∈Q

q.d

7.5 Initialize an array Buf of |TU| blocks 7.6 for each q ∈ Q do

7.7 Execute Lines 5.5 - 5.7 of Algorithm 5 7.8 Sort posting lists by CPt,q

7.9 νt ← FindPivotTerm(R_k(q), q) 7.10 if νt = ∅ then 7.11 Q ← Q − q 7.12 ν_q← CP_νt,q 7.13 if νq< ID⇑ then 7.14 Q ← Q − q 7.15 while Q 6= ∅ do 7.16 ν⇓← min q∈Q(νq)

7.17 q⇓← the query for which ν⇓ is selected. 7.18 Execute Lines 5.14 - 5.17 of Algorithm 5 for q⇓ 7.19 for each t ∈ TU, where CP_t,q⇓ ≤ ν_q⇓ do

7.20 if Block pointed by CP_t,q⇓ _{Not retrieved before then} 7.21 b ← Retrieve block pointed by CP_t,q⇓

7.22 Mark b as retrieved

7.23 Buft← b

7.24 Execute Lines 5.20 - 5.26 of Algorithm 5 for q⇓ 7.25 Execute Lines 7.8 - 7.14 of Algorithm 7 for q⇓ 7.26 Return H

containing the current ν⇓is retrieved for any prior ν⇓, that block is guaranteed to be found in the block buffer in this approach.

As we maintain the pointers of the terms for all of the individual queries, forwarding the pointers is achieved in the same way for q⇓ _{as described for GeoBW (Line 7.24). The pivot} object of the q⇓ is computed for the next iteration as shown in Line 7.25. We now illustrate I/O sharing among queries with the following example.

Example 7. Figure 3.7 shows two queries q1and q2, where q1.d ={t1, t2, t3, t4} and q2.d ={t2, t3}.

Let the starting pivot of q₁ be 316 and the pivot of q₂ be 360 in this example. In this case, we take 316 as the minimum pivot ID, ν⇓ and process 316 for q1. Suppose we need to compute the

Experimental Evaluation 63

score for object 316 for q₁, then the blocks shown with red stripes (Figure 3.7a) are retrieved from disk to compute this score, and these blocks are stored in the block buffer for the corresponding terms.

The pivot of q1 is computed again for the next iteration. Let, the ν⇓ selected in the next

iteration be 360 for q₂. After checking the conditions, suppose the score of 360 needs to be computed for q2. The blocks that are required to be accessed for q2 are shown with blue stripes

in Figure 3.7b. As the block for the term t2 was retrieved for q1 previously, that block can be

found in the block buffer, Buf for t₂. Thus the block shown with two color stripes (red and blue) are shared among the queries. The block stored in the block buffer for t3 is not the one that is

required by q₂. Therefore, we need to retrieve this block that contains the pivot object 360, and update Buf for t3.

If a block is skipped for all pointers in an inverted list where multiple queries share the same term, that block is not retrieved from disk. The priority queues for each q ∈ Q, and the thresholdsR_k(q) are also updated in this process. If we reach the terminating condition for a query q such that there is no object left that can be its top-k object, q is excluded from Q. We continue until Q is empty, which indicates that the result for all the queries have been found.

In the TF-MBW approach using the SIF index, a separate priority queue, each of size k is maintained for each query to track the current best objects, i.e., total k × |Q| objects for the batch. We also maintain a block buffer store the last recent retrieved block for each unique query term, so total |TU| blocks are kept in memory at a time.

In document Efficient query processing on spatial and textual data: beyond individual queries (Page 75-78)