Indexes - Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing

We employ the IR-tree [3] and spatially gridded posting lists (SGPL) [11] to index the objects.

The IR-tree is an R-tree [25] extended with inverted files [29]. An inverted file index has two main components: (i) A vocabulary of all distinct words appearing in the text descriptions of the objects in the data set being indexed, and (ii) a posting list for each word t, i.e., a sequence of pairs(id, w), where id is the identifier of an object whose text description contains t and w is the word’s weight in the object. Each leaf node of an IR-tree contains entries of the form o= (id, λ), where e.id refers to an object identifier and o.λ refers to a minimum bounding rectangle (MBR) of the spatial location of the object.

Each leaf node also contains a pointer to an inverted file indexing the docu-ments of all objects stored in the node. Each non-leaf node contains entries of the form e= (id, λ)representing the children of the node where e.id is a child node identifier and e.λ is the MBR of all entries contained in the child node identified by e.id. Each non-leaf node also contains a pointer to an inverted file indexing the text descriptions of the entries stored in the node’s subtree.

Example:An IR-tree with fanout 4 indexing the example dataset given in Table F.1a is illustrated in Figure F.2.

R7 R8

A SGPL [11] is a grid-based index structure proposed for selectivity esti-mation and processing of range queries. First, an n×n grid is created on the data set, and the grid cells are indexed by a space filling curve. Then, for each

Paper F.

word wi, a SGPL is created. The SGPL of wi is a sorted list of entries of the formhc_j, Sw_i,c_ji, where c_j is the index value of a grid cell and Sw_i,c_j is a set of objects that contain w_iin their documents and are located in cell c_j. Although we only show the identifiers of the PoIs in the example SGPL below for the sake of simplicity, the index structure also stores the location and the textual weight associated with wi.

0 Fig. F.3:Example Spatially Gridded Posting Lists

Example: Figure F.3a illustrates a 4×4 grid on the 14 objects given in Table F.1a. Grid cells are indexed using a 2-order Z-curve. The numbers in bold are the Z-values of the cells. Table F.3b shows the SGPLs for words

“jeans”, “tshirt”, and “hat”. For instance, the first entry of “hat” indicates that the grid cell with Z-value 0 has two PoIs (p11and p13) that contain “hat”

in their documents.

4.2 DBSCAN-based Algorithm

We present the DBSCAN-based algorithm that uses the IR-tree. Then we present how to use the SGPL instead of the IR-tree for the DBSCAN-based algorithm.

4. Proposed Method

Algorithm

The DBSCAN-based algorithm is built on top of the algorithms proposed to process k-STC queries. Given a query, we determine the set of possible e and minpts values and construct the set of k-STC queries. Then we process these queries in parallel using the methods proposed by Wu and Jensen [11].

The clusters are then sorted with respect to one of the cost functions given in Equation F.3 and Equation F.4, and the top-k disjoint clusters are returned as the result.

Algorithm F.1DBSCAN-based Algorithm

Input: irtree- IR-tree, q - k-TMSTC query, minmp - Minimum minpts value, maxmp - Maximum minpts value, incth- Increase threshold value to deter-mine e values

Output: clusters- The list of clusters

1: ∆ub, e_ub←GetBounds(q.tmc, q.tm_i);

2: D_ψ ←irtree.RangeQuery(q.λ, q.ψ,∆ub);

3: dbscanParams←∅;

4: for minpts=minmp to maxmp do

5: paramList←GetEpsValues(irtree, q, D_ψ, minpts, eub, incth)_; ._Gets executed in parallel using fork-join model

6: dbscanParams←dbscanParams.Add(paramList);

7: end for

8: queries ← The list of k-STC queries corresponding to the query q and dbscanParams;

9: cSet←∅;

10: for all queryin queries do

11: cq←ProcessKstcQuery(query, irtree, D_ψ); .Gets executed in parallel using fork-join model

12: cSet←cSet∪cq;

13: end for

14: clusters←top-k disjoint clusters of cSet sorted with respect to the cost;

15: return clusters;

The DBSCAN-based algorithm is given in Algorithm F.1. Given a query, the algorithm first determines the upper bounds for the distance between the query location and the cluster (∆ub) and for the e parameter of the DBSCAN algorithm (e_ub) according to the transportation mode parameters tmcand tm_i (line 1). We assume that the algorithm has access to a rule-based method (the GetBounds call in the algorithm) to determine the upper bounds according to the transportation mode parameters and the underlying spatial region. For instance, if a user specifies that he plans to walk to the cluster, the∆ubshould not be set to more than 3–4 kilometers. However, if he plans to drive then∆ub

Paper F.

can be set to 40–50 kilometers. We can determine eubsimilarly. For instance, if a user plans to walk within the cluster, the e_ub should not be set to more than 0.5–1 kilometers. However, if the user plans to cycle then it can be set to 2–3 kilometers.

Then, the algorithm obtains the relevant object set Dψ with regard to the query keywords by issuing a range query using the query location, query keywords, and the upper bound for the distance, ∆ub (line 2). The algo-rithm then determines the possible e values for minpts values in the range of[minmp–maxmp]and the given query using the relevant object set (lines 3–

7). The algorithm has an additional increase percentage threshold inc_th to be used in determining e values. We set it to 5% by default. To determine pos-sible DBSCAN parameters, the algorithm first initializes dbscanParams as an empty set (line 3). Then the list of(_{e, minpts})pairs (paramList) is determined for each minpts value in parallel. At the end of each parallel execution, the resulting paramList is appended to dbscanParams (line 6). The default value for the maximum minpts value (maxmp) is set to 10 since we do not think a user can visit more than 10 PoIs according to a query result. Furthermore, the default value for the minimum minpts value (minmp) is set to 3 since users are interested in clusters and since we do not want to miss small clusters in the result.

After the DBSCAN parameters are determined, the algorithm constructs the corresponding k-STC query for each DBSCAN parameter tuple and pro-cesses them in parallel (lines 8–13). It uses a simple fork-join model to process the queries in parallel. The algorithm initializes the set of clusters cSet to an empty set (line 9) and populates the set with the results of the queries by forming the union with cq, which is the clusters for the current query (lines 11 and 12). Then the clusters are sorted in ascending order with respect to their cost, and the top-k disjoint clusters with the least cost are returned as the result (lines 14 and 15).

In the following, we describe the three subroutines called from the main algorithm.

The function RangeQuery (Algorithm F.2) is used to find the relevant objects in the e-neighborhood of a given location using an IR-tree on the objects. It is a standard range query algorithm with an additional check for text relevance (lines 8 and 14).

The DBSCAN-based algorithm uses the approach employed by the VDB-SCAN algorithm [14] to determine the e values. VDBVDB-SCAN first computes the k-dist values for each object in the dataset for a given k value and deter-mines the cut-off points from the k-dist plot. The k-dist value for an object is the distance between the object and the k^th nearest neighbor. We utilize the function GetEpsValues (Algorithm F.3) to determine possible e values using the IR-tree given a query, a minpts value, an upper bound(eub) for e, a relevant object set, and an increase percentage threshold (inc_th). The

algo-4. Proposed Method

Algorithm F.2 RangeQuery(λ, ψ, e)

Input: λ- Query location, ψ - Query Keywords, e - Query Range Output: neighbors- The list of neighbors

1: Queue queue←∅;

2: queue.Enqueue(root);

3: while queueis not empty do

4: e←queue.Dequeue();

5: N←ReadNode(e);

6: if N is a leaf node then

7: for allobject o in N do

8: if ois relevant to ψ andkλ o.λk_min ≤e then

9: neighbors.Add(o);

10: end if

11: end for

12: else

13: for allobject e⁰ in N do

14: if e⁰ is relevant to ψ andkλ e⁰k_min≤e then

15: queue.Enqueue(e⁰);

16: end if

17: end for

18: end if

19: end while

20: return neighbors;

rithm initializes the list of k-dist values as an empty list. Then it determines the k-dist value for k=minpts for each relevant object and adds it to the list if the k-dist value is not UNDEFINED (lines 3–8). The function GetKDist returns the k-dist value if it is less than e_ub. Otherwise, it is UNDEFINED.

Next, the algorithm checks if the list is populated. If not, it returns the empty list (lines 9–11), which means that the algorithm is unable to find e values for the given parameters.

The list is then sorted. After sorting, the algorithm iterates over the list of k-dist values and determines the cut-off points for different density levels in the relevant object set (lines 13–27). We define a density level as a sorted list of k-dist values that contains at least minpts values, do not have a percentage of increase between the consecutive k-dist values exceeding the given inc_th, and has a percentage of increase exceeding inc_th after the last k-dist value in the list. The algorithm checks whether the percentage of increase is more than inc_th between previous and current values (line 17). If there is an increase and the number of consecutive k-dist values without the required percentage of increase (nic) exceeds minpts, the algorithm adds a pair of the current

k-Paper F.

Algorithm F.3 GetEpsValues(irtree, q, D_ψ, minpts, e_ub, inc_th)

Input: irtree - IR-tree, q - k-TMSTC query, D_ψ - Relevant object set, minpts - The minpts value, e_ub - The upper bound value for e, inc_th - Increase percentage threshold

Output: params- the list of DBSCAN parameter tuples(e, minpts)

1: params←∅;

2: kdistValues←_∅;

3: for allobject o in D_ψ do

4: kdist←irtree.GetKDist(o.λ, q.ψ, e_ub, minpts);

5: if kdistis not UNDEFINED then

6: kdistValues.Add(kdist);

7: end if

8: end for

9: if kdistValues.Size=0 then

10: return params;

11: end if

12: Sort(kdistValues);

13: prev←0;

14: nic←0; .The number of k-dist values that do not have the required increase percentage (inc_th).

15: for i=1 to kdistValues.size do

16: curr←kdistValues[i];

17: if prev 6= 0∧ the increase percentage between curr and prev exceeds inc_th then

18: if nic≥minpts then

19: params.Add((curr, minpts));

20: end if

21: nic←0;

22: else

23: nic←nic+_1;

24: end if

25: prev←curr;

26: end for

27: return params;

dist value corresponding to a density level and minpts input to the list of parameters (lines 18 and 19). If there is an increase, the algorithm sets nic to 0 (line 21). If the increase is insufficient, nic is incremented (line 23). The algorithm terminates when the list of k-dist values is exhausted.

Example. Let us assume that the PoIs provided in the example dataset given in Figure F.1 are the relevant PoIs for a k-TMSTC query. We set minpts

4. Proposed Method

to 3, eubto 5 units, and incth to 15%.

1 1.41 1.41 1.41 1.41 2 2 2 2.24 3.61 3.61 3.61 4 4.24

0 0 1 2 3 0 1 2 3 0 1 2 3 4

0 41 0 0 0 41.84 0 0 12 61.16 0 0 9.75 6

Table F.1:List of k-dist values and corresponding nic and increase percentage values

The first row of Table F.1 shows the sorted list of 3-dist values for the relevant objects. The algorithm iterates over the list and computes the nic and increase percentage values. The second and third rows of the table show the nic and increase percentage values, respectively. The output e values are 2 and 3.61 since the increase percentage values exceed the given inc_th and nic is equal to 3 for both 3-dist values. However, 1.41 is not included in the output set since the corresponding nic value is below the minpts parameter.

The algorithm to determine e values employs the IR-tree on the objects to determine the distance of the k^thnearest neighbor to the object. Function GetKDistis quite similar to RangeQuery given in Algorithm F.2. The query location parameter is the object’s location, and the query range is set to e_ub. This function has an additional parameter(k=minpts), which is the order of the neighbor whose distance is of interest. The algorithm employs a priority queue instead of a regular queue, and the nodes are added to the queue with the priority value being their distance to the query location. Except from using a priority queue, the only part that is different in the algorithm is line 9. Instead of adding to the neighbors list, the algorithm counts the number of neighbors it has processed. If the current object is the k^th neighbor, it just returns the distance between the input location and the current object.

Algorithm F.4 ProcessKstcQuery(irtree, q, Dψ)

Input: irtree- IR-tree, q - k-STC query, Dψ- Relevant object set Output: clusters- the list of clusters

1: slist←sort objects in D_ψin ascending order of d_q.λ(o);

2: clusters←_∅;

3: while slist6=_{∅ do}

4: o←first element in slist;

5: c←GetCluster(o, q, irtree, slist);

6: if c6=∅ then

7: Compute cost of c;

8: clusters.Add(c);

9: end if

10: end while

11: return clusters;

Paper F.

The function ProcessKstcQuery (Algorithm F.4) takes an IR-tree, a k-STC query, and a relevant object set as input and returns the density based clusters. It iterates over a sorted list of elements with respect to the distance to the query location and gets the cluster with the object as core object (lines 4 and 5). If the cluster is not empty, its cost is computed, and it is added to the result list (lines 6–9).

Algorithm F.5 GetCluster(irtree, q, o, slist)

Input: irtree- IR-tree, q - k-STC query, o - The core object, slist - Sorted list of relevant objects

Output: C- the cluster

1: C←∅;

2: neighbors←irtree.RangeQuery(o.λ, q.ψ, q.e);

3: if neighbors.size<q.minpts then

4: Remove o from slist;

5: Mark o as noise;

6: return C;

7: else

8: Add neighbors to C;

9: Remove neighbors from slist;

10: Remove o from neighbors;

11: while neighbors6=_{∅ do}

12: Object oi ←remove an element from neighbors;

13: neighborsi←irtree.RangeQuery(oi.λ, q.ψ, q.e);

14: if neighborsi.size≥q.minpts then

15: for allObject oj∈neighbors_ido

16: if o_jis noise then

17: Add o_jto C;

18: else if oj∈/C then

19: Add ojto C;

20: Remove ojfrom slist;

21: Add o_jto neighbors;

22: end if

23: end for

24: end if

25: end while

26: end if

27: return C;

To get the cluster of a given core object, the function GetCluster (Algo-rithm F.5) is utilized. It issues a range query centered at o with a range of q.e on the IR-tree (line 2). The goal is to check whether o is a core object. If the

re-4. Proposed Method

sult set neighbors has fewer than q.minpts objects, object o is marked as noise, and an empty set is returned (lines 3–6). Otherwise, the algorithm initiates a cluster C containing the objects in neighbors. Next, this cluster is expanded by checking the e-neighborhood of each object oi in neighbors except o (lines 12 and 13). If the e-neighborhood of oi is dense (line 14), the objects inside the neighborhood that are previously marked as noise are added to the cluster (lines 16 and 17). The objects that are not processed yet are also added to the cluster, removed from the sorted list, and added to neighbors (lines 18–22). If no more objects can be added, cluster C is returned as the result (line 27). It is important to note that this GetCluster is the same as in the regular DBSCAN algorithm. We provide the pseudocode here to facilitate understanding of the more advanced algorithms described later.

SGPL-based Improvements

The DBSCAN-based algorithm explained above uses the IR-tree for range queries and does not utilize the SGPL-based improvements. We proceed to explain how we can make use of the optimizations, namely selectivity esti-mation and the FastRange algorithm [11], in the DBSCAN-based algorithm.

We also propose a FastKDist algorithm based on SGPL to compute the k^th nearest neighbor distances of all objects.

Selectivity Estimation

Wu and Jensen [11] propose a selectivity estimation method based on SGPL to decrease the number of range queries issued to process queries. Given a set q.ψ containing m query keywords, the corresponding m SGPLs are merged to estimate the selectivity of a range query. Wu and Jensen define a merging operator^Lon several SGPLs that produces a count for each non-empty grid cell as follows:

M w_i∈q.ψ

(qs) = {hcj,| ^[

w_i∈q.ψ

Sw_i,c_j|i |Cc_j∩qs 6=_∅} (F.5)

The merge operator merges the SGPLs of query keywords and returns a set of pairs, each of which contains a cell id and the number of relevant objects in the cell. The definition approximates the circular query region defined by λ and e by its circumscribed square (qs) to check the intersection effectively, and Cc_j corresponds to the spatial region of the cell with id c_j. The idea is that if the number of relevant objects within the circumscribed square is less than q.minpts, there is no need to issue a range query. To incorporate this optimization, we just need to add a selectivity estimation check before issuing range queries in lines 2 and 13 of GetCluster as given

Paper F.

in Algorithm F.5. In other words, the algorithm should only issue a range query if the selectivity estimate exceeds q.minpts.

FastRange Algorithm

Wu and Jensen [11] propose an algorithm to process range queries using SGPL. To be able to process queries with several keywords, they override the merging operator (^L) with^Las follows:

M w_i∈q.ψ

(qs) = {hcj, ^[

w_i∈q.ψ

Sw_i,c_ji |Cc_j∩qs 6=∅} (F.6)

The overrided merge operator given in Equation F.6 produces a set of objects instead of a count for each grid cell to be able to process range queries effectively.

Algorithm F.6 FastRange(λ, ψ, e, sgplList)

Input: λ Query location, ψ Query Keywords, e Query Range, sgplList -The list of SGPLs

Output: neighbors- The list of neighbors

1: neighbors←∅;

2: qs←The circumscribed square around the circular query region defined by λ and e;

3: mergedSgpl← ^L

w_i∈ψ

(qs);

4: for allCell c∈mergedSgpl do

5: if cis completely inside the query region then

6: Add all the objects inside c to the neighbors;

7: else

8: for allObject o inside c do

9: ifkλ ok ≤qc.e then

10: Add o to neighbors;

11: end if

12: end for

13: end if

14: end for

15: return neighbors;

The FastRange algorithm (Algorithm F.6) takes a location (λ), a query range (e), a set of query keywords (ψ), and the list of SGPLs (sgplList) as arguments. It first applies the overrided merge operator to the given list of SGPLs for the circumscribed square around the query region and assigns the result to mergedSgpl (lines 2 and 3). If a cell c from mergedSgpl is completely inside the query region, all objects in c are added to the result (lines 3 and

4. Proposed Method

4). If a cell intersects the query region, only objects in c that have distance to object location no greater than e are added to the result (lines 9 and 10).

The FastRange algorithm can be utilized in the DBSCAN-based algorithm by just changing its calls to irtree.RangeQuery to calls to FastRange in lines 2 and 13 of the GetCluster function (Algorithm F.5).

FastKDist Algorithm

The DBSCAN-based algorithm needs to process the relevant object set to determine the appropriate e values for the set. To do this, the GetEpsValues function given in Algorithm F.3 is employed. This requires a large amount of compute time because that the algorithm traverses the IR-tree for each relevant object. For this reason, a more efficient way to find the distance of k^thnearest neighbor (k-dist) for all relevant objects is desirable. We utilize the approach proposed by Wu and Tan [30] together with SGPL index.

c3,3

(a) Levels

L2G3 L2G2

L2G2 L1G2

L2G1 L2G2

L1G1 L1G2

L2G1 L1G1

L2G2 L1G2

c3,3 L1G1

L1G1 L1G2

L2G3

L2G2

L2G1

L2G2

L2G3 L2G2 L2G1 L2G2 L2G3

(b) Groups Fig. F.4:Levels and Groups for an Example kNN Query (redrawn from [30])

Wu and Tan [30] propose an algorithm on top of a grid-based index to process k-nearest neighbor (kNN) queries. They propose a method to build a visit order consisting of cells, levels, and groups in order to reduce the number of cells that should be visited to answer the kNN query. Assuming that ca,b is the cell that contains the query object, level l is the set of cells such that each cell ci,j satisfies either i = a±l∧b−l ≤ j ≤ b+l or j = b±l∧a−l ≤ i ≤ a+l. The cells in group g (1 ≤ g ≤ l+1) of level l is the set of cells c_i,j that satisfy either i = a± (g−1) ∧ j = b±l or i=a±l ∧ j=b± (g−1). L_lGgdenotes group g of level l. Figures F.4a and

Paper F.

F.4b show the corresponding levels and groups for an example kNN query (q) that is located in cell c_3,3.

FastKDist employs the same idea on top of SGPLs since we need a grid index that takes textual relevance to the given query into account. It also makes use of caching in order to reuse the distance computations between

In document Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing (Page 189-200)