Thesis organization - Efficient query processing on spatial and textual data: beyond individual

The rest of the thesis is organized as follows: Chapter 2 presents an overview of the funda- mental related work. In Chapter 3, we investigate the processing of batch queries. Chapter 4 presents the optimal location and keyword selection problem and the solutions. The top-m rank aggregation on streaming queries is explored in Chapter 5. Finally, Chapter 6 summarizes the thesis and discusses possible extensions of the current work.

Literature review

In this thesis, we address problems involving multiple queries on spatial and spatial-textual data. Our work relies on the state-of-the-art spatial and textual indexes and different combinations of the indexes to process the queries. First, we briefly discuss the general index structures and the basic query processing techniques using the indexes. As introduced in Section 1.1.2, the basic spatial query types are: (i) range queries, (ii) k nearest neighbor queries (kNN), and (iii) reverse k nearest neighbor queries (RkNN). The basic textual queries are (i) Boolean and (ii) text similarities based rank queries. The spatial-textual queries are usually associated with the combination of these spatial and textual query constraints. Finally, we present an overview of the research studies that address the processing of individual and multiple queries in this area, and discuss where our work stands in the literature.

2.1 Spatial Databases: Indexes and Query Processing

2.1.1 Spatial Indexes

Space partitioning indexes. A grid-based index [94] partitions a space into multiple non- overlapping regions, denoted as grid cells. Each object is allocated to a grid cell corresponding to its spatial position and a data structure is maintained to access the objects against the grid IDs. The main advantage of the structure is that, the index can be created first, and the spatial objects can be inserted in the cells without changing the structure. Such a structure is called ‘space-driven’ or data independent structure.

Finkel and Bentley [40] propose the Quadtree, which partitions a two-dimensional space by recursively subdividing it into four quadrants or regions, denoted as Quadtree cells. Each cell

Spatial Databases: Indexes and Query Processing 17

o

₅

o

₃

o

₂

o

₆

o

₇

R

₁

R

₂

R

₅

R

₇

R

(a) Objects and MBRs

o1 o2 o3 o4 o5 o6 o7 R1 R2 R5 R6 R3 R4 R1 R2 R3 R4 R5 R6 R7 (b) R-tree Figure 2.1: The example of an R-tree

has a maximum capacity; a cell is split into four lower-level cells if its maximum capacity is reached during insertion. Thus, unlike a grid index, the granularity of the partitions can be varied according to the number and/or the nature of the data to be stored. As shown in the study by Kothuri et al. [69], Quadtrees are suitable for update intensive applications.

Bentley [5] proposed the k-dimensional tree (k-d tree), which is a space-partitioning data structure to index points in a k-dimensional space. A k-d tree is a binary tree where each node is a k-dimensional point. One of the dimensions is chosen for each non-leaf node, and an implicit hyperplane is generated that passes through the point and perpendicular to that dimension’s axis. The space is partitioned into two parts by this hyperplane. Points to the left of the hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree.

R-tree and its variants. The R-tree [53] is a commonly used spatial index that can be used to store any geometric shape. The idea is to group the spatially close objects in a Minimum Bounding Rectangle (MBR). The R-tree is a hierarchy of nodes, each containing a number of entries. Each entry of a non-leaf node consists of the identifier of a child node and the MBR of all entries of that child node. The entries of a leaf node are the data objects. The number of entries in a node is bounded by a maximum value. The construction of R-trees is ‘data-driven’, that is, the resulting structure dependents on the dataset being indexed. Figure 2.1a shows the locations of an example set of objects O = {o1, o2, . . . , o7} and Figure 2.1b illustrates the

R-tree for O. In this example, the maximum capacity of a node is set as ‘2’.

The performance of the index depends on how the MBRs are constructed. Two variants of the R-tree, namely the R+-tree [110] and the R?-tree [4] were proposed to improve the

3 2 6 7 1 4 5 Z-curve

Figure 2.2: An example of the Z-curve ordering

performance by minimizing the ‘coverage’ and ‘overlap’. Coverage refers to the total area that covers the MBRs. Overlap is the total area that is contained in more than one node. By minimizing the coverage, the amount of empty spaces covered by the nodes are reduced. Minimizing the overlap allows a single path to be followed during query processing, while an area is contained in only one node. The R+-tree ensures that the entries of the internal nodes do not overlap by allowing partitions to split rectangles. The R?-tree attempts to reduce both coverage and overlap, using node splitting and reinsertion.

Space filling curves. Space filling curves, such as the Z-order curve [90] and the Hilbert curve [56] are used to impose a linear ordering on spatial objects, such that the objects close to each other in space are also close to each other in the ordering. After the data is ordered, any one-dimensional data structure (e.g., the B-tree [27]) can be used to index them. Figure 2.2 shows an example where the IDs of a set of objects O = {o₁, o2, . . . , o7} are assigned according

to their position in a Z-curve.

2.1.2 Spatial Query Processing

All of the spatial indexes support efficient processing of spatial range queries. As an example, in R-trees, the nodes are recursively traversed from the root node to the leaf nodes to answer a range query. If a node does not intersect with the query range, the subtree rooted at that node is pruned. When the leaf nodes are reached, each of its entries (data objects) are verified and returned as result if that entry is actually contained in the query range.

Information Retrieval: Models, Indexes, and Query Processing 19

Roussopoulos et al. [106] proposed a branch-and-bound traversal algorithm to answer kNN queries. Starting from the root node, the minimum and the maximum distance of each node from the query location are computed as an upper bound (overestimation) and a lower bound (underestimation) of the spatial similarity, respectively. The k best lower bound spatial similarities are maintained and updated as the nodes are traversed. If the upper bound spatial similarity of a node is not equal or better than the k best lower bound similarity scores, the subtree rooted at that node is pruned from consideration. When a leaf node is reached, the actual distance of the objects in that node from the query is calculated. Finally, k objects with the smallest distances are returned as results. Hjaltason and Samet [57] proposed an incremen- tal algorithm to answer kNN queries that is applicable to a large class of hierarchical spatial data structures. In this approach, the nodes of the trees are accessed in a best-first order of their upper bound distances from the query location. Finally, the first k data objects that are accessed in this order are returned as the result. Their experiments show that the algorithm significantly outperforms the branch-and-bound approach.

In document Efficient query processing on spatial and textual data: beyond individual queries (Page 30-34)