• No results found

SPATIAL OLAP QUERY ENGINE: PROCESSING AGGREGATE QUERIES ON SPATIAL OLAP DATA

N/A
N/A
Protected

Academic year: 2021

Share "SPATIAL OLAP QUERY ENGINE: PROCESSING AGGREGATE QUERIES ON SPATIAL OLAP DATA"

Copied!
75
0
0

Loading.... (view fulltext now)

Full text

(1)

SPATIAL OLAP QUERY ENGINE:

PROCESSING AGGREGATE QUERIES

ON SPATIAL OLAP DATA

by

Yongchul (Kenneth) Choi

LL.B., Seoul National University, 1991

B.Sc., Simon Fraser University, 2005

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

In the

School of Computing Science

© Yongchul (Kenneth) Choi 2009

SIMON FRASER UNIVERSITY

Summer 2009

All rights reserved. However, in accordance with the Copyright Act of Canada,

this work may be reproduced, without authorization, under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private

study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

(2)

APPROVAL

Name: Yongchul (Kenneth) Choi Degree: Master of Science

Title of Thesis: SPATIAL OLAP QUERY ENGINE: PROCESSING AGGREGATE QUERIES ON SPATIAL OLAP DATA

Examining Committee:

Chair: Dr. Ze-Nian Li

Professor of Computing Science

______________________________________

Dr. Wo-Shun Luk

Senior Supervisor

Professor of Computing Science

______________________________________

Dr. Jian Pei

Supervisor

Associate Professor of Computing Science

______________________________________

Dr. Ke Wang

Examiner

Professor of Computing Science

Date Defended/Approved: July-17-2009

(3)

Declaration of

Partial Copyright Licence

The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

(4)

ABSTRACT

A spatial OLAP can be characterised as a practical union of OLAP analysis and geographic mapping. A spatial OLAP query has a spatial confinement along with the conventional non-spatial predicate. An existing framework we opt for is to convert a spatial OLAP query into a set of queries for a general-purpose ROLAP engine. However, little has been done at the query optimization level, once the queries are submitted to the query engine.

This thesis introduces three query engines on an experimental MOLAP system. The first is the implementation of the framework in the MOLAP context. The second increases the efficiency by adopting a novel merging technique to screen out many useless queries. The third does all aggregation on the fly, which outperforms the first two query engines by a wide margin under many

circumstances. Detailed experimental performance data are presented, using a real-life database with 1/3 million of spatial objects.

Keywords: OLAP; Spatial Data Management; Spatial OLAP; Spatial OLAP Query

(5)

DEDICATION

This thesis is dedicated to my family, who kept me strong in the course of working on my thesis.

(6)

ACKNOWLEDGEMENTS

I would like to thank my supervisor, Dr. Wo-Shun Luk. Your sound advice and careful guidance were invaluable in working on my thesis. Without your time and patience to guide me through the maze, this thesis would not have been possible.

(7)

TABLE OF CONTENTS

Approval... ii  Abstract... iii  Dedication ... iv  Acknowledgements ... v  Table of Contents ... vi 

List of Figures ... viii 

List of Tables ... ix 

1: Introduction ... 1 

1.1  Traditional data warehouse and OLAP system ... 3 

1.2  Two approaches for spatial OLAP query processing ... 4 

1.3  Improved query processing strategies ... 6 

1.4  Organization of the thesis ... 8 

2: Spatial Data Model and Spatial Query ... 10 

2.1  Spatial OLAP Database Model ... 10 

2.2  Spatial OLAP Query... 14 

3: Architecture for Query Processing System ... 16 

3.1  Overview ... 16 

3.2  Data Model ... 17 

3.3  Query Submission and User Interface ... 18 

3.4  Spatial Predicate Processor and Spatial Index (R-tree) ... 21 

3.5  Query Formatter ... 21 

3.6  Query Engine ... 23 

3.7  Storage Engine and Spatial OLAP Database (B+-tree) ... 23 

4: Storage Engine and Spatial OLAP Database (B+-tree)... 24 

4.1  Overview ... 24 

4.2  Search Key (B-key) ... 25 

5: Spatial Predicate Processor and Spatial Index (R-tree) ... 27 

5.1  Overview ... 27 

5.2  Generating the Spatial Hierarchy ... 28 

5.3  Retrieving a list of Spatial Objects within a Query Window ... 30 

(8)

6.3  SIQUE (Sequential-Index Query Engine) ... 41 

6.4  PICTQUE (Primary Cell Table Query Engine) ... 45 

7: Experimental Evaluation and Analysis ... 52 

7.1  Data Set ... 52 

7.2  R-tree Height (Fan-Out) ... 53 

7.3  Experiments on Query Engines ... 55 

7.4  Scalability ... 60 

8: Conclusion and Future Direction ... 62 

8.1  Conclusion ... 62 

8.2  Future Direction ... 63 

(9)

LIST OF FIGURES

Figure 1: Dimension Hierarchy for Non-Spatial Dimension ... 10 

Figure 2: Dimension Hierarchy for Spatial Dimension ... 12 

Figure 3: Architecture of Spatial OLAP Query Processing System ... 16 

Figure 4: User Interface to Specify a Query Window ... 19 

Figure 5: SQL Statement for an Example Query ... 20 

Figure 6: Re-written SQL Statement ... 22 

Figure 7: An Example of a Point Query ... 22 

Figure 8: B+-tree and Storage Engine Interfaces ... 25 

Figure 9: Embedded & Intersected Nodes ... 31 

Figure 10: Overlap Search Algorithm in an R-tree ... 33 

Figure 11: Fan-outs & Number of points to be processed ... 34 

Figure 12: Query Translation to B-key ... 37 

Figure 13: Generic Algorithm ... 40 

Figure 14: Merge Process in SIQUE ... 42 

Figure 15: SIQUE Algorithm ... 44 

Figure 16: Architecture of PICTQUE(a) ... 46 

Figure 17: Algorithm of Locate-pCR in PICT ... 47 

Figure 18: Algorithm of Filter Process ... 48 

Figure 19: PICTQUE(a) Algorithm ... 49 

Figure 20: Architecture of Spatial OLAP Query Processing System with PICTQUE(b) ... 50 

Figure 21: PICTQUE(b) Algorithm ... 51 

Figure 22: Efficiency of Query Engines & Number of Queries ... 59 

Figure 23: Average Query Time & Size of Databases ... 60 

(10)

LIST OF TABLES

Table 1: Star Schema of a Spatial OLAP Database ... 17 

Table 2: Dimension Hierarchies of Real Estate Database ... 52 

Table 3: R-tree Fan-out & Performance (SIQUE Used) ... 54 

Table 4: Numbers of Primary & Group SOID's Inside Query Window ... 56 

Table 5: Running Times (in ms): Generic vs. SIQUE ... 57 

Table 6: Running Times (in ms): PICTQUE(a) vs. PICTQUE(b) ... 58 

(11)

1: INTRODUCTION

Data warehouses and business intelligence (BI) tools are indispensable for modern decision makers in businesses and government agencies to make better decisions. A data warehouse is a large electronic repository consolidated to store the organization's historical data gathered from operational databases. It is well structured not merely to store data as archives for future references or evidences, but to facilitate complex queries, analyses, and reports. An Online Analytical Processing (OLAP), a very prominent business intelligence tool, is a series of operations or protocols on a data warehouse to provide answers for sophisticated analytical queries with various points-of-view. OLAP has been an essential tool for decision makers, which allows them to extract meaningful information at various angles from a pile of raw data in a data warehouse.

In recent years, it has been observed that most transactions in operational business databases involve the spatial component as one of attributes of a transaction. For example, let us consider a database of sales from electronic stores in a chain. The location of each electronic store typically will be stored as an attribute of it, and each transaction can be related to the specific location. We can simply call such transactions spatial data. The area of spatial data

(12)

its focus. Currently, an R-tree (or any of its variants) is a predominant indexing structure for spatial data management. A spatial query is in the form of a query window, which usually is a 2-dimensional rectangle on a map specified by the user. The answer to the spatial query is the set of spatial objects located inside the query window.

Unsurprisingly, a vast amount of data accumulated into data warehouses are also spatial data. However, traditional data warehouses and OLAP systems have not been able to process spatial data very well. Recently, researchers have begun to focus on developing specialized OLAP techniques to handle spatial data efficiently, by noting radical differences between spatial data and non-spatial data. We can call this discipline of research spatial OLAP. Indeed, spatial OLAP refers to the confluence of two technologies that are on the rise. They are spatial data management and OLAP technologies. In geographic information system (GIS) application areas, there has been considerable interest in employing the potential of spatial OLAP (e.g., [3] and [8]), although they do not define spatial OLAP in exactly same ways as we, database researchers, do.

This thesis introduces a MOLAP based spatial OLAP framework and efficient spatial OLAP query processing strategies on the framework. A spatial OLAP query typically requests aggregate information about the non-spatial aspects of the spatial objects inside the query window the user has drawn. The following subsections are short reviews of related work, and a brief introduction of improved query processing strategies to be presented in this thesis.

(13)

1.1 Traditional data warehouse and OLAP system

Traditionally, the system to process the spatial condition of a spatial OLAP query has been considered as a separate database from a general-purpose non-spatial OLAP database. The non-spatial condition of a query is usually a query window on a map. The system to process the spatial condition filters spatial objects based on the query window and the spatial information of spatial objects. When a spatial OLAP query is submitted, the system to process the spatial condition retrieves the objects inside the query window, and the non-spatial OLAP database returns objects that satisfy the conditions about non-spatial aspects. Then, the intersection operation is done on two result sets to get the set of objects whose measures should contribute to the aggregate answer.

However, the traditional approach is not efficient. It is because the size of result sets to be intersected is usually very large (especially for the set of spatial objects inside a query window), and the intersection operation is very expensive. To minimize the number of spatial objects to process, and to find an efficient alternative to costly intersection operation, methods of arranging spatial objects into a hierarchy and aggregating on that hierarchy have been proposed. Those methods incorporate the system to process the spatial condition into the spatial OLAP database system. As soon as the result set of spatial condition is

computed, the system needs to check the non-spatial condition only on elements in the set. To make it work, they had to overcome two main difficulties that the traditional data warehouses and OLAP systems have not been able to solve

(14)

concept hierarchies of dimensions are crucial in building a data cube. As for regular non-spatial dimensions, OLAP cube designers and implementers are able to pre-define concept hierarchies using well-thought business perspectives to meet the needs of a specific application. However, the hierarchical structure for spatial aspects of spatial objects cannot be defined easily in advance. This lack of pre-definability demands special techniques to define a hierarchy for spatial aspects of spatial objects. Secondly, OLAP query processing engine of the traditional OLAP system lacks facilities to process an ad hoc grouping of dimension hierarchies. In spatial OLAP query processing, members in a

hierarchy for spatial aspects should be decided by the system on the fly after the user specifies a query window on a map, whereas members in a hierarchy of any spatial dimension can be directly chosen by the user in a query as a non-spatial predicate.

1.2 Two approaches for spatial OLAP query processing

In the literature, there have been two major approaches to minimize the number of spatial objects to process, and to find an efficient alternative to costly intersection operation. They are quite different to each other and the choice seems to be application dependent.

The first approach is presented in [6] and [7], where an application of traffic supervision is considered. The system is to track the positions of cars on the roads. A typical query to the system is to find the total number of cars on the road segments inside a query window. An aggregate R-Tree (aR-tree) has been proposed to store the necessary information for update and query answering.

(15)

The aggregate R-tree augments a regular R-tree with aggregate information, e.g., the number of cars, stored in each tree node, which is actually a window in the map. The non-spatial data could be stored in the node, but the aggregation must be done in conjunction with the spatial data. It builds an aR-tree, a new type of an OLAP database, which can be described as the OLAP database embedded in the spatial indexing structure.

The second approach is to store the spatial data as a regular dimension along with non-spatial dimensions. An OLAP cube is built for these dimensions. The big advantage of this approach over the aR-tree approach is the existing OLAP cube-building techniques can be reused. This is possible by adopting a spatial indexing structure as a module to a system. A spatial indexing structure functions to serve two purposes: The one is to derive a hierarchy for a dimension about spatial aspects necessary for building an OLAP cube. The other is to locate interesting spatial objects inside the query window on the fly. This

approach is adopted by two papers, e.g., [10] and [11]. The example application described in these two papers is about gas sales by gas station in a

neighbourhood. The database contains details of each gas sale: gas station ID, gas type, customer ID, transaction time, sales amount and quantity per

transaction. In addition to the non-spatial data, the location of each gas station, in x-y coordinates, is also stored associated with each gas station ID. A typical spatial query is: what are the total sales of each type of gas for taxis (one type of customer) at gas stations within a query window? To process queries of this type

(16)

non-spatial dimensions and a hierarchy for a spatial-related dimension generated by a spatial indexing structure. Here, a spatial indexing structure plays dual roles: deriving a spatial hierarchy, and locating interesting spatial objects inside the query window. A viable R-tree is chosen as a spatial indexing structure. The main query processing strategy of this approach is converting a spatial OLAP query into a set of regular ROLAP queries, and processing them against a general-purpose OLAP engine.

The efficiency gains of two new approaches over the traditional approach are due to eliminating costly intersection set operation by incorporating the spatial indexing engine and the spatial hierarchy into a system. Under these approaches, once the spatial indexing engine retrieves spatial objects inside a query window, the system needs to look into the non-spatial conditions only for those objects returned. It does not need to check non-spatial conditions to a whole database. As soon as the result set for the spatial condition is computed, the members in a dimension about the spatial aspects are filtered out. In

addition, the number of members in a result set is much smaller than that in the traditional approach because the objects with members in a higher level of spatial hierarchy can contribute the answer, eliminating the need for processing many objects with members in a base level of spatial hierarchy.

1.3 Improved query processing strategies

Between aforementioned two approaches, the latter approach is chosen as the framework of the research in this thesis. The main rationale behind this choice is that advantages from utilizing an established OLAP paradigm seem to

(17)

outweigh any benefits an aR-tree approach may derive, in terms of cost effectiveness or system reusability. Under this approach, by simply adding an indexing engine as a system module to the existent OLAP system, many well-established OLAP techniques can be leveraged across applications.

This thesis is concerned with the query optimization level inside an OLAP query engine. Efficiency is crucial especially in spatial OLAP query processing system where the system is supposed to do much ad hoc processing. However, very little research has been done so far. In [11], for example, a regular ROLAP will take it over for processing once the query is submitted. In other words, the query optimization techniques that are applied to the spatial OLAP query is not domain specific. Working with an experimental OLAP database system [5], which features a multidimensional interface (MOLAP), it is relatively easy, in

comparison to ROLAP, to implement a specific query processing strategy and assess quantitatively the direct impact of the strategy on a query performance. Three query-processing strategies are developed and evaluated in this thesis.

The first strategy is the implementation of the approach in [11], except that we choose to implement it for our experimental MOLAP system. It will

demonstrate how the framework works systematically especially in the context of MOALP. It also serves as the baseline to other strategies.

The second strategy selects only the spatial objects inside the query rectangle to be processed in a batch against the non-spatial predicate on our

(18)

inside a query window. Processing them in a batch may yield performance improvement over processing each spatial object one at a time. Depending on the non-spatial predicate of the query in question, a large number of the spatial objects thus identified may eventually be excluded from the answer because they fail to satisfy the non-spatial predicate. In [5], an OLAP query-processing engine, called SIRQUE (Sequential-Indexed-Range QUery Engine) is developed to process a batch of OLAP queries efficiently. This efficiency gain is due to its capability of discarding, without any processing, an overwhelming number of queries that are doomed to fail the query predicates.

The third strategy is designed and implemented from rather a different perspective. It does not rely on any pre-computed aggregates at all, but instead, on computation of the required aggregates on the fly. This strategy is based on an observation that the characteristics of the input data and the hardware (CPU and main memory) capacities can make dealing with the input data on the fly more efficient. Instead of searching the OLAP cube, which could be huge in size, the system can efficiently search the input data set to compute the answer of the spatial query on the fly under this strategy.

1.4 Organization of the thesis

The remainder of this thesis is organized as follows. Chapter 2 provides an overview of the spatial data model and the spatial query, and gives a

motivating example. Chapter 3 presents the overall architecture of the spatial OLAP query processing system in this thesis. Chapter 4 introduces the storage engine used in the system. Chapter 5 explains the spatial predicate processor

(19)

along with an R-tree indexing structure in the system. Chapter 6 illustrates three query-processing engines developed and evaluated in this thesis. Chapter 7 shows experimental results to demonstrate the efficiency of proposed query engines. Finally, chapter 8 concludes the thesis and suggests a future research direction.

(20)

2: SPATIAL DATA MODEL AND SPATIAL QUERY

2.1 Spatial OLAP Database Model

An OLAP database is seen as a cube which is defined in this thesis to be a k-dimensional array of cells, where a k is a positive integer greater than zero,

under our MOLAP framework. Each dimension Di of a cube has |Di| members, for 1 ≤ i ≤ k, which are organized as a hierarchy. The members at the leaf level are called primary members. All other members in any higher level of the dimension

hierarchy are called group members. The root of hierarchy is named after the

dimension name itself. The hierarchy is a tree hierarchy, where a member is assumed to have exactly one parent, except for the root, which has no parent. In particular, there exists exactly one path between a group member and any of its descendants. Figure 1 shows an example of the dimension hierarchy of a non-spatial dimension.

(21)

A spatial OLAP database is in structure identical to a general-purpose MOLAP database under our chosen framework. The main difference between the spatial OLAP database and the general non-spatial OLAP database is about the basic design of the cube. In a spatial OLAP database, one of the dimensions is designated as the dimension for the hierarchy for spatial aspects of spatial OLAP data, e.g., kth dimension. In this thesis, we just call such dimension the spatial-object (SO) dimension or the spatial dimension in short. Associated with any primary member in the SO dimension is the unique identifier (ID) of that spatial object. This is called a primary spatial object ID, or a primary SOID

(pSOID) in short. In this thesis, a spatial object is assumed to be 2-dimensional,

which is identified by a minimum bounded rectangle (MBR). An MBR is

represented by the x-y coordinates of the upper-right and the lower-left corner points of a rectangle. A 2D point, with x-y coordinates, is considered to be a degenerate MBR. The hierarchy associated with the SO dimension is a spatial index, which is assumed to be an R-tree in this thesis. A group SOID (gSOID)

corresponds to a node in an R-tree. Associated with this group member, say G,

is an MBR which is the MBR of all rectangles associated with the child nodes of

G or the elements of G. The root node of this R-tree is called the map. To comply

with the definition of an OLAP cube, each node of an R-tree has exactly one parent, except the root, which has no parent. Figure 2 shows an example of the dimension hierarchy of a spatial dimension, which actually is an R-tree in this thesis.

(22)

Figure 2: Dimension Hierarchy for Spatial Dimension

A cell in the cube is identified uniquely by a k-tuple, which is composed of

its coordinates along the k dimensions. A cell is a group cell if at least one

coordinate of the cell is a group member of some dimension; otherwise it is a

primary cell. A cell stores some numeric values, which are called measures.

Measures of all primary cells are input from a data source(s). The measure of a group cell has to be derived from the measures of some of the primary cells or other group cells. In this thesis, we assume that all measures of all group cells are pre-computed. The process of pre-computing measures of all group

members is called aggregation. Over the last decade, there has been a large

number of aggregation algorithms published in the literature. A detailed

description of an aggregation algorithm used in an experimental MOLAP system can be found in [5]. Usually, only a tiny fraction of cells in a multi-dimensional array of cells are non-empty, which is called a sparse cube. The user may submit

(23)

a query to the spatial OLAP database to retrieve the measures stored in the cell. This query is essentially the set of k coordinates of the cell.

As an example of the spatial data model, consider a real estate database about the properties around the San Diego city area in California in the United States. The input data consists of data records, each of which describes a property, with the following attributes: property_type, year_built,

number_of_bedrooms, lot_size, region (9-digit ZIP code in ZIP+4 format), property_ID, tax_value, and property_value. The name of each attribute has

been chosen to be self-explanatory as possible as it can be. The first five

attributes are non-spatial dimensions, the property_ID is the SO dimension, and

the last two are measures.

For a non-spatial dimension, its dimension hierarchy is defined by the user by grouping the primary members together according to the requirement of the application. In an example, the hierarchy of a year_built dimension, which is one

of non-spatial dimensions of an example spatial data model, has a group member called modern, which has, as its child members, 2001, …, and 2009

(Figure 1) and was chosen to represent years in the current century.

For a spatial dimension, the node (or the element in it) of an R-tree as the spatial indexing structure (Figure 2) can be mapped to a member of the

dimension hierarchy on a one-to-one basis. In our example, a group member of a spatial dimension property_ID is an ID for a node in an R-tree. Associated with

(24)

primary member of a property_ID is an ID for an element in a leaf node of an

R-tree, which are read from an input data. Associated with this primary member are the longitude (x-coordinate) and the latitude (y-coordinate) pairs of the upper-right and the lower-left corner points of its MBR, where the longitude and the latitude can be same if the member has the point data instead of the region data.

2.2 Spatial OLAP Query

A spatial OLAP query has a select clause and two predicates: a spatial predicate and a non-spatial predicate. The select clause, just like the one in

standard SQL, contains a subset of measure attributes, and their associated distributive aggregation function(s), e.g., sum(), count(), max(), or min(). We

assume here the measure is always a designated measure, e.g., tax_value, and sum() is the distributive aggregation function. For brevity, we represent a spatial

predicate as a query window, which is a rectangle on the map. Likewise, the non-spatial predicate is represented by a vector <p1, p2, …, pk-1>, such that each element pi, for 1 ≤ i ≤ k-1, is a member of the hierarchy of a dimension Di.

The answer to this spatial query will be the measure value which is derived by applying the distributive aggregation function to the set of measures contained in the cells that satisfy both non-spatial and spatial predicates. A cell satisfies the spatial predicate if the spatial location of the associated spatial object, by virtue of its x-y coordinates, is found to be inside the query window. A cell satisfies the non-spatial predicate if each of its non-spatial coordinate vi in a cell coordinate <v1, v2, …, vk-1>, where 1 ≤ i ≤ k-1, is equal to, or a descendant of, pi, in a hierarchy of that dimension. Note that the spatial object represented in the

(25)

cell could be an element in a leaf node, a leaf node, or a non-leaf node in the R-tree.

An example of a spatial query is to find the sum of tax assessment values of all properties that satisfies conditions on non-spatial dimensions and lies inside a query window represented by the longitude (x-coordinate) and the latitude (y-coordinate) pairs of the upper-right and the lower-left corner points. The example query is “What are the total tax values of all properties of any property type with three bedrooms, the lot size of acreage, and ZIP code of 91901-2400, which is built in the current century and located inside a query window drawn on the map?” The non-spatial predicate in this query is: ‘property_type’ =

property_type AND ‘year_built’ = modern AND ‘number_of_bedrooms’ = 3 AND

‘lot_size’ = acreage AND ‘region’ = 91901-2400. A cell, whose property is located

inside the query window, of property type ‘RSFR’, built in year 2003, with 3 bedrooms, on a lot with 48000 sq. ft, with ZIP code 91901-2400 as the 9-digit zip code, will satisfy both predicates and will contribute to the final aggregate

answer, assuming year 2003 is a child member of year modern and the lot size 48000 sq. is a child member of lot size acreage, as is the case under our

(26)

3: ARCHITECTURE FOR QUERY PROCESSING SYSTEM

3.1 Overview

The architecture for spatial OLAP query processing system is shown in Figure 3. This chapter illustrates how a spatial OLAP query is processed systematically using various module engines in the system.

(27)

3.2 Data Model

As stated in the previous chapter, data models for non-spatial dimensions are pre-defined by database designers with specific application needs or

business perspectives. That’s how traditional OLAP databases are designed. As for the spatial dimension, under the framework we opt for, a spatial index (R-tree) built from a spatial data works directly as a data model for it. The entire data model (containing hierarchies of both non-spatial and spatial dimensions) works not only as a blueprint to build an OLAP database, but also as a protocol how the user can make query conditions.

In our work, the star schema is used as an OLAP database model at the

logical level. The dimension tables and a fact table of our OLAP database are as

follows:

Table 1: Star Schema of a Spatial OLAP Database

Dimension Tables

property_type (type_id, type_name)

Non-Spatial Dimension(s)

year_built (year_id, year_value)

number_of_bedrooms (count_id,

count_value)

lot_size (size_id, size_value)

region (region_id, zip_code)

property_ID (property_id, location*) Dimension Spatial

Fact Table

property_info

(type_id, year_id, count_id, size_id, region_id, property_id, tax_value, property_value)

(28)

Here, the location attribute of a spatial dimension property_ID is x-y

coordinates of two diagonal corner points of a rectangle (each point has a longitude and latitude pair).

Each dimension has its own dimension hierarchy. They are either

conceptual groupings for non-spatial dimensions (Figure 1), or the spatial index for a spatial dimension (Figure 2). The mechanism of design and construction of the hierarchy for a spatial dimension will be explained in chapter 5 in detail.

3.3 Query Submission and User Interface

We start with a user submitting a spatial query described in a query example in Chapter 2. The semantic of a query is “What are the total tax values of all properties of any property type with three bedrooms, the lot size of acreage, and ZIP code of 91901-2400, which is built in the current century and located inside a query window drawn on the map?” The end

user can form a query in a textual format or on a graphical user interface. In either case, the spatial query window is translated to the x-y earth coordinates of two diagonal corner points (each point has a longitude and latitude pair). In this thesis, the graphical user interface has been built mainly to accommodate the conveniences of evaluation routines. This will be further developed to be the front-end part of the whole spatial OLAP query processing system, as a web application, in future. The user can draw a rectangle on a map, which is built with the customized Google Maps Interface, to specify the spatial predicate. Figure 4

(29)

shows a snapshot of a user interface when the user draws a query rectangle on a map.

Figure 4: User Interface to Specify a Query Window

When the user draw a query rectangle, the x-y earth coordinates of two corner points are computed and sent to the spatial predicate processor in the spatial OLAP query processing system (Figure 3).

(30)

Now, the equivalent SQL statement to an example query can be written as follows:

SELECT SUM (pinfo.tax_value) AS answer

FROM property_type pt, year_built yb, number_of_bedrooms nb, lot_size ls, region rg, property_ID pid, property_info pinfo WHERE pinfo.property_id = pid.property_id AND

pid.location WITHIN query_window (llx, lly, urx, ury)* AND

pinfo.type_id = pt.type_id AND pinfo.year_id = yb.year_id AND yb.year_value IN (‘2001’ … ‘2009’) AND

pinfo.count_id = nb.count_id AND nb.count_value = 3 AND pinfo.size_id = ls.size_id AND ls.size_value IN (‘10k’ … ∞ ) AND pinfo.region_id = rg.region_id AND rg.zip_code = ‘91901-2400’

Figure 5: SQL Statement for an Example Query

*The llx is the longitude (x-coordinate) and the lly is the latitude (y-coordinate) of the lower-left corner point for the query window. Similarly, the urx and the ury are coordinates for the upper-right corner point of the query window.

The non-spatial predicate is directly sent to the query formatter. However, the non-spatial predicate needs to be augmented with each of the SOID’s that represent the spatial objects inside the query window before

processed against a general-purpose OLAP. For this augmentation, the spatial predicate is sent to the spatial predicate processor. As illustrated (Figure 4), the user interface translates the user drawn rectangular query window into the format

(31)

(the x-y earth coordinates of two diagonal corner points are typical) that the spatial predicate processor can understand, and sends that geo-information to the spatial predicate processor.

3.4 Spatial Predicate Processor and Spatial Index (R-tree)

The job of the spatial predicate processor is to call the R-tree interface to retrieve the list of spatial objects (R-tree nodes or elements in nodes), where the MBR of each of them is contained inside a query window. Then, the list of spatial objet ID’s of them is sent to the query formatter.

The list contains both primary spatial object ID’s (pSOID’s) and group

spatial object ID’s (gSOID’s). The mechanisms about the spatial predicate

processor and the spatial index (R-tree) will be discussed in chapter 5.

3.5 Query Formatter

Based on the non-spatial predicate and the list of SOID’s (say the name of the list to be SOID_list), the query formatter forms a series of point queries to

be fed into the query engine.

When we interpret our MOLAP database into ROLAP database parlance, the summary table concerning a spatial dimension can be as follows:

cell_summary (SOID, property_type, year_built, number_of_bedrooms,

lot_size, region, tax_value, property_value)

(32)

SELECT SUM (cs.tax_value) AS answer FROM cell_summary cs

WHERE cs.SOID IN “SOID_list” AND

cs.property_type = null AND

cs.year_built IN (‘2000’ … ‘2009’) AND cs.number_of_bedrooms = 3 AND cs.lot_size IN (‘10k’ … ∞ ) AND cs.region = ‘91901-2400’

Figure 6: Re-written SQL Statement

Physically, under our MOLAP query processing system, each point query generated by the query formatter can be represented by a select clause and a vector <p1,…, pk-1, pk>, such that each element pi, for 1 ≤ i ≤ k-1, is a member of

the hierarchy of a non-spatial dimension Di, and the element pi, for i = k, is a member of SOID_list the spatial predicate processor computed. For example,

one of the point queries in a batch can be represented as follows: Point Query # 003 =

{

Measure = tax_value

Aggregation Function = SUM()

Query Coordinates = <RSFR, 2003, 3, 48000, 91901-2400, G8*>

}

Figure 7: An Example of a Point Query

(33)

3.6 Query Engine

The development of the query engine is the main focus of this thesis. The workflow in the spatial OLAP query processing system before the query engine is similar to the workflow adopted in [11], except the fact that we use the MOLAP model instead of the ROLAP model. However, there has been little research about how efficiently these queries can be processed against the underlying general-purpose ROLAP database. In this thesis, we introduce a module called the query engine (Figure 3), which oversees how queries coming from the query formatter are to be fed into the storage engine, before they are processed by it. The answer to the spatial query will be made available by this query engine based on data coming from the storage engine. In chapter 6, we will present, in detail, three versions of query engines developed and evaluated in this thesis.

3.7 Storage Engine and Spatial OLAP Database (B+-tree)

At the bottom layer of the query processing architecture is the database storage engine which features two interfaces: sequential and indexed ones. The

main task of the storage engine is to search the B+-tree, which physically stores the contents of the OLAP database (cube). Detailed explanation of the Storage Engine and the spatial OLAP database (B+-tree) will be in the next chapter (chapter 4).

(34)

4: Storage Engine and Spatial OLAP Database (B+-tree)

4.1 Overview

Our experimental MOLAP system uses the B+-tree as a physical storage structure, which is simple and already well-integrated into the relational DBMS. By storing the OLAP database in the B-tree(s), updatability, spreadsheet-like querying, partial pre-aggregation, and parallelization can be accommodated more easily. More importantly, optimizing the query processing performance can be done more systematically by exploiting the sequential processing capabilities of the B+-tree.

The storage engine in the system has two distinctive but cooperative interfaces: Sequential and Indexed ones (Figure 8). These two interfaces make

searching the B+-tree, which stores the contents of the OLAP database, at the physical level, very efficient.

(35)

Figure 8: B+-tree and Storage Engine Interfaces

The storage system can be considered as an attempt to integrate the MOLAP concept (an OLAP cube) at the logical level with the B+-tree technology at the physical or implementation level. About the detail of cube building

(aggregation) algorithm used for the OLAP database in our system, refer to the paper [5]. In this thesis, only the relevant part of the algorithm is presented.

4.2 Search Key (B-key)

As for the structure of an OLAP database (cube), only non-empty cells (data entries) are stored in nodes at the leaf level of a B+-tree (Figure 8). Each cell contains a unique key and measure value(s). The unique key, called B-key,

(36)

Let D0, D1, …, Dk-1 be the 1st to the kth dimension respectively, and |Di| be the total number of primary and group members in the hierarchy for Di, for 0 ≤ i ≤

k-1. The dimensions are arranged such that |Di| ≤ |Di+1|, for 0 ≤ i ≤ k-2. All

members of each dimension, say dimension i, are mapped, on one-to-one basis, into a range of integers, {0, …, |Di|-1}. The B-key of a cell, with the numeric cell coordinates <v0, v1, …, vk-1>, is given by the formula v0*|D1|*…*|Dk-1| +

v1*|D2|*…*|Dk-1| + … + vk-1. Note that this B-key can be converted back uniquely to the coordinates of the cell.

Associated with every non-empty cell is a cell record, which consists of its uniquely assigned B-key and its measures read from the input data (primary cell) or aggregated (group cell). A B+-tree is built with B-key as the index, where all cell records are stored on the leaf nodes. This storage structure is sometimes called integrated B+-tree clustered on the B-key [4]. A query to the spatial OLAP database, which is in the form of a cell address, should be first converted into the B-key of the target cell by the query engine before the call to the storage engine.

(37)

5: Spatial Predicate Processor and Spatial Index (R-tree)

5.1 Overview

The job of the spatial predicate processor is to call the spatial index interface to retrieve the list of data entries (spatial objects) in the spatial indexing structure, where an MBR of each data entry is contained inside the query

window. The list of spatial object ID’s of data entries retrieved are sent to the query formatter, in which a series of point queries are constructed using this list and the non-spatial predicate specified by and sent from the user.

An R-tree is chosen as a spatial indexing structure, which is the most popular choice to handle spatial data with geo-coordinates as one of its

properties. It was proposed a long time ago (about 25 years ago) [2]. However, it (or any of its variants, such as R*-tree, R+-tree, Priority R-tree, or Hilbert R-tree) is still the most important and effective indexing structure for spatial data.

As stated, an R-tree spatial indexing structure plays dual roles in the spatial OLAP query processing system: The one is to derive the hierarchy for a spatial dimension required to build a spatial OLAP database. The other is to locate interesting spatial objects inside the query window at a query time to augment the non-spatial predicate. In the following sections, the mechanism of performing each role will be illustrated.

(38)

5.2 Generating the Spatial Hierarchy

Applications associated with techniques used in the spatial database management systems have been very popular nowadays in the areas such as GIS, CAD/CAM, or medical imaging. Because the conventional linear index / storage structure (e.g., B+-tree) for the relational database model turned out to be inadequate for handling multi-dimensional spatial data, various spatial

indexing structures have been proposed to exploit the existence of some kind of spatial relationship among spatial objects, to organize spatial data entries in a way to search them efficiently. An R-tree spatial indexing structure is one of those, and the most successful one.

In our system, the basic techniques in Guttman’s R-tree [2] are used to build an R-tree on an input data set. Each data entry in an input data set we use has geo-coordinates as one of attributes. If we look at the schema (Table 1), the dimension “property_ID (property_id, location)” contains the unique identifier to

represent each data entry and its associated geo-spatial coordinate location,

where location has x-y coordinates of two diagonal corner points of an MBR of a

property.

When building an R-tree, all data entries in an input data set become the primary members for the spatial dimension, which will reside in the leaf nodes of an R-tree. From the leaf level to the root, dynamic hierarchical organizations of a set of multi-dimensional spatial objects will be formed. The definitions and the mechanisms to generate the hierarchy for the spatial dimension by building an R-tree in our system are as follows.

(39)

• Each object entry in the leaf node represents a primary member for a spatial dimension: Each object entry is of the form <pSOID, MBR>, where pSOID is

the unique identifier of the object and MBR is the minimum bounded rectangle

of the object, which can be a point location. The number of object entries contained in each leaf node is between M/2 and M.

• Each node (a leaf node or an internal node) represents a group member for a spatial dimension: Each node is of the form <gSOID, MBR, list_ids>, where

gSOID is the unique spatial object ID of the node, and MBR is the minimum

bounded rectangle that encloses MBR’s of all object entries inside or child nodes pointed. If the node is a leaf node, list_ids is the list of object entries

that are bounded in MBR. If a node is an internal node, list_ids is the list of

pointers to child nodes that are bounded in MBR. The number of entries in list_ids is between M/2 and M (fan-out).

• All leaf nodes are at the same level (height-balanced tree).

• Insertion is similar as that of B+-tree. Choosing a proper leaf node and adjusting tree are straightforward. The overlap search algorithm (Figure 10:

Algorithm 5.3) and the technique to ensure the least enlargement of MBR have been used when searching the best node from the root to the leaf level. When a node has no room for a new data entry, splitting a node is done. Guttman [2] proposed three splitting algorithms: The exponential-cost exhaustive split, the quadratic-cost split, and the linear-cost split. In our

(40)

entries as seeds for the two nodes, which are located as far as possible to each other (computed by the greatest normalized separation). One of

technical difficulties we faced is MBR’s can overlap each other while objects inside do not. This case sometimes causes costly false signals. An R-tree should have some filtering mechanism to prevent the search process from following unnecessary paths. In addition, the insertion order of object entries seems to affect the building speed and the shape of resulting R-tree. Hence, some kinds of pre-processing (e.g., sorting) of input object entries helped to improve the insertion and the searching speeds.

Once an R-tree is built on the input data, all members (primary and group) of a spatial dimension are organized as a dimension hierarchy identical to the hierarchy of an R-tree itself. This spatial hierarchy will be processed in the exactly same way as the hierarchy for the non-spatial dimension when building the spatial OLAP database.

5.3 Retrieving a list of Spatial Objects within a Query Window

As stated, when the user submit a query, the spatial predicate is sent to the spatial predicate processor. Then, the spatial predicate processor calls the spatial index interface to retrieve the list of SOID’s of spatial objects in the spatial index, where the MBR of each spatial object is contained inside the query

(41)

Figure 9: Embedded & Intersected Nodes

Figure 9 is the 2-dimensional map representation of a dimension hierarchy of a spatial dimension shown in Figure 2. The dimension hierarchy in Figure 2 also matches the spatial index (R-tree) of the same example data. Let’s suppose the user draws a query rectangle on a map (the rectangle with the dashed line in Figure 9). There are two types of nodes to be returned: the intersected nodes and the embedded nodes. An embedded node is one which is totally contained in the confines of the query window. In Figure 9, G8 is an embedded node.

Because primary spatial objects, such as P5 and P6, are also embedded in G8, they are not returned. G7 is an intersected node. In G7, primary spatial objects, such as P1, P2, and P3, are embedded in the query window, but not P4.

(42)

The papers [10] and [11] present different solutions on how these

intersected nodes are processed by the spatial predicate processor. In [11], the list of primary SOID’s of all intersected nodes inside the query window, e.g. P1, P2, and P3, is returned, together with the selected group SOID’s corresponding to the embedded nodes, such as G8, to the query formatter. In [10] however, a heuristic is applied, so that only group SOID’s are returned. The heuristic is to derive an estimate on how much percentage of the measure of an intersected node should be included in the answer. Assuming that the user is interested in only exact query answer, we opt for the approach in [11].

Search starts from the root of the R-tree, and looks for nodes or objects in a top-down manner. If any node is an embedded node, the node ID is returned and the search stops. If any node is an intersected node, the search process looks at its child nodes recursively. The spatial index interface uses the following

overlap search algorithm (Figure 10: Algorithm 5.3) to find and return group

(43)

A

Allggoorriitthhmm55..33::OOvveerrllaappSSeeaarrcchh((RReeccuurrssiivvee))

I

INNPPUUTT: 1. The root node N of an R-tree

2. A query rectangle (a spatial predicate) R O

OUUTTPPUUTT: L: The list of primary and group spatial object ID’s

F

FUNUNCCTTIIOONN OOVVEERRLLAAPP SSEAEARRCCHH ((NN,,RR,,LL) )

// Base Case 1: Embedded Group SOID Case

If MBR of N is embedded in R, Then put ID of N into L, and quit

//Base Case 2: Embedded Primary SOID Case If N is a leaf node,

Then check each entry E in N,

and put ID’s of all embedded E’s into L, and quit

//Recursive Step: Intersected Group SOID Case If N is not a leaf node,

Then find a child node C that each entry E of N points to

For each C, Do recursively calling Overlap Search (C, R, L)

E

ENNDD FFUNUNCCTTIIOONN

Figure 10: Overlap Search Algorithm in an R-tree

5.4 Fan-out and the efficiency of the system

The fan-out of an R-tree is the major design consideration for a spatial system in general. A smaller fan-out makes a taller R-tree, which entails a longer time to build, and a longer retrieval time to travel from the root to the nodes at the bottom. However, an R-tree is after all an indexing system. A fine-grained

indexing system will reduce the intra-node processing. In the absence of an R-tree, all spatial objects are viewed as being included in one huge node, and they must be processed for any sizes of query windows. By pruning sub-trees, a

(44)

smaller fan-out will narrow the search area, and improve the query processing time. The following example (Figure 11) illustrates how it works.

Figure 11: Fan-outs & Number of points to be processed

Consider the Figure 11, which shows two R-trees, on the same map, with different fan-outs, i.e., with different node sizes. The size of the fan-out affects the total number of points that must be processed in order to derive the spatial objects embedded in the query window, which are used to form the batch of point queries in the query formatter. The total number of points that must be processed is the sum of all spatial objects in the intersected nodes, because, in our overlap search algorithm (Figure 10), all objects in the intersected nodes should be

examined one by one to test whether they are embedded in the query window. Only the ones inside the query window are eligible to be members of the answer list. Here, considerable processing is required to filter these ones if there are a huge number of objects included in the intersected nodes. By the design of an

(45)

R-tree, a smaller fan-out means a smaller total area covered by all intersected nodes (the dotted rectangles in Figure 11) for the same query rectangle. That is because the smaller fan-out, in (ii), results in more embedded nodes (the shaded rectangles in Figure 11) with smaller areas than the larger fan-out does, which, in turn, saves the filtering process for all points inside those embedded nodes.

In a spatial OLAP system, the choice for the fan-out of an R-Tree has an even greater impact on the system performance. A taller R-tree will be translated into a higher dimension hierarchy for the spatial dimension in the spatial OLAP system. Thus, the fan-out of the R-tree has a direct consequence on the cube building time and the query processing time, since the size of an OLAP database grows linearly as the height of a single dimension hierarchy grows. On the other hand, a smaller fan-out tends to reduce the number of queries that must be sent to the query engine, by making the ratio of the number of embedded group spatial objects to the number of embedded primary spatial objects higher, which usually reduces the total number of point queries.

In section 7.2, some experiments about the R-tree fan-out and the system performance will be presented.

(46)

6: QUERY ENGINE

6.1 Overview

The query engine (Figure 3) is in charge of refining the flow of point queries from the query formatter, then, feeding queries into the storage

engine, and forming the answer based on data returned from the storage engine. It can be dubbed the traffic controller of the spatial OLAP query processing

system.

The query engine first converts the point query from the query formatter into the form of numeric cell coordinates, using the rule that members of each dimension are mapped, on one-to-one basis, into a range of integers within the cardinality of members of it. Then, the query engine translates the cell

coordinates into the B-key of the target cell that the storage engine can understand, using the rule that the B-key of a cell, with the numeric cell coordinates <v0, v1, …, vk-1>, is given by the formula v0*|D1|*…*|Dk-1| + v1*|D2|*…*|Dk-1| + … + vk-1 (Refer to section 4.2).

(47)

A point query from the query formatter (Figure 7): Point Query # 003 =

{

Measure = tax_value

Aggregation Function = SUM()

Query Coordinates = <RSFR, 2003, 3, 48000, 91901-2400, G8>

} Î

Query translation to the form of numeric cell coordinates by the query engine:

Cell Address Query # 003 = {

Measure = 1 // numeric id of tax_value

Aggregation Function = 0 // numeric id of SUM()

// numeric id’s of each dimension in vector from: <v0, v1, …, vk-1> Cell Coordinates = <23, 104, 2, 1231, 4632, 23587>

} Î

Query translation to B-key by the query engine: B-key Query # 003 =

{

Measure = 1 // numeric id of tax_value

Aggregation Function = 0 // numeric id of SUM()

B-key = 73563219 // B-key }

Figure 12: Query Translation to B-key

The heart of this thesis is to develop query engines (refer to [1]) that show the efficiency improvement over the framework [11]. Such improvement lies, at the query optimization level, in the query engine. We observe that the number of

(48)

queries from the query formatter may contain millions of point queries. The

processing these point queries all in series may be intolerable. This can be called the query explosion problem.

This thesis introduces three query-processing strategies (query engines) to solve the query explosion problem. The first one is the implementation of the

framework in [11] on our experimental MOLAP system. That will prove the effectiveness of the framework especially under MOLAP context. More

practically, it will function as a benchmark to other strategies developed in this thesis. The second and the third ones solve the query explosion problem with very different approaches from each other.

The following sections show the detail of each strategy one by one.

6.2 Generic

This query engine is the simplest of all, and shows our framework of aggregating on the spatial hierarchy, by incorporating the spatial index into a system, is feasible and efficient on our MOLAP system.

It also serves as the baseline, to which the performance of all other versions of the query engine is compared. It processes the input queries in a batch, one at a time. Each point query from the query formatter is sent to the storage engine after the B-key translation process is done in the way illustrated in the previous section. It works on the same framework that the paper [11]

proposed. However, our MOALP based implementation is simpler than the ROLAP based implementation of the paper [11]. Instead of building two different

(49)

databases (one summary table for group SOID’s, and the other one for primary SOID’s), and performing unions of answers from queries against them as in [11], our system maintains only one unified OLAP database (cube), a B+-tree, that contains both primary and group spatial objects. The answer returned by the storage engine is the required measure associated with the query. The aggregation function, e.g., summation, is applied to the retrieved measure, to yield the query result.

(50)

A

Allggoorriitthhmm66..22::GGeenneerriicc

I

INNPPUUTT:: 1. Select Clause: Measure M & Distributive Aggregation Function F

2. Query Set Q-batch: Cartesian product of members of Non-spatial

predicate NS and Result of Spatial Predicate (SOID’s) S O

OUUTTPPUUTT:: Aggregate Answer A F

FUUNNCCTTIIOONN GGEENNEERRIICC ((MM,,FF,,Q-batch,,AA))

1. Do B-key translation for queries in a Q-batch

2. Sort them in an ascending order of their B-keys 3. Query Loop

While (as long as there exists any query Q in a Q-batch)

// B-key query to the Storage Engine

3.1 Point Result RÅB-key Point-Query (B-key, M, F) to the Storage

Engine

// Update A based on F and R 3.2 F (A, R)

End While

E

ENNDD FFUNUNCCTTIIOONN

Figure 13: Generic Algorithm

Although the generic version is doing far better, by aggregating on the spatial hierarchy, than the traditional scheme without it, the framework [11] does not mention about any optimization device to gain efficiency under the

framework. The following strategies are such attempts that can make further improvements, at the query performance aspect, over the existing efficiency gain by the aggregation on the spatial hierarchy.

(51)

6.3 SIQUE (Sequential-Index Query Engine)

Let the set of queries from the query formatter be called the Q-Batch,

which will be sorted in ascending order of their B-keys after the B-key translation

process by the query engine. The rationale behind the SIQUE is the recognition

that a significant portion of queries in a Q-batch will fail to satisfy the non-spatial predicate. More importantly, these queries will return empty cells. For example, if a property P has four bedrooms, the query asking the system the tax value of the

property with the same SOID as that of P, with three bedrooms, will return

nothing, i.e., an empty cell. Since the B+-tree contains only the non-empty cell, the cell corresponding to the query will not even exist in the B+-tree.

Is there a way of screening out those queries in the Q-Batch that are destined to return empty cells, without resorting to the storage engine? At present, we don’t have any method that guarantees the absolute success (i.e., each query, the query engine sends to storage engine, will always return the non-empty cell), but the SIQUE we propose here will raise the success rate, by a significant margin, over that of the Generic query engine, which attempts every query in the Q-Batch.

Consider a simple example to show how SIQUE works. Let us suppose the queries in the Q-Batch have the following B-keys: 10, 13, 14, 16, and 20. The B+-tree contains, in its leaf nodes, cells with B-keys as 15, 16, and 23. The job of the SIQUE is to locate the intersection of two sets with as few as queries

(52)

storage engine is modified such that it always returns the B-key of the cell that is next to 10, i.e., 15. With this information, SIQUE skips next two queries (with B-keys as 13 and 14), and tries the query with the B-key next to 15, which is 16. This time around there is a matching cell in the B+-tree, and the storage engine returns the next B-key, which is 23. Since 23 is larger than any of the B-keys in the Q-batch, the merge process is over. There are two queries attempted (with B-keys 10 and 16), and one successful match. That is 60% reduction in number of attempted queries in comparison with the Generic query engine.

Figure 14 illustrates an idea how this merge process works.

Figure 14: Merge Process in SIQUE

The query range is defined as the interval between the smallest and the largest B-keys in the Q-batch. Let CR-batch be cell records, in sorted order, in the B-tree whose B-keys fall into the query range. This solution is feasible because the random access is available in both Q-Batch and CR-batch. The random access capability, implemented in our query engine and storage engine, makes GetNextCR() or GetNextQuery() interface very efficient, because

(53)

that the number of point queries actually processed by the storage engine is fewer than either the number of queries in the Q-batch, or the number of cell records in the CR-batch. The actual amount of reduction is apparently very dependent on the contents of the Q-batch and CR-batch. However, as we will see in chapter 7, the query reduction ratio is very high on our experiment.

(54)

A

Allggoorriitthhmm66..33::SSIIQQUUEE

I

INNPPUUTT:: 1. Select Clause: Measure M & Distributive Aggregation Function F

2. Query Set Q-batch: Cartesian product of members of Non-spatial

predicate NS and Result of Spatial Predicate (SOIDs) S O

OUUTTPPUUTT:: Aggregate Answer A

F

FUUNNCCTTIIOONN SSIIQQUUEE((MM,,FF,,Q-batch,,AA))

1. Do B-key translation for queries in a Q-batch

2. Sort them in ascending order of their B-keys

3. Merge of Q-Batch (with B-keys in ascending order) and CR-Batch (with

B-keys in ascending order) that is accessible in the storage engine: Let Q be the first query of the Q-batch

While (Q is inside the query range)

3.1 Processing on the Query Engine

(a) B-key Point-Query (B-key, M, F) to the Storage Engine

(b) If a matching CR, whose B-key is same as that of Q, exists

Then update A with the measure of CR returned from the

storage engine

(c) If not, Then receive the CR ( = NextCR ) from the storage engine

(d) Call GetNextQuery (NextQ, Q-batch): NextQ is the Q with the

smallest B-key among all Q’s with B-keys > CR

(e) QÅNextQ

(f) If Q is outside the query range, Then quit

3.2 Processing on the Storage Engine

(a) If a cell record CR, whose B-key is same as that of Q, exists

Then retrieve the measure of CR from a cube and pass it to the

query engine

(b) Call GetNextCR (NextCR, CR-batch): NextCR is the CR with the

smallest B-key among all CR’s with B-keys > Q

(c) CRÅNextCR

(d) If CR is outside the query range, Then quit

(e) Return CR to the query engine

End While

E

ENNDD FFUNUNCCTTIIOONN

(55)

6.4 PICTQUE (Primary Cell Table Query Engine)

The design of this query engine is motivated by the method of processing queries with primary SOID’s described in [11]. Two types of OLAP databases are built in the form of summary tables under ROALP model scheme [11]: One for processing queries with group SOID’s and the other for processing queries with primary SOID’s. As for the size of the set of the all primary cell records in an OLAP database, we can observe that it typically occupies only a tiny fraction of the total space of the OLAP database. Hence, computing the aggregates on the fly only from the primary cells might have a competitive edge over pre-computing all the aggregates for the OLAP database. There will be less I/O time, at the expense of increase in CPU time due to aggregate computation and more memory usage. Considering faster evolution of CPU capability and larger price drop rate of memory unit compared to the rate of I/O time reduction, this

approach can be very feasible solution for the efficiency enhancement.

The architecture as shown in Figure 3 is modified, with the introduction of three new components (Filter Engine, Primary Cell Table, and Filter) inside the

query engine. The module architecture of PICTQUE engine is shown in Figure 16.

(56)

Figure 16: Architecture of PICTQUE(a)

The component named primary cell table (PICT) is a flat file containing

all primary records, each of which has a B-key and associated measures. The filter engine receives primary SOID’s from the query processor module in the query engine, and performs a search on the PICT to locate the associated primary cell record. The following (Figure 17) is the algorithm for locating a primary cell record in PICT. The algorithm is very simple and efficient because, under the spatial index and OLAP database schemes we use, any input data record (in the form of B-key with measures) can be easily found using the

(57)

primary SOID, as a search key. The information about dimensions and measures associated with that data record can be easily decoded from a B-key, using the B-key translation scheme in our system.

A

Allggoorriitthhmm66..44..11::LLooccaatteePPrriimmaarryyCCRRiinnPPIICCTT

I

INNPPUUTT: 1. Primary SOID : pSOID

2. Primary Cell Table PICT O

OUUTTPPUUTT: Primary Cell Record : pCR

F

FUUNNCCTTIIOONN LLOOCCAATTEE--PPCCRR((PPSSOOIIDD,,PPIICCTT,,PPCCRR)) 1. Calculate the offset using the pSOID

2. Find the primary cell record in PICT using the offset

3. Decode B-key in a cell into primary member ID’s of dimensions 4. pCRÅ Decoded primary member ID’s & Measures in a primary cell E

ENNDD FFUNUNCCTTIIOONN

Figure 17: Algorithm of Locate-pCR in PICT

Once retrieved, this record will be subject to a filtering process in the filter engine to determine whether this record will satisfy the non-spatial predicate. To speed up this process, a data structure called filter has been pre-computed. It contains the ancestor/descendant relationships of all pairs of members in each dimension. Filter may be built as i-dimensional array (where i is the number of

dimensions for a data record) or a hash table. The following is the algorithm for the filtering process:

References

Related documents