• No results found

Implementing Real-Time OLAP with Multidimensional Dynamic Clustering

N/A
N/A
Protected

Academic year: 2020

Share "Implementing Real-Time OLAP with Multidimensional Dynamic Clustering"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

MARTIN, MICHAEL W. Implementing Real-Time OLAP with Multidimensional Dynamic Clustering. (Under the direction of Dr. Rada Y. Chirkova).

This dissertation details multidimensional dynamic clustering (MDDC), a new

multidimen-sional data clustering method that supports efficient On-Line Analytical Processing (OLAP)

queries [24]. OLAP queries are generally multidimensional business intelligence queries that

require large amounts of data, complex table joins, and aggregate data calculations to

pro-duce useful results [19]. MDDC allows OLAP queries to execute very efficiently on base

tables while requiring fewer materialized views, aggregates, and secondary indexes than

many previous solutions that support OLAP queries. MDDC also allows dynamic inserts,

updates, and deletes on stored data without reorganization of the underlying data

struc-tures. This is especially important for real-time data warehouses that must provide efficient

OLAP query access while continually adding data to and maintaining data in tables. Such

data warehouses cannot stop query processing to reorganize data structures when data

ex-ceeds preset thresholds such as a fixed number of key values. This dissertation provides

a precise definition of MDDC, including a comparison with multidimensional hierarchical

clustering (MHC), and presents experimental results to substantiate the claims that MDDC

is more dynamic than MHC and provides symmetry that is better than or equal to that of

(2)

by

Michael W. Martin

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Computer Science

Raleigh

2006

Approved By:

Dr. Xiasong Ma Dr. Ting Yu

(3)

I would like to dedicate this dissertation to the three people who would not accept

anything less than my having the highest level of education possible.

I dedicate it to my mother, Alta Irene Simmons Martin. In her own education, she never

advanced past the seventh grade. She performed hard manual labor all her short life.

Understandably, she insisted that I have as much education as possible so that I would

have a better life.

I also dedicate it to my father-in-law, Malton Ruffin Tripp Sr., and my wife, Shelia Ann

Tripp Martin. Through many years they insisted that I obtain a Doctor of Philosophy

degree in Computer Science. They always provided positive encouragement and they

(4)

Biography

My educational background consists of a Bachelor of Science in Computer Science from the

University of North Carolina at Wilmington in 1986, a Master of Science in Management

from North Carolina State University in 1990, and a Master of Computer Science from

North Carolina State University in 1997.

My professional experience consists of 19 years in information technology with

positions of increasing responsibility including programmer analyst, database administrator,

systems analyst, project manager, development manager, and most recently director of

architecture and planning.

In addition, I won the GlaxoWellcome Inc. CEO’s award for a national sales and

marketing data warehouse and hold U.S. utility patent 6,003,036 relating to the storage,

maintenance, and access of multidimensional data warehouses. Finally, the intellectual

(5)

Acknowledgements

I would like to acknowledge my advisory committee for their help and guidance. I especially

would like to acknowledge and thank Dr. Rada Y. Chirkova whose patience, assistance, and

(6)

Contents

List of Figures vii

List of Tables viii

1 Introduction 1

2 The Real-Time OLAP Problem 4

2.1 OLAP Data Model . . . 6

2.2 OLAP Dimensions and Hierarchies . . . 6

2.3 OLAP Facts . . . 7

2.4 OLAP in Real Time . . . 7

2.5 No B-tree Equivalent for OLAP . . . 8

3 Related Work 10 3.1 Non-Clustering Indexes . . . 10

3.2 Clustering Indexes . . . 11

3.3 MHC . . . 11

3.3.1 UB-trees . . . 12

3.3.2 Z-ordering . . . 12

3.3.3 Z-ordering Example . . . 13

3.3.4 Hierarchical Clustering . . . 16

3.3.5 Z-ordering and Hierarchical Clustering . . . 17

4 MDDC Theory 21 4.1 Variable Length Bit Strings . . . 22

4.2 Dynamic Bit Interleaving . . . 23

4.3 MDDC Operations . . . 24

4.3.1 Queries in MDDC . . . 24

4.3.2 Point Queries . . . 25

4.3.3 Slice or Partial Match Queries . . . 25

4.3.4 Maintenance . . . 25

4.3.5 Key Recycling . . . 26

4.3.6 Key Participation . . . 26

(7)

4.5 Summary of MDDC . . . 27

5 Implementation 28 5.1 Simulation Platform . . . 28

5.2 Simulation Software . . . 28

5.3 Simulation Parameters . . . 29

5.4 Simulated SQL Statements . . . 29

5.5 Simulated Database Tables . . . 30

5.6 Special Simulation Challenges . . . 30

6 Experimental Results 31 6.1 National Grocery Chain . . . 31

6.1.1 Data Model . . . 31

6.1.2 Test Criteria . . . 31

6.1.3 Data Loading . . . 32

6.1.4 Queries . . . 32

6.1.5 Results . . . 33

6.2 TPC-H . . . 34

6.2.1 Data Model . . . 38

6.2.2 Test Criteria . . . 38

6.2.3 Data Loading . . . 39

6.2.4 Queries . . . 39

6.2.5 Results . . . 39

6.3 Final Evaluation of MDDC and MHC . . . 41

7 Conclusions and Future Work 44

Bibliography 46

A Introduction to Appendices 50

B Tables definitions for experiments 51

C Queries for experiments 55

D Detailed experimental results 58

(8)

List of Figures

1.1 Hierarchical SQL Query . . . 2

1.2 Slice SQL Query . . . 2

2.1 OLAP SQL query . . . 5

2.2 OLAP SQL query . . . 5

3.1 Z-order Product SQL Query . . . 15

3.2 Z-order Month SQL Query . . . 15

3.3 Typical OLAP SQL Query . . . 17

4.1 Uniform Depth Z-address Space . . . 24

4.2 Variable Depth Dynamic Bit Interleaving Space . . . 25

5.1 Simulated SQL Query . . . 30

6.1 Grocery Sales Product Dimension Query 1 . . . 34

6.2 Grocery Sales Product Dimension Query 2 . . . 35

6.3 Grocery Sales Promotion Dimension Query 1 . . . 35

6.4 Grocery Sales Promotion Dimension Query 2 . . . 36

6.5 Grocery Sales Store Dimension Query 1 . . . 36

6.6 Grocery Sales Store Dimension Query 2 . . . 37

6.7 Grocery Sales Month Dimension Query 1 . . . 37

6.8 Grocery Sales Month Dimension Query 2 . . . 38

6.9 TPC-H Part Dimension Queries . . . 41

(9)

List of Tables

3.1 Product Dimension . . . 14

3.2 Month Dimension . . . 14

3.3 Sales . . . 14

3.4 Customer Dimension . . . 18

3.5 MHC HSE . . . 19

4.1 MDDC HSE . . . 22

B.1 Grocery store dimension . . . 52

B.2 Grocery product dimension . . . 52

B.3 Grocery promotion dimension . . . 52

B.4 Grocery month dimension . . . 52

B.5 Grocery sales fact . . . 52

B.6 TPC-H estimated database size . . . 53

B.7 TPC-H PART dimension . . . 53

B.8 TPC-H SUPPLIER dimension . . . 53

B.9 TPC-H LINEITEM fact . . . 54

C.1 Grocery sales experiment - queries . . . 56

C.2 TPC-H experiment - queries . . . 57

D.1 Grocery sales results - part 1 . . . 59

D.2 Grocery sales results - part 2 . . . 60

D.3 Grocery sales results - part 3 . . . 61

D.4 Grocery sales results - part 4 . . . 62

D.5 Grocery sales results - part 5 . . . 63

D.6 TPC-H Lineitems results - part 1 . . . 64

D.7 TPC-H Lineitems results - part 2 . . . 65

D.8 TPC-H Lineitems results - part 3 . . . 66

D.9 TPC-H Lineitems results - part 4 . . . 67

D.10 TPC-H Lineitems results - part 5 . . . 68

E.1 Product dimension . . . 70

(10)
(11)

Chapter 1

Introduction

On-Line Analytic Processing (OLAP) applications typically provide business

ana-lysts and managers with multiple views and organizations of critical business data. OLAP

applications require underlying data structures to efficiently provide access to data from

any combination of the component keys or dimensions in the data. OLAP queries

typi-cally restrict the data by some combination of dimensions and then aggregate or compute

statistics on some subset of the attributes of the data [19].

OLAP data structures and methods must overcome several difficult challenges

such as the requirement for performance symmetry given that a query might restrict the

data on any combination of dimensions. OLAP data structures and methods must also

deal with the “curse of dimensionality” or the decreasing selectivity in any one dimension

as the total number of dimensions in the data structure increases. Finally, OLAP data

structures must also deal with the common problem of maintaining a balanced and efficient

data structure as data content changes. Increasingly, OLAP data structures and methods

must also maintain data in real time [20, 7] while simultaneously executing efficient OLAP

queries. All the while, the size and corresponding performance challenges of these data

warehouses are increasing at a staggering rate.

OLAP slice searches typically restrict data by dimension key values or

hierarchi-cal key values within the dimensions, as the first SQL query in Figure 1.1 that follows

demonstrates. OLAP applications rarely restrict data by key order slice searches such as

the second query in Figure 1.2 that follows demonstrates [19, 21, 23].

Practitioners and researchers have created many techniques and methods in an

(12)

aggrega-SELECT DOLLARS FROM SALES, CUSTOMER

WHERE SALES.CUSTOMER ID=CUSTOMER.CUSTOMER ID AND

CUSTOMER.STATE=’NC’

Figure 1.1: Hierarchical SQL Query

SELECT DOLLARS FROM SALES

WHERE CUSTOMER>4 AND CUSTOMER <50

Figure 1.2: Slice SQL Query

tion. This involves the replication of pre-computed and pre-joined data into one or more

tables. While this technique is very helpful, it is expensive in terms of space and time and

inhibits real-time updates since the database management systems (DBMS) must update

and synchronize multiple tables when an update occurs. The use of multiple indexes,

es-pecially bit mapped indexes, is another common technique. The indexes provide multiple

views and increase query efficiency. But, they also delay update operations in proportion to

their numbers and do nothing to cluster the data. Since they do not cluster the data they

are more suited for small result sets [21, 23]. Much work has been done in pursuit of

multi-dimensional clustered indexes. While clustering is a very effective technique for increasing

query efficiency, clustering data in more than one dimension is very difficult. Most such

in-dexes are overly complicated, inefficient to manage, and are not effective in maintaining the

proper data organization for which they were originally designed [16]. Multidimensional

hierarchical clustering (MHC), a method involving the B-tree [21, 23], which has proven

itself so well in on-line transaction processing (OLTP) applications, has emerged. The

tech-nique uses Z-ordering which interleaves bits from multiple keys to enforce symmetry and

then stores the interleaved key in B-trees [2]. It also encodes dimensional hierarchies into its

keys to aid clustering. This technique provides efficient multidimensional clustering with all

(13)

however. MHC uses fixed length keys for each level in each hierarchy. This limitation,

impedes symmetry and prevents the technique from being completely dynamic and thereby

not providing a real-time capability.

Multidimensional dynamic clustering (MDDC) [24] like MHC capitalizes on all the

advantages of MHC but unlike MHC uses variable length keys and dynamic bit interleaving

in lieu of fixed length keys and Z-ordering. As a result MDDC is completely dynamic and

provides a real-time capability for data warehouses since it never requires reorganizations.

In addition, MDDC demonstrates symmetry that is superior to that of MHC. The only

dis-advantages of MDDC when compared to MHC is the additional space that MDDC requires

to store the variable length keys and an inability to handle conventional range queries.

Depending on the number of dimensions, hierarchies in each dimension, and metric fields

this additional space requirement could be very high but based on the experiments in this

dissertation is less than 10%.

This dissertation fully details MDDC and then provides empirical evidence to

(14)

Chapter 2

The Real-Time OLAP Problem

The real-time OLAP problem involves dynamic storage, maintenance, and access

of multidimensional data for OLAP queries [7].

Consider a database consisting of sales data in dollars, with a primary key

com-posed of product, month, and store. Further, suppose that all products each fall into a

specific product category, months are related to a specific quarter and year, and stores each

fall into one and only one zip code, state, and region. Tables E.1, E.2, E.3, and E.4 provide

an example of such a database.

The real-time OLAP problem involves real-time query, insert, update, and delete

operations against such tables. Applications or users are able to efficiently and dynamically

insert, update, or delete records from any dimension table or fact table in the database

without reorganizing any data structures. There are also not any preset upper limits on

the values of keys in any of these tables. Queries involve restrictions or criteria on any

combination of dimensions. The following two SQL queries in Figures 2.1 and 2.2 represent

typical queries involving combinations of dimensions. Query 2.1 returns the total sales by

store from all store sales where the year is 1998 and the state is ’NC’. Query 2.2 returns

the total sales by product for all products that are categorized as ’DRY GOODS’.

To provide an effective multidimensional data clustering, a multidimensional data

structure that provides ordered and direct access to data must support slice and point

queries on any combination of participating dimensions with the same efficiency as if the

combination of dimensions make up a prefix of a one-dimensional data structure supporting

(15)

SELECT STR NM, SUM(SLS DLR) FROM SLS FCT,

STR DIM, PROD DIM, MO DIM

WHERE STR DIM.STR ID=SLS FCT.STR ID

AND PROD DIM.PROD ID=PROD DIM.PROD ID

AND MO DIM.MO ID=SLS FCT.MO ID

AND STR DIM.ST NM=’NC’

AND MO DIM.YR NM=1998

GROUP BY STR DIM.STR NM

Figure 2.1: OLAP SQL query

SELECT PROD NM, SUM(SLS DLR) FROM SLS FCT,

STR DIM, PROD DIM, MO DIM

WHERE STR DIM.STR ID=SLS FCT.STR ID

AND PROD DIM.PROD ID=PROD DIM.PROD ID

AND MO DIM.MO ID=SLS FCT.MO ID

AND PROD DIM.PROD CAT NM=’DRY GOODS’

GROUP BY PROD DIM.PROD NM

(16)

2.1

OLAP Data Model

While both dimensional and normalized data models accommodate OLAP data,

the problem specification and theory in this dissertation assumes that the data model is

dimensional for simplicity, query efficiency, ease of use, and increased probability of correct

results. Tables in OLAP dimensional databases are divided into dimension tables and fact

tables [19].

Dimension tables such as E.1 contain all the primary key sources for all foreign

keys in a dimensional model. Each dimension table has one primary key, typically a

tu-ple identifier. Dimension tables contain most of the textual or descriptive attributes in a

dimensional model. Dimension tables also contain one or more hierarchies related to each

primary key for a dimension. For example, a customer dimension table might contain a city

and state attribute for each customer. In a flat dimensional model, the customer dimension

would contain all the attributes for city and state as well as customer. A flat dimensional

model might also contain separate dimension tables for city and state in addition to the

attributes for city and state in the customer dimension. A snow-flaked dimensional model

would only contain foreign keys to city and state in the customer dimension and would

house all other attributes for city and state in separate city and state dimension tables.

Fact tables such as E.4 contain foreign keys that reference one dimension. They

also contain attributes that are usually but not necessarily numeric and are dependent on

the primary key of the fact table. These attributes are known as metrics.

Fact tables are typically orders of magnitude larger than dimension tables. This

is expected since fact tables contain rows that could include all the possible combinations

of key values from all dimensions that the fact table includes in its definition. Albeit, fact

tables rarely contain even a small fraction of all possible dimension combinations. The ratio

of actual rows in a fact table compared to all rows and combinations of dimension values is

known as the sparseness of the fact table [19].

2.2

OLAP Dimensions and Hierarchies

Each dimension table contains one or more hierarchies. There is always at least

one trivial hierarchy for each dimension made up of only one level or the primary key from

(17)

that hierarchy. Each level of a hierarchy assigns one and only one key value to each primary

key value in the dimension table. Therefore, for each dimension,D, there arehhierarchies, each with some number of levelsl for each hierarchyh.

2.3

OLAP Facts

If a fact table contains ndimensionsD1, D2, D3, D4, ..., Dn, then a query can

con-strain the fact table on 2npossible combinations of dimensions since there arendimensions

and each one can either be included or not be included as a constraint.

The most optimal way to store the fact data is to cluster or sort it by the dimensions

that queries most often use to constrain the data [17, 21, 23]. But, this is a problem for

conventional indexes such as B-trees because they require too much replicated data to

support all the possible combinations.

A multidimensional data structure need not use all n dimensions for a given fact table to organize the data. The multidimensional data structure can use any subset of then dimensions. When a fact table contains a large number of dimensions, including all of them

in the data structure may dilute the organization of the data too much and cause “curse of

dimensionality” problems. Depending on query patterns it is often more practical to only

use the dimensions that queries most commonly constrain. The fact table can still use the

remainder of the dimensions to fulfill any requirements for constraints such as primary keys.

2.4

OLAP in Real Time

OLAP applications conventionally load data in batch. As a result there is time to

completely reorganize all data if necessary. Real-time OLAP applications are more similar

to OLTP applications. They must handle new data incrementally as it arrives without

major data structure reorganizations. Simultaneously, the OLAP application must provide

query performance good as or better than conventional OLAP applications. This requires

everything regarding the OLAP data structure to be completely dynamic. There can be no

preset limits on key values or any preset number of records of any kind in the data structure.

Each new record insert, update, or delete must result in only a limited amount of work as

(18)

2.5

No B-tree Equivalent for OLAP

B-trees are ubiquitous to On-Line Transaction Processing (OLTP) database

sys-tems since their performance is so efficient, versatile, and predictable. The B-tree solves

most index requirements in OLTP systems and therefore there is little demand for

alter-natives at this time [42]. OLAP database systems to date have not had such a widely

successful solution to implement.

A B-tree is also called a multi-way tree, it is a fast data-indexing method that

organizes the index into a multi-level set of nodes. Each node contains a sorted array of

key values. Two important properties of a B-tree are that all nodes are at least half-full

and that the tree is always balanced. That is, in access operation must read an identical

number of nodes in order to locate all keys at any given level in the tree. A well-organized

B-tree, with well sized nodes, will have only three to four levels [40].

Most OLTP applications depend on the B-tree to limit the amount of work

re-quired by each transaction since a B-tree has a limited number of levels for any database

operation to access or update. The conventional B-tree, which collectively here refers all

derivatives of the B-tree, has evolved into a very robust and efficient solution to implement

conventional operational or OLTP databases. The B-tree data structure provides good

con-currency, recoverability, maintenance, predictable query performance, predictable update

performance, and does not require periodic reorganizations. B-tree structures in their

na-tive forms, however, do not provide a good solution to implement OLAP databases. They

only work well when queries specify a complete key or a prefix of the key [42, 13, 40].

As the next chapter illustrates, conventional B-trees are not the only data

struc-tures that fall short in this area. There is not a robust and dynamic data structure to

support OLAP databases. The crowded field of candidate data structures for OLAP

ev-idences this fact. Each of the current solutions work well for some applications and not

others or work well in one aspect of their functionality but not in other aspects of their

functionality [11, 39]. The lack of a solution as versatile and robust for OLAP applications

as B-trees are for OLTP applications remains.

As this dissertation details, MDDC provides a B-tree solution to the real-time

OLAP problem. Like standard B-trees, MDDC enables real-time or dynamic updates for

OLAP in much the same way that standard B-trees provide such functionality for OLTP.

(19)
(20)

Chapter 3

Related Work

There are two major sectors of research regarding data structures for OLAP

ap-plications. The first sector focuses on data structures that only provide indexes and do

not cluster the underlying data [39]. The second sector focuses on data structures that, in

addition to any indexes they provide, do cluster the underlying data [14, 9, 4].

3.1

Non-Clustering Indexes

Solutions from the first sector of research typically utilize secondary indexes,

in-verted lists, or more recently bit mapped indexes to reduce the search space of

multidi-mensional queries without regard to the physical organization or clustering of the data.

Relational databases typically employ solutions from this sector since these solutions work

well with heap table organizations. Relational databases depend on heap structures for

OLTP [13]. But, research has shown that database management system (DBMS) software

can achieve large gains by clustering data according to the keys that queries use to filter the

data. This is true even for applications that are not multidimensional in nature [3]. When

databases store records in a random fashion and do not cluster them by key values, they

require virtually one I/O per record on average to fetch the data from secondary storage

[17]. Not surprisingly then, such solutions only work well for queries that access a very small

number of records relative to the overall search space. This is true even when the database

is capable of combining multiple indexes in one query. More formally, queries must have

very small conjunctive selectivity or combined dimensions query constraints to benefit from

(21)

better than full table scans for queries where the conjunctive selectivity exceeds 3.33% of

the total table size. This study also demonstrates that the tendency of OLAP queries to

constrain data based on hierarchically organized dimensions further amplifies this problem

[21, 23].

3.2

Clustering Indexes

Solutions from the second sector of research attempt to efficiently solve the very

difficult problem of multidimensional clustering. Unlike B-trees for OLTP, most of these

conventional OLAP data structures such as grid files, hB-trees, and R-trees are not efficient

and practical in one or more of the areas of space utilization, update predictability, query

efficiency, recovery, concurrency control, and complexity [25, 9, 38, 13, 14, 15, 18, 3].

For instance, grid files are reasonably easy to maintain and provide good

currency but do not provide predictable query performance and have a tendency to

con-sume large amounts of disk space and require much unnecessary I/O. An inability to

effi-ciently handle sparse data is the primary underlying cause of these problems in grid files

[27, 34, 15, 40]. While a large number of OLAP software applications implement Hypercubes

as their primary solution, they exhibit sparseness, disk space, and I/O query performance

problems. hB-trees and R-trees attempt to furnish OLAP applications with the robustness

that B-trees furnish OLTP applications but exhibit problems with concurrency,

mainte-nance, complexity, and predictable query performance [14, 9]. Most other known OLAP

data structures have one or more major weaknesses and are not able to provide the same

level solution for OLAP databases that B-trees are able to provide for OLTP databases.

Of course the exception to this is the UB-tree [2] since it is in fact a B-tree with

multidimensional ordering. But, the UB-tree alone does not cluster the data optimally for

hierarchical OLAP data. Multidimensional hierarchical clustering (MHC) clusters data in

UB-trees with dimensional hierarchies but requires a tradeoff between dynamic or real-time

updates and symmetry.

3.3

MHC

MHC combines UB-trees and hierarchical clustering. MHC overcomes the difficult

(22)

key values and simply storing the key values in one dimensional B-trees. Therefore, the

problems associated with clustering data by more than one dimension are absent. The only

problems that remain are to assign key values so that symmetric clustering is enforced and

so that the overall data structure is dynamic [21, 23].

3.3.1 UB-trees

The UB-tree or Universal B-tree [2], offers promise since it supports

multidimen-sional clustering and is able to utilize the B-tree as an underlying data structure without

altering the properties of the B-tree. This is true since the UB-tree employs Z-ordering or

an interleaving of bits from each key in a composite key [28] to map multidimensional spaces

to a linear sequence of points suitable for storage in linear data structures that preserve

order such as B-trees. If a multidimensional composite key contains 3 component keys or

dimensions of 16 bits each, the resulting interleaved key in the UB-tree has 48 bits with the

priority of the bits alternating between the dimensions on each bit position. This technique

merely alters the sorting or collating sequence of B-trees and does not affect its behavior

in any other way. The UB-tree provides a solid host data structure for multidimensional

clustering but does not address the multidimensional clustering problem itself.

3.3.2 Z-ordering

UB-trees and therefore MHC are based on the concept of bit interleaving from

Z-ordering [28, 29]. This technique maps multidimensional spaces to one-dimensional spaces.

For this reason, Z-ordering is known as a space filling curve. When a Z-ordering curve

maps two dimensions into one dimension it has a distinctive Z shape, hence the name.

Other research efforts have since made advancements in the area of space filling curves

such as Gray and Hilliard Ordering [10]. MHC could very well make use of the other more

advanced space filling curves that provide better proximity for spatial applications but,

bit-interleaving via Z-ordering is sufficient for MHC since MHC targets OLAP applications

and only partially orders the data as illustrated later.

The Z-ordering technique is very straightforward. It shuffles or interleaves the

bits from multiple keys together to form one contiguous key. The technique can order the

interleaved key as if it were a single key. Consider a composite multidimensional key made

(23)

of values for A, B, and C of 101, 110, 001, a new interleaved key or Z-address takes the first

bit from each of the three keys, then the second bit, and finally the third bit from each key

so that the resulting single key in shuffled bit format is 110010101.

More formally, a Z-address Z(k) is the ordinal number of a multidimensional

com-posite key k from a tuple or record on the Z-curve. Z-ordering calculates a Z-address fors bits in each of the ddimensions in the keyk as follows:

Z(k) =

s−1

X

j=0

d X

i=1

ki,j·2j·d+i−1

Of course, this represents the case where each key has the same number of bits. In

the case where one key exhausts its bits before the other keys, it simply fills in the missing

bits with zeroes so as not to affect the collating. Note that with Z-ordering and other

space filling curves, the technique predetermines the multidimensional space including the

number of bit partitions or division points for each dimension and places data points into

the multidimensional space as they become available.

Maintenance and queries for Z-ordering are relatively simple. Inserts, updates,

and deletes simply compute the Z-address for the multidimensional composite key and then

follow the rules of the underlying data structure.

Queries are slightly more complex. Queries performing partial matches or

multidi-mensional slice searches utilize a wildcard or “don’t care” bit for each bit in any unspecified

keys. Consequently, queries must search several paths in the underlying data structure

prun-ing sections of the data structure where possible. Otherwise, queries invoke the underlyprun-ing

rules of the data structure as with inserts, updates, and deletes.

3.3.3 Z-ordering Example

Consider an OLAP example involving Sales dimensioned by Product and Month

as Tables 3.1, 3.2, and 3.3 depict.

In this example, Z-ordering interleaves the bits from the Product and Month keys

in the Sales fact table together to form one key per record as specified by Z-ordering. Notice

that for each Z-address in the Z-ordering above the changes or breaks in bits from Product

occur much earlier in the Z ADDRESS and therefore dominate the collating sequence in

(24)

Table 3.1: Product Dimension

KEY DESCRIPTION

DEC BIN

1 0001 Nylon Wind Breaker 2 0010 Rain Coat

3 0011 Full Length Overcoat 4 0100 Heavy Waist Coat 5 0101 Light Jacket 6 0110 Hiking Boots 7 0111 Rain Pants 8 1000 Leather Gloves

Table 3.2: Month Dimension

KEY DESCRIPTION

DEC BIN

200301 110000111001101101 January, 2003 200302 110000111001101110 February, 2003 200303 110000111001101111 March, 2003 200304 110000111001110000 April, 2003 200305 110000111001110001 May, 2003 200306 110000111001110010 June, 2003 200307 110000111001110011 July, 2003 200308 110000111001110100 August, 2003 200309 110000111001110101 September, 2003 200310 110000111001110110 October, 2003 200311 110000111001110111 November, 2003 200312 110000111001111000 December, 2003

Table 3.3: Sales

Z ADDRESS PRODUCT MONTH DOLLARS

(25)

SELECT MONTH, DOLLARS

FROM SALES WHERE PRODUCT=3

Figure 3.1: Z-order Product SQL Query

SELECT PRODUCT, DOLLARS

FROM SALES WHERE MONTH=200301

Figure 3.2: Z-order Month SQL Query

Therefore, when a UB-tree stores these fact records, the product key dominates

the placement of records. If the block size is two and the B-tree fully packs the blocks,

then the first two records will be in block one, the second two in block two, and so forth.

Consider the SQL query in 3.1 that restricts data based on Product.

Recall that Z-ordering executes partial match queries against the underlying data

structure by specifying all available explicit bits and using the “don’t care” or wildcard bits

for unspecified bits. When queries search the data in the UB-tree, it uses these explicit bits

along with the wildcard bits to prune blocks that the queries do not need in order to satisfy

their requirements.

This particular query is able to prune all but one leaf block in the UB-tree. This

is true because even when a query designates a wildcard for every Month bit, the UB-tree

clusters all the records where Product is equal to 3 into one block. Since the UB-tree is a

non-dense index and the index portion of the UB-tree can easily reside in random access

memory (RAM), it is quite possible that this specific query will only generate one I/O for

the one block that it needs. A query from the example selecting any value for Product

should be able to utilize the index with similar efficiency.

In contrast, consider the query 3.2 that restricts data based on Month. This query

is not able to effectively leverage the UB-tree and needs to scan every leaf block in the

UB-tree or essentially perform a full scan of the data. When a query designates wildcards

(26)

example, this is true no matter which value of the Month the query uses to restrict the

data.

As demonstrated, the content of the data in this example prevents balanced query

performance and renders the UB-tree ineffective in producing symmetric access to OLAP

data. These same deficiencies are present in Gray and Hilliard Ordering [10] data structures

as well since they do nothing to control the relative bit placement in participating keys.

Research for the Variable UB-tree [41] also recognizes this problem and in fact

is an attempt to resolve this problem. This research terms the problem the “puff pasty

effect” since attempting to extract a hyperplane from a hypercube based on a partial match

query in a non-symmetric UB-tree is analogous to cutting through a “puff pastry” the long

way or splitting the layers in lieu of cutting through the short way or across the layers.

The extraction or query must access too many blocks rendering it like a full scan of the

hypercube in the worst case. Unfortunately, while the Variable UB-tree improves symmetry,

it is a static solution and forces a reorganization anytime data changes.

Ideally, an OLAP multidimensional data structure should provide access to the

data that is symmetric regardless which participating key or combination of participating

keys the query uses to restrict the data. Moreover, this is the impetus for multidimensional

data structures in the first place.

Clearly, while UB-trees can provide symmetric access to multidimensional data,

the distribution of key values that are present in the data has the potential to destroy

symmetry in UB-trees and render them ineffective.

3.3.4 Hierarchical Clustering

Thus, UB-trees in isolation are not enough. In their native form, UB-trees order

data according to foreign key values from dimensions. For instance, an OLAP fact table with

the dimensions customer, product, and time would have a multidimensional or Z-ordering

based on the primary keys for customer, product, and time. Typically, OLAP queries do

not base queries on such key values. This is especially true for conventional range queries.

An OLAP query selecting all products with a primary key value less than or greater than

a particular foreign key value corresponding to a dimension would not be common. This

is because primary keys are often meaningless, tuple identifiers. It is not uncommon for

(27)

SELECT CUSTOMER.NAME, TIME.WEEK,

SUM(SALES.DOLLARS)

FROM SALES,

CUSTOMER,

TIME

WHERE SALES.CUSTOMER ID=

CUSTOMER.CUSTOMER ID AND

CUSTOMER.CITY=’NEW YORK’ AND

CUSTOMER.STATE=’NEW YORK’ AND

TIME.MONTH=’MARCH’ AND TIME.YEAR=’2004’

GROUP BY CUSTOMER.NAME, TIME.WEEK

Figure 3.3: Typical OLAP SQL Query

single key value of customer and a single key value of product but, more typically OLAP

queries select slices of data that correspond to hierarchical values in dimensions. As the

SQL example in Figure 3.3 illustrates, a typical OLAP query might select all customers in a

city and all days for a month [19]. Therefore, UB-trees must incorporate such dimensional

hierarchies in their clustering of the data to be efficient for OLAP queries.

Hierarchical clustering is a method that controls key values and structures these

key values specifically for query access patterns. Hierarchical clustering capitalizes on the

fact that many primary key to foreign key relationships are composed of hierarchies. As

an example, a customer dimension might contain customers each of which the dimension

assigns to one city. The customer dimension might further assign each city to a state and

each state to a region. Hierarchical clustering incorporates these hierarchies into the primary

keys for the dimensions. The Customer Dimension in Table 3.4 shows such a hierarchy.

3.3.5 Z-ordering and Hierarchical Clustering

MHC combines hierarchical clustering with UB-trees. If customers in an OLAP

(28)

Table 3.4: Customer Dimension

KEY DESC CITY STATE REGION

1 Johnson San Diego CA West

2 Smith Atlanta GA South

3 Davis Boston MA Northeast

4 Watts New York NY Northeast

5 Duncan Wilmington NC South

6 Reaves Columbia SC South

7 Bradford Chicago IL Midwest

these customer levels, including customer, to cluster the data by the customer dimension

in the UB-tree. MHC utilizes this same hierarchical clustering technique for all dimensions

participating in the UB-tree. This technique allows MHC to outperform bit map indexes

and other non-clustered indexes when supporting OLAP queries. Non-clustered indexes

such as bit mapped indexes tend to only work well for very small result sets since

non-clustered indexes require an average of one Input/Output (I/O) for each record that they

include in result sets [17, 6, 26, 35, 31, 30, 8, 37, 36]. MHC does not have this limitation. It

takes advantage of locality in OLAP queries and concentrates records with the same

dimen-sional hierarchy values into much smaller numbers of database blocks thereby increasing

query efficiency by reducing I/O and ultimately speeding up queries. Simultaneously, MHC

provides symmetrical multidimensional access for any combination of participating

dimen-sions through the use of UB-trees with Z-ordering. This allows MHC to be useful for queries

that do not restrict dimensions or keys in the prefix of the underlying B-tree [23].

To combine Z-ordering and hierarchical clustering, MHC uses compound surrogate

keys or hierarchical surrogate encoding (HSE) keys in conjunction with Z-ordering. These

keys reserve a fixed number of bits for each level in a dimension hierarchy. The fixed number

of bits at each level depends on the number of unique key values for all parent keys at that

level in the dimension hierarchy. For the Customer dimension if there are 6 regions overall,

the maximum number of states in any of the 6 regions is 20, the maximum number of cities

in any state is 150, and the maximum number of customers in any city is 17, then the HSE

would require 3 bits for the region level, 5 bits for the state level, 8 bits for the city level,

and 5 bits for the customer level. Therefore the Customer dimension would require a total

of 21 bits for each of its primary keys. Note that each level requires only the number of

(29)

Table 3.5: MHC HSE

REGION STATE CITY CUSTOMER

001 10101 00100001 00011

the context of its parent at the next higher level. This greatly reduces the overall length of

the compound surrogate in contrast to storing independent primary keys for each hierarchy

level. MHC also makes provision for variable length compound surrogate keys in the event

that different keys have different numbers of hierarchical levels or unbalanced hierarchies

but, MHC does not make provision for variable length bits strings for each parent in each

hierarchical level individually. If MHC requires 8 bits for the city level in the customer

dimension, then any primary key from the customer dimension that includes the city level

must also include all 8 bits [23]. Table 3.5 shows the primary key structure for this Customer

Dimension.

Other than dimension key content dictated by HSE, MHC does nothing to alter

the properties of host UB-trees and the Z-ordering that it employs. Queries are the same

as in UB-trees. MHC implements slice queries by explicitly specifying supplied values in

the prefix of compound surrogates and using wildcards for the remaining bits just as it uses

wildcard bits for completely unspecified dimensions in multidimensional queries.

Ironically, while the underlying UB-tree is completely dynamic in that it does not

have any inherit limitations or preset thresholds, MHC is not. MHC incorporates variable

length overall bit strings as dimension keys but uses fixed length bit strings for individual

hierarchy levels within the dimension keys [23]. In MHC, the length L of each bit string that represents a hierarchy level is:

L = log2F

For keys not at the highest level of a dimensional hierarchy, F represents the maximum fan-out or largest number of children that any parent key, not just the parent

key of the current child, in the next higher level in the dimensional hierarchy could have. If

the key is at the highest level in the hierarchy,F simply represents the maximum number of possible keys at that level. MHC incorporates these fan-out values into the physical

database design when it defines the dimensional hierarchies [23, 22, 33, 32]. If the actual

size of the data that a data warehouse application inserts or loads into an MHC dimensional

(30)

will require full reorganization. If MHC dictates maximum fan-out values that are much

larger than actual data sizes, the physical database design then is likely to adversely affect

the symmetry of one or more dimensions. Thus, MHC forces a tradeoff between dynamic

or real-time updates and symmetry.

Because MHC uses fixed length bit strings to represent individual levels in HSE,

there is a tradeoff between symmetry and dynamic or real-time updates in MHC. If MHC

designates too few bits for a given hierarchy level, then the data might exceed this threshold

and require full reorganization of all data in all UB-trees that include the affected dimension

before MHC can process any further updates. If MHC designates too many bits for a given

hierarchy level, then the leading bits of the hierarchy level might remain empty and cause

(31)

Chapter 4

MDDC Theory

MDDC removes the tradeoff between symmetry and dynamic updates in MHC

with a simple but powerful modification to the MHC key structure. Both MHC and MDDC

benefit from their more narrow focus on OLAP queries by encoding compound surrogates

into the primary keys of dimensions. With this encoding, MHC imposes a partial ordering

and not a full ordering of the data. Specifically, MHC sorts key values within the order of

parent keys. In the Customer Dimension example, MHC sorts each CUTOMER within the

context of a REGION, STATE, and CITY. MHC might group CUSTOMER 1, 3, and 5 in

one REGION, STATE, and CITY and CUSTOMER 2, 4, and 6 in other REGION, STATE,

and CITY values depending on data values and relationships. Since hierarchical clustering

already forces a partial ordering that orders data according to the dimension hierarchies,

MDDC further takes advantage of this tradeoff with no additional cost. Research related

to MHC has shown that mirroring the bits of each level in compound surrogates increases

entropy and therefore symmetry [22]. MDDC introduces a mechanism for variable length

bit strings at each level in HSE keys and uses this in conjunction with bit mirroring. MDDC

also employs a third derived bit value that collates before 0 or 1 to mark the end of bit strings

in each HSE level so that HSE keys with the same prefix but different HSE level values do

not intermingle in the collating sequence. These techniques allow MDDC to be completely

dynamic while maintaining symmetry equal to or better than MHC. Specifically, MDDC

has superior symmetry when it uses fewer bits in an HSE level than the fixed number of bits

that MHC predefines. This is a special problem in all HSE levels other than the highest level

since MHC requires the number of bits at an HSE level account for the maximum number

(32)

Table 4.1: MDDC HSE

REGION STATE CITY CUSTOMER

10000000 10101000 10000100 11000000

of MDDC can never be worse than MHC by definition because MHC must allocate enough

bits to cover the maximum distinct values in the level. MDDC will never use more than

this maximum number of bits unless there are more distinct values in which case MHC

would need to be reorganized with a higher value for maximum number bits at that level.

Therefore, MDDC introduces dynamic capabilities not found in MHC, improves upon the

symmetry of MHC, and does not introduce any significant tradeoffs.

4.1

Variable Length Bit Strings

Introducing variable length bit strings for each level in the HSE key requires an

efficient storage and processing technique. Typically, variable length fields use a length byte

or end of string marker such as a NULL byte to indicate their size. This would not make

sense for HSE levels since the actual bits that compose the level itself are likely to fit in less

than one byte. A better approach is to allocate the last bit of each byte as a continuation

bit so that 7 bits of each byte are usable by MDDC. This approach would have to exceed

8 bytes or 56 usable bits in a level of an HSE key before the continuation bit approach

becomes less efficient than the length byte or NULL byte approaches. This is not likely to

occur in the vast majority of HSE key values since each level needs only be unique in the

context of its ancestors. Note also, that each level in a variable length HSE key requires at

least one byte and also uses a whole number of bytes for each level. This affects the length

of HSE keys that MDDC must store but does not affect symmetry since MDDC does not

use the continuation bits or any trailing zeros when interleaving bits. MDDC can discard

all trailing zeroes since by definition all reverse bit strings end in 1 with the exception of the

single value of 0. It is also important to point out that MDDC might also have one overall

variable length byte for the entire HSE key just as MHC does to accommodate unbalanced

hierarchies or hierarchies that allow a different number of levels for each key value.

In MDDC, the key from the customer example in the MHC section has an HSE

(33)

4.2

Dynamic Bit Interleaving

In addition to varying the length of bit strings in HSE keys, MDDC combines HSE

keys with dynamic bit interleaving in lieu of Z-ordering. If MDDC were to use Z-ordering

and bit strings that preserve an integer ordering as MHC does, the relative position of

each bit in these bit strings would need to shift as HSE key lengths change. This would

alter the overall Z-ordering and require MDDC to completely reorganize all the data in its

entirety. Using mirrored bits in conjunction with variable length bit strings and dynamic

bit interleaving overcomes this problem.

No matter how long the HSE bit strings grow, the relative position of bits in the

bit interleaving does not change. MDDC simply adds the additional bits for larger level

values to the end of the bit string for that level in the HSE. With Z-ordering the depth of

all HSE key values for each dimension are the same, for example 3 bits.

With dynamic bit interleaving in contrast, different keys in each dimension are

likely to have different depths. Consequently, MDDC does not have a predetermined address

space like MHC. MDDC subdivides the address space based on the density of local keys

on an as needed basis. Figure 4.1 depicts the uniform depth of Z-ordering while Figure 4.2

depicts the variable depth of dynamic bit interleaving. Dynamic bit interleaving furnishes

MDDC with some of the capabilities found in grid files, quad trees, and other similar data

structures [42, 13, 12].

Using a third derived bit value that collates before 0 and 1 ensures that MDDC

completely segregates distinct HSE key values. This does not alter the relative position of

previous interleaved bit addresses in the overall data ordering and therefore does not require

data reorganization.

Mirrored bits have one more important advantage. Except for the value of 0, each

mirrored bit string ends with a 1 by definition. This allows MDDC to trim the remaining

filler bits, which are all zeros, when interleaving bits. With this technique, MDDC always

exhibits symmetry as good as MHC and better in some cases. If MHC specifies 8 bits for a

given level in a compound surrogate but only uses one, then MHC wastes up to 7 bits before

accessing a bit that makes a difference in the Z-ordering. In such cases, some dimensions

that are using all bits in their compound surrogates might dominate the dimensions that

have unused bits. For the previously listed reasons, MDDC does not include these bits when

(34)

Figure 4.1: Uniform Depth Z-address Space

The following demonstrates how dynamic bit interleaving works. First, dynamic

bit interleaving maps all relevant zero and one bits from the mirrored bit string to 10

for bit 0 and 11 for bit 1. Dynamic bit interleaving maps all separators or the third

derived bit value to 01 and maps the end of string or NULL to 00. For example, in

MDDC the following two bytes, 10100110000010000, with 8 bits for each level maps to

111011101011110110101010101100. MDDC only uses these double length bit strings to

collate keys. MDDC does not store the keys in this format. MDDC stores the keys in the

mirrored bit format already described. Once the double bit format is calculated, MDDC

can interleave the bits just as Z-order does. This allows MDDC to implement dynamic bit

interleaving with minimal additional costs.

4.3

MDDC Operations

In general operations in MDDC are very similar to those in MHC. The following

sections provide details for MDDC operations.

4.3.1 Queries in MDDC

MDDC is like MHC in that it only alters the content of keys in B-trees and does

(35)

Figure 4.2: Variable Depth Dynamic Bit Interleaving Space

for B-trees. As with MHC, slice queries or partial matches are more complicated in MDDC.

4.3.2 Point Queries

Point queries are very simple in MDDC. When the query specifies a full key for

each dimension, MDDC simply computes the single bit interleaved address and uses the

B-tree to locate the record or records with the specified key values.

4.3.3 Slice or Partial Match Queries

MDDC uses “wild card” or “don’t care” bits in interleaved bit addresses for

di-mension key values or parts of didi-mension key values that the query does not specify. This

applies to situations where the query completely omits dimensions or where queries only

specify dimension prefixes corresponding to levels in compound HSE dimension keys. In

this regard, MDDC functions in the same manner as MHC.

4.3.4 Maintenance

Like MHC, MDDC does not alter any properties of B-trees during maintenance

operations. MDDC inserts, deletes, and updates records with standard B-tree methods

(36)

dimension keys before comparison with other keys in B-trees. MDDC also pre-computes bit

interleaved addresses for all comparison keys in the B-trees as they are needed. The

pre-computed, bit-interleaved addresses determine the placement and organization of records

within the B-tree. In doing so, MDDC does not in any way alter the properties of the

host B-tree. Therefore, B-trees containing keys that MDDC algorithm organizes retain all

maintenance advantages of B-trees including the perfect balance, shallow depth, granular

concurrency control, and recoverability [40, 13]. Since MDDC does not require complete

data reorganization, it is completely dynamic and can readily accommodate real-time

up-dates.

4.3.5 Key Recycling

MDDC like MHC can take advantage of key recycling at all levels of any hierarchy

[23]. MDDC works the same as MHC in this area. Realize that HSE keys need not assume

the role of primary keys. HSE keys can play the role of internal row addresses in dimension

and fact tables. Finally, the dimensions and optionally fact tables can store the primary

key values in addition to the HSE keys.

4.3.6 Key Participation

As is the case with MHC, MDDC optionally includes any dimensions and any

hierarchy levels that the dimensions contain into the HSE. MDDC can also include any

proper subset of these dimensions and levels. When MDDC configures a data structure

with a proper subset of dimensions and levels, it appends the HSE key with any remaining

dimensions or other fields in the primary key to enforce uniqueness. MDDC can also

combine more than one independent dimension into one virtual dimension via concatenation

to establish priority.

4.4

Incremental Costs

As stated before, the only relevant tradeoff that MDDC makes so that it can

provide dynamic symmetry is the forfeiture of integer ordering. But since MHC forces a

partial integer ordering anyway and the most common OLAP queries work well with such

(37)

from MDDC works just as well for OLAP queries [19] and there is no real tradeoff in this

regard. Hence, MDDC supports OLAP point and OLAP slice queries just as well or better

than MHC.

The only other cost that MDDC suffers in comparison to MHC is the requirement

for more disk space. The total amount of additional disk space per table depends on the

ratio of key fields to other fields in the table. However, since both MHC and MDDC, provide

efficient query access without additional indexes, they both are very space efficient.

4.5

Summary of MDDC

By utilizing variable length HSE keys and dynamic bit interleaving, MDDC enables

completely dynamic insert, update, and delete operations without full reorganization of any

dimension, fact table, or underlying B-tree. In addition, this allows MDDC to provide

symmetrical query efficiency that is good as or better than that of MHC. The only tradeoff

is the requirement for additional disk space. The experimental results that follow will

(38)

Chapter 5

Implementation

The implementation here contains working versions of both MDDC and MHC so

that the experiments that follow can compare the two algorithms directly. The purpose

of the implementation is to demonstrate that MDDC as opposed to MHC is completely

dynamic, has symmetry that is good as or better than MHC, at minimum additional costs.

The following sections describe the simulation software, simulation parameters, simulated

SQL tables, and simulated database tables.

5.1

Simulation Platform

This implementation is built on a Microsoft Windows laptop. The laptop is a

Hewlett-Packard Pavilion dv5000. Its RAM is 1 Gigabyte, it has a single 84.4 Gigabyte

hard disk drive, and its single AMD 64 bit processor has a clock speed of 1.79 Gigahertz.

Its operating system is Windows XP Professional.

5.2

Simulation Software

In the experiments that follow, the implementation of MDDC and MHC uses the

B-tree access method in BDB (Berkeley DB) software from Sleepy Cat Inc. BDB is open

source software. The implementation here specifically implements MDDC and MHC with

the C language API from BDB. All code for the implementation of MDDC in this study is

in C language. The simulation uses Microsoft Visual Studio.net Professional to compile the

(39)

In order to implement MDDC and MHC with the standard B-tree access methods

in BDB, this implementation alters the comparison routine of the B-tree access method

in BDB to compute bit interleaved addresses as part of the comparison process between

composite primary keys. This further evidences the fact that MDDC does not require

alterations to the host B-tree data structure. The implementation additionally contains

programs to search the MDDC and MHC data structures in a skip sequential manner. To

simplify the code for the implementation, the simulation only permits queries that specify

at most one value for each dimension level. If the queries do not specify a value for a

given dimension level, the implementation allows all values and uses wild card bits for these

dimension levels. Finally, if a query does not specify a value for a dimension level then the

query cannot specify any specific values for the children of this dimension level.

The query output of the simulation on MDDC and MHC consists of record counts

for each query and access statistics including cache size, records scanned, total I/O problems,

I/O probes to disk, processor time, and elapsed time. The simulation also includes a full

scan program for MDDC and MHC to ensure query results are correct.

5.3

Simulation Parameters

This implementation only uses a small amount of RAM for MDDC and MHC

so that it can demonstrate the I/O efficiency of MDDC. This implementation limits the

BDB cache size parameters to 4 database blocks. It also uses large database blocks for I/O

efficiency. It uses the maximum 64 kilobyte block size in BDB. Therefore the size of the

cache for MDDC and MHC is 256 kilobytes.

5.4

Simulated SQL Statements

Since this implementation uses C API for the BDB embedded database

man-agement system, no SQL interpreter is available. Therefore, C programs simulate SQL

statements in this implementation. These SQL statements constrain dimensions directly.

The simulation does not perform join operations since join operations would be the same

for both MHC and MDDC. Similarly, the simulation does not perform range queries since

they are not typical in OLAP applications and MHC or MDDC are not designed for them.

(40)

SELECT COUNT(*)

FROM GROCERY SALES FACT WHERE PROD CAT ID=4 AND

PROD ID=8

Figure 5.1: Simulated SQL Query

5.5

Simulated Database Tables

Typically, relational database systems use SQL to define tables. As with all other

SQL statements, this simulation uses C programs to define tables.

5.6

Special Simulation Challenges

In order to take full advantage of MHC and MDDC and to compare them fairly,

a query algorithm is required that visits each leaf database block only one time. This is

fairly simple in a one dimensional B-tree. Once the query accesses a given leaf block, the

algorithm access all the records on that block and ensures that the next key value accessed

is greater than the last key in the current block [40]. It is more complicated to effect this

type of skip sequential processing in MDDC and MHC where the algorithms must deal

with multidimensional key encodings and bit interleavings. The algorithm must calculate

the next key in the multidimensional space using wildcard bits for partial match and slice

queries. Essentially, the lowest priority bit for each dimension that has an unset value less

than 1 has to be identified and then the unset bit with the lowest overall priority from all

the dimensions not in the query filter has to be set to 1. This algorithm proved to be quite

(41)

Chapter 6

Experimental Results

6.1

National Grocery Chain

The “National Grocery Chain” database contains sales data for a national chain

of grocery stores broken out by month, and product. The store, month, and product

data were fabricated and manually entered into text files. The sales data was created

with the assistance of a random number generator. Given a sparseness factor such as

10% the simulator uses the random number generator to create one randomly selected

record out of every 10 possible combinations of store, month, and product. A sparseness

factor of approximately 0.28% was selected for this experiment to generate the approximate

12,000,000 sales records in the simulation.

6.1.1 Data Model

The “National Grocery Chain” database is a flat star-schema or one that contains

the complete dimension hierarchy in the base dimension [19]. The dimensions are Store

in table B.1, Product in table B.2, Promotion in table B.3, and Month in table B.4. This

database combines these four dimension tables to establish the grain of the Sales Facts in

table B.5.

6.1.2 Test Criteria

The experiment focuses on two dimensions. It tests the behavior of the product

(42)

The product dimension contains a hierarchy made up of two levels, product category and

product. The MHC key as tables B.2 and B.4 depict in this simulation uses 8 bits for the

product category level and 8 bits for the product level. The month dimension contains

a hierarchy made up of three levels, year, quarter, and month. The MHC key in this

simulation uses 6 bits for year, 4 bits for quarter, and 4 bits for month. MDDC has no

preset number of bits for any of the dimensions.

6.1.3 Data Loading

A set of C programs make calls to the BDB database in order to load data in the

“National Grocery Chain” database. The C programs load the data into the MHC database

and the MDDC database. The C programs load the “Grocery Sales Fact” table in B.5 in

stages to illustrate the differences in symmetry between MDDC and MHC and the need to

reorganize MHC data structures when the MHC algorithm reaches preset size thresholds.

The simulation loads all simulated records with product category keys 1 and 2

in stage 1. In stages 2 through 9, the simulation loads all simulated records with product

category keys 3 through 10 respectively as results tables D.1, D.2, D.3, D.4, and D.5 depict.

6.1.4 Queries

For simplicity in the “National Grocery Chain” database, the C programs that

im-plement queries, only allow point queries and slice queries with no more than one dimension

value from any hierarchical level of any dimension. The programs do not allow specification

of multiple values from any level in any dimensional hierarchy in one query. This allows

for simple and efficient query implementation. The implementation could include multiple

dimension values at a higher implementation cost and complexity in order to maintain the

same level of efficiency. The theory and principles of MDDC and its comparison to MHC

are the same for either level of implementation complexity. This is true since skip sequential

processing handles multiple points in key order very efficiently [40].

To further focus on the MDDC versus MHC data structures, this simulation

con-centrates on queries that specify dimension and hierarchy values that MDDC and MHC use

to organize the fact data or those that are in the hierarchical surrogate encoding (HSE) keys.

The simulation executes all the queries represented by the SQL in Table C.1 against the

(43)

after each stage in the loading process represented by the corresponding product category

values.

6.1.5 Results

The results demonstrate the advantage that MDDC has over MHC in symmetry

and real-time update capability. The first 8 queries from Table C.1 represent the differences

in MHC and MDDC most starkly since each one of these queries only constrains one of the

dimensions. In this experiment, the product dimension is the experimental test whereas the

other three dimensions are controlled tests. Therefore query 1 and 5 from Table C.1 should

illustrate the results that are sought while queries 2 through 4 and 6 through 8 should

perform as expected.

All the results demonstrated that the most persistent and reliable indicator of

performance is the number of I/O operations for each query, whether they are cached or

not. In addition, when a ratio of the number of I/O operations per index query versus

the number of I/O operations per full scan query is used as the primary metric, then the

results are stable as the underlying data increases in size. This ratio also accounts for the

difference in file sizes between MHC and MDDC. As a result, the ratio of query I/O to full

scan I/O is the primary metric on which this experiment focuses.

Charts 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, and 6.8 compare the MHC results to the

MDDC results for all 9 product category data runs in this experiment. Note that charts 6.1

and 6.5 represent the greatest difference between MHC and MDDC queries. Of course these

correspond exactly to test queries 1 and 5 from Table C.1. In these queries, the MHC ratio

of query I/O to full scan I/O or percentage of full scan actually indicates that query I/O’s

in MHC are sometimes larger than the number of I/O operations in a full table scan. Note

also that the ratio improves as the simulation loads more of the product category records.

As charts 6.2, 6.3, 6.4, 6.6, 6.7, and 6.8 depict, MDDC queries have slightly higher I/O

ratios than MHC queries in the control queries but are much lower than the 100 percent

ratio required to scan the full table.

These results are to be expected since the HSE key for product category in the

product dimension for the MHC database allocates 5 leading bits that this simulation does

not use. A new database design for MHC could alleviate this problem. If the simulation

(44)

Full Scan Percentage for Product Dimension Query 1

0 20 40 60 80 100 120 140 160

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.1: Grocery Sales Product Dimension Query 1

category in the MHC database then the above problem with the product dimension will not

occur. But, this presents another problem. The MHC database would not be able to house

more than 16 product categories without a complete MHC database reorganization. the

slightly higher I/O ratios in the control queries are also to be expected since the product

dimension in MDDC is sharing in some of the selectivity in the MDDC data structure at

the expense of the other 3 dimensions while the product dimension is not doing this in the

MHC data structure.

Hence, MDDC offers better symmetry than MHC and unlike MHC offers real-time

updates without major data structure reorganizations. Unlike MHC, MDDC does not force

a tradeoff between symmetry and intrinsic growth limits.

6.2

TPC-H

The data in this experiment originates from the TPC-H benchmark [1]. The

DBGEN from TPC-H generates the database with a scale factor of 2 so that the LINEITEM

fact table has approximately 12,000,000 records. See TPC-H estimated database size for a

scale factor of 2 in B.6. This simulation uses the PART dimension as shown in table B.7

(45)

Full Scan Percentage for Product Dimension Query 2

0 20 40 60 80 100 120 140 160

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.2: Grocery Sales Product Dimension Query 2

Full Scan Percentage for Promotion Dimension Query 1

0 2 4 6 8 10 12 14 16

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

(46)

Full Scan Percentage for Promotion Dimension Query 2

0 5 10 15 20 25 30

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.4: Grocery Sales Promotion Dimension Query 2

Full Scan Percentage for Store Dimension Query 1

0 1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

(47)

Full Scan Percentage for Store Dimension Query 2

0 2 4 6 8 10 12

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.6: Grocery Sales Store Dimension Query 2

Full Scan Percentage for Month Dimension Query 1

0 5 10 15 20 25 30

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

(48)

Full Scan Percentage for Month Dimension Query 2

0 5 10 15 20 25 30

1 2 3 4 5 6 7 8 9

Data Run

MHC MDDC

Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.8: Grocery Sales Month Dimension Query 2

table as shown in table B.9 for the fact table. The MHC and MDDC data structures cluster

the LINEITEM data with only the PART and SUPPLIER dimensions.

6.2.1 Data Model

The TPC-H database is a normalized schema or a schema with snow-flaked

di-mensions and fact tables that are normalized into more than one table [1, 19, 5]. This

simulation uses PART in B.7 and SUPPLIER in B.8 as the dimensions. It combines these

two dimensions to establish the granularity of the LINEITEM table in B.9.

6.2.2 Test Criteria

The experiment focuses on two dimensions. It tests the behavior of the SUPPLIER

dimension as data is loaded. It also uses the PART dimension as an experimental control.

The SUPPLIER and PART dimensions each contain a hierarchy made up of one level, the

primary key. The MHC key as tables B.7 and B.8 depict uses 19 bits for the PARTHSE

References

Outline

Related documents