MARTIN, MICHAEL W. Implementing Real-Time OLAP with Multidimensional Dynamic Clustering. (Under the direction of Dr. Rada Y. Chirkova).
This dissertation details multidimensional dynamic clustering (MDDC), a new
multidimen-sional data clustering method that supports efficient On-Line Analytical Processing (OLAP)
queries [24]. OLAP queries are generally multidimensional business intelligence queries that
require large amounts of data, complex table joins, and aggregate data calculations to
pro-duce useful results [19]. MDDC allows OLAP queries to execute very efficiently on base
tables while requiring fewer materialized views, aggregates, and secondary indexes than
many previous solutions that support OLAP queries. MDDC also allows dynamic inserts,
updates, and deletes on stored data without reorganization of the underlying data
struc-tures. This is especially important for real-time data warehouses that must provide efficient
OLAP query access while continually adding data to and maintaining data in tables. Such
data warehouses cannot stop query processing to reorganize data structures when data
ex-ceeds preset thresholds such as a fixed number of key values. This dissertation provides
a precise definition of MDDC, including a comparison with multidimensional hierarchical
clustering (MHC), and presents experimental results to substantiate the claims that MDDC
is more dynamic than MHC and provides symmetry that is better than or equal to that of
by
Michael W. Martin
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Computer Science
Raleigh
2006
Approved By:
Dr. Xiasong Ma Dr. Ting Yu
I would like to dedicate this dissertation to the three people who would not accept
anything less than my having the highest level of education possible.
I dedicate it to my mother, Alta Irene Simmons Martin. In her own education, she never
advanced past the seventh grade. She performed hard manual labor all her short life.
Understandably, she insisted that I have as much education as possible so that I would
have a better life.
I also dedicate it to my father-in-law, Malton Ruffin Tripp Sr., and my wife, Shelia Ann
Tripp Martin. Through many years they insisted that I obtain a Doctor of Philosophy
degree in Computer Science. They always provided positive encouragement and they
Biography
My educational background consists of a Bachelor of Science in Computer Science from the
University of North Carolina at Wilmington in 1986, a Master of Science in Management
from North Carolina State University in 1990, and a Master of Computer Science from
North Carolina State University in 1997.
My professional experience consists of 19 years in information technology with
positions of increasing responsibility including programmer analyst, database administrator,
systems analyst, project manager, development manager, and most recently director of
architecture and planning.
In addition, I won the GlaxoWellcome Inc. CEO’s award for a national sales and
marketing data warehouse and hold U.S. utility patent 6,003,036 relating to the storage,
maintenance, and access of multidimensional data warehouses. Finally, the intellectual
Acknowledgements
I would like to acknowledge my advisory committee for their help and guidance. I especially
would like to acknowledge and thank Dr. Rada Y. Chirkova whose patience, assistance, and
Contents
List of Figures vii
List of Tables viii
1 Introduction 1
2 The Real-Time OLAP Problem 4
2.1 OLAP Data Model . . . 6
2.2 OLAP Dimensions and Hierarchies . . . 6
2.3 OLAP Facts . . . 7
2.4 OLAP in Real Time . . . 7
2.5 No B-tree Equivalent for OLAP . . . 8
3 Related Work 10 3.1 Non-Clustering Indexes . . . 10
3.2 Clustering Indexes . . . 11
3.3 MHC . . . 11
3.3.1 UB-trees . . . 12
3.3.2 Z-ordering . . . 12
3.3.3 Z-ordering Example . . . 13
3.3.4 Hierarchical Clustering . . . 16
3.3.5 Z-ordering and Hierarchical Clustering . . . 17
4 MDDC Theory 21 4.1 Variable Length Bit Strings . . . 22
4.2 Dynamic Bit Interleaving . . . 23
4.3 MDDC Operations . . . 24
4.3.1 Queries in MDDC . . . 24
4.3.2 Point Queries . . . 25
4.3.3 Slice or Partial Match Queries . . . 25
4.3.4 Maintenance . . . 25
4.3.5 Key Recycling . . . 26
4.3.6 Key Participation . . . 26
4.5 Summary of MDDC . . . 27
5 Implementation 28 5.1 Simulation Platform . . . 28
5.2 Simulation Software . . . 28
5.3 Simulation Parameters . . . 29
5.4 Simulated SQL Statements . . . 29
5.5 Simulated Database Tables . . . 30
5.6 Special Simulation Challenges . . . 30
6 Experimental Results 31 6.1 National Grocery Chain . . . 31
6.1.1 Data Model . . . 31
6.1.2 Test Criteria . . . 31
6.1.3 Data Loading . . . 32
6.1.4 Queries . . . 32
6.1.5 Results . . . 33
6.2 TPC-H . . . 34
6.2.1 Data Model . . . 38
6.2.2 Test Criteria . . . 38
6.2.3 Data Loading . . . 39
6.2.4 Queries . . . 39
6.2.5 Results . . . 39
6.3 Final Evaluation of MDDC and MHC . . . 41
7 Conclusions and Future Work 44
Bibliography 46
A Introduction to Appendices 50
B Tables definitions for experiments 51
C Queries for experiments 55
D Detailed experimental results 58
List of Figures
1.1 Hierarchical SQL Query . . . 2
1.2 Slice SQL Query . . . 2
2.1 OLAP SQL query . . . 5
2.2 OLAP SQL query . . . 5
3.1 Z-order Product SQL Query . . . 15
3.2 Z-order Month SQL Query . . . 15
3.3 Typical OLAP SQL Query . . . 17
4.1 Uniform Depth Z-address Space . . . 24
4.2 Variable Depth Dynamic Bit Interleaving Space . . . 25
5.1 Simulated SQL Query . . . 30
6.1 Grocery Sales Product Dimension Query 1 . . . 34
6.2 Grocery Sales Product Dimension Query 2 . . . 35
6.3 Grocery Sales Promotion Dimension Query 1 . . . 35
6.4 Grocery Sales Promotion Dimension Query 2 . . . 36
6.5 Grocery Sales Store Dimension Query 1 . . . 36
6.6 Grocery Sales Store Dimension Query 2 . . . 37
6.7 Grocery Sales Month Dimension Query 1 . . . 37
6.8 Grocery Sales Month Dimension Query 2 . . . 38
6.9 TPC-H Part Dimension Queries . . . 41
List of Tables
3.1 Product Dimension . . . 14
3.2 Month Dimension . . . 14
3.3 Sales . . . 14
3.4 Customer Dimension . . . 18
3.5 MHC HSE . . . 19
4.1 MDDC HSE . . . 22
B.1 Grocery store dimension . . . 52
B.2 Grocery product dimension . . . 52
B.3 Grocery promotion dimension . . . 52
B.4 Grocery month dimension . . . 52
B.5 Grocery sales fact . . . 52
B.6 TPC-H estimated database size . . . 53
B.7 TPC-H PART dimension . . . 53
B.8 TPC-H SUPPLIER dimension . . . 53
B.9 TPC-H LINEITEM fact . . . 54
C.1 Grocery sales experiment - queries . . . 56
C.2 TPC-H experiment - queries . . . 57
D.1 Grocery sales results - part 1 . . . 59
D.2 Grocery sales results - part 2 . . . 60
D.3 Grocery sales results - part 3 . . . 61
D.4 Grocery sales results - part 4 . . . 62
D.5 Grocery sales results - part 5 . . . 63
D.6 TPC-H Lineitems results - part 1 . . . 64
D.7 TPC-H Lineitems results - part 2 . . . 65
D.8 TPC-H Lineitems results - part 3 . . . 66
D.9 TPC-H Lineitems results - part 4 . . . 67
D.10 TPC-H Lineitems results - part 5 . . . 68
E.1 Product dimension . . . 70
Chapter 1
Introduction
On-Line Analytic Processing (OLAP) applications typically provide business
ana-lysts and managers with multiple views and organizations of critical business data. OLAP
applications require underlying data structures to efficiently provide access to data from
any combination of the component keys or dimensions in the data. OLAP queries
typi-cally restrict the data by some combination of dimensions and then aggregate or compute
statistics on some subset of the attributes of the data [19].
OLAP data structures and methods must overcome several difficult challenges
such as the requirement for performance symmetry given that a query might restrict the
data on any combination of dimensions. OLAP data structures and methods must also
deal with the “curse of dimensionality” or the decreasing selectivity in any one dimension
as the total number of dimensions in the data structure increases. Finally, OLAP data
structures must also deal with the common problem of maintaining a balanced and efficient
data structure as data content changes. Increasingly, OLAP data structures and methods
must also maintain data in real time [20, 7] while simultaneously executing efficient OLAP
queries. All the while, the size and corresponding performance challenges of these data
warehouses are increasing at a staggering rate.
OLAP slice searches typically restrict data by dimension key values or
hierarchi-cal key values within the dimensions, as the first SQL query in Figure 1.1 that follows
demonstrates. OLAP applications rarely restrict data by key order slice searches such as
the second query in Figure 1.2 that follows demonstrates [19, 21, 23].
Practitioners and researchers have created many techniques and methods in an
aggrega-SELECT DOLLARS FROM SALES, CUSTOMER
WHERE SALES.CUSTOMER ID=CUSTOMER.CUSTOMER ID AND
CUSTOMER.STATE=’NC’
Figure 1.1: Hierarchical SQL Query
SELECT DOLLARS FROM SALES
WHERE CUSTOMER>4 AND CUSTOMER <50
Figure 1.2: Slice SQL Query
tion. This involves the replication of pre-computed and pre-joined data into one or more
tables. While this technique is very helpful, it is expensive in terms of space and time and
inhibits real-time updates since the database management systems (DBMS) must update
and synchronize multiple tables when an update occurs. The use of multiple indexes,
es-pecially bit mapped indexes, is another common technique. The indexes provide multiple
views and increase query efficiency. But, they also delay update operations in proportion to
their numbers and do nothing to cluster the data. Since they do not cluster the data they
are more suited for small result sets [21, 23]. Much work has been done in pursuit of
multi-dimensional clustered indexes. While clustering is a very effective technique for increasing
query efficiency, clustering data in more than one dimension is very difficult. Most such
in-dexes are overly complicated, inefficient to manage, and are not effective in maintaining the
proper data organization for which they were originally designed [16]. Multidimensional
hierarchical clustering (MHC), a method involving the B-tree [21, 23], which has proven
itself so well in on-line transaction processing (OLTP) applications, has emerged. The
tech-nique uses Z-ordering which interleaves bits from multiple keys to enforce symmetry and
then stores the interleaved key in B-trees [2]. It also encodes dimensional hierarchies into its
keys to aid clustering. This technique provides efficient multidimensional clustering with all
however. MHC uses fixed length keys for each level in each hierarchy. This limitation,
impedes symmetry and prevents the technique from being completely dynamic and thereby
not providing a real-time capability.
Multidimensional dynamic clustering (MDDC) [24] like MHC capitalizes on all the
advantages of MHC but unlike MHC uses variable length keys and dynamic bit interleaving
in lieu of fixed length keys and Z-ordering. As a result MDDC is completely dynamic and
provides a real-time capability for data warehouses since it never requires reorganizations.
In addition, MDDC demonstrates symmetry that is superior to that of MHC. The only
dis-advantages of MDDC when compared to MHC is the additional space that MDDC requires
to store the variable length keys and an inability to handle conventional range queries.
Depending on the number of dimensions, hierarchies in each dimension, and metric fields
this additional space requirement could be very high but based on the experiments in this
dissertation is less than 10%.
This dissertation fully details MDDC and then provides empirical evidence to
Chapter 2
The Real-Time OLAP Problem
The real-time OLAP problem involves dynamic storage, maintenance, and access
of multidimensional data for OLAP queries [7].
Consider a database consisting of sales data in dollars, with a primary key
com-posed of product, month, and store. Further, suppose that all products each fall into a
specific product category, months are related to a specific quarter and year, and stores each
fall into one and only one zip code, state, and region. Tables E.1, E.2, E.3, and E.4 provide
an example of such a database.
The real-time OLAP problem involves real-time query, insert, update, and delete
operations against such tables. Applications or users are able to efficiently and dynamically
insert, update, or delete records from any dimension table or fact table in the database
without reorganizing any data structures. There are also not any preset upper limits on
the values of keys in any of these tables. Queries involve restrictions or criteria on any
combination of dimensions. The following two SQL queries in Figures 2.1 and 2.2 represent
typical queries involving combinations of dimensions. Query 2.1 returns the total sales by
store from all store sales where the year is 1998 and the state is ’NC’. Query 2.2 returns
the total sales by product for all products that are categorized as ’DRY GOODS’.
To provide an effective multidimensional data clustering, a multidimensional data
structure that provides ordered and direct access to data must support slice and point
queries on any combination of participating dimensions with the same efficiency as if the
combination of dimensions make up a prefix of a one-dimensional data structure supporting
SELECT STR NM, SUM(SLS DLR) FROM SLS FCT,
STR DIM, PROD DIM, MO DIM
WHERE STR DIM.STR ID=SLS FCT.STR ID
AND PROD DIM.PROD ID=PROD DIM.PROD ID
AND MO DIM.MO ID=SLS FCT.MO ID
AND STR DIM.ST NM=’NC’
AND MO DIM.YR NM=1998
GROUP BY STR DIM.STR NM
Figure 2.1: OLAP SQL query
SELECT PROD NM, SUM(SLS DLR) FROM SLS FCT,
STR DIM, PROD DIM, MO DIM
WHERE STR DIM.STR ID=SLS FCT.STR ID
AND PROD DIM.PROD ID=PROD DIM.PROD ID
AND MO DIM.MO ID=SLS FCT.MO ID
AND PROD DIM.PROD CAT NM=’DRY GOODS’
GROUP BY PROD DIM.PROD NM
2.1
OLAP Data Model
While both dimensional and normalized data models accommodate OLAP data,
the problem specification and theory in this dissertation assumes that the data model is
dimensional for simplicity, query efficiency, ease of use, and increased probability of correct
results. Tables in OLAP dimensional databases are divided into dimension tables and fact
tables [19].
Dimension tables such as E.1 contain all the primary key sources for all foreign
keys in a dimensional model. Each dimension table has one primary key, typically a
tu-ple identifier. Dimension tables contain most of the textual or descriptive attributes in a
dimensional model. Dimension tables also contain one or more hierarchies related to each
primary key for a dimension. For example, a customer dimension table might contain a city
and state attribute for each customer. In a flat dimensional model, the customer dimension
would contain all the attributes for city and state as well as customer. A flat dimensional
model might also contain separate dimension tables for city and state in addition to the
attributes for city and state in the customer dimension. A snow-flaked dimensional model
would only contain foreign keys to city and state in the customer dimension and would
house all other attributes for city and state in separate city and state dimension tables.
Fact tables such as E.4 contain foreign keys that reference one dimension. They
also contain attributes that are usually but not necessarily numeric and are dependent on
the primary key of the fact table. These attributes are known as metrics.
Fact tables are typically orders of magnitude larger than dimension tables. This
is expected since fact tables contain rows that could include all the possible combinations
of key values from all dimensions that the fact table includes in its definition. Albeit, fact
tables rarely contain even a small fraction of all possible dimension combinations. The ratio
of actual rows in a fact table compared to all rows and combinations of dimension values is
known as the sparseness of the fact table [19].
2.2
OLAP Dimensions and Hierarchies
Each dimension table contains one or more hierarchies. There is always at least
one trivial hierarchy for each dimension made up of only one level or the primary key from
that hierarchy. Each level of a hierarchy assigns one and only one key value to each primary
key value in the dimension table. Therefore, for each dimension,D, there arehhierarchies, each with some number of levelsl for each hierarchyh.
2.3
OLAP Facts
If a fact table contains ndimensionsD1, D2, D3, D4, ..., Dn, then a query can
con-strain the fact table on 2npossible combinations of dimensions since there arendimensions
and each one can either be included or not be included as a constraint.
The most optimal way to store the fact data is to cluster or sort it by the dimensions
that queries most often use to constrain the data [17, 21, 23]. But, this is a problem for
conventional indexes such as B-trees because they require too much replicated data to
support all the possible combinations.
A multidimensional data structure need not use all n dimensions for a given fact table to organize the data. The multidimensional data structure can use any subset of then dimensions. When a fact table contains a large number of dimensions, including all of them
in the data structure may dilute the organization of the data too much and cause “curse of
dimensionality” problems. Depending on query patterns it is often more practical to only
use the dimensions that queries most commonly constrain. The fact table can still use the
remainder of the dimensions to fulfill any requirements for constraints such as primary keys.
2.4
OLAP in Real Time
OLAP applications conventionally load data in batch. As a result there is time to
completely reorganize all data if necessary. Real-time OLAP applications are more similar
to OLTP applications. They must handle new data incrementally as it arrives without
major data structure reorganizations. Simultaneously, the OLAP application must provide
query performance good as or better than conventional OLAP applications. This requires
everything regarding the OLAP data structure to be completely dynamic. There can be no
preset limits on key values or any preset number of records of any kind in the data structure.
Each new record insert, update, or delete must result in only a limited amount of work as
2.5
No B-tree Equivalent for OLAP
B-trees are ubiquitous to On-Line Transaction Processing (OLTP) database
sys-tems since their performance is so efficient, versatile, and predictable. The B-tree solves
most index requirements in OLTP systems and therefore there is little demand for
alter-natives at this time [42]. OLAP database systems to date have not had such a widely
successful solution to implement.
A B-tree is also called a multi-way tree, it is a fast data-indexing method that
organizes the index into a multi-level set of nodes. Each node contains a sorted array of
key values. Two important properties of a B-tree are that all nodes are at least half-full
and that the tree is always balanced. That is, in access operation must read an identical
number of nodes in order to locate all keys at any given level in the tree. A well-organized
B-tree, with well sized nodes, will have only three to four levels [40].
Most OLTP applications depend on the B-tree to limit the amount of work
re-quired by each transaction since a B-tree has a limited number of levels for any database
operation to access or update. The conventional B-tree, which collectively here refers all
derivatives of the B-tree, has evolved into a very robust and efficient solution to implement
conventional operational or OLTP databases. The B-tree data structure provides good
con-currency, recoverability, maintenance, predictable query performance, predictable update
performance, and does not require periodic reorganizations. B-tree structures in their
na-tive forms, however, do not provide a good solution to implement OLAP databases. They
only work well when queries specify a complete key or a prefix of the key [42, 13, 40].
As the next chapter illustrates, conventional B-trees are not the only data
struc-tures that fall short in this area. There is not a robust and dynamic data structure to
support OLAP databases. The crowded field of candidate data structures for OLAP
ev-idences this fact. Each of the current solutions work well for some applications and not
others or work well in one aspect of their functionality but not in other aspects of their
functionality [11, 39]. The lack of a solution as versatile and robust for OLAP applications
as B-trees are for OLTP applications remains.
As this dissertation details, MDDC provides a B-tree solution to the real-time
OLAP problem. Like standard B-trees, MDDC enables real-time or dynamic updates for
OLAP in much the same way that standard B-trees provide such functionality for OLTP.
Chapter 3
Related Work
There are two major sectors of research regarding data structures for OLAP
ap-plications. The first sector focuses on data structures that only provide indexes and do
not cluster the underlying data [39]. The second sector focuses on data structures that, in
addition to any indexes they provide, do cluster the underlying data [14, 9, 4].
3.1
Non-Clustering Indexes
Solutions from the first sector of research typically utilize secondary indexes,
in-verted lists, or more recently bit mapped indexes to reduce the search space of
multidi-mensional queries without regard to the physical organization or clustering of the data.
Relational databases typically employ solutions from this sector since these solutions work
well with heap table organizations. Relational databases depend on heap structures for
OLTP [13]. But, research has shown that database management system (DBMS) software
can achieve large gains by clustering data according to the keys that queries use to filter the
data. This is true even for applications that are not multidimensional in nature [3]. When
databases store records in a random fashion and do not cluster them by key values, they
require virtually one I/O per record on average to fetch the data from secondary storage
[17]. Not surprisingly then, such solutions only work well for queries that access a very small
number of records relative to the overall search space. This is true even when the database
is capable of combining multiple indexes in one query. More formally, queries must have
very small conjunctive selectivity or combined dimensions query constraints to benefit from
better than full table scans for queries where the conjunctive selectivity exceeds 3.33% of
the total table size. This study also demonstrates that the tendency of OLAP queries to
constrain data based on hierarchically organized dimensions further amplifies this problem
[21, 23].
3.2
Clustering Indexes
Solutions from the second sector of research attempt to efficiently solve the very
difficult problem of multidimensional clustering. Unlike B-trees for OLTP, most of these
conventional OLAP data structures such as grid files, hB-trees, and R-trees are not efficient
and practical in one or more of the areas of space utilization, update predictability, query
efficiency, recovery, concurrency control, and complexity [25, 9, 38, 13, 14, 15, 18, 3].
For instance, grid files are reasonably easy to maintain and provide good
currency but do not provide predictable query performance and have a tendency to
con-sume large amounts of disk space and require much unnecessary I/O. An inability to
effi-ciently handle sparse data is the primary underlying cause of these problems in grid files
[27, 34, 15, 40]. While a large number of OLAP software applications implement Hypercubes
as their primary solution, they exhibit sparseness, disk space, and I/O query performance
problems. hB-trees and R-trees attempt to furnish OLAP applications with the robustness
that B-trees furnish OLTP applications but exhibit problems with concurrency,
mainte-nance, complexity, and predictable query performance [14, 9]. Most other known OLAP
data structures have one or more major weaknesses and are not able to provide the same
level solution for OLAP databases that B-trees are able to provide for OLTP databases.
Of course the exception to this is the UB-tree [2] since it is in fact a B-tree with
multidimensional ordering. But, the UB-tree alone does not cluster the data optimally for
hierarchical OLAP data. Multidimensional hierarchical clustering (MHC) clusters data in
UB-trees with dimensional hierarchies but requires a tradeoff between dynamic or real-time
updates and symmetry.
3.3
MHC
MHC combines UB-trees and hierarchical clustering. MHC overcomes the difficult
key values and simply storing the key values in one dimensional B-trees. Therefore, the
problems associated with clustering data by more than one dimension are absent. The only
problems that remain are to assign key values so that symmetric clustering is enforced and
so that the overall data structure is dynamic [21, 23].
3.3.1 UB-trees
The UB-tree or Universal B-tree [2], offers promise since it supports
multidimen-sional clustering and is able to utilize the B-tree as an underlying data structure without
altering the properties of the B-tree. This is true since the UB-tree employs Z-ordering or
an interleaving of bits from each key in a composite key [28] to map multidimensional spaces
to a linear sequence of points suitable for storage in linear data structures that preserve
order such as B-trees. If a multidimensional composite key contains 3 component keys or
dimensions of 16 bits each, the resulting interleaved key in the UB-tree has 48 bits with the
priority of the bits alternating between the dimensions on each bit position. This technique
merely alters the sorting or collating sequence of B-trees and does not affect its behavior
in any other way. The UB-tree provides a solid host data structure for multidimensional
clustering but does not address the multidimensional clustering problem itself.
3.3.2 Z-ordering
UB-trees and therefore MHC are based on the concept of bit interleaving from
Z-ordering [28, 29]. This technique maps multidimensional spaces to one-dimensional spaces.
For this reason, Z-ordering is known as a space filling curve. When a Z-ordering curve
maps two dimensions into one dimension it has a distinctive Z shape, hence the name.
Other research efforts have since made advancements in the area of space filling curves
such as Gray and Hilliard Ordering [10]. MHC could very well make use of the other more
advanced space filling curves that provide better proximity for spatial applications but,
bit-interleaving via Z-ordering is sufficient for MHC since MHC targets OLAP applications
and only partially orders the data as illustrated later.
The Z-ordering technique is very straightforward. It shuffles or interleaves the
bits from multiple keys together to form one contiguous key. The technique can order the
interleaved key as if it were a single key. Consider a composite multidimensional key made
of values for A, B, and C of 101, 110, 001, a new interleaved key or Z-address takes the first
bit from each of the three keys, then the second bit, and finally the third bit from each key
so that the resulting single key in shuffled bit format is 110010101.
More formally, a Z-address Z(k) is the ordinal number of a multidimensional
com-posite key k from a tuple or record on the Z-curve. Z-ordering calculates a Z-address fors bits in each of the ddimensions in the keyk as follows:
Z(k) =
s−1
X
j=0
d X
i=1
ki,j·2j·d+i−1
Of course, this represents the case where each key has the same number of bits. In
the case where one key exhausts its bits before the other keys, it simply fills in the missing
bits with zeroes so as not to affect the collating. Note that with Z-ordering and other
space filling curves, the technique predetermines the multidimensional space including the
number of bit partitions or division points for each dimension and places data points into
the multidimensional space as they become available.
Maintenance and queries for Z-ordering are relatively simple. Inserts, updates,
and deletes simply compute the Z-address for the multidimensional composite key and then
follow the rules of the underlying data structure.
Queries are slightly more complex. Queries performing partial matches or
multidi-mensional slice searches utilize a wildcard or “don’t care” bit for each bit in any unspecified
keys. Consequently, queries must search several paths in the underlying data structure
prun-ing sections of the data structure where possible. Otherwise, queries invoke the underlyprun-ing
rules of the data structure as with inserts, updates, and deletes.
3.3.3 Z-ordering Example
Consider an OLAP example involving Sales dimensioned by Product and Month
as Tables 3.1, 3.2, and 3.3 depict.
In this example, Z-ordering interleaves the bits from the Product and Month keys
in the Sales fact table together to form one key per record as specified by Z-ordering. Notice
that for each Z-address in the Z-ordering above the changes or breaks in bits from Product
occur much earlier in the Z ADDRESS and therefore dominate the collating sequence in
Table 3.1: Product Dimension
KEY DESCRIPTION
DEC BIN
1 0001 Nylon Wind Breaker 2 0010 Rain Coat
3 0011 Full Length Overcoat 4 0100 Heavy Waist Coat 5 0101 Light Jacket 6 0110 Hiking Boots 7 0111 Rain Pants 8 1000 Leather Gloves
Table 3.2: Month Dimension
KEY DESCRIPTION
DEC BIN
200301 110000111001101101 January, 2003 200302 110000111001101110 February, 2003 200303 110000111001101111 March, 2003 200304 110000111001110000 April, 2003 200305 110000111001110001 May, 2003 200306 110000111001110010 June, 2003 200307 110000111001110011 July, 2003 200308 110000111001110100 August, 2003 200309 110000111001110101 September, 2003 200310 110000111001110110 October, 2003 200311 110000111001110111 November, 2003 200312 110000111001111000 December, 2003
Table 3.3: Sales
Z ADDRESS PRODUCT MONTH DOLLARS
SELECT MONTH, DOLLARS
FROM SALES WHERE PRODUCT=3
Figure 3.1: Z-order Product SQL Query
SELECT PRODUCT, DOLLARS
FROM SALES WHERE MONTH=200301
Figure 3.2: Z-order Month SQL Query
Therefore, when a UB-tree stores these fact records, the product key dominates
the placement of records. If the block size is two and the B-tree fully packs the blocks,
then the first two records will be in block one, the second two in block two, and so forth.
Consider the SQL query in 3.1 that restricts data based on Product.
Recall that Z-ordering executes partial match queries against the underlying data
structure by specifying all available explicit bits and using the “don’t care” or wildcard bits
for unspecified bits. When queries search the data in the UB-tree, it uses these explicit bits
along with the wildcard bits to prune blocks that the queries do not need in order to satisfy
their requirements.
This particular query is able to prune all but one leaf block in the UB-tree. This
is true because even when a query designates a wildcard for every Month bit, the UB-tree
clusters all the records where Product is equal to 3 into one block. Since the UB-tree is a
non-dense index and the index portion of the UB-tree can easily reside in random access
memory (RAM), it is quite possible that this specific query will only generate one I/O for
the one block that it needs. A query from the example selecting any value for Product
should be able to utilize the index with similar efficiency.
In contrast, consider the query 3.2 that restricts data based on Month. This query
is not able to effectively leverage the UB-tree and needs to scan every leaf block in the
UB-tree or essentially perform a full scan of the data. When a query designates wildcards
example, this is true no matter which value of the Month the query uses to restrict the
data.
As demonstrated, the content of the data in this example prevents balanced query
performance and renders the UB-tree ineffective in producing symmetric access to OLAP
data. These same deficiencies are present in Gray and Hilliard Ordering [10] data structures
as well since they do nothing to control the relative bit placement in participating keys.
Research for the Variable UB-tree [41] also recognizes this problem and in fact
is an attempt to resolve this problem. This research terms the problem the “puff pasty
effect” since attempting to extract a hyperplane from a hypercube based on a partial match
query in a non-symmetric UB-tree is analogous to cutting through a “puff pastry” the long
way or splitting the layers in lieu of cutting through the short way or across the layers.
The extraction or query must access too many blocks rendering it like a full scan of the
hypercube in the worst case. Unfortunately, while the Variable UB-tree improves symmetry,
it is a static solution and forces a reorganization anytime data changes.
Ideally, an OLAP multidimensional data structure should provide access to the
data that is symmetric regardless which participating key or combination of participating
keys the query uses to restrict the data. Moreover, this is the impetus for multidimensional
data structures in the first place.
Clearly, while UB-trees can provide symmetric access to multidimensional data,
the distribution of key values that are present in the data has the potential to destroy
symmetry in UB-trees and render them ineffective.
3.3.4 Hierarchical Clustering
Thus, UB-trees in isolation are not enough. In their native form, UB-trees order
data according to foreign key values from dimensions. For instance, an OLAP fact table with
the dimensions customer, product, and time would have a multidimensional or Z-ordering
based on the primary keys for customer, product, and time. Typically, OLAP queries do
not base queries on such key values. This is especially true for conventional range queries.
An OLAP query selecting all products with a primary key value less than or greater than
a particular foreign key value corresponding to a dimension would not be common. This
is because primary keys are often meaningless, tuple identifiers. It is not uncommon for
SELECT CUSTOMER.NAME, TIME.WEEK,
SUM(SALES.DOLLARS)
FROM SALES,
CUSTOMER,
TIME
WHERE SALES.CUSTOMER ID=
CUSTOMER.CUSTOMER ID AND
CUSTOMER.CITY=’NEW YORK’ AND
CUSTOMER.STATE=’NEW YORK’ AND
TIME.MONTH=’MARCH’ AND TIME.YEAR=’2004’
GROUP BY CUSTOMER.NAME, TIME.WEEK
Figure 3.3: Typical OLAP SQL Query
single key value of customer and a single key value of product but, more typically OLAP
queries select slices of data that correspond to hierarchical values in dimensions. As the
SQL example in Figure 3.3 illustrates, a typical OLAP query might select all customers in a
city and all days for a month [19]. Therefore, UB-trees must incorporate such dimensional
hierarchies in their clustering of the data to be efficient for OLAP queries.
Hierarchical clustering is a method that controls key values and structures these
key values specifically for query access patterns. Hierarchical clustering capitalizes on the
fact that many primary key to foreign key relationships are composed of hierarchies. As
an example, a customer dimension might contain customers each of which the dimension
assigns to one city. The customer dimension might further assign each city to a state and
each state to a region. Hierarchical clustering incorporates these hierarchies into the primary
keys for the dimensions. The Customer Dimension in Table 3.4 shows such a hierarchy.
3.3.5 Z-ordering and Hierarchical Clustering
MHC combines hierarchical clustering with UB-trees. If customers in an OLAP
Table 3.4: Customer Dimension
KEY DESC CITY STATE REGION
1 Johnson San Diego CA West
2 Smith Atlanta GA South
3 Davis Boston MA Northeast
4 Watts New York NY Northeast
5 Duncan Wilmington NC South
6 Reaves Columbia SC South
7 Bradford Chicago IL Midwest
these customer levels, including customer, to cluster the data by the customer dimension
in the UB-tree. MHC utilizes this same hierarchical clustering technique for all dimensions
participating in the UB-tree. This technique allows MHC to outperform bit map indexes
and other non-clustered indexes when supporting OLAP queries. Non-clustered indexes
such as bit mapped indexes tend to only work well for very small result sets since
non-clustered indexes require an average of one Input/Output (I/O) for each record that they
include in result sets [17, 6, 26, 35, 31, 30, 8, 37, 36]. MHC does not have this limitation. It
takes advantage of locality in OLAP queries and concentrates records with the same
dimen-sional hierarchy values into much smaller numbers of database blocks thereby increasing
query efficiency by reducing I/O and ultimately speeding up queries. Simultaneously, MHC
provides symmetrical multidimensional access for any combination of participating
dimen-sions through the use of UB-trees with Z-ordering. This allows MHC to be useful for queries
that do not restrict dimensions or keys in the prefix of the underlying B-tree [23].
To combine Z-ordering and hierarchical clustering, MHC uses compound surrogate
keys or hierarchical surrogate encoding (HSE) keys in conjunction with Z-ordering. These
keys reserve a fixed number of bits for each level in a dimension hierarchy. The fixed number
of bits at each level depends on the number of unique key values for all parent keys at that
level in the dimension hierarchy. For the Customer dimension if there are 6 regions overall,
the maximum number of states in any of the 6 regions is 20, the maximum number of cities
in any state is 150, and the maximum number of customers in any city is 17, then the HSE
would require 3 bits for the region level, 5 bits for the state level, 8 bits for the city level,
and 5 bits for the customer level. Therefore the Customer dimension would require a total
of 21 bits for each of its primary keys. Note that each level requires only the number of
Table 3.5: MHC HSE
REGION STATE CITY CUSTOMER
001 10101 00100001 00011
the context of its parent at the next higher level. This greatly reduces the overall length of
the compound surrogate in contrast to storing independent primary keys for each hierarchy
level. MHC also makes provision for variable length compound surrogate keys in the event
that different keys have different numbers of hierarchical levels or unbalanced hierarchies
but, MHC does not make provision for variable length bits strings for each parent in each
hierarchical level individually. If MHC requires 8 bits for the city level in the customer
dimension, then any primary key from the customer dimension that includes the city level
must also include all 8 bits [23]. Table 3.5 shows the primary key structure for this Customer
Dimension.
Other than dimension key content dictated by HSE, MHC does nothing to alter
the properties of host UB-trees and the Z-ordering that it employs. Queries are the same
as in UB-trees. MHC implements slice queries by explicitly specifying supplied values in
the prefix of compound surrogates and using wildcards for the remaining bits just as it uses
wildcard bits for completely unspecified dimensions in multidimensional queries.
Ironically, while the underlying UB-tree is completely dynamic in that it does not
have any inherit limitations or preset thresholds, MHC is not. MHC incorporates variable
length overall bit strings as dimension keys but uses fixed length bit strings for individual
hierarchy levels within the dimension keys [23]. In MHC, the length L of each bit string that represents a hierarchy level is:
L = log2F
For keys not at the highest level of a dimensional hierarchy, F represents the maximum fan-out or largest number of children that any parent key, not just the parent
key of the current child, in the next higher level in the dimensional hierarchy could have. If
the key is at the highest level in the hierarchy,F simply represents the maximum number of possible keys at that level. MHC incorporates these fan-out values into the physical
database design when it defines the dimensional hierarchies [23, 22, 33, 32]. If the actual
size of the data that a data warehouse application inserts or loads into an MHC dimensional
will require full reorganization. If MHC dictates maximum fan-out values that are much
larger than actual data sizes, the physical database design then is likely to adversely affect
the symmetry of one or more dimensions. Thus, MHC forces a tradeoff between dynamic
or real-time updates and symmetry.
Because MHC uses fixed length bit strings to represent individual levels in HSE,
there is a tradeoff between symmetry and dynamic or real-time updates in MHC. If MHC
designates too few bits for a given hierarchy level, then the data might exceed this threshold
and require full reorganization of all data in all UB-trees that include the affected dimension
before MHC can process any further updates. If MHC designates too many bits for a given
hierarchy level, then the leading bits of the hierarchy level might remain empty and cause
Chapter 4
MDDC Theory
MDDC removes the tradeoff between symmetry and dynamic updates in MHC
with a simple but powerful modification to the MHC key structure. Both MHC and MDDC
benefit from their more narrow focus on OLAP queries by encoding compound surrogates
into the primary keys of dimensions. With this encoding, MHC imposes a partial ordering
and not a full ordering of the data. Specifically, MHC sorts key values within the order of
parent keys. In the Customer Dimension example, MHC sorts each CUTOMER within the
context of a REGION, STATE, and CITY. MHC might group CUSTOMER 1, 3, and 5 in
one REGION, STATE, and CITY and CUSTOMER 2, 4, and 6 in other REGION, STATE,
and CITY values depending on data values and relationships. Since hierarchical clustering
already forces a partial ordering that orders data according to the dimension hierarchies,
MDDC further takes advantage of this tradeoff with no additional cost. Research related
to MHC has shown that mirroring the bits of each level in compound surrogates increases
entropy and therefore symmetry [22]. MDDC introduces a mechanism for variable length
bit strings at each level in HSE keys and uses this in conjunction with bit mirroring. MDDC
also employs a third derived bit value that collates before 0 or 1 to mark the end of bit strings
in each HSE level so that HSE keys with the same prefix but different HSE level values do
not intermingle in the collating sequence. These techniques allow MDDC to be completely
dynamic while maintaining symmetry equal to or better than MHC. Specifically, MDDC
has superior symmetry when it uses fewer bits in an HSE level than the fixed number of bits
that MHC predefines. This is a special problem in all HSE levels other than the highest level
since MHC requires the number of bits at an HSE level account for the maximum number
Table 4.1: MDDC HSE
REGION STATE CITY CUSTOMER
10000000 10101000 10000100 11000000
of MDDC can never be worse than MHC by definition because MHC must allocate enough
bits to cover the maximum distinct values in the level. MDDC will never use more than
this maximum number of bits unless there are more distinct values in which case MHC
would need to be reorganized with a higher value for maximum number bits at that level.
Therefore, MDDC introduces dynamic capabilities not found in MHC, improves upon the
symmetry of MHC, and does not introduce any significant tradeoffs.
4.1
Variable Length Bit Strings
Introducing variable length bit strings for each level in the HSE key requires an
efficient storage and processing technique. Typically, variable length fields use a length byte
or end of string marker such as a NULL byte to indicate their size. This would not make
sense for HSE levels since the actual bits that compose the level itself are likely to fit in less
than one byte. A better approach is to allocate the last bit of each byte as a continuation
bit so that 7 bits of each byte are usable by MDDC. This approach would have to exceed
8 bytes or 56 usable bits in a level of an HSE key before the continuation bit approach
becomes less efficient than the length byte or NULL byte approaches. This is not likely to
occur in the vast majority of HSE key values since each level needs only be unique in the
context of its ancestors. Note also, that each level in a variable length HSE key requires at
least one byte and also uses a whole number of bytes for each level. This affects the length
of HSE keys that MDDC must store but does not affect symmetry since MDDC does not
use the continuation bits or any trailing zeros when interleaving bits. MDDC can discard
all trailing zeroes since by definition all reverse bit strings end in 1 with the exception of the
single value of 0. It is also important to point out that MDDC might also have one overall
variable length byte for the entire HSE key just as MHC does to accommodate unbalanced
hierarchies or hierarchies that allow a different number of levels for each key value.
In MDDC, the key from the customer example in the MHC section has an HSE
4.2
Dynamic Bit Interleaving
In addition to varying the length of bit strings in HSE keys, MDDC combines HSE
keys with dynamic bit interleaving in lieu of Z-ordering. If MDDC were to use Z-ordering
and bit strings that preserve an integer ordering as MHC does, the relative position of
each bit in these bit strings would need to shift as HSE key lengths change. This would
alter the overall Z-ordering and require MDDC to completely reorganize all the data in its
entirety. Using mirrored bits in conjunction with variable length bit strings and dynamic
bit interleaving overcomes this problem.
No matter how long the HSE bit strings grow, the relative position of bits in the
bit interleaving does not change. MDDC simply adds the additional bits for larger level
values to the end of the bit string for that level in the HSE. With Z-ordering the depth of
all HSE key values for each dimension are the same, for example 3 bits.
With dynamic bit interleaving in contrast, different keys in each dimension are
likely to have different depths. Consequently, MDDC does not have a predetermined address
space like MHC. MDDC subdivides the address space based on the density of local keys
on an as needed basis. Figure 4.1 depicts the uniform depth of Z-ordering while Figure 4.2
depicts the variable depth of dynamic bit interleaving. Dynamic bit interleaving furnishes
MDDC with some of the capabilities found in grid files, quad trees, and other similar data
structures [42, 13, 12].
Using a third derived bit value that collates before 0 and 1 ensures that MDDC
completely segregates distinct HSE key values. This does not alter the relative position of
previous interleaved bit addresses in the overall data ordering and therefore does not require
data reorganization.
Mirrored bits have one more important advantage. Except for the value of 0, each
mirrored bit string ends with a 1 by definition. This allows MDDC to trim the remaining
filler bits, which are all zeros, when interleaving bits. With this technique, MDDC always
exhibits symmetry as good as MHC and better in some cases. If MHC specifies 8 bits for a
given level in a compound surrogate but only uses one, then MHC wastes up to 7 bits before
accessing a bit that makes a difference in the Z-ordering. In such cases, some dimensions
that are using all bits in their compound surrogates might dominate the dimensions that
have unused bits. For the previously listed reasons, MDDC does not include these bits when
Figure 4.1: Uniform Depth Z-address Space
The following demonstrates how dynamic bit interleaving works. First, dynamic
bit interleaving maps all relevant zero and one bits from the mirrored bit string to 10
for bit 0 and 11 for bit 1. Dynamic bit interleaving maps all separators or the third
derived bit value to 01 and maps the end of string or NULL to 00. For example, in
MDDC the following two bytes, 10100110000010000, with 8 bits for each level maps to
111011101011110110101010101100. MDDC only uses these double length bit strings to
collate keys. MDDC does not store the keys in this format. MDDC stores the keys in the
mirrored bit format already described. Once the double bit format is calculated, MDDC
can interleave the bits just as Z-order does. This allows MDDC to implement dynamic bit
interleaving with minimal additional costs.
4.3
MDDC Operations
In general operations in MDDC are very similar to those in MHC. The following
sections provide details for MDDC operations.
4.3.1 Queries in MDDC
MDDC is like MHC in that it only alters the content of keys in B-trees and does
Figure 4.2: Variable Depth Dynamic Bit Interleaving Space
for B-trees. As with MHC, slice queries or partial matches are more complicated in MDDC.
4.3.2 Point Queries
Point queries are very simple in MDDC. When the query specifies a full key for
each dimension, MDDC simply computes the single bit interleaved address and uses the
B-tree to locate the record or records with the specified key values.
4.3.3 Slice or Partial Match Queries
MDDC uses “wild card” or “don’t care” bits in interleaved bit addresses for
di-mension key values or parts of didi-mension key values that the query does not specify. This
applies to situations where the query completely omits dimensions or where queries only
specify dimension prefixes corresponding to levels in compound HSE dimension keys. In
this regard, MDDC functions in the same manner as MHC.
4.3.4 Maintenance
Like MHC, MDDC does not alter any properties of B-trees during maintenance
operations. MDDC inserts, deletes, and updates records with standard B-tree methods
dimension keys before comparison with other keys in B-trees. MDDC also pre-computes bit
interleaved addresses for all comparison keys in the B-trees as they are needed. The
pre-computed, bit-interleaved addresses determine the placement and organization of records
within the B-tree. In doing so, MDDC does not in any way alter the properties of the
host B-tree. Therefore, B-trees containing keys that MDDC algorithm organizes retain all
maintenance advantages of B-trees including the perfect balance, shallow depth, granular
concurrency control, and recoverability [40, 13]. Since MDDC does not require complete
data reorganization, it is completely dynamic and can readily accommodate real-time
up-dates.
4.3.5 Key Recycling
MDDC like MHC can take advantage of key recycling at all levels of any hierarchy
[23]. MDDC works the same as MHC in this area. Realize that HSE keys need not assume
the role of primary keys. HSE keys can play the role of internal row addresses in dimension
and fact tables. Finally, the dimensions and optionally fact tables can store the primary
key values in addition to the HSE keys.
4.3.6 Key Participation
As is the case with MHC, MDDC optionally includes any dimensions and any
hierarchy levels that the dimensions contain into the HSE. MDDC can also include any
proper subset of these dimensions and levels. When MDDC configures a data structure
with a proper subset of dimensions and levels, it appends the HSE key with any remaining
dimensions or other fields in the primary key to enforce uniqueness. MDDC can also
combine more than one independent dimension into one virtual dimension via concatenation
to establish priority.
4.4
Incremental Costs
As stated before, the only relevant tradeoff that MDDC makes so that it can
provide dynamic symmetry is the forfeiture of integer ordering. But since MHC forces a
partial integer ordering anyway and the most common OLAP queries work well with such
from MDDC works just as well for OLAP queries [19] and there is no real tradeoff in this
regard. Hence, MDDC supports OLAP point and OLAP slice queries just as well or better
than MHC.
The only other cost that MDDC suffers in comparison to MHC is the requirement
for more disk space. The total amount of additional disk space per table depends on the
ratio of key fields to other fields in the table. However, since both MHC and MDDC, provide
efficient query access without additional indexes, they both are very space efficient.
4.5
Summary of MDDC
By utilizing variable length HSE keys and dynamic bit interleaving, MDDC enables
completely dynamic insert, update, and delete operations without full reorganization of any
dimension, fact table, or underlying B-tree. In addition, this allows MDDC to provide
symmetrical query efficiency that is good as or better than that of MHC. The only tradeoff
is the requirement for additional disk space. The experimental results that follow will
Chapter 5
Implementation
The implementation here contains working versions of both MDDC and MHC so
that the experiments that follow can compare the two algorithms directly. The purpose
of the implementation is to demonstrate that MDDC as opposed to MHC is completely
dynamic, has symmetry that is good as or better than MHC, at minimum additional costs.
The following sections describe the simulation software, simulation parameters, simulated
SQL tables, and simulated database tables.
5.1
Simulation Platform
This implementation is built on a Microsoft Windows laptop. The laptop is a
Hewlett-Packard Pavilion dv5000. Its RAM is 1 Gigabyte, it has a single 84.4 Gigabyte
hard disk drive, and its single AMD 64 bit processor has a clock speed of 1.79 Gigahertz.
Its operating system is Windows XP Professional.
5.2
Simulation Software
In the experiments that follow, the implementation of MDDC and MHC uses the
B-tree access method in BDB (Berkeley DB) software from Sleepy Cat Inc. BDB is open
source software. The implementation here specifically implements MDDC and MHC with
the C language API from BDB. All code for the implementation of MDDC in this study is
in C language. The simulation uses Microsoft Visual Studio.net Professional to compile the
In order to implement MDDC and MHC with the standard B-tree access methods
in BDB, this implementation alters the comparison routine of the B-tree access method
in BDB to compute bit interleaved addresses as part of the comparison process between
composite primary keys. This further evidences the fact that MDDC does not require
alterations to the host B-tree data structure. The implementation additionally contains
programs to search the MDDC and MHC data structures in a skip sequential manner. To
simplify the code for the implementation, the simulation only permits queries that specify
at most one value for each dimension level. If the queries do not specify a value for a
given dimension level, the implementation allows all values and uses wild card bits for these
dimension levels. Finally, if a query does not specify a value for a dimension level then the
query cannot specify any specific values for the children of this dimension level.
The query output of the simulation on MDDC and MHC consists of record counts
for each query and access statistics including cache size, records scanned, total I/O problems,
I/O probes to disk, processor time, and elapsed time. The simulation also includes a full
scan program for MDDC and MHC to ensure query results are correct.
5.3
Simulation Parameters
This implementation only uses a small amount of RAM for MDDC and MHC
so that it can demonstrate the I/O efficiency of MDDC. This implementation limits the
BDB cache size parameters to 4 database blocks. It also uses large database blocks for I/O
efficiency. It uses the maximum 64 kilobyte block size in BDB. Therefore the size of the
cache for MDDC and MHC is 256 kilobytes.
5.4
Simulated SQL Statements
Since this implementation uses C API for the BDB embedded database
man-agement system, no SQL interpreter is available. Therefore, C programs simulate SQL
statements in this implementation. These SQL statements constrain dimensions directly.
The simulation does not perform join operations since join operations would be the same
for both MHC and MDDC. Similarly, the simulation does not perform range queries since
they are not typical in OLAP applications and MHC or MDDC are not designed for them.
SELECT COUNT(*)
FROM GROCERY SALES FACT WHERE PROD CAT ID=4 AND
PROD ID=8
Figure 5.1: Simulated SQL Query
5.5
Simulated Database Tables
Typically, relational database systems use SQL to define tables. As with all other
SQL statements, this simulation uses C programs to define tables.
5.6
Special Simulation Challenges
In order to take full advantage of MHC and MDDC and to compare them fairly,
a query algorithm is required that visits each leaf database block only one time. This is
fairly simple in a one dimensional B-tree. Once the query accesses a given leaf block, the
algorithm access all the records on that block and ensures that the next key value accessed
is greater than the last key in the current block [40]. It is more complicated to effect this
type of skip sequential processing in MDDC and MHC where the algorithms must deal
with multidimensional key encodings and bit interleavings. The algorithm must calculate
the next key in the multidimensional space using wildcard bits for partial match and slice
queries. Essentially, the lowest priority bit for each dimension that has an unset value less
than 1 has to be identified and then the unset bit with the lowest overall priority from all
the dimensions not in the query filter has to be set to 1. This algorithm proved to be quite
Chapter 6
Experimental Results
6.1
National Grocery Chain
The “National Grocery Chain” database contains sales data for a national chain
of grocery stores broken out by month, and product. The store, month, and product
data were fabricated and manually entered into text files. The sales data was created
with the assistance of a random number generator. Given a sparseness factor such as
10% the simulator uses the random number generator to create one randomly selected
record out of every 10 possible combinations of store, month, and product. A sparseness
factor of approximately 0.28% was selected for this experiment to generate the approximate
12,000,000 sales records in the simulation.
6.1.1 Data Model
The “National Grocery Chain” database is a flat star-schema or one that contains
the complete dimension hierarchy in the base dimension [19]. The dimensions are Store
in table B.1, Product in table B.2, Promotion in table B.3, and Month in table B.4. This
database combines these four dimension tables to establish the grain of the Sales Facts in
table B.5.
6.1.2 Test Criteria
The experiment focuses on two dimensions. It tests the behavior of the product
The product dimension contains a hierarchy made up of two levels, product category and
product. The MHC key as tables B.2 and B.4 depict in this simulation uses 8 bits for the
product category level and 8 bits for the product level. The month dimension contains
a hierarchy made up of three levels, year, quarter, and month. The MHC key in this
simulation uses 6 bits for year, 4 bits for quarter, and 4 bits for month. MDDC has no
preset number of bits for any of the dimensions.
6.1.3 Data Loading
A set of C programs make calls to the BDB database in order to load data in the
“National Grocery Chain” database. The C programs load the data into the MHC database
and the MDDC database. The C programs load the “Grocery Sales Fact” table in B.5 in
stages to illustrate the differences in symmetry between MDDC and MHC and the need to
reorganize MHC data structures when the MHC algorithm reaches preset size thresholds.
The simulation loads all simulated records with product category keys 1 and 2
in stage 1. In stages 2 through 9, the simulation loads all simulated records with product
category keys 3 through 10 respectively as results tables D.1, D.2, D.3, D.4, and D.5 depict.
6.1.4 Queries
For simplicity in the “National Grocery Chain” database, the C programs that
im-plement queries, only allow point queries and slice queries with no more than one dimension
value from any hierarchical level of any dimension. The programs do not allow specification
of multiple values from any level in any dimensional hierarchy in one query. This allows
for simple and efficient query implementation. The implementation could include multiple
dimension values at a higher implementation cost and complexity in order to maintain the
same level of efficiency. The theory and principles of MDDC and its comparison to MHC
are the same for either level of implementation complexity. This is true since skip sequential
processing handles multiple points in key order very efficiently [40].
To further focus on the MDDC versus MHC data structures, this simulation
con-centrates on queries that specify dimension and hierarchy values that MDDC and MHC use
to organize the fact data or those that are in the hierarchical surrogate encoding (HSE) keys.
The simulation executes all the queries represented by the SQL in Table C.1 against the
after each stage in the loading process represented by the corresponding product category
values.
6.1.5 Results
The results demonstrate the advantage that MDDC has over MHC in symmetry
and real-time update capability. The first 8 queries from Table C.1 represent the differences
in MHC and MDDC most starkly since each one of these queries only constrains one of the
dimensions. In this experiment, the product dimension is the experimental test whereas the
other three dimensions are controlled tests. Therefore query 1 and 5 from Table C.1 should
illustrate the results that are sought while queries 2 through 4 and 6 through 8 should
perform as expected.
All the results demonstrated that the most persistent and reliable indicator of
performance is the number of I/O operations for each query, whether they are cached or
not. In addition, when a ratio of the number of I/O operations per index query versus
the number of I/O operations per full scan query is used as the primary metric, then the
results are stable as the underlying data increases in size. This ratio also accounts for the
difference in file sizes between MHC and MDDC. As a result, the ratio of query I/O to full
scan I/O is the primary metric on which this experiment focuses.
Charts 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, and 6.8 compare the MHC results to the
MDDC results for all 9 product category data runs in this experiment. Note that charts 6.1
and 6.5 represent the greatest difference between MHC and MDDC queries. Of course these
correspond exactly to test queries 1 and 5 from Table C.1. In these queries, the MHC ratio
of query I/O to full scan I/O or percentage of full scan actually indicates that query I/O’s
in MHC are sometimes larger than the number of I/O operations in a full table scan. Note
also that the ratio improves as the simulation loads more of the product category records.
As charts 6.2, 6.3, 6.4, 6.6, 6.7, and 6.8 depict, MDDC queries have slightly higher I/O
ratios than MHC queries in the control queries but are much lower than the 100 percent
ratio required to scan the full table.
These results are to be expected since the HSE key for product category in the
product dimension for the MHC database allocates 5 leading bits that this simulation does
not use. A new database design for MHC could alleviate this problem. If the simulation
Full Scan Percentage for Product Dimension Query 1
0 20 40 60 80 100 120 140 160
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.1: Grocery Sales Product Dimension Query 1
category in the MHC database then the above problem with the product dimension will not
occur. But, this presents another problem. The MHC database would not be able to house
more than 16 product categories without a complete MHC database reorganization. the
slightly higher I/O ratios in the control queries are also to be expected since the product
dimension in MDDC is sharing in some of the selectivity in the MDDC data structure at
the expense of the other 3 dimensions while the product dimension is not doing this in the
MHC data structure.
Hence, MDDC offers better symmetry than MHC and unlike MHC offers real-time
updates without major data structure reorganizations. Unlike MHC, MDDC does not force
a tradeoff between symmetry and intrinsic growth limits.
6.2
TPC-H
The data in this experiment originates from the TPC-H benchmark [1]. The
DBGEN from TPC-H generates the database with a scale factor of 2 so that the LINEITEM
fact table has approximately 12,000,000 records. See TPC-H estimated database size for a
scale factor of 2 in B.6. This simulation uses the PART dimension as shown in table B.7
Full Scan Percentage for Product Dimension Query 2
0 20 40 60 80 100 120 140 160
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.2: Grocery Sales Product Dimension Query 2
Full Scan Percentage for Promotion Dimension Query 1
0 2 4 6 8 10 12 14 16
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Full Scan Percentage for Promotion Dimension Query 2
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.4: Grocery Sales Promotion Dimension Query 2
Full Scan Percentage for Store Dimension Query 1
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Full Scan Percentage for Store Dimension Query 2
0 2 4 6 8 10 12
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.6: Grocery Sales Store Dimension Query 2
Full Scan Percentage for Month Dimension Query 1
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Full Scan Percentage for Month Dimension Query 2
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
Data Run
MHC MDDC
Please purchase 'e-PDF Converter and Creator' on http://www.e-pdfconverter.com to remove this message.Figure 6.8: Grocery Sales Month Dimension Query 2
table as shown in table B.9 for the fact table. The MHC and MDDC data structures cluster
the LINEITEM data with only the PART and SUPPLIER dimensions.
6.2.1 Data Model
The TPC-H database is a normalized schema or a schema with snow-flaked
di-mensions and fact tables that are normalized into more than one table [1, 19, 5]. This
simulation uses PART in B.7 and SUPPLIER in B.8 as the dimensions. It combines these
two dimensions to establish the granularity of the LINEITEM table in B.9.
6.2.2 Test Criteria
The experiment focuses on two dimensions. It tests the behavior of the SUPPLIER
dimension as data is loaded. It also uses the PART dimension as an experimental control.
The SUPPLIER and PART dimensions each contain a hierarchy made up of one level, the
primary key. The MHC key as tables B.7 and B.8 depict uses 19 bits for the PARTHSE