In this section, we will describe the major processing steps entailed when we want to answer star queries over a hierarchically clustered fact table.
• Step 1. Identifying relevant fact table data: The processing begins with the evaluation of the restrictions on the individual dimension tables, that is, the evaluation of the local predicates (operations Create_Range in Figure 5(a)).
This step performed on a hierarchically encoded dimension table will result in a set of h-surrogate values that will be used in order to access the correspond-ing fact table data. Due to the hierarchical nature of the h-surrogate, this set can be represented by a number of h-surrogate intervals called the h-surrogate specification. Using the notation of Karayannidis et al. (2004), an interval will have the form 1999.*.* for the restriction on the DATE dimension in our running example. This denotes that we need to access all values under 1999 at the Month level and all values of each such month at the Day level.
Once the h-surrogate specifications are determined for all dimensions, the evaluation of the star join follows. In hierarchically clustered fact tables, this translates to one or more simple multidimensional range queries on the underly-ing multidimensional structure that is used to store the fact table data (operator MD_Range_Access in Figure 5(a)). Moreover, since data are physically clus-tered according to the hierarchies and the ranges originate from hierarchical restrictions, this will result in a low-I/O evaluation of the range selection.
• Step 2. Computing necessary joins: The tuples resulting from the fact table contain the h-surrogate values and the measure values. At this stage, there might be a need for joining these tuples with a number of dimension tables in order to retrieve certain hierarchical or feature attributes that the user wants to have in the final result and might also be needed for the grouping operation.
We call these joins residual joins. Note that all these join operations (the Re-sidual_Join nodes in Figure 5(a)) are equi-joins on key-foreign key attributes and therefore each fact table tuple is joined with exactly one dimension table tuple.
• Step 3. Performing grouping, filtering, and ordering: Finally, the result-ing tuples may be grouped and aggregated and the groups further filtered and ordered for delivering the result to the user. The Group_Select operator in Figure 5(a) performs these actions.
46 Karayannidis, Tsois, & Sellis
The abstract processing plan comprising of the phases is illustrated in Figure 5(a) and can be used to answer the star queries that belong to the query template of Figure 3. This plan is abstract in the sense that it does not determine specific algorithms for each processing step; it just defines the processing that needs to be done. That is why it is expressed in terms of abstract operators (or logical operators), which in turn can be mapped to a number of alternative physical operators that correspond to specific implementations.
An example abstract processing plan is shown in Figure 5(b) and it corresponds to the query of Figure 4.
Having described the framework for query processing of OLAP queries, we move next to discuss how this can be materialized on a hierarchical clustering-preserving data structure, namely the CUBE File.
Figure 5. (a) The abstract processing plan; (b) the abstract processing plan for the example query
MD_Range_Access (area, month) Main execution phase
DATE PRODUCT LOCATION
Create_Range (Year=)
Create_Range (Category=”ar condton”)
Create_Range (Populaton >
000000) SALES_FACT
Residual_Jon (Store_id)
h-surrogate processing MD_Range_Access
Residual_Jon Group_Select Main execution phase
D Dj
Create_Range FT
h-surrogate processing Create_Range
Advanced Ad Hoc Star Query Processing 147
Processing Star Queries over CUBE File Organized Fact Tables
The CUBE File (Karayannidis et al., 2004) is a multidimensional data structure for storing the most detailed data of a fact table. Thus it could be exploited as an alter-native primary organization to heap files for fact tables. It provides fast indexing on data, when these are accessed via restrictions on the hierarchies. Moreover, it physically clusters data w.r.t. dimension hierarchies (i.e., hierarchical clustering), which reduces significantly the I/O cost for star query evaluation.
The CUBE File partitions the multilevel-multidimensional data space of an OLAP cube in disjoint subspaces, called chunks, which are formed by all hierarchy value combinations per hierarchy-level. This process is called hierarchical chunking (Fig-ure 6(a)) and results in a chunk-tree representation of the cube (Fig(Fig-ure 6(b)). Note that prior to applying hierarchical chunking all hierarchies have to be normalized to the same length with the insertion of pseudo-levels to the shorter ones. The main advantage of hierarchical chunking is that it results in a structure that is highly adap-tive to the cube’s inherent extreme sparseness. The intuition is that the underlying data clusters are located naturally during the chunking process, exactly because hierarchy value combinations form the dense and sparse data areas. For example, a Figure 6. (a) A cube hierarchically chunked; (b) the whole subtree up to the data chunks under chunk 0|0 (corresponding to the grayed cells on the left figure)
48 Karayannidis, Tsois, & Sellis
sparse area would be formed in a 3-dimenisonal cube, along the subspace (Sept00-Dec00, Books, Italy) if we did not sell any books during Sept00-Dec00 in Italy.
A subtree at chunking-depth D corresponds to a “family” (i.e., a subspace) of hier-archy-related data points. In fact the taller this subtree is (i.e., the smaller D is), the larger is the subspace of hierarchy-related data points that it “covers.” Based on this observation, the CUBE File construction algorithm tries to “pack” into buckets (i.e., disk pages) whole subtrees of the smaller possible depth. This is the basic heuristic exploited by the CUBE File for achieving hierarchical clustering of the data. Note that the packing of chunks into buckets, so as to preserve hierarchical clustering, is an NP-Hard problem (Karayannidis, 2003).
In Figure 7 we depict such an allocation of chunks into buckets. In this figure we depict an arbitrary chunk-tree, where subtrees appear as triangles and specific nodes (i.e., chunks) as squares. The number within a triangle denotes the size of the cor-responding subtree. The number within a square denotes the size of all subtrees under this node, plus the size of the node itself. In the figure, we have assumed a bucket size of 30 storage units. The lowest depth subtree that has been stored in a bucket corresponds to depth D = 1 (see bucket B1). This bucket has the maximum hierarchical clustering degree among all buckets of the specific chunk-to-bucket allocation. Essentially this means that HPP queries that need to access B1 will have reduced I/O cost, since this bucket has stored a larger subspace.
Figure 7. The chunk-to-bucket allocation for a chunk-tree where the size of a bucket is SB = 30 units of storage
6
40
0
20
D= 0
D MAX = D= SB = 30
B
B
B
Advanced Ad Hoc Star Query Processing 149
Note that the upper nodes (including the root node) that “fail” to be allocated to some bucket comprise the root-directory of the CUBE File. The root-directory is usually cached in main memory or it is further allocated into buckets as if it is a chunk-tree on its own.
The CUBE File requires the assignment of h-surrogates to the dimension values.
Moreover, for each cell of a chunk, the interleaving of the corresponding h-surro-gates yields a path-based key called the chunk-id, which is a unique identifier of a data point in the multilevel multidimensional data space of a cube. For example, 0|0.0|0 is the chunk-id of the low-left cell of the chunk at depth D = 1 in Figure 6(b) (in the figure, it is depicted as a label on the corresponding chunk). “P” in a chunk-id denotes a pseudo-level.
In Figure 8(a), we depict the abstract processing plan of Figure 5(b) as a physical execution plan over a CUBE File organized fact table. We can see the evaluation of the local predicate on the DATE dimension consisting of an HPP restriction solely
Figure 8. (a) The abstract processing plan of our running example expressed as a physical execution plan over a CUBE File organized fact table; (b) the optimized abstract plan
0 Karayannidis, Tsois, & Sellis
on the HPP-Index (a B-tree index defined on the hierarchy attributes of a dimen-sion) without the need to access the base table (i.e., the DATE dimension table). The restriction Year = 1999 has been translated to the h-surrogate specification 0.*.*
(assuming that the order-code of 1999 is 0). Note that only the first matching index-tuple (corresponding to the Day level) suffices for generating the corresponding h-surrogate specification, since all days of 1999 have the same h-surrogate prefix:
0.. This single index-tuple retrieval is an extremely fast implementation of the Cre-ate_Range abstract operation of Figure 5(a).
The same holds for the PRODUCT dimension also, since another HPP restriction is imposed there. The corresponding h-surrogate specification for the restriction Category = ‘Air Condition’ is 3.*.* (assuming that the order-code of the ‘Air Con-dition’ value is 3). For dimension LOCATION things are a bit different since we have to perform a full table scan directly on the base table and then select the tuples that match the restriction on the feature attribute Population, which is functionally dependent on the hierarchical attribute Area. The corresponding h-surrogate speci-fication is 2.6.* assuming that only one area (with order-code 6) in a single region (with order-code 2) satisfies the restriction on the Population feature attribute.
As soon as the h-surrogate specifications are extracted from each dimension, they are combined into a single chunk expression (CX) that is passed as input to an MDRangeSelect operator. A chunk expression is essentially an access pattern de-scribing the cells that must be accessed in each depth of the chunk-tree and exploits the chunk-id notation. The chunk expression is created from the interleaving of the h-surrogate specifications. The depicted chunk expression 0|3|2|*.*|*|6|P.*|*|*|*
is built from the interleaving of the aforementioned h-surrogate specifications, plus the h-surrogate specification for the CUSTOMER dimension, which is *.P.*, since this dimension is left unrestricted in the query of Figure 4 (“P” denotes a pseudo-level). The interleaving order is DATE, PRODUCT, LOCATION, and CUSTOMER and was chosen arbitrarily.
The MDRangeSelect will access the CUBE File in order to efficiently retrieve the relevant detailed data (described in the CX). Each sales value retrieved will be aug-mented with two h-surrogates, one corresponding to the DATE dimension and the other to LOCATION, which are dynamically computed from the corresponding data cell chunk-id (which is not stored along with the measure values, since the chunks are essentially multidimensional arrays, but retrieved from the current position/data point in the cube’s data space). This provides the “impression” of tuples coming out of the MDRangeSelect operator.
Furthermore, these tuples will need to be joined with the DATE dimension in order to retrieve the Month values required in the final result. This join is implemented by a physical operator named IndexResJoin in the figure. Essentially, this is an index-based join that utilizes the primary organization of the dimension tables to efficiently retrieve the single join tuple from the dimension side. A dimension table
Advanced Ad Hoc Star Query Processing
is organized as a B+ tree with the h-surrogate attribute as the search key. Each tuple coming from the CUBE File side contains an hsk attribute (i.e., an h-surrogate) corresponding to the DATE dimension. We use this value as a key for accessing directly the DATE dimension and retrieving the single tuple that matches. Indeed, since hsk is a primary key of the dimension table, there will be only a single tuple match. Therefore, the number of tuples in the output of the IndexResJoin operator is the same as the one in the input.
Similarly, for each hsk value corresponding to dimension LOCATION we access the corresponding tuple and retrieve the appropriate Area value. Finally, the grouping and aggregation has to take place. We depict a hash-group operator that groups the incoming tuples by Area and Month.