Star Query Optimization - Data Warehouses and OLAP

An interesting feature of the database schema is the hierarchical structure of the h-surrogate attributes and the fact that they encode all the hierarchical attributes of the dimensions. Therefore, a number of functional dependencies exist in the schema, along with inclusion dependencies defined through the key-foreign key relationships. These functional and inclusion dependencies can be combined and used for the optimization of the grouping and join operations. The complex optimi- zation technique that exploits these existing integrity constraints is the hierarchi-cal pregrouping and it is presented next. Experimental results have shown that this technique can drastically reduce the execution time of the examined OLAP queries (Pieringer et al., 2003).

Other optimization opportunities exist and pertain to the Create_Range operation of the abstract processing plan (Figure 5(a)), or the exploitation of the sort-order of the tuples coming from the fact table (MD_Range_Access operation). Due to lack of space, these techniques will not be described here. The interested reader can find details in Karayannidis et al. (2002), Tsois and Sellis (2003), Pieringer et al.

(2003), and Tsois (2005) for the former and in Theodoratos and Tsois (2003) and Tsois (2005) for the latter.

The Hierarchical Pregrouping technique is based on the properties of the join and grouping operations. The grouping operation uses the values of the grouping attri-butes only to group tuples that have the same value. However, the actual value of a grouping attribute is not important. Therefore, an attribute X that is used only in a grouping operation, like Group_Select, can be replaced with any other attribute Y when there is a bijective (1-1 and onto) mapping among the values of X and Y.

Therefore, one can use the functional (and inclusion) dependencies in order to re-place grouping attributes. In a similar manner functional and inclusion dependencies

2 Karayannidis, Tsois, & Sellis

allow the modification of join conditions, as explained in detail by Tsois (2005) and Tsois and Sellis (2003). By replacing grouping attributes and join conditions, the hierarchical pregrouping technique can group the fact table tuples very efficiently before the residual join operations, it can push join operations above the grouping operations and it can even remove completely some of the join operations.

The key idea used in the hierarchical pregrouping transformation is the follow-ing: if a hierarchical attribute h_k of a dimension D_i is used in the query evaluation plan just for grouping or just for an equi-join operation, then this attribute can be replaced by its encoded form which exists within (as a prefix) the corresponding h-surrogate. If hsk_i is the corresponding h-surrogate, the encoded value of h_k is denoted as hsk_i:h_k.

For example, for the LOCATION dimension in our running example of Figure 4, we can use the prefix part of the h-surrogate to group on the Area attribute, instead of using the actual Area attribute. This is because all tuples of LOCATION with the same values in the Region and Area attributes will have the same prefix in the h-surrogate value. Furthermore, since the foreign key loc_hsk exists in the fact table we can group the fact table tuples according to the Area attribute without having to join them with the LOCATION dimension table.

Lets see how this transformation affects the abstract plan shown in Figure 5(b) for our example query of Figure 4. The hierarchical pregrouping transformation modi-fies this initial plan by changing the grouping attributes and pushing both residual join operations after the Group_Select operation. This is possible because group-ing is done on hierarchical attributes only (attributes Area and Month) that have a corresponding encoded form in the h-surrogates found in the fact table (loc_hsk:

area, date_hsk:month). The residual join operations are modified so that each tuple in the output of the Group_Select operation is joined with only one tuple from the DATE dimension and one tuple from the LOCATION dimension. The resulting plan is shown in Figure 8(b).

By exploiting the above properties of the h-surrogates the hierarchical pregrouping transformation can achieve three different types of modifications to the proposed abstract processing plan for star queries (Figure 5(a)):

1. It can eliminate a residual join operation completely. This case happens when the removed dimension table was joined only to obtain access to hierarchical attributes that where then used only for the grouping operation.

2. It can split the grouping operation of the initial plan into two stages: in the first stage, a pregrouping of the fact table tuples is performed before invoking them in residual join operations. The grouping is performed using the h-surro-gates contained in the fact table (the foreign keys). In the second stage, which is executed after performing the required residual joins, the tuples are once again grouped to obtain the required result. This second grouping operation

Advanced Ad Hoc Star Query Processing

uses some of the attributes that where acquired with the residual join opera-tions performed. This transformation reduces drastically the number of tuples involved in the residual join operations and therefore it reduces significantly the cost of the query evaluation plan.

3. It can push join operations above the grouping operation by carefully modifying the grouping attributes and the join condition. The result of this transformation is similar to the previous one: it reduces the number of tuples involved in the affected residual join operations and therefore it improves the overall cost of the evaluation plan.

Note that in order to push a residual join operation above a grouping operation the join must be carefully modified. This can be done either by grouping the dimension table and modifying the join condition or by using special join algorithms that join each fact table tuple with only one tuple (the first matching tuple) from the dimension table. This is because all initial residual joins are equi-joins on the key attribute.

The details of the hierarchical pregrouping transformation and its definition as an algorithm appear in Karayannidis et al. (2002) and Tsois (2005). A theoretical analysis of the transformation, its generalization as well as a proof of correctness can be found in Tsois (2005) and Tsois and Sellis (2003).

Future. Trends

Speculating about the future trends in data warehouse query processing in general, we believe that there are two main factors that will drive the processing require-ments in the near future:

1. Continuously increasing data volumes that one needs to analyze.

2. Continuously increasing rates by which data for analysis are generated on the one hand and increasing need for up to date information on the other.

The first factor calls for extremely scalable storage organizations that exploit a plethora of successful techniques such as semantic based physical data clustering, precomputation of aggregates, and fast indexing. It also requires even more elaborate semantic-based optimization transformations that will reduce the amount of data processed at each step. The second calls for storage structures that are extremely adaptive to updates, and for processing techniques “borrowed” from the field of data stream processing.

4 Karayannidis, Tsois, & Sellis

Conclusion.

In this chapter, we discussed the processing of ad hoc star queries over hierarchically clustered fact tables. We presented a complete abstract processing plan that covers all the necessary steps for answering such queries. This plan directly exploits the benefits of hierarchically clustered fact tables and opens the road for new optimiza-tion challenges. We showed how this abstract plan can be “materialized” for the case of a multidimensional storage structure that achieves hierarchical clustering, namely the CUBE File. Finally, we presented the hierarchical pregrouping transformation as a powerful optimization technique for this type of query processing.

Clearly, star query processing over hierarchically clustered fact tables is signifi-cantly different from other approaches. The most remarkable difference is that the fact table access is transformed to a multidimensional range query through the use of h-surrogates (i.e., surrogate keys with hierarchy semantics). Moreover, fact table physical organizations such as the CUBE File exploit h-surrogates to provide physi-cal data clustering w.r.t the dimension hierarchies, resulting in a reduced I/O fact table access. Finally, the exploitation of the hierarchy semantics that h-surrogates convey can lead to efficient query optimization techniques such as the hierarchical pregrouping transformation.

The abstract processing plan can be easily incorporated in a DBMS provided that a hierarchical clustering-preserving fact table organization is supported. For example the methods introduced in this chapter have been fully implemented in the com-mercial relational DBMS TransBase HyperCube® (TransBase HyperCube®, 2005), which utilizes the UB-tree (Bayer, 1997) as a fact table primary organization.

References

Bayer, R. (1997). The universal B-tree for multi-dimensional indexing: General concepts. Proceedings of the Worldwide Computing and Its Applications, International Conference, Tsukuba, Japan (pp. 198-209).

Chan, C. Y., & Ioannidis, Y. (1998). Bitmap index design and evaluation. Proceed-ings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA (pp. 355-366).

Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1), 65-74.

Karayannidis, N. (2003). Storage structures, query processing, and implementation of on-line analytical processing systems. PhD doctoral thesis, National

Techni-Advanced Ad Hoc Star Query Processing

cal University of Athens. (2003). Retrieved May 29, 2006, from http://www.

dblab.ece.ntua.gr/~nikos/thesis/PhD_thesis_en.pdf

Karayannidis, N., Sellis, T., & Kouvaras, Y. (2004, March 14-18). CUBE file: A file structure for hierarchically clustered OLAP cubes. In Proceedings of the 9^th International Conference on Extending Database Technology (pp. 621-638), Heraklion, Crete, Greece. EDBT.

Karayannidis, N., Tsois, A., Sellis, T. Pieringer, R., Markl, V. Ramsak, F., et al. (2002).

Processing star-queries on hierarchically-clustered fact-tables. Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong (pp. 730-741).

Markl, V., Ramsak, F., & Bayern, R. (1999). Improving OLAP performance by multidimensional hierarchical clustering. Proceedings of the International Database Engineering and Applications Symposium, Montreal, Canada (pp.

165-177).

O’Neil, P. E., & Graefe, G. (1995). Multi-table joins through bitmapped join indices.

SIGMOD Record, 24(3), 8-11.

O’Neil, P. E., & Quass, D. (1997). Improved query performance with variant indexes.

Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ (pp. 38-49).

Oracle® 10g. (2005). Documentation.

Pieringer, R., Elhardt, K. Ramsak, F., Markl, V., Fenk, R., Bayer, R., et al. (2003).

Combining hierarchy encoding and pre-grouping: Intelligent grouping in star join processing. Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India (pp. 329-340).

Roussopoulos, N. (1998). Materialized views and data warehouses. SIGMOD Re-cord, 27(1), 21-26.

Sarawagi, S. (1997). Indexing OLAP data. Data Engineering Bulletin, 20(1), 36-43.

Sarawagi, S., & Stonebraker, M. (1994, February 14-18). Efficient organization of large multidimensional arrays. In Proceedings of the 11^th International Con-ference on Data Engineering, Houston, Texas (pp. 326-336).

Srivastava, D., Dar, S., Jagadish, H. V., & Levy, A. Y. (1996). Answering queries with aggregation using views. Paper presented at the VLDB Conference 1996 (pp. 318-329).

Theodoratos, D., & Tsois, A. (2003, May). Processing OLAP queries in hierarchi-cally clustered databases. Data & Knowledge Engineering, 45(2), 205-224.

TransBase HyperCube® Relational Database System. (2005). Retrieved May 29, 2006, from http://www.transaction.de

6 Karayannidis, Tsois, & Sellis

Tsois, A. (2005). Optimization of on-line analytical processing systems: Concep-tual data modeling and query processing techniques. Unpublished doctoral dissertation, National Technical University Of Athens.

Tsois, A., & Sellis, T. (2003). The generalized pre-grouping transformation: Ag-gregate-query optimization in the presence of dependencies (Tech. Rep. No.

TR-2003-4). Retrieved May 29, 2006, from http://www.dbnet.ece.ntua.gr/

pubs/uploads/TR-2003-4.pdf

Weber, R., Schek, H.-J., & Blott, S. (1998). A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. VLDB, 194-205.

Wu, M. C., & Buchmann, A. P. (1998). Encoded bitmap indexing for data ware-houses. ICDE, 220-230.

Yan, W. P., & Larson, P.-A. (1995). Eager aggregation and lazy aggregation. VLDB, 345-357.

Endnote

1 Naturally, the only way to favor more than one hierarchy (per dimension) in clustering is to maintain redundant copies of the cube (Sarawagi & Stonebraker, 1994), or to treat different hierarchy paths as separate dimensions (Markl, 1999). The latter results in an increase of the cube dimensionality, rendering clustering even more difficult (Weber, 1998).

Bitmap Indices for Data Warehouses

Chapter.VII

Bitmap.Indices.for.

In document Data Warehouses and OLAP (Page 178-184)