Query.Processing - Data Warehouses and OLAP

Nikos Karayannidis

National Technical University of Athens, Greece Aris Tsois

National Technical University of Athens, Greece Timos Sellis

National Technical University of Athens, Greece

Abstract

Star queries are the most prevalent kind of queries in data warehousing, online analytical processing (OLAP), and business intelligence applications. Thus, there is an imperative need for efficiently processing star queries. To this end, a new class of fact table organizations has emerged that exploits path-based surrogate keys in order to hierarchically cluster the fact table data of a star schema. In the context of these new organizations, star query processing changes radically. In this chap-ter, we present a complete abstract processing plan that captures all the necessary steps in evaluating such queries over hierarchically clustered fact tables. Further-more, we realize the abstract operations in terms of physical operations over the CUBE File data structure. Finally we discuss star query optimization issues over the presented abstract plan.

Advanced Ad Hoc Star Query Processing

Introduction.

Star queries are the most prevalent kind of queries in data warehousing, online analytical processing (OLAP), and business intelligence applications. Star queries impose restrictions on the dimension tables that are used for selecting specific facts from the fact table; these facts are further grouped and aggregated accord-ing to the user demands. Furthermore, advanced decision support calls for ad hoc analysis, in contrast to using predefined reports that are constructed periodically, or have already been precomputed. The foundation for this kind of analysis is the support of ad hoc star queries, which comprise the real essence of OLAP. Efficient processing of ad hoc star queries is a very difficult task considering, on one hand, the native complexity of typical OLAP queries, which potentially combine huge amounts of data, and on the other, the fact that no a priori knowledge for the query exists and thus no precomputation of results or other query-specific tuning can be exploited. The only way to evaluate these queries is to access directly the base data in an efficient way.

Traditionally, the major bottleneck in evaluating star queries has been the join of the central (and usually very large) fact table with the surrounding dimension tables (also known as a star join). To cope with this problem various indexing schemes have been developed (Chan & Ioannidis, 1998; O’Neil & Grafe, 1995; O’Neil & the tuples. A typical star join is transformed then into a multidimensional range query, which is very efficiently computed using the underlying multidimensional data structures. The combination of the two: hierarchical clustering of data and a multidimensional structure for accessing the fact table tuples results in a very ef-ficient method for ad hoc star query processing.

8 Karayannidis, Tsois, & Sellis

In this chapter, we discuss the processing of ad hoc star queries over hierarchically clustered fact tables. In particular, we present a complete abstract processing plan that covers all the necessary steps for answering such queries. This plan directly exploits the benefits of hierarchically clustered fact tables and opens the road for new optimization challenges. Then we proceed in realizing this abstract plan for the case of a multidimensional storage structure that achieves hierarchical clustering, namely the CUBE File. We continue with a discussion on star query optimization for the presented abstract plan and present the hierarchical pregrouping transfor-mation. This is a very elegant transformation that exploits dimension hierarchy semantics to speed up query processing significantly. Finally, we conclude with a discussion on main conclusions of the presented methods, and future trends in star query processing.

Background.

Preliminary.Concepts..

In a relational OLAP (ROLAP) implementation, a dimension is stored into one or more dimension tables, each having a set of attributes. Dimension attributes usually form one or more classification hierarchies. For example, the h₁ attribute is classified by the h₂ attribute, which is further classified by the h₃ attribute, and so forth. We call the attributes h₁, h₂, h₃, … hierarchical attributes because they participate in the definition of the hierarchy. For example, day, month, and year can be a hierarchical classification in the DATE dimension. For the purposes of this chapter we will as-sume a single hierarchy for each dimension.¹ A dimension table may also contain one or more feature attributes f. Feature attributes contain additional information about a number of hierarchical attributes and are always functionally dependent on one (or more) hierarchical attribute. For example, population could be a feature at-tribute dependent on the region attribute of dimension LOCATION.

Measures (or facts) are stored in fact tables. A fact table may contain one or more measure attributes and is always linked (by foreign key attributes) to some dimen-sion tables. This logical organization consisting of a central table (the fact table) and surrounding tables (the dimension tables) that link to it through 1:N relationships is known as the star schema (Chaudhuri & Dayal, 1997). In a typical scenario, the hierarchical attribute representing the most detailed level will be the primary key of the respective dimension. Each such attribute will have a corresponding foreign key in the fact table.

In Figure 1(a) we depict an example schema of a simplified data warehouse. The data warehouse stores sales transactions recorded per item, store, customer, and

Advanced Ad Hoc Star Query Processing

date. It contains one fact table SALES_FACT, which is defined over the dimensions:

PRODUCT, CUSTOMER, DATE, and LOCATION with the obvious meanings. The single measure of SALES_FACT is sales representing the sales value for an item bought by a customer at a store on a specific day. The dimension hierarchies are depicted in Figure 1(b).

The dimension DATE is organized in three levels: Day-Month-Year. Hence, it has three hierarchical attributes (Day, Month, Year). The PRODUCT dimension is or-ganized into three levels (Item-Class-Category) with three hierarchical attributes and one feature attribute (Brand). The dimension CUSTOMER is organized in only two levels (Customer-Profession) with two hierarchical attributes and two feature attributes (Name, Address). The LOCATION dimension is organized into three lev-els: store-area-region, meaning that stores are grouped into geographical areas and the areas are grouped into regions. For each area, the population is stored as feature attribute. Therefore, the dimension has three hierarchical attributes (Store_id, Area, Region) and one feature attribute (Population) that is assigned to the Area level.

Note that the key attributes of the dimension tables are Customer_id, Item_id, Store_id, and Day and the corresponding foreign keys (Customer_id, Product_id, Store_id, Day) define the fact table’s primary key.

In order to create a fact table that is clustered according to the dimension hierar-chies we first need to apply a hierarchical encoding (HE) on each dimension table.

To achieve this we use the hierarchical surrogate key (or h-surrogate) attribute, a special attribute which is defined for each dimension table. The value of this attri-bute is computed based on the value of the hierarchical attributes of the dimension.

The h-surrogate encodes not only the values of the hierarchical attributes but also the hierarchical relationships defined among the levels of the dimension. Although there are several equivalent ways to define such an encoding, it is sufficient to pres-ent only one such technique for the reader to understand how h-surrogates are used.

All the query processing and optimization techniques presented in this chapter work regardless of the particular encoding technique used.

Customer_d (FK) Product_d (FK) Store_id (FK) Day (FK) NameAddress CUSTOMER

DayMonth year AreaRegion Populaton

Figure 1. (a) The schema of the data warehouse; (b) the dimension hierarchies of the example

40 Karayannidis, Tsois, & Sellis

Definition 1 (Hierarchical Surrogate Key).

Assume a dimension table D containing the hierarchical attributes h_m, h_m-1, …, h₁ (h_m the most aggregated level and h₁ the most detailed one). Each tuple t of D assigns the values t.v_m, t.v_m-1, …, t.v₁ to the corresponding hierarchical attributes. Let {oc_i} be a set of m bijective functions so that oc_i: V_i  {0,…,|Vi|-1}, where 1 ≤ i ≤ m and V_i is the set of values of the hierarchical attribute h_i and |V_i| is the total number of values in this set. The hierarchical surrogate key (or h-surrogate) of D is a computed attribute of D so that for each tuple t of D the value of the h-surrogate is hsk=oc_m(t.

v_m). oc_m-1(t.v_m-1). …. oc₁(t. v₁). 

The value assigned by a oc_i function to a hierarchical attribute value is called an order-code and since these functions are 1-1 the order-codes uniquely identify a hierarchical attribute value. In Figure 2(b) we depict the hierarchy-tree formed from the values of the hierarchy attributes (equivalently levels) of dimension LOCATION.

Below each value appears in parentheses its assigned order-code. In the same figure we depict the h-surrogate value for the leaf-value “storeC.” Note that an h-surro-gate conveys all the hierarchical semantics (i.e., the genealogy) of a specific value.

Moreover it is indeed an alternate key, since it determines all hierarchical attributes, which in turn functionally determine all feature attributes. Note also that leaf-values under the same parent have a common h-surrogate prefix. For example the prefix

“0.1” in Figure 2(b) is the same for the two stores in “AreaB” and the prefix “0.” is common to all stores in “RegionA.” We use the notation hsk:L to refer to the prefix of the h-surrogate that corresponds to the level L of the hierarchy.

Figure 2. (a) The schema of the data warehouse enhanced with h-surrogates; (b) each value of a hierarchical attribute is assigned an order-code, which preserves hierarchical proximity.

Note: An h-surrogate is essentially a path of order-codes in the hierarchy tree.

Customer_d (FK) Product_d (FK) Store_id (FK) Day (FK) ______________

Cust_hsk (FK) Prod_hsk (FK) Loc_hsk (FK) Date_hsk (FK) ______________

sales SALES_FACT Customer_d

Profession NameAddress ______________

Cust_hsk CUSTOMER

DayMonth Year AreaRegion Populaton

(0) AreaB

()

storeA (0) storeB

() storeC () storeD

() hsk(storeC)= 0..2

Advanced Ad Hoc Star Query Processing 4

Each h-surrogate can trigger the creation of a foreign key in the fact table. The concatenation of all these foreign keys produces the fact table’s primary key. This is depicted in Figure 2(a). Note that for the fact table we have two alternative com-posite keys: (a) (Customer_id, Product_id, Store_id, Day) that links to the corre-sponding lowest hierarchical attribute of each dimension and (b) (Cust_hsk, Prod_

hsk, Loc_hsk , Date_ hsk) that links to the corresponding h-surrogate attribute. Note that the former is not necessary to achieve hierarchical clustering of the data, and thus could be omitted, in order to reduce storage overhead.

The h-surrogates play a central role in processing ad hoc star queries, first because they enable the clustering of the fact table according to the dimension hierarchies, and second because they can be exploited to optimize the query evaluation plans.

Experiments in Karayannidis et al. (2002) have shown speed-ups up to a factor of 20, over the state of the art bitmap-based star join processing (see section on other methods for star query processing). Even more interestingly, this factor doubled when query optimization techniques that are discussed later in this chapter were exploited.

The h-surrogates should be system assigned and maintained attributes, and typically they should be made transparent to the user. The actual implementation of the h-sur-rogates depends heavily on the underlying physical organization of the fact table.

Proposals for physical organizations (Karayannidis et al., 2004; Markl et al., 1999) exploit such path-based surrogate keys, in order to achieve hierarchical clustering of the fact table data.

In this chapter, we adopt a denormalized approach for the design of a dimension;

that is, we represent each dimension with only one table. The hierarchical attributes (h₁, h₂, …,h_m), the feature attributes (f₁, f₂, …, f_k ), as well as the hierarchical sur-rogate key hsk of the dimension are stored in a unique dimension table. However, the presented methods are fully applicable to normalized schemata (i.e., snowflaked schemata) as well, with the only difference that extra joins between the several di-mension tables (corresponding to separate hierarchy levels) must be included in the plan. In addition, we assume a special physical organization for the fact table. The fact table is stored hierarchically clustered in a multidimensional data structure such as the CUBE File (Karayannidis et al., 2004) or the UB-tree (Markl et al., 1999).

The index attributes of these structures are the h-surrogates.

Star.Queries

OLAP queries typically include restrictions on multiple dimension tables that trig-ger restrictions (via the foreign key relationships) on the (usually very large) fact table. This is known as a star join (O’Neil & Grafe, 1995). We use the term star query to refer to flat SQL queries, defined over a single star schema, that include a star join. Star queries represent the majority of OLAP queries. In particular, we are

42 Karayannidis, Tsois, & Sellis

interested in ad hoc OLAP star queries. With the term “ad hoc” we refer to queries that are not known in advance and therefore the administrator cannot optimize the DBMS specifically for these.

In Figure 3 we depict the SQL query template for the ad hoc star queries considered.

The template defines the most complex query structure supported and uses abstract terms that act as placeholders. Note that queries conforming to this template have a structure that is a subset of the template in Figure 3 and instantiate all abstract terms.

The terms D₁, D₂, …, D_k are the dimension tables of the star join and LP₁, LP₂, .., LP_k are the corresponding local predicates. Thus, the term LP_i is a local predicate on the dimension table D_i. The characterization “local” is because this predicate includes restrictions only on D_i and not on other dimension tables or the fact table.

This predicate is very important for the h-surrogate processing phase explained later, and is used to produce the necessary h-surrogate specification accessing the fact table discussed later.

The vast majority of OLAP queries contains an equality restriction on a number of hierarchical attributes and more commonly on hierarchical attributes that form a complete path in the hierarchy (i.e., starting from the most aggregated level to some lower level without “gaps” in between). For example, the query “show me sales for area A in region B for each month of 1999” contains two whole-path restrictions, one for a dimension LOCATION and one for a DATE: (a) LOCATION.Region = ‘A’

AND LOCATION.Area = ‘B’ and (b) DATE.Year = 1999. This is reasonable since the core of analysis is conducted along the hierarchies. We call this kind of restric-tion hierarchical prefix path (HPP) restrictions. Note also that even if we impose a restriction solely on an intermediate level hierarchical attribute, we can still have an HPP restriction, as long as hierarchical attributes functionally determine higher level ones. For example, the restriction Month = ‘AUG-99’ implies also that Year

= 1999.

Let us now define an example query on the schema of Figure 1: We want to see the sum of sales by area and month for areas with population more than 1 million, for

SELECT <grouping attributes and/or aggregation functions>

FROM <fact table>, D, D, …, D_k

WHERE <star join conditions: equalities on key-f.key> AND LP AND LP AND … AND LP_k AND

GROUP BY <grouping attributes>

HAVING <group selection predicate>

ORDER BY <sorting attributes>

Figure 3. The ad hoc star query template

Advanced Ad Hoc Star Query Processing 4

the months of the year 1999 and for products that belong to the category “air con-dition.” Figure 4 shows the corresponding SQL expression of this query. One can easily see that the query is an instance of the query template of Figure 3.

Star. Query. Processing.

In document Data Warehouses and OLAP (Page 163-170)