3.5 A Practical Case: The TPC-H
3.5.2 Data Source Expressiveness
The TPC-H relational schema is well-formed and captures a fair picture of the business domain. For example, foreign keys are used to identify semantic relationships between attributes in dif- ferent tables, and the schema is in 3NF.
Importantly, note that MDBE is an interleaved hybrid approach, which analyzes the end-user and the data sources simultaneously, but the requirement analysis leads the process. Conse- quently, the quality of the sources required in our approach is considerably lower than in pre- vious approaches working from relational sources. Indeed, as discussed in Section 2.1, current approaches demand that the source relational schemas capture the functional dependencies (i.e., to-one relationships) existing in the domain. This kind of relationships, typically represented at the logical level by means of foreign and candidate key constraints, are crucial to identify the multidimensional concepts and specially, the dimensional concepts. For this reason, the quality of the output obtained by current approaches decreases drastically for relational sources between a denormalized schema and a logical schema in 3NF. On the contrary, MDBE is able to produce high-quality results, even from denormalized sources, by means of two key features:
• MDBE exploits the candidate - foreign key knowledge captured in the data sources. Nev- ertheless, if this information is missing, we are able to extract it from the requirements (in case it is relevant for the final result). For example, consider the TPC-H relational schema introduced in Figure 3.2, and the TCP-H #5 query used as example all over the chapter. In Q5, the c custkey = o custkey logic clause involves two concepts re- lated by means of a primary - foreign key relationship and thus, it can be exploited by most of the current methods. However, if we discard the candidate - foreign key relation- ships in the schema, none of these approaches would be able to exploit it anymore. Rel- evantly, this does not affect MDBE, since we consider requirements. Indeed, SQL query joins represent concept associations explicitly stated by the user and thus, we are able to exploit them even if the attributes joined are not explicitly related in the sources. Con- sequently, the multidimensional knowledge inferred for attribute involved in joins in the WHERE clause is automatically propagated to its counterpart (see Section 3.1.2 for further details). Specifically, the orders node is initially labeled as a level in Step 4 (see Section 3.4.1) and therefore, its attributes involved in the query are identified as dimensional con- cepts (i.e., o orderdate, o custkey and o orderkey). Thus, by means of the c custkey = o custkey join, we propagate the knowledge inferred for o custkey to c custkey: i.e., it is also identified as a dimensional concept. Interestingly, later, Step 6 proposed orders to play a Cell role as well. This alternative is not prune in the third stage of the algorithm (see Section 3.4.3) and eventually, MDBE produces two re- sults for the Q5 query (see Section 3.4.2). However, even in this case, o custkey and c custkey will still play a dimensional role: in this scenario, o custkey would have not been identified as a dimensional concept, but customer is labeled as level and ac- cordingly, c custkey as dimensional concept. Consequently, o custkey is identified
as a dimensional concept by means of the association in the WHERE clause. This is sound, since MDBE is identifying o custkey as orders link attribute (see Section 3.3.2) and thus, it is part of the dimensional concepts forming the multidimensional space.
• Furthermore, MDBE smooths the impact of denormalization on the output produced. Con- sider now a unique relation capturing the whole TPC-H relational schema (i.e., the TPC-H universal relation); i.e.:
TPC-H(lineitem attrs, orders attrs, partsupp attrs, part attrs, supplier attrs, customer attrs, nation attrs, region attrs)
where lineitem attrs refers to the whole set of attributes in the lineitem relation, and similarly for the rest. Functional dependencies would not be extracted from such a relation, and current approaches (i) would not be able to identify any dimensional concept or (ii) they would produce loads of meaningless results. On the contrary, MDBE is able to identify dimensional concepts from such a relation by means of the end-user requirements (see Section 3.1.2 for further details). For example, consider the TPC-H Q5 business query over the universal relation introduced above:
Select nation name, sum(lineitem extendedprice * (1 - lineitem discount) as revenue FROM TPC-H
WHERE region name = ’[REGION]’ and orders orderdate >= ’[DATE]’ and orders orderdate < ’[DATE]’ + ’1’ year
GROUP BY nation name ORDER BY revenue desc;
In this case, the multidimensional graph would be compound of just one node (i.e., TPC-H), which would be labeled as CDM, since nation name, region name and orders-
orderdate would be identified as dimensional concepts (see Steps 2 and 4 in Section 3.4.1), and lineitem extendedprice and lineitem discount as measures (see Step 3 in Section 3.4.2). However, regarding dimensional data identified from this kind of relations, MDBE cannot automatically generate the dimension hierarchies, since require- ments provide additional knowledge about the role played by each attribute but, under no circumstances, knowledge about the missing to-one relationships (i.e., functional depen- dencies) is provided. For example, considering just the dimensional concepts identified for Q5, and according to the requirements stated in the TPC-H benchmark, we should manually form the place dimension (in which nation name can be aggregated into region name) and the order date dimension. We can only overcome this draw- back by mining the instances, but mining the instances is computationally expensive (see [JHP04], which already proposes to mine the instances to identify functional dependen- cies) and can be unfeasible for large databases. Finally, note that, although MDBE does not generate the dimension hierarchies from denormalized sources, it does identify the di- mensional concepts and therefore, they do not have to be derived from scratch, but from the set of dimensional concepts identified for the fact (i.e., we do not shape dimensions by exploiting all the to-one relationships in the schema, but only those between concepts identified as dimensional concepts).