2.4 Federated Query Processing
2.4.1 Query Optimization
The objective of the query optimizer is to find a good QEP. This involves two basic steps. First, enumerating alternative plans, typically only a subset of all possible QEPs is considered due to the very large number of possible plans. Second, es- timating the cost of the enumerated plans and choosing the plan with the lowest cost [38]. The query optimization can be done at two different times relative to the
query execution: (i) statically before the execution of the query, and (ii) dynamically during the execution [35]. Since static query optimization is done before the query execution, the sizes of intermediate results are unknown and have to be estimated. Errors in these estimations can lead to a suboptimal QEP. Following this approach, it is possible to cache the QEP for a query in order to save the optimization time on subsequent requests for the same query. Dynamic query optimization is progressed at query execution time. Therefore, at any point of the execution the decision on which operator to use next can be based on the actual size of the result previously produced. The main advantage of dynamic query optimization is that the actual sizes of intermediate results are available and, therefore, minimizing the probabil- ity of creating a bad plan. However, the shortcoming of this approach is that the expensive task of query optimization has to be repeated for each execution of the query.
The most popular algorithm to enumerate the plans is dynamic programming (DP). [42] contains an overview of alternative enumeration algorithms. DP creates very good plans if the cost model is accurate enough. The downside of this approach is the exponential time and space complexity and, therefore, it is not feasible for complex queries. An adaptive extension called iterative dynamic programming is proposed in [25] to overcome this disadvantage. It produces plans as good as DP for simple queries and good plans for queries too complex for DP. However, itera- tive dynamic programming is not suitable for very complex queries with dozens of relations. There are also greedy algorithms, but following the nature of the greedy paradigm they might get caught in a local optimum. [31] proposes a hypergraph- based approach to query optimization. Each sub-query will be handled as a node in the hypergraph. Nodes that share a join variable are merged into a new hypernode. Those hypernodes represent joins of sub-queries.
DP works in a bottom-up manner, that is starting with sub-plans that only involve one table. The best access plan(s) are considered for the next step. Then all two-way join plans using the access plans are evaluated. This includes evaluating different implementations of the join based on the query. Since using an implementation that returns an ordered result may be the better alternative if the data is needed to be sorted for a later operation. Afterwards, all three-way join plans are computed using the plans from the previous steps. If the query involves n tables, the algorithm continues in this fashion until all n-way join plans are enumerated. Inferior plans are pruned as early as possible. A plan can be pruned if there is an alternative plan that does at least the same amount of work with lower costs. When an inferior plan is pruned it is no longer be considered by the following steps of the plan generation. Thus, the complexity is significantly reduced.
(a) Left-Deep Tree (b) Bushy Tree
Figure 2.7: Example of Left-Deep and Bushy Trees
The search space can be further reduced if bushy trees are not considered but left-deep trees only. Figure 2.7 shows examples of left-deep and bushy trees. Bushy trees allow for more parallelism in query execution, e.g., the join of A and B can be executed in parallel with the join of C and D. For left-deep trees the inner operand of a join is always a base relation, ensuring indexes can be used. Relational optimizers consider left-deep plans only [38].
The classical cost model estimates the cost of every operator of the plan and then sums up the individual costs. The cost of a plan is the total resource consumption. In a centralized system this will include CPU costs and disk I/O. In a distributed system also communication cost must be considered. The response time model which does not estimate the resource consumption but the response time. It takes intraquery parallelism into account and finds the plan with the lowest response time.
One of the challenges of query optimization when dealing with heterogeneity in a federated database system is that the capabilities of the sources may differ from each other [24]. Several approaches to this problem have been proposed like describing the capabilities as views or context-free grammars. DP accesses this problem by using planning functions provided by the wrappers to enumerate the plans. Another challenge is the cost estimation since it is not necessarily known how the wrapper executes the plan [24]. The calibration approach uses a generic cost model for all wrappers and adjusts the parameters of this model based on the results of test queries. An alternative would be individual wrapper cost models. Using this approach the developers of the wrapper also provide cost formulas to calculate the cost of the generated plans. The advantage of this approach is that the cost estimation can be as accurate as possible. However, the downside of this approach is that wrapper developers are left with the complicated task of creating cost formulas. The learning curve approach keeps statistics of the execution costs of queries and estimates the cost of a plan based on these statistics. This approach releases the wrapper developers from the burden of cost estimation, but it can be very inaccurate.