2 1 Query Optimization
2.1. QUERY OPTIMIZATION
cost-efficient plans to trade off the efficiency of query optimization itself. These op- timizers utilize a rule-based query rewrite transformation module before carrying out cost-based optimization. Query rewrite translates the original query graph into
a semantically equivalent query graph using expansion rules [68], and rewrite rules
include, for example, logically equivalent algebraic expressions subject to commu- tativity or associativity of algebraic operators. Next, logical operators are extended into physical operators to enlarge the search space in order for the system to find the most efficient implementation in an extensible manner. For example, a join operator can use either a nested loop or a sort-merge implementation, and a scan operator can use either an index scan or a sequential table scan. There are also physical operators that may not correspond to any logical operator, e.g., a sort operator is
a physical operator. These are often referred to as physical properties. Indeed, one
of the most important design feature of System R is the consideration ofinteresting
orders, that is, the combination of logical operator and physical properties form the characteristics of a candidate plan in the search space.
• Cost estimation: The cost model used by a query optimizer assigns an estimated
cost to each partial or complete plan in the search space. Usually a cost model is a linear combination of intermediate result sizes (cardinalities) weighed by factors such as CPU cost, I/O cost, and so forth. Cost models for distributed plans also take into account the communication costs for transferring data. The cost model usually
relies on: 1) a set of statistics maintained on relations and indexes, e.g., number of
data pages in a relation, number of distinct values in a column, possibly in data
structures like histograms; 2) formulas to estimate selectivity of predicates and to
project the size of the output data stream for every operator node, for example, the size of the output of a join is estimated by taking the product of the sizes of the two relations and then applying the joint selectivity of all applicable predicates;
and 3) formulas to estimate the CPU and I/O costs of query execution for every
operator. These formulas take into account the statistical properties of its input data streams, existing access methods over the input data streams, and any available interesting orders on the data stream (e.g., if a data stream is ordered, then the cost
22 CHAPTER2. BACKGROUND
of a sort-merge join on that stream may be significantly reduced). The cost model then uses these statistics and formulas to compute the following information in a
bottom-up fashion for operators in a plan:1) the size of the data stream (cardinality)
represented by the output of the operator node; 2) any ordering of tuples created
by the output data stream of the operator node; and3) estimated execution cost for
the operator (and the cumulative costs of the partial plan).
• Enumeration/Pruning algorithm: An enumeration algorithm in a bottom-up style
query optimizer is analogous to backward-chaining enumeration in the AI litera-
tures. Plans are enumeratedbottom-up in a stratified way. This stratification starts
with enumerating leaf-level access plans, and then enumerates all intermediate plans for combining two of these table access plans, and then three plans, and so forth, until all the plans have been enumerated that produce the overall query
results for the original query. These optimizers usedynamic programming to prune
the search space, based on the principle of optimality. That is, in order to obtain an
optimal plan for the original query, it suffices to consider only the optimal plans for subexpressions. As the enumeration algorithm is performed in a bottom-up order, we can safely discard suboptimal plans for the subexpressions. The dynamic pro-
gramming approach reduces the search space fromO(n!)toO(n2n−1)wherenis the
number of binary joins in the plan, yet the search problem still has an exponential upper bound.
Top-down style optimizers: top-down search with branch-and-bound pruning [38,39].
We then review top-down style optimization frameworks (e.g., Volcano, Cascades, Ex-
odus) [39] that feature top-down goal-directed search with branch-and-bound pruning.
Note that here we only review high-level designs and their specific differences to the pre- vious bottom-up style optimizers. We discuss them in three angles: search space, cost estimation and enumeration/pruning algorithms.
• Search space: These top-down optimizers universally use rules to represent the
knowledge of the search space. Two kinds of rules are used [38, 39]: the trans-
2.1. QUERY OPTIMIZATION 23
rules map an algebraic expression into an operator tree. In comparison to rewriting
rules in bottom-up-style optimizers, 1) these systems do not use two distinct op-
timization phases because all transformations are algebraic and cost-based; 2) the
mapping from algebraic to physical operators occurs in a single step;3) most impor-
tantly, instead of applying rules in aforward chaining fashion (e.g., in the Starburst
query rewrite phase [47]), Volcano-style optimizers performgoal-driven application
of rules. In regards to linear plans or bushy plans, Volcano-style optimizers consider both flexibly. They consider physical properties as well.
• Cost estimation: Similar to bottom-up style optimizers, cost estimations of top-
down style optimizers have to be performed in a bottom-up fashion, because the cumulative cost of the plan depends on the costs of sub-plans and their statistics too.
• Enumeration/Pruning algorithm: Top-down style optimizers achieve dynamic pro-
gramming in a top-down fashion through memoization. When presented with an
optimization task, they check whether the query expression has already been ac- complished by looking up its logical and physical properties in the table of plans (memoization table) that have been optimized in the past. Otherwise, it applies a logical transformation rule, an implementation rule, or uses an enforcer to modify
properties of the data stream. In addition, it usesbranch-and-boundingto prune the
search space, that is, if a sub-plan exceeds the (loose) bound for its representing subexpression (which is normally derived from parents and sibling expressions), it can be safely pruned early and its subtrees do not have to be enumerated. The effectiveness of pruning depends on the top-down search order, and on how fast it reaches the best plan (and hence can use this to prune others effectively). However, pruning is embedded in enumeration and hence has to be in a specific top-down order as well.
Both bottom-up and top-down optimizers achieve the goal of guaranteeing the dis- covery of the optimal plan with respect to a cost model. The quality of the plan does not depend on which of these search methods is used, but rather on the transformation
24 CHAPTER2. BACKGROUND
rules available for generating plans, as well as the correctness of cost models. Top-down optimizers have an advantage, in that they can prune the search space early with branch- and-bounding, but the effectiveness of pruning depends on the search order, and on how fast it reaches the best plan for a given workload. Hence, a certain sub-plan may be visited many times, although the cost will only be computed once. Bottom-up optimizers prune
the search space through dynamic programming (which is pioneered by System R [86]),
hence they can discard suboptimal plans, yet they lose the flexibility of branch-and-bound pruning because of backward-chaining application of rules.