• No results found

2 1 Query Optimization

2.1. QUERY OPTIMIZATION

cost-efficient plans to trade off the efficiency of query optimization itself. These op- timizers utilize a rule-based query rewrite transformation module before carrying out cost-based optimization. Query rewrite translates the original query graph into

a semantically equivalent query graph using expansion rules [68], and rewrite rules

include, for example, logically equivalent algebraic expressions subject to commu- tativity or associativity of algebraic operators. Next, logical operators are extended into physical operators to enlarge the search space in order for the system to find the most efficient implementation in an extensible manner. For example, a join operator can use either a nested loop or a sort-merge implementation, and a scan operator can use either an index scan or a sequential table scan. There are also physical operators that may not correspond to any logical operator, e.g., a sort operator is

a physical operator. These are often referred to as physical properties. Indeed, one

of the most important design feature of System R is the consideration ofinteresting

orders, that is, the combination of logical operator and physical properties form the characteristics of a candidate plan in the search space.

• Cost estimation: The cost model used by a query optimizer assigns an estimated

cost to each partial or complete plan in the search space. Usually a cost model is a linear combination of intermediate result sizes (cardinalities) weighed by factors such as CPU cost, I/O cost, and so forth. Cost models for distributed plans also take into account the communication costs for transferring data. The cost model usually

relies on: 1) a set of statistics maintained on relations and indexes, e.g., number of

data pages in a relation, number of distinct values in a column, possibly in data

structures like histograms; 2) formulas to estimate selectivity of predicates and to

project the size of the output data stream for every operator node, for example, the size of the output of a join is estimated by taking the product of the sizes of the two relations and then applying the joint selectivity of all applicable predicates;

and 3) formulas to estimate the CPU and I/O costs of query execution for every

operator. These formulas take into account the statistical properties of its input data streams, existing access methods over the input data streams, and any available interesting orders on the data stream (e.g., if a data stream is ordered, then the cost

22 CHAPTER2. BACKGROUND

of a sort-merge join on that stream may be significantly reduced). The cost model then uses these statistics and formulas to compute the following information in a

bottom-up fashion for operators in a plan:1) the size of the data stream (cardinality)

represented by the output of the operator node; 2) any ordering of tuples created

by the output data stream of the operator node; and3) estimated execution cost for

the operator (and the cumulative costs of the partial plan).

• Enumeration/Pruning algorithm: An enumeration algorithm in a bottom-up style

query optimizer is analogous to backward-chaining enumeration in the AI litera-

tures. Plans are enumeratedbottom-up in a stratified way. This stratification starts

with enumerating leaf-level access plans, and then enumerates all intermediate plans for combining two of these table access plans, and then three plans, and so forth, until all the plans have been enumerated that produce the overall query

results for the original query. These optimizers usedynamic programming to prune

the search space, based on the principle of optimality. That is, in order to obtain an

optimal plan for the original query, it suffices to consider only the optimal plans for subexpressions. As the enumeration algorithm is performed in a bottom-up order, we can safely discard suboptimal plans for the subexpressions. The dynamic pro-

gramming approach reduces the search space fromO(n!)toO(n2n−1)wherenis the

number of binary joins in the plan, yet the search problem still has an exponential upper bound.

Top-down style optimizers: top-down search with branch-and-bound pruning [38,39].

We then review top-down style optimization frameworks (e.g., Volcano, Cascades, Ex-

odus) [39] that feature top-down goal-directed search with branch-and-bound pruning.

Note that here we only review high-level designs and their specific differences to the pre- vious bottom-up style optimizers. We discuss them in three angles: search space, cost estimation and enumeration/pruning algorithms.

• Search space: These top-down optimizers universally use rules to represent the

knowledge of the search space. Two kinds of rules are used [38, 39]: the trans-

2.1. QUERY OPTIMIZATION 23

rules map an algebraic expression into an operator tree. In comparison to rewriting

rules in bottom-up-style optimizers, 1) these systems do not use two distinct op-

timization phases because all transformations are algebraic and cost-based; 2) the

mapping from algebraic to physical operators occurs in a single step;3) most impor-

tantly, instead of applying rules in aforward chaining fashion (e.g., in the Starburst

query rewrite phase [47]), Volcano-style optimizers performgoal-driven application

of rules. In regards to linear plans or bushy plans, Volcano-style optimizers consider both flexibly. They consider physical properties as well.

• Cost estimation: Similar to bottom-up style optimizers, cost estimations of top-

down style optimizers have to be performed in a bottom-up fashion, because the cumulative cost of the plan depends on the costs of sub-plans and their statistics too.

• Enumeration/Pruning algorithm: Top-down style optimizers achieve dynamic pro-

gramming in a top-down fashion through memoization. When presented with an

optimization task, they check whether the query expression has already been ac- complished by looking up its logical and physical properties in the table of plans (memoization table) that have been optimized in the past. Otherwise, it applies a logical transformation rule, an implementation rule, or uses an enforcer to modify

properties of the data stream. In addition, it usesbranch-and-boundingto prune the

search space, that is, if a sub-plan exceeds the (loose) bound for its representing subexpression (which is normally derived from parents and sibling expressions), it can be safely pruned early and its subtrees do not have to be enumerated. The effectiveness of pruning depends on the top-down search order, and on how fast it reaches the best plan (and hence can use this to prune others effectively). However, pruning is embedded in enumeration and hence has to be in a specific top-down order as well.

Both bottom-up and top-down optimizers achieve the goal of guaranteeing the dis- covery of the optimal plan with respect to a cost model. The quality of the plan does not depend on which of these search methods is used, but rather on the transformation

24 CHAPTER2. BACKGROUND

rules available for generating plans, as well as the correctness of cost models. Top-down optimizers have an advantage, in that they can prune the search space early with branch- and-bounding, but the effectiveness of pruning depends on the search order, and on how fast it reaches the best plan for a given workload. Hence, a certain sub-plan may be visited many times, although the cost will only be computed once. Bottom-up optimizers prune

the search space through dynamic programming (which is pioneered by System R [86]),

hence they can discard suboptimal plans, yet they lose the flexibility of branch-and-bound pruning because of backward-chaining application of rules.