• No results found

The literature has previously established that the Classic algorithm has a time complexity of O(3n) and a space complexity of O(2n) for queries over n base relations in a centralized setting. It has further been shown that moving this algorithm to a distributed setting (with s possible evaluation sites) requires a time complexity of O(s3∗ 3n), and a space complexity of O(s ∗ 2n+ s3) [45]. In this section, we will formally establish the time and space complexities of the Optimal algorithm.

For each of the proofs in this section, we assume that all sites in the system host copies of all tables and are capable of evaluating any operation in a query plan. In the interests of avoiding the discussion and presentation of spurious constants, we further assume that there is only one physical scan operator and only one physical join operator available to the optimizer, and that no interesting sort orders can be taken advantage of during optimization. All of these assumptions are similarly made in [45]. We only add the assumption that no user-specified required constraints can be violated by any query plans to evaluate their respective queries. That is, considering that pruning requirement violating plans can decrease the practical time and space need of the running of our presented algorithms, we make this assumption to aid in defining the worst case time and space complexities of the Optimal algorithm.

Theorem 8. The time complexity of the Optimal algorithm for queries over n base relations with r required constraints, p preferred constraints, and c SQL conditions (e.g., selection, projection, join conditions) that can be evaluated bys potential servers is O(2c2+3c∗ sc+3∗ c ∗ 3n∗ (r + p)).

Proof. Because the domination metric used inPRUNEPLANSopthas been rewritten such that a plan cannot dominate another if they are both to be evaluated at different sites and, in the worst case, all

s sites host a replica of all tables involved in the query, each entry in optP lan must correspond to at least s plans in the worst case (note there cannot be multiple plans in the same entry that differ only by sort order or physical operator due to our assumptions). s different scan plans for each relation, and, s different join plans for each set of base relations. The domination metric was also changed such that no plan can dominate another if they evaluate different sets of conditions. This, combined with accounting for evaluation at different sites, requires that each entry in optP lan to corresponds to, in the worst case, s ∗ 2c different plans. Each of the s different sites must be considered, and later joins must be able to draw upon a plan upholding any combination of SQL conditions at any possible site.

The increased size of optP lan further entails that, for each call toJOINPLANSP AQO, all s ∗ 2c plans in optP lan[O] must be combined with all s ∗ 2c plans from optP lan[P/O] with all s sites considered for evaluation of the new root join, and all 2cpossible combinations of SQL conditions applied to the join, increasing the runtime complexity by a factor of (s ∗ 2c)3, or 23c∗ s3.

The only other additional runtime cost incurred by the Optimal algorithm is that required to call the EXTEND function. In the worst case, EXTEND is called to extend a node with no SQL conditions applied. In this case all c conditions are passed to EXTEND. Hence, we can upper bound the cost of all calls to extend as taking in c conditions. Inside of the extend function, two nested loops are run: one over all possible evaluation sites (O(s)) and one over every set in the powerset of all conditions passed to EXTEND(O(2c)). Inside of the innermost loop, a new query plan is created that applies the current set from the powerset of conditions. All leftover conditions are then passed to a recursive call to extend. At this point, at least one condition must be applied to the plan created in extend, and hence, we can upper-bound the number of conditions passed to the recursive call to EXTENDby c − 1. Hence, we can represent the runtime of EXTEND by the following recurrence relation: T (c) = s ∗ 2c∗ T (c − 1) + Θ(1).

To solve this recurrence relation, we consider the recursion tree that can be used to represent it. This tree would be of height c as each recursive call decrements c by 1. A recursive call at level i of the tree would further make s ∗ 2c−1recursive calls, itself in addition to doing a constant amount of work (creating a new query plan and adding it to the list of plans to be pruned). Hence, we can determine the runtime of theEXTENDfunction as T (c) = Σci=0(s ∗ 2c−i)i, which we can bound as T (c) < c ∗ sc∗ 2c2

As per our assumptions, no plans can be pruned due to requirement violation, and hence such pruning has no effect on the time complexity. Our use of relative plan preference will further have no effect on the time complexity of our algorithm. Using relative preference as part of the domination metric will only change which plans are stored in optP lan, it can not effect the number of plans stored for each entry in optP lan. Constraint adherence must be checked for each plan that is realized, however, and that does have a slight effect on the runtime complexity. For each plan realized, it must be determined if it violates any of the r required constraints or any of the p preferred constraints. This leads to the addition of the (r + p) term in our runtime.

With this, we can bound the runtime of the Optimal algorithm as O(c ∗ sc∗ 2c2

∗ 23c∗ s3∗ 3n (r + p)). Or, equivalently: O(2c2+3c

∗ sc+3∗ c ∗ 3n∗ (r + p))

The space complexity for our optimal algorithms follows rather simply from the previous proofs.

Theorem 9. The space complexity of the Optimal algorithm for queries over n base relations with r required constraints, p preferred constraints, and c SQL conditions (e.g., selection, projection, join conditions) that can be evaluated ons potential servers is O((2c∗ s ∗ 2n+ 2c2+3c

∗ s(c+3)

c) ∗ (r + p)).

Proof. The optimal algorithm must, for each entry in optP lan, store 2c∗ s times as many plans as the classic dynamic programming algorithm. Further, additional overhead is incurred in the JOINPLANSopt function to iterate through all 2c∗ s left child plans, 2c∗ s right child plans, 2csets of conditions, c ∗ sc∗ 2c2

extended plans, and s sites. Finally, the Optimal algorithm must save, for each plan, a record of which constraints it upholds/violates, state which grows in size according to the number of constraints specified for the query being optimized, specifically (r + p).