• No results found

Matching with Re-aggregation

3.4 View Matching using Workarea Functions

3.4.3 Matching with Re-aggregation

The second category is that the matching between the GROUPBY boxes of the query and the view requires re-aggregation compensation. This may happen when either the group-by columns of the query and the view do not match or the query has some rejoin with some additional tables that do not exist in the view. The row multiplicity may be affected due to such rejoins and hence re-aggregation is necessary.

Re-aggregation Compensation without Rejoin

We first consider the case when the query does not contain join with the tables that do not exist in the view, i.e., no rejoin in the query. More specif-ically, the general matching rules of this category are that:

• The sub-graphs of the query and view match with SELECT-only com-pensation (which does not contains join with other tables, i.e., no re-join).

• The group-by columns in the query and view GROUPBY boxes do not match exactly. In particular the view GROUPBY box must contain more group-by columns than that in the query box.

In this case, we have to re-aggregate over the view to the same granular-ity of the query. We note that direct re-aggregation over derived functions, such as variance, is not defined in Table 3.2. The reason is that the computa-tion of the derived funccomputa-tion is completely dependent on its corresponding workarea functions. Hence we propose to instead re-aggregate over the corresponding workarea functions.

Such re-aggregation, equivalently speaking, can be viewed as comput-ing the workarea function under multiple insertions. More specifically, as-sume one workarea function F and a set of its input values (w1, w2, ...wn).

Re-aggregating over wi can be considered as multiple insertions of the in-put dataset from which each wi computes. Hence, we can naturally define the re-aggregate function fFagg as an iterative application of fF+.

We follow a similar three-step for aggregate function definition as in [GBLP96, WZ00], namely, (init, iter, f inal), to define fFagg. Here the init step computes the result when there is only one data input. The iter step defines how to compute the new result based on the previous result and the current data input. The final step defines how to compute the final result once all the input data are consumed.

Based on this formulation, our re-aggregate function fFagg can be eas-ily defined as (F, fF+, ). That is, initially, the result is simply the input workarea function value. After that, we can apply fF+ to incrementally compute the new workarea function value. We do not need any final step, since after each iterative step we get the new workarea function value.

We can extend the workarea function table (3.1) to also contain an entry for fFaggas shown in Table 3.3. Note that user need not populate this entry at all since such re-aggregate function can be automatically derived.

Workarea Func F Aux. Func fF+ fF fFagg(init,iter,final) Count(x) - fCount+ ... fCount ... (n, fCount+ , ) Sum(x) - fSum+ ... fSum ... (s, fSum+ , )

NVar(x) Count,

Sum

fN V ar+ ... fN V ar ... (nv, fN V ar+ , )

NCov(x,y) Count,

Sumx, Sumy

fN Cov+ ... fN Cov ... (nc, fN Cov+ , )

... ... ... ... ...

Table 3.3: Extending Workarea Function Table with Re-aggregate Functions

Figure 3.5 depicts an example for how to use such re-aggregate func-tions to answer queries. The group-by columns in the view GROUPBY box are o custkey and year, which are more than that in the query GROUPBY box. Hence we need re-aggregation compensation. Here Wagg consists of total six re-aggregate functions, one for each workarea function. To answer regr icpt in the query, we first re-aggregate over the workarea in the view.

After that, we compute the regr icpt from the resulting workarea using the scalar function f in Table 3.2. Those two compensation operations are

GROUP BY

SELECT

GROUP BY

SELECT

Base orders Base lineitem

match

o_custkey, f(WA) as slp SELECT

Rewritten Query:

SELECT o_custkey, f(Wagg(WA)) as slp FROM MV

GROUP BY o_custkey

Query

MV o_custkey,Wagg(WA) as WA GROUP BY

o_custkey, l_year regr_intercept(l_extendedprice,

l_quantity) as icpt, WIcpt(l_extendedprice,

l_quantity) as WA

group by o_custkey, l_year

o_custkey, l_year l_extendedprice, l_quantity

l_orderkey = o_orderkey l_orderkey = o_orderkey

o_custkey, l_extendedprice, l_quantity

group by o_custkey o_custkey, regr_slope(l_extendedprice,

l_quantity) as slp

Base orders Base lineitem

with re-agg compensation

Figure 3.5: Re-aggregation Compensation without Rejoin

shown in the two dotted boxes in Figure 3.5.

Re-aggregation Compensation with Rejoin

Now let us consider the case when the matching of the GROUPBY boxes includes rejoin compensation from sub-graphs. In that case, we need a re-aggregation compensation unless the rejoin is an equi-join on key and for-eign key and the group-by columns include the join key.

The matching compensation for such GROUPBY boxes would be first to pull up the compensation from the lower boxes, i.e., do a join between the

view and the extra base table (as well as other residual predicates). Then we re-aggregate over the join results.

l_orderkey = o_orderkey and c_custkey = o_custkey o_custkey, o_nationkey l_extendedprice, l_quantity group by c_custkey, c_nationkey c_custkey, c_nationkey W(l_extendedprice, l_quantity) as WA

group by o_custkey f(Wagg(WA)) as slope_qtyonprice FROM MV, Customer

WHERE c_custkey = o_custkey GROUP BY c_custkey, c_nationkey

Query Base customer

Base lineitem Base orders Base customer

Base customer

c_custkey, c_nationkey c_custkey, c_nationkey,

Wagg(WA) as WA SELECT

SELECT

Figure 3.6: Re-aggregation Compensation with Rejoin

Figure 3.6 depicts such an example. Here the two SELECT boxes of the query and view do not match. The compensation includes a rejoin between the view and the table Customer. When the algorithm goes up to match the two GROUPBY boxes, it will first pull up the compensation from lower box, namely, evaluate a join between the view and Customer. It then re-aggregates over the workarea function and computes the final regr slope.

Note that in this particular example, if the rejoin with Customer is on key and foreign key (i.e., the cardinality is not changed), we can avoid such re-aggregation altogether.