Principles of magic sets - Magic ETL optimization

4.4 Magic ETL optimization

4.4.1 Principles of magic sets

The Magic Sets algorithm has been proposed for the eﬃcient evaluation of recursive queries in the context of deductive databases [BMSU86, Ull89]. Deduc- tive databases typically use a query language called Datalog, which is inspired by the logic programming language Prolog. As compared to SQL, recursive queries2 can more naturally be expressed in Datalog as illustrated by the following sample query.

sl (X, X).

sl (X, Y ) :−mngr(X, X1), sl(X1, Y 1), mngr(Y, Y 1).

The sample Datalog query consists of two rules. Datalog rules can be divided into a head and a body placed on the left-hand side and the right-hand side of the “:−” symbol, respectively. Intuitively, a rule is interpreted as follows: If the tuples mentioned in the body exist in the database, the tuples mentioned in the head must also be in the database, i.e. the body of a rule implies its head. If the body is empty, as is the case in the ﬁrst sample rule, the head

X Y Bob Anna Guy Anna Carol Bob Dana Bob Helen Guy Ian Guy Eric Dana Fran Dana Anna Bob Carol Dana Eric Fran Guy Helen Ian

mngr

Figure 4.8: Sample management hierarchy

always evaluates to true, i.e. the tuples mentioned in the head must be in the database.

The intention of the sample Datalog query is to ﬁnd all employees in a management hierarchy that are at the same level as a given employee. The query refers to two relations, namely mngr and sl . It is assumed that the former relation is stored in the database and that mngr (X, Y ) means that X is managed by Y as shown in Figure 4.8. The latter relation sl is inferred from

mngr using the above rules.

The ﬁrst rule says that any employee is on the same level with himself. The second rule says that any two employees X and Y are at the same level, if their respective managers X1 and Y 1 are at the same level. In Datalog, a query to ﬁnd all employees at the same level as “Carol” is formulated as sl (Carol , X ). There are two obvious ways to evaluate such a query.

First, the query may be evaluated top-down. This strategy is referred to as backward-chaining and implemented in Prolog. Starting from “Carol”, her manager is visited first. Then, recursively, all managers at the same level are considered to find all employees managed by them. During the recursive processes, the same employee may be visited more than once through different paths in the hierarchy. The flaw with the backward-chaining approach is that it discovers “all derivation paths” rather than “all answers”.

Second, the query may be evaluated bottom-up. This strategy is referred to as forward-chaining. Initially, only the tuples in the database (i.e. in the mngr relation) are considered and any query parameters (i.e. “Carol”) are ignored. The idea of forward-chaining is to fully build the sl relation and select the tuples

matching the query hereafter. The sl relation is initially empty and build step- by-step in several passes over the query rules. On the ﬁrst pass, each employee is found to be at the same level with himself. On the second pass, any two employees with a common manager are found to be at the same level. On the third pass, any two employees having managers with a common manager are found to be at the same level, and so on. After the process terminates, the sl relation is restricted to those tuples having “Carol” as their ﬁrst component to answer the query. Note that many tuples generated by forward-chaining may not contribute to the query result.

Whether backward-chaining or forward-chaining is more eﬃcient depends entirely on the given base data. However, neither method is very good in general, because backward-chaining may repeatedly compute the same answer tuple and forward-chaining may compute tuples that do not contribute to the answer in any way. The Magic Sets method has been proposed as a more eﬃcient alternative [BMSU86, Ull89].

The Magic Sets method evaluates queries bottom-up. In contrast to forward- chaining, however, the computation of irrelevant tuples that do not contribute to the answer to the query is generally avoided. The Magic Sets method can be thought of as a generalized selection pushing technique. It allows for pushing the selection, which is the ﬁnal step of forward-chaining, into recursive queries. For this purpose, the query rules are rewritten and so-called Magic Sets are added that act as ﬁlters on the set of tuples generated by each rule.

Reconsider the sample Datalog query. We are interested in finding all tu- ples in sl with the first field equal to “Carol”. However, to find all such tu- ples, other tuples in sl have to be computed as well. For example, the tuple

sl (Carol , Ian), which is part of the query answer, cannot be computed unless

the tuple sl (Bob, Guy ) has been computed, which is not part of the query an- swer. Thus, we cannot simply restrict the query evaluation to those tuples that directly contribute to the answer but need to consider those tuples that indirectly contribute to the answer, too. Intuitively, these are tuples whose ﬁrst ﬁeld contains a value on the path from “Carol” to the top of the hierarchy such as sl (Bob, Guy ), for instance. These tuples are computed by the following Datalog rules and referred to as a Magic Set.

magic(Carol).

magic(U ) :−magic(V ), mngr(V, U).

Any tuple outside the Magic Set, such as sl (Dana, Ian) for instance, cannot be used to compute tuples that directly or indirectly contribute to the query answer. Computing such tuples is thus wasteful and should be avoided. To do so, the Magic Set can be used in the original rules to prevent the computation

of irrelevant tuples. The Magic Set-rewritten rules look like this.

sl (X, X) :−magic(X).

sl (X, Y ) :−magic(X), mngr(X, X1), sl(X1, Y 1), mngr(Y, Y 1).

It can be shown that the rewritten query yields the same result as the original one, but may work more eﬃciently. The Magic Sets rewriting can be done automatically; the rewrite algorithm is presented in [BMSU86, Ull89].

In document Incremental Recomputations in Materialized Data Integration (Page 100-103)