Research challenges in incremental ETL processing

The incremental ETL job obtained by algebraic diﬀerencing is correct, i.e. the customer dimension table reaches the exact same state after incremental maintenance that a full recomputation would have produced. A proof of correctness for the set of delta rules is given in [GLT97]. However, as suggested at the beginning of this section there are some ﬂaws in the incremental ETL job that will be addressed in the following chapters of this thesis.

Unlike traditional database views, ETL solutions typically have a strong emphasis on data integration and thus, standardizing heterogeneous data and improving data quality are key tasks. For this purpose, ETL tools provide rich sets of data cleansing operators. These operators perform complex data transformations that are usually computationally expensive.

In Chapter 4 we will show that traditional relational optimization techniques often struggle in the presence of data cleansing operators. Sur- prisingly, the incremental variant derived from the sample ETL job in this section, does not achieve an efficiency improvement compared to a full recomputation. The reason for the sample incremental job to perform badly is that it performs excessive data cleansing. We will propose optimization techniques that effectively reduce the data cleansing effort during incremental recomputations.

In the algebraic view maintenance approach, deltas are modeled as two sets, i.e. the set of inserted tuples and the set of deleted tuples. Updates are modeled implicitly as delete-insert pairs. In particular, it is assumed that the deltas are complete, e.g. it is assumed that both, the new and the old state of updated tuples are known. However, in a warehousing environment, so-called Change Data Capture (CDC) techniques are used to gather deltas at the source systems and an important difference to traditional view maintenance is that these deltas are often incomplete (or partial). This is for two reasons. First, a number of different CDC approaches is used in practice – a survey will be presented in the beginning of Chapter 5 – and some CDC techniques are simply incapable of capturing complete deltas. Second, it is often more efficient to capture deltas only partially instead of completely. In this way, the load on the operational source systems may be decreased.

Standard delta rules proposed for algebraic view maintenance are built on the assumption that deltas are completely available. We will show that these rules fail for partial deltas in Chapter 5 and propose a generalized delta model and adapted delta rules hereafter.

The incremental expressions obtained by the standard delta rules rely on transactional guarantees. Several delta rules, such as the rule for relational joins, require access to both, the deltas and the base data itself. It is assumed that these datasets are accessed in the scope of a transaction and thus synchronized with concurrent updates. Unsynchronized access may cause so-called maintenance anomalies that have the potential to corrupt the materialized view irrecoverably.

In the ETL environment, the operational source systems and the warehouse are distributed. In a distributed environment, maintenance anomalies can be avoided using distributed transactions. However, distributed transactions are most often prohibitively expensive and holding locks on base data for the period of warehouse maintenance is usually unaccept- able. Moreover, the source systems in an ETL environment may be un- sophisticated and unable to participate in distributed transactions. We

discuss maintenance anomalies in the context of data warehousing in Chapter 6 and propose approaches to avoid such anomalies that do not rely on distributed transactions.

We proposed an approach for generating incremental ETL jobs in Chapter 3 that is inspired by algebraic diﬀerencing for incremental view maintenance. As suggested earlier, incremental ETL jobs obtained in the way described in this chapter are not without issues. The optimization of incremental ETL jobs will be discussed in this chapter, which is based on our work published in [BJ10].

As we will show in Section 4.1, incremental jobs obtained by algebraic differencing are often inefficient and fail to outperform a naive full recomputation approach. To fix this issue, we will propose a refined set of delta rules for algebraic differencing tailored to the specifics of ETL in Section 4.2. After a discussion of some remaining shortcomings in Section 4.3, we will show in Section 4.4 that the Magic Sets method, which was originally proposed in the context of recursive logical programs, is effective for the optimization of incremental ETL jobs that include data cleansing. We will conclude the chapter in Section 4.5.

4.1 Incremental recomputation issues in ETL

Throughout this chapter, we will use a running example to illustrate the ETL- speciﬁc optimization techniques proposed in the following. Consider the sample ETL job depicted in Figure 4.1. It is a slight variation of the sample job presented in Chapter 3. In particular, the underlying base tables are taken from the TPC-W benchmark schema used in our experiments.1_.

The sample job extracts data from the TPC-W customer, address, and coun- try tables, which are denoted as C, A, and N (as in nation), respectively, in the sequel. A is joined with N along a foreign-key relationship using a so-called Lookup stage, which is a hash join implementation provided by DataStage. The result data is cleansed hereafter, using an address standardization stage. Another data cleansing stage is used to standardize the formatting of tele- phone numbers in the customer data extracted from C. The cleansed datasets are joined together and ﬁnally loaded into the warehouse customer dimension

1_{The TPC-W benchmark simulates a web-based e-commerce system. Since operational} systems of this kind often serve as data sources for warehouse systems, we chose the TPC-W benchmark schema for our experiments. Another reason for our decision was the availability of a data generation utility.

Anew Standardize Lookup ∆D Cnew Nnew MNS Join

Figure 4.1: ETL job computing a customer dimension table from TPC-W benchmark tables

table. The sample ETL job is represented by the following algebra expression.

D := ΠAddr(A ◃▹ N ) ◃▹ ΠT el(C)

As discussed earlier in Section 2.2.3, there are two fundamental approaches to maintain materialized integrated data such as the customer dimension in the example. It may either be fully recomputed or maintained incrementally. Recall that warehouse tables cannot be simply dropped and rebuilt, because any historical data would be lost, since it is usually no longer available at the sources. To correctly keep the data history, the recomputed data rather needs to be compared to the current warehouse content to determine the set of changes. For this reason, ETL jobs for the initial computation (such as the sample job depicted in Figure 4.1) cannot be reused for full recomputations as they are. In fact, an additional ﬁnal step is required, in which the recomputed data is compared to the warehouse content and the data history is updated. Commercial ETL tools usually provide dedicated operators for this purpose. IBM DataStage, for instance, oﬀers the so-called Slowly Changing Dimension (SCD) stage.

A variant of the initial ETL job that uses a Slowly Changing Dimension stage to perform a full recomputation is shown in Figure 4.2. The SCD stage is added as a final processing stage right before the database adapter. It consumes the output of the original ETL job and, additionally, the warehouse customer dimension table in its current state. These datasets are compared in a process similar to snapshot differentials (see Section 5.1). From the differences found in these datasets, the SCD stage generates a set of SQL insert and update statements to update the dimension table according to the SCD method (see 2.5.1).

Anew Standardize Lookup ∆D Cnew Nnew MNS Join Dold Slowly Changing Dimensions

Figure 4.2: ETL job fully recomputing the customer dimension table

The alternative approach to a full recomputations is an incremental recomputation. An incremental ETL job for this purpose can be obtained by algebraic diﬀerencing as shown in Chapter 3. The implementation of the incremental ETL job is depicted in Figure 4.3.

∆D = [ΠAddr(Anew◃▹ Nnew) ◃▹ (ΠT el(∆C)−ΠT el(Cold)

| {z }

eﬀectiveness test

)]∪

[(ΠAddr(∆A ◃▹ Nnew∪ Anew◃▹ ∆N )−ΠAddr(Aold◃▹ Nold)

| {z }

eﬀectiveness test

) ◃▹

(ΠT el(Cnew))]

Note that the incremental expression above contains two so-called effectiveness tests. Effectiveness tests are generated by the delta rules for relational pro- jection, union, and difference to prevent the propagation of redundant deltas. Redundant means that the exact same delta tuple has been computed in an earlier maintenance cycle and thus, already exists in the warehouse. The effec- tiveness test ensures that each delta tuple in ∆D is truly an insertion, i.e. no such tuple exists in D yet, and that each delta tuple in∇D is truly a deletion, i.e. the exact same tuple does exist in D. Technically, effectiveness test for insertions (deletions) are realized through relational difference with the min- uend being the insert delta tuples (delete delta tuples) and the subtrahend being source tuples from the old database state (the new database state). The difference ensures that those delta tuples that were already derivable from the old database state (are still derivable in the new database state) are eliminated from the set of insertions (deletions).

We did experiments to compare the performance of the naive full recomputation approach and the incremental recomputation approach. We executed

Standardize Cold ∆C Anew Nnew ∆A ∆N Aold Nold Cnew Copy Copy Change Capture Lookup Lookup Lookup Lookup MNS Remove Duplicates Standardize Standardize MNS Change Capture Join Join Funnel Remove Duplicates ∆D MNS Funnel

Figure 4.3: ETL job incrementally recomputing the customer dimension table

either variant of the sample ETL job using IBM DataStage as ETL runtime platform. We performed the experiments on an SMP system equipped with two quad-core processor at 2.0 GHz and 8GB RAM. The test data was generated in a way such that the customer relation and the address relation join losslessly and that addresses are uniformly distributed on customers.

The measurements taken are shown in Figure 4.4. The runtime of the diﬀer- ent ETL job variants is shown on the y-axis. Note that the measurements are normalized as percent of the initial loading time, i.e. the runtime of the ETL job depicted in Figure 4.1. During the experiments, we stepwise increased the percentage of updates at the source tables, which is denoted on the x-axis.

The time required for the initial computation is of course independent of the number of updates at the sources. The full recomputation works much like the initial computation but additionally compares the recomputed dimension to the current dimension to determine the changes and update the dimension

0% 50% 100% 150% 200% 250% 300% 350% 400% 0% 25% 50% 75% 100% No rm al ize d ru nt im e [% ] Base updates [%] Full Recomputation Incr. Recomputation

Figure 4.4: Runtime comparison of full recomputation and incremental recomputation ETL jobs

according to the SCD method. Hence, a full recomputation is more expensive than the initial one. The runtime increases slightly for larger numbers of updates at the sources. This is because more and more updates at the warehouse dimension are required, i.e. the size of the output deltas increases.

Surprisingly, the incremental ETL job, which we obtained using the standard delta rules proposed by Griffin et al., performs worse than the full recomputation approach. It takes almost twice as much time for very small numbers of updates at the sources. Unlike the full recomputation that has an almost constant runtime, the runtime of the incremental recomputation increases no- ticeably with the number of source updates. If the update percentage gets close to a hundred percent, the incremental recomputation takes almost three times as long as the full recomputation. Considering that the whole purpose of incremental recomputation techniques is to improve efficiency, this result seems unexpected. It clearly indicates that the delta rules proposed for the maintenance of materialized views in relational database systems may struggle in the ETL environment. We will explore the reasons in the next section and propose a modified set of delta rules.

In document Incremental Recomputations in Materialized Data Integration (Page 82-89)