• No results found

2.3 Integration systems

2.3.1 Database replication

Database replication means that several copies of relations or relation frag- ments are stored at multiple physically distributed locations [RG03]. The aims of database replication are increased availability of data in case of node or communication link failures and faster query evaluation due to higher data locality.

Replication schemes can be classified along two dimensions, namely the prop- agation strategy and the ownership strategy [GHOS96]. The propagation strat- egy can either be eager or lazy, also referred to as synchronous or asynchronous replication, respectively. Eager update propagation means that any update is applied to each associated replica as part of the modifying transaction. Hence, the update propagation involves a distributed transaction across multiple repli- cas. The overhead and limited scalability of distributed transaction processing makes eager or synchronous replication undesirable or even unachievable in many situations. Gray et al. show that the time to wait for locks and the probability of deadlocks grow cubically with the number of nodes in an eager replication system [GHOS96].

Asynchronous or lazy replication abandons distributed transactions and ap- plies updates to each replica in local transaction instead. Lazy replication achieves better scalability and is most often used in practice. However, the drawback is that different replicas of the same object may have different states at the same point in time.

There are two flavors of lazy replication depending on the ownership strat- egy used. In master or primary site replication, one replica of an object is designated the master replica. Only the master replica can be updated while all the others are read-only. Once the master replica has been updated the changes are propagated to other replicas. The alternative ownership strategy is referred to as group or peer-to-peer replication and allows all replicas to be updated concurrently. It is thus possible for two transactions to update the

Figure 2.5: IBM DB2 SQL Replication overview [DB2a]

same object at different replicas and race each other to install their updates at the others. Such conflicts must be detected by the replication mechanism and reconciled so that updates are not lost.

Many commercial database replication offerings are available. As an ex- ample, we describe a replication solution provided by IBM DB2 called SQL replication [DB2a]. The overall architecture is depicted in Figure 2.5. DB2 SQL replication follows a lazy master replication scheme. It uses two programs referred to as Capture and Apply running on the source system and one or more target systems, respectively. The Capture program parses DB2 recovery logs and extracts committed changes to relevant source tables. These changes are written to a so-called change data (CD) table. It is possible to register just a subset of table columns for replication. In this case the Capture program ignores changes that affect unregistered columns only. The Apply program reads out the changes from the CD table and applies them to the replicated relations.

A subscription set associates source and target tables for replication and defines a mapping from source column names to target column names. A single source table can be associated with multiple target tables. For each source- target pair in a subscription set a predicate can be specified to select tuples for replication based on attribute values. Furthermore a scheduling strategy for data replication can be specified as part of a subscription set. DB2 SQL replication supports three alternative strategies. First, data replication may be performed at regular intervals. Second, it may be performed continuously,

i.e. as frequently as permitted by the current workload and the available system resource. Third, replication can be triggered by certain events signaled by an application or an user.

To sum up, we characterize database replication based on the desirable prop- erties of data integration systems introduced in Section 2.2.3. Database repli- cation techniques allow for physical distribution of source systems, however, technical distribution is often not supported. Replication solutions used to be vendor-specific and assume a homogeneous system environment. This has changed with the advent of a new class of systems we refer to as advanced

replication systems, which will be discussed in the next section.

Database replication requires the source systems to be cooperative. Taking IBM SQL replication as an example, the source systems are required to switch to archive logging, run a capture program instance, and provide change data tables and further control tables. The source systems are thus forced to give up some of their autonomy.

The data transformation capabilities of traditional database replication are very limited. Typically replicated tables are one-to-one copies of their respec- tive source tables. However, IBM SQL replication allows to restrict replicated tables to a subset of the source columns. Furthermore, replicated tables may be restricted to a subset of source tuples based on a given filter predicate. These subset operations correspond to relational projection and selection, re- spectively, which characterizes the transformation capabilities.

The last desirable property for data integration is the ability to incrementally maintain an integrated dataset. Obviously, database replication supports the incremental maintenance of replicated tables.