• No results found

Properties of data integration systems

2.2 Materialized data integration

2.2.3 Properties of data integration systems

In this section, we introduce desirable properties of data integration systems. The set of properties will be used in the subsequent sections to classify existing integration systems and point out their strength and weaknesses.

In the beginning of this chapter, we discussed the challenges of data inte- gration. Obviously, an integration system should provide the infrastructure to overcome these challenges, at least to some extent. In our opinion, integration systems can well be characterized by four key properties, namely the degree of tolerable source distribution, the degree of tolerable source autonomy, the data transformation capabilities provided, and the ability to incrementally main- tain integrated data. These properties are described in greater detail in the following.

Degree of source distribution An integration system should at least be able to resolve physical data distribution, because data to be integrated is virtually always hosted by (geographically) distributed source systems. It is furthermore desirable to resolve technical distribution, i.e. integrate data that is hosted by systems of different kinds. Integration systems may allow for varying degrees of technical distribution. Some integration systems require their sources to be relational database systems, possibly from different vendors. Others are more flexible and allow for other kinds of systems, such as packaged applications or flat files for instance, to act as data sources too. Obviously, the stronger the assumptions about sources are, the more restricted the applicability of the integration system will be.

Degree of source autonomy It is a desirable property of an integration sys- tem to tolerate a large degree of source autonomy. Some integration systems tamper with the interface autonomy of source systems by imposing strong re- quirements. An integration system may require the sources to provide a specific

query interface or language such as SQL and tolerate some additional work- load, it may require the sources to participate in dedicated communication protocols, or it may demand for specific features such as change data capture mechanisms. Furthermore, design autonomy may be tampered with by enforc- ing schema changes at the source systems or additional constraints to improve local data quality, for instance.

To meet the integration system’s requirements, source systems may need to be adapted. However, such adaptations are often painful in practice. The reasons are twofold. First, system owners are usually eager to keep their au- tonomy and reluctant to change their systems. Changing a mature and proven system always entails the risk of introducing bugs and other problems. Second, system changes may be prohibitively complex. Legacy system, for instance, require significant re-engineering to be adapted to new requirements. Even worse, closed source packaged applications cannot be changed at all.

To sum up, an integration system should impose as few requirements as possible on the source systems and, at the same time, be able to exploit any interfaces and features currently provided by the system. That is, an ideal integration system should “make the most” of whatever the source systems are able and willing to offer.

Data transformation capabilities The transformation phase of materialized data integration has been discussed in Section 2.2.2. Recall that the trans- formation phase can roughly be divided into schema integration and instance integration. Schema integration is mainly about resolving structural hetero- geneity, i.e. restructuring source data to match the target schema of the in- tegration system. For this purpose, transformation capabilities comparable to the relational algebra (with aggregation) or SQL are sufficient. In fact, there are integration systems that use SQL to specify data transformations [Rau05]. Other systems use proprietary transformation languages. However, as recent studies suggest, these languages have a relational core [DHW+08, WCP09].

SQL-like transformation capabilities fall short of instance integration. Re- call, that instance integration can be divided into data standardizing, cleans- ing, and duplication elimination. Data standardization can be performed using simple string manipulation and format conversion functions. However, special- purpose transformation engines have been built for data cleansing and dupli- cation elimination that feature large domain-specific rule sets and ontologies. Only recently, integration system vendors started to integrate such engines into their products.

As said, the data transformation phase involves multiple tasks. However, integration systems may support only subsets of these tasks or may not fully support certain tasks. Details will be provided in Section 2.3. An ideal in-

tegration system should obviously provide the transformation capabilities to perform an “exhaustive” transformation.

Incremental maintenance In materialized data integration, data is extracted from the sources, integrated, and physically stored in the integration system. We will refer to the data content of the integration system as integrated dataset. Two phases of materialized data integration can be distinguished. A newly created integration system is initially empty; in the first phase, data is extracted exhaustively from the sources and loaded into the integration system. This phase is referred to as initial load.

When source data changes over time, the integrated dataset gets stale and needs to be maintained to regain consistency. Thus, maintaining the integrated dataset is the second phase of materialized data integration. While the first phase is performed only once, the second phase is an ongoing process. There are two fundamental approaches to maintain the integrated dataset such that it reflects the current state of the sources.

ˆ The first approach is referred to as full recomputation. The idea is to exhaustively extract and integrate the source data in much the same way as it was done at the initial loading phase. We will refer to the resulting dataset as re-integrated dataset. To perform a full recomputation, the integrated dataset can be dropped and replaced with the re-integrated dataset. However, any historical data, which is no longer available at the sources, will be lost. To keep historical data in the integration system, which is a common requirement in data warehousing, the re-integrated dataset needs to be compared to the current integrated dataset to deter- mine any data changes. These changes are then stored in the integration system, without deleting or overwriting any previous contents.

ˆ Assuming that only a small fraction of source data is changed during loading cycles, a full recomputation is rather inefficient. To improve the efficiency, the repeated extraction and integration of unchanged source data should be avoided. To this end, changes are captured at the sources and used to directly determine the induced changes in the integrated dataset. This approach is referred to as incremental recomputation or incremental maintenance. A full recomputation may be done in much the same way as the initial computation, i.e. the transformation logic can be re-used. However, this cannot be done for recomputing incrementally. In fact, the integration logic needs to be adapted for change propagation, making it more complex in general.

In summary, support for incremental maintenance is a desirable property of an integration system. However, this property is the harder to achieve, the

bigger the system’s data transformation capabilities are.