State of the Art and Open Issues - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

The need for schema matching in numerous applications and the inherent difficulty of the task have led to the development of many techniques and prototypes to semi-automatically solve the match problem. They either address the problem for specific applications, such as [7, 8, 9, 16, 20, 23, 34, 35, 40, 42, 49, 53, 54, 62, 71, 84, 97, 107, 112, 134, 143] or in a more generic way for different applications and schema languages [15, 88, 93, 96]. Some recent surveys of the match and related approaches are given in [119, 70, 31, 36]. The large amount of research work indicates the high potential of techniques and algorithms, which can be exploited for schema matching. However, we can observe a number of issues, which are not or not sufficiently addressed yet in previous work and thus require further investigation:

Expressiveness of Modern Schema Languages

Modern schema languages, e.g., W3C XSD and the new object-relational SQL versions (SQL:1999, SQL:2003), support many advanced modeling capabilities, such as user- defined types and classes, aggregation and generalization, component reuse, distributed schemas and namespaces, leading to significant complication for schema matching [120]. Mostly, such design styles can be used alternatively to model the same or similar real-world concepts, leading to different, yet semantically similar, structures in the schemas. Thus, matching schemas taking advantage of such powerful languages becomes a challenging task essentially depending on the detection and unification of the alternative modeling styles.

On the other hand, current match systems only focus on structurally simple schemas w.r.t. nesting levels, data types, constraints, and shared schema components. In particu-

lar, early schema languages such as DTD and SQL:1992 are mostly required as the input format. The traditional database notion of a schema is typically assumed where all instances can be described by a single monolithic schema. Likewise, most approaches assume tree-like schemas and ignore shared elements, such as the same complex types or sub-structures used at multiple places to capture the same kind of information (e.g., address data). Such shared elements may appear many times in a schema with a context- dependent semantics, which need to be differentiated for a correct matching. The treat- ment of shared elements still requires further work in order to avoid an explosion of the search space.

Dealing with Large Schemas

Real-world schemas are constantly growing in both size and complexity in order to cope with the requirements for representing and managing data in corresponding applications. For example, the standard schemas for E-business messages developed by OpenTrans and Xcbl contain several independent parts, or subschemas, for individual transaction/ message types, each of which in turn consists of up to thousands of elements in order to be able to capture every detail of the messages. Furthermore, the schemas often use shared elements to avoid unnecessarily diverse specifications and keep a low schema complexity for easier maintenance. At the end, the match operation needs to examine a huge search space to find plausible correspondences; a major challenge, which requires very efficient approaches to deal with.

On the other hand, we observe that current match approaches are typically applied to some test schemas for which they could automatically determine most correspondences. As surveyed in [31], most test schemas were of small size of 50-100 elements. Unfortu- nately, the effectiveness of automatic match techniques studied so far typically decrease for larger schemas [29]. In particular, it is likely that large portions of one or both input schemas have no matching counterparts. Thus, matching complete input schemas may lead not only to long execution time, but also poor quality due to the large search space. Moreover, it is difficult to present the match result to a human engineer in a way that she can easily validate and correct it. A more piecemeal approach, e.g., based on the divide- and-conquer philosophy, may be more preferable in such cases with the promise for both better user control and match performance.

Combination of Match Algorithms

To achieve high match accuracy for a large variety of schemas, considering a single cri- terion (e.g., name matching) is unlikely to be successful. As a consequence, it is necessary to combine and utilize multiple techniques at the same time. For this purpose, previous prototypes have followed either a so-called hybrid or composite combination of match approaches. So far the hybrid approach is most common where multiple criteria or properties (e.g., name and data type) are considered within a single algorithm. Typically, these criteria are fixed and utilized in a specific way, for example, concerning the order they are evaluated, making it difficult to extend and improve the overall algorithm. By contrast, a composite match approach combines the results of several independently executed match algorithms, which can in turn be hybrid or composite. This allows for a high flexibility, as there is the potential for selecting the match algorithms to be executed based on the match task at hand. Moreover, there are different possibilities for combin- ing the individual match results. We know of only few systems following such a composite approach, in particular, [34, 35, 42]. They are limited to match techniques based on machine learning and employ a specific combination of match results. While promis-

3.1.ST A T E O F T H E AR T A N D OP E N IS S U E S 1 9

ing to improve the flexibility and quality over the hybrid approach, composite combination still requires further work in order to fully exploit its potential and to examine how to best combine different matchers.

Schema Matching Evaluation

For identifying a solution for a particular match problem, it is important to understand which of the available techniques performs best, i.e., can reduce the manual work required for the match task at hand most effectively. The only way to approach this goal is to demonstrate the quality and practicability of the developed match algorithms in real-world scenarios, or better, to conduct a systematic study using a range of schema matching tasks. Evaluation thus represents an important task in developing a match solution and has also been seriously considered in most previous work.

Unfortunately, the system evaluations reported in the literature so far were done using diverse methodologies, metrics, and data making it difficult to assess the effectiveness of each single system, not to mention to compare their effectiveness. Furthermore, the systems are usually not publicly available making it virtually impossible to apply them to a common test problem or benchmark in order to obtain a direct quantitative comparison. Hence, it is necessary to establish a common framework for future evaluations, so that they can be documented better, their result be more reproducible, and a comparison between different systems and approaches be easier. This requires a systematic analysis of the factors influencing the quality and performance of a match approach.

Reuse of Match Results and Data Integration

Reuse aims at exploiting different kinds of auxiliary information, such as (domain-specific or general-purpose) dictionaries, thesauri, ontologies, etc., to solve a match task. This can especially help in cases if schema elements cannot be compared merely using metadata in the schemas or available instance data. Current match prototypes mostly utilize a simple form of reuse at the level of single schema elements by looking up element correspondences in synonym tables [88, 112, 42] or by using user-specified correspondences to train machine-learning algorithms on instances of a schema element [34, 35]. A further generic approach is to reuse entire previously identified match results [119]. In fact, we observe that new schemas to be matched are often very similar to previously matched schemas. Reusing the existing match results can thus result in significant sav- ings of manual effort. However, the potential of this approach has not yet been studied in current schema matching work.

The idea of reuse previous match results can be further generalized to cover mappings between different kinds of objects, which may be at both the metadata and instance level. This is motivated, on the one side, by the fact that often applications employ generic schemas and store heterogeneous information, possibly mixing both metadata and instance data, in a few generic tables [11, 1]. On the other side, we observe that semantic correspondences between objects of different types are available in many domains, such as bioinformatics [51, 45, 50] and peer-to-peer data management [72]. Such correspondences represent valuable domain knowledge and can be re-used to inter-relate objects of interest and to integrate object information from different sources.

Graphical User Interface for Match

Given the fact that no fully automatic solution is possible, a user-friendly interface is essential for the practicability and effectiveness of a match system. On the one side, the match process should be performed interactively, so that the domain knowledge of the

user can be actively incorporated in identifying corresponding schema elements. On the other side, the effort required for interactions, such as configuration of the match operation, verification and correction of automatically derived match results, should be reduced to a minimum so that it is still affordable compared to manually solving the match task from scratch.

Unfortunately, most prototypes developed so far focus on some research aspects and offer no or only a rudimentary user interface. The only system that we know of providing a comprehensive graphical user interface is CLIO [66, 105, 114, 61], a commercial tool developed at IBM. However, CLIO focuses on the mapping discovery task to obtain que- ries for transforming instances between two schemas. Hence, many GUI capabilities have not yet been studied, such as to customize the match operation, to visualize and deal with large schemas/match results, to manipulate and evaluate match results.

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 35-38)