Contributions - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

Focusing on the open problems discussed above, the dissertation makes a number of contributions, which can be grouped into the four following areas:

Surveys of Match Approaches and Evaluations

To obtain a better overview about the current state of the art in schema matching, we survey the existing approaches and their evaluations.

• Survey of schema matching algorithms and prototypes: There is a lot of previous work on schema matching done in different fields, such as schema translation and integration, knowledge representation, machine learning, and information retrieval. We adopt the taxonomy proposed in [119] and perform a new survey of existing schema matching approaches. In particular, we differentiate and discuss schema- and instance-level, element- and structure-level, language- and constraint-based, and reuse-oriented match approaches. As for the combination of multiple matchers, hybrid and composite approaches are possible. Match approaches may further be distinguished according to the cardinalities of their results. According to the taxonomy, we review various match prototypes published in the literature. We characterize them in some detail and com- pare them and our own development.

• Survey of schema matching evaluations: Evaluation aims at proving the practicability of a match system for real-world circumstances. We identify and discuss the major criteria that influence the effectiveness of a schema matching approach, such as the cho- sen test problems, the design of the experiments, the representation of match results, the metrics used to quantify the match quality and the amount of saved manual effort, and the overall execution performance. We use these criteria to review the evaluation of various state-of-the-art systems and motivate the importance of a common frame- work to make the comparison between different systems and approaches easier.

Besides helping us to implement and test our own system, our insights on the match approaches and their evaluations can be of valuable help for the development of future schema matching systems.

Generic, Customizable, and Scalable Schema Matching Systems

The main contribution of the thesis consists in the development of two new generic and customizable schema matching systems, COMA (Combining Matchers) and its successor,

3.2.CO N T R I B U T I O N S 2 1

COMA++2. In particular, COMA++ extends COMA by a number of significant improve- ments and offers a comprehensive infrastructure to solve large real-world match problems.

• Architecture design: COMA++ pioneers to explicitly articulate an open multi-compo- nent architecture for schema matching, offering high flexibility for extension and adaptation. It consists of five components, the Repository to persistently store match- related data, the Schema and Mapping Pools to manage schemas and mappings in memory, the Match Customizer to construct and configure matchers and match strate- gies, and the Execution Engine to execute match operations. Each component in turns provides an extensible library of methods for processing its data. COMA++ comes with a comprehensive graphical user interface, which supports interactive and iterative match processing with many ways for the user to provide feedback.

• Composite matcher combination: Combining individual matchers has so far only been studied in the context of machine learning approaches focusing on instance-level matchers and using a specific combination of match results. With COMA++, by con- trast, we support a wide spectrum of matchers not confined to a particular technique like machine learning, as well as the customizable combination and refinement of their results. New match algorithms can be easily added or constructed by combining existing ones. The implementation of matchers has been highly optimized in order to achieve fast execution times for large match problems. Match processing is supported as workflows of multiple match steps, which can individually be configured. While supporting a default configuration set to the best strategies identified in our evaluations, COMA++ also allows to tailor match strategies for a match problem at hand by manually selecting the matchers and strategies for combining them.

• Novel match approaches: COMA++ includes several new approaches, in particular for

context-dependent, fragment-based, and reuse-oriented matching. Shared elements

have become a popular modeling mechanism to reduce schema complexity and impose standard specifications. However, they are largely ignored in previous schema matching work. Context-dependent matching differentiates between different contexts of a shared element and tries to find match candidates for individual contexts. Frag- ment-based matching aims at an efficient approach for dealing with very large schemas. In particular, we decompose a large match problem into smaller problems by matching at the level of schema fragments. Finally, we propose a further match approach based on the reuse of previously obtained match results. It is motivated by the observation that many schemas to be matched are very similar to previously matched schemas. Reusing the previous match results may thus result in significant savings of manual effort.

Comprehensive Evaluations of Different Match Approaches

Due to the flexibility to configure matchers and match strategies, COMA and COMA++ can also be used to comparatively evaluate the effectiveness of different match approaches. In fact, we conduct several comprehensive evaluations on COMA and COMA++, which show high quality and acceptable execution time for complex real- world match problems in different domains. In particular, the involved E-business mes- sage standards represent the largest test schemas compared to previous evaluations. 2. COMA++ [2, 33, 120] extends the COMA prototype first published in [29] and also subsumes all func-

Using the test cases from an published ontology alignment contest, we achieve on aver- age very high quality, which is comparable to that of the best performing participants in the contest.

By systematically testing the supported match strategies with all relevant configurations, we are able to identify the best strategies with high and stable quality across different match tasks for the default match operation, thereby limiting tuning effort for later match operations. Our evaluations yield many insights on the quality and performance of different match approaches and the impact of many factors, such as schema size, the choice of matchers and of combination strategies. We believe that our evaluation insights can be of valuable help for the development and evaluation of further match algorithms.

Mapping-based Approaches for Data Integration

Traditional data integration approaches, such as mediation or data warehousing, rely on the notion of a domain-dependent global schema to provide a unified and consistent view of the underlying data sources. Unfortunately, the manual effort to create such a schema and to keep it up-to-date is substantial. Furthermore, adding new data sources is a time- and effort-intensive task, making it difficult to scale to many sources or to use such systems for ad-hoc (explorative) integration needs. Further extending the idea of reusing match results, we have developed GENMAPPER (Generic Mapper), a new mapping-based approach for data integration, and applied it to the challenging field of bioinformatics with hundreds of highly cross-referenced web data sources managing annotation data for various types of molecular-biological objects, such as genes and proteins.

GENMAPPER uses a generic data model to uniformly represent different kinds of annotations physically integrated from different data sources. Existing correspondences between objects, i.e., cross-references, represent valuable domain knowledge. Therefore, they are explicitly captured and utilized to drive data integration and combine annotation knowledge from different sources. To serve specific analysis needs, powerful operators are provided to derive tailored annotation views from the generic data representation. In an extended version, GENMAPPER is coupled with a mediator to combine the advantages of both materialized and virtual integration. While frequent and intensive join processing to inter-relate objects from different sources is performed on the data materialized in GENMAPPER, the mediator retrieves on a demand-driven basis up-to-date annotations from relevant sources for objects of interest. GENMAPPER is fully functional and has been successfully used for large-scale functional profiling of genes and proteins.

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 38-40)