In this first part of the thesis, Part I, we introduced the problem of schema matching and motivated the need for semi-automatic support to solve the task. We then discussed the open issues in the current state of the art and gave an overview about the main contribu- tions of the dissertation. The rest of the thesis is organized in four following parts:
Part II focuses on the design and implementation of our schema matching systems COMA/COMA++. We first present the taxonomy of schema matching algorithms, relevant candidates to included in a composite match system like ours. We then elaborate on the techniques concerning architecture design and match processing of COMA/COMA++, and the novel match approaches that we have developed and employed to make our system generic, customizable, and scalable. Finally, we review various schema matching proto- types published in the literature and compare them with ours.
3.3.OU T L I N E 2 3
Part III deals with the issues of the evaluation of a schema matching system. We first introduce the major criteria influencing the effectiveness of a match approaches. We then describe the evaluation of COMA/COMA++ and discuss the obtained results and insights. Finally, we survey the state of the art in schema matching evaluation and compare our evaluation with several other studies.
Part IV presents our mapping-based approach for data integration and its application in the bioinformatics domain. First we summarize the challenges and current solutions in integrating molecular-biological data. We then describe in detail our GENMAPPER system and its extension to a hybrid data integration system.
Part V concludes the thesis by summarizing the main contributions made and discussing relevant directions for future research.
Parts of the thesis have been published in refereed conferences and journals. In particu- lar, the survey of schema matching prototypes and evaluations is presented in [31]. The COMA and COMA++ systems and their evaluations are described in [29, 120, 2]. The GENMAPPER system and its extension are described in [32, 77, 76].
P
A R TPART II
S
CHEMA
M
ATCHING
A
PPROACHES
The need for schema matching in numerous applications and the inherent difficulty of the task have led to the development of many algorithms to semi-automatically solve the match problem. While the reuse of existing algorithms promises to reduce development cost for new match systems, it is still largely unclear, how to best combine different approaches. Therefore, we first conduct a survey of the available approaches to identify their strengths and applicability. We then develop the COMA schema matching system (Combining Matchers) as a platform to combine different match algorithms in a flexible way [29]. We further extend COMA to a more powerful system, COMA++ [2, 33, 120]. While taking over the composite approach of COMA to combine different match algo- rithms, COMA++ implements significant improvements, in particular, graphical user interface, flexible construction and configuration of matchers and match strategies, uni- form management and manipulation of schemas and mappings, new approaches for con- text-dependent, fragment-based and reuse-oriented matching, and various performance optimizations to deal with large schemas.
This part, spanning from Chapter 4 to Chapter 9, describes the techniques and algorithms implemented in COMA/COMA++ in detail. Chapter 4 gives an overview of existing schema matching approaches, which can be employed to build a composite match sys- tem like COMA/COMA++. Chapter 5 describes the overall match processing in COMA/ COMA++ and the function of single system components. Chapter 6 describes our novel match approach based on the reuse of existing match results to solve a new and similar match task. Chapter 7 presents our generic framework to combine the results of individ- ually executed matchers. Chapter 8 extends the combination framework to cover the iter- ative refinement of match results and discusses the construction of the new context- dependent and fragment-based match strategies. Finally, Chapter 9 reviews relevant pro- totypes published in literature and provides a comparison of the most representative ones and our own prototype COMA++.
C
H A P T E RCHAPTER 4
A
PPROACH
C
LASSIFICATION
There is a lot of work developed in different fields, such as schema translation and inte- gration, knowledge representation, machine learning, and information retrieval, aiming at automating the schema matching task as much as possible. The main goal of this chap- ter is to survey these approaches and to explain their common features and applicability. Not only for the development of our own system, we expect that the survey can be gener- ally of help for designers of new approaches as well as users who need to select from a library of approaches.
In the next section, we briefly describe the criteria adopted from the taxonomy presented in [119] to classify the approaches for automatic schema matching. The classification is then used in the subsequent sections to discuss previously proposed techniques. In partic- ular, we summarize the approaches exploiting schema-level information, instance data, and auxiliary information in Sections 4.2, 4.3, and 4.4, respectively. The approaches to combine multiple match algorithms are presented in Section 4.5. Section 4.6 focuses on the cardinality of the result produced by a match approach. Section 4.7 concludes the chapter.