Contest-based Comparison - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

The EON Ontology Alignment Contest [44, 48] represents the single effort so far, in which several match systems were evaluated on a common test base and the results were compared to identify the strengths and weaknesses of each system. As already discussed in Section 12.1, the contest defined a set of match tasks, involving both synthetic and real-world scenarios, and provided at the same time the real results as the gold standard. Although not all of our proposed evaluation criteria were considered (e.g., specification of allowable auxiliary information, report of additional manual effort and of execution performance - see Chapter 10), the contest already represents a remarkable step towards a benchmark for schema matching.

Four ontology matching prototypes, OLA, the subsystem PROMPTDIFF of PROMPT, QOM, and SCM, took part in the contest. The evaluation results for all systems were published

13.4.CO N T E S T-B A S E D CO M P A R I S O N 1 4 9

in the workshop proceedings and are also available in the unified and cleaned form for download from the contest website [44]. Unlike the individual evaluations, whose heterogeneity only allows us to compare the way the evaluations were conducted, the uniform test environment enforced by the contest makes it possible to perform a direct comparison of the quality of the systems. Therefore, we downloaded the published results of the four systems from the contest website and compared them with the results of COMA++ as presented in Chapter 12. For better readability, we show in the following only the values of the combined measure Fmeasure.

Figure 13.4 Quality of single prototypes for contest tasks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 101 103 104 201 202 204 205 206 221 222 223 224 225 228 230 301 302 303 304 F m easu re

QOM OLA SCM PROMPT COMA++

Figure 13.4 shows the best Fmeasure values reported by the participants for their system well as those we obtained with COMA++ for the single match tasks. Note that for QOM, no results were available for tasks 101, 103, 104, 202, 221, 222, 225 and 228. From the figure, we can distinguish between the simple and hard tasks for the match systems. In particular, the tasks 101-104, 204, 221-230 are rather simple, as most systems achieve high or absolute quality for them. The hard tasks include 201, 202, 205, and 206, in which class and property names in the target ontology are replaced by random strings, by synonyms, and by names in a foreign language, respectively, and 301, 302, 303, and 304, which match between real-world ontologies.

In general, COMA++, PROMPTDIFF, and SCM outperform QOM and OLA. In the simple tasks, COMA++ and PROMPTDIFF yield mostly equal quality, while SCM exhibits slightly worse quality in several of these tasks. As for the hard tasks, we observe varying behav- ior among the top candidates PROMPTDIFF, SCM, and COMA++. In particular, SCM per- forms particularly well, showing better quality than both COMA++ and PROMPTDIFF in the 201 and 202 tasks. This is because SCM exploits instances (which are the same between the source and target ontologies in these tasks), while COMA++ and PROMPT- DIFF do not. On the other side, COMA++ outperforms PROMPTDIFF in both tasks, appar- ently due to the utilization of comments and ontology structure. In the 205 and 206 tasks, COMA++ outperforms both SCM and PROMPTDIFF. Exploiting instances in turn helps SCM to outperform PROMPTDIFF in these tasks. In the 3xx series matching between real- world ontologies, COMA++ generally outperforms SCM. This indicates the weakness of SCM, which depends very much on the common terms between the vocabularies of the input ontologies. COMA++ significantly outperforms PROMPTDIFF in task 301, while showing slightly worse quality in the remaining tasks.

Figure 13.5 Average quality of single prototypes for test series 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1xx 2xx 3xx 123 F m easu re

QOM OLA SCM PROMPT COMA++

Figure 13.5 shows the average quality achieved by the systems for the single series. For QOM, no average quality could be computed for the 1xx series, while in the remaining series, the average quality only considers the tasks for which quality result is present. SCM achieves high quality in the 2xx series due to the high overlap between the vocabularies of the input ontologies as they are derived from the same initial ontology BibTex. However, such an overlap is difficult to achieve for independently constructed ontologies, indicated by the significantly decreasing quality of SCM in the 3xx series. PROMPT- DIFF, although originally developed for comparing ontology versions, has surprisingly some difficulty with the synthetic tests of the 2xx series, while performing well for the more diverse real-world ontologies in the 3xx series. COMA++ exhibits high robustness for both the synthetic and real-world tests. In particular, its average quality is comparable to that of SCM in the 2xx series and to that of PROMPTDIFF in the 3xx series.

13.5 Summary

So far, most evaluations were conducted individually for single prototypes, typically by the authors of the prototype themselves. A few authors also tried to compare their own prototype with others. However, these evaluations in general depend much on the sub- jectivity of the authors in selecting the match tasks and designing the test methodology. As a result, the evaluations differ from each other in so many ways that it is impossible to directly compare their results. The EON Ontology Alignment Contest represents the first effort so far providing a uniform test base for a comparative evaluation of different match systems. In particular, the authors were asked to perform the same tests on their system, allowing to tune and optimize it with the best knowledge.

Although the considered match problems were mostly simple, we observe that many techniques have proved to be quite powerful such as exploiting element and structure properties (CUPID, SF, COMA/COMA++, PROMPT), and utilizing instance data, e.g., by Bayesian and WHIRL learners (LSD/GLUE) or neural networks (SEMINT). Moreover, the combined use of several approaches within composite match systems proved to be very successful (COMA/COMA++, LSD/GLUE/IMAP, PROMPT). As proved by the high quality of COMA++ for both real-world schemas and ontologies, generic schema matching is fea- sible for different schema languages and domains. On the other side, there are still unex- ploited opportunities, e.g., in the use of large-scale dictionaries and standard taxonomies and increased reuse of previous match results (COMA/COMA++). Future systems should integrate those techniques within a composite framework to achieve maximal flexibility.

13.5.SU M M A R Y 1 5 1

To allow an objective interpretation and easy comparison of match quality between different systems and approaches, future evaluations should be conceived and documented more carefully, if possible, including the criteria that we identified in this paper. Further- more, following issues concerning input and output factors should be considered:

• Input factors - test schemas and system parameters: All evaluations have shown that both match quality and execution time degrade with bigger schemas. Hence, future systems should be evaluated with schemas of more realistic size, e.g., several hundreds of elements. As done in the EON Ontology Alignment Contest and in the work of [126], systematic variation of a schema to automatically obtain synthetic schemas and the real mappings between them and the original schemas can help design test cases with desired characteristics for focused evaluation.

Besides the characteristics of the test schemas, the various input parameters of each system can also influence the match quality in different ways. However, their impact has rarely been investigated in a comprehensive way, thus potentially missing opportunities for improvement and tuning. Consequently, previous evaluations typically reported only some peak values w.r.t. some quality measure so that the overall match quality for a wider range of configurations remained open. A systematic evaluation of all relevant parameters can benefit from an automatic approach as proposed by the recent ETUNER system [126], which systematically tests different configurations of a match systems to identify the best one on a synthetic workload of schemas and mappings obtained by systematically perturbing an initial schema.

• Output factors - match results and quality measures: Instead of determining only one match candidate per schema element, future systems could suggest multiple, i.e., top- k, match candidates for each schema element. This can make it easier for the user to determine the final match result in cases where the first candidate is not correct. In this sense, a top-k match prediction may already be counted as correct if the required match candidate is among the proposed choices.

Previous studies used a variety of different quality measures with limited expressive- ness thus preventing a qualitative comparison between systems. To improve the situa- tion and to consider precision, recall and the degree of post-match effort we recommend the use of combined measures such as Fmeasure in future evaluations. However, further user studies are required to quantify the different effort needed for finding missing matches, removing false positives, and verifying the correct results. As this depends very much on the convenience supported by a tool, the capabilities of the user interface should also be considered. Another limitation of current quality measures is that they do not consider the pre-match effort and the hardness of match problems.

Ultimately, a schema matching benchmark, as shown by the EON Ontology Alignment Contest, seems very helpful to better compare the effectiveness of different match systems by clearly defining all input and output factors for a uniform evaluation. However, we would like to also see such an effort for schema matching. Because of the extreme degree of heterogeneity of real-world applications, a benchmark should not strive for general applicability but focus on a specific application domain, e.g., a certain type of E- business. Alternatively, a benchmark can focus on determining the effectiveness of match systems with respect to schema types, such as SQL and XSD, to specific match capabilities, such as name, structural, instance-based and reuse-oriented matching. In addition to the test schemas and the expected results, a benchmark should also specify the use of all auxiliary information in a precise way since otherwise any hard-to-detect correspondences could be built into a synonym table to facilitate matching.

P

A R T

PART IV

M

APPING

-

BASED

D

ATA

I

NTEGRATION

Traditional data integration approaches rely on the notion of a global schema to provide a unified and consistent view of the underlying data sources. This approach has been especially successful for data warehouses, but is also used in mediators for virtual integration of data sources. Unfortunately, the manual effort to create such a schema and to keep it up-to-date is substantial. Furthermore, adding new data sources is a time- and effort-intensive task, making it difficult to scale to many sources or to use such systems for ad-hoc (explorative) integration needs.

We present a new approach for integrating heterogeneous web data sources. It is based on mappings between sources and utilizes correspondences between their objects, i.e., at the instance level. In a first step, we focus on the bioinformatics domain, where hundreds of web sources are publicly accessible providing large amounts of data on various molecular-biological objects, such as genes, proteins, metabolic pathways, etc. The sources are highly cross-referenced by means of web-links, capturing different kinds of relationships between the objects. Such semantic correspondences can help navigate between the sources to retrieve and combine the information from multiple sources for objects of interests.

This part consists of three chapters, from Chapter 14 to Chapter 16. Based on the characteristics of the domain, Chapter 14 discusses the major requirements for data integration and reviews the state of the art of current approaches. Chapter 15 describes the imple- mentation of our integration approach, the GENMAPPER system (Generic Mapper), which physically integrates heterogeneous annotation data in a flexible way and supports large- scale analysis on the integrated data. Chapter 16 presents an extension of GENMAPPER, which is coupled with a mediator to combine the advantages of both the materialized and virtual integration techniques.

C

H A P T E R

CHAPTER 14

D

ATA

I

NTEGRATION

IN

B

IOINFORMATICS

New advances in life sciences, e.g., molecular biology, biodiversity, drug discovery and medical research, increasingly depend on bioinformatics methods to manage and analyze vast amounts of highly diverse data. The volume of data is increasing at an unprece- dented pace, fueled by world-wide research activities producing publicly available data, and new technologies, e.g., high-throughput devices such as microarrays. Thus, data mining and analysis require comprehensive integration of heterogeneous data, that is typically distributed across many data sources on the web and often structured only to a limited extent. Despite new interoperability technologies, such as XML and web ser- vices, data integration is a highly difficult and still largely manual task, especially due to the high degree of semantic heterogeneity and varying data quality as well as specific application requirements. This chapter introduces the data integration problem in the bioinformatics field. In the next section, we summarize the characteristics of molecular-biological data. We then discuss the major requirements for data integration in Section 14.2 and give an overview of the existing solutions in Section 14.3.

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 166-173)