Designing a Benchmark for the Assessment of Schema Matching Tools

(1)

Open Access

Open Journal of Databases (OJDB)

Volume 1, Issue 1, 2014

http://www.ronpub.com/journals/ojdb

ISSN 2199-3459

Designing a Benchmark

for the Assessment of Schema Matching Tools

Fabien Duchateau

A

, Zohra Bellahsene

B

A_{LIRIS, UMR5205, Universit´e Claude Bernard Lyon 1, Lyon, France, [email protected]} B_{LIRMM, Universit´e Montpellier 2, Montpellier, France, [email protected]}

A

BSTRACT

Over the years, many schema matching approaches have been developed to discover correspondences between schemas. Although this task is crucial in data integration, its evaluation, both in terms of matching quality and time performance, is still manually performed. Indeed, there is no common platform which gathers a collection of schema matching datasets to fulfil this goal. Another problem deals with the measuring of the post-match effort, a human cost that schema matching approaches aim at reducing. Consequently, we propose XBenchMatch, a schema matching benchmark with available datasets and new measures to evaluate this manual post-match effort and the quality of integrated schemas. We finally report the results obtained by different approaches, namely COMA++, Similarity Flooding and YAM. We show that such a benchmark is required to understand the advantages and failures of schema matching approaches. Therefore, it could help an end-user to select a schema matching tool which covers his/her needs.

T

YPE OF

P

APER AND

K

EYWORDS

Regular research paper:schema matching, benchmark, data integration, heterogeneous databases, evaluation, inte-grated schema

1 I

NTRODUCTION

Data integration is a crucial task which requires that the models which represent the data (e.g., schemas, ontolo-gies) have to be matched [5]. Due to the growing avail-ability of information in companies, agencies, or on the Internet, decision makers may need to quickly under-stand some concepts before acting, for instance in in-ternational emergency events. In such context, schema matching is a crucial process since its effectiveness (matching quality) and its efficiency (performance) di-rectly impacts the decisions [46]. In a different scenario, the costs for integrating multiple data sources can reach astronomic amounts. Thus, evaluating these costs en-ables project planners to assess the feasibility of a project by analysing incomplete integrated schemas or detecting

the overlap between the data sources. In addition, XML and XSD have become standard exchange formats for many web sites and applications.

As a consequence, the schema matching community has been very prolific in producing tools during the last decades. Many surveys [43, 48, 24, 45, 41] reflect this interest and their authors propose various classifications of matching approaches according to their features. Each schema matching tool has been designed to satisfy one or more schema matching tasks. For instance, some tools are dedicated to large scale scenarios, i.e., to match a large number of input schemas. Others may be devoted to discover complex correspondences, in which more than two elements from the input schemas are involved. In papers related to a schema matching approach, there is often anexperimentsection which demonstrates the

(2)

ben-efit or gain, mainly in terms of matching quality or time performance, of the proposed approach. Sometimes, au-thors have also compared their work with existing tools, thus showing how their approach may (or not) perform better than others. Yet, this is not a sufficient evaluation for several reasons. First, the schemas against which ap-proaches are evaluated are not clearly defined. In other words, authors do not always detail enough the proper-ties of the schemas used in the experiments and the tasks they may fulfil. Besides, experiments cannot be repro-duced with ease. As a result, it is difficult for an end-user to choose a matching tool which covers his/her needs.

For these reasons, we believe that a benchmark for the schema matching community is necessary to eval-uate and compare tools in the same environment. In information retrieval community, authors of the Lowell report [28] clearly call for the design of a test collec-tion. Thalia [29], INEX [26] and TREC [4] were there-fore proposed. Another closely-related domain is called ontology alignment. Schemas differ from ontologies mainly because they include less semantics, for instance on the relationships. Ontology alignment researchers have designed OAEI (Ontology Alignment Evaluation Initiative) [23, 27]. Every year since 2004, an evalua-tion campaign of ontology alignment tools is performed. The entity matching task, which consists of discover-ing correspondences at the instance level, also benefits from various benchmarks [30, 32]. Conversely, there is currently no common evaluation platform for explicit schema matching, although some attempts of evaluation have been proposed [13, 48, 2]. In [13], the authors present an evaluation of schema matching tools. The main drawback deals with the evaluation of the match-ing tools with the scenarios provided in their respective papers, thus one cannot objectively judge the capabilities of each matching tool. Secondly, some matching tools generate an integrated schema along with a set of cor-respondences, and the common measures used to assess the quality of a set of correspondences are not appropri-ate to evaluappropri-ate the quality of an integrappropri-ated schema. Another proposal for evaluating schema matching tools has been done in [48]. It extends [13] by adding time measures and relies on real-world schemas to evaluate the matching tools. However, the evaluation system has not been implemented and it does not automatically compute quality and time performance results. Finally, STBenchmark [2] was proposed to evaluate the relation-ship of the mappings (i.e., the transformations of source instances into target instances), but it does not deal with schema matching tools. Our work extends the criteria provided in [13], by adding scoring functions to evalu-ate the quality of integrevalu-ated schemas. It goes further on the evaluation aspect. Indeed all the matching tools are evaluated against the same scenarii.

In this paper, we present the foundation of a bench-mark for XML schema matching tools. Our evaluation system named XBenchMatch involves a set of criteria for testing and evaluating schema matching tools. It pro-vides uniform conditions and the same testbed for all schema matching prototypes, Our approach focuses on the evaluation of the matching tools in terms of matching quality and performance. Next, we also aim at giving an overview of a matching tool by analysing its features and deducing some tasks it might fulfil. This should help an end-user to choose among the available matching tools depending on the criteria required to perform his task. Finally, we provide a testbed involving a large schema corpus that can be used by everyone to quickly bench-mark schema matching tools. Here we outline the main contributions of our work:

• Testbed. We have proposed a classification of schema matching datasets according to schema properties and schema matching task perspectives. This allows to choose the features that the user needs to test.

• Evaluation metrics. To improve the evaluation of the matching quality for schema matching, we pro-pose new metrics to measure the post-match effort, the rate of automation performed by using a match-ing tool and the quality of integrated schemas.

• Experimental validation. We have extended our demonstration tool, XBenchMatch [20], to take into account the classification and the new quality mea-sures. Experiments with three matching tools and over ten datasets demonstrate the benefit of our benchmark for the matching community.

The rest of the paper is organized as follows. First, we give some definitions and preliminaries in Section 2. Related work is presented in Section 3. In Section 4, we present an overview of XBenchMatch. Section 5 describes the datasets and their classification. Section 6 covers the new measures that we have designed for eval-uating a set of discovered correspondences. We report in Section 7 the results of three schema matching tools by using XBenchMatch. Finally, we conclude and outline future work in Section 8.

2 P

RELIMINARIES

Here we introduce the notions used in the paper.Schema matchingis the task which consists of discovering se-mantic correspondences between schema elements. We considerschemas as edge-labelled trees (a simple ab-straction that can be used for XML schemas, web inter-faces, or other semi-structured or structured data

(3)

mod-els). Acorrespondenceis a semantic link between sev-eral schema elements which represent the same real-world concept. Contrary to [2], evaluating the quality of the mapping (i.e., the transformation function between instances of a first schema element and those of another schema element) is out of scope of this paper since we focus on evaluating the quality of correspondences. We also limit correspondences to1:1(i.e., one schema ele-ment is matched to only one schema eleele-ment) or to1:n

(i.e., one schema element is matched to several schema elements). Currently, only a few schema matching tools such as IMAP [12] are able to producen:m correspon-dences.

A schema matching dataset is composed of a

schema matching scenario (the set of schemas to be matched, without schema instances), a set of expert correspondences (between the schemas of the scenario) and/or the integrated expert schema along with expert correspondences (between the integrated schema and each schema of the scenario) [8]. Such datasets [23], also called ground truth or test collections [4, 29], are used by most evaluation tools as an oracle, against which they can evaluate and compare different approaches or tools.

The quality in schema matching is traditionally com-puted using well-known metrics [7]. Precision calcu-lates the proportion of correct correspondences extracted among the discovered ones. Another typical measure is

recallwhich computes the proportion of correct discov-ered correspondences among all correct ones according to the ground truth. Finally, F-measure is a trade-off between precision and recall.

3 R

ELATED

W

ORK

This section presents related work which covers the main topics of this paper. First, we focus on benchmarks for matching tools in ontology and schema domains. Then, we describe schema matching tools that are publicly available for evaluation with our benchmark.

3.1 Ontology Benchmarks

In the ontology domain, a major work for evaluating ontology matching is called OAEI, standing for Ontol-ogy Alignment Evaluation Initiative [23, 27]. Since 2004, this campaign yearly invites researchers to test their ontology matching tools against different scenar-ios (datasets). They run and test their tools during sev-eral months and they send both the system and the pro-duced alignments to OAEI organizers. Results are then published and the OM workshop enables researchers to share feedback. The OAEI datasets fulfil various crite-ria. For instance, the benchmark dataset gathers many schemas in which a specific type of information has been

altered (modifications, deletions, etc.). Consequently, it aims at detecting the weaknesses of a tool according to available information. Other datasets might be very specific like theFAO1 _{ontologies. In such case,}

dictio-naries are available as external resource for the match-ing tools. However, only thebenchmark andanatomy

datasets are provided with the complete set of expert cor-respondences. For the remaining datasets, OAEI orga-nizers are in charge of evaluating the results w.r.t. expert correspondences. A tool, AlignAPI [11], can be used to automatically compute precision, recall, F-measure and fall-out values for a given dataset. However, this tool is mainly useful with datasets for which expert corre-spondences should be provided by users. A recent work finally proposes a method for automatically generating new datasets [25].

Last but not least, the OAEI organizers have pointed out that the recent campaigns do not enable a significant impact on the quality. They conclude that the tools may have reached an upper limit of automation, and they call in 2013 for a new interactive track in which users are in-volved [27]. In such context, new metrics are needed for evaluating the number of user interactions. The OAEI ”interactive track” used in the 2013 contest the number of both positive and negative interactions. This is cru-cial since the tools are traditionally in charge of suggest-ing the next interactions, and countsuggest-ing the negative ones helps to detect whether the tool is able to suggest rele-vant interactions. In our case, only positive interactions are taken into account since XBenchMatch aims at es-timating the worst-case scenario for an expert to reach 100%F-measure from a given set of correspondences. Other metrics have been refined to consider various types of interactions, such as providing the definition of the semantic link of a correspondence [42]. The AUL met-ric (Area Under Learning curve) is proposed to evaluate interactive matching tools and compare their F-measure given a cost of interaction. Thus, the AUL metric is an estimation because it is calculated during the interactive matching. On the contrary, our metrics are computed during evaluation, with knowledge of the correct set of correspondences.

The maturity achieved by the OAEI campaign is an in-teresting starting point for designing a schema matching benchmark. Only a few schema matching tools which are able to parse ontologies have participated in OAEI, for instance COMA++ in the 2006 campaign [35] and YAM++ since 2011 [22, 38, 40]. Indeed, most of schema matching tools do not parse ontologies and are not de-signed to match them, since ontologies are richer than schemas in terms of semantics for instance. On the con-trary, our benchmarking tool is specifically designed for

(4)

the schema matching community. Furthermore, it offers additional features:

• New metrics for evaluating the matching quality of correspondences (e.g., post-match effort [19]) and the quality of an integrated schema (e.g., schema proximity [18]). Such metrics may be useful to the recent OAEI interactive track2_{, for which user}

inter-action is allowed, to determine which system needs the smallest amount of interactions.

• Dedicated schema matching datasets. Contrary to ontologies which include semantics (e.g., concepts are linked using specific types of relationships such as ”subClassOf”), schemas do not have such seman-tics. Thus, ontology matching tools and schema matching tools use different similarity measures and do not exploit the same type of information. For those reasons, dedicated schema matching datasets are required.

• Generation of plots of the evaluation results both for a comparison between tools and for the assessment of a single tool.

3.2 Schema Matching Benchmarks

To the best of our knowledge, there is no complete benchmark for schema matching.

In [13], the authors mainly discuss the criteria required to evaluating schema matching tools. More precisely, they provide a report about the evaluation results of the matching tools: COMA [14], Cupid [33], LSD [17], and Similarity Flooding [36]. However, they do not propose any evaluation tool. As the authors explain, it is quite dif-ficult to evaluate the matching tools for several reasons, especially when they are not available as demo. There-fore, it is not possible to test them against specific sets of schemas. Finally, schema matching is not a standardized task, thus its inputs and outputs can be totally different from one schema matching tool to another. For instance, a given tool might require specific resources to be effec-tive, like an ontology, a thesauri, or training data, which are not necessarily provided with the tool or with the matching scenario. Consequently, users do not have suf-ficient information to choose the schema matching tool which suits their needs. Secondly, some matching tools also generate an integrated schema along with a set of correspondences, and the measures provided to evaluate a set of correspondences are not appropriate in that case. Another proposal for evaluating schema matching tools has been done in [48]. It extends [13] by adding time measures and it relies on real-world schemas to

2_{http://web.informatik.uni-mannheim.de/}

oaei-interactive/2013/

compare matching tools. Three schema matching tools (COMA, Similarity Flooding and Cupid) were evaluated against four scenarios. Two of these scenarios are small (fewer than20elements) and the two others have an av-erage size (fewer than100elements). Most of these sce-narios have labels with low heterogeneity. Besides, their structure is not very nested (1 to3 depth levels). The results obtained by the matching tools on four scenarios are probably not sufficient to judge on their performance. No evaluation system has been implemented, and results (in term of quality and time values) are not automati-cally computed, thus making it difficult for extending the work. Finally, the quality of integrated schemas pro-duced by schema matching tools is not evaluated.

In 2008, STBenchmark [2] was proposed to evalu-ate mapping systems, namely the transformation func-tion from source instances into target instances. This benchmark aims at evaluating both the quality of these transformations and their execution time. However, the discovery of mappings is performed manually using vi-sual interfaces which generate XSLT scripts. Bench-mark scenarios, against which mapping systems are eval-uated, are gathered according to common transforma-tions (e.g., copying, flattening). To enrich this corpus, STBenchmark also includes scenarios and instances gen-erators. Both generators can be tuned thanks to config-uration parameters (e.g., kind of joins, nesting levels). Finally, a simple usability model enables users to quan-tify their number of actions (in terms of keyboard inputs and mouse clicks) that they have to perform to design the produced mappings. A cost value is finally returned by computing a weighted sum of all actions. Four mapping systems (whose names have not been provided in the pa-per) have been evaluated both in terms of quality (based on the usability model) and time performance. Contrary to STBenchmark, our solution is dedicated to the discov-ery of correspondences rather than mappings.

3.3 Schema Matching Tools

Many approaches have been devoted to schema match-ing [6]. In various surveys, researchers have proposed a classification for matching tools [43, 24], which has been later refined in [45]. However, this section only focuses on schema matching tools which are publicly available for evaluation with our benchmark.

3.3.1 Similarity Flooding/Rondo

Similarity Flooding (SF) and its successor Rondo [36, 37] can be used with Relational, RDF and XML schemas. These input data sources are initially converted into labelled graphs and SF approach uses fix-point com-putation to determine correspondences between graph

(5)

nodes. The algorithm has been implemented as a hybrid matcher, in combination with a terminological similar-ity measure. First, the prototype does an initial element-level terminological matching, and then feeds the com-puted correspondences to the structural similarity mea-sure for the propagation process. This structural meamea-sure states that two nodes from different schemas are con-sidered similar if their adjacent neighbours are similar. When similar elements are discovered, their similarity increases and it impacts adjacent elements by propaga-tion. This process runs until there is no longer similarity increasing. Like most schema matchers, SF generates correspondences for pairs of elements having a similar-ity value above a certain threshold. The generation of an integrated schema is performed using Rondo’smerge

operator. Given two schemas and their correspondences, SF converts the schemas into graphs and it renames ele-ments involved in a correspondence according to the pri-orities provided by the users.

3.3.2 COMA/COMA++

As described in [14, 3], COMA++ is a hybrid matching tool that incorporates many independent similarity mea-sures. It can process Relational, XML, RDF schemas as well as ontologies. Different strategies, e.g., reuse-oriented matching or fragment-based matching, can be included, offering different results. When loading a schema, COMA++ transforms it into a rooted directed acyclic graph. Specifically, the two schemas are loaded from the repository and the user selects required sim-ilarity measures from a library. For linguistic match-ing, it also utilizes user-defined synonym and abbrevi-ation tables. For each measure, each element from the source schema is attributed a similarity value between0 (no similarity) and1(total similarity) with each element of the target schema, resulting in a cube of similarity values. The final step involves combining the similar-ity values given by each similarsimilar-ity measure by means of aggregation operators likemax,min,average, etc. Fi-nally, COMA++ displays all correspondences possibili-ties whose similarity value is above a given threshold and the user checks and validates their accuracy. COMA++ supports a number of other features like merging the schemas into an integrated schema which can be saved in an ASCII tree format.

3.3.3 YAM

YAM (Yet Another Matcher) [21, 22] is not (yet) another schema matching system as it enables the generation of

a la carte schema matchers. It considers the schema matching task as a classification problem. To produce a schema matcher, it automatically combines and tunes

similarity measures based on various machine learning classifiers. Then it selects the most appropriate schema matcher for a given schema matching scenario. In ad-dition, optional requirements can be provided such as a preference for recall or precision, similar3 schemas that have already been matched and expert correspondences between the schemas to be matched. YAM uses a knowl-edge base that includes a (possibly large) set of similar-ity measures and classifiers. YAM++ is an extension for matching ontologies [39].

4 O

VERVIEW OF

XB

ENCH

M

ATCH

In this section, we first describe the desiderata for a schema matching benchmark. Then, we present the ar-chitecture of our XBenchMatch tool.

4.1 Desiderata

A schema matching benchmark needs to have the follow-ing properties in order to be complete and efficient:

• Extensible, the benchmark is able to evolve accord-ing to research progress. This extensibility gath-ers three points : (i) futureschema matching tools

could be benchmarked, hence XBenchMatch deals with well-formed XML schemas; (ii) new evalua-tion metricscould be added to measure the match-ing quality or time performance; and (iii) users should easily add newschema matching datasets.

• Portable. The benchmark should be

OS-independent, since the matching tools might run on different OS. This requirement is fulfilled by using Java programming language.

• Ease of use, since end-users and schema matching experts are both targeted by this benchmark. Be-sides, from the experiment results computed by the benchmark, the users should be able to decide be-tween several matching tools the most suitable for a given scenario. Indeed, one of the challenge is to provide sufficient information between conflict-ing constraints, such as time performance and F-measure.

• Generic, it should work with most of the available matching tools. Thus, we have divided the bench-mark into datasets, each of them reflecting one or several specific schema matching issues. For in-stance, tools which are able to match a large num-ber of schemas can be tested against alarge scale

(6)

Figure 1: Architecture of XBenchMatch

scenario. Dividing the benchmark into datasets en-ables us to facilitate the understanding of the re-sults. Besides, it does not constrain the benchmark to the common capabilities of the tools. Indeed, if some matching tools can only match two schemas at the same time, this does not prevent other tools to be tested against a datasets with a large number of schemas.

All these requirements should be met to provide an acceptable schema matching benchmark. From these desiderata, we have designed the XBenchMatch archi-tecture.

4.2 XBenchMatch Architecture

To evaluate and compare schema matching tools, we have implemented XBenchMatch. Its architecture is de-picted by Figure 1 and it relies on two main components:

extensibility processandevaluation process.

The former deals with theextensibilityof the tool. It takes a dataset as input, and the extension process ap-plies some checking (i.e., the schemas are well-formed, or the expert correspondences have elements which exist in the schemas, etc.). If the dataset is validated by the extension process, then it is added into the knowledge base (KB). This KB stores all information about datasets and evaluation metrics. Consequently, it interacts with the two main components.

The latter process, namelyevaluation, takes as input the results of a schema matching tool. Indeed, we as-sume that the schema matching tool to be evaluated had performed matching against a schema matching scenario from our benchmark. Thus, the evaluated matching tool

produces a set of correspondences between the schemas of the scenario, and/or an integrated schema along with its associated correspondences. This input is used by the evaluation process, which compares them to the expert set of correspondences and/or expert integrated schema included in the scenario. XBenchMatch outputs quality and time performance results of the matching tool for the given scenario.

4.3 Methodology

Before using XBenchMatch, a user has to generate an in-tegrated schema and/or a set of correspondences for each dataset with the matching tool(s) she would like to evalu-ate. Recall that each dataset contains a schema matching scenario, the expert integrated schema and/or the expert sets of correspondences. Thus, the idea is compare the output produced by a matching tool against those expert ones.

Let us describe a scenario. A user would like to know if her matching tool performs well, in terms of qual-ity, when matching large schemas. She chooses in our benchmark a dataset with large schemas (see Section 5). Then, she runs her matching tool against the schemas of the chosen dataset, which produces a set of corre-spondences. As she wants to evaluate the quality of this set of correspondences, she uses it as input for XBench-Match. Our benchmark compares the set of correspon-dences produced by the matching tool against the expert set of correspondences provided with the dataset. Qual-ity metrics such as post-match effort are computed to as-sess the quality of the set of correspondences (see Sec-tion 6). The user can finally interpret these results vi-sually and discover if her matching tool is suitable for matching large schemas (see Section 7).

5 D

ATASETS

C

LASSIFICATION

To evaluate schema matching tools, our benchmark includes various schema matching datasets. Indeed, the discovery of correspondences is an initial step before integration or mediation, and the nature of the datasets clearly impacts this matching quality. First, we describe our datasets. For all of them, the expert set of correspondences has been manually expertised. Then, we propose a classification of these datasets based on the datasets properties and the schema matching features. Here are the available datasets in our benchmark, which are available online4

4

http://liris.cnrs.fr/˜fduchate/research/ tools/xbenchmatch/

(7)

• Betting andfinancedatasets. Each of them con-tains tens of web forms, extracted from various websites [34]. As explained by authors of [46], schema matching is often a process which evalu-ates the costs (in terms of resources and money) of a project, thus indicating its feasibility. Ourbetting

andfinancedatasets can be a basis for project plan-ning, i.e., to help users decide if integrating their data sources is worth or not.

• Biology dataset. The two large schemas origi-nate from different biology collections which de-scribe proteins, namely UniProt5 and GeneCards6. The UniProt schema contains896elements and the GeneCards schema has more than 2530elements. This is an interesting dataset for deriving a com-mon specific vocabulary from different data sources which have been designed by human experts. A third source such as ProSite7 _{could be used as}

ex-ternal resource for facilitating the matching.

• Currency and sms datasets are popular web ser-vices8_{. Matching the schemas extracted from web}

services is a recent challenge to build new applica-tions such as mashups or to automatically compose web services.

• Order dataset deals with business. The first schema is drawn from the XCBL collection9_{, and it}

includes about850 elements. The second schema (from OAGI collection10_{) also describes an order}

but it is smaller with only20elements. This dataset reflects a real-case scenario in which a repository of schemas exist (similar to our large schema) and the users would like to know whether a new schema (the small one in our case) is still necessary. It can also be used in a context where users need to deter-mine the overlap between schemas, i.e., whether a schema or subset of a schema can be reused from the repository.

• Person datasetcontains two small-sized schemas describing a person with low heterogeneity. On the Web, many applications only need small schemas to store the data of specific concepts (e.g., customers, products).

• Travel datasetincludes20schemas that have been extracted from airfare web forms [1]. In data shar-ing systems, partners have to choose a schema or

5_{http://www.uniprot.org/docs/uniprot.xsd} 6_{http://www.geneontology.org} 7_{http://prosite.expasy.org/} 8_{http://free-web-services.com/} 9_{http://www.xcbl.org} 10_{http://www.oagi.org}

a subset of schema that will be used as a basis for exchanging information. Thistraveldataset clearly reflects this need, since schema matching enables data sharing partners to identify similar concepts that they are willing to share.

• University courses dataset. These 40 schemas have been taken from Thalia collection presented in [29]. Each schema has about20elements and they describe the courses offered by some worldwide universities. As explained in [46], this dataset could refer to a scenario where users need to generate an exchange schema between various data sources.

• University department datasetdescribes univer-sity departments and it has been widely used in the literature [16]. These two small schemas have very heterogeneous labels.

According to their descriptions, it is clear that these datasets either have different criteria or fulfil various schema matching tasks. These features are summed up in Table 1 and have been classified into five categories rep-resented by the gray columns. The first four categories deals with the properties of datasets. The columnLabel heterogeneityis computed thanks to terminological sim-ilarity measures applied to the expert set of correspon-dences. If these measures are able to discover most of the correspondences, this means that the labels have a very low heterogeneity. Conversely, if the terminologi-cal measures only discover a few correspondences, then the labels are strongly heterogeneous. The second col-umnDomain specificmeans that the vocabulary is un-common and it cannot be found in general dictionaries such as Wordnet [47]. The size column indicates the average number of schema elements in a dataset while thestructure column checks how deep the schema ele-ments are nested. Unfortunately, we were not able to obtain very nested schemas (depth>7). The latter cate-gory provides information about the number of schemas in the dataset. This is an important feature because of the growing information stored on the Internet and in distributed organizations [44]. Other schema matching features could have been added, for instanceuse of ex-ternal resources(e.g., domain ontology),complex corre-spondences,use of instances,evolution of schemas, etc. But they would require the corresponding schema match-ing datasets, which involves a demandmatch-ing effort to cre-ate the ground truth. However, we intend to add more datasets that reflect such interesting features by semi-automatically discovering the correspondences with a schema matching tool. Using this classification of the datasets in our benchmark enables a better understand-ing of the matchunderstand-ing tools’ successes and failures.

(8)

Label heterogeneity Domain Average size Structure Number

Low or Average High Specific Small Average Large Flat Nested of

normalized (<10) (10-100) (>100) (3<depth<7) schemas

Betting x x x 12 Biology x x x x 2 Currency x x x 2 Finance x x x x 8 Order x x x 2 Person x x x 2 Sms x x x 2 Travel x x x 20 Univ. courses x x x 40 Univ. dept x x x 2

Table 1: Datasets classification according to their properties

6 Q

UALITY OF

M

ATCHING

The schema matching community evaluates the results produced by its tools using common metrics, namely precision, recall and F-measure [6]. However, the aim of schema matching is to avoid a manual, fastidious and error-prone process by automating the discovery of cor-respondences. Therefore, the post-match effort, which consists of checking these discovered correspondences, should be reduced at most. Yet, the overall is the only metric which computes this effort, but it is not suffi-ciently realistic to reflect it [36]. Indeed, this metric im-plies that validating a discovered correspondence con-sumes as much resources as searching for a missing cor-respondence. Furthermore, schema matching tools may produce an integrated schema (using the set of corre-spondences). To the best of our knowledge, there are only a few metrics in charge of assessing the quality of an integrated schema, which do not take into ac-count the structure of the produced schemas [10]. Con-sequently, we present new metrics to complete the eval-uation of the quality for a set of correspondences and an integrated schema.

6.1 Quality of Matching

As the matching process mainly aims at helping users saving both time and resources, it is interesting to mea-sure the gain of using a matching tool. Thus, we present an alternative solution to the overall metric for comput-ing thepost-match effort, i.e., a maximum estimation of the amount of work that the user must provide to check the correspondences that have been discovered by the tool and to complete them [19]. Furthermore, inverting the post-match effort enables us to measure the human-spared resources, i.e., the rate of automation that has been gained by using a matching tool.

When a set of correspondences is generated by a tool, users first have to check each correspondence from the set, either to validate it or to remove it. Then, they have to browse the schemas and discover the missing corre-spondences. Thus, we propose to evaluate this user

post-match effort byestimating the number of user inter-actionsto reach a100%F-measure, i.e., to correct the two previously mentioned issues. A user interaction is an (in)validation of one pair of schema elements (either from the set of discovered correspondences or between the schemas).

Here are our assumptions which underlie our metric:

• All pairs of schema elements, which have not al-ready been matched, must be (in)validated. Be-sides, the last pair to be validated would be a cor-respondence (worst-case scenario).

• Missing correspondences are discovered with the same frequency, to enable a comparison of the qual-ity at different times (uniformqual-ity).

• Onlycorrespondences 1:1are taken into account. The metric can be applied to1:n correspondences

(represented by several 1:1 correspondences), but we do not consider more complex correspondences (namelyn:m).

Now, let us introduce an example. Figure 2 depicts a set of correspondences discovered by the matching tool Rondo between two hotel booking schemas. The expert set of correspondences is shown in Figure 3. By com-paring the correspondences in Figures 2 and 3, we notice that one discovered correspondence is incorrect (¡Hotel Location, Hotel Name¿). Consequently, it has to be in-validated. Besides, the matching tool missed two corre-spondences, namely¡Hotel Brand:, Chain¿and¡Rooms Needed:, Number of Rooms¿. These two correspon-dences have to be searched among the23pairs that have not been validated (8×3possible remaining pairs minus 1incorrect pair discovered by the tool).

6.2 Computing the Number of User

Interac-tions (NUI)

In the following, we present the different parameters that are involved in the post-match effort. Given two schemasS` andSL of respective sizes |S`| and|SL|,

(9)

Figure 2: Correspondences discovered by a schema matcher (Rondo)

with|S`| ≤ |SL|(i.e.,SL is a larger schema thanS`), their expert set of correspondencesE contains|E| cor-respondences. A matching tool applied against these schemas has discovered a set of correspondences M, which contains|M|correspondences. Among these dis-covered correspondences,|R|of them are correct, with 0≤ |R| ≤ |M|.

|S`|,|SL|,|E|,|M|and|R|are the five inputs required to compute the number of user interactions. In our hotel booking example, we have the following values:

• |S`| = 14, the number of elements in the smallest schema11_.

• |SL| = 19, the number of elements in the largest schema11_.

• |E| = 13, the number of given expert correspon-dences, shown in Figure 3.

• |M| = 12, the number of correspondences discov-ered by the matching tool, shown in Figure 2.

• |R| = 11, the number of correct correspondences discovered by the matching tool.

The post-match evaluation process can be divided into three phases.

Phase 1: checking of all discovered correspon-dences. This step is very easy to compute. A user has to check each correspondence from the set of discovered correspondences, and (in)validate it. Thus, this requires

11_{We do not count the root element tagged with}_<a:schema>_.

Figure 3: Expert correspondences between the two hotel booking schemas

a number of interactions equal to the number of discov-ered correspondences in the set, |M| in our case. We call this metriceffort_precsince it is directly impacted by precision. Indeed, a high precision reduces the number of user interactions since there are fewer incorrect corre-spondences which have been discovered. Note that at the end of this step, precision is equal to100%.

effortprec=|M| (1) In our example, there are 12 discovered correspon-dences, thus effortprec=12. It means that the number of user interactions during this step is equal to12, among which11validations and1invalidation for the incorrect correspondence.

Phase 2: manual discovery of missed correspon-dences. The second step deals with the manual dis-covery of all missing correspondences. At the end of this step, recall reaches100%, and as a consequence F-measure does too. We assume that all pairs which have not been invalidated yet must be analysed by the user. As we consider only 1:1 and 1:n correspondences, elements that have already been matched (during phase 1) are not checked anymore. The main idea is to check every un-matched element from the smallest schema against all unmatched elements from the largest schema.

Due to the uniformity assumption, we manually dis-cover a missing correspondence with the same fre-quency. This frequency is computed by dividing the number of unmatched elements in the smallest schema by the number of missing correspondences, as shown by Formula 2. Due to 1:1 correspondences assumption, the number of correct correspondences |R| is equal to the

(10)

number of correctly matched elements in each schema. freq= |S`| − |R|

|E| − |R| (2)

Back to our example, freq= 14−11 13−11 =

3

2 means that the

user will manually find a missing correspondence for ev-ery three-halves unmatched elements from the smallest schema.

Since we now know the frequency, we can compute the number of interactions using a sum function. We call this metric effortrec since it is affected by recall. The higher recall you achieved, the fewer interactions you require during this step. |SL| − |R|denotes the number of unmatched elements from the largest schema. Withi

standing for the analysis of theith unmatched element fromS`,_freqi represents the discovery of a missing cor-respondence (when it reaches 1). Finally, we also uni-formly remove the pairs which may have been already invalidated during step 1, by computing |_|M_S|−|R|

`|−|R|. Thus,

we obtain this Formula 3:

effortrec= |S`|−|R| X i=1 (|SL| − |R| − i freq − |M| − |R| |S`| − |R| ) (3)

We now detail for our example the successive iterations of this sum function, which vary from1to3.

• effortrec(i= 1),19−11−11.5− 1 3 = 7 • effort_rec(i= 2),19−11− 2 1.5− 1 3 = 6 1 3 • effort_rec(i= 3),19−11− 3 1.5− 1 3 = 5 2 3

Thus, the second step to discover all missing correspon-dences requires effort_rec = 7 + 613 + 523 = 19 user

interactions.

Phase 3: computing NUI. Finally, to compute the number of user interactions between two schemasS`and SL, notednui, we need to sum the values of the two pre-vious steps by applying Formula 4. Note that if the set of correspondences|M|is empty, then using a matching tool was useless and the number of user interactions is equal to the total number of pairs between the schemas.

nui(S`, SL) = (

|S`| × |SL| if|M|= 0 effortprec+effortrec otherwise

(4)

6.3 Computing the Post-match Effort

The number of user interactions is not sufficient to measure the benefit of using a matching tool because the size of the schemas should be taken into account. Thus, the post-match effort is a normalization of the

number of user interactions. In addition, we derive the number of user interactions into a human spared resourcesmeasure that calculates the rate of automation by a schema matching tool.

Post-match effort (PME).From the number of user interactions, we can normalize the post-match effort value into [0,1]. It is given by Formula 5. Indeed, we know the number of possible pairs (|S`| × |SL|). Check-ing all these pairs means that the user performs a manual matching, i.e., nui=|S`| × |SL|and pme= 100%.

pme(S`, SL) =

nui(S`, SL)

|S`| × |SL|

(5) We notice that the post-match effort cannot be equal to 0%, although F-measure is 100%. Indeed, the user must at least (in)validate all discovered corre-spondences, thus requiring a number of interactions. Then, (s)he also has to check if no correspondence has been forgotten by analysing every unmatched element from input schemas. This is realistic since we never know in advance the number of correct correspondences.

Human spared resources (HSR).We can also com-pute the percentage of automation of the matching pro-cess thanks to a matching tool. This measure, notedhsr, for human spared resources, is given by Formula 6. If a matching tool achieves a20%post-match effort, this means that the user has to perform a20%manual match-ing for removmatch-ing and addmatch-ing correspondences, w.r.t. a complete (100%) manual matching. Consequently, we can deduce that the matching tool managed to automate 80%of the matching process.

hsr(S`, SL) = 1−

nui(S`, SL)

|S`| × |SL|

= 1−pme(S`, SL) (6)

In our dating example, the post-match effort is equal to pme = 31

14×19 ' 12% and human spared resources

is equal to hsr = 1−0.12 '88%. The matching tool has spared 88% resources of the user, who still has to manually perform12%of the matching process.

6.4 Discussion

The overall measure [36] was specifically designed to compute the post-match effort using the formula: overall = recall×(2− 1

precision). However, it en-tails a major drawback since it assumes that validating discovered correspondences requires as much effort as searching for missed ones. Another drawback explained by the authors deals with a precision below50%: it im-plies more effort from the user to remove extra corre-spondences and add missing ones than to manually per-form the matching, thus resulting in a negative overall

(11)

value which is often disregarded. On the contrary, our measure returns values in the range[0,1]and it does not assume that a low precision involves much effort during post-match because of the number of user interaction es-timation. Finally, the overall measure does not consider the size of the schemas. Yet, even with the same number of expert correspondences, the manual task for checking the discovered correspondences and finding the missed correspondences in two large schemas requires a larger effort than in smaller ones. An extensive comparison be-tween overall and post-match effort can be read in [19].

We now discuss several points about the post-match metric. Our metric does not take into account the fact that some schema matching tools [36, 3] returns the top-K correspondences for a given element. By proposing several correspondences, the discovery of missing cor-respondences is made easier when the correct correspon-dence appears in the top-K. Note that without the 1:1 cor-respondence assumption (i.e, in case of n:m correspon-dences), formula of the number of user interactions is re-duced to nui(S1, S2) =|S1| × |S2|. Indeed, as complex

correspondences can be found, all elements from one schema have to be compared with all elements from the other schema. However, many schema matching tools do not produce such complex correspondences [43]. At best, they represent 1:n complex correspondences using two or more 1:1 correspondences. During the second step of the post-match effort, F-measure has the same value distribution for all matching tools (since precision equals 100% and only recall can be improved). This facilitates a comparison between matching tools for a given dataset. We claim that recall is more important than precision. Indeed, it enables a better reducing of the post-match effort, since (in)validating discovered corre-spondences requires less user interactions than searching missed correspondences.

6.5 Quality of Integrated Schemas

As matching tools can also generate an integrated schema, we describe several metrics to assess their qual-ity. In our context, we have an integrated schema pro-duced by a matching tool and a reference integrated schema. This reference integrated schema can be: (i) provided by an expert, thus indicating that the tool in-tegrated schema should be at most similar to this refer-ence integrated schema; (ii) a repository of schemas, that many organizations maintain, in which case we can com-pare the overlap between the tool integrated schema and this repository; or (iii) one of the input schemas, which means that we would like to know how similar the tool integrated schema is related to one of the input schemas. This reference integrated schema does not necessarily take into account the domain application, user

require-ments and other possible constraints, so several reference integrated schemas could be considered as valid [8].

In [10], authors define two measures for integrated schema w.r.t. data sources. First,completeness repre-sents the proportion of elements in the tool integrated schema which are common with the reference integrated schema. Secondly,minimalityis the percentage of extra elements in the tool integrated schema w.r.t. the refer-ence integrated schema. Both metrics are in the range [0,1], with a1 value meaning that the tool integrated schema is totally complete (respectively minimal) w.r.t. to the reference integrated schema.

As stated by Kesh [31], these metrics are crucial to produce a more efficient schema, i.e. that reduces query execution time. However, they do not measure the qual-ity of the structure of the produced integrated schema. We believe that the structure of an integrated schema produced by a schema matching tool may also decrease schema efficiency if it is badly built. Besides, an inte-grated schema that mostly keeps the semantics of the source schemas has a better understanding for an end-user. Thus, we have completed the two previous met-rics by another one which evaluates thestructuralityof an integrated schema [18]. We mainly check that each element of the tool integrated schema shares the same structure than its corresponding element in the reference integrated schema. By same structure, we mean that an element in both integrated schemas shares the maximum number of common ancestors, and that no extra ancestor have been added in the tool integrated schema.

These three metrics, namely completeness, minimal-ity and structuralminimal-ity, are finally averaged to evaluate theschema proximitybetween two integrated schemas, which returns values in the range[0,1][18].

As all these metrics have been implemented in our benchmark, we have run different experiments with three schema matching tools to understand their results.

7 E

XPERIMENTS

R

EPORT

In this section, we present the evaluation results of the following matching tools: COMA++ [14, 3], Similar-ity Flooding (SF) [36, 37] and YAM [21, 22]. These tools are described in Section 3.3. The default config-uration for SF was used in the experiments. We have tested the three pre-configured strategies of COMA++ (AllContext, FilteredContextandNoContextin the ver-sion2005b) and we kept the best score among the three. YAM includes a machine learning process in charge of tuning its parameters, but each score is an average of 200 runs to reduce the impact of the randomly-chosen training data. When dealing with a dataset with more than two schemas, all schemas are matched two by two.

(12)

Consequently, the quality value is global, i.e., equal to the average of all individual quality scores obtained for all possible pairs of schemas. The three schema match-ing tools were evaluated usmatch-ing the datasets12_{presented in}

Section 5. All experiments were run on a3.0Ghz laptop with2G RAM. We first discuss the matching quality of these tools for each dataset from our benchmark. Then, we evaluate their time performance.

7.1 Matching Quality Evaluation

We report the results achieved by each schema matching tool over10datasets, with three plots per dataset. The first plot depicts the common metrics: precision, recall and F-measure (see Section 2). The second plot enables a comparison of our human spared resources measure (HSR, the inverse of the post-match effort, see Section 6) with F-measure andoverall13. Finally, the last plot shows the quality of the integrated schema14 with com-puted values for completeness, minimality, structurality and schema proximity. We conclude this section by a general discussion about the experimental results.

7.1.1 Betting dataset

Figures 4(a) and 4(b) depict the matching quality for thebettingdataset, which features flat schemas from the web. COMA++ obtains the highest precision but it dis-covers fewer correct correspondences than SF. Accord-ing to their HSR values, both tools manage to spare around 40% of user resources. We also note that SF’s overall is negative because of its precision below50%. Yet, it was able to discover half of the correct corre-spondences, meaning that the human post-match effort is reduced. YAM achieves a87% F-measure. Besides, YAM’s overall is quite high (78%) and its HSR is lower (54%). The reason is that half of the elements in the schemas do not have a correspondence. Yet, the over-all metric does not take into account these unmatched elements while our HSR metric considers that a human expert has to check them, thus increasing the post-match effort and reducing HSR.

For the quality of integrated schemas, shown by Fig-ure 4(c), COMA++ successfully encompasses all con-cepts (100%completeness) while SF produces the same structure than the expert (100%structurality). Both tools did not achieve a minimal integrated schema, i.e., with-out redundancies. SF generates the most similar

inte-12_{http://liris.cnrs.fr/˜fduchate/research/}

tools/xbenchmatch/

13_{Overall values have been limited to}₀_{instead of -}_∞_{. One should}

consider a negative overall value as not significant as explained in [36].

14_{Only for COMA++ and Similarity Flooding, since YAM does not}

build an integrated schema.

0 1 0 0 1 1 Precision Recall Fmeasure 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

(a) Correspondence (precision, recall, F-measure)

0 1 0 0 1 1 Fmeasure Overall HSR 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

(b) Correspondence (F-measure, overall, HSR)

00 00 11 11 00 11 00 00 11 11 Completeness Minimality Structurality SchemaProx. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 0% 20% 40% 60% 80% 100% COMA++ SF (c) Integrated schema

Figure 4: Quality obtained for thebettingdataset

grated schema w.r.t. the expert one (schema proximity equal to92%).

7.1.2 Biology dataset

With this large scale and domain-specific dataset, we notice on Figures 5(a) and 5(b) that the three schema matching tools have poorly performed for discovering correspondences (less than10%F-measure). Thus, their overall values are all negatives and therefore considered insignificant. But using these schema matching tools enables users to spare a few resources (HSR around 5%). These poor results might be explained by the fact

(13)

that no external resource such as a domain ontology was provided. An external resource contains knowledge which can be extracted and exploited to facilitate schema matching.

However, as shown by Figure 5(c), the tools were able to build integrated schemas with acceptable complete-ness (superior to 80%) but many redundancies (mini-mality inferior to 40%) and different structures (58% and 41% structurality values). These scores can be explained by the failure for discovering correct corre-spondences. As a consequence, many schema elements have been added into the integrated schemas, includ-ing redundant elements. For structurality, we believe that for unmatched elements, the schema matching tools have copied the same structure than the one of the input schemas.

7.1.3 Currency dataset

Figures 6(a) and 6(b) depict the quality obtained for cur-rency, a nested average-sized dataset. COMA++ discov-ered more correct correspondences than SF and YAM (respective F-measures of60%, 33%and52%). How-ever, YAM achieves the highest recall. Dealing with the post-match effort, SF and YAM have a negative overall value, which is not realistic. Indeed, our HSR measure shows that the matching tools spared between16%and 34%human resources.

On Figure 6(c), we can observe the quality of the in-tegrated schemas built by COMA++ and SF. This last tool manages to build a more similar integrated schema (83%schema proximity against62%for COMA++). Al-though both tools have a100%completeness, COMA++ avoids more redundancies (due to a better recall) while SF respects more the schema structure.

7.1.4 Finance dataset

Based on Figures 7(a) and 7(b), we analyse the qual-ity of the correspondences for thefinancedomain. For this dataset with a specific vocabulary, we notice that COMA++ and SF only discovers a few correct cor-respondences (respectively 18% and 45%), probably because of the specific vocabulary of finance. But COMA++ achieves a100%precision. YAM slightly ob-tains better results with a F-measure equal to56%. The interesting point with this dataset deals with the results of SF and YAM. YAM obtains a far better precision than SF, while this last manages to discover one more correct correspondence than YAM. Although YAM’s F-measure is15% better than SF’s, their HSR is the same (41%). With SF, the user has to process around 40%more in-validations, but these extra invalidations are compen-sated by the discovery of one more correct

correspon-0 1 0 0 1 1 Precision Recall Fmeasure 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 000 000 000 000 000 111 111 111 111 111 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 000 000 000 111 111 111 00 00 00 00 11 11 11 11 00 00 00 00 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 1 0 0 1 1 Fmeasure Overall HSR 000 000 000 111 111 111 000 000 000 000 111 111 111 111 00 00 00 00 11 11 11 11 000 000 111 111 00 00 11 11 00 00 00 00 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 0 1 1 0 1 0 0 1 1 Completeness Minimality Structurality SchemaProx. 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 111 111 111 111 111 111 111 0% 20% 40% 60% 80% 100% COMA++ SF Value in % (c) Integrated schema

Figure 5: Quality obtained for thebiologydataset

dence (that the YAM user has to manually find thanks to (in)validations). In other words, the40%extra precision obtained by YAM over SF are equivalent, for this dataset, to the4%extra recall obtained by SF over YAM, in terms of user (in)validations. This clearly shows that promot-ing recall instead of precision enables a better reducpromot-ing of the post-match effort in most cases.

Figure 7(c) depicts the quality of the integrated schema for the finance dataset. The experimental results indicate that both tools produced an integrated schema which is80%similar with the expert one. They mainly achieve a high completeness.

(14)

0 1 0 0 1 1 Precision Recall Fmeasure 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 1 0 0 1 1 Fmeasure Overall HSR 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 00 00 00 00 00 00 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 0 1 1 0 1 0 1 Completeness Minimality Structurality SchemaProx. 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 0% 20% 40% 60% 80% 100% COMA++ SF Value in % (c) Integrated schema

Figure 6: Quality obtained for thecurrencydataset

7.1.5 Order dataset

This experiment deals with large schemas whose labels are normalized. Similarly to the other large scale sce-nario, schema matching tools do not perform well for this

order dataset (F-measures less than40%), as depicted by Figures 8(a) and 8(b). Contrary to most datasets, the tools obtain higher recall values than precision values. The normalized labels of the schema elements might ex-plain why their precisions are so low. Although they dis-cover30%to40%of the correct correspondences, their overall values are negative. However, our HSR measure indicates that using COMA++, SF or YAM respectively

0 1 0 0 1 1 Precision Recall Fmeasure 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 1 0 0 1 1 Fmeasure Overall HSR 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 0% 20% 40% 60% 80% 100% COMA++ SF YAM Value in %

0 0 1 1 0 1 0 1 Completeness Minimality Structurality SchemaProx. 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 000 000 000 000 000 000 000 000 000 000 000 000 111 111 111 111 111 111 111 111 111 111 111 111 0% 20% 40% 60% 80% 100% COMA++ SF Value in % (c) Integrated schema

Figure 7: Quality obtained for thefinancedataset

enables the sparing of11%, 16% and21% human re-sources.

As for quality of the integrated schema, given by Fig-ure 8(c), both tools achieve a schema proximity above 70%, mainly due to a high completeness. The reason for such high completeness value is the low quality results for correspondence discovery. In that case, users may be more interested by the results about structurality and minimality.