The Reuse Matcher - SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION

Figure 6.4 illustrates the schema-level reuse approach implemented in the Reuse matcher. All schemas and previous match results are persistently stored the repository and can be exploited for reuse. In memory, a graph of existing schemas and mappings is maintained as described in Section 6.1 for easy identification of reuse opportunities. Given two schemas S₁ and S₂ to match, Reuse identifies all opportunities for each cate- gory, direct mappings, complete mapping paths, and incomplete mapping paths. If a pivot schema is specified, only those mapping paths involving the pivot schema will be considered. The example in Figure 6.4 shows one direct mapping, two complete map- ping paths, and two incomplete mapping paths for the match problem S₁↔S₂.

Figure 6.4 Schema-level reuse in the Reuse matcher

Complete mapping paths S1↔ S2 S₁↔ S2 Similarity cube S1↔ Sj, Sj↔ S2 Sk↔ S1, S2↔ Sk Match problem Similarity Matrix Search

repository _ComposeMatch- Aggregation

Direct mappings Incomplete mapping paths S1↔ Sj Sk↔ S1 Sj↔ S2 S2↔ Sk Matcher 1 Matcher 2 Matcher 3 {s11, s12, ...} {s21, s22, ...} s11↔s21 s12↔s22 s13↔s23 Matcher 1 Matcher 2 Matcher 3 {s11, s12, ...} {s21, s22, ...} s11↔s21 s12↔s22 s13↔s23 Matcher 1 Matcher 2 Matcher 3 {s11, s12, ...} {s21, s22, ...} s11↔s21 s12↔s22 s13↔s23 Matcher 1 Matcher 2 Matcher 3 {s11, s12, ...} {s21, s22, ...} s11↔s21 s12↔s22 s13↔s23 Match

Legends: _{Mapping available}

Not yet matched

S_j↔ S₂ S₁↔ S_j Map paths S1 S2 S1 S2

In the interactive mode, the user may choose one from the proposed reuse possibilities, such as the direct mapping in the example above, or combine multiple ones to derive a single match result. In the automatic mode, Reuse ranks the identified possibilities according to several criteria, such as the average schema similarity in the path, the expected computational effort expressed by the path length and the size of the involved schemas and mappings, and uses a predefined number of best mapping paths. If some incomplete mapping paths are to be evaluated, the default match operation is first applied to solve the open match tasks.

With all match results available, MatchCompose is applied for each mapping path to pro- duce an S₁↔S₂ mapping, which represents a possible result for the given match task. Similar to the match results returned by multiple matchers, they constitute a similarity cube with three dimensions, S₁ elements, S₂ elements, and the mapping paths used to derive the single mappings, as shown in Figure 6.4. The similarity cube is then aggregated along the mapping path dimension using an aggregation strategy (e.g., Average - see Section 7.3) to obtain a similarity matrix, which is in turn stored together with the

6.5.SU M M A R Y 6 3

results of other matchers in the similarity cube of a match iteration for aggregation and selection as described in Section 5.2.

For a flexible specification of reuse strategies, the Reuse matcher supports four configu- ration parameters, Mapping path length, Top-k mapping paths, Pivot schema, and Com-

position/Aggregation strategy. While the first three are combined using the logical AND

operator to filter the mapping paths to be considered, the last one influences how similarity values are composed and aggregated:

• Mapping path length: This parameter can be used to specify the exact or maximal length of the mapping paths to be identified. If not specified, all mapping paths are taken into consideration.

• Top-k mapping paths: If specified, this parameter determines the number of the best mapping paths to be selected from the ranking of all mapping paths. Otherwise, all mapping paths are used.

• Pivot schema: If a pivot schema is given, all identified mapping paths are filtered in order to involve it as an intermediary schema. Otherwise, all mapping paths are taken into consideration.

• Composition/Aggregation strategy: This parameter, either Min, Max, or Average, sets the strategy to compose transitive similarities in the MatchCompose operation and to aggregate the MatchCompose results stored in the temporary similarity cube. That is, we use the same strategy for both composition and aggregation of similarity values for simplicity reasons.

Despite the high level of reuse in the Reuse matcher (match results at the level of entire schemas), we believe that there is a high probability to find the necessary match result pairs for MatchCompose in an environment where many schemas are managed and matched to each other. In particular, with a central repository, all computed match results can be saved, increasing the potential for reuse over time. On the other side, schemas from the same application domain usually contain many similar elements, which are typ- ical to this domain, so that their mappings can provide good reusable candidates.

6.5 Summary

In this chapter, we presented a new match approach based on the reuse of previously obtained match results. It assumes the transitivity of similarity relationships between previously identified matching elements. In particular, our approach determines relevant mapping paths consisting of two or more mappings successively sharing a common schema, and performs a join-like operation on each such mapping path to derive a new mapping. The results of multiple mapping paths can be aggregated and combined using the same combination scheme like the results obtained using multiple matchers. The determination of mapping paths may also consider a selected pivot schema and “light- weight” match tasks, which can be quickly computed for combination with the existing mappings. The entire reuse approach is implemented as a standalone matcher and can be invoked like and combined with other matchers in the Matcher Library.

C

H A P T E R

CHAPTER 7

M

ATCHER

C

ONSTRUCTION

Starting with a set of hybrid matchers based on some similarity functions, such as string matching, dictionary lookups, reuse of previous match results, etc., we support construct- ing more powerful combined matchers from existing matchers, i.e., both hybrid and combined matchers. While COMA used separate implementations of combined matchers, COMA++ uniformly defines them using a generic and customizable implementation, the so-called CombineMatcher. In the following, we first present CombineMatcher (Section 7.1) and possible configuration strategies (Section 7.2 and 7.3). We then discuss the default combination of the combined matchers currently defined in the Matcher Library (Section 7.4). Section 7.5 briefly summarizes the chapter.

In document SCHEMA MATCHING AND MAPPING-BASED DATA INTEGRATION (Page 80-83)