We discuss the evaluation of several recent schema matching prototypes, in particular, AUTOPLEX [7], AUTOMATCH [8], LSD [34], GLUE [35], IMAP [28], SEMINT [83, 84, 85], SIMILARITYFLOODING (SF) [93]. We have encountered a number of systems, which either have not been evaluated, such as CLIO [66, 105, 114, 61], DIKE [112], MOMIS [9, 20], ONION [98, 99], and TRANSCM [97], or whose evaluations have not been described
with sufficient detail, such as DELTA [23] and the work of [42]. Those systems are not considered in our study.
AUTOPLEX and AUTOMATCH
Both prototypes depend on a domain global schema, against which source schemas are matched. In both evaluations, the global schemas were rather small, containing 15 and 9 attributes, respectively [7, 8]. No information about the characteristics of the involved source schemas was given. First the source schemas were matched manually to the glo- bal schema, resulting in 21 and 22 mappings in the AUTOPLEX and AUTOMATCH evalua- tions, respectively. These mappings were divided into three portions of approximately equal content. The test was then carried out in three runs, each using two portions for learning and the remaining portion for matching. Both evaluations did not document the test machine or report execution times required for the match tasks.
The AUTOPLEX evaluation used the quality measures Precision and Recall, while for AUTOMATCH, Fmeasure was employed. However, the measures were not determined for single experiments but for the entire evaluation: the false/true negatives and positives were counted over all match tasks. For AUTOPLEX, they were reported separately for table and column matches. We recompute the measures to consider all matches and obtain a Precision of 0.84 and Recall of 0.82, corresponding to an Fmeasure of 0.82 and Overall of 0.66. Furthermore, the numbers of the false/true negatives and positives were rather small despite counting over multiple tasks, leading to the conclusion that the source schemas must be very small. For AUTOMATCH, the impact of different methods for sampling training data on match quality was studied. The highest Fmeasure reported was 0.72, so that the corresponding Overall must be worse.
LSD, GLUE, and IMAP
LSD [34] was tested for 4 domains, in each of which 5 data sources were matched to a manually constructed global schema, resulting in 20 match tasks altogether. To match a particular source, 3 other sources from the same domain were used for training. The source schemas were rather small (14-48 elements), while the largest global schema had 66 attributes. GLUE was evaluated for 3 domains, in each of which two website taxono- mies were matched in two different directions, i.e., A→B and B→A [35]. The taxono- mies were relatively large, containing up to 300 elements. IMAP was tested with 4 match tasks from 4 domains. The schema size ranges from 19 to 44 elements. All three systems rely on pre-match effort, on the one side, to train the learners, and on the other side, to specify domain constraints and synonyms. None of the evaluations reported on the exe- cution times required by the systems.
For all systems, the quality of using different learner combinations was determined to find the best configuration. For LSD, the impact of the amount of available instance data on match quality was studied. IMAP was evaluated for identifying 1:1 and m:n correspon- dences, respectively, and using different methods for match candidate selection (Top-1 and Top-3). Match quality was estimated using a single measure, called match accuracy, defined as the percentage of the matchable source attributes that are matched correctly. It corresponds to Recall in our definition due to one single correspondence returned for each source element. Furthermore, we observe that at most a Precision equal to the pre- sented Recall can be achieved for single match tasks; that is, if all source elements are matchable. Based on this conclusion, it is possible to estimate the highest possible F- Measure (=Recall) and Overall (=2*Recall-1) for the single evaluations.
Figure 13.1 Match quality of LSD [34]
13.1.IN D I V I D U A L EV A L U A T I O N S 1 4 3
Figure 13.1 shows the quality of different learner combinations in LSD. Usually, the best quality was achieved when all learners were involved. Interestingly, we observe that the three systems exhibited quite similar quality. On average (over all domains), LSD and GLUE achieved a Recall of ~0.8, respectively, which is similar to IMAP employed to iden- tify 1:1 correspondences. This corresponds to an Overall of about 0.6. As for m:n matches, IMAP showed some degradation in quality with average Recall only around 0.6, which was however improved to ~0.8 with the Top-3 strategy. This was due to the opti- mistic way of counting correct correspondences to compute the quality measure. In par- ticular, 3 best matching candidates are returned for each schema element and count as one correct correspondence if the correct candidate is among the three.
SIMILARITYFLOODING
The SIMILARITYFLOODING (SF) evaluation [93] used 9 match tasks defined from 18 schemas (XML and Relational) taken from different application domains. The schemas were small with the number of elements ranging from 5 to 22, while showing a relatively high similarity to each other (0.75 on average). Seven users were asked to perform the manual match process in order to obtain subjective match results. For each match tasks, the results returned by the system were compared against all subjective results to esti- mate the automatic match quality, for which the Overall measure was used. Other exper- iments were also conducted to compare the effectiveness of different filters and formulas for fix-point computation, and to measure the impact of randomizing the similarities in the initial mapping on match accuracy. The best configuration was identified and used in SF. Figure 13.2 shows the Overall values achieved in the single match tasks according to the match results suggested by the single users. The average Overall quality over all match tasks and all users is around 0.6. Like previous systems, no execution time was reported.
SEMINT
A preliminary test consisting of 3 experiments was described in [83]. The test schemas were small with mostly less than 10 attributes. However, the achieved quality for these experiments was only presented later in [84, 85]. In these small tasks, SEMINT performed very well and achieved very high Precision (0.9, 1.0, 1.0) and Recall (1.0). A second evaluation was described in [84, 85], involving two other match tasks. In the bigger match task with schemas with up to 260 attributes, SEMINT surprisingly performed very
Figure 13.2 Match quality of SIMILARITYFLOODING [93]
well (Precision ~0.8, Recall ~0.9). But in the smaller task with schemas containing only around 40 elements, the quality dropped drastically (Precision 0.20, Recall 0.38).
On average over 5 experiments, SEMINT achieved a Precision of 0.78 and Recall of 0.86. Using the Precision and Recall values presented for each experiment, we can also com- pute the average Fmeasure, 0.81, and Overall, 0.48. On the other hand, it is necessary to take into consideration that this match quality was determined from match results of attribute clusters, each of which possibly contains multiple 1:1 correspondences. In addi- tion to the match tasks, further tests were performed to measure the sensitivity of the sin- gle match criteria employed by SEMINT [84]. The results allowed to identify a minimal subset of match criteria, which could still retain the overall effectiveness.
Besides quality, runtime performance of the system was also reported. In particular, exe- cution time was negligible for the small match tasks in the first evaluation [83]. How- ever, in the larger match tasks of the second evaluation [84, 85] a large amount of time (several hours) was required alone for training the neural network. This indicates scal- ability problem of instance-based approaches with respect to the schema size and the amount of instance data to be processed.