5.2 KSMS Overview
5.2.2 Element-level Matching
For element-level matching, the Hybrid-RDR approach is used. An example of the detail matching process is described below:
At the beginning of the matching process, one dataset is selected randomly for training and another dataset for testing. All the cases with 73 features (described in Chapter 4) of the test dataset are shown in the Case Browser in Figure 5.4.
Figure 5.4: The Case Browser represents 73 features with cases (all features are not visible)
In Figure 5.4, all the cases imported from a dataset are shown in the case browser by the All Cases button. The U nClassif ied Cases button is used to show NULL
classied cases. The Hybrid-RDR approach works in two phases: a training phase and a classication/matching phase. In the training phase, the T raining by M Lbutton is
used to build a training model for one dataset by the decision tree J48 at the beginning. This button is used only once. The purpose of building a model is to classify whether a given element pair is matched or not based on their feature similarity measures. For J48, 10-fold cross validation is used. In the classication phase, the Classif y
button is used to classify all the cases in theCase Browser by the trained model. The
results are found as true positive (if reported match by an expert manual mapping is TRUE and predicted match by an algorithm is TRUE), false positive (if reported
match is FALSE and predicted match is TRUE), true negative (if reported match is FALSE and predicted match is FALSE) and false negative (if reported match is TRUE and predicted match is FALSE). These are displayed by T RU E P ositives, F ALSE P ositives,T RU E N egatives and F ALSE N egatives buttons respectively.
The False positive cases are shown in Figure 5.4. The attributeClass provided by
an expert is FALSE, but the Classif icationprovided by an algorithm is TRUE. The
False negative button shows that the attribute Class provided by an expert manually
is TRUE, but the Classif ication provided by an algorithm is FALSE. The schema
matching problems of false positive and false negative are solved using the knowledge acquisition process. The Edit Classif ication button is used to rene the incorrectly
classied cases by adding new conditions until all incorrect cases are removed or creating another new rule using the knowledge acquisition GUI. Classication for the censor rule is always NULL. For editing the classication, the knowledge acquisition GUI is required which is displayed in Figure 5.5.
Figure 5.5: Knowledge acquisition
In Figure 5.5, the parent condition is Decision T ree which gives the incorrect
classication for the current case. In order to edit the parent rule, it is not necessary to select the classication as the classication for the censor rules is always NULL. First,
the rule conditions are added. For each condition in the rule, attribute, operator and value are selected from the drop down boxes that list all the attributes, operators and
values respectively. After selecting conditions, theAdd Conditionbutton is used to add
conditions. It is possible to add more than one condition and delete a condition using the Delete Selected button if users think that the added condition is not appropriate.
The Satisf y Condition button helps to look at whether the rule is satised by the
selected case or not. In this gure, it is shown that the rule is satised by the current case. For this, theV alidate N ew Rulebutton becomes active and this helps to validate
the rule for all the incorrectly classied cases. The cases that satisfy the rule are shown in Figure 5.6.
Figure 5.6: Cases that satisfy the rule
In Figure 5.6, the Reported M atch shows the manual matching results, and the Algorithmic M atchshows the results produced using rules. The knowledge acquisition
process makes the incorrectly classied cases NULL classied. The Save Rule button
helps to save the rule in the rule database (KB) as a censor rule and the cases in the case database. The unclassied cases are shown by the U nClassif iedCases button of
Figure 5.4. It is seen that the attributeClassprovided by the expert is TRUE/FALSE,
but the attribute Classif icationprovided by the rule is NULL.
In Figure 5.4, the Add Classif ication button is used to add alternative rules to
correct the classication (TRUE or FALSE). These rules are added by the knowledge acquisition process like Figure 5.5. The dierence is that there is no parent rule in this process. This is because a rule is added to classify the NULL classied cases. In order to add classication using the knowledge acquisition GUI like Figure 5.5, rst
Figure 5.7: Save correct cases and edit classications
the classication of the rule is selected. This can be done using the drop down box at the top that lists TRUE or FALSE classication for this domain. Having selected the classication, the conditions for creating a rule are added. It is then checked whether the rule is satised by the current case or not. If the rule is satised, it is then validated to determine whether the conclusion provided by the rule is matched with the reported match. As the rule is satised, so theV alidate N ew Rule button becomes active. The
cases that validate the rule are shown in Figure 5.7.
In Figure 5.7, theSave Rulebutton is used to save all the cases, the alternative rule
and correct classications. After saving rules, this Save Rulebutton becomes inactive
to avoid duplicate saving. In the rule editor, some cases incorrectly satisfy the rule. So knowledge acquisition is again required to delete the cases. In this way, rules are added incrementally to the KB to solve incorrect classications. An example of KB of the Hybrid-RDR approach to classify schema elements/cases is shown in Table 5.1
In Table 5.1, rule types GB and R represent a ground breaking rule and a rene rule respectively. A ground breaking rule is used as an alternative rule, and the conclu- sion of this rule is either TRUE or FALSE. A rene rule is used as a censor rule or an exception rule or a stopping rule, and the conclusion of this rule is NULL. The abbrevi- ations Lev, S, T, SynT, AbbSynT, AbbT okS, T okT, AbbT, T okSynT, AbbT okSynT, AbbSynS, T okS, M on, Smith, J aroM, N eedle and J aroW mean Levenshtein func-
tion, source schema, target schema, synonym of target, abbreviation and synonym of target, abbreviation and tokenization of source, tokenization of target, abbreviation of
Table 5.1: An example of KB for classifying cases using Hybrid-RDR RID PID RType Condition Conclusion Classied
Cases
1 0 0 0 0 0
2 1 GB Decision Tree TRUE/ FALSE ALL 3 2 R Lev_SynT== 1.0 NULL 964 4 1 GB Lev_TokSynT==1.0 and JaroW_AbbTokT==0.9 TRUE 964 5 2 R JaroW_ST == 0.9 NULL 289 6 1 GB JaroW_ST== 0.9 and Mon_AbbSynS== 0.8 TRUE 289 7 2 R Lev_AbbSynT == 0.8 NULL 785 8 1 GB JaroW_TokSynT == 0.8 TRUE 785 9 2 R JaroM_AbbTokS == 1.0 and JaroW_AbbT==0.9 NULL 1049 10 1 GB Lev_AbbSynS == 1.0 and JaroW_AbbT==0.9 TRUE 1049 11 2 R Lev_ST <= 0.2 NULL 567 12 1 GB Lev_ST <= 0.2 and Smith_TokS < 0.3 FALSE 567 13 2 R Needle_TokSynT==0.1 NULL 234 14 1 GB Needle_AbbTokSynT==0.1 FALSE 234 15 12 R Lev_TokT==1.0 NULL 975 16 1 GB Lev_TokT== 1.0 and Mon_SynT== 1.0 and Smith_TokSynT >= 0.9 TRUE 975 17 6 R Mon_TokT==0.3 NULL 640 18 1 GB Lev_TokT==0.3 FALSE 640
target, tokenization and synonym of target, abbreviation, tokenization and synonym of target, abbreviation and synonym of source, tokenization of source, MongeElkan, SmithWaterman, JaroMeasure, NeedlemanWunsch and JaroWinkler function respec- tively. The values 1.0, 0.9, 0.2, 0.3, 0.8, 0.6 are thresholds. An example of a rule JaroW_ST==0.9 means if the value of JaroWinkler function applied to source and target is equal to the threshold value 0.9, the conclusion is then TRUE.
In Table 5.1, rule 1 (RID=1) is an entry rule in the KB, and it is always TRUE. The rules 2 to 18 are used to classify cases of datasets. First, rule 2 is applied to classify one dataset using a decision tree. This rule classies the case 964 as FALSE whereas the classication provided by expert manual matching is originally TRUE. In order to solve this incorrect classication, the knowledge acquisition process is used to make the classication NULL using the rule 3. The same process is then used to create an alternatve rule 4 to classify the case as TRUE. In this way, up to 14 rules are added to the KB to solve the incorrect classications of one dataset. Later, the rules 15 to 18 are added to the KB to solve incorrect classications of another dataset. Adding censor rules and alternative rules incrementally build the KB.