7.4 Evaluations Based on Perfect Input Mappings
7.4.2 Evaluating Selected Strategies
Table 7.6 shows the F-measure achieved by each single strategy, with the best results in each experiment marked bold. In this test, only one strategy has been enabled, so either Compound, Background Knowledge, Structure, Multiple Linkage or Word Frequency.
Note that the Itemization Strategy must always be enabled to achieve any results for itemizations. To focus more on the strengths and weaknesses of the different strategies, the configuration undecided-as-false is used.
The Compound Strategy allows F-measures between 18.9 and 65.9 %, with an average of 44.9 %. It is usually a very convenient technique, but has a generally restricted scope, as it can only be used in correspondences where a compound word matches its head.
Background Knowledgeleads to constantly good results between 49.1 and 73.5 %, with an average of 64.0 %. This is the best result of all strategies and is caused by the multiple resources imported in the SemRep system that allow a correct relation type determination
Compound Background
Table 7.6: Evaluation results (F-measure) for single strategies.
by using background knowledge (D, F, L). The Structure Strategy can both lead to good results as in scenarios W, D, L, but can also lead to poor results as in scenario T . With an average of 37.8 % it is rather a special strategy useful for some specific cases where complex concept paths exist. The heuristic strategy Multiple Linkage has an even larger range. In experiment F , which apparently does not consist of many (1 : n) or (n : 1) correspondences, only 4.4 % F-measure was obtained. By contrast, an F-measure of 84.5
% was achieved in experiment C, which is very close to the F-measure achieved by using all strategies (see Table 7.5). The average F-measure is 58.4 %, which is the second-best result. In 3 of the 7 data sets this strategy even achieved the best results ( T, G, C), which is an impressive result for a strategy mostly based on heuristics. Eventually, the Word Frequency Strategy achieves results between 38.4 and 75.4 %. The average F-measure of 57.7 % is very similar to Multiple Linkage, but the smaller range indicates that Word Frequency works more steadily. In one experiment (W ) this strategy returned the best result.
It can be observed that background knowledge and heuristic strategies are complemen-tary strategies. In W, T, G, C, heuristic strategies led to the best results while the Back-ground Knowledge Strategy led to somewhat lower results. In D, F, L, the opposite was the case. Thus, using the more profound linguistic strategies as well as heuristic strate-gies seems to be an ideal configuration.
The Compound Strategy achieves only the fourth-best result, as it is exceeded by the gen-erally good Background Knowledge Strategy and the relatively good heuristic strategies Multiple Linkage and Word Frequency. Still, this strategy is very important, because it can findis-a and inverse is-a relations that cannot be found by other strategies. Thus, this strategy does not achieve the best results if used alone, mostly because of a limited recall (there are manyis-a relations which are not expressed by compounds), but in the overall STROMA system allows considerable improvements.
Table 7.7 shows the F-measures for an inverted experiment, where all strategies but one are selected. Thus, one specific strategy was disabled in each test, and each column shows the results achieved in such a configuration. The worst results are achieved if Multiple Linkage is disabled (73.9 %), which shows that this strategy has the largest impact on the overall result. Word Frequency and Background Knowledge have also a considerable
im-Compound Background
Table 7.7: Evaluation results (F-measure) for of STROMA if a specic strategy is disabled.
Without heuristics Without
Table 7.8: Evaluation results for specic combinations of strategies.
pact (76.4 % resp. 76.2 %), while disabling the Structure Strategy (77.5 %) and Compound Strategy (77.9 %) reduced the general F-measure only slightly (which is 78.6 %). There are two important insights: First, that the impact of a single strategy is very low and only ranges between an F-measure loss of 0.7 % (Compound) and 4.7 % (Multiple Linkage).
This substantiates that the generally good quality of STROMA (78.6 % in undecided-as-false mode) is based upon the combination of the several strategies and not on a specific strategy alone. Second, that the deactivation of any strategy leads to a loss in F-measure, which proofs that any strategy has a positive impact on the mapping quality.
If no strategies were used at all, the F-measures would be 0 % in all benchmarks in the mode undecided-as-false. In the mode undecided-as-equal, the F-measures would be identical to the number (percentage) ofequal relations in the benchmark.
Finally, Table 7.8 shows the F-measures for combinations of selected strategies. The first column shows how STROMA performs if no heuristic strategies (Multiple Linkage and Word Frequency) are used. The second column shows the results if all strategies except Background Knowledge are used. For comparison, the third column shows the result achieved by all strategies, which was already shown in Table 7.5.
If no heuristic strategies are used, the F-measure ranges between 52.7 and 83.1 %, with an average of 69.5 %. Thus, without heuristics the results are about 9 % worse compared to the results achieved with all strategies. This shows that the heuristic strategies are very useful after all. If no background knowledge is used, STROMA achieves results between 68.6 and 87.4 %, with an average of 76.2 %. This result shows that STROMA can even achieve very good results if no background knowledge is used.
As a matter of fact, the best results (78.6 % average F-measure in undecided-as-false mode) are obtained if all strategies are used. However, in experiments G and W the best results were obtained if no background knowledge was used. In this case, some relation types gained from SemRep were obviously incorrect. Since background knowledge has a rather high strategy weight, the incorrect result has sometimes outweighed the result of other (such as heuristic) strategies and thus led to a worse result in this experiment. One example is the rather simple correspondence (Vinegar, Sauce). Vinegar seem to be best organized under the concept Sauce, so vinegar is some kind of sauce (is-a). Multiple Linkage correctly detects this relation, but Background Knowledge returnsrelated, as in SemRep vinegar and sauce are both sub-concepts of the concept flavoring, which leads to the imprecise relation typerelated. In this case, Multiple Linkage outperforms Back-ground Knowledge and leads to a better result, though it is ignored as Multiple Linkage has a lower weight.