• No results found

3.4 Optimization Strategies for ESA

3.4.1 Evaluation of Article Filter Strategy based on Link Type Selection

Gabrilovich and Markovitch [79] filter all articles that have less than five incoming links and less than five outgoing links based on the assumption that the remaining articles are general enough and rich enough in content for representing a semantic concept. However, there are alternative strategies based on link type selection that can be applied. For this evaluation, following strategies have been applied and evaluated:

• Filtering articles with less than a certain amount of incoming links (inlinks) • Filtering articles with less than a certain amount of outgoing links (outlinks)

• Filtering articles with less than a certain amount of in– and outgoing links. This strategy, used by Gabrilovich and Markovitch, is merely an intersection of the article set produced by the first two strategies.

• Filtering articles with less than a certain amount of mutual links. A mutual links exists if an article

a1 links to article a2 and vice versa.

Figure 3.6 displays the effect of these filtering strategies on the number of articles in the semantic interpreter. Note that disambiguation pages are not included in the articles used to generate the semantic interpreters. 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1 10 100 1000 Contained Articles Link Threshold Outgoing Links Incoming Links Incoming Outgoing Links Mutual Links

Figure 3.6:Number of articles depending on filtering strategies based on article linkage

This figure shows that filtering the articles by the number of different link types significantly decreases the number of remaining articles. Furthermore, the different strategies have a varying influence regard- ing the degree of decrease. Articles featuring many outgoing links are more numerous than articles having many incoming links. Curiously, this trend is reversed at the threshold of approximately 60 links. This is probably due to a core of very generic articles that are linked to very frequently (e.g. the article

England23 that is linked to 16,172 times and only links to 92 other articles). The conjunction of inlinks and outlinks having the same threshold that is applied in ESA’s original research is approximately ap- proaching the curve of the inlink strategy. It is not explored in more detail in the following, because the semantic interpreters of the inlink strategy and the combined strategy are very similar. Further, there are less occurrences of articles containing many mutual links than using other strategies and accordingly the respective curve drops quickly. Already with a filter threshold of three, the number of remaining articles is only about 50%.

In the following, an evaluation is presented that analyses the accuracy benefits in relation to the size of the respective semantic interpreter. The conjunction of inlinks and outlinks is not considered here, as it is very similar to the inlink strategy.

Inlink Filter Outlink Filter Mutual Link Filter

Filter Threshold Articles SI non–zeros Articles SI non–zeros Articles SI non–zeros

0 973,227 197,640,359 (100.00%) 973,227 197,640,359 (100.00%) 973,227 197,640,359 (100.00%) 5 - - - - 291,831 66,122,694 (33.46%) 10 471,504 119,482,379 (60.45%) 834,394 134,374,296 (67.99%) 113,018 35,174,062 (17.80%) 25 261,569 75,559,590 (38.23%) 401,434 122,292,448 (61.88%) 25,158 12,273,530 (6.21%) 50 149,370 48,703,205 (24.64%) 164,039 66,224,493 (33.51%) - - 75 101,082 35,996,128 (18.21%) 88,048 42,704,594 (21.61%) - - 100 74,270 28,248,574 (14.29%) 52,391 29,965,737 (15.16%) - - 200 39,406 15,966,763 (8.08%) 12,042 11,341,855 (5.74%) - - 300 31,452 11,983,460 (6.06%) - - - - 400 28,078 9,831,884 (4.95%) - - - - 500 26,436 8,654,881 (4.38%) - - - -

Table 3.6:Impact of filtering by inlinks, outlinks and mutual links on article count and semantic interpreter (SI) size

Table 3.6 shows the ratios of decrease of article count and semantic interpreter size (in non–zero matrix entries) on using inlink, outlink and mutual link filter strategies. Further, it indicates that the decrease of the corpus size for both inlink and outlink strategies is correlating to the decrease of the number of articles remaining in the corpus, although it is not proportional (cf. figure A.1 in appendix A.3). Mutual links are rare in Wikipedia, thus the number of articles drops quickly with already a low filter threshold. The results of Zesch and Gurevych [203] indicate that the optimal number of articles taken into ac- count for a semantic interpreter is about 200,000, however, this number is based on a complete Wikipedia dump containing all articles. It is expected that this number can be reduced by an appropriate article filtering strategy.

In order to get a conclusive picture about the quality of the link filtering strategies, the resulting semantic interpreters are compared with regard to following performance indicators (cf. section 3.3.2): Coverage, Global and Local Accuracy using the Reader’s Digest Word Puzzle Corpus RDWP984.

Correlation with human judgement based on Spearman’s Rank Correlation Coefficientρ using the two term–pair datasets Gur65 and Gur350.

MAP and BEP based on experiments performed with the semantic corpus Gr282.

Coverage and Accuracy

A reduction of articles is accompanied with the removal of specific terms that are relevant to the filtered articles. Thus, the coverage will decrease with an increasing level of filtering. Figure 3.7 shows the increase of coverage for the inlink, outlink and mutual link filter strategies with semantic interpreters containing more articles for different points of measurement. The points of measurement represent the different filtering thresholds, but they are transcribed to the number of articles that are contained in a semantic interpreter to make the results comparable. The outlink filter strategy approximately subsumes the mutual link filter strategy. Both perform better for semantic interpreters where less than approx- imately 250,000 articles are retained. For semantic interpreters containing more than this number of articles, the coverage is similar in both outlink and inlink strategies. A reason for this better performance of the outlink filter strategy could be that the number of outlinks in an article correlates with the size of the article in terms. Consequentially there is a bigger chance that long articles contain a larger diversity of terminology than short articles. Thus, the coverage of the outlink and mutual link filter strategies is better especially with smaller semantic interpreters.

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 104 105 106 Coverage

Number of Articles in Semantic Interpreter

Inlink Filter Strategy Outlink Filter Strategy Mutual Link Filter Strategy

Figure 3.7:Coverage for inlink, outlink and mutual link filter strategies

According to [203], accuracy should be stable for a semantic interpreter consisting of about 200,000 articles. Figures 3.8a and 3.8b show that this not only applies to semantic interpreters containing all articles but also for the inlink filter strategy. A plateau of local accuracy is reached with approximately 220,000 articles for both local and global accuracy. The outlink filter strategy again subsumes the mutual link strategy. Both strategies, however, steadily increase the local accuracy until all articles are contained. For all semantic interpreters containing more than 30,000 articles, the outlink and mutual link filter strategies are dominated by the inlink filter strategy regarding the local accuracy. This is in accordance with global accuracy, which can also be seen in tables A.8 and A.9 in appendix A.

These results show that the accuracy of the inlink filter strategy performs better than the outlink and mutual link filter strategies for semantic interpreters that contain more than about 30,000 articles. This

0.5 0.55 0.6 0.65 0.7 0.75 104 105 106 Local Accuracy

Number of Articles in Semantic Interpreter

Inlink Filter Strategy Outlink Filter Strategy Mutual Link Filter Strategy

(a)Local Accuracy

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 104 105 106 Global Accuracy

Number of Articles in Semantic Interpreter

Inlink Filter Strategy Outlink Filter Strategy Mutual Link Filter Strategy

(b)Global Accuracy

Figure 3.8:Local and global accuracies for the Reader’s Digest Word Choice Puzzle corpus RDWP984 using inlink, outlink and mutual link filter strategies

supports the assumption of [79] that articles containing many inlinks are more generic and thus more applicable to be used as concepts in ESA.

So far, these experiments show that an increasing number of articles (and therefore concepts) used in the semantic interpreter primarily causes an increase of coverage while local accuracy is only affected to a small extent. Thus, more articles mean rarely used words can be mapped to concepts and compared to other terms contained in the semantic interpreter. This is crucial for the evaluation of semantic relatedness and semantic similarity due to the very limited number of available terms for the analysis.

Correlation with Human Judgements

In order to get a better understanding of the effect of filtering articles on ESA’s ability to judge the semantic relatedness between terms, a second evaluation was executed. The German datasets Gur65 and Gur350 (cf. section 3.3.2) are used to measure the correlation between human judgements and computed judgements on semantically related word pairs. The results for both corpora can be seen in figures 3.9a and 3.9b.

These two evaluations show basically a similar tendency to correlate with semantic relatedness judge- ments of human raters as accuracy in the previous experiments. However, the inlink filter strategy visibly outperforms the outlink strategy above the range of about 30,000 contained articles. This is no surprise, as a similar characteristic could already be seen with accuracy. The mutual link filter strategy is clearly outperformed by both inlink and outlink filter strategies. Although it yields good results regarding cov- erage, it seems to eliminate important articles from the semantic interpreter M . Therefore, the mutual link filter strategy is not applied in the following experiments.

The steps between the points of measurement in figure 3.9a where large “jumps” in correlation can be seen for the inlink and outlink filter strategies are due to the increased coverage of the terminology. This affects the Gur65 corpus more than the Gur350 corpus, as it contains less word pairs. Interestingly, the inlink filter strategy in the Gur350 dataset shows to have a local level of saturation at about 200,000. Here, the increase of the number of articles does not increase the correlation with the human judgements equally. The results for all points of measurement can also be found in tables A.8 and A.9 of appendix A. The findings up to now support the hypothesis that the inlink filter strategy is superior to the other strategies as long as the semantic interpreter does not include less than 30,000 articles. Thus, in the following evaluation only the inlink filter strategy is applied.

Detecting Semantically Related Documents

In most cases, documents of the above–mentioned corpora only consist of single terms. The Gr282 corpus, however, is assembled from documents that contains 95 terms on average. As there are multiple terms, the effect on not covered terminology on the accuracy of a semantic interpreter is not as severe — even if some terms are not covered by the reduced semantic interpreter, the remaining terms can still be analysed. Therefore, reduced semantic interpreters could yield accuracy comparable to the accuracy of a semantic interpreter built from all available articles. In order to measure the impact of using reduced semantic interpreters in IR tasks, the following experiments are performed on the Gr282 dataset.

The inlink filter strategy shows to be better suited to reasonably reduce the semantic interpreter. Therefore, only this strategy is applied in the following experiment. The effects of different link filtering thresholds are shown in figure 3.10 and table A.10 in appendix A.2.

The BEP and MAP values in figure 3.10 show that the performance of ESA increases up to the point of measurement with a minimum of 100 incoming links. Subsequent measurements with a larger amount of

0.4 0.45 0.5 0.55 0.6 0.65 0.7 104 105 106 Spearman's Correlation ρ

Number of Articles in Semantic Interpreter

Inlink Filter Strategy Outlink Filter Strategy Mutual Link Filter Strategy

(a)Corpus Gur65

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 104 105 106 Spearman's Correlation ρ

Number of Articles in Semantic Interpreter

Inlink Filter Strategy Outlink Filter Strategy Mutual Link Filter Strategy

(b)Corpus Gur350

Figure 3.9:Comparison of the effect of inlink, outlink and mutual link filter strategies and different sized semantic interpreters on the correlation between human and computed relatedness judgements.

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0 100 200 300 400 500 Accuracy

Inlink Filter Threshold

Break Even Point Averaged Average Precision

Figure 3.10:The Mean Average Precision and Break Even Point of dataset Gr282 using semantic interpreters re- duced by the inlink filter strategy.

articles do not show a gain in terms of accuracy. In fact, it already begins to decrease slightly at small link filtering thresholds. An explanation for this observation is that noise is introduced by articles that do not cover the most relevant terms but still have a certain term overlap. Therefore, the terminology coverage of the used corpora is already near–complete and additional articles do not significantly contribute to the information contained in the semantic interpreter but rather distort the semantic analysis. Another explanation for this observation is that short articles do not contain sufficient terms that appropriately describe the article’s concept.