• No results found

Chapter 4 Conducted experiments

4.7 Multi-lingual evaluation for ranking algorithms

In the research described so far we used English Infobox ranking data exclusively as the baseline against which we conducted our evaluation of ranking algorithms. This was done in line with our research goal to review ranking algorithms within the scope of the English language (as stated in paragraph 1.4). The process by which we ob- tained the English ranking data baseline was explained fully in section Chapter 3.

As our research progressed we felt that a review of ranking algorithms would be stronger if we would also incorporate ranking baselines from non-English Infobox templates. Infobox templates vary per language, and the ranking order that can be extracted from these templates is indeed different to some degree. Ranking algorithms that consistently produces strong ranking results ‘regardless’ of the ranking baseline in use are obviously far more useful that those that cannot.

Before we continue our discussion on the multi-lingual evaluation aspects we want to highlight that our ranking algorithms internally operate only in “English mode”. The multi-lingual experiments take the ranking output from these algorithms and compare these rank orders with non-English ranking baselines. To illustrate this as- pect in a somewhat more concrete manner, take for example our heuristic algorithm that we described in section 4.6. The algorithm will at some point search for specific English phrases to improve the rank order for properties. This algorithm was thus not re-written for this experiment to search for other (non-English) phrases. Hence, we used the exact same algorithms for our multi-lingual experiments. The word frequen- cy lists that we used, the WordNet database and the NLP API interaction all remained ‘as is’ (English). The input fed into the ranking algorithms also indeed all English.

There were some data set factors that influenced the design of our cross-language ranking baseline experiments. First, one has to consider that two given Wikipedia languages L1 and L2 will typically not share the same set of Infobox templates. As an

example, consider the fact that the English DBpedia at time of writing defines 417 mappings where the German set mapped 353 classes. The number of classes that are mapped both in German and English constituted of a mere 85 classes. The number of usable classes for which we can actually compute rankings is even smaller as we ex- clude for example classes that have fewer than 8 properties. The number of shared classes diminishes further very quickly for each additional language included to the point that the statistical validity becomes questionable.

To circumvent this issue we designed a meaningful experiment in the following manner. We created a data set that consisted of 244 non English mapped Infobox templates. The set compromised of Infobox templates that were mapped by French, German, Spanish, Dutch, Polish, Portuguese and Turkish DBpedia mappings. Each included mapping at least maps 8 class properties. Please note that the construction of the test data was to a large extend already discussed in sections 3.3, 3.4 and 3.5.

We ran the different ranking algorithms repeatedly over all the classes and com- puted average KPI metric values. The computed average KPI values could then be compared and evaluated for significance using paired sample t-tests. The experiment design is shown schematically in Table 14.

Alphabetic Word frequency N-gram Heuristics

Class A (NL) Class A (NL) Class A (NL) Class A (NL)

Class B (NL) Class B (NL) Class B (NL) Class B (NL) Class P (DE) Class P (DE) Class P (DE) Class P (DE)

Class Q (DE) Class Q (DE) Class Q (DE) Class Q (DE)

Class B (FR) Class B (FR) Class B (FR) Class B (FR)

… … … …

Class A (TR) Class A (TR) Class A (TR) Class A (TR) Compute average Compute average Compute average Compute average

By including various languages in our test data set we obtain a blended average that covers a wide array of different ranking baselines. The experiment output is used to evaluate if the ranking algorithms reviewed in our research also outperform alphabetic ranking in a non-English context. The logic rationale is shown in Figure 46.

Ranking algorithm is consistent outperfoming alphabetical Ranking algorithm outperforms alphabetical ranking in English baseline? Ranking algorithm outperforms alphabetical ranking in multi-lingual experiment? Table 14: multi-lingual experiment approach

We started with a run to review the alphabetic ranking algorithm in a multi-lingual context. The results of this run are shown in Figure 47. Please note however that the group stability metrics have been excluded in the multi-lingual metrics set as these are meaningless in this context.

We have included the English metrics again as a reference point in Figure 48 alt- hough we cannot compare the above figures directly due to the different input data set. Note that his graph is the same as was already presented in section 3.10.6. The two metric diagrams are very close in terms of the comparative KPI metrics.

With the above reference in mind as a starting point we can discuss the perfor- mance of the three ranking algorithms that we researched in this paper. We only con- sidered the ranking algorithms for which we already proofed that they outperformed alphabetical ranking when evaluated with respect to an English ranking baseline.

Figure 47: metrics for alphabetic ranking in a multi-lingual ranking baseline

First, we discuss the word frequency ranking algorithm that we described fully in section 4.2. In Figure 49 the deltas for the KPI metrics with respect to alphabetical ranking are presented (please also refer to Figure 47). As all these metrics have been computed from the same multi-lingual ranking baseline data set we have paired data, which allowed us to compute a paired samples t-test. We do not present the t-test data here, but the outperformance is statistically significant (with a 95% confidence level).

The exact same procedure was executed to review the N-gram ranking algorithm with non-English ranking baselines. The N-gram ranking algorithm was fully de- scribed in section 4.3. Again, we only present the delta metrics, and note that the out- performance is statistically significant in a paired sample t-test for differences.

Figure 49: multi-lingual frequency ranking (Δ compared to alpha ranking)

Finally, we present the performance metrics for the heuristic ranking algorithm in a multi-lingual ranking baseline context in Figure 51. The paired samples t-tests for both the normalized Kendall τ and normalized Spearman ρ metrics show that the re- sults are statistically significant. We have included the supporting data for the t-test (for the normalized Kendall τ) in Figure 52 and Figure 53. The t-test data for the nor- malized Spearman ρ metrics are not shown.

Figure 51: multi-lingual heuristic ranking (Δ compared to alpha ranking)

Figure 52: heuristic ranking t-test details (multi-lingual context)