• No results found

Evaluation of ESA Model Variants

IV.5 Experiments

IV.5.4 Evaluation of ESA Model Variants

As described in Section IV.3, the ESA model allows different design choices. To find best parameters for CLIR, we performed experiments testing the influence of different variants on the retrieval performance. Our experiments have been carried out in an iterative and greedy fashion in the sense that we start form the original ESA model as a baseline. Then we iteratively vary different parameters and always fix the best configuration before studying the next parameter. The sequence of optimiza- tions is given by the nesting of functions in Equation IV.3. We first optimized the

parameters of the inner functions that do not depend on the output of other functions and then proceeded to the outer functions. At the end of our experiments, we will thus be able to assess the combined impact of the best choices on the performance of the ESA model.

In summary, the contributions of the experiments on ESA model variants are the following:

1. We identify best choices for the parameters and design variants of the CL-ESA model on a CLIR scenario.

2. We show that the CL-ESA model is sensitive to parameter settings, heavily influencing the retrieval outcome.

3. We show that the settings chosen in the original ESA model are reasonable, but can still be optimized for CLIR and MLIR settings.

Experimental Settings. Results of the evaluation of variants of the CL-ESA model are presented in Figures IV.9 to IV.12. As the observed effects were constant across measures, we only present recall at cutoff rank 1 (R@1). For experiments on the Multext corpus we used all documents (2,783) as queries to search in all documents in the other languages. The results for language pairs were averaged for both retrieval directions (for example using English documents as queries to search in the German documents and vice versa). For the JRC-Acquis dataset, we randomly chose 3,000 parallel documents as queries (to yield similar settings as in the Multext scenario) and the results were again averaged for language pairs. This task is harder compared to the experiments on the Multext corpus as the search space consists of 15,464 documents and is thus bigger by a factor of approximately 5. This explains the generally lower results on the JRC-Acquis dataset.

To prove the significance of the improvement of our best settings (projection function Π10000abs , association strength function tf.icf, cosine retrieval model) we carry out paired t-tests (confidence level 0.01) comparing the best settings pairwise with all other results for all language pairs on both datasets. Results where the dif- ferences are not significant with respect to all other variants at a confidence level of 0.01 are marked with X in Figures IV.9 to IV.12.

In the following we discuss the results of the different variations of the CL-ESA model:

Projection Function. We first used different values for the parameter m in the projection function Πm

abs. The results in Figure IV.9 showed that m = 10, 000 is a

good choice for both datasets.

On the basis of this result, we investigated different projection functions. In or- der to be able to compare them, we set the different threshold values t such that the projected ESA vectors had an average number of approx. 10,000 non-zero di- mensions. An exception is the function sliding window (orig.) where we used the

IV.5. EXPERIMENTS 97

en-de en-fr de-fr

1k 2k 5k 10k 15k 20k 0% 20% 40% 60% 80% 100%

en-de en-fr de-fr

Multext JRC-ACQUIS T O P-1 Accur acy X X X X X X X X X X X X X X X X X X X X X X X X XX

Figure IV.9: Variation of m in the projection function Πmabsusing the tf.icfassocia- tion function and cosine retrieval model. Results that have no significant difference to the results of our best setting are marked with X.

en-de en-fr de-fr

absolute absolute threshold relative threshold

sliding window (orig.) sliding window

0% 20% 40% 60% 80% 100%

en-de en-fr de-fr

T

O

P-1 Accur

acy

Multext JRC-ACQUIS

Figure IV.10: Variation of the projection function Π using the tf.icf association function and cosine retrieval model.

en-de en-fr de-fr

TFICF TFICF* TF BM25 Cosine

0% 20% 40% 60% 80% 100%

en-de en-fr de-fr

T

O

P-1 Accur

acy

Multext JRC-ACQUIS

Figure IV.11: Variation of the association strength function φlusing the projection function Π10,000abs and cosine retrieval model.

parameters described in [Gabrilovich, 2006]: threshold t = 0.05 and window size

l = 100. Using an absolute number of non-zero dimensions yielded the best results

(see Figure IV.10), the difference being indeed significant with respect to all other variants. Thus, we conclude that neither the settings of the original ESA approach (sliding window) nor in the model of Gurevych et al. [2007] (fixed threshold) are ideal in our experimental settings. For the remaining experiments, we therefore use the absolute dimension projection function that selects 10,000 articles (Π10,000abs ).

Association Strength. The results in Figure IV.11 show that the functions tf.icf (used in the original ESA model) and tf.icfperform much better compared to the other functions. The better performance of tf.icfwhich ignores the term frequen- cies in the queries was indeed significant w.r.t. all other alternatives for all language pairs considered on both datasets. We thus conclude that the settings in the origi- nal ESA model are reasonable, but, surprisingly, can be improved by ignoring the term frequency of the terms in the document to be indexed. The low results using the tf function show that icf is an important factor in the association strength func- tion. Otherwise, the normalization of the tf.icf values (= cosine function) reduces the retrieval performance substantially.

Retrieval Model. Experiments with different retrieval models lead to the result that the cosine function, which is used by all ESA implementations known to us, constitutes indeed a reasonable choice. All other models perform worse (the differ- ence being again significant for all language pairs on both datasets), which can be seen at the charts in Figure IV.12, especially on the JRC-Acquis dataset.

Discussion. Our results show on the one hand that ESA is indeed quite sensitive to certain parameters (in particular the association strength function and the retrieval

IV.5. EXPERIMENTS 99

en-de en-fr de-fr

Cosine TF.IDF KL-Divergence LM

0% 20% 40% 60% 80% 100%

en-de en-fr de-fr

T

O

P-1 Accur

acy

Multext JRC-ACQUIS

Figure IV.12: Variation of the retrieval model using Π10,000abs and tf.icf.

model). The design choices have a large impact on the performance of the ESA based retrieval approach. For example, using tfcvalues instead of rtfcvalues (which are length normalized) in the association strength function decreases performance by about 75%. Unexpectedly, abstracting from the number of times that a term ap- pears in the query document (using tf.icf) improves upon the standard tf.icf measure (which takes them into account) by 17% to 117%. We have in particular shown that all the settings that are ideal in our experiments are so indeed in a statistically sig- nificant way. The only exception is the number of non-zero dimensions taken into account, which has several optimal values.

On the other hand, the results of our experiments confirm that that the settings in the original ESA model (Π0.05,100window, tf.icf, cosine) [Gabrilovich and Markovitch, 2007; Gabrilovich, 2006] are reasonable. Still, when using the settings that are ideal on both datasets according to our experiments (Π10,000abs ,tf.icf,cosine), we achieve a relative improvement in TOP-1 accuracy between 62% (from 51.1% to 82.7%, Mul- text dataset, English/French) and 237% (from 9.3% to 31.3%, JRC-Acquis dataset, English/German). This shows again that the settings can have a substantial effect on the ESA model and that ESA shows the potential to be further optimized and yield even better results on the various tasks it has been applied to.

Finally, all experiments including the German datasets have worse results com- pared to the English/French experiments. This is likely due to the frequency of specific German compounds in the datasets, which lead to a vocabulary mismatch between documents and Wikipedia articles. However an examination of this remains for future work.