Correlation between automatic methods - Language-Independent Methods for Identifying Cross-Ling

6.2 Method

6.5.1 Correlation between automatic methods

Results discussed in Section 6.4 show that the anchor + wor d method correlates highly to the t r ansl at i on method (ρ=0.717). Considering that the tr ansl ation method repre- sents state-of-the-art translation resources, this finding is very promising. On the other hand, the translation qualities for under-resourced languages often vary widely due to the lack of bilingual resources to train the machine translation system (Skadin,a et al., 2012). If this was the case for this study, the high correlation scores cannot be used to determine the performance of anchor + wor d method in general.

To investigate this in more detail, the author examined the translation qualities of a set of documents in the evaluation set. The author found that the quality of Google Trans- late for the under-resourced language pairs used at the time of the study varied widely.6 In general, the translation quality of the under-resourced language pairs is poorer than the highly-resourced language pair, which confirmed the findings from previous litera- ture (Skadin,a et al., 2012). As an example, an excerpt of a translated Estonian article in the evaluation corpus about the "Estonian Auxiliary Police" is shown in Figure 6.8. This

6_{The evaluation corpus was translated using Google Translate in 2010. The qualities of both Google}

... In Estonia, the Estonian Selbstschutz members was also polit- seipataljonid, välipolitseinikest, Punaarmeest and ületulnutest and mo- biliseeritutest aastavahetusel 1943 - 1944, formed politsepataljonideks renamed items, original name was vahipataljon protection. ... Polit- seipataljonid formeeriti various tasks (also known as valvepolitsei (Schutzpolizei) watchkeeping duty, rannakaitse, fighting fronts parti- sanidega, etc.), hence the differences in relvaüksustel: Politseirügement set up named, Schutzmannschaft, protection, Police, Schutzmannschaft Vahipataljon infantry battalion. ...

Fig. 6.8 An excerpt of the English translation (using Google Translate) of an Estonian (ET) article about the "Estonian Auxiliary Police"

example shows that many words (often domain-specific terms) were left untranslated when using the t r ansl at i on method. Further work is needed to analyse the translation quality in these language pairs in more detail. This task was not carried out in this study as it required linguistic knowledge of each language pair, which could not be pursued in the limited time of the study.

Although the qualities of Google Translate varied widely across languages, Google Translate was, at the time of the study, a state-of-the-art translation resource and was a valid baseline to use against the proposed method. Furthermore, it specifically high- lighted the challenges for under-resourced languages. The high correlation between the

anchor + wor d method and the tr ansl ati on method shows that the use of Wikipedia

as a bilingual resource for under-resourced languages is very promising. This finding also confirms that the anchor + wor d method can be used without a significant decrease in quality compared to the state-of-the-art translation method in the language pairs.

Variations between language pairs

Although the overall correlation between the anchor +wor d method and the tr ansl ati on methods is high, the correlation scores between these two automatic methods vary widely across the different language pairs. The highest correlation (ρ=0.897), achieved in German- English, is more than double the lowest correlation, i.e.ρ=0.441 in EL-EN. These varying degrees of correlation across languages can be explained using the following reasons.

Firstly, the anchor + wor d method fully relies on the size of translation resources ex- tracted from Wikipedia to identify overlapping information in different languages relies. The size of these resources for each language pair is the number of interlanguage-linked articles in the language pair (see Table 6.2).7 The more interlanguage-linked articles are available in the language pair, the larger the translation resource is for the language pair, and the more likely that a word in the source language can be translated into English.

In the study, the results indicate that the higher number of interlanguage-linked articles ("Total ILL articles") in the language pair is, the higher is the correlation between the automatic methods (i.e., the correlation between anchor + wor d method to the

t r ansl at i on method). LV-EN and EL-EN, which have the two lowest number of inter-

language-linked articles, also have lower correlation scores compared to other language pairs. DE-EN, on the other hand, has a very high number of interlanguage-linked articles and a high correlation scores between both automatic methods.

Secondly, the performance of anchor + wor d method may also be influenced by the

similarity of the languages. Because the anchor + wor d method computes the overlap

of words in the article content, it is likely to perform better on source languages that are similar to English. In other words, languages that share more English words can be measured more accurately using the anchor + wor d method, compared to those that share fewer words.

Identifying similarity across languages requires specific linguistic knowledge of all language pairs explored in this study. Since this information was not available, the number of duplicate titles in both languages are used to indicate the similarity between the languages. The proportion of the same titles for each language pair is shown in Table 6.2. These proportions were calculated using the list of ILL articles only, and therefore, only represents the proportion of duplicate titles in a small subset of the vocabulary of the language pair. However, due to the large dataset (over 20,000 titles for each language pair),

7_{As described in Section 3.6.1, duplicate titles of interlanguage-linked articles were filtered out prior}

to creating the bilingual lexicon. However, these duplicate titles are also valuable in identifying overlapping words in both languages. Therefore, this analysis reports the correlation between the number of interlanguage-linked articles (instead of the bilingual lexicon size) in each language pair to its correlation scores between the automatic methods.

these findings can still be used to indicate the degree of similarity of the language pairs.

Using these data, German-English was shown to achieve the highest proportion of same titles in their interlanguage-linked articles (72% of same titles). It is therefore indicated to be more likely to share other words with English in general, compared to Latvian, or Lithuanian, which share 27% and 28% same words, respectively. Unsurprisingly, Greek has the lowest proportion of same titles, 23%, undoubtedly affected by the different char- acters they used compared to English. The results indicate that higher correlations seem to be achieved in languages that are indicated to be similar (i.e., higher proportion of same titles), such as German-English. Greek-English, meanwhile, was shown to be the least similar languages and achieved the lowest correlation scores. This is, however, not strongly supported by the rest of the language pairs, suggesting that other factors may also affect these results.

In document Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia (Page 180-183)