We apply our approach to the task of translating Wikipedia entries. Being able to au- tomatically translate Wikipedia entries would allow the proliferation of knowledge across more languages by providing a starting point for editors if an article already exists in another language. It could also be used to augment existing articles with missing parts translated from another language. We choose Wikipedia, because it fits our envisaged scenario perfectly: It is a large multilingual collection containing document-level cross- lingual links, and there are no parallel data available for this task. Wikipedia is internally structured by interlanguage links and inter-article links. Interlanguage links connect arti-
5.3 Experimental Setup and Data Analysis
cles on the same topic across languages. While entries connected by these links often have the same subject, they are not necessarily parallel. Nonetheless, these connections provide a good starting point for automatic parallel data extraction. Inter-article links connect articles in the same language. These articles can be closer or more distantly related, but they do not have the same subject. It is not clear whether these articles contain any par- allel data. In this work, we focus on interlanguage and inter-article links, but additional link structure could also be extracted from images embedded in text or the categorization of articles.
We use the German-English WikiCLIR collection by Schamoni et al. (2014)3, along with
their definition of cross-lingual relevance levels: A target language document is highly relevant to a source document if there exists an interlanguage link between the source and target document. WikiCLIR calls this the mate relation. A target document is weakly relevant to a source document, if there exists a bidirectional link between the source-document’s cross-lingual mate and the target document. A bidirectional link exists between two documents, if both reference each other via an inter-article link. WikiCLIR calls this thelinkrelation. The corpus contains a total of 225,294 mate relations with one average mate per English document, and over 1.7 million link relations, with on average 8.5 links per English document. The search queries provided in WikiCLIR are designed for a cross-lingual retrieval task. They are truncated to the first 200 words of a document, and words occuring in the article title have been removed. As our task is article translation, we use the full Wikipedia documents rather than the truncated queries.
We use the linked Wikipedia data to run an automated pseudo-parallel data extractor. We do this for three purposes: First, to identify nearly parallel document pairs for the construction of an in-domain evaluation set without having to rely on manual transla- tion. Second, to examine whether the mate and link relation provide a strong enough signal for extracting pseudo-parallel training data. Third, to compare our method to au- tomatic parallel data extraction based on cross-lingual document-level links. We use the modifiedyalignmethod described by Wołk and Marasek (2015) for pseudo-parallel data extraction.4 We adapt their software to handle the WikiCLIR format. yalign requires a
bilingual dictionary with translation probabilities. Following Wołk and Marasek (2015), we use a lexical translation table created from the IWSLT parallel training data5 as our
bilingual dictionary. We filter the dictionary for punctuation and numerals and discard all entries whose lexical translation probability is smaller than 0.3. We run the sentence aligner twice, using both the mate and link relation to align documents.
3cl.uni-heidelberg.de/wikiclir/ 4github.com/krzwolk/yalign 5wit3.fbk.eu/
not parallel 34.5% parallel 23.5% almost parallel 21% similar 21%
(a) Parallelism in mates
not parallel 96.5% similar 3.5% (b) Parallelism in links
Figure 5.1 Sentence aligner precision for mates and links.
Figure 5.1 shows an analysis ofyalign’s precision for the mate and link relations.For each case, a sample of 200 automatically aligned sentence pairs was manually evaluated. The sentence pairs were annotated using four categories: “parallel”, “almost parallel” – this category contains sentence pairs that have strictly parallel segments, with other segments missing from the aligned part, “similar” – for sentence pairs that have similar content or wording but are not strictly parallel –, and “not parallel”. While 65.5% of sentence pairs from the mate relation are similar or parallel, the link relation yields only 3.5% similar sentence pairs. We conclude that the bidirectional link relation is too weak to extract useful pseudo-parallel data.
To gain an idea of the yield of the aligner, we also look at the number of sentence pairs that were extracted from the paired documents. Figure 5.2 shows the frequency histogram of the number of extracted lines per document pair for document pairs with a mate relation. For most document pairs, only a single sentence pair was extracted. However, there were a few pairs that yielded several hundred pseudo-parallel sentence pairs. In total, 533,516 sentence pairs were extracted.
To construct our in-domain evaluation data, we sorted all automatically aligned documents by the number of aligned sentences up to a limit of 10,000 sentences. We then selected eight document pairs, discarding other document pairs which appeared to have been machine- translated, only contained few parallel sentences, or consisted of lists of proper names. We manually corrected sentence splitting errors in the selected documents, and removed image captions and references. We split the documents into two groups of four, making sure to keep the sets topically diverse. Table 5.1 shows the two sets of extracted documents. They are similar in length (1,712 sentences forWiki1, 1,526 sentence for Wiki2), and contain
5.3 Experimental Setup and Data Analysis
Figure 5.2 Number of documents (y-axis, on log-scale) from which𝑛lines were extracted (x-axis). set total sentences parallel sentences article title Wiki1
323 285 “Polish culture during World War II”
710 677 “Black-figure pottery” 457 375 “Ulm Hauptbahnhof” 587 375 “Characters of Carnivàle” Wiki2 360 268 “J-pop” 501 388 “Schüttorf”
549 438 “Military history of Australia during World War II”
676 432 “Arab citizens of Israel”
a considerable percentage of parallel sentences. We usedWiki1 as heldout validation set, andWiki2 for testing.