• No results found

Pivot-based multilingual dictionary building using Wiktionary

N/A
N/A
Protected

Academic year: 2020

Share "Pivot-based multilingual dictionary building using Wiktionary"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

hu:c´eh

en:guild

ro:breasl˘a

Figure 1: Straight edges represent translation pairs ex-tracted directly from the Wiktionaries. The pair guild– breasl˘awas found via triangulating.

Figure 1, using the Hungarian wordc´ehas a pivot for join-ing its English and Romanian translations, thus creatjoin-ing the previously non-existent translation pair,guild – breasl˘a. As pointed out by Saralegi et al. (2012), the initial results obtained via triangulation are quite noisy. We distinguish four classes of translation pair candidates:

1. Correct candidates

2. Wrong candidates due to polysemy

3. Wrong candidates due to errors in the original dictio-nary

4. Wrong candidates due to parsing errors in the ex-tracted dictionary

en:book

fr:r´eserver de:Buch

Figure 2: Error due to polysemy

The main source of errors is the polysemous nature of words. An example of this would be to join the German wordBuchwith the French wordr´eserverthrough the pol-ysemous English wordbook(see Figure 2).

The simplest filtering method, IC, amounts to accepting only pairs found via at least two pivots (see Figure 3). Unfortunately this aggressive filtering greatly reduces the number of triangulated pairs. It also does not solve the is-sue ofparallel noisein the original data. Let’s assume that we extract the English-Greek pairdog–XXX, whereXXXis used as a placeholder for future translations (this is actually used in the Greek Wiktionary). If the placeholder is widely used, it is possible that we have an entirely different pair with the same Greek side, such as the German-Greek pair

en:book

fr:r´eserver hu:lefoglal

de:buchen

Figure 3: Translation graph with two pivots

el:

XXX

en:dog de:Buch

el:

XXX

Figure 4: Error due to parallel noise

Buch–XXX. It is easy to imagine the same case for many words, which results in erroneous translation pairs found via severalXXXpivots (see Figure 4). Although we tried to filter these placeholders, there is a high chance that some of them were overlooked by us in the 43 Wiktionary editions. To solve this issue, we examine the source Wiktionary edi-tion of the pairs (i.e. the Wikedi-tionary they were extracted from). All pairs are considered symmetrical but we order them alphabetically by the Wiktionary codes, thus creating aleftand arightside of a triangle. In Figure 1 the pairc´eh– guildis the left pair and the pairc´eh–breasl˘athe right pair. We consider a candidate pair to be more reliable based on the following:

1. its left and right side were extracted from different Wiktionaries,

2. either side was found in more than one Wiktionary,

3. the pair was found via more than one pivot.

We call this group of measuresedge diversity.

(3)
(4)
(5)

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kon-tokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al. (2013). DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia.Semantic Web Journal (under review, 2013). Saralegi, X., Manterola, I., and Vicente, I. S. (2011). An-alyzing methods for improving precision of pivot based bilingual dictionaries. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 846–856. Association for Computational Linguis-tics.

Saralegi, X., Manterola, I., and Vicente, I. S. (2012). Building a Basque-Chinese dictionary by using English as pivot. In Chair), N. C. C., Choukri, K., Declerck, T., Do˘gan, M. U., Maegaard, B., Mariani, J., Odijk, J., and Piperidis, S., editors, Proceedings of the Eight Interna-tional Conference on Language Resources and Evalu-ation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).

Soderland, S., Etzioni, O., Weld, D. S., Skinner, M., Bilmes, J., et al. (2009). Compiling a massive, multilin-gual dictionary via probabilistic inference. In Proceed-ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 262–270. Association for Computa-tional Linguistics.

Tanaka, K. and Umemura, K. (1994). Construction of a bilingual dictionary intermediated by a third language. InProceedings of the 15th conference on Computational linguistics-Volume 1, pages 297–303. Association for Computational Linguistics.

Figure

Figure 1: Straight edges represent translation pairs ex-tracted directly from the Wiktionaries.The pair guild–breasl˘a was found via triangulating.
Figure 5: Translation graph with many pivots. The edge la-bels denote the source Wiktionary and article of the trans-lation pair.
Table 4: Summary of dictionaries built

References

Related documents

The Figure 3 shows the share of commodities group traded in the total trade volume. of commodity market

In addition, the first and third of the three post-exercise measurements were compared, to evaluate the instantane- ous ability to maintain postural control in single-limb stance

Here we continue to explore the functional aspects of this model by multiple experiments involving different uterine shapes, cell numbers, and initial distributions of active

Unfortunately, the self-help tourism trajectories only have a few features for each trajectory: date of start of trajectory, province city where trajectory start, self-help

Compared with antibiotic treatment alone, our analysis indicated that the anti-cANGPTL4 MAb treatment combined with an antibiotic notably improved the immune responses against

We report significant differences in cytokine production between populations, with Hmong donors producing less IL-6 than do their European counter- parts and also less IL-17 by

Throckmorton, 8 and has fre- quently been imposed as a limitation on the principle that judgments procured by fraud are not entitled to full faith and credit.' In

Similar observations were made with the MreB paralog Mbl: an inducible GFP-Mbl fusion (as the only copy of Mbl in the cell; strain 2523) formed diffraction- limited patches