Methods Employed - Online MT-based Techniques

5.3 Online MT-based Techniques

5.3.1 Methods Employed

Depending on the online MT systems used to discover the translations of the elements of an ontology, we have three methods: statistical-based, rule-

based and hybrid-based systems. The hybrid-based systems leverage the strengths of statistical and rule-based translation methodologies. In all cases the translation process is as follows:

• Translation candidate extraction: The basic approach to use an online

MT system in ontology localization is simple: one just has to submit the ontology element to an MT system to obtain a translated version. In the literature some works show that better performance in terms of output quality can be achieved when these systems can process the texts that they are required to translate into smaller chunks [Wu et al., 2008a]. Our own experience with these systems suggests a high eﬃciency in the translation of compound labels and short phrases. However, if a simple word is submitted, then there is a high chance that the word will be translated by its default translation.

To solve the translation of simple labels we have investigated the use of term verbalization context as translation input. Remember that

term verbalization context produces a short natural language phrase

of the term in the ontology. However, this solution involves the use of diﬀerent word/phrase alignment tools and algorithms for identifying translation relationships among the words in a bitex26 _{as used in}

statistical machine translation (see section 5.5.1). Of course, further investigation is necessary to evaluate this approach as a plausible solution in order to mitigate the lack of context in simple labels.

• Translation selection: Each online translation service uses its own dis-

ambiguation method which does not allow to be customized. In other

23_{http://www.microsofttranslator.com/} 24

http://translate.google.com/

25_{http://babelﬁsh.yahoo.com/} 26

In the ﬁeld of translation studies a bitext is a merged document composed of both source and target language versions of a given text.

words, these methods do not use the context of the term to be localized to rank their translations. Despite the fact that we are interested in translation methods that use source context information to improve translation quality, we include these systems as part of the classiﬁca- tion due to improvements in quality reported in similar domains (e.g., translation of metadata records [Chen et al., 2012] or query translations [Wu et al., 2008b]).

5.3.2 Advantages

Online MT services have proven to be an elusive goal in localization, but to- day a number of systems are available which produce output which, though not perfect, is of sufficient quality to be useful in a number of specific domains. Also, these services save time while translating large texts and often allow for customization by domain or user-specific settings e.g., choosing between American English and British English.

5.3.3 Disadvantages

Normally the output of these systems is limited to one per word, while there are multiple expressions for it in the target language. For example, both “drogue” and “stupfiant” are correct French translations of “drug” in the sense of illegal substance, but both Babelfish and Google only choose “stupfiant” in their translations of “drug traffic”. These translation services do not suggest non-translation, but strongly related words in the translation results. However, strongly related words can be very useful in ontology localization, even if they are not translation words. For example, it may be useful to “translate” the word “computer” by the French word “programme” even if the latter is not a literal translation of the former. This latter term may help retrieve other related terms, which could be relevant.

Another problem with these systems is the diﬃculty to translate un- known words, or out-of-vocabulary words (often referred to as OOV [Qu et al., 2012]). A typical case is the translation of ontology entities represent- ing names of persons or organizations. Also, no extensions can be made to these systems, e.g., addition of very speciﬁc domain terms or proper names. Finally, the translation produced by these translation services is often limited to a certain number of characters per day.

5.4 Knowledge-based Techniques

Knowledge-based techniques rely on dictionaries, terminology databases, glossaries, encyclopedias, thesauri or lexical knowledge bases, without any corpus evidence to generate the target translations. These resources provide information such as examples, deﬁnitions, or semantic hierarchies and

associations could be used to help select more appropriate translations in the context. Basically these methods use a direct translation approach and they rely on similarity measures computation to disambiguate the candidate translations (e.g., Pedersen et al. [Pedersen et al., 2005], Resnik [Resnik, 1999]).

5.4.1 Methods Employed

In the literature the knowledge-based techniques are normally classiﬁed into:

dictionary-based and thesauri-based approaches. This distinction takes into

account the grade of structured information contained in the resource. For our purposes we use the same categorization:

Dictionary-based

Dictionary-based methods take advantage of the multilingual linguistic information available in machine readable dictionaries, glossaries, encyclopedias or terminological databases to discover the translations. The main diﬃculty of the dictionary-based techniques is to select the correct translation of a term among all the translations provided by these resources.

• Translation candidate extraction: Some of these resources require a

normalization process before each word is submitted as a query to the translation source. The normalization process involves for example, transform the word to singular form, verbs in the inﬁnitive form, and adjectives in their positive form. After the normalization, the resource returns a set of translations any time the label exactly matches a word in the source entries.

All these resources oﬀer an exact or fuzzy search mechanism to extract the candidate translations. To prevent an explosion of nuisance matches, the term POS tagggig context can be used. POS information has shown to solve 87% of all word ambiguities [Wilks and Stevenson, 1997]. This is useful, since dictionaries in general have separate hierarchies for words of diﬀerent POS, and contemporary POS-taggers are of high accuracy [Brill, 1995]. With this context information we can only retain those translations whose POS exactly match with the POS of the search term. This assumption is accomplished in the majority of languages.

• Translation selection: In spite of the process of selection performed

in the previous step, it is common for a single word to have several translations, some with very diﬀerent meanings. To disambiguate the senses of a source term, we can employ mainly the example sentences, deﬁnitions or related terms listed for each sense division of a source

word. The Lesk algorithm [Lesk, 1986], in which the most likely meanings for the words in a given context are identiﬁed based on a measure of contextual overlap among dictionary deﬁnitions pertaining to the various senses of the ambiguous words, provides reasonable disambiguation precision for these types of resources. If additional resources are available (e.g., a set of semantic relations from a semantic network or a minimal set of annotated data) other methods can be applied (see

Translation selection section in the Thesauri-based techniques).

Thesauri-based

Thesauri-based methods take advantage of resources with semantic hierarchies and associations (such as lexicon or thesaurus), to generate the target ontology translations.

• Translation candidate extraction: The process used to discover candi-

date translations is similar to the method introduced for the Dictionary-

based techniques.

• Translation selection: In addition to Lesk algorithm, we can disam-

biguate the candidate translations using the implicit relations such as synonym, hypernym, hyponym, etc., found in this type of resources. In eﬀect, the senses of surrounding words in the context can be ex- panded to include the senses of these related words to which they are semantically related through extended relations. Diﬀerent measures of semantic similarity can be used for ranking the candidate translations. In the next part we list the measures that are more relevant for the purposes of this thesis:

– Variations of the Lesk Algorithm: Among all variations of this al-

gorithm, the simplified Lesk method [Kilgarriff and Rosenzweig, 2000] is the one that improves most in comparison to the original algorithm both in terms of efficiency (it overcomes the com- binational sense explosion problem) and precision (comparative evaluations have shown that this alternative leads to better disambiguation results). In this simplified algorithm, the correct meaning of each word in a text is determined individually by finding the sense that leads to the highest overlap between its dictionary definition and the current context. Another variation of the Lesk algorithm, called the adapted Lesk algorithm, was introduced by Banerjee and Pedersen [Pedersen et al., 2005], which extended gloss overlaps through the rich network of word sense relations in Wordnet rather than simply considering the glosses.

– Measures of semantic similarity computed over semantic networks:

quantify the degree to which two words are semantically related. Most such measures rely on semantic networks and follow the original methodology proposed by Rada et al. [Rada et al., 1989] for computing metrics on semantic nets. These measures include methods for ﬁnding the semantic density/distance between concepts. A comprehensive survey of semantic similarity measures is reported by Budanitsky and Hirst [Budanitsky, 2001].

More detailed reviews about diﬀerent knowledge-based disambiguation techniques can be found at [Agirre and Stevenson, 2006].

5.4.2 Advantages

The resources that establish these techniques are widely available, dictionary- based approaches are easy to implement, and these resources have the ability to produce consistent, high-quality translations (conditional to the quality of the original bilingual resource).

5.4.3 Disadvantages

One of the main problems associated with dictionary-based techniques is untranslatable words due to the limitations of general resources. The category of untranslatable words involves new compound words, special terms, and cross-lingual spelling variants, i.e., equivalent words in diﬀerent languages which diﬀer slightly in spelling, particularly proper names and loanwords. The problem of missing translations can be addressed by automatically min- ing additional translation relations. We leave this problem to Section 5.5.1.

5.5 Corpus-based Techniques

These methods use a parallel or comparable corpus of aligned documents to discover translations. The criteria used for alignment combine linguistic and statistical information.

5.5.1 Methods Employed

Many approaches have been proposed to extract translation relations from parallel or comparable corpora. In the following, we will describe some representative approaches:

Example-based MT

The philosophy of example-based machine translation (EBMT) [Nagao, 1984] combines the features of rule-based and statistical approaches in a manner that seems favorable for the task at hand. The main idea behind EBMT

is that a given input phrase in the source language is compared with the example translations in the given bilingual parallel text to ﬁnd the closest matching examples that can be used in the translation of that input phrase. One of the main approaches in the EBMT paradigm is to use pattern matching techniques. First, these approaches collect word sequences from each corpus using translation patterns to acquire candidates for bilingual expressions. Second, a search for pairs of words that satisfy the correspondences of the sequences is performed. Therefore, a pre-processing step such as part of speech tagging and syntactic category identiﬁcation is necessary to apply this method.

• Resource pre-processing: Before discovering candidate translations, a

bilingual template acquisition from a simple monolingual corpus or parallel corpora has to be completed. In general, this process involves three phases: retrieving local patterns, assigning their syntactic categories with part-of-speech (POS) templates, and making translation patterns.

– Retrieving local patterns. In order to retrieve local patterns any

method for retrieving word sequences may be used [Kansai et al., 1996,Sato and Saito, 2002]. These methods generate all n-character (or n-word) strings appearing in a text and ﬁlters out fragmen- tal strings with the distribution of words adjacent to the strings. This is based on the idea that adjacent words are widely dis- tributed if the string is meaningful, and are localized if the string is a substring of a meaningful string.

– Identifying syntactic categories. Since the strings are just word

sequences, this task gives them syntactic categories. Thus, this task involves the assignation of part-of-speech tags for each com- ponent word discovered in the previous step. A syntactic category can be used to group similar tagged words. For example, the syntactic category NN can be used to group the following sample POS templates, (word) (word) or (word) (preposition) (word). In the example NN represent a noun phrase.

– Making translation patterns. The ﬁnal process is to generate the

bilingual translation patterns. In the case of using a monolingual corpus as base to discover the patterns, we need to translate each word (identiﬁed in step one) as previous step to identify its syntactic categories.

The output of this process is a repository of lexical templates for MT.

• Translation candidate extraction: The term POS tagging context could

To retrieve candidate translations, we can collect the n-grams of POSs appearing in a translation pattern (e.g., NN, JN, etc.) from each corpus. As this method simply extracts word sequences according to POS tags, it also collects noisy sequences. However, most meaning- less sequences can be eliminated, estimating diﬀerent types of word similarity correspondences.

• Translation selection: After generating a ranked list of translation

candidates for each source term, ranking techniques must be used to estimate the coherence of the translated label and decide the best translation. The ranking factor can be estimated using one of the techniques described below:

– Ranking through Web. The Web can be considered as an exem-

plar linguistic resource for decision-making [Grefenstette, 1999,Li et al., 2003]. In this approach, each candidate translation is sent to a Web search engine (e.g., Google) to discover how often the combination of translation alternatives appears. The number of retrieved Web pages in which the translated sequence occurred is used to rank the translation candidates.

– Ranking through a test collection. Large-scale test collections could be used to rank the translation alternatives and complete a ﬁnal translation. We can follow the same steps as the previous technique, replacing the Web by a test collection and a retrieval system to index documents of the test collection.

– Ranking through an interactive mode. An interactive mode [Og-

den and Davis, 2000] could help solve the problem of identifying ﬁnal translations. The interactive environment setting should optimize the label translation, select best translation alternatives and facilitate the information access across languages. For in- stance, the user can access a list of all possible candidates ranked in a form of hierarchy on the basis of word ranks associated to each translation alternative.

Statistical-based MT

These approaches analyze large collections of texts on a statistical basis and automatically extract the most probable translations in the target language [Peters and Sheridan, 2000]. The recent progress in SMT suggests interesting future development for ontology localization [Stroppa et al., 2007, Gimpel and Smith, 2008]. In particular, phrase-based translation approaches have become the state of the art in SMT, while these approaches have not yet been widely investigated in localization. A recent work [McCrae

et al., 2011a] analyzes diﬀerent translation strategies using statistical machine translation approaches that also utilize the semantic information be- yond the label or term describing the concept, that is relations among the concepts in the ontology, as well as the attributes or properties that describe concepts:

• Resource pre-processing: Bilingual word/phrase alignment is the ﬁrst

step of most current approaches to SMT. Alignment is a vital issue in the construction and exploitation of parallel corpora. The alignment methodology tries to identify translation equivalence between sentences, words and phrases within sentences. In most literature, alignment methods are either categorized as association or estimation approaches (heuristic and statistical models). Association approaches use string similarity measures, word order heuristics, or co-occurrence measures (e.g., mutual information scores). The major distinction between statistical and heuristic approaches are that statistical approaches are based on well-substantiated probabilistic models while heuristic ones are not. Most current SMT systems use a generative model for word alignment such as the one implemented in the freely available tool GIZA++ [Och and Ney, 2003]. GIZA++ is an imple- mentation of the IBM alignment models [Brown et al., 1993]. These models treat word alignment as a hidden process, and maximize the probability of the observed (e, f) sentence pairs using the Expectation Maximization (EM) algorithm, where e and f are the source and the target sentences.

• Translation candidate extraction: To discover candidate translations,

SMT-based methods generally use the occurrence frequencies of substrings of the sentence in target-language corpora. The score assigned to each candidate translation depends on both: i) the extent to which the source sentence meaning is also expressed in the candidate translation,and ii) the extent to which the candidate translation is likely to be a valid sentence in the target language regardless of whether or not its meaning bears any relationship to the source sentence. Details of how this score is computed is out of the scope of this thesis, however this information can be consulted in [Hearne and Way, 2011].

• Translation selection: In order to discover the ﬁnal translations, the

approach introduced in [McCrae et al., 2011a] uses word sense disambiguation by comparing the structure of the input ontology to that of an already translated reference ontology. We found this method to be very eﬀective in choosing the best translations. However it is depen- dent on the existence of a multilingual resource that already has such terms. As such, we view the topic of taxonomy and ontology translation as an interesting sub-problem of machine translation and believe

there is still much fruitful work to be done to obtain a system that can correctly leverage the semantics present in these data structures in a way that improves translation quality.

Translation Memory tools

The essential idea behind these techniques is the use of a linguistic database (also called translation memory) in order to reuse previously translated words. These techniques are often used in order to compare segments in the source text with the translated segments in the translation memory. For our purposes, a segment can consist of simple ontology labels, compound labels, or term annotation paragraphs.

• Translation candidate extraction:. Linguistic databases provide a num-

ber of eﬃcient search options to extract candidate translations:

– Fuzzy matching. This is the dominating approach for the retrieval of similar segments from translation memories, because the possibility of exactly repeated segments is small, except in the context of re-translating the labels of a modiﬁed resource (in our case an ontology). The method can be based on orthographic similarities, which can be eﬃciently computed by comparing the number of corresponding substrings (e.g., bi- or trigrams) of two segments [Willett and Angell, 1983, Rapp, 1997]. Another option to measure the distance between two fuzzy matching content segments is to use the Levenhstein algorithm [Levenshtein, 1965]. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string

In document Ontology Localization (Page 107-120)