Incorporating Morphological Information at Decoding Time

A factored translation model (Koehn and Hoang, 2007), which is an extension to PBSMT, is the most suitable example for incorporating any additional annotation, including morphological information, at decoding time. The main problem with PBSMT is that it translates text phrases without any explicit use of linguistic information, which seems beneficial for a fluent translation. In factored models each word is extended by a set of annotations, so a word in this framework is not only a token, but a vector of factors. For example a simple word in PBSMT can be represented by a vector of {word (surface form), lemma, POS tag, word class, morphological information}. Clearly, the new representation is richer than that of the word’s surface form. As the main focus in factored models is on word-level enrichments, clearly it addresses the problem of morphology which fits our case.

Let us have a closer look at the model. In word-based or phrase-based approaches each word is treated independently, i.e. ‘studies’ has no relation to ‘studied’. If only one of them was seen during training, translation of the other one would be hard (or even impossible) for any MT engine, even though they come from the same root. Translation knowledge of their shared stem, along with extra morphological information, could help us translate both of them (and even all derivative forms of the stem). This property not only provides solutions for this sort of morphological issues but also addresses the data sparsity problem at the same time. A factored translation model follows a similar approach and performs better than other word- based models for MRLs (see Section 5.2.3).

Translation in factored models is generally broken up into two translation and one generation steps. A source lemma is translated into a target lemma. Morpholog- ical and POS factors are translated into target forms and the final form is generated based on the lemma and other factors. Factored models follow the same imple- mentation framework as the phrase-based model. In these models the translation

step operates at the phrase level whereas generation steps are word-level operators. The pipeline is illustrated step-by-step to translate the German word Häuser into English. We use the same example reported in Koehn and Hoang (2007):

• Factored representation: (surface form: Häuser), (lemma: Haus), (POS: NN), (count: plural), (case: nominative)

• Translation (mapping lemmas): Haus Ñ house|home|building|shell

• Translation (mapping morphology): NN|plural-nominative-neutral Ñ NN|plural, NN|singular

• Generation (generating surface forms):

– house|NN|plural Ñ houses

– house|NN|singular Ñ house

– home|NN|plural Ñ homes

Multiple choices can generate multiple surface forms which result in phrase ex- pansions. Training is performed similar to the basic phrase-based model. Word phrases are extracted with standard models. Factors are also treated as words whose phrases are extracted in the same way as surface forms. Generation distribu- tions are estimated on the output side only, i.e. word alignments play no role here. The generation model is learned on a word-for-word basis. Obviously, a factored model is a combination of several components which can be easily integrated into the log-linear translation model. A simple form of the entire pipeline is illustrated in Figure 2.6

The factored translation model is the most well-known model which explicitly addresses the morphology problem in SMT. We are able to boost this approach by our morphology-aware word embeddings. We provide more detailed information on our model in Chapter 5. Apart from the factored model we wish to review two other models which study the same problem with different approaches.

Input Output word lemma POS morphology word lemma POS morphology

Figure 2.6: The high-level architecture of the factored translation model (Koehn and Hoang, 2007).

Dyer (2007) proposed a model for translating from MRLs. The goal is to capture source-side complexities. The system is based on a hierarchical phrase-based model (Chiang, 2007) and evaluated on CzechÑEnglish. The main intuition behind the model is to extend the noisy channel metaphor, where the new model is referred to as the noisier channel. It suggests that an English source signal is a distorted variant of a morphologically neutral French signal. In the noisy channel model, the French signal is known as a noise-free signal whereas the noisier channel assumes the French signal is noisy, as it is a result of another distortion applied by a morphological process to the original source signal. This part of the distortion can be modeled separately apart from the main noisy channel.

In order to implement the noisier channel, first lemma forms of Czech words are extracted. Corpora consisting of truncated forms are also generated, using a length limit of 6. This means that for all words, the first 6 characters only are taken into account and the rest is discarded. Hierarchical grammar rules are induced based on surface, lemmatized, and truncated forms. These three grammars are combined together for use by a hierarchical phrase-based decoder, such that the model’s performance was improved by 10%.

Williams and Koehn (2011) proposed another model to manipulate the decoder in order to translate into MRLs. The model is an extension to a string-to-tree model by which unification-based constraints were added to the target side of the model.

The main idea is to penalize implausible hypotheses during search. They applied the model to EnglishÑGerman and were able to improve performance over the baseline model. The aforementioned three models are examples for incorporating morphological information into the decoding phase; see Table 2.2 on page 44 for the summary of similar models

In document Machine translation of morphologically rich languages using deep neural networks (Page 40-43)