Neural Language Modeling for SMT - Count-based Segmentation for MCWs

4.3 Count-based Segmentation for MCWs

4.5.2 Neural Language Modeling for SMT

All issues discussed so far are purely related to language modeling. Language models are frequently used in different fields such as MT. In this section we show the ap- plication of language modeling in SMT and explain how the proposed NLM enables us to provide better translations. To this end we designed a simple experiment, where we manipulate the n-gram language model with our NLM which means we do not change anything but the n-gram scores. The n-gram-based language model includes n-grams and their associated scores. We recompute those scores with our NLM models. Results for this experiment are shown in Table 4.7.

In this experiment we trained different SMT engines to translate from English (En) into German (De) and vice versa. To train the En–De engines we used the WMT-15 datasets.9 _{For the training set we randomly selected 2M sentences. Our}

NLM EnÑDe Imp. DeÑEn Imp. Baseline 15.25 0 20.13 0 WordCLM 15.78 +0.53 20.44 +0.31 WordCLMdp 16.27 +1.02 20.72 +0.59 WordCLMbp 16.24 +0.99 20.72 +0.59 MorphCLM 15.85 +0.60 20.83 +0.70 CharCLM 15.89 +0.64 20.85 +0.72 CLMA 16.18 +0.93 20.96 +0.83 CLMBdiv 16.15 +0.90 20.84 +0.71 CLMBroot 16.11 +0.86 20.92 +0.79 CLMC 16.20 +0.95 20.90 +0.77

Table 4.7: Boosting n-gram-based LMs with NLMs. Improvements (Imp.) are statistically significant according to the results of paired bootstrap re-sampling with

p = 0.05 for 1000 samples (Koehn, 2004b)

models were evaluated on newstest-2015 and tuned using newstest-2013. We trained them using Moses (Koehn et al., 2007) with the default configuration, tuned via MERT (Och, 2003) and evaluated using BLEU (Papineni et al., 2002).

In Table 4.7, Baseline shows the baseline systems which are phrase-based SMT models. For our baseline models we trained 5-gram language models on the monolingual parts of the bilingual corpora using SRILM (Stolcke, 2002). Other settings indicate enhanced models for the baseline systems where n-gram counts are re- computed using different NLMs. Imp is the difference between the baseline score and the score obtained after embedding the NLM, which shows the impact of the NLM. As the table shows using the NLM instead of/along with the n-gram-based LM considerably improves the quality of both the directions.

4.6 Summary

In this chapter we proposed an extension to the state-of-the-art character-level NLM. We did not drastically change the neural architecture but developed new segmentation models to decompose morphologically complex words. Our proposed models are simple and unsupervised models. The models learn the segmentation scheme from a training corpus. The granularity provided by the models falls in between

character-level and morpheme-level models. They define a new set of basic units (alphabet) for the given corpus. Through the training phase, they learn to connect a set of related and consecutive characters to one another to construct new blocks. The proposed neural language-modeling pipeline outperforms existing models for all of our experiments. We studied our models from different perspectives and discussed the impact of the external parameters. Following this, we used our NLM in the SMT pipeline. In our future work we plan to focus more on the generation aspect of our neural language model.

Chapter 5 Boosting SMT via NN-Generated

Features

In this chapter we benefit from neural features generated by our NNs (reported in Chapters 3 and 4) to boost SMT models. First we introduce a general pipeline to incorporate word and phrase embeddings into SMT. By use of embeddings, not only is the SMT model informed with syntactic and semantic (similarity) information, but also we define different word-, phrase-, and sentence-level features to provide better translations. We show how to use monolingual and bilingual embeddings. Accordingly, training embeddings in the SMT context, especially bilingual embeddings, is one of the key contributions of the chapter. Our pipeline is investigated from different perspectives through different experiments. We evaluated our models by translating between English (En) and Czech (Cz), Farsi (Fa), French (Fr), and German (De) and observed significant improvements for all language pairs. The main goal targeted in this chapter is to introduce a pipeline by which neural features can be incorporated into the SMT pipeline, so models in this chapter are not only limited to MRLs and can be applied to any language. However, as we are interested in this set of languages, we designed our experiments based on MRLs and improved our models via morphological information, which were reported in Section 5.2.3.

5.1 Incorporating Embeddings into Phrase Tables

The process of PBSMT can be interpreted as a search problem for the best target- side match for a given input sentence where the score at each step of exploration is formulated as a log-linear model (Koehn, 2009). For each candidate phrase, the set of features is combined with a set of learned weights to find the best target counterpart of the provided source sentence. Because an exhaustive search of the entire candidate space is not computationally feasible, the space is typically pruned via some heuristics, such as using beam search (see Chapter 2). The discrimina- tive log-linear model allows the incorporation of arbitrary context-dependent and context-independent features. Thus, features such as those in Och and Ney (2002) or Chiang et al. (2009) can be combined to improve translation performance. The standard baseline bilingual features (in the phrase table) included in Moses (Koehn et al., 2007) by default are: the phrase translation probability ϕ(e|f ), inverse phrase

translation probability ϕ(f |e), direct lexical weighting lex(e|f ), and inverse lexical weighting lex(f |e).1 _{The structure of the phrase table is illustrated in Figure 5.1.}

Figure 5.1: The figure shows the structure of a German-to-English phrase table where the first constituent at each line is a German phrase which is separated by ||| from its English translation. The following 4 scores after the English phrase are default bilingual scores extracted from training corpora. These scores show how phrases are semantically related to each other. The decoder selects the best phrase pair at each step based on these scores.

1_{Although the features contributed by the language model component are as important as the}

bilingual features, we do not address them in Chapter 5, since they traditionally only make use of the monolingual target language context, and we are concerned with incorporating bilingual semantic knowledge.

The scores in the phrase table (see Figure 5.1) are computed directly from the co-occurrence of aligned phrases in the training corpora. A large body of recent work evaluates the hypothesis that co-occurrence information alone cannot capture contextual information as well as the semantic relations among phrases (see Section 5.1.1). Therefore, many techniques have been proposed to enrich the feature list with semantic information. In our model, we define six new features for this purpose. All of our features indicate the semantic relatedness (similarity) of source and target phrases. Our features leverage contextual information which is lost by the traditional phrase extraction operations. Specifically, on both sides (source and target) we look for any type of constituents including phrases, sentences, or even words which can fortify the semantic information about phrase pairs.

Our main contributions in this model are threefold: i) we define new similarity features and embed them into PBSMT to enhance the translation quality; ii) in order to define the new features we train bilingual phrase and sentence embeddings using an NN. Embeddings are trained in a joint distributed feature space which not only preserves monolingual similarity and syntactic information but also represents cross- lingual relations; and iii) we indirectly incorporate external contextual information using the neural features. We search in the source and target spaces and retrieve the closest constituent to the phrase pair in our bilingual embedding space.

In document Machine translation of morphologically rich languages using deep neural networks (Page 125-130)