This chapter presents a literature review of research relevant to the research presented in the thesis. In the following chapters, we present our own research and show (i) how parallel text fragments can be extracted from comparable corpora which can be added to the bilingual training corpus as additional training material to improve the performance
of SMT systems for low-resource language pairs, (ii) the optimal use of existing parallel resources and an improved hybridization method for MT, (iii) different approaches to APE over a first stage MT system, and finally (iv) how human interaction with CAT tools can be optimized in existing MT workflows.
Mining Parallel Resources from
Comparable Corpora
Statistical Machine Translation (SMT) is based on a probabilistic model which is learned from sentence-aligned parallel corpora where each sentence in the source is paired with its translation in the target. Due to the fact that parallel corpora remain a scarce re- source for many language pairs (e.g., English–Indian languages) and are often restricted to certain domains, comparable corpora can to some extent provide a possible solution to this data scarcity problem for corpus-based approaches to MT. Many studies and appli- cations in both linguistic and language engineering communities use comparable corpora as resources, and these can play an important role in improving the quality of MT (Smith et al., 2010). Extracting parallel text fragments, paraphrases or sentences from compara- ble corpora is particularly useful for SMT (Gupta et al., 2013).
In general, comparable documents are not strictly parallel: a comparable corpus consists of documents in two languages, but these are not sentence-by-sentence translations of each other; rather the documents are about the same topic. While the sentences of comparable corpora usually are not (exact) translations, parallel documents convey information on the same topic or event and hence there should exist some sentential or sub-sentential level of parallelism.
Previous studies on comparable corpora mainly focused on: (i) parallel data extraction in the form of bilingual lexicon extraction (BLE) (Fung and McKeown, 1997; Pirkola et al.,
2001; Rapp, 1995), parallel fragment extraction (Quirk et al., 2007) and parallel sentence extraction (Munteanu and Marcu, 2005), (ii) Translation model improvement (Daumé and Jagarlamudi, 2011; Klementiev et al., 2012) and (iii) Language model adaptation (Zhao et al., 2004). The main focus of this chapter is to exploit comparable corpora to address the scarcity of parallel data for less resourced languages. We propose novel approaches to extract parallel fragments from comparable corpora by applying a textual entailment (TE) method and a template based approach.
This chapter addresses RQ1: How can MT for low resource languages be improved? We extract parallel segments from comparable corpora. The extracted parallel segments are typically added to the training corpus as additional training material that is expected to improve the performance of SMT systems, specifically for low-resource language pairs.
The core part of the research presented in this chapter has been previously published in (Pal et al., 2014b, 2015b).
Figure 3.1 schematically represents the research presented and the research questions addressed in this Chapter.
3.1
Introduction
In this chapter, we describe a methodology for extracting English–Bengali parallel re- sources from comparable corpora using TE and template based phrase extraction. We collected a document-aligned comparable corpus of English–Bengali document pairs from Wikipedia1. Wikipedia is a large collection of documents in many different languages. We first collect an English document from Wikipedia and then follow the inter-language link to find the corresponding document in the Bengali Wikipedia. To extract parallel fragments, we perform three steps. In the first step, we cluster the source side of the bilin- gual comparable corpus into several small groups using TE and a distributional semantic textual similarity method (Mitchell and Lapata, 2010; Grefenstette and Sadrzadeh, 2011; Socher et al., 2012; Agirre et al., 2014; Bentivogli et al., 2016). In the second step, we pro- duce cross-lingual linked clusters of comparable segments for each comparable document using a probabilistic bilingual lexicon. The bilingual lexicon is prepared from a bilingual
Comparable corpora from Wikipedia
Comparable Corpus
Web Crawling through inter-wiki link TE System Monolingual (EN) clusters of Comparable Corpus
PB-SMT
Comparator Template Based Phrase Extraction Cross-lingual (EN-BN) clusters of Comparable Corpus Parallel Corpus Parallel SegmentsEnglish (EN) Bengali (BN)
MT output
Figure 3.1: Schematic design of the research and the research questions presented in this Chapter.
English–Bengali parallel corpus in the tourism domain using a statistical word alignment tool – GIZA++ (Och and Ney, 2003a). In the final step, we use a template-based phrase extraction method (Cicekli and Güvenir, 2001) between each of the aligned groups of comparable segments. The template-based extracted phrases are finally aligned using a baseline phrase-based SMT (PB-SMT) system, which was trained on the English–Bengali tourism parallel corpus.
Typically, there are two approaches that are applied for grouping documents according to their (text) similarity: TE and semantic textual similarity (STS) (Agirre et al., 2014). Given two pieces of text – a text (T) and a hypothesis (H), T is said to entail H if H can be inferred from T (Dagan and Glickman, 2004). The task of TE is to decide whether the meaning of H can be inferred from the meaning of T. For example, let T be: “Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year.”, and H: “Yahoo bought Overture”. For this particular T–H pair, T entails H. STS measures the degree of semantic equivalence between two sentences. This task can be applied in many areas, such as Information Extraction, Question Answering, Summarization, and Information Retrieval, for indexing semantically similar phrases or sentences. STS is related to TE, but differs from TE in that TE is unidirectional while STS is bidirectional. E.g., the two sentences “Yahoo took over search company Overture last year.” and “Yahoo acquired Overture. ” are highly semantically similar; however, while the first sentence entails the second, it is not true the other way round since the first sentence carries some additional (here temporal) information not contained in the second sentence.
Calculating textual similarity between T and H can be tackled by various techniques at lexical, syntactic, and semantic levels (Šarić et al., 2012; Osman et al., 2012). Lexical techniques are based on word overlap metrics, n-gram matching, or comparing the de- pendency relations of the two texts. Moreover, some important lexical relationships (e.g., synonyms, hypernyms) can also be applied to measure textual similarities. Other meth- ods, such as syntactic techniques are based on syntactic or dependency trees matching. In addition to STS, another semantic similarity technique was applied based on relations comparison (e.g., logical inference and Semantic Role Labeling).
In distributional semantics approaches (Blei et al., 2003a), similarities between T and H can be computed by measuring their collocation and distributional properties on large
amounts of data in an unsupervised way (Chaney and Blei, 2012) or by using the Gen- sim framework (Rehurek and Sojka, 2010), in which semantic relationships of words and phrases are computed using the word2vec2 (Mikolov et al., 2013a) model. Our approach uses Gensim3 (Řehůřek and Sojka, 2010) to measure distributional semantic similarity. Gensim is a free open source Python library designed to automatically extract seman- tic topics from documents in an efficient way. Gensim is designed to process raw plain text data (e.g., a corpus). Several popular algorithms such as Latent Semantic Analysis (LSA) (Deerwester et al., 1990), Latent Dirichlet Allocation (LDA) (Blei et al., 2003b) and Random Projections are implemented in Gensim. These algorithms discover seman- tic structure of documents by examining statistical co-occurrence patterns of the words within a training corpus. LSA and LDA are topic modeling techniques, however LDA is a fully generative model. LSA is also considered as a statistical, corpus-based text comparison method that uses a weighted term-document matrix that is created from a large collection of documents. LSA consists of four steps: (i) preparing a term-document matrix, (ii) a transformation (e.g., tf-idf, log-entropy), (iii) dimensionality reduction us- ing Singular Value Decomposition (SVD) and (iv) retrieval using cosine similarity. LDA assumes that a document is a mixture of latent topics. In contrast to LSA, LDA uses a probabilistic background instead of SVD. In our work, we use a pre-trained Gensim model (cf. Section 3.3) for measuring semantic text similarity.
The rest of the chapter is structured as follows: Section 3.2 discusses previous work relevant to this chapter. Section 3.3 describes the TE system used for our research. Section 3.5 describes comparable text extraction from comparable corpora and Section 3.6 shows how to identify parallel segments from these comparable segments. Section 3.7 and Section 3.8 present the dataset used for our experiments and the baseline experimental setup, respectively. Section 3.9 describes our experiments and presents the evaluation results. Section 3.10 summarizes the outcomes of this research.