Learning from monolingual corpora - Different learning tasks

2.4 Different learning tasks

2.4.2 Learning from monolingual corpora

Apparently, robust methods could be offered with the use of parallel corpora.

However, the lack of resources of well-aligned and noisy parallel corpora limits the implementation of these methods. This contention is further highlighted by Koehn and Knight (2002) who emphasize that parallel corpora will always be limited resources, especially in different domains. Monolingual texts are–

though with some reservations–the most easily available linguistic resources.

Fortunately, monolingual corpora can be an alternative to parallel corpora especially if the required extra knowledge is provided, which is an initial bilin-gual dictionary of sufficient size (Koehn and Knight, 2002).

2.4 Different learning tasks

Using monolingual lexicon alone means purely unsupervised learning. Similar to parallel texts, monolingual texts also provide vast lexical and statistical data. However, being non-parallel (though with large sizes), it is unlikely for two monolingual corpora of different languages to be able to provide a per-fect set of context words for both the source word and the target word. In essence, the problem will increase in magnitude when the amounts of compa-rable corpora decrease. Another problem with using compacompa-rable corpora to find translation equivalents is that there is no obvious bridge between the two languages Sharoff et al. (2006).

To make the unsupervised learning feasible, most studies have relied on the assumptions that relate a word with its translation equivalent (Koehn and Knight, 2002; Diab and Finch, 2000; Rapp, 1995). Thus, the most obvious ap-proach is in finding word pairs that are spelled identically or similarly across the languages (Koehn and Knight, 2002). For example, Figure 2.5 highlights a series of word pairs collected by Koehn and Knight (2002) in their experi-ments. However, this approach that is based on word spelling similarity would not help extend bilingual lexicon so much. The reason for this is because a pair of languages does not have many words of similar spelling across them unless both languages are historically and culturally related, such as loanwords.

Previously, work using an initial bilingual lexicon were of context-based ap-proach. In this regards, Rapp (1999) insists that an initial bilingual lexicon is required to improve accuracy (see Sub Section 2.5 for details). To address this requirement, Rapp developed a model that bridged two monolingual texts using seed words. Seed words are known bilingual translations in an initial bilingual lexicon: one side is used to represent the context of the source word, and the other side is used to represent the context of the target word, with re-spect to the languages. Both sides can be used to bridge the two monolingual texts and map out the word pairs. Essentially, Rapp’s work is based on the notion that “words that co-occur frequently in one language have translations that also co-occur frequently in another language” (Rapp, 1995; 1999). He

2.4 Different learning tasks

Figure 2.5: An example of word pairs learnt from monolingual corpora using spelling-based approach

Source: Koehn and Knight (2002)

used such properties to map bilingual word pairs and to fill gaps in an existing lexicon. Likewise, similar efforts carried out by Fung (1995); Fung and Yee (1998) are based on the same principle, which allowed them to add novel word pairs to a lexicon.

In another related work, Koehn and Knight (2000) also proposed the use of a lexicon, together with a corpus in the target language, and a comparable cor-pus in the source language. However, their approach is similar to an approach that views the corpus in the source language as being the distorted target corpus corrupted by a noisy channel. Based on word-level translation proba-bilities and a language model, the most likely target word can be determined

2.4 Different learning tasks

for each source word. Furthermore, given parallel corpora, the word-level translation probabilities can easily be estimated. The chosen approach, how-ever, is not a straight forward route–the word-level translation probabilities are needed to estimate the best target word matches without the availability of bilingual word pairs and, at the same time, the bilingual word pairs should be established without the word-level translation probabilities. Hence, Koehn and Knight (2000) used the Expectation Maximization (EM) algorithm to deal with the problem. The algorithm alternatively scores the possible target words for each source word in the expectation step. In the maximization step, it estimates translation probabilities based on that until convergence.

Later, Koehn and Knight (2001) conducted several experiments based on the same models for monolingual corpora. The models assume the availability of many linguistic tools including POS taggers and morphological analyser. In these experiments, they compared the models that used an initial bilingual lexicon with those that did not use any lexicon. They took only nouns into account and found that the first model (those using initial bilingual lexicon) had registered higher accuracy compared to the second model (those without initial bilingual lexicon). The experimental results ranged from 75% to 79%

and 11% to 39% for the first and second model, respectively. In summary, Koehn and Knight (2001) contend that a parallel corpus can be replaced with monolingual corpora and a bilingual lexicon. (A survey on previous work that proposed models for monolingual corpora, with or without a bilingual lexicon, is presented in Sub Section 2.6.)

2.4.3 Learning from parallel corpora and monolingual corpora

In document Minimally Supervised Techniques for Bilingual Lexicon Extraction (Page 48-51)