Extraction clues - Different learning tasks

2.4 Different learning tasks

2.5.1 Extraction clues

Understanding clues that helps the extraction process underscores the im-perative of the underlying the mechanism of the process itself in the first place. Bilingual lexicon extraction involves a mapping process of a word in the source language to its translation equivalent in the target language, known as the source word and the target word, respectively. Since the source word and the target word are equivalent, they are expected to share certain mutual characteristics. In a more concrete sense, the extraction process has been elo-quently defined by Hwa et al. (2006) as the mapping between two disjoint sets

2.5 Basic concepts of bilingual lexicon extraction

of symbols.

More importantly, certain clues can help define the characteristics or proper-ties of the words, both the source word and the target word, that are used during learning. Nonetheless, certain clues may be applicable to some types of corpora only. Table 2.2 presents some of the examples of the clues that are generally used in the bilingual lexicon extraction, especially when parallel corpora are involved as correspondence of word and sentence order, correla-tion between word frequencies and similar spelling word pairs. The table also shows the comparison of the types of corpora, including the monolingual cor-pora.

For parallel corpora, the correspondence of word and sentence order is usually the strongest, but this is not the case for monolingual corpora as the clue is not applicable (Rapp, 1999). Correlation between word frequencies is not strong compared to the first clue because many words are ambiguous in natu-ral languages, even in panatu-rallel texts. For comparable corpora, the clue is still applicable, though with a low reliability; however, the same clue is not useful for unrelated texts.

The third clue, which is similar spelling word pairs, is generally limited to the clear identification of the pairs only. For all other majority pairs, this clue needs to be combined with the first clue to be useful. Similarly, for mono-lingual texts, the third clue is not useful for identification of the majority of the pairs because the first clue would not work. Hence, the task to extract a bilingual lexicon from monolingual corpora is more difficult because “most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts” (Rapp, 1999).

To overcome the above shortcoming,Koehn and Knight (2002) have identified five clues that can be used for the extraction purposes when monolingual texts are involved, which include the following: identical word, similar spelling,

2.5 Basic concepts of bilingual lexicon extraction

Table 2.2: Extraction clues: their usefulness vs. type of corpora Statistical

clue example

For parallel corpora For monolingual cor-pora clue is applicable but with a much lower reliability than of the majority of the pairs for both comparable and unrelated texts

contexts, similar words and word frequency. For the purpose of this thesis, the clues have been divided into three major properties, i.e., word spelling, word frequency and word context because the remaining elements are likely to be derived from these three major properties. The descriptions of the three clues are as follows:

• Word frequency

Word frequency is one of the clues shown in Table 2.2. The clue of the word frequency is applicable as long as the texts are comparable.

However, the accuracy score may decline greatly due to low reliability of the comparable texts, compared to the parallel texts.

2.5 Basic concepts of bilingual lexicon extraction

Frequency of words can be useful to help extract bilingual word pairs from ideal parallel texts. The assumption held is: “the frequencies of word pairs of parallel corpora, especially the most frequent ones, are par-allel”. For comparable corpora, frequent words in one corpus should also have their translation equivalents that are also frequent in the other corpus. For example, in English-Malay news corpora, English word government is more frequent than flower. Respectively, Malay word kerajaan is more frequent than bunga. While the most frequent word in the target corpus is not necessarily the translation of the most fre-quent word in the English corpus, the former should also be frefre-quent as the latter. Inevitably, some of the translations might occur less frequent in the other corpus of a target language. Hence, m-th frequent target word cannot be simply aligned with the n-th frequent source word. For most of word pairs, there is a considerable correlation between the fre-quency of a word and its translation. The frefre-quency is usually redefined as a ratio of the word frequencies normalized by the corpus sizes (Koehn and Knight, 2002).

• Word spelling

Two different languages may contain a number of identical words, es-pecially when both are related. More importantly, both words may originate from the same root, or one of the words may have originated from one of the languages that is later adopted by the target language.

This type of words may be adopted exactly; or these words are changed slightly according to some rules or without rules. Nevertheless, this tech-nique may not be able to build a huge repository of word pairs, unless the languages to be paired are closely related with one another, such as English and Spanish. Likewise, the same technique is also applica-ble if one of the languages has a reasonaapplica-ble number of loanwords taken from the other language. Detail descriptions of the characteristics are as follows:

2.5 Basic concepts of bilingual lexicon extraction

1. Identical words

Certain number of identical or exact words with the same meaning can be found in two or more languages. Usually, the word is adopted completely (with no translation or modification) into another lan-guage; for instance, the English words hospital and pen are used in Malay in their entirety without any changes in spelling. Another example of words that is adopted exactly is the word internet.

Thus, the identical words are based on the assumption that the identically spelled words are translations of one another.

2. Similar spelling, or cognates

Some words may have very similar written translations due to their common language roots (e.g., freund and friend). These words are known as cognates, or adopted words (e.g., bajet and budget) where the adopted words are derived from another taken into one language from another little translations or minor modifications.

Moreover, these words may differ in spelling (even by a very few letters), but these words still maintain similar meaning. As an example, Koehn and Knight (2002) provides a computation that works as follows:

For a given word pair (friend, freund), these words share five letters (fr-e-nd), and each of them has a word length of 6. Thus, the spelling similarity between them is 5/6, or 0.83. This measure-ment is called longest common subsequence ratio (LCSR), which has been proposed by Melamed (1995) as follows:

LCSR(A, B) = length(LCS(A, B)) max(length(A), length(B)) where

A and B are the words to be measured, and

LCS is the longest common subsequence not necessarily continuous in A and B.

2.5 Basic concepts of bilingual lexicon extraction

From the example, the LCS is equivalent to the five letters (f,r,e,n,d).

Another measure that can be used to find similarity in spelling is the string edit distance or Levenshtein distance. Compared to LCSR, which only allows addition and deletion operations, Leven-shtein distance allows substitution operation on top of the other two operations. However, Haghighi et al. (2008) caution one disad-vantage of using edit distance operation precision quickly degrades with higher recall. Instead, they recommend assigning a feature to each sub string of length of three or less for each word and use the set of features to be elements of a word vector, which is ready to be matched with other word vectors in a vector space.

3. Transliteration

Invariably, some English words would appear in foreign language text, especially in science reports or journals. Word pairs may be derived simply by looking for collections of documents in the for-eign language containing English words. Most frequent words in the foreign text corpora are likely to be the translation of the cor-responding English words. Such approach is language-independent and domain-independent.

The spelling approach may not be suitable if majority of the word pairs to be processed have spelling with little resemblance. (See example of the output in Figure 2.5).

• Word context

Context is defined by the frequencies of context words in the surrounding positions. Words that co-occur in a certain context should also have their translations co-occur in a similar context in the target corpus. Hence, the clue is based on the co-occurrence patterns of words in certain win-dow of words. Rapp (1995) indicates that co-occurrence clue is based

2.5 Basic concepts of bilingual lexicon extraction

on the assumption that there is a correlation between co-occurrence pat-terns in different languages.

A context of occurrence for each word j is approximated by bag-of-words that occurs within a window of n-word length or n-word distance.

If n = 2 the window size is five by considering a neighbourhood of +/- 2 words around the current test word sums up to five words in the window.

A context window of a sentence can also be used. Some related examples are discussed in Chapter 3 (see Sub Sub Section 3.3.6 for details).

A context vector of a word j is initially the vector of all words in the bag-of-words. Each word i in this vector is assigned a weight that represents its number of occurrences in that bag-of-words, which is also the number of co-occurrences of word i and j in the same context windows.

The following sentence examples are taken from Rapp (1999):

“Economy nearer recession after weak growth data.”

“Economy growth is the increase in value of the goods and services produced by an economy.”

“Report shows US economy growth weak if not in recession.”

“How can we increase economy growth in the future?”

Words tend to co-occur frequently in the context of the word economy are all underlined in the sentences. Using the above example by Rapp (1999), the English word economy co-occurs frequently with growth as the German word Wirtschaft does with Wachstum. In the English and German context words, Rapp (1999) discovered that the English words teacher and school co-occur more than expected by chance in the English corpus, which was in sync with their translations in German, i.e, Lehrer (teacher) and Schule (school).

2.5 Basic concepts of bilingual lexicon extraction

Interestingly, the clue not only holds for parallel texts but also holds for monolingual texts. The hypothesis is that a pair of words in two separate corpora is more likely to be translation of each other when the distributions of their context words are similar. An initial bilingual lexicon is required to provide translations for the context words. For each word in the corpus, a context vector of co-occurrence statistics pattern between the word and all words in the initial bilingual lexicon, or within certain specified context, is built.

To determine which context words that strongly correspond to a source word or a target word, a measure of association can be used. To compute the similarity between two distributions of context words, a similarity measure should be considered. The most popular concept used in bilingual lexicon extraction is the vector space model.

In document Minimally Supervised Techniques for Bilingual Lexicon Extraction (Page 54-61)