Finding Position Using Parse Tree - Finding Positions to Insert Translations

6.4 CATaLog_TS: Beyond Translation Memories

6.4.3 Finding Positions to Insert Translations

6.4.3.2 Finding Position Using Parse Tree

When multiple POS n-gram matches are found, the system resolves this ambiguity us- ing the parse tree of the input sentence. For all the higher order POS n-gram matches, we determine the lowest common ancestor (LCA) node in the parse tree. The n-gram POS sequence choice for which the depth of the common ancestor node is maximized is considered as the most appropriate candidate. If there is a tie, the system chooses one among them randomly. The idea behind choosing the LCA (i.e., maximum depth) is that the lower the common ancestor in the parse tree, the more syntactically coherent the constituent words are. If the LCA is located in an upper level of the tree, the words considered in the n-gram sequence are unrelated and hence the ‘corresponding n-gram’

Algorithm 7: Finding W_c for the ith _{unmatched word W}i u

Data: input Wi

u and P OScontext

; /* P OScontext is a set, containing all possible unigram, bigram and

trigram contexts of each Wi

u and their corresponding positional

information j in TM Source suggestions */

Result: Return {W_c, j}

; /* returns Wc and its position j in the TM Source suggestion */

begin

foreach Wi u do

P OStri := {(P OSi−2, P OSi−1, P OSi), (P OSi−1, P OSi, P OSi+1), and (P OSi,

P OSi+1, P OSi+2)} ; /* Possible trigrams */

P OSbi := {P OSi−1, P OSi), (P OSi, P OSi+1)} ; /* Possible bigrams */

P OSuni) := (P OSi) ; /* Possible unigram */

if any P OStri is found in P OScontext then

Return {W_c, j}

else if any P OSbi is found in P OScontext then

Return {W_c, j}

else if P OSuni is found in P OScontext then

Return {W_c, j}

else

Return Wi u

should be ignored. This motivates the idea behind using the LCA. The process is il- lustrated using Example 6.1. For the sake of simplicity, we make use of the unigram dictionary to obtain the translation for the unmatched words in the example. However, the system uses a trigram back-off model for this purpose.

Example 6.1.

Input sentence: i would prefer something in a middle price range .

TM suggestion: i would preferto sit in the back part of the plane .

TM suggestion translation: আিম িবমােনর িপছেনর অংেশ বসেত পছ করব . (Gloss: ami

Table 6.1 shows the TER alignment between the TM source suggestion and the input sentence along with the edit operations required to turn the TM source suggestion into the input sentence. Table 6.1 shows the word alignment information between the source and target of the TM suggestion.

The Wu in the input sentence in this case are ‘something’, ‘a’, ‘middle’, ‘price’, ‘range’.

Unigram dictionary entries for the unmatched words are :

something: NN|_{একটা িকছু}; NN|_িকছু; NN|_{কান িকছু}; NN|_{িকছু একটা} a: DT|_একটা; DT|_কান; DT|_এক middle: JJ|_{মাঝাির আকােরর}; JJ|_{মােঝর} price: NN|_দাম; NN|_{দামটা}; NN|_মূল range: VBP|_{দড়'শ এর মেধ বদলােত থােক} TM Target Suggestion TM Source Suggestion TM Source (POS) Input Sentence Input (POS) Edit Operation আিম i FW i FW M - would MD would MD M পছ করব prefer VB prefer VB M - to TO - - D বসেত sit VB something NN S - in IN in IN M - the DT - - D িপছেনর back JJ - - D অংেশ part NN a DT S - of IN middle JJ S - the DT price NN S িবমােনর plane NN range NN S . . . M

Table 6.1: TM source–target alignment and TM source–input alignment

• For every unmatched word (Wu) in the input sentence, the system searches for

Wu in TM source suggestions that appear in same or similar contexts as the input

sentence. A corresponding word, Wc, found for an unmatched word (Wu) in the

input sentence in this way is a potential candidate which could be replaced by W_u. • Among the unmatched words in the input sentence in Example 6.1, the system first considers the three trigrams: (i) ‘would/MD prefer/VB something/NN’; (ii) ‘prefer/VB something/NN in/IN’; and (iii) ‘something/NN in/IN a/DT’ involving the word ‘something/NN’.

• Applying Algorithm 7, the third POS trigram matches with the POS trigram ‘part/NN of/IN the/DT’ in the TM source suggestion. Therefore, ‘part’ is con- sidered as the corresponding word (i.e. W_c) for the unmatched word (i.e. W_u) ‘something/NN’ in the input sentence.

• After getting the Wc(i.e., ‘part’), the system searches for the position of the corre-

sponding target word W_ct in TM target suggestion using the GIZA++ alignment. The GIZA++ alignments between the TM source and TM target suggestion is given below.

1-1, 3-6, 3-7, 5-5, 8-3, 9-4, 12-2, 13-8

Here the position index before the hyphen (-) is the word position in the TM source suggestion and the position index after hyphen (-) is the word position in the TM target suggestion. Wc = ‘part’ is the ninth word in the TM source suggestion and

according to the GIZA++ alignment, Wct is the fourth word in the TM target

suggestion, i.e., ‘অংেশ (angshe)’. Therefore, Wct = ‘অংেশ’ is replaced by Wt = ‘একটা

িকছু (ekta kichu)’, for Wu = ‘something’ (cf. Algorithm 6). Hence, the TM target

suggestion is modified as:

আিমিবমােনর িপছেনরএকটা িকছুবসেতপছ করব .

• Next, the system tries to find a match for the Wu ‘a/DT’. The corresponding POS

trigrams are ‘something/NN in/IN a/DT’, ‘in/IN a/DT middle/JJ’ and ‘a/DT middle/JJ price/NN’. Since the POS sequence ‘part/NN of/IN the/DT’ starting with ‘part/NN’ already matched with ‘something’, this match is not considered again.

However, the system gets a match for the other two trigrams – ‘in/IN the/DT back- /JJ’ and ‘the/DT back/JJ part/NN’ where ‘the/DT’ has not matched with any W_u of the input sentence.

• To resolve the ambiguity we consider the parse tree of the input sentence. The parse tree of the input sentence is shown in Figure 6.5 The numeric values in parentheses in Figure 6.5 represent the depth of the corresponding nodes.

ROOT(0) S(1) NP(2) PRP I VP(2) MD would VP(3) VB prefer NP(4) NN something PP(4) IN in NP(5) DT a JJ middle NN price NN range .(1) .

Figure 6.5: Parse tree

The trigram ‘in/IN a/DT middle/JJ’ has the lowest common ancestor at depth 4 whereas ‘a/DT middle/JJ price/NN’ has the lowest common ancestor at depth 5. We consider the trigram which has the lowest common ancestor at a higher depth (i.e., lower level). Therefore, in this case, the trigram ‘a/DT middle/JJ price/NN’ is considered and the corresponding matched sequence is ‘the/DT back/JJ part/NN’ in the TM source suggestion and the word ‘the’ is the Wcfor the Wu ‘a’. Subsequently

the system looks for the translation Wctfor the Wc(seventh word in TM suggestion).

However, since there is no alignment corresponding to the seventh source word in the GIZA++ alignment, the translation of ‘a’ is not placed in the TM suggestion translation.

• Afterwards the system searches for the Wu ‘middle/JJ’. The corresponding POS

trigrams are (i) ‘in/IN a/DT middle/JJ’, (ii) ‘a/DT middle/JJ price/NN’ and (iii) ‘middle/JJ price/NN range/NN’. The first two trigrams match with ‘in/IN the/DT back/JJ’ and ‘the/DT back/JJ part/NN’ in the TM source suggestion. To resolve this ambiguity the system checks the parse tree again. The POS trigram ‘in/IN a/DT middle/JJ’ has the LCA at depth 4 while the POS trigram ‘a/DT middle/JJ price/NN’ has the LCA at depth 5. Therefore, the second trigram is considered and the Wc for ‘middle/JJ’ is ‘back/JJ’. Note that ‘back/JJ’ is located at position 8 of

the TM source suggestion and its translation is ‘িপছেনর’ which is located at position 3 of the TM target suggestion. Therefore the Wt of ‘middle/JJ’, ‘মাঝাির আকােরর’, is

replaced by the third word ‘িপছেনর’ in the TM target suggestion. Thus the modified translation is formed as:

আিমিবমােনরমাঝাির আকােরর একটা িকছুবসেতপছ করব .

• The system next searches for ‘price/NN’ which is translated using ‘_দাম’. The three POS trigrams to be considered are ‘a/DT middle/JJ price/NN’, ‘middle/JJ price/NN range/NN’, and ‘price/NN range/NN ./.’ . Here the POS sequence ‘a/DT middle/JJ price/NN’ gets a match with ‘the/DT back/JJ part/NN’, where ‘part/NN’ is the corresponding word for ‘price/NN’. However, ‘part/NN’ has already been used ear- lier; therefore, the system ignores this match. The other two trigrams do not match with any POS trigram in the TM suggestion. Two POS bigrams considered for ‘price/NN’ are ‘middle/JJ price/NN’ and ‘price/NN range/NN’. Here ‘middle/JJ price/NN’ matches with ‘back/JJ part/NN’; however, it is ignored since the translation position of ‘part/NN’ has already been replaced. The other bigram does not match either. Therefore the system falls back to the unigram match for ‘price/NN’. It matches with ‘part/NN’ and ‘plane/NN’. Since ‘part/NN’ has already been used, the system considers ‘plane/NN’ which is at position 12 of the TM suggestion and its translation, ‘িবমােনর’, is at position 2 of the suggestion translation. Therefore, ‘_{িবমােনর}’ is replaced by ‘_দাম’ and the suggested translation is modified as given below.

আিম দাম মাঝাির আকােরর একটা িকছুবসেতপছ করব .

• The system tries to find a match for ‘range/NN’ later on. However, its trigram, bigram, and unigram POS sequences are either being used already or do not match.

Therefore, its translation is not put in the suggested translation. Finally the word ‘_বসেত’ which is the translation of ‘sit’ is deleted since ‘sit’ does not match with any word of the input sentence. Thus, the final translation suggestion is produced as given below.

আিম দাম মাঝাির আকােরর একটা িকছু পছ করব .

Since the translations of ‘a/DT’ and ‘range/NN’ are not placed in the translation suggestion, their translations ‘একটা’ and ` দড়'শ এর মেধ বদলােত থােক’, respectively, are added to a list and are shown to the post-editor as suggestions. The post-editor can directly use those translations without typing them and can put them in the proper place. In this way the system modifies the TM translation suggestion to generate more appropriate translation candidates. These translation candidates can be post- edited with much less effort.

In document A Hybrid Machine Translation Framework for an Improved Translation Workflow (Page 192-198)