6.3 CATaLog Online: System Description
6.3.2 Color Coding
Like most of the existing TM based CAT tools, CATaLog, the back-end TM engine in CATaLog Online, presents the user with five most relevant translation suggestions from the TM database, while CATaLog Online presents only the top ranking TM suggestion from CATaLog along with the translations from the MT and APE engines.
In CATaLog, among the top five TM suggestions presented by the tool, the post-editor selects the most suitable TM reference translation to do the post-editing task. To make that decision process easy, CATaLog color codes the matched and unmatched parts in both the source and target of the TM suggestions. Green portions indicate that they are matched fragments and red portions indicate mismatches.
Matched and unmatched fragments in the source of the TM suggestions are easily identi- fied through the TER alignments. To identify the corresponding matched and unmatched fragments in the target side of the TM suggestions the tool establishes word alignments between the TM source sentences and their corresponding translations using GIZA++ (Och and Ney, 2003b). However, any other word aligner, e.g., Berkeley Aligner (Liang et al., 2006), could be used to produce this alignment. The TER alignment between the input sentence and the relevant TM source segments, together with the alignment between the source and target of the relevant TM suggestions, are used to generate the color coding of the TM suggestions. The GIZA++ alignment file is directly integrated into the TM tool. The example given below shows an example TM sentence pair along with the corresponding word alignment produced by GIZA++.
• English: we want to have a table near the window .
• Bengali: আমরা জানালার কােছ একটা টিবল চাই ।
1 2 3 4 5 6 7
• Alignment: NULL ({}) we ({ 1 }) want ({ 6 }) to ({ }) have ({ }) a ({ 4 }) table ({ 5 }) near ({ 3 }) the ({ }) window ({ 2 }) . ({ 7 })
The word alignment between the TM source sentences and their corresponding transla- tions is computed offline using GIZA++, only once, on the TM database for a specific language pair. TER provides the alignments between an input sentence and the corre- sponding top five TM source suggestions. Using these two sets of alignments we color the matched fragments of the TM suggestions in green and the unmatched fragments in red. Trados, a popular CAT tool, does not provide color coding at word level. By contrast, CATaLog highlights parts of the segment at word level whereas Trados highlights the entire segment according to the match percentage.
Color-coding the TM source segments makes explicit which portions of the matching TM source sentences match with the input sentence and which ones do not. Similarly, color- coding the TM target segments serves two purposes. Firstly, it makes the decision process easier for the translators as to which TM suggestion to choose and work on. Secondly, it guides the translators as to which fragments to post-edit in the chosen TM translation. The reason behind color-coding both the TM source and target segments is that a longer (matched or unmatched) source fragment might correspond to a shorter target fragment, or vice versa, due to language divergence. A reference translation which has more green fragments than red fragments will be a good candidate for post-editing. However, shorter TM translations with high green coverage may not be ideal candidates for post-editing, since post-editors might have to insert translations for many unmatched words in the input sentence.
In this context, it is to be noted that insertion and substitution operations are the most costly operations in post-editing. However, sentences involving insertions and substitu- tions are not preferred by the TM as it assigns a higher cost for insertion than deletion, and hence sentences involving many insertions are typically not shown as the top candidates by our TM.
The color coding scheme is illustrated with the following example in an English–Bengali translation task. The corresponding TM database consists of English sentences taken from the BTEC12 (Basic Travel Expression Corpus) corpus and their Bengali translations13. For the convenience of non-native speakers, Latin transliteration glosses are provided within parenthesis for the Bengali sentences.
Input: you gave me wrong number .
Source Matches:
1. you gave methewrongchange . i paid eighty dollars .
2. i thinkyou ’ve got the wrong number .
3. you arewrong .
4. you pay me .
5. you ’re overchargingme .
Target Matches:
1. আপিন আমােক ভুলখুচেরািদেয়েছন . আিম আিশ ডলার িদেয়িছ . (Gloss: apni amake vul khuchro diyechen . ami ashi dollar diyechi .) (English Gloss: you me wrong change gave .
I eighty dollar paid .)
2. আমার ধারণাআপিন ভুল ন ের ফান কেরেছন. (Gloss: amar dharona apni vul nombore phon
korechen .) (English Gloss: I thinkyou wrong number ’ve got .)
3. আপিন ভুল . (Gloss: apni vul .) (English Gloss: you wrong .)
4. আপিন আমােকটাকা িদন. (Gloss: apni amake taka din .) (English Gloss: you mepay .) 5. আপিন আমার কােছ থেক বিশ িনে ন. (Gloss: apni amar kache theke beshi nichchen .)
(English Gloss: you me are overcharging.)
12The BTEC corpus contains tourism-related sentences similar to those that are usually found in phrase books for tourists going abroad.
For the input sentence shown above, the TM system shows the above mentioned color- coded top five TM matches in order of their relevance with respect to the post-editing effort (as deemed by the TM similarity metric) for producing the translation for the input sentence.
It is to be noted that when the post-editor selects a TM segment for post-editing, the input sentence is also color coded accordingly to reflect the corresponding matching and unmatched fragments in the input sentence. This also gives the post-editor an indication of how much post-editing is involved for the chosen TM segment. Red fragments in the input sentence correspond to insertion while red fragments in the TM segments correspond to deletion. Recalling the above example, if the translator chooses the translation of TM segment 1, “you gave me the wrong change. i paid eighty dollars .”, the corresponding input source sentence will automatically be color coded as “ you gave me wrong number
.”.