Stéphane H
UET, Julien B
OURDAILLETand Philippe L
ANGLAISEAMT 2009 - Barcelona
TS3: an Improved Version of
the Bilingual Concordancer
TransSearch
Computer assisted translation
• Preferred by professional translators
• Exploits a translation memory
• One of these tools: bilingual concordancer
–
Retrieves from a bitext parts associated with a query
–
Currently operates at the sentence level
–
TransSearch: a web-based concordancer with
177,000 queries/month
Current version: www.tsrali.com
sentence highlighted
query
Prototype version: TS3
sentence alignment highlighted
query
translation of the
query
Prototype version: TS3
several translations
of the query
context of use
Outline
• Spotting of the query translation
• Refinement of translation spotting
• Translation variants merging
• Corpora
• Experimental results
• Identification in a sentence of the translation of a query
–
Query:
in keeping withThis is in keeping with that strategy .
Translation spotting (or Transpotting)
La présente mesure est conforme à cette stratégie .
• Identification in a sentence of the translation of a query
–
Query:
in keeping with–
Transpot:
conforme àThis is in keeping with that strategy .
Translation spotting (or Transpotting)
La présente mesure est conforme à cette stratégie .
Word alignment
• Use of an IBM2 model
– Discontinuous transpots
– Not the best method to transpot
This is in keeping with that strategy .
La présente mesure est conforme à cette stratégie .
Transpotting algorithm
• Algorithm of [Simard 03]
– Contiguous transpots
– Best performance among several tested
methods
La présente mesure est conforme à cette stratégie . La présente mesure est conforme à cette stratégie . La présente mesure est conforme à cette stratégie .
This is in keeping with that strategy .
Transpotting algorithm
• Algorithm of [Simard 03]
– Contiguous transpots
– Best performance among several tested
methods
La présente mesure est conforme à cette stratégie . This is in keeping with that strategy .
The need for post-processing
• Query: in keeping with
• Proposed transpots
conforme à (45) conformément à (29)
à (21) dans (20)
conforme aux (18) de (14)
conforme (13) conformément aux (13) conforme au (12) conformes à (11)
…
d’actualité (1) gestes en (1) correspond à (1) respectent (1)
The need for post-processing
• Query: in keeping with
• Proposed transpots after filtering
conforme à (45) conformément à (29)
à (21) dans (20)
conforme aux (18) de (14)
conforme (13) conformément aux (13) conforme au (12) conformes à (11)
…
d’actualité (1) gestes en (1) correspond à (1) respectent (1)
The need for post-processing
• Query: in keeping with
• Proposed transpots after filtering and merging
conforme à (45) conformément à (29)
conforme aux (18)
conforme (13) conformément aux (13) conforme au (12) conformes à (11)
…
correspond à (1) respectent (1)
Filtering bad transpots
• At the level of a pair of sentences
• Computation of 3 sets of features
– Size of the transpot, size of the query
– Statistical word alignment features: min and max
likelihood, Viterbi scores...
– Linguistic features: grammatical word ratio, article
counts, preposition counts...
• Training of various classifiers
– Voted-perceptron, SVM, decision tree, voting
Merging translation variants
• At the level of the transpot list found for a query
• High complexity when building all possible clusters
• Neighbor-joining method of [Saiou and Nei 87]
– Builds a distance matrix Q between all pairs – Is a greedy algorithm that at each step
• Merges the two closest transpots
• Updates Q
– Uses a word-based distance
• Minimal cost between 2 inflected forms of a lemma
• Edition costs smaller for grammatical words
Example for the merging process
conforme au
correspondant au conforme aux
dans le sens de l’
dans les sens des
Example for the merging process
correspondant au
dans le sens de l’
dans les sens des
conforme au conforme aux
Example for the merging process
correspondant au
conforme au conforme aux
dans le sens des dans le sens de l’
Example for the merging process
correspondant au conforme au conforme aux
dans le sens des dans le sens de l’
Example for the merging process
correspondant au conforme au conforme aux
dans le sens des dans le sens de l’
Detection of similar variants
correspondant au conforme au conforme aux
dans le sens des dans le sens de l’
Corpus used in the experiments
5,000 most frequent
queries
Canadian Hansard 8.3 M pairs
of sentences
Retrieved pairs
of sentences Retrieved
pairs
of sentences
Reference corpus for filtering
• Annotation of 530 queries (23 translations per query)
Results for classification of transpots
• Trained on the annotated queries
• Tested by 10-fold cross-validation
Correct classification
F-measure for bad transpots
All good 62 0
Grammatical ratio >0.75 78 63
Best classifier 84 77
– Similar results for the 4 tested classifiers: voted-perceptron, SVM, decision stump, AdaBoost
– Most informative features: grammatical and statistical word
Reference corpus for transpotting
Bilingual lexicon
Transpotted pairs
of sentences
Reference = 1.4 M pairs of
sentences Retrieved
pairs
of sentences Retrieved
pairs
of sentences
• Precison
Metrics for transpotting
suggested transpot 2/4
reference
Je crois qu’il est tout à fait conforme à l’esprit du projet de loi.
• Precison
Metrics for transpotting
suggested transpot 2/4
reference
• Recall
suggested transpot
Cela n’est pas conforme aux normes des Nations Unies.
reference 1/2
• Averaged for each query, then averaged on
Je crois qu’il est tout à fait conforme à l’esprit du projet de loi.
Results for transpotting and filtering
• Filtering of 7.9% of pairs of sentences
• Improvement of F-measure, in particular of recall
precision recall F-measure
Transpotting 79 84 81
Transpotting + filtering
82 90 86
Evaluation of variant merging
• Significant reduction of the number of translations proposed for a query: 164 Ì 86
• Higher diversity among the top translations
– Example: query as described
rank 1 2 3 4 5
before décrits décrite décrit tel que décrit
comme l’a
after décrits prévu comme
l’a
tel que prescrit
comme le propose
Evaluation of variant merging
• Task: retrieving the 5 best transpots of a query
• Example
{
décrits, décrite, décrit, tel que décrit, comme l’a}
Evaluation of variant merging
• Task: retrieving the 5 best transpots of a query
• Example
{
décrits, décrite, décrit, tel que décrit, comme l’a}
bag of unigrams
{
décrits, décrite, décrit, tel, que, comme, l’, a}
Evaluation of variant merging
• Task: retrieving the 5 best transpots of a query
• Example
{
décrits, décrite, décrit, tel que décrit, comme l’a}
bag of unigrams
grammatical words removed
{
décrits, décrite, décrit}
Evaluation of variant merging
• Task: retrieving the 5 best transpots of a query
• Example
{
décrits, décrite, décrit, tel que décrit, comme l’a}
bag of unigrams
grammatical words removed lemmatization
{
décrit}
Results for variant merging
• Task: retrieving the 5 best transpots of a query
• Experiments done on the manually annotated corpus
precision recall F-measure
Before merging 90 43 58
After merging 86 54 66
• Slight decrease of precision and significant improvement of recall => higher diversity
Conclusion
• Use of word alignment in a bilingual concordancer
• Quantitative evaluation of a transpotting algorithm
• Two new issues
– Filtering erroneous transpots
– Merging similar variants of translations
Future work
• Improvement of word alignment
– Higher level IBM models – Phrase-based models