• No results found

TS3: an Improved Version of the Bilingual Concordancer TransSearch

N/A
N/A
Protected

Academic year: 2021

Share "TS3: an Improved Version of the Bilingual Concordancer TransSearch"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Stéphane H

UET

, Julien B

OURDAILLET

and Philippe L

ANGLAIS

EAMT 2009 - Barcelona

TS3: an Improved Version of

the Bilingual Concordancer

TransSearch

(2)

Computer assisted translation

• Preferred by professional translators

• Exploits a translation memory

• One of these tools: bilingual concordancer

Retrieves from a bitext parts associated with a query

Currently operates at the sentence level

TransSearch: a web-based concordancer with

177,000 queries/month

(3)

Current version: www.tsrali.com

sentence highlighted

query

(4)

Prototype version: TS3

sentence alignment highlighted

query

translation of the

query

(5)

Prototype version: TS3

several translations

of the query

context of use

(6)

Outline

• Spotting of the query translation

• Refinement of translation spotting

• Translation variants merging

• Corpora

• Experimental results

(7)

• Identification in a sentence of the translation of a query

Query:

in keeping with

This is in keeping with that strategy .

Translation spotting (or Transpotting)

La présente mesure est conforme à cette stratégie .

(8)

• Identification in a sentence of the translation of a query

Query:

in keeping with

Transpot:

conforme à

This is in keeping with that strategy .

Translation spotting (or Transpotting)

La présente mesure est conforme à cette stratégie .

(9)

Word alignment

• Use of an IBM2 model

– Discontinuous transpots

– Not the best method to transpot

This is in keeping with that strategy .

La présente mesure est conforme à cette stratégie .

(10)

Transpotting algorithm

• Algorithm of [Simard 03]

– Contiguous transpots

– Best performance among several tested

methods

La présente mesure est conforme à cette stratégie . La présente mesure est conforme à cette stratégie . La présente mesure est conforme à cette stratégie .

This is in keeping with that strategy .

(11)

Transpotting algorithm

• Algorithm of [Simard 03]

– Contiguous transpots

– Best performance among several tested

methods

La présente mesure est conforme à cette stratégie . This is in keeping with that strategy .

(12)

The need for post-processing

Query: in keeping with

Proposed transpots

conforme à (45) conformément à (29)

à (21) dans (20)

conforme aux (18) de (14)

conforme (13) conformément aux (13) conforme au (12) conformes à (11)

d’actualité (1) gestes en (1) correspond à (1) respectent (1)

(13)

The need for post-processing

Query: in keeping with

Proposed transpots after filtering

conforme à (45) conformément à (29)

à (21) dans (20)

conforme aux (18) de (14)

conforme (13) conformément aux (13) conforme au (12) conformes à (11)

d’actualité (1) gestes en (1) correspond à (1) respectent (1)

(14)

The need for post-processing

Query: in keeping with

Proposed transpots after filtering and merging

conforme à (45) conformément à (29)

conforme aux (18)

conforme (13) conformément aux (13) conforme au (12) conformes à (11)

correspond à (1) respectent (1)

(15)

Filtering bad transpots

• At the level of a pair of sentences

• Computation of 3 sets of features

– Size of the transpot, size of the query

– Statistical word alignment features: min and max

likelihood, Viterbi scores...

– Linguistic features: grammatical word ratio, article

counts, preposition counts...

• Training of various classifiers

– Voted-perceptron, SVM, decision tree, voting

(16)

Merging translation variants

• At the level of the transpot list found for a query

• High complexity when building all possible clusters

• Neighbor-joining method of [Saiou and Nei 87]

– Builds a distance matrix Q between all pairs – Is a greedy algorithm that at each step

Merges the two closest transpots

Updates Q

– Uses a word-based distance

Minimal cost between 2 inflected forms of a lemma

Edition costs smaller for grammatical words

(17)

Example for the merging process

conforme au

correspondant au conforme aux

dans le sens de l’

dans les sens des

(18)

Example for the merging process

correspondant au

dans le sens de l’

dans les sens des

conforme au conforme aux

(19)

Example for the merging process

correspondant au

conforme au conforme aux

dans le sens des dans le sens de l’

(20)

Example for the merging process

correspondant au conforme au conforme aux

dans le sens des dans le sens de l’

(21)

Example for the merging process

correspondant au conforme au conforme aux

dans le sens des dans le sens de l’

(22)

Detection of similar variants

correspondant au conforme au conforme aux

dans le sens des dans le sens de l’

(23)

Corpus used in the experiments

5,000 most frequent

queries

Canadian Hansard 8.3 M pairs

of sentences

Retrieved pairs

of sentences Retrieved

pairs

of sentences

(24)

Reference corpus for filtering

Annotation of 530 queries (23 translations per query)

(25)

Results for classification of transpots

Trained on the annotated queries

Tested by 10-fold cross-validation

Correct classification

F-measure for bad transpots

All good 62 0

Grammatical ratio >0.75 78 63

Best classifier 84 77

Similar results for the 4 tested classifiers: voted-perceptron, SVM, decision stump, AdaBoost

Most informative features: grammatical and statistical word

(26)

Reference corpus for transpotting

Bilingual lexicon

Transpotted pairs

of sentences

Reference = 1.4 M pairs of

sentences Retrieved

pairs

of sentences Retrieved

pairs

of sentences

(27)

• Precison

Metrics for transpotting

suggested transpot 2/4

reference

Je crois qu’il est tout à fait conforme à l’esprit du projet de loi.

(28)

• Precison

Metrics for transpotting

suggested transpot 2/4

reference

• Recall

suggested transpot

Cela n’est pas conforme aux normes des Nations Unies.

reference 1/2

• Averaged for each query, then averaged on

Je crois qu’il est tout à fait conforme à l’esprit du projet de loi.

(29)

Results for transpotting and filtering

• Filtering of 7.9% of pairs of sentences

• Improvement of F-measure, in particular of recall

precision recall F-measure

Transpotting 79 84 81

Transpotting + filtering

82 90 86

(30)

Evaluation of variant merging

• Significant reduction of the number of translations proposed for a query: 164 Ì 86

• Higher diversity among the top translations

– Example: query as described

rank 1 2 3 4 5

before décrits décrite décrit tel que décrit

comme l’a

after décrits prévu comme

l’a

tel que prescrit

comme le propose

(31)

Evaluation of variant merging

• Task: retrieving the 5 best transpots of a query

• Example

{

décrits, décrite, décrit, tel que décrit, comme l’a

}

(32)

Evaluation of variant merging

• Task: retrieving the 5 best transpots of a query

• Example

{

décrits, décrite, décrit, tel que décrit, comme l’a

}

bag of unigrams

{

décrits, décrite, décrit, tel, que, comme, l’, a

}

(33)

Evaluation of variant merging

• Task: retrieving the 5 best transpots of a query

• Example

{

décrits, décrite, décrit, tel que décrit, comme l’a

}

bag of unigrams

grammatical words removed

{

décrits, décrite, décrit

}

(34)

Evaluation of variant merging

• Task: retrieving the 5 best transpots of a query

• Example

{

décrits, décrite, décrit, tel que décrit, comme l’a

}

bag of unigrams

grammatical words removed lemmatization

{

décrit

}

(35)

Results for variant merging

• Task: retrieving the 5 best transpots of a query

• Experiments done on the manually annotated corpus

precision recall F-measure

Before merging 90 43 58

After merging 86 54 66

Slight decrease of precision and significant improvement of recall => higher diversity

(36)

Conclusion

• Use of word alignment in a bilingual concordancer

• Quantitative evaluation of a transpotting algorithm

• Two new issues

– Filtering erroneous transpots

– Merging similar variants of translations

(37)

Future work

• Improvement of word alignment

– Higher level IBM models – Phrase-based models

• Use of pseudo-relevance feedback to improve transpotting

• Evaluation with end users

(38)

Thank you for your attention

References

Related documents

The goals of this paper are to bring awareness of the increasing diagnoses of diabetes in children, the resources available to help educate the diabetic child and his/her family, the

Taking the analysis components into account, it is clear that individual cities have unique trends with specific attributes/independent variables. As discussed, the discrepancy in

There is a tendency in the growth of building hotel objects fast all over Kosovo from a symbolic number 3-4 hotel per city (in the main cities), this number surpasses 120 hotels.

131 As Professor Thomas Greaney has commented, this ambiguity can cut two ways: “it can result in overdeterrence in the sense that providers are reluctant to undertake

a) Lord Henry must be old and he regrets it. b) Dorian agrees that it is wonderful to be young. c) Lord Henry advises Dorian to be as moral as he can in his youth. d) Dorian wishes

However, this is not done by isolating the individual from former criminal friends and networks, quite the opposite in fact; it is done by making those criminals within

Fig S3 a HPAEC chromatograms showing the elution profile of oxidized and non‑oxidized oligosaccharides and b quantification by HPAEC analysis of the soluble sugars released by