Measuring Machine Translation Errors in New Domains
Full text
Figure
Related documents
MT metrics compute a score based on the output of a MT system, here called “candidate”, and a “reference” sentence, which is provided. The ref- erence is a valid translation of
The most commonly used automatic evaluation metrics, BLEU (Papineni et al., 2002) and NIST (Doddington, 2002), are based on the assumption that “The closer a machine translation is to
This is essentially equivalent to the standard domain-adaptation problem in machine learning, and in the context of MT there have been methods proposed to perform Bayesian adaptation
Domain Adaptation for Statistical Machine Translation with Domain Dictionary and Monolingual Corpora Proceedings of the 22nd International Conference on Computational Linguistics
We employ 8 different MT metrics for identifying paraphrases across two different datasets - the well-known Microsoft Research paraphrase corpus (MSRP) (Dolan et al., 2004) and
Traditional machine translation evaluation metrics such as BLEU and WER have been widely used, but these metrics have poor correlations with human judgements because they badly
After introducing gist consistency score into traditional MT metrics, the Kendall correlation between the hybrid BLEU (HBLEU( s topic )) and human judgements rise from 42.56% to
We report translation results for two metrics, Bleu (Papineni et al., 2002) and NIST (Doddington, 2002), and significance testing is performed using approxi- mate randomization