[PDF] Top 20 Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

... Learning weights for scoring functions. Given a large number of scoring functions, simply av- eraging their resulting scores may be inadequate. Learning weights to optimize machine translation system quality is ... See full document

14

STACC, OOV Density and N gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

... the WMT 2018 Shared Task on parallel corpus ...the task, which can efficiently process large volumes of data and can be eas- ily deployed for new datasets in different ... See full document

7

Findings of the WMT 2018 Shared Task on Quality Estimation

... QE brain uses a conditional target language model as a robust feature extractor with a novel bidirectional transformer which is pre- trained on a large parallel corpus filtered to contain “in-domain like” ... See full document

21

Findings of the WMT 2018 Shared Task on Automatic Post Editing

... of parallel attention layers (4 and 8 ...the WMT‘17 Trans- lation task (Huck et ...the task, training is per- formed by taking advantage of both the artificial data provided by ... See full document

16

Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low Resource Conditions

... high-quality parallel corpora, while low-quality sentence pairs are either synthe- sized by scrambling high-quality sentence pairs or by using the raw crawled data (S´anchez-Cartagena et ... See full document

19

Tilde’s Parallel Corpus Filtering Methods for WMT 2018

... describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training ... See full document

7

Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

... the shared task organizers, we did an additional de-duplication step in which email addresses and URLs were replaced with a place- holder token and numbers were removed, before deciding which sentences were ... See full document

8

Webinterpret Submission to the WMT2019 Shared Task on Parallel Corpus Filtering

... consideration. The initial filtering partially allevi- ates this cost by drastically reducing the amount of sentences to rank. However, it is still a slow process that took about one second per iteration with our ... See full document

6

MAJE Submission to the WMT2018 Shared Task on Parallel Corpus Filtering

... We also conducted some initial experiments using the Common Crawl corpus, under the rationale that it would be closer to the domain of the noisy data from the Paracrawl corpus. However, Com- mon Crawl ... See full document

5

The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task

... In this paper, we presented our rescoring system for the WMT 2019 Shared Task on Parallel Cor- pus Filtering. Our system is based on contrastive scoring models using features extracted ... See full document

7

Alibaba Submission to the WMT18 Parallel Corpus Filtering Task

... The parallel corpus is an essential resource for machine translation and multilingual natural language ...of parallel corpus is also very important in MT system training (Koehn and Knowles, ... See full document

6

Findings of the WMT 2016 Bilingual Document Alignment Shared Task

... As Rarrick et al. (2011) point out, a key prob- lem for parallel corpora extracted from the web is filtering out translations that have been created by machine translation. Venugopal et al. (2011) propose a ... See full document

10

NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

... al., 2018) provides a very noisy 1 billion words (English word count) German-English (De-En) corpus crawled from the web as a part of the Paracrawl ... See full document

5

Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets

... BioRo corpus for Romanian (Mitrofan and Tu- fis, 2018), can boost performance of MT systems for these ...more parallel corpora are certainly necessary not only for those lan- guages that scored ... See full document

16

NRC Parallel Corpus Filtering System for WMT 2019

... WMT19 shared task on parallel corpus filtering was essentially the same as last year’s edi- tion (Koehn et ...noisy corpus crawled from the web using ParaCrawl (Koehn et ...of ... See full document

9

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

... We propose a novel model architecture and training algorithm to learn bilingual sentence embeddings from a combination of parallel and monolingual data. Our method connects autoencoding and neural machine ... See full document

11

JU Saarland Submission to the WMT2019 English–Gujarati Translation Shared Task

... in WMT 2019. We initially used monoses (Artetxe et al., 2018), which is based on unsupervised statistical phrase based machine translation, to translate the monolingual sentences from English to ... See full document

6

SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering

... the sentence pairs on the 1 billion word German- English Paracrawl corpus. Scores do not have to be meaningful, except that higher scores indicate better quality. The performance of the submissions is evaluated ... See full document

5

The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task

... ParaCrawl corpus down to an amount that can be handled by stronger, computationally more complex, ...tered corpus. Although a big part of the corpus is removed (58M sentences or 60% of the origi- nal ... See full document

9

The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task

... 10M corpus is lower than that of the SMT ...10M corpus comprises 221K long sentence pairs, a relatively small number of sentences for NMT systems, which evalu- ate fluency over entire ... See full document

6