• No results found

Dutch Parallel Corpus

N/A
N/A
Protected

Academic year: 2021

Share "Dutch Parallel Corpus"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Dutch Parallel Corpus

Lieve Macken

[email protected]

LT3, Language and Translation Technology Team

Faculty of Applied Language Studies University College Ghent

(2)

Dutch Parallel Corpus

Characteristics

Parallel Corpus of 10 million words

2 language pairs, 4 language directions

• Dutch–French / Dutch–English

• Bidirectional

• Also comparable corpus (translated vs. non-translated language)

Balanced

• 5 text types: administrative texts, instructive texts, literature, journalistic texts, texts for external communication

• Text types further divided into subtypes

• ± 500,000 words per translation direction/text type

• Metadata files

• info on translation direction and text types

• other relevant info such as intended audience, text provider, . . .

(3)

Dutch Parallel Corpus

Characteristics (ctd.)

Balanced (ctd.)

• Condition on translation direction relaxed for instructive texts

• difficult to find information on translation direction

• Literary texts balanced according to language pairs

• difficult to obtain copyright clearance

Copyright-cleared

• Difficult and time-consuming task

• Commercial publishers (journalistic texts and literature) → longer and more difficult negotiation cycles

(4)

Dutch Parallel Corpus

Characteristics (ctd.)

Automaticaly aligned at sentence level

• Combined three alignment tools

• reduce manual verification effort

• No word alignment

Linguistically enriched

• Preprocessing

• Character conversion (UTF-8)

• Text split in sentences and tokens

• Part-of-speech tagging

• Info on tag sets: see user doc

• Lemmatization

(5)

Dutch Parallel Corpus

Characteristics (ctd.)

Obtained accuracy scores on a manually validated DPC

sample

Sample size Lemmata PoS PoS

(tokens) (full tag) (main category)

Dutch 211,000 96.5% 94.8% 97.4%

English 300,000 98.1% 96.2% N/A

(6)

Dutch Parallel Corpus

Distribution

Distributed and maintained by HLT-agency

(TST-centrale)

http://www.inl.nl/tst- centrale/nl/producten/corpora/dpc-corpus-dutch-parallel-corpus/6-65

(7)

XML data

4 files per text

2 monolingual files

• Indications of paragraphs and sentences

• Linguistic annotations

File containing metadata

• Author, date, publication, ...

• Source and target language

• Text type and subtype

• Domain, keywords

• Type of IPR-agreement

(8)

Monolingual XML file

(9)
(10)

Alignment XML file

(11)

Webinterface

Demo version

(12)

Webinterface ctd.

Three steps

1

Monolingual vs. bilingual search

2

Full corpus vs. subcorpus

3

Search query

(13)

Webinterface ctd.

Putting together a subcorpus

Multifunctional resource for diverse group of users

• Researchers in translation studies vs. contrastive linguistics

• Language teachers and language learners

• Human translators

Create a subcorpus according to your needs!

• E.g. filter on translation direction

• E.g. filter on text type

(14)

Webinterface ctd.

(15)

Webinterface ctd.

Search query

Define search words

• max. 2 per language

Define search type

• Word form or lemma

• Part-of-speech

(16)

Example 2 consecutive words

(17)
(18)

Webinterface ctd.

Metadata

Context

(19)

Webinterface ctd.

Export to Excel

Source and target sentences and accompanying

(20)

Puisque – aangezien

(21)
(22)

Dutch compounds

(23)
(24)

Pass´

e compos´

e (French) – Simple past

(Dutch)

(25)

Results pass´

e compos´

e (French) – Simple

past (Dutch)

(26)

Final verbal group in Dutch

(27)
(28)

Acknowledgements

Funded by the STEVIN programme

DPC Team

• K.U. Leuven campus Kortrijk

• Faculty of Applied Language Studies – University College Ghent

• Contributors

• Piet Desmet, Willy Vandeweghe, Hans Paulussen, Lieve Macken, Maribel Montero Perez, Orph´ee De Clercq, Lidia Rura, Julia Trushkina and Antoine Besnehard

http://www.kuleuven-kortrijk.be/dpc

• User Manual

• Publications

References

Related documents

Some authors propose internal differentiations of interdisciplinarities, either regarding their quality or their mode of functioning (see Frodeman et al., 2017, p.

However, it could be claimed that the indirect corrective feedback rather than the traditional copy editor kind of feedback on spelling errors will work to the students’

Reflects the changes in percentage terms of the price of light, sweet crude oil, as measured by the changes in the average of the prices of 12 futures contracts on crude oil traded

As expected, dimension, asset and turnover variables have a positive and statistically significant relationship with the probability of a company recognizing impairment losses.. This

Effect of treatment was also evaluated on proportion of patients who have reported symptoms on likert scale as “1”, “2” and “3” at baseline for each symptom, were

Manual therapy techniques are effective at im- proving different parameters related to symptoms and impact of TTH; however, the effectiveness for individual therapies

within the terms and conditions of this reinsurance.” 77 The example of the use of “loss settlements shall be binding” words will be discussed in Hill v Mercantile &

The estimations, which go along with the common effects of FDI and educational equal- ity, show positive effects of this interaction variables (FDIEDUGINI) for all groups, if