External Evaluations - Sub-word based language modeling of morphologically rich languages for L

3. Reduction in deletion/substitution rate of INV words

Here, it is worth noting that any negative value of reduction means an increase in the underlying rate.

Table 3.14 gives some examples of words for which recognition is improved by reducing the number of deletions or substitutions using the best sub-word based systems.

Table 3.13. Analysis of improvements in the best sub-word based system compared to the best full-word based system for Arabic, German, and Polish corpora. Amount of reduction in WER is divided into (ins: reduction in insertion rate; OOV del/sub: reduction in deletion/substitution rate of OOV words; INV del/sub: reduction in deletion/substitution rate of INV words). Note: a negative reduction means an increase.

absolute reduction in

OOV WER ins OOV del/sub INV del/sub

language corpus [%] [%] [%] [%] [%]

Arabic ar-dev07 0.0 0.5 0.3 0.0 0.2

ar-eval07 0.0 0.2 0.2 0.0 0.0

German gr-dev09 1.4 0.3 0.2 0.1 0.0

gr-eval09 1.4 0.4 -0.1 0.3 0.2

Polish pl-dev10 0.4 0.1 -0.2 0.2 0.1

pl-eval10 0.6 0.1 -0.1 0.1 0.1

Table 3.14. Examples of words for which recognition is improved using the best sub-word based systems.

Arabic German Polish

Aljmyl eigentlich dwudziestotrzylatkiem Almst$Ar heimtrainer prezydentura

ystlhmh dreiecksungleichung wypracowana yktnfh rückführungsrichtlinie terminowe yHddhA justizkomitee zapunktować

3.6 External Evaluations

The approaches discussed in this chapter have been employed in the RWTH evaluation systems used in many evaluation campaigns during the years from 2010 up to 2013. In those evaluations, RWTH has achieved advanced positions among the participants, namely the first or the second position. This section presents a summary of the recognition results achieved in those evaluations. Initially, Table 3.15 presents a list of the participant sites.

3.6.1 Quaero German ASR Evaluation 2010

Table 3.16 shows the results of the Quaero⁸ evaluation on German ASR held in 2010. Two types of evaluation data have been used, broadcast news (BN) and broadcast conversations, in a rough 50-50%

ratio. The RWTH system uses morpheme-based LMs. The first position has been achieved out of three participants. A detailed description of the system is given in [Sundermeyer & Nußbaum-Thom⁺ 2011].

3.6.2 Quaero German ASR Evaluation 2011

Table 3.17 shows the results of the Quaero evaluation on German ASR held in 2011. Similar to the year 2010, broadcast news (BN) and broadcast conversations data have been used in a rough 50-50%

8http://www.quaero.org

Chapter 3 Sub-Word Based Language Models

Table 3.15. List of participants in different evaluation campaigns.

RWTH RWTH Rheinisch-Westf¨alische Technische Hochschule Aachen, Germany KIT Karlsruhe Institute of Technology, Germany

CITLAB Computational Intelligence Technology Laboratory, University of Rostock, Germany LIMSI Laboratoire d’Informatique pour la M´ecanique et les Sciences de l’Ing´enieur, France VR Vocabia Resarch, France

A2IA Artificial Intelligence and Image Analysis, France

LITIS Laboratory for Computer Science, Information Technology and Systems, University of Rouen, France UOB-TELE Telecom ParisTech research lab, France

UPV Universitat Polit`ecnica de Val`encia, Spain FBK Fondazione Bruno Kessler, Italy

UEDIN University of Edinburgh, UK

Table 3.16. Quaero German ASR evaluation 2010.

participant WER [%]

RWTH 16.94

KIT 24.14

LIMSI + VR 21.05

ratio. The RWTH system uses morpheme-based LMs. The second position has been achieved out of four participants. The WER of the RWTH system is only 0.09% (absolute) higher than the best WER achieved by KIT in this year.

Table 3.17. Quaero German ASR evaluation 2011.

participant WER [%]

RWTH 17.49

KIT 17.40

LIMSI + VR 18.04

VR 20.17

3.6.3 Quaero German and Polish ASR Evaluation 2012

Tables 3.18 and 3.19 show the results of the Quaero ASR evaluation on German and Polish ASR held in 2012. The evaluation data was a mix of broadcast news and broadcast conversations/podcasts, with an emphasis of the conversations. The RWTH German system uses morpheme-based LMs, whereas, the Polish system uses syllable-based LMs. The first and second positions have been achieved out of three participants for German and Polish respectively.

Table 3.18. Quaero German ASR evaluation 2012.

participant WER [%]

RWTH 18.71

KIT 19.38

LIMSI + VR 19.63

3.6 External Evaluations

Table 3.19. Quaero Polish ASR evaluation 2012.

participant WER [%]

RWTH 13.57

LIMSI + VR 12.67

VR 14.79

3.6.4 Quaero German ASR Evaluation 2013

Table 3.20 shows the results of the Quaero evaluation on German ASR held in 2013. In fact, two different evaluation tracks have been considered. The first is based on lecture speech data, whereas the second is based on a mix of broadcast news and broadcast conversations/podcasts data. The RWTH systems use morpheme-based LMs. The first position has been achieved out of two participants in each domain.

Table 3.20. Quaero German ASR evaluation 2013.

lecture data BN + BC data

participant WER [%] WER [%]

RWTH 25.23 14.38

VR 31.86

-KIT - 15.95

3.6.5 IWSLT German ASR Evaluation 2013

Table 3.21 shows the results of the IWSLT evaluation on German ASR held in 2013⁹. This is the tenth evaluation campaign organized by the IWSLT workshop. The 2013 evaluation has offered a track on lecture transcription based on the TED Talks¹⁰ corpus. The RWTH system uses morpheme-based LMs.

The first position has been achieved out of four participants. A detailed description of the recognition system is given in [Shaik & T¨uske⁺2013].

Table 3.21. IWSLT German ASR evaluation 2013.

participant WER [%]

RWTH 16.94

KIT 24.14

LIMSI + VR 21.05

3.6.6 OpenHaRT Arabic Handwriting Recognition Evaluation 2013

Table 3.22 shows the results of the 2013 NIST open handwriting recognition and translation evaluation (OpenHaRT 2013¹¹) performed on Arabic handwriting text. The RWTH system uses morpheme-based Arabic LMs. The Arabic word decomposition is performed using MADA toolkit [Habash & Rambow 2005, 2007]. Two different evaluation tasks have been offered. The first is a constrained task, where only the official OpenHaRT data have been used. The second is an unconstrained task, where all the available data have been used. The second position has been achieved out of six participants in the constrained

9http://www.iwslt2013.org/59.php

10A collection of public speeches covering many different topics (http://www.ted.com).

11http://www.nist.gov/itl/iad/mig/hart2013.cfm

Chapter 3 Sub-Word Based Language Models

task. Whereas, the first position has been achieved out of two participants in the unconstrained task. A detailed description of the recognition system is given in [Hamdani & Doetsch⁺2014].

Table 3.22. OpenHaRT Arabic handwriting recognition evaluation 2013.

constrained task unconstrained task

participant WER [%] WER [%]

RWTH 23.91 16.15

A2IA 20.32 18.50

CITLAB 26.81

-LITIS 78.40

-UOB-TELE 48.66

-UPV 30.01

In document Sub-word based language modeling of morphologically rich languages for LVCSR (Page 57-60)