• No results found

ASR framework for Causal Czech Recognition

A.1 Summary of articulatory features

3.4 Particular results of the Czech ASR (LVCSR)

3.4.3 ASR framework for Causal Czech Recognition

The purpose of this section is to extend on the previous baseline results for NCCCz and to describe the design of more sophisticated ASR system for Czech casual speech recognition task. The focus was on the contributions of acoustic and language models as well as on pronunciation lexicon optimization. The AM was trained on large speech train set which consists of several Czech corpora available at our department. Special attention was also paid to the impact of publicly available corpora suitable for LM creation.

The section starts with discussion about the state-of-the-art of Czech Casual Speech recognition and continues with the description of implemented solutions to improve the accuracy of casual speech recognition. It is divided into three subsections: robust acoustic modelling, improvement to language modelling and extensions of pronunciation lexicons. Results of particular experiments are discussed in the context of other results obtained for other speaking styles. The section also presents a comparison between the GMM-HMM system and DNN-HMM hybrid approach.

The recognition of spontaneous speech still represents a very challenging task. The commonly achieved accuracy is still rather low in comparison with a generally high accu- racy for standard LVCSR systems. This conclusion is supported by many other works for other languages [70, 6, 95, 97, 116, 20, 129]. The spontaneous or colloquial speech recog- nition deals with similar problems, i.e. strong variability in the pronunciation (mainly strong pronunciation reduction), changes in the word morphology, free word order in the sentence, sentence breaks, and some others [75, 94].

Many authors have presented solutions for the above mentioned problems and achieved varying results for various languages, speaking styles, or recording conditions. The authors in [13] worked with transcriptions of oral interviews of survivors and witnesses of the Holocaust and they reported 39.60% WER for English and 39.40% for Czech. However, when the level of speech spontaneity is higher, typically for very informal speaking style, the accuracy of speech recognition falls. Authors in [70] worked with the recordings of telephone conversations and reported 48% WER for the Czech language. Similarly in [97], authors presented results around 31-56% WER for the case of a very informal speech recognition task. Results presented by other authors were also confirmed also by our evaluation of casual speech recognition which were based on data from NCCCz, described previously. The recognition accuracy in a standard LVCSR task using a standard setup decreased significantly, see Table 3.11 and Table 3.12. The possible improvement of these results is discussed in the following parts this section.

3.4. PARTICULAR RESULTS OF THE CZECH ASR (LVCSR) 41

Impact of front-end processing & Acoustic modelling

The front-end processing and AM training for NCCCz followed the setup previously de- scribed in 3.4.1. This was possible mainly because the conversations available in NCCCz were recorded in a quiet environment which was similar to headset recordings from a quiet SPEECON environment. Other speech corpora which were similar to NCCCz, from an acoustic conditions point of view, were also included in order to create a larger and more generic train set. This was especially important for DNN-HMM system. To summarize, office subpart of SPEECON (SPEECON CS0 OFFICE), clean subpart of car database (CZKCCC headset) and training part of NCCCz (NCCCz train) were used as a set for AM training in all further experiments.

Impact of language models for casual Czech

The standard n-gram-based statistical LMs described in section 3.4.1 were used for NCCCz corpus. With regards to NCCCz corpus, the significant problem which had to be solved was a choice of a suitable resources that would appropriately cover the casual speech. The suitability of five general LMs collected from three different publicly available resources, CNC, WEB1T, ORAL 2006, ORAL 2008 and ORAL 2013 were analyzed. While the corpora CNC, same as WEB1T, contained text that was rather general in nature that were built with various size of word forms up-to 340k and these models should cover general nature of Czech. The corpora of ORAL family contain spontaneous conversations and it was thus expected the produced LM would be a better fit for the NCCCz domain. The number of word forms obtained for ORAL corpus was 162k and 29k for NCCCz. This differences amounted to 73k additional words from ORAL and 9k words from NCCCz approximately. Finally, in order to cover the maximum vocabulary for our task, we have also created LMs from NCCCz. The first LM was trained from a defined training part of NCCCz containing the transcription of 60% utterances per each recorded session which were also not used for the evaluations later. It represented a slightly more realistic scenario as the content of recognized utterances has not been seen before. The second LM was created for comparison purposes as an optimal LM for causal speech since it was made from all available NCCCz transcriptions.

Impact of pronunciation variation modelling

The modelling of pronunciation variation in casual speech (mainly pronunciation reduc- tions) was the last point of interest. The particular rules, some of them known from other works, e.g. [94] or [127], others obtained from results of the psycholinguistic study of pro- nunciation reduction in NCCCz [69] were applied. In the end, we have used approximately

6700 additional pronunciation variants. The illustrative examples of several rules are “v[sSzZ]→[sSzZ]”- e.g.“vˇzdyt’, vst´at”(“but, to stand up”),

“[td]J→[cJ\J]” - e.g. “letn´ı” (“adj. summer”),

“cons_1-t-cons_2→cons_1-cons_2” - e.g. “jestli” (“if”), “js → s” - e.g. “jsem” (“I am”),

“j[eai] → [eai]” - e.g. “jestli, jinam” (“if, elsewhere”), “zj → z” - e.g. “zjist´ıˇs” (“You will find”),

“t-S → t_S” - e.g. “vˇetˇsina” (“majority”), “nsk → nt_sk” - e.g. “ˇc]´ınsk´y” (“Chinesse”), “vZd → vd” - e.g. “vˇzdycky” (“always”).

Results of experiments & discussion

The achieved results for previously established recognition tasks are evaluated from the following points of view: the optimization of acoustic modelling, the impact of language modelling and pronunciation variation. Experiments were performed on utterances from the following Czech databases: SPEECON, CtuTest, CzLecDSP, and NCCCz which cover different levels of spontaneity, i.e.

• T1 - read speech recognition

a) read sentences, phonetically rich (SPEECON database),

b) read journal sentences, phonetically unbalanced (CtuTest database),

• T2 - spontaneous speech recognition

recordings of technical lectures (CzLecDSP database),

• T3 - casual speech recognition

recordings of highly informal conversations (NCCCz database).

The principal results of these experiments are those for spontaneous speech data from NCCCz and CzLecDSP (test sets TA2 and TA3). Experiments performed on testing subsets from SPEECON and CtuTest (test sets T1a and T1b) which contained read speech were done for comparison purposes to analyze the overall recognizer setup in a more standard task.

3.4. PARTICULAR RESULTS OF THE CZECH ASR (LVCSR) 43

I. The impact of AM type

The first results describe the quality of used AM, i.e. starting from a basic GMM-HMM approach and ending with the best AM based on a DNN-HMM architecture. General 340k-word bigram LM based on CNC was used for all of these experiments. The obtained results shown in Table 3.14 demonstrate that our DNN-HMM LVCSR system obtained accuracy comparable to the current state-of-the art systems, i.e. 15.2% of WER for standard read speech. For spontaneous speech we have obtained WER of 37.4% for the task of lecture transcription (i.e. with slightly more formal speaking style) and 72.0% for very informal (casual) speech from NCCCz.

tasks tri2 tri3 SGMM bMMI DNN

T1a 29.8 23.4 22.2 21.8 21.1

T1b 24.0 17.0 15.9 15.3 15.2

T2 49.9 41.3 39.9 38.0 37.4

T3 82.5 76.1 74.9 74.2 72.0

Table 3.14: WERs of LVCSR in the phase of AM optimization

II. The impact of LM

Results shown in Table 3.15 present the analysis of various LMs. The first part summarizes achieved WERs for all speaking styles using general CNC and WEB1T-based LMs where the strong decrease for the case of casual speech is clearly shown. The second part of Table 3.15 presents the results for TA3 task (casual speech) and using LMs trained on ORAL and NCCCz (i.e. transcriptions of recorded casual speech). The reduction of out- of-vocabulary (OOV) rate as well as the perplexity (PPL) confirmed improved match for casual speech and resulted in WER of around 60-70%. The achieved results also showed that trigram-based LMs brought a very small improvement in WER but the complexity of used HCLG graph increased significantly. Due to this fact, bi-gram LMs were used in further experiments. The last line of Table 3.15 represented a rather exceptional case where the LM NCCCzAll was created from all available transcriptions in NCCCz (i.e. including also the test set). This model hadOOVof 0% and a very low value ofPPL, both of which were expected. This result was presented purely as a limit case to demonstrate the theoretical limits of used modelling approaches.

The next experiments were focused on minimizing OOV and WER in the TA3 task by merging of various bigram LMs. The results for merged LMs with the uniform interpo- lation weights are summarized in Table 3.16. The usage of various merged LMs reduced the level of OOV significantly but the WER decreased only marginally as the setup of the interpolation weights (λ) was not optimal. Therefore, we also optimized the value of

Tasks LM OOV PPL 2-gram 3-gram TA1a CNC 1.6 3572 21.1 21.8 TA1b CNC 1.8 2034 15.2 14.7 TA2 CNC 4.8 2937 37.4 37.2 TA3 CNC 4.6 2065 72.0 72.2 WEB1T 4.5 4427 68.9 - TA3 ORAL06 6.5 389 67.1 66.4 ORAL08 6.7 445 66.8 66.3 ORAL13 4.7 475 66.1 65.4 ORALall 4.0 426 63.6 62.5 NCCCz60 7.2 248 61.4 61.2 NCCCzAll 0 69 41.3 28.4

Table 3.15: WERs of LVCSR with various 2-gram a 3-gram LMs on particular tasks.

bigram LMs OOV WER

CNC+WEB1T 4.3 69.8

CNC+WEB1T+ORALall 2.8 64.7

CNC+WEB1T+ORALall+NCCCz60 1.5 61.2

Table 3.16: DNN-HMM casual speech recognition (TA3) with merged bigram LMs. NCCCz weight λ

LMs OOV 0.0 0.25 0.50 0.75 1

CNK340+NCCCz60 2.2 72.0 62.8 60.8 59.4 61.4

ORALall+NCCCz60 2.5 63.6 60.9 59.8 58.9 61.4

WEB1T+NCCCz60 2.1 68.9 62.3 60.6 60.0 61.4

Table 3.17: DNN-HMM with various weights of NCCCz in merged LMs on TA3 task.

λ for particular LMs. The best result was obtained with the following weights λ: 0.2 for ORAL LM, 0.15 for CNC 0.15 for WEB1T and 0.5 for NCCCz. The corresponding WER reached about 59.7%. The final investigation focused on merging various LMs with the NCCCz-based LM. The contributions of various interpolation weightsλto the final WER are summarized in Table 3.17. The best results were achieved for the setup withλ= 0.75. In the end, the combination of all LMs brought an improvement in target OOV but the decrease of WER was smaller. The results proved that general LMs (CNC and WEB1T) did not contain proper information to describe the causal speech in NCCCz. However, the LMs created from ORAL corpus modelled casual speech very similarly to a LM created directly from NCCCz, with the exception of NCCCzAll language model used also the test data.

3.4. PARTICULAR RESULTS OF THE CZECH ASR (LVCSR) 45

III. The impact of pronunciation reduction

The final results presented in this chapter describe the achieved WER for three approaches of pronunciation modelling. First, automatically generated pronunciation was used for all words in analyzed LMs (which is used always if a word is not present in the available dictionary). Second, an approved canonic pronunciation of all words from NCCCz was created by manually by two independent experts. Third, the dictionary with the additional pronunciation variants containing phone reductions using the above-described rules was used. All obtained results are summarized in Table 3.18 and, according to preliminary assumptions, the recognition accuracy has improved but by only about 1.4%.

LM Lexicon WER

automatic 59.8

0.25 ORALall + 0.75 NCCCz60 canonic checked 58.9 reduction variants 58.4 Table 3.18: Impact of pronunciation variation in DNN-HMM system

Conclusions

This section describes an optimization of DNN-HMM and GMM-HMM based LVCSR for casual speech recognition for Czech and its performance on data from the Nijmegen Corpus of Casual Speech. Achieved results confirmed that it is possible to use these systems for casual speech recognition, but the results are significantly worse when compared to the results for more formal speech. It was also proved that publicly available corpora ORAL which contains transcriptions of spontaneous conversations and corpora of formal Czech can be used for the creation of basic LMs for the task of casual speech recognition.

Chapter 4

Estimation of AF for Czech and

other languages

This chapter summarizes the research on the estimation of AF from an acoustic speech signal. The term of AF is introduced and the definition of AF classes for Czech, English, and several other languages is discussed. Further, widely used approaches of AF estima- tion are summarized. The chapter is closed by a description of performed analyzes of AF estimation realized for particular languages as well as acoustic conditions.

4.1

Articulatory features for analyzed languages

The term it Articulatory features generally represents a set of features trying to describe how the human speech is generated. Articulatory information can be obtained using direct measurements of the motion of particular articulators (e.g. lips, tongue, jaw) or various statistic methods estimating this information from the acoustic speech signal. Since a lot of approaches to achieve articulatory information it have been suggested, there are various ways to represent AF.

With regards to the statistical methods, the representations of AF are standardly based on articulatory phonetics or different theories of phonology [60], [78]. The following three representations of AF are the most important ones:

• multi-valued features which are based on articulatory phonetic categories,

• phonological distinctive features proposed by Chomsky and Halle,

• articulatory gestures used in articulatory phonology and proposed by Browman and Goldstein.

AF based on multi-valued features or articulatory gestures are widely applied in the speech applications which were previously mentioned. They are commonly used for observation

AF class Cardinality Feature values English

place 10 alveolar, dental, labial, postalveolar, rhotic, ve-

lar, labiodental, lateral, none

degree 6 approximant, closure, flap, fricative, vowel

nasality 3 front, central, back

rounding 3 stop, nasals, affricates, fricatives

glottal sta. 4 aspirated, voiceless, voiced

vowel 23 aa, ae, ah, ao, aw1, aw2, ax, ay1, ay2, eh, er,

ey1, ey2, ih, iy, ow1, ow2, oy1, oy2, uh, uw, nil

height 8 high, low, mid, mid-high, midlow, very-high, nil

frontness 7 back, front, mid, mid-back, mid-front, nil

voicing 3 voiced, unvoiced

Table 4.1: AF classes for English

modelling in ASR, robust speech recognition, and nowadays also in important areas of multilingual/cross-lingual ASR or low-resource speech recognition. In the contrast, the AF based on articulatory gestures are standardly used in the task of pronunciation modelling. These different representations of AF were analyzed separately in several experiments during Johns Hopkins University (JHU) Summer Workshop [78] and it was also proposed how these AF sets could be combined in ASR system by Hasegawa-Johnson [45].

This work analyzes the AF-based TANDEM approach, which was presented in JHW [33], [14], [72] with focus on Czech language. Therefore, it uses AF based on multi-valued fea- tures (further referred to only as AF) for observation modeling with the aim to improve general ASR accuracy, phone recognition, as well as phonetic segmentation precision for the analysis of Czech spontaneous speech.

AF with multi-valued feature representation of speech production knowledge for ob- servation modelling was used with the purpose of making these features acoustically dis- tinguishable which is discussed more within next sections.