Automatic Pronunciation Variants Extraction

6.8 Pronunciation Variants Extraction

6.8.2 Automatic Pronunciation Variants Extraction

We also performed recognition studies by automatically selecting new pronunciation variants. The automatic selection of new pronunciation variants was done in the following manner:

1. For each utterance of each word in H-set, run the baseform pronunciation evaluation pro- cedure. Sort the comb score (6.7) obtained in ascending order and, select the pronunciation variant with lowest comb score and LD > 0. If the baseform pronunciation model is stable, it is very much possible that LD = 0 always and thus, no pronunciation variants are selected. 2. For each word, rank these selected pronunciation variants in ascending order and, select the

top two. This results in utmost two possible pronunciation variants for each word.

3. The pronunciation variants selected in Step 2 are added to the lexicon if avgpis greater than

0.5. If avgplies between 0.45 and 0.5 then the pronunciation variant is selected, if the frame

average posterior probability (obtained by summing the posterior probabilities of the states in the best path and dividing the sum by number of frames) of the best path is greater than 0.5. Otherwise, the pronunciation variants are rejected. This way only pronunciation variants that are reliable and close to the baseform pronunciation are selected.

The statistics of the test lexicon after adding the automatically extracted pronunciation variants is given in Table 6.4. Compared to manual selection there are more pronunciation variants. This is mainly due to the condition LD > 0 as opposed to LD > 1 in case of manual selection.

# of resulting Pronunciation models Number of words

1 183

2 292

3 127

Table 6.4. Statistics of test lexicon: The pronunciation selection was done automatically. The first column mentions the number of pronunciations and the second column gives the number of words with that number of pronunciations.

The recognition studies were performed on updated lexicon(s). The results are given in Table 6.5. We observe that the automatically selected pronunciation variants also leads to similar improve- ments as that of manually selected pronunciation variants.

Systems Performance Performance

75 words 602 words system-base 3.0† _10.3† system-app-p O 1.8† _6.4† system-cond-p O 2.7† _8.9† H 3.3 10.1† system-app-e O 4.4† _12.0† system-cond-e O 2.4† _7.6 H 3.0 9.3†

Table 6.5. Recognition studies performed on 8 different sets of 75 words lexicon and one set of 602 words lexicon with multiple pronunciations. The pronunciation variants selection was done automatically. Performance is measured in

terms of WER (expressed in %). Notations: O: Auxiliary feature observed, H: Auxiliary feature hidden.†_{Improvement in}

the performance is significant compared to the results in Table 6.1 (with 95% confidence or above)

6.9 Summary and Conclusion

In this chapter, we proposed an approach based on HMM inference to evaluate the adequacy of pronunciation models by:

1. Relaxing the lexical constraints of the baseform pronunciation model. 2. Inferring a new pronunciation variants for each relaxation.

3. Measuring the “stability” of the pronunciation model through a combination of acoustic confidence level measure and Levenshtein distance.

The proposed approach was used to:

• Compare the quality of different acoustic models, namely, acoustic models trained with only standard features and acoustic models trained with both standard features and auxiliary features.

• Extract new pronunciation variants that are reliable (high confidence level) and are “close” enough to baseform pronunciation (low Levenshtein distance).

Experimental studies conducted on isolated word recognition task shows that:

• Integrating auxiliary features in standard ASR improves the “stability” of the baseform pronunciation model, i.e., the matching and discriminating properties of the single baseform pronunciation model is improved.

• The ASR performance can be significantly improved by incorporating the selected pronunciation variants.

In this work, we have studied the proposed approach for pronunciation variant selection for isolated word recognition task which only contains with-in word pronunciation variation. Given the good results achieved on isolated work recognition task, we believe in future it would be interest- ing to further study the proposed approach in the context of large vocabulary continuous speech recognition system where both with-in and cross-word pronunciation variation is present.

Chapter 7

Using Graphemes as Subword

Units in ASR

7.1 Introduction

Grapheme is a written symbol that is used to represent words, e.g., example alphabets in English language. In this chapter, we study the use of graphemes as subword units for ASR, particularly for the English language where, there is weak correspondence between the written form and spoken form compared to other languages such as Finnish or Spanish.

In Chapter 4, we studied how to model the joint distribution over hidden state space Q, observed feature space X, and some auxiliary source of knowledge A to improve the ASR performance. In this case, the auxiliary knowledge sources were particular acoustic features, such as pitch frequency, short-term energy and rate-of-speech. In the present chapter, we extend this strategy to jointly model phonemes, graphemes and standard features, where graphemes are now treated as the auxiliary source of knowledge. We initially studied this system for context-independent graphemes. The results from this study motivated us to further look into using context-dependent graphemes.

In Section 7.2, we motivate the use of grapheme in state-of-the-art ASR systems. We present an overview of research in this direction and motivate joint phoneme-grapheme based ASR. Section 7.3 presents the modelling process in phoneme-grapheme based ASR and Section 7.4 presents the studies conducted on phoneme-grapheme system using context-independent phonemes and graphemes. Section 7.5 presents our studies using context-dependent graphemes. Finally, Section 7.6 summa- rizes and concludes with our findings.

In document Using Auxiliary Sources of Knowledge for Automatic Speech Recognition (Page 93-96)