Word duration modeling for slot fillers - Speech decoding on multiple devices

4.7 Speech decoding on multiple devices

4.7.4 Word duration modeling for slot fillers

The duration of filler models may have an impact on the recognition performance. In this thesis a word penalty is used, but in general more advanced technique are imagin- able. This is motivated by Jennequin et al.[205] where a duration model is used during lattice re-scoring. A similar approach was also proposed by Dino et al. for large vocab- ulary speech recognition[206]. Kao et al. proposes a discriminative duration models for speech recognition with segmental conditional random fields[207].

The filler models in this thesis uses a transition penalty similar to a word start penalty in the language transducer. However, not only the filler model can use a duration model also the introduced domain specific content can use dedicated recognizers. A word duration model, e.g. for large address books, might improve speech recognition. In Figure 4.16 is a word duration statistic presented. It was derived from a recognition on the Wall Street Journal corpus[39]. The measure indicates a Poisson distribution of the word length. However, most words have a duration between 400ms and 700ms. This is often approximated with an uniform distribution.

This section introduces a word duration modeling for speech recognition. It was observed that the word duration differs significantly. This may have an impact on the recognition accuracy. A word transition penalty is therefor introduced and transferred to the acoustic filler models. In addition, the word transition penalty was also proposed to improve the recognition of domain specific content.

4.8 Summary

This chapter introduced a novel method to model dynamic content for language and speech processing. Dynamic content refers to word sequences which are sparsely rep- resented in common training data. This could be user-dependent word sequences or words which are changing frequently. Examples are address book entries, medi- cal records or stock prices. It is also possible to model word sequences with long range dependencies using a dynamic language model. In addition, a technique is described that enables the use of dynamic language models in low resource speech recognition. This could be speech recognition on mobile phones, car head-units but it could also be speech recognition on a highly scalable cloud infrastructure. The technique is based on weighted finite state transducers and uses a transducer nesting technique.

Principles of transducer based speech recognition are introduced in Section 4.1 The transducer for automatic speech recognition is composed of several automatons and transducers. A lexicon transducer defines the relation between grapheme and phoneme sequences given words. It is described in Section 4.2 as a tree-like repre- sentation for fast look-up of phonetic transcriptions. The language automaton repre- sents the knowledge of word sequences, e.g. to determine whether a word sequence is probable, or not. The transducer is derived form statistical n -gram language models or grammars. Section 4.3 describes the developing process in detail.

The embedding of grammars in statistical models to model dynamic content is described in Section 4.4. Speech decoding with transducers is introduced in Section 4.5 together with an algorithm for efficient decoding. It is based on a token passing time synchronous Viterbi beam decoder which is implemented for low resource speech recognition, e.g. on embedded devices. A tree structured word history is proposed with a continuous garbage collector. Section 4.6 extends the decoding schema by a transducer nesting technique. This enables speech recognition using several transducers where the nesting happens during run-time as proposed by Georges et al. [9]. A transducer can be compiled on the device and nested into the deployed transducer which models carrier phrases. The recognition with nested transducers allows speech decoding on multiple devices as proposed by Georges et al.[7], [6], [8]. For this, acous- tical filler models are used which enable to evaluate some portion of the utterance on a different device. This technique is described in Section 4.7.

5

Applications

There are many applications using speech recognizers. Today, short message dictation and voice search is available on nearly all smart phones. All high class navigation systems are offering voice destination entry. The methods and techniques described in this thesis are contributing to the developing of humanoid voice interfaces. In fact, these allow a voice command to be queried in a natural language. This chapter gives an overview of evaluation metrics and evaluates the transducer nesting technique as well as the technique for recognizing speech on multiple devices.

The chapter is organized as follows; Section 5.1 introduces evaluation methods for language processing. The perplexity measure is described for language model evaluation. Speech recognition accuracy is measured by the word and sentence error rate. In addition, some evaluation metrics for information retrieval tasks are defined.

Section 5.2 describes an application which embeds grammars in n -gram language models. For this the transducer nesting technique was used as described in Section 4.6. The evaluation was proposed by Georges et. al[9] for the use on embedded devices.

The decoding on multiple devices is evaluated in Section 5.3.5. This was proposed by Georges et al. [10]. It is an application to keep personal data on the client. These data is also recognized on the client whereas the rest of the utterance is recognized on the cloud. Finally, the chapter is summarized.

5.1 Evaluation of speech and language techniques

There are two aspects in evaluating speech and language techniques. First, it should measure how good a technique suites the assumption. Second, it should assess the human experience using the technique. Both is not always achieved by one measure. This section outlines evaluation methods used for developing speech applications.

Evaluating a language mode covers two questions; First, how well does a language model suits the assumption, e.g. evaluate the model on training data. Second, how good generalizes the model, e.g. test the model on test data. A language model is evaluated by computing its perplexity. This is described in Section 5.1.1. Whereas the language model performance is not visible to the user, the accuracy of the speech recognition is perceptible by the user.

The evaluation of a speech recognizer should measure an user’s experience. The word or sentence error rate is a good measure for many speech applications, such as short message dictation. It is the ratio of wrongly recognized words or sentences overall words or sentences, respectively. The measure is described in Section 5.1.2. Evaluation methods known from the information retrieval research are used, too. It is used to evaluate applications such as voice destination entry or command and control tasks. In addition, note that a web search engine on a tablet computer may cause a different user experience than similar engine on a car head-unit. A set of standard metrics is described in Section 5.1.3.

In document Dynamic language models for hybrid speech recognition (Page 109-112)