Summary - Methods for Addressing Data Diversity in Automatic Speech Recognition

In this chapter a definition for the domain was provided and used to formulate the problem of mismatch in training and test conditions of machine learning problems. Techniques used for compensating the mismatch are studied under various names

in different fields. In the speech recognition community they are called adaptation techniques and mostly speaker adaptation is studied. However, most of the speaker adaptation techniques can be generalised to the other sources of variation, such as background, device and the more generic notion of domain.

The majority of this chapter was an overview of speaker adaptation techniques for both GMM-HMM and DNN-HMM acoustic models and where possible, the generalisation to other sources of variability such as the domain was discussed. Techniques developed for the GMM-HMM models usually cannot be directly used for the adaptation of DNN-HMM models. However, a unified categorisation of techniques for both models was provided in this chapter, which was: transformation- based approaches, retraining and sub-space methods. The relevant techniques for both acoustic model types were studied in this chapter with references for more details. Usually the selection of one method over the other is task dependent, e.g. if the amount of adaptation data is very limited, then not all approaches are applicable. The use cases of these approaches were also provided in the corresponding sections. The remaining part of this chapter was devoted to normalisation techniques where either features are transformed to better fit the model or the models are transformed to better match the features. Finally another family of mismatch compensation techniques called multi-style training was introduced and the relevant studies were briefly introduced.

All of the different approaches discussed in this chapter have the ultimate goal of mismatch reduction and boosting performance. The remainder of this thesis will be focused on further improving some of the existing techniques by addressing their shortcomings, and also introducing some novel techniques. The next chapter will be about mismatch compensation using data selection techniques, followed by a new approach for modelling the latent domains in speech and its applications in named domain identification and acoustic model adaptation.

CHAPTER

3 Data selection and

augmentation techniques

3.1 Introduction

For many machine learning problems and in almost all practical problems the underlying distributions of the training and test data are different, and this causes a mismatch in the training and test conditions which usually degrades the performance (Pan and Yang, 2010). The same problem exists in speech recognition as well, the differences in the underlying distributions from which the training and test data are sampled causes a mismatch in the training and testing conditions and this increases the WER (Yu and Deng, 2015).

Training acoustic models from utterances that match the target speaker popula- tion, speaking style, or acoustic environment is generally considered to be the easiest way to optimise ASR performance. However, there are many scenarios where speech corpora of sufficient size, that characterise the sources of variability existing in a par- ticular target domain, are not available. In practical situations even if the training data of sufficient size that matches the target domain is available and used for training the ASR models, after the deployment of the ASR system and over time, the new test data will be different from the initial test data and this will again cause mismatch between the training and test data (Yu and Deng, 2015). This motivates the study conducted in this chapter to explore various techniques that can be used for minimising the mismatch between training and test data.

There are several approaches to address the mismatch problem between training and test data, including adaptation techniques that were introduced in chapter 2 and data selection and augmentation techniques that will be introduced in this chapter. The aim of the data selection/augmentation/generation techniques is to

create perfectly matched training data to a target test set. The matched training corpus is created by either selecting data from an existing pool of data, augmenting some existing data, or generating new data.

To assess the quality of the selected/augmented/generated training data, usually distance measures are defined and used as a proxy value for the WER, such that reducing the distance between the training and test data usually decreases the WER on the target test data. The reason for using proxy values rather than the actual WER is mostly for practical considerations. Computing WER on each subset is not considered to be a practical option because of the time required for training and evaluating the ASR models. Thus, the proxy function should be fast and easy to compute.

If the amount of training data is fixed and known beforehand and the task is to select a subset of that data, then the mismatch minimisation problem turns into a data selection problem. In the data selection problem, given a target test set the aim is to select a subset of the training data that, when a model is trained with the selected training data, will have the lowest WER compared to using any other subset of the available training data.

If the amount of training data is not fixed and the training data can be augmented, e.g. by generating artificial data or perturbing the existing data, and the task is to generate a training set, then the mismatch minimisation problem turns into a data augmentation problem. The aim of the data augmentation techniques is to create a training set that better matches a target test set. The data augmentation problem is also used in low resource scenarios, where the amount of training data is usually not enough to train models with reasonable performance. One approach to solve this problem is to augment the existing training data (Ragni et al., 2014). Data augmentation is also used in MTR scenarios, where the aim is to have diverse conditions (background noise, speaking style, speaker characteristics, etc.) present in the data so that the model generalises better to different conditions in the test set (Lippmann et al., 1987).

The main research question of this chapter is how to create a training set (by either selecting or augmenting data) that best matches a target test set. To address this question, first a unified view of the mismatch minimisation problem is provided based on the notation introduced in chapter 2, and then an overview of data selection and augmentation techniques are provided in section 3.2. Two similarity measures for data selection and data augmentation are provided in section 3.3 and 3.4 respectively, followed by the conclusion of the chapter in section 3.5.

In document Methods for Addressing Data Diversity in Automatic Speech Recognition (Page 59-63)