Data selection and transfer learning experiments with a di-

3.3 Likelihood ratio based distance

3.3.1 Data selection and transfer learning experiments with a di-

An experimental study is conducted in this chapter to first study the effects of using mismatched training and test data in the performance of ASR systems, and then to study the effectiveness of the proposed approach in reducing the mismatch and im- proving the performance. For the experimental work of this chapter, a very diverse dataset was required so that the mismatched conditions can be easily experimented on. For this purpose an artificially diverse data set was created by combining six different datasets. Details of the dataset are provided in section 3.3.1.1. The mismatch in components of this dataset makes it a good choice for the following experimental work, where the effects of using mismatched training data is studied in section 3.3.2. Section 3.3.3 further investigates the positive and negative transfer effects when using cross-domain data and finally in section 3.3.4 a new approach for data selection based on similarity to a target test set is presented using the likelihood ratio function defined in this section.

3.3.1.1 Dataset definition

For the data selection experiments, a highly diverse simulated dataset was created by combining 6 different types of data widely used in ASR experiments:

• Radio (RD): BBC Radio4 broadcasts on February 2009 (Bell et al., 2015b) • Television (TV): broadcasts from BBC on May 2008 (Bell et al., 2015b) • Telephone speech (CT): from the Fisher corpus1 _{(Cieri et al., 2004)}

• Meetings (MT): from AMI (Carletta et al., 2006) and ICSI (Janin et al., 2003) corpora

• Lectures (TK): from TedTalks (Ng et al., 2014)

1_{All of the telephone speech data was up–sampled to 16 kHz to match the sampling rate of the} rest of the data.

Table 3.1: Amount of data used from each component dataset for the training set of the diverse dataset and their related statistics (durations are in hh:mm:ss format)

Dataset Duration #Segments #Words #Unique Words #Speakers

RD 10:00:05 3,685 116,015 9,827 518 TV 10:00:07 6,774 118,190 10,928 1,745 CT 10:00:01 10,200 114,188 6,029 100 MT 10:00:34 4,088 104,368 5,484 80 TK 10:00:00 5,143 108,927 10,088 100 RS 10:00:04 3,963 84,299 8,902 89 Total 60:00:52 35,279 645,987 25,374 2,632

• Read speech (RS): from the WSJCAM0 corpus (Robinson et al., 1995) A subset of 10h from each component dataset was selected to form the training set (60h in total), and 1h from each component dataset was used for the test set (6h in total). The selection of these component datasets aimed to cover the most common and distinctive types of audio recordings used in ASR tasks. Table 3.1 and 3.2 summarises the statistics of the datasets. Each of the component datasets have their own particular attributes; some of them are listed in the statistics table. For example, the MT dataset has only 80 speakers for the 10 hour training set, while TV with similar amount of data has more than 1,700 speakers. Also in terms of the number of unique words, CT and MT have around 6,000 unique words, however, TV and TK have more than 10,000 unique words for the same amount of data (in terms of duration). This shows the diversity of words used in TV and TK compared to CT and MT. Comparing the total number of words, RS has the lowest count which shows that the average speaking rate is lower than the others. In terms of type of speech, all of the datasets can be considered to be spontaneous speech, except the RS which is read speech. However, parts of RD and TV have read speech as well (e.g. news programmes). These differences plus other variabilities, such as speaking style, background conditions, etc. characterise each of these components and shows the diversity of this dataset.

3.3.1.2 Baseline models

Since the dataset consists of various different component datasets and to evaluate the difficulty of each component, baseline models were trained. One set of baseline models were trained for each component separately, and also another baseline model was trained using all of the available pooled data. These models were then evaluated

Table 3.2: Amount of data used from each component dataset for the test set of the diverse dataset and their related statistics (durations are in hh:mm:ss format)

Dataset Duration #Segments #Words #Unique Words #Speakers

RD 1:00:00 282 10,872 2,596 68 TV 1:00:01 802 11,379 2,871 90 CT 1:00:01 721 12,727 1,696 71 MT 1:00:02 397 10,026 1,618 53 TK 1:00:04 359 10,321 2,399 19 RS 1:00:01 410 8,743 2,378 20 Total 6:00:12 2,971 64,068 7,869 321

on the test set. Details for the baseline models are provided in this section.

Two types of systems were used for the experiments: a GMM-HMM system and a bottleneck DNN-GMM-HMM system. For the GMM-HMM system, 13 dimensional PLP (Hermansky, 1990) features plus their first and second derivatives were used (in total 39 dimensional). For the DNN-GMM-HMM system, a 65 dimensional feature vector concatenating the 39 dimensional PLP features and 26 dimensional bottleneck (BN) features were used. The BN features were extracted from a 4 hidden layer feed-forward DNN trained with the 60 hours of the training data. For the DNN, 31 adjacent frames (15 frames to the left and 15 frames to the right) of 23 dimensional Mel-scale log-filter bank energy features were concatenated to form a 713 dimensional super vector; a discrete cosine transform was applied to this super vector to de-correlate and compress it to 368 dimensions and then it was fed into the neural network. The network had 4 hidden layers of size 1,745 followed by a bottleneck layer of size 26 and a softmax output layer of 4,000 context dependent triphone states. The objective function used for training was frame-level cross- entropy (CE) and the optimisation was performed using the stochastic gradient descent (SGD) algorithm. For both types of features, MLE-based GMM-HMM models were trained with 5-state crossword triphones and 16 Gaussian components per state. For the bottleneck system, the frame level alignments were acquired from the initial GMM-HMM system. The language model was based on a 50,000 word vocabulary and was trained by combination of component language models for each of the 6 domains. The interpolation weights were tuned using an independent development set.

Table 3.3: WER (%) of the baseline models on the test set of the diverse dataset, ordered in terms of difficulty

Features Model RS RD TK CT MT TV Overall

PLP ML 17.3 18.4 34.1 46.6 44.0 51.1 36.0 ML in-domain 16.9 19.1 35.1 44.4 44.0 52.9 36.3 MAP 14.6 16.8 31.8 43.5 40.4 49.6 33.6 PLP+BN ML 13.0 13.3 23.5 33.5 32.2 42.0 26.8 ML in-domain 12.6 14.0 25.0 34.3 33.2 44.0 27.9 MAP 12.1 12.8 23.1 32.5 30.6 41.5 26.2 3.3.1.3 Baseline results

Table 3.3 presents results using both types of acoustic features with three different types of models: ML, ML in-domain and MAP. ML models were trained with the ML criterion using all of the pooled training data. ML in-domain were the 6 individual models trained with the in-domain 10h data and each model was then used to decode the corresponding test set. Finally, the initial ML model is MAP adapted to each of the 6 domains and the new adapted models were used to decode the corresponding test set.

These results show a large variation in the performance among domains, from 17% and 18% for the read speech and radio broadcasts to 51% for the television broadcasts. The use of PLP+BN features provides a 20–25% relative improvement in performance against the PLP features in all three types of the models; which is consistent across domains and follows the results previously seen in the litera- ture (Hinton et al., 2012; Yu and Deng, 2015). The results using in-domain data models is overall worse than the pooled data models (e.g. 26.8% vs. 27.9% with PLP+BN features) which suggests that more data is helpful for this task. In both types of features the MAP adapted models yielded the best performance which sets MAP as a preferred setup for domain adaptation in the context of GMM-HMM models. Among other adaptation techniques, MLLR adaptation did not consistently improve the performance compared to the MAP adaptation and was not considered for the domain adaptation task with this amount of data.

In document Methods for Addressing Data Diversity in Automatic Speech Recognition (Page 71-74)