Datasets and Baselines for Comparisons - Key Components in the CD-DNN-HMM

Part III Deep Neural Network-Hidden Markov Model

6.2 Key Components in the CD-DNN-HMM

6.2.1 Datasets and Baselines for Comparisons

6.2.1.1 Bing Mobile Voice Search Dataset

The Bing mobile voice search application allows users to do USA-wide business and web search from their mobile phones via voice. The business search dataset used in the experiments was collected under real usage scenarios in 2008, at which time the application was restricted to do location and business lookup [30]. All audio files collected were sampled at 8 kHz and encoded with the GSM codec. This is a challenging task since the dataset contains all kinds of variations: noise, music, side-speech, accents, sloppy pronunciation, hesitation, repetition, interruption, and different audio channels.

The dataset was split into a training set, a development set, and a test set. To simulate the real data collection and training procedure, and to avoid having overlap between training, development, and test sets, the dataset was split based on the time stamp of the queries. All queries in the training set were collected before those in the development set, which were in turn collected before those in the test set. The public lexicon from Carnegie Mellon University was used. The normalized nationwide

6.2 Key Components in the CD-DNN-HMM and Their Analysis 107

Table 6.1 Bing mobile voice

search dataset Hours Number of utterances

Training set 24 32,057

Development set 6.5 8,777

Test set 9.5 12,758

Table 6.2 The

CD-GMM-HMM baseline sentence error rate (SER) on the voice search dataset (Summarized from Dahl et al. [7])

Criterion Dev SER (%) Test SER (%)

ML 37.1 39.6

MMI 34.9 37.2

MPE 34.5 36.2

language model (LM) used in the evaluation contains 65 K word unigrams, 3.2 million word bi-grams, and 1.5 million word tri-grams, and was trained using the data feed and collected query logs; the perplexity is 117.

Table6.1summarizes the number of utterances and total duration of audio files (in hours) in the training, development, and test sets. All 24 h of training data included in the training set are manually transcribed.

Performance on this task was evaluated using sentence error rate (SER) instead of word error rate (WER). The average sentence length is 2.1 tokens, so sentences are typically quite short. Also, the users care most about whether they can find the business or location they seek for with fewest attempts. They typically will repeat what they have said if one of the words is mis-recognized. Additionally, there is significant inconsistency in spelling that makes using sentence accuracy more convenient. For example, “Mc-Donalds” sometimes is spelled as “McDonalds,” “Walmart” sometimes is spelled as “Wal-mart”, and “7-eleven” sometimes is spelled as “7 eleven” or “seven-eleven”. The sentence out-of-vocabulary (OOV) rate using the 65 K vocabulary LM is 6 % on both the development and test sets. In other words, the best possible SER we can achieve is 6 % using this setup.

The clustered cross-word triphone GMM-HMMs were trained with the maximum likelihood (ML), maximum mutual information (MMI) [3, 14, 20], and minimum phone error (MPE) [20, 23] criteria. The 39-dim features used in the experiments include the 13-dim static Mel-frequency cepstral coefficient (MFCC) (with C0 replaced with energy) and its first and second derivatives. The features were pre-processed with the cepstral mean normalization (CMN) algorithm.

The baseline systems were optimized by tuning the tying structures, number of senones, and Gaussian splitting strategies on the development set. All systems have 53 K logical and 2 K physical tri-phones with 761 shared states (senones), each of which is a GMM with 24 mixture components. The GMM-HMM baseline results are summarized in Table6.2.

For all CD-DNN-HMM experiments cited for the VS dataset, 11 frames (5-1-5) of MFCCs were used as the input features of the DNNs. During DNN pretraining a learning rate of 1.5e−4per sample was used for all layers. For fine-tuning, a learning rate of 3e−3per sample was used for the first 6 epochs and a learning rate of 8e−5

per sample was used for the last 6 epochs. In all the experiments, a minibatch size of 256 and a momentum of 0.9 was used. The hyperparameters were selected by hand, based on preliminary single hidden layer experiments so it may be possible to obtain even better performance with the deeper models using a more exhaustive hyperparameter search strategy.

6.2.1.2 Switchboard Dataset

The Switchboard (SWB) dataset [8,9] is a corpus of conversational telephone speech. It has three setups whose training set sizes are 30 h (a random subset of Switchboard-I training set), 309 h (full Switchboard-I training set), and 2,000 h (+Fisher training sets), respectively. For all configurations the 1,831-segment SWB part of the NIST 2,000 Hub5 eval set and the FSH half of the 6.3 h Spring 2003 NIST rich transcription set (RT03S) are used as the evaluation sets. The system uses 13-dimensional PLP features with rolling-window mean-variance normalization and up to third-order derivatives, which is reduced to 39 dimensions by heteroscedastic linear discriminant analysis (HLDA) [16]. The speaker-independent 3-state cross-word triphones share 1,504 (40 mixture), 9,304 (40 mixture), and 18,004 (72 mixture) CART-tied states, respectively, on the 30, 309, and 2,000 h setups, optimized for the GMM-HMM systems. The trigram language model was trained on the 2,000 h Fisher transcripts and interpolated with a written-text trigram. Test-set perplexity with the 58 k lexicon is 84.

The DNN system was trained using stochastic gradient descent in mini-batches. The mini-batch sizes were 1,024 frames except for the first mini-batches for which 256 samples were used. For DBN-pretraining the mini-batch size was 256.

For pretraining, the learning rate is 1.5e−4per sample. For the first 24 h of training data the learning rate is 3e−3per sample, and was reduced to 8e−5per sample after 3 epochs. The momentum of 0.9 was used. These learning rates are the same as that used in the VS dataset.

In document Automatic Speech Recognition (Page 122-124)