CI Initialisation Experiments - Joint Training Methods for Tandem and Hybrid Speech Recognition

A problem arising from CI initialisation is that the commonly used DNN acoustic model performance indicator, the frame classification accuracy, is not available during the complete DNN training process since different target sets are involved. To solve this problem, each CD state is associated with a CI state by stripping its contexts, and the CD state results and labels are converted at the CI state level. Let C_l be a set of

CD states, and Ck∈Cl means that Ck’s center phone and state index constitute the

CI state that C_l represents. Therefore, the posterior probability for the CI state C_l is

found by summing over all of the CD instances Ck

P(Cl|xin(t)) = X Ck∈Cl

P(Ck|xin(t)). (4.8)

Then in classification, the CI state that gives the maximum accumulated posterior probability is chosen and compared to its reference label.

Applying Bayes’ rule to Eqn. (4.8) leads to,

p(xin(t)|Cl) = X Ck∈Cl P(Ck) P(C_l)p(x in₍_t₎_|_C k). (4.9)

As in Section4.2.1, both p(xin(t)|C_l) and p(xin(t)|Ck) are generated through softmax

functions that have equivalent Gaussian distributions. Therefore, the conversion actually regards the single Gaussian model for each CD state as a mixture component in the GMM for its corresponding CI state, and P(Ck)/P(Cl) is the mixture weight

with X Ck∈Cl P(Ck) P(C_l) = 1. (4.10)

4.5 CI Initialisation Experiments

In this section, experiments are first carried out on the WSJ data set to compare CI initialisation with the standard generative and discriminative PT methods. On the Aurora-4 data set, the regularisation effect of CI initialisation is studied by comparing it to weight decay.

Table 4.4 WSJ SI-84 DNN-HMM system recognition and classification results with a 65k word trigram LM.

ID System PT %WER CI State

Dev Eval CV %Acc. S1 CI-DNN-HMMs CI 14.6 16.6 67.2 S2 CD-DNN-HMMs RBM 9.4 10.9 68.9 S3 CD-DNN-HMMs CD 9.6 11.3 68.7 S4 CD-DNN-HMMs CI (no FT) 8.9 10.3 69.7 S5 CD-DNN-HMMs CI 8.4 10.0 70.2

4.5.1 WSJ SI-84 DNN-HMM system performance

The first experiments were conducted with the WSJ SI-84 setup, which has a fairly limited number of training samples. The CI-DNN, S1, had 138 CI states, which were associated with 46 phones, while all CD-DNN-HMMs, S2-S5, had 3,007 tied states produced by the GMM-HMM based decision tree state tying approach. The DNNs had 5 hidden layers all with 1,000 artificial neurons. Baseline CD-DNN-HMM systems, S2 and S3, were initialised by generative and traditional CD discriminative PT, respectively. S5 was the model initialised by the proposed CI discriminative PT, whose starting point was actually S1. S4 differed from S5 by removing the CI-DNN FT step from CI discriminative PT (step 2 listed in Section 4.4.1). Performance of these systems is presented in Table 4.4.

From Table4.4, S5, the CD-DNN-HMMs initialised by the proposed CI initialisation, had the lowest WER among all SI-84 systems. Compared to the baselines with generative and CD discriminative PT, S2 and S3, S5 gave on average a 9.4% and a 12.0% relative reduction in WER (on Dev and Eval combined). Furthermore, by comparing S4 to S5, removing the CI-DNN FT step degraded the performance, but S4 still outperformed S2 and S3, which had the same total number of epochs in training, by an average of 5.7% and 8.1% relative reduction in WER. These results reveal that although the CI-DNN FT step could help reduce WER, performing more epochs of training is not the only reason for CI initialisation to outperform CD discriminative PT. Moreover, the CI state accuracies of the CD-DNNs, computed by Eqn. (4.8) on the CV set, are also included. These numbers are consistent with WERs across all SI-84 systems.

4.5 CI Initialisation Experiments 101

Table 4.5 Standard deviations of DNN layer output values on the WSJ SI-84 CV set. ID Averaged Standard Deviation

1st Hidden Layer Output Layer S2 4.19×10−1 ₉_.₇₁_×₁₀−3

S3 4.23×10−1 ₉_.₄₆_×₁₀−3

S4 4.09×10−1 1.08×10−2

4.5.2 Investigation of DNN layer output values

When classifying CI rather than CD targets, there would be only CI errors generated and backpropagated to the first layer. Therefore, only the CI low level characteristics are modelled. Specifically, if the first layer handles the task of, for instance, low-level feature normalisation, these features can be learned from CI speaker characteristics. After swapping CI targets with CD targets for FT purpose, the CD low level characteristics, e.g., CD pronunciation changes caused by the accent of a particular speaker, will not be modelled by the first layer and can be used by higher layers for better CD target discrimination. Intuitively, if this assumption is true, then for the final CD-DNN with CI discriminative PT, the first layer should produce more general features with smaller variances, while the output layer posterior probabilities should be more discriminative. This assumption is validated using SI-84 trained systems and the corresponding CV set, as shown in Table 4.5.

In Table4.5, for each system, the standard deviations of the output of each node in the first hidden layer and the output layer were calculated and averaged. First, the baseline with generative PT, S2, had smaller first hidden layer and larger output layer standard deviations than S3 with CD discriminative PT. This matches the fact that generative PT has more general first hidden layers than CD discriminative PT (Hinton et al., 2012; Mohamed et al., 2012). S5 had the smallest first layer and the largest output layer standard deviations, which indicated that CI discriminative PT produced the most generic first hidden layer and the most discriminative output layer, among the three methods.

4.5.3 WSJ SI-284 DNN-HMM system performance

Finally, CI initialisation was applied to the larger data set with the WSJ SI-284 configuration, which contains about 4.5 times more data than SI-84. The CD output layer size is increased from 3,007 to 5,981, making the DNNs have 1.5 times more

Table 4.6 WSJ SI-284 DNN-HMM system recognition and classification results using a 65k word trigram LM. The time cost in terms of seconds of different PT methods is also included.

ID System PT %WER CI State

Method n Epoch Time Dev Eval CV %Acc.

S11 CI-DNN-HMMs CI 11.6 12.6 70.5

S12 CD-DNN-HMMs RBM 8 4562.6 6.9 8.1 73.4 S131 _CD-DNN-HMMs _CD ₄ _{8617.1 6.7} _8.1 _72.5

S14 CD-DNN-HMMs CI (no FT) 4 1703.1 6.3 7.4 73.4 S15 CD-DNN-HMMs CI 16 9794.5 6.3 7.4 72.9 parameters. As a result, every DNN parameter has three times more training samples on average. The SI-284 DNN-HMM systems and their performance are presented in Table 4.6.

From Table 4.6, CD-DNN-HMMs with CI initialisations, S14 and S15, still outperformed the baseline systems S12 and S13 by a margin of 8.7% and 7.4% relative reduction in WER. To make the CI state accuracies comparable between Table 4.4

and 4.6, the CI state accuracies in Table 4.6 were still on the SI-84 CV set, since it was included in the SI-284 CV set. The CI state CV accuracies of S12-S15 are appropriate compared to the accuracy of S11, but are not all consistent with their WERs. Furthermore, S14 and S15 had identical WERs, which reveals that CI-DNN FT with sufficient training samples is not as important as when using less data, since the first few hidden layers may have been updated a sufficient number of times to model the low level characteristics well during the CI-DNN PT phase. In this case, CI initialisation not only improves the CD-DNN performance, but also considerably reduces the required amount of training time, as shown in Table 4.6. The experiments in Table 4.6used a single NVIDIA K20c GPU card, and the RBM time cost was not strictly comparable to the others, since it was trained with a different software, TNet (BUT, 2013).

4.5.4 Aurora-4 DNN-HMM system performance

Next, the regularisation effect of CI initialisation is investigated on the Aurora-4 task. The data set description and system configuration are presented in Appendix A.4. All CD-DNN-HMM systems have a structure of 720×2000×2000×2000×2000×

In document Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks (Page 123-127)