Training and Decoding Speedup
7.2 Decoding Speedup
7.2.5 Multiframe DNN
In CD-DNN-HMMs, we estimate the senone posterior probability for each 10 ms frame covering a window of 9–13 frames of input. However, speech is a rather stationary process when analyzed at a 10 ms frame rate. It is natural to believe that the predictions between adjacent frames are very similar. A simple and computationally efficient approach to take advantage of time correlations between feature frames is to simply copy the predictions from previous frame and thus cut the computation by 2. This simple approach, discussed in [22] and referred as frame-asynchronous DNN, performs surprisingly well.
An improved approach, referred as multiframe DNN (MFDNN), is proposed in [22]. Instead of copying the state predictions from the previous frame as that in the frame-asynchronous DNN, the MFDNN predicts both the frame label at time t and that at adjacent frames with the same input window at frame t. This is done by replacing the single softmax layer in the conventional DNN with several softmax layers each for a different frame label. Since all the softmax layers share the same hidden layers, MFDNN can cut the hidden layer computation time.
For example, in [22], the MFDNN jointly predicts labels for frames t to t− K , where K is the number of lookback frames. This is a special case of multitask learning.
Note that here the MFDNN jointly predicts past (t− 1, . . . , t − K ) instead of future (t+ 1, . . . , t + K ) frames. This is because in [22] a much longer context window in the past (20 frames) than that in the future (5 frames) was used. As the result, the input of the DNN provides more balanced context for predicting past frame labels than future frame labels. In most implementations of DNNs [6,21], the input window has balanced context of the past and future frames. For these DNNs, the jointly predicted frame labels can from both past and future frames. Note that if K is large the overall latency of the system will be increased. Since latency will affect users’ experience, K is typically set to a value less than 4 so that the additional latency introduced by the MFDNN is less than 30 ms.
Training such MFDNNs can be performed by backpropagating the errors from all softmax layers jointly through the network. When doing so the gradient magnitudes will increase since the error signal is multiplied by the number of jointly predicted frames. To maintain the convergence property the learning rate might have to be reduced.
The MFDNN performs better than the frame-asynchronous DNN and performs as good as the baseline system. It was reported [22] that a system which predicts jointly 2 frames at a time achieved a 10 % improvement in the query processing rate at no cost in accuracy or median latency, compared to an equivalent baseline system.
7.2 Decoding Speedup 135
A system which predicts jointly 4 frames achieved a further 10 % improvement in the query processing rate at a cost of a 0.4 % absolute increase in word error rate.
References
1. Ba, L.J., Caruana, R.: Do deep nets really need to be deep? arXiv preprinthttp://arxiv.org/abs/
1312.6184arXiv:1312.6184 (2013)
2. Bertsekas, D.P.: Constrained optimization and lagrange multiplier methods. Computer Science and Applied Mathematics, vol. 1982, p. 1. Academic Press, Boston (1982)
3. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn.
3(1), 1–122 (2011)
4. Bucilua, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 535–541.
ACM (2006)
5. Chen, X., Eversole, A., Li, G., Yu, D., Seide, F.: Pipelined back-propagation for context-dependent deep neural networks. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)
6. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(1), 30–42 (2012)
7. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.
ACM 51(1), 107–113 (2008)
8. Hassibi, B., Stork, D.G., et al.: Second order derivatives for network pruning: optimal brain surgeon. Proc. Neural Inf. Process. Syst. (NIPS) 5, 164–164 (1993)
9. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969) 10. Kingsbury, B. and Sainath, T.N., Soltau, H.: Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization (INTERSPEECH) (2012)
11. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. J. Mach. Learn.
Res. (JMLR) 10, 777–801 (2009)
12. Le, Q.V., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. arXiv preprint arXiv:1112.6209(2011)
13. LeCun, Y., Denker, J.S., Solla, S.A., Howard, R.E., Jackel, L.D.: Optimal brain damage. Proc.
Neural Inf. Process. Syst. (NIPS) 2, 598–605 (1989)
14. Li, J., Zhao, R., Huang, J.T., Gong, Y.: Learning small-size DNN with output-distribution-based criteria. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
15. Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 735–742 (2010)
16. Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free optimization.
In: Proceedings of the International Conference on Machine Learning (ICML), pp. 1033–1040 (2011)
17. Niu, F., Recht, B., Ré, C., Wright, S.J.: Hogwild!: A lock-free approach to parallelizing sto-chastic gradient descent. arXiv preprintarXiv:1106.5730(2011)
18. Petrowski, A., Dreyfus, G., Girault, C.: Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Trans. Neural Netw. 4(6), 970–981 (1993)
19. Powell, M.J.: A method for non-linear constraints in minimization problems. UKAEA (1967)
20. Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix fac-torization for deep neural network training with high-dimensional output targets. In: Proceed-ings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655–6659 (2013)
21. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of the Annual Conference of International Speech Commu-nication Association (INTERSPEECH), pp. 437–440 (2011)
22. Vanhoucke, V., Devin, M., Heigold, G.: Multiframe deep neural networks for acoustic modeling.
In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7582–7585. IEEE (2013)
23. Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on CPUs. In:
Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)
24. Xue, J., Li, J., Gong, Y.: Restructuring of deep neural network acoustic models with singu-lar value decomposition. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2013)
25. Yu, D., Seide, F., G.Li, Deng, L.: Exploiting sparseness in deep neural networks for large vocabulary speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4409–4412 (2012)
26. Zhang, S., Zhang, C., You, Z., Zheng, R., Xu, B.: Asynchronous stochastic gradient descent for DNN training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6660–6663 (2013)
27. Zhou, P., Liu, C., Liu, Q., Dai, L., Jiang, H.: A cluster-based multiple deep neural networks method for large vocabulary continuous speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6650–6654 (2013)