Part III Deep Neural Network-Hidden Markov Model
7.2 Decoding Speedup
7.2.1 Parallel Computation
The obvious solution to speeding up the decoding time is to parallelize the DNN computation. This is trivial on GPUs. However, in many tasks, it is more economically efficient to use commodity CPU hardware. Fortunately, modern CPUs often support instruction sets that can support single instruction multiple data (SIMD) low-level parallelization. These instructions perform the same multiple operations in parallel on contiguous data. On Intel and AMD CPUs of the x86 family, they typically operate on 16 bytes (e.g., 2 doubles, 4 floats, 8 shorts, or 16 bytes) worth of data at a time. By taking advantage of these new instruction sets, we can greatly improve the decoding speed.
Table7.3, extracted from [23], summarizes the techniques applicable to the CPU decoder and the real-time factors (RTFs, defined as the processing time divided by the playback time) achievable with a DNN configured as 440:2000X5:7969. This is a typical DNN with 11 frames of features as input. Each frame consists of 40 log- energies extracted from filterbanks on a Mel frequency scale. Sigmoid nonlinearity is used in all the 5 hidden layers each of which has 2000 neurons. The output layer has 7,969 senones. The results were obtained on an Intel Xeon DP Quad Core E5640 machine with Ubuntu OS. CPU scaling was disabled and each run was performed a minimum of 5 times and averaged.
From the table, it is clear that the naive implementation would require 3.89 real time to just compute the posterior probabilities from the DNN. Using the floating- point SSE2 instruction set, which operates on 4 floats at a time, the decoding time can be reduced significantly to 1.36 real time. However, this is still very expensive and slower than real time. In contrast, we may alternatively linearly quantize the 4-byte floating-point values of the hidden activations (constrained within(0, 1) if sigmoid activation function is used) to unsigned char (1 byte) and the weight values to signed char (1 byte). The biases can be encoded as 4-byte integer, and the input remains the floating point. This quantization technique can reduce the time to 1.52 real time
Table 7.3 Real-time factor (RTF) on a typical DNN (440:2000X5:7969) used in speech recognition
with different engineering optimization techniques (Summarized from Vanhoucke et al. [23])
Technique Real-time factor Note
Floating-point baseline 3.89 Baseline
Floating-point SSE2 1.36 4-way parallelization (16 bytes)
8-bit quantization 1.52 Activation: unsigned char; Weight: signed char
Integer SSSE3 0.51 16-way parallelization
Integer SSE4 0.47 Faster 16-32 conversion
Batching 0.36 Batches over tens of milliseconds
Lazy evaluation 0.26 Assume 30 % active senones
7.2 Decoding Speedup 129
even without using SIMD instructions. Quantization also reduced the model size to 1/3–1/4.
When the integer SSSE3 instruction set is applied to the 8-bit quantized values, which allows for 16-way parallel computation, additional 2/3 of time is reduced and the overall computation time is reduced to 0.51 real time. Using the integer SSE4 instruction set, which introduces one small optimization with a single instruction for 16–32-bit conversion, a slight gain can be observed and the time is reduced to 0.47 real time.
In speech recognition, even in the online recognition mode, it is common to incorporate a lookahead of a few hundred milliseconds, especially in the beginning of an utterance, to help improve runtime estimates of speech and noise statistics. This means process frames in small batches over tens of milliseconds will not affect latency too much. To take full advantage of batching, the batches have to be propagated through the neural network layers in bulk, so that every linear computation becomes a matrix–matrix multiply which can take advantage of CPU caching of both weights and activations. Using batching can further reduce the computation time to 0.36 real time.
One last trick to further improve the decoding speed is to compute the senone posteriors only if needed. It is well known that during decoding, at every frame, only a fraction (25 to 35 %) of the state scores ever need to be computed. In the GMM– HMM system, this can be easily exploited since every state has its own, small, set of Gaussians. In the DNN, however, all the hidden layers are shared and need to be computed even if only one state is active, except the last layer, in which only the neurons corresponding to the necessary state posteriors need to be computed. This means we can lazily evaluate the output layer. Evaluating the output layer in a lazy manner, however, adds inefficiency to the matrix computation and hence introduces a small fixed cost of about 22 % relatively. Overall, using lazy evaluation (without batching) can reduce the time to 0.26 real time since the output layer dominates (typically account for around 50 %) the computation.
With lazy evaluation, however, one can no longer compute batches of output scores for all outputs across multiple frames although we can continue to batch the computation of all hidden layers. Furthermore, since a state is very likely to be needed at frame t+ 1 if it is needed by the decoder at frame t, it is still possible to compute a batch of these posteriors for consecutive frames at the same time while the weights are in cache. Combining lazy evaluation with batch further reduced the DNN computation time to 0.21 real time.
Overall, the engineering optimization techniques achieved near 20 times speedup (reduced from 3.89 real time to 0.21 real time) compared to the naive implementation. Note that this 0.21 real time is only the DNN posterior probability computation time. The decoder also needs to search over all possible state sequences which typically adds another 0.2–0.3 real time on average and 06–0.7 real time for extreme cases depending on the language model perplexity and beam used in the search. Overall, the complete decoding time is within real time without loss of accuracy.