Sparse Network - Decoding Speedup - Deep Neural Network-Hidden Markov Model

Part III Deep Neural Network-Hidden Markov Model

7.2 Decoding Speedup

7.2.2 Sparse Network

On some devices, such as smart phones, SIMD instructions may not be available. Under those conditions, we still can use 8-bit quantization to improve the decoding speed. However, we cannot use many other parallel computation techniques we discussed in the Sect.7.2.1.

Fortunately, by inspecting the fully connected DNNs after the training we can notice that a large portion of all connections have very small weights. For example, in a typical DNN used in speech recognition, the magnitude of 70 % of weights is below 0.1 [25]. This means we may reduce model size and increase decoding speed by removing connections with small weight magnitude. Note that we do not observe similar patterns on bias parameters. This is expected since nonzero bias terms indicate the shift of hyperplanes from the origin. However, given that the number of bias parameters is very small compared to that of weight parameters, keeping bias parameters intact does not affect the final model size and decoding speed in a noticeable way.

There are many ways to generate sparse models. For example, the task of enforcing sparseness can be formulated as a multiobjective optimization problem since we want to minimize the cross entropy and the number of nonzero weights at the same time. This two-objective optimization problem can be converted into a single objective optimization problem with L1 regularization. Unfortunately, this formulation does not work well with the stochastic gradient descent (SGD) training algorithm often employed in the DNN training [25]. This is because the subgradient update does not lead to precise sparse solutions. To enforce a sparse solution, one often truncates the solutions after each T steps by forcing parameters with magnitude smaller than a thresholdθ to zero [11]. This truncation step, however, is somewhat arbitrary and T is difficult to select correctly. In general, it is not desirable to take a small T (e.g., 1), especially when the minibatch size is small, since in that case each SGD update step only slightly modifies weights. When a parameter is close to zero, it remains so after several SGD updates and will be rounded back to zero if T is not sufficiently large. Consequently, truncation can be done only after (a reasonably large) T steps in the hopes that nonzero coefficients have sufficient time to go above θ. On the other hand, a large T means that every time the parameters are truncated, the training criterion will be reduced and will require a similar number of steps to get the loss compensated.

Another well-known work [8, 13] pruned the weights after training converges based on the second-order derivatives. Unfortunately, these algorithms are difficult to scale up to large training set we typically use in speech recognition and their advantages vanish if additional training iterations are carried out upon the pruned weights.

A third approach, which performs well and generates good model is to formulate the problem as an optimization problem with a convex constraint

7.2 Decoding Speedup 131

where q is a threshold value for the maximal number of nonzero weights allowed. This constrained optimization problem is hard to solve. However, an approximate solution can be found following two observations: First, after sweeping through the full training set several times the weights become relatively stable—they tend to remain either large or small magnitudes. Second, in a stabilized model, the impor- tance of the connection is approximated well by the magnitudes of the weights.1This leads to the very simple yet efficient and effective algorithm.

We first train a fully connected DNN by sweeping through the full training set several times. We then keep only the connections whose weight magnitudes are in top q. Continue training the DNN and keep the same sparse connections unchanged. This can be achieved either by masking the pruned connections or round weights with magnitude below min{0.02, θ/2} to zero, where θ is the minimal weight magnitude that survived the pruning and 0.02 is determined by examining the patterns of weights in the fully connected network. The masking approach is cleaner but requires storage of a huge masking matrix. The rounding alternative is cheaper but trickier since it is important to round only weights smaller than min{0.02, θ/2}, instead of θ, to zero. This is because the weights may shrink and be suddenly removed if not doing so. In addition, after the pruning it is very important to continue training to remedy the accuracy degradation caused by sudden removal of the small weights.

Tables7.4and7.5, provided in [25], summarize the experimental results on the voice search (VS) and Switchobard (SWB) datasets described in Sect.6.2.1. By exploiting the sparseness property in the model, we can obtain 0.2–0.3 % error reduc- tion and simultaneously reduce the connections to only 30 % on both the VS and SWB datasets. Alternatively, we can reduce the number of weights to 12 % and

Table 7.4 Model size, computation time, and sentence error rate (SER) with and without sparseness

constraints on the VS dataset

Acoustic model # nonzero % nonzero Hub5’00 FSH RT03S SWB

params params (%) (%) GMM MPE 1.5M – 34.5 36.2 DNN, CE 19.2M Fully connected 28.0 30.4 12.8M 67 % 27.9 30.3 8.8M 46 % 27.7 30.1 6.0M 31 % 27.7 30.1 4.0M 21 % 27.8 30.2 2.3M 12 % 27.9 30.4 1.0M 5 % 29.7 31.7

The fully connected DNN contains 5 hidden layers each with 2,048 neurons. The OOV rate for both the dev and test sets is about 6 % (Summarized from Yu et al. [25])

1_{More precisely, it can be approximated by the magnitudes of the product of the weights and the}

input values. However, the magnitude of the input values are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden layer values are probabilities.

Table 7.5 Model size, computation time, and word error rate (WER) with and without sparseness

constraints on the SWB dataset

Acoustic model # nonzero % nonzero Hub5’00 FSH RT03S SWB

params params (%) GMM, BMMI 29.4M – 23.6 % 27.4 DNN, CE 45.1M Fully connected 16.4 % 18.6 31.1M 69 % 16.2 v 18.5 23.6M 52 % 16.1 % 18.5 15.2M 34 % 16.1 % 18.4 11.0M 24 % 16.2 % 18.5 8.6M 19 % 16.4 % 18.7 6.6M 5 % 16.5 % 18.7

The fully connected DNN contains 7 hidden layers each with 2,048 neurons (Summarized from Yu et al. [25])

19 %, respectively, on the VS and SWB datasets, without sacrificing recognition accuracy. In that case, the CD-DNN-HMM is only 1.5 and 0.3 times as large as the CD-GMM-HMM on the VS and SWB datasets, respectively, and takes only 18 % and 29 % of the model size compared to the fully connected models. This translates to reducing the DNN computation to only 14 % and 23 % of that needed by the fully connected models on the VS and SWB datasets respectively if SIMD instructions are not available.

The sparse weights learned generally have random patterns. This prevents it from being very efficient both in storage and in computation even if high degree of sparseness can be achieved, especially when SIMD parallelization is used.

In document Automatic Speech Recognition (Page 146-148)