Pipelined Backpropagation Using Multiple GPUs

Training and Decoding Speedup

7.1 Training Speedup

7.1.1 Pipelined Backpropagation Using Multiple GPUs

It is well known that minibatch based stochastic gradient descend (SGD) training algorithm can easily scale to large datasets on a single computing device. However, the need to use minibatches of only few hundred samples makes parallelization difficult. This is because the model is updated after each minibatch and requires prohibitive bandwidth if the naive parallelization across data is used. For example, a typical 2kx7 (7 hidden layers each with 2k neurons) CD-DNN-HMM has around 50–100 million floating parameters or 200–400 million bytes. Each minibatch would require the distribution of 400 MB worth of gradients and gathering another 400 MB of model parameters, per server. If each minibatch takes about 500 ms to compute which is easy to achieve on GPGPUs, we get close to the PCIe-2 limit (about 6 GB/s).

The minibatch size, however, is typically determined by two factors. Smaller minibatch size means more frequent update of the model but also means less effective usage of the GPU’s computing power. Larger minibatch can be more efficiently computed but the whole training process requires more passes over the training set. An optimal minibatch size can be obtained by balancing these two factors.

Figure7.1shows runtime (right y-axis) and early frame accuracies (left y-axis) for different minibatch sizes (x-axis) after seeing 12 h of data. In these experiments, if the minibatch size is larger than 256 the minibatch size was set to 256 for the first 2.4 h of data and then increased to the actual size. It can be seen that the optimal minibatch size is in the 256–1,024 range.

If NVidia S2090 (hosted on T620), the best GPU listed in Fig.7.1, is used, getting a good performing DNN trained with the cross-entropy criterion and 300 h of training data costs about 15 days. If 2,000 h of data are used, the total time is increased to 45 days. Note that the total time is increased by about 3 times instead of 2,000/300≈

6.7 times. This is because although each pass of the data costs 6.7 times, it takes

Fig. 7.1 Relative runtime for different minibatch sizes and GPU/CPU model types, and corre-sponding frame accuracy measured after seeing 12 h of data. Left y-axis frame accuracy; right y-axis runtime relative to C1060 with minibatch size of 2,048 (Figure from Chen et al. [5], permit-ted to use by ISCA)

7.1 Training Speedup 119

less passes when trained with SGD. The training time can be reduced to about 20 days if the newer GPU model such as K20X is used. Even with this new GPU, however, training with 20 K hours of data still takes over two months. Parallel training algorithm thus is very important to support training with large datasets.

Let’s denote K the number of GPUs (for example, 4), T the size of a mini-batch (like 1,024), N the dimensions of all hidden layers , e.g., 2,048, and J the output dimension (number of senones), e.g., 9,304. If we use the simple classic map-reduce [7] approach which achieves parallelization by splitting the training data, it would require accumulation/redistribution of gradients/models of the dimen-sion of the entire model to/from a master server to the other K − 1 GPUs for each minibatch. On the shared bus between GPGPUs, bandwidth per minibatch is of O(N · (T + 2(L · N + J)(K − 1))). A tree-structured communication architecture could reduce it toO(N · (T + 2(L · N + J)log2K)), where x is the minimum integer that is larger than or equal to x.

Alternatively, we can partition each layer’s model parameters into stripes and distribute stripes in different computing nodes. In this node parallelization approach, each GPU holds one out of K vertical stripes of each layer’s parameters and gradients.

Model update happens only locally within each GPU. In forward computation, each layer’s input v⁻¹gets distributed to all GPUs, each of which computes a slice of the output vector v. The slices are then distributed to all other GPUs for computing the next layer. In backpropagation, error vectors are parallelized as slices, but the resulting matrix products from each slice are partial sums that need to be further summed up. As a result, in both forward computation and backpropagation, each vector is transferred K− 1 times. The bandwidth is of O(N · (K − 1) · T · (2L + 1)).

The pipelined backpropagation [5,18] avoids the multiple copying of data vectors of the striping method, by distributing the layers across GPUs to form a pipeline.

Data, instead of model, flows from GPU to GPU. All GPUs work simultaneously on the data they have. As an example, in Fig.7.2, the two-hidden-layer DNN is split and

Forward Pass Backward Pass Mini Batch n

Mini Batch n-1 Mini Batch n-2

Mini Batch n-5 Mini Batch n-4 Mini Batch n-3

Fig. 7.2 Illustration of a pipelined parallelization paradigm

stored in three GPUs. When the first batch of training data comes in, it’s processed in GPU1. The activations (outputs) of hidden layer 1 is then passed to GPU2 for processing. At the same time, a new batch comes in and is processed in GPU1. After three batches, all GPUs are occupied. This suggests a speedup of three if all layers are balanced. The backpropagation pass is processed in the similar way. After six batches, all GPUs process both a forward batch and a backward batch. Since GPUs use the single instruction multiple data (SIMD) architecture, we can update the model first and then do the forward pass at each layer. This guarantees that the most recently updated weights are used when the forward computation is conducted to reduce the delayed-update problem we will mention below.

In the pipeline architecture, each vector travels twice per GPU, once for forward computation and once for backpropagation. The bandwidth isO(N · T · (2K − 1)), which is cheaper than data parallelization and striping. If the number of layers L is larger than the number of GPUs K , you may group several layers in the same GPU.

Lastly, asynchronous data transfer and appropriate order of execution allows most data transfers to happen in parallel to computation, which can reduce the effective communication time to close to zero.

Note that the efficiency comes with a cost. This is because there is a mismatch between the weights used to do forward computation and that used to do backpropa-gation. For example, the weights used for forward computation of batch n on GPUs 1, 2, and 3 are updated after batch n− 5, n − 3, and n − 1, respectively. However, when computing the gradients, these weights have already been further updated on batches n− 1 and n − 2, respectively on GPUs 2 and 1, although on GPU3 they are the same. This means in lower layers, due to the delay in the pipeline process, the gradients calculated are not accurate. Based on this analysis, we can consider the delayed update as a special complicated momentum technique in which the update (smoothed gradient) is a function of previous models and gradients. For this reason, when the pipeline is too long performance degradation can be observed if the same minibatch size is used [5]. To alleviate the side effect of the delayed update, we need to cut the minibatch size.

The key to achieve great speedup is to balance the computation on each GPU.

If the number of layers is a plural of the number of GPUs and all layers have the same dimension, balancing is trivial. However, in CD-DNN-HMMs the softmax layer typically dominates the number of parameters. This is because the number of senones is often around 10 K and the hidden layer size is typically around 2 K. To balance the computation, we need to use stripe for the softmax layer and pipeline for the rest.

Table7.1, quoted from Chen et al. [5], shows training runtime using up to 4 GPUs (NVidia Tesla S2090) in a single server (Dell PowerEdge T620), measured for 429 input feature dimension, L = 7 hidden layers, N = 2,048 hidden dimensions, and J= 9,304 senones. From this table, we can observe that speedups of 1.7–1.9 can be achieved on dual GPUs (e.g., reducing runtime from 61 to 33 min for minibatch size of 512), at no accuracy loss despite its delayed-update approximation. To achieve this speedup, GPU1 contains five weight matrices and GPU2 has only two due to the unbalanced softmax layer. The computation time ratio on these two GPUs

7.1 Training Speedup 121

Table 7.1 Training runtime in minutes per 24 h of data for different parallelization configurations using pipelined backpropagation

Method # GPUs Minibatch size

256 512 1024

Baseline (single GPU) 1 68 61 59

Pipeline (0..5; 6..7) 2 36 33 31

Pipeline (0..2; 3..4; 5..6;7) 4 32 29 [27]

Pipeline+ striped top layer (0..3; 4..6; 7L; 7R) 4 20 18 [[18]]

[[·]] denotes divergence, and [·] denotes a greater than 0.1% word error rate (WER) loss on the test set (Quoted from Chen et al. [5])

are (429 + 5 × 2,048) × 2,048:(2,048 + 9,304) × 2,048 = 0.94:1 and thus are very balanced. Going to 4 GPUs using pipeline alone barely helps. The overall speedup remains below 2.2 (e.g.,~61 vs.~29 min). This is because the softmax layer is 4.5 times larger (9,304 × 2,048 parameters) than the hidden layers (2,048²), and is thus the limiting bottleneck. The computation time ratio on the four GPUs is (429 + 2 × 2,048)×2,048:(2×2,048)×2,048:(2×2,048)×2048:9304 × 2048 = 1.1:1:1:2.27. In other words, GPU4 takes twice as time as other GPUs. However, if pipelined BP is combined with the striping method, which is applied only to the softmax layer, significant speedup can be achieved. In this configuration, four GPUs are assigned with layers (0..3; 4..6; 7L; 7R) where L and R denote left and right stripe respectively. In other words, two GPUs jointly form the top stage of the pipeline, while the lower 7 layers are pipelined on the other two GPUs. Under this condition, similar calculation indicates that the computation cost ratio on the four GPUs is(429 + 3 × 2,048) × 2,048:(3 × 2,048) × 2,048:4,652 × 2,048:4,652 × 2,048 = 1.07:1:0.76:0.76. At no word error rate (WER) loss, the fastest pipelined system (18 min to process 24 h of data with minibatch size 512) runs 3.3 times faster on four GPUs than the fastest single-GPU baseline (59 min to process 24 h of data with minibatch size 1024), a 3.3 times speedup.

The drawback of the pipelined backpropagation is obvious. The overall speedup heavily depends on whether you can find a way to balance the computation across GPUs. In addition, due to the delayed-update effect, It is not easy to extend the same speedup to more GPUs.

In document Automatic Speech Recognition (Page 134-137)