• No results found

A network utilising dropout takes longer to train in two senses: each training epoch takes several times longer, and the number of training epochs needed increases too. This remainder of this chapter will present a technique for speeding up training with dropout, resulting in a substantial reduction in the time needed per epoch.

Some machine learning libraries such as Pylearn2 allow dropout to be applied in a batchwise manner instead of independently1. This is done in a naive fashion by replacingd

k with a 1◊nk row matrix of independent Bernoulli(1≠pk) random variables, and then copying it verticallyb times to get the right shape. I now present an efficient version of batchwise dropout that takes 1Pylearn2 source: apply_dropout() in mlp.py - https://github.com/lisa-lab/pylearn2/blob/master/

pylearn2/models/mlp.py

Chapter 5. Efficient Batchwise Dropout

advantage of the redundancy inherent in the dropout mask once the constraint of independence is relaxed.

To be practical, it is important that each training minibatch can be processed quickly. A crude way of estimating the processing time involved is to count the number of floating point multiplication operations needed (naively) to evaluate the◊matrix multiplications specified in Eqns5.1,5.2and

5.3above, which results in the following expression:

¸≠1 ÿ k=0 bnknk+1 ¸ ˚˙ ˝ forwards +nkbnk+1 ¸ ˚˙ ˝ ˆcost/ˆW +bnk+1◊nk ¸ ˚˙ ˝ backwards . (5.4)

However, when taking into account the effect of the dropout mask, it can be seen that many of these multiplications are unnecessary. The (i, j)-th element of the Wk weight matrix effectively ‘drops-out’ of the calculations if unit iis dropped in levelk, or if unitj is dropped in level k+ 1.

Applying 50% dropout in levelsk andk+ 1 renders 75% of the multiplications unnecessary.

If independent dropout is applied to a network, then the parts ofWk that disappear are different for each sample. This makes it effectively impossible to take advantage of the redundancy as it is typically slower to check if a multiplication is necessary than to just do the multiplication. However, if batchwise dropout is used, then it becomes easy to take advantage of the redundancy by simply removing the redundant parts of the calculations.

The binary 1◊nk batchwise dropout matricesdk naturally define submatrices of the weight and hidden-unit matrices. Letxdropoutk :=xk[ :, dk] denote the submatrix ofxk consisting of the level-k hidden units that survive dropout. Let Wkdropout := Wk[dk, dk+1] denote the submatrix of Wk consisting of weights that connect active units in levelkto active units in levelk+ 1. The network

can then be trained using the following equations:

xdropoutk+1 =xdropoutk Wkdropout (5.5)

ˆcost ˆWkdropout = (x dropout k )T◊ ˆcost ˆxdropoutk+1 (5.6) ˆcost ˆxdropoutk = ˆcost ˆxdropoutk+1 ◊(W dropout k )T (5.7)

The redundant multiplications have been eliminated. There is an additional benefit in terms of memory needed to store the hidden units: xdropoutk needs less space thanxk.

It should be emphasized that batchwise dropout only improves performance during training; during testing the fullWk matrix is used as normal, scaled by a factor of 1≠pk. However, machine learning research is often constrained by long training times and high costs of equipment. In Section 5.5

the results will show that all other things being equal, batchwise dropout is similar to independent dropout, but faster. Moreover, with the increase in speed, all other things do not have to be equal. For instance, with the same resources, batchwise dropout can be used to

• increase the number of training epochs, • increase the number of hidden units,

Chapter 5. Efficient Batchwise Dropout

• increase the number of validation runs used to optimize “hyper-parameters”, or • train a number of independent copies of the network to form a committee.

These possibilities will often be useful as ways of improving generalization/reducing test error. The rest of this chapter will attempt to characterise the implications of replacing independent dropout with that of batchwise dropout. In Section5.6batchwise dropout will be extended for use with convolutional networks. Dropout for convolutional networks is somewhat more complicated as weights are shared across spatial locations. A minibatch passing up through a convolutional network might be represented at an intermediate hidden layer by an array of size 100◊32◊12◊12:

100 samples, the output of 32 convolutional filters, at each of 1212 spatial locations. It is conventional to use a dropout mask with shape 100◊32◊12◊12; again this will called independent

dropout. In contrast, if the goal is to to apply batchwise dropout efficiently by adapting the submatrix trick, then the dropout mask will take on the shape 1◊32◊1◊1. This looks like

a significant change: the ensemble over which the average cost is optimised is different. During training, the error rates are higher. However, testing the networks gives very similar error rates.

Implementation

To see how much of a performance gain that should be expected by being able to remove 50% of the rows and columns of Wk (assuming a pk of 0.5) a consideration of the time complexity of the operations involved must be undertaken. In theory, for nn matrices, addition is an

O(n2) operation, and multiplication is O(n2.37...) by the Coppersmith–Winograd algorithm. This suggests that the bulk of the processing time should be spent doing matrix multiplication, and that a performance improvement of about 60% should be possible compared to networks using independent dropout, or no dropout at all. In practice, SGEMM functions use Strassen’s algorithm or naive matrix multiplication, so performance improvements of up to 75% should be possible2.

Batchwise dropout was implemented for fully-connected and convolutional neural networks using CUDA/CUBLAS. It was clear that using the highly optimizedcublasSgemmfunction to do the bulk of the work, with CUDA kernels used to form the submatricesWkdropoutand to update theWkusing ˆcost/ˆWkdropout, worked well. Better performance may well be obtained by writing a SGEMM-like

matrix multiplication function that natively understands submatrices, though this was considered beyond the scope of the current work. For large networks and minibatches, batchwise dropout is substantially faster, see Fig. 5.2. The approximate overlap of some of the lines on the left indicates that 50% batchwise dropout reduces the training time in a similar manner to halving the number of hidden units.

The graph on the right show the time saving obtained by using submatrices to implement dropout. Note that for consistency with the left hand side, the graph compares batchwise dropout with dropout-free networks, not with networks using independent dropout. The need to implement dropout masks for independent dropout means that Fig. 5.2 slightly undersells the performance benefits of batchwise dropout as an alternative to independent dropout.

2Coppersmith-Winograd would give 2(n/2)2.37 ¥ 0.4n2.37, while the naive algorithm would give 2(n/2)3 = 0.25n3.

Chapter 5. Efficient Batchwise Dropout