Having discussed objectives to optimize in section 2.2, it is now necessary to discuss how to actually carry out this optimization with respect to the model parametersθ. Unfortunately, it is not possible to obtain closed-form solutions for the optimization problem defined in the previous section in equations 2.20 and 2.27 for deep neural networks. Instead, neural networks are optimized using iterative gradient descent methods [107]. Currently, there exists a wide range of gradient descent optimizers for neural networks, including stochastic gradient descent with momentum [11, 40], RMSProp [40], Adagrad [29], AdaDelta [138] and Adam [62]. The current section describes gradient descent optimization, detailing the Adam optimizer [62] used to train all models throughout this thesis, and discusses the topics of learning rate scheduling and initialization.
2.3.1
Gradient Descent Optimization
Gradient descent methods can be broadly split intoBatchgradient descent andStochastic
gradient descent. Batch Gradient Descent minimizes the loss over the entire training dataset per-iterationt. A full pass over the entire training dataset is called anepoch, and for batch gradient descent each iteration is one epoch. The model is trained by updating the parameters with the gradient of a loss functionL(y(i),x(i),θ)with respect to the parametersθ, multiplied
by the learning rate η: Where the gradient of the loss with respect to each parameter is obtained via theError Back-propagation(BP) [107] andError Back-Propagation Through Time(BPTT) [87, 88, 86] algorithms for feed-forward and recurrent networks, respectively. Batch Gradient Descent yields the exact gradient of the loss on the training data, but may take a long time to iterate over datasets of millions or billions of data points.
2.3 Optimization 23
Algorithm 1Batch Gradient Descent
whilet ≤T do θ(t+1) ←θ(t)−η· 1 N PN i=1∇θL N LL(y(i),x(i),θ) end while
Instead of doing a full pass over the dataset before updating the parameters, it is possible to update the parameters after processing each datapoint, doingN parameter updates per epoch. This is calledStochastic Gradient Descent[105]. This approach yields significantly
Algorithm 2Stochastic Gradient Descent
whilet ≤T ·N do
θ(t+1) = θ(t)−η· ∇θLN LL(y(t),x(t),θ)
end while
faster convergence than Batch Gradient Descent, but at the cost of noisy gradient updates. In Stochastic Gradient Descent the gradient of the loss at each datapoint is a single-sample approximation to the gradient of the loss over the entire dataset. As a consequence, it may be necessary to use a far lower learning rate to avoid instabilities during training.
A compromise between Batch and Stochastic Gradient Descent isStochastic Minibatch Gradient Descent, which minimizes the loss on stochasticminibatchesof sizeNb << N.
The size of the minibatch is typically between 16 and 256 datapoints. Sizes are given in powers of 2 because of the way data is best partitioned on modern GPUs. However, seeing the same training examples in the same order every epoch may introduces unnecessary biases into the model. Thus, mini-batches are shuffled between each epoch via uniform sampling from the training data without replacement. In practice, Stochastic Gradient Descent and minibatch shuffling can be seen as forms of regularization, as the noise added to the gradients may prevent the model from over-fitting to the training dataset and avoid getting stuck in a local minimum.
Algorithm 3Stochastic Minibatch Gradient Descent
whilet ≤T · N Nb do θ(t+1) = θ(t)−η· 1 Nb PNb i=1∇θLN LL(y(i),x(i),θ), {x(i), y(i)}Ni=1b ∼ Dtrain end while
If stochastic gradient descent yields particularly noisy gradients or the curvature of the loss surface is strong in some directions and weak in others, gradient descent may yield ‘zig-zagging’ behaviour along the loss surface which slows convergence.Momentum methods
24 Deep Learning
parameter update, multiplied by the momentum rateα where0 ≤ α ≤ 1, to the current parameter update. This increases the effective learning rate in the direction of consistent gradients and accelerates convergence.
Algorithm 4Stochastic Minibatch Gradient Descent with Momentum
whilet ≤T · N Nb do m(t)= (1−α)· 1 Nb PNb i=1∇θL N LL(y(i),x(i),θ) +α·m(t−1), {x(i), y(i)}Nb i=1 ∼ Dtrain θ(t+1) = θ(t)−η·m(t) end while
Currently,Adam[62] is a popular state-of-the-art stochastic gradient based optimization algorithm for neural networks. Adam, as well as other adaptive learning rate methods like Adagrad [29] and RMSprop [40] are pseudo-second order methods which attempt to account for the curvature of the loss surface along each dimension. The advantage of Adam is that it combines momentum with automatic adjustment of the learning rate for each parameter.
Algorithm 5Adam s(0) = r(0) =0 whilet ≤T · N Nb do g(t)= 1 Nb PNb i=1∇θL N LL(y(i),x(i),θ), {x(i), y(i)}Nb i=1 ∼ Dtrain s(t) = β1·s(t−1)+ (1−β1)·g(t) r(t) = β 2·r(t−1)+ (1−β2)·g(t)⊙g(t) ˆ s(t) = s(t) 1−βt 1 ˆ r(t) = r(t) 1−βt 2 θ(t+1) = θ(t)−η√ˆs(t) ˆ r(t)+ϵ end while
In Adam, the biased estimates of the first and second moments of the gradients,s(t)and
r(t), are computed at each iteration. The biased estimate of the first moment s(t)directly incorporates momentum into the algorithm. The first and second moments of the gradients are then de-biased, yielding the unbiased estimates of the first and second moments. This de- biasing is important only at the beginning of training and as training progresses the unbiased estimates tend to the biased estimates. The parameters are then updated by subtracting the ratio of the unbiased estimate of the first momentˆs(t)and the square root of the unbiased estimate of the second momentrˆ(t), multiplied by the global learning rateη. This process is controlled by the hyper parametersβ1,β2 andϵ, whose typical values are 0.9, 0.999 and
2.3 Optimization 25
The potential drawbacks of Adam are that the estimates of the second order moments may become stale near a local minimum and that Adam does not properly interact with weight decay regularization. Several extensions to Adam have been proposed to address these issues, such as Adamax[62], AMSGrad [104], and AdamW [75], but the performance improvements are not consistent. Thus, throughout this thesis the standard version of Adam is used to train all neural network models.
2.3.2
Learning Rate Schedules
The learning rateηplays an important role in gradient descent optimization. If a learning rate is too high, then training may become unstable, and if the learning rate is too low, then it may take the optimization process a long time to converge to a local minimum. Furthermore, due to the noisy nature of the gradients, stochastic gradient descent and minibatch stochastic gradient descent will not converge to a local minimum without a decaying learning rate. Thus, it is necessary to decay the learning rate over the course of training, so that the optimization can settle in some local (or global) minimum. Typically, an exponentially decaying learning rate is used:
η(t) =η(0)e−λt (2.36)
whereλis the decay constant andη0 is the initial learning rate.
It is possible to consider alternative learning rate schedules. Currently cyclical learning rates are popular, as they allow for faster optimization. In particular, in this work the1-Cycle Policy[115, 114] is sometimes used instead of the exponentially decaying learning rates. Here, a cycle length is defined in terms of a number of epochs. The learning rate is linearly increased fromη(0) to10∗η(0)for half the cycle and then linearly decayed back down to
η(0)for the second half of the cycle. Then the learning rate is linearly decay to η(0)
100 for the
remaining number of epochs. Fast Ai [45] report that this approach allows for significantly faster training of neural networks, which is why is was adopted in this work.
2.3.3
Initialization
An important factor in the success or failure of training a neural network is the choice of initialization scheme for the parametersθ. Initialization is known to have a strong effect on the optimization process and the generalization of the resulting model, though how exactly initialization affects the latter property is not fully understood [40]. Furthermore, the effects of initialization on the network’s capacity to generalize are more pronounced when there is
26 Deep Learning
less training data. As the quantity of available training data increase initialization effects play a smaller role in generalization.
In general, initialization must accomplish two tasks - it must inducesymmetry breaking
and avoid saturating the non-linear activations, so that gradients can propagate effectively throughout the entire network. Symmetry breaking is necessary to force each neuron in each hidden layer to learn adifferentfunction, otherwise the effective capacity of the network is diminished. This can be accomplished by randomly initializing the parameters. However, the choice of distribution from which the parameters are initialized can have a strong affect on the optimization process. If the variance of the distribution is too large, then the non-linear activations may saturate or explode and impede gradient flow. On the other hand, if the variance is too small, then information may poorly flow through the network. Work by Xavier and Bengio [38] suggests initializing from either a uniform distribution or a normal distribution, where the bounds on the uniform distribution or the variance of the normal distribution is a function of the number of neurons in the previous layer (fanin) and the number of neurons in the following layer (fanout):
θ ∼ U(− r 6 fanin+fanout, r 6 fanin+fanout θ ∼ N(0, r 2
fanin+fanout
)
(2.37)
The motivation is to make sure that the scale of the activation of every neuron is such that the activation function is within a linear region and does not saturate, which enables good gradient flow throughout the entire network. This is especially important for very deep networks. From empirical evidence, the particular choice of uniform or normal distribution does not seem matter greatly.