• No results found

How the hyperparameters are tuned is perhaps the most obscure area in deep learn- ing. On the one hand, deep neural networks have many hyperparameters and are very sensitive to the settings of some of them. On the other hand, there are few known good strategies to set them. When there are only three or fewer hyperpa- rameters, such as in the case of Support Vector Machines, grid search or random search is the most desirable strategy. However, when we deal with dozens of hy- perparameters, which is the case with complex neural architectures, this becomes infeasible. A recent paper by Zoph and Le (2016) of the Google Brain team reports to have used 800 GPUs to find the best settings. There are also a few studies that suggest using Bayesian optimisation for hyperparameter tuning (Snoek et al., 2012; Eggensperger et al., 2013).

The most feasible approach is still manual search.12 However, in this case, a good understanding of the role of each hyperparameter is needed. In this section we

12There are jokes suggesting an alternative name to intelligent manual search: GSD a.k.a. Grad-

Figure 2.18: Learning with too high learning rate.

Figure 2.19: Learning with too low learning rate.

will review the main hyperparameters of the architectures we use.

Learning rate: When we train a model with a gradient-based method, such as stochastic gradient descent, we update the trainable parameters was follows:

wi =wi−λ

∂L(www)

∂wi

(2.26)

where L(w) is the loss function, and λ is the learning rate. It determines how far down the direction opposite to the gradient we want to move the weights. This is perhaps the most important hyperparameter to set correctly. When it is set to too large a value, we risk simply missing the optimal point (see Figure 2.18). If the learning rate is too small, the training becomes slower and we risk getting stuck at a local minimum (see Figure 2.19). That is why it is important to try a range of values when setting the learning rate.

Hyperparameters that increase the representation capacity: most param- eters directly influence the model’s representation capacity. For example, increasing the number of hidden layers makes the model able to learn more complicated functions. The same happens when we increase the number of hidden units or the dimensionality of the word embeddings. The size of the convolution kernel falls into the same group. There are also binary hyperparameters, such as the use of a

bidirectional RNN instead of a unidirectional RNN, and the use of LSTM instead of GRU.

Increasing the values of these hyperparameters increases the number of trainable parameters of the model, i.e. it can better represent the training set. Ideally, the values of these hyperparameters should be just as large as needed. In practice, it is very hard to find settings that match the problem’s complexity exactly. Thus, our general approach to tuning these parameters is to have them slightly higher than needed, and then add strong regularisation. Having these parameters set at higher values than actually needed makes the model prone to overfitting on the training set. In other words, we let the model be able to overfit the training set and then regularise it with weight decay and dropout.

Regularisation Hyperparameters: these hyperparameters increase the model’s representation capacity when decreased, i.e. having the dropout probability and the weight decay set to very low values allows the model fit the training set better, but also decreases its ability to generalise. Usually for a complex model we set the dropout probability to 0.3 or 0.5 and the weight decay rate, e.g. L2 regularisation rate, to a small value around 10−7–10−5. We increase the weight decay rate if we use a very deep network, and decrease it for smaller networks.

Other hyperparameters: other parameters include the choice of the activation functions and the choice of the optimiser (SGD versus its variations). In our exper- iments we usually use ReLU activation for the feedforward networks and hyperbolic tangent for the recurrent network, and plain SGD optimiser.

Training parameters: Apart from the model hyperparameters, there are a few other parameters that affect the training process, such as the size of the minibatch and the number of training steps. We usually determine the number of training steps with early stopping. We need to set the following parameters: after how many iterations do we evaluate the model on the development set, and how long do we

wait for an improvement. In our experiments, we set the number of iterations after which we evaluate the model on the development set roughly equal to the number of iterations needed to iterate over the whole training set once. We call this number an epoch. We usually wait for ten consecutive epochs for an improvement on the development set, then we stop the training.