6.2 Training of a Neural Network
6.2.2 Overfitting and Regularization
Once training of a NN has been started, data flow and calculations inside the NN are somewhat black-box-like due to the vast amount of parameters, i.e. weighted connections between neurons. Intermediate outputs connected to hidden layers help to control and visualize the different representations, but complexity and multi- dimensionality of data do not allow a strict tracing. This lack of traceability poses an important drawback of NNs, especially in DL where the number of trainable parameters is often as large as 105 - 107. Quoting Nobel prize winner Enrico
Fermi, who said: “I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”[Dys04] rushes DL into a dilemma. Therefore, the problem of overfitting and how to control it are under ongoing development in computer science. A model
6.2 Training of a Neural Network suffering from overfitting looses generality to the problem and performs weakly on new, unseen data. In this section some regularization strategies to prevent overfitting will be motivated and described, these are: data splitting into training, validation and test sets, L1/L2 regularization, dropout, batch normalization and training data expansion.
Data Splitting: The data fed into a model mark the basis for every learning. An often applied rule for data selection claims that data should be independent, identically distributed in the defined problem space [Vap99]. This means that no data has any preferences with the problem in mind and all are identically representative. It allows to find a model which generalizes well for a defined problem. In medical imaging, where the data availability is limited, the selection of suitable data is crucial. Despite a potentially limited amount of data not all data are used for the training explicitly. Data is rather split into three categories [Nie15]:
• Training data contain the majority of data. These data are presented to the model for learning. The input paired with the known output is compared to the calculated output of the NN and weights are adjusted iteratively during training to find optimal sets.
• Valiation data represent a subset of data files used to evaluate model perfor- mance during training and optimize hyper-parameters, e.g. learning rate or batch-size to increase training performance. Validation data do not directly influence weight adjustments, but rather validate the model-performance with its tuned parameters.
• Test data are separated from the training processes and only presented to the model after training. The predicted output allows an evaluation of the quality, generality and potential of the trained NN.
Besides counteracting direct overfitting of the weights by evaluation on data not involved in the training process themselves, the split into separate validation and test data additionally prevents the overfitting of the hyper-parameters [Nie15]. L1/L2 Regularization: Another way to combat overfitting are L1 and L2 regular- ization of the loss function. Both limit the growth of single weights by adding an extra term to the loss function. The L1 approach regularizes the loss function to
EL1 = E0+ λ n ∑︂ w |w| , (6.18)
where E0 is the unregularized, previously defined loss function. The influence of
6 Neural Networks and Deep Learning
λ normalized by the size of the training set n. Analogously, the L2 regularization, sometimes also named weight decay, is formulated as
EL2 = E0+ λ n ∑︂ w w2 . (6.19)
Both regularizations, L1 and L2, act similarly and counter rapid weight growth, but their regulatory regime differs. L2 penalizes larger weights stronger than L1, but optimization might slow down during training. The partial derivatives for both scenarios are δE δw = δE0 δw + λ nsgn(w) (6.20) with sgn(w) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ −1 , for w < 0 0 , for w = 0 +1 , for w > 0 for the L1 regularized loss function and
δE δw = δE0 δw + λ nw (6.21)
for L2 regularization of the loss function. With these and as addressed in Section 6.2.1, backpropagation is applied to define the update rule for weights in the NN. The weights in a L1 regularized network with a learning rate η are updated by
L1: w → w′ = w −ηλ
n sgn(w) − η δE0
δw , (6.22)
which shrinks the weights by a constant value towards zero. By comparison, the weights are updated by an amount proportional to the current weight in case of L2 regularization L2: w → w′ = w (︃ 1 −ηλ n )︃ − ηδE0 δw . (6.23)
NNs which are regularized with L1 tend to concentrate weights and connections to a relatively small number of high-importance connections [Nie15]. This is because the update rule in Equation 6.22 shrinks large weights |w| much less than Equation 6.23. For updates where the weight values are small, e.g. close to zero, the L1 regularization will continue shrinking the weights without slowing down and reinforce zeroing weights, i.e. disconnecting neurons. In summary, regularization prevents overfitting by slowing down weight growth. Small weights, in turn, make the response of the NN robust against changes in only a few inputs, e.g. activation by noise [Nie15].
6.2 Training of a Neural Network Dropout: A different approach to reinforce generalization during the training is dropout. Dropout can be considered as an binary selector for neurons in one or more layers. As depicted in Figure 6.4, dropout randomly and independently drops a defined percentage of neurons in that layer and thus randomly disconnects parts of adjacent layers [Hin+12]. A typical choice is the temporal and random omission of 50 % of the neurons, e.g. neurons in the same layer which is affected by dropout. By the randomized disconnection of neurons, overfitting is reduced by prevention of complex co-adaption of weights [Sri+14]. Adapting the weights on the incoming connections is the essential part of the learning processes and it enables prediction of an output with a given input [Hin+12]. Since in a suitable NN different settings of weights show equal accuracy of the model during training, but result in worse performance for the unseen data, the co-adaption of weights to one of these specific sets reduces the general predictive capabilities of a NN. Another point of view is to consider dropout under certain circumstances as a form of data augmentation [Sch15].
Figure 6.4: a) Neurons in a layer with dropout are disconnected from the next layer with a specified probability p. Training of weights w is only possible during training steps where neurons are present. b) At test time these randomly omitted neurons are always present and their corresponding weights are multiplied by the probability p, giving the same output as expected while training.
Batch Normalization: Batch Normalization (BN) was introduced to increase train- ing speed. Moreover, its regulatory attitude helps to avoid overfitting and is beneficial for generalization. A sample of training data is seen in conjunction with other samples of the same batch, all trained simultaneously in the same batch. The variance of the samples of one batch hampers the DNN to produce deterministic parameters for a single sample [Sze15]. The principle behind BN is to reduce variance and inhomogeneity of a layer’s inputs. The variance of training data leads to a range of differently distributed numbers not directly relevant for the features or information to be learned, but difficult for the DNN to capture and therefore, with negative impact on learning rate and training speed [Sze15]. In addition, varying input distributions propagate from layer to layer and hamper optimal parameter adjustment. Normal- izing a layer’s input is a tool to accelerate training and, furthermore, it stabilizes
6 Neural Networks and Deep Learning
parameter growth by its scaling property on gradients during backpropagation [Sze15]. The BN algorithm transforms the input of a batch B = {x1, ..., xm} of size m trained
at a time according to Equation 6.24 [Sze15]:
µB ← 1 m m ∑︂ i=1 xi , batch mean σB2 ← 1 m m ∑︂ i=1 (xi− µB) 2 , batch variance xˆi ← xi− µB √︁σ2 B+ ϵ , normalization ˜︁
xi ← γxˆ + β ≡ BNi y,β(x1, x2, ..., xm) , scaling and shifting