• No results found

Up until now, we have seen various deep neural network architectures and data transfor- mation performed by them on a layer-by-layer basis (on earlier layer’s output) to learn interesting features. The primary question remains, how these networks can be trained effi- ciently. In principle, we want to learn optimal weight matrix wand bias vectorb, that give a minimum error on a specified measure. The loss (also objective or cost) function computes the difference between model’s output and expected true value. Most widely used an objec- tive function for multi-class classification problem is Negative Log-Likelihood1 (Goodfellow et al., 2016) and is given by Equation 4.8.

L=−1 n n X i m X k

yi,k log(ˆyi,k) (4.8)

Where n represent number of training examples, m denotes the number of classes, y is the true label and ˆy is the model’s output.

The Backpropagation algorithm is widely used in supervised learning problems for error calculation. During training, after each forward pass through the network (i.e. feeding input and calculating output at the last layer), the error is computed that is passed backwards to update the model weights. More formally, partial derivatives of the loss function are computed with respect to each layer’s weight by repeated application of chain rule. To understand it better, let’s consider one neuron case. If xj is jth input value to neuron, wj is the weight

given toxj and nis the total incoming inputs into the neuron. Then the weighted output of the neuron with non-linear activation can be written as follows:

u= n X j wjxj z=σ(u) (4.9)

The error computation for weightwi is given by Equation 4.10.

∂L ∂wi = ∂L ∂z ∂z ∂u ∂u ∂wi (4.10)

The error is computed in a similar way for all the model weights. The vector containing partial derivatives of weights is called a gradient. The negative of the gradient provides direction, where to move in order to minimize the cost function. However, in deep models’ case, the error surface is highly non-convex with many local minima beside global minimum due to large parameter space. Therefore, iterative optimization algorithms such as Stochastic Gradient Descent (SGD) are employed to minimize the loss.Letwtrepresents current weight matrix, ∇w shows gradient vector andη denotes the learning rate, which controls how large

1

Also known as Categorical Cross Entropy

Chapter 4. Deep Neural Networks for Arousal Classification

step to take in the direction provided by the gradient. Then one step of gradient descent is shown by Equation 4.11 and performed until convergence.

wt+1=wt−η∇wL(wt) (4.11)

In SGD, the weights are updated after error calculation on a small batch of data e.g. 10 training examples. In comparison with (full) gradient descent, it converges quickly to far better solutions to due diversity posed by different batches of data. To accelerate training, several variants of SGD are proposed that adaptively tune learning rate for each features. Some popular choices includes RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014).

Another important issue in training deep neural networks is the weight initialization. If the initial weights are randomly assigned to be very large or small, it will create problems for convergence. To over come this issue, Xaiver initialization proposed by Glorot and Bengio (2010) is mainly used. Here, the weights are taken from uniform distribution, whose range is given by Equation 4.12.

r =qn 6

in +nout ; [−r, r] (4.12)

Wherenin andnout are number of incoming and outgoing connections to a neuronn, respec-

tively.

Normalization is important when the data dimensions have different ranges or units. In such a case, rescaling is performed to bring all the dimensions on a similar scale. In gradient- based optimization algorithms, this pre-processing become even more important to avoid slow convergence and give equal importance to each dimension. If the features are on different scales, then certain weights in the model may update faster than others, as each feature value play a vital role in weight updates. More specifically, in neural networks, the orientation of the hyperplane is determined by the weights from input to hidden layer units. The distance of the hyperplane from the origin is determined by the bias terms. If the bias terms are initialized to small random numbers, then hyperplanes will lie closer to an origin. Now, if the data points are not centred close to the origin and inputs have a small coefficient of variation, hyperplanes may fail to pass through the points and likely to stuck in the local minima. Therefore, to get benefit of small random initialization, it is important to standardized the dataset. The most widely used method is z-normalization, where the mean of each dimension is subtracted from every value and divided by standard deviation (see Figure 4.62). This has the effect of bringing data points centred around origin with unit variance.

To further understand the importance of feature normalization for gradient based optimiza- tion algorithms, lets consider a randomly generated dataset for a binary classification task (with equal class representation) having two features. The ninety values are randomly sam- pled with mean of 5±1.4 and 2±1.4 for first and second feature, respectively. Furthermore,

2

Chapter 4. Deep Neural Networks for Arousal Classification

suppose we are using logistic regression as a classifier with cross-entropy loss. Then error surface of gradient descent can be plotted as contours as shown in Figure 4.73. The plot on the left depicts elongated error surface without feature normalization. In this case, weight updates are extreme with a zig-zag motion. There is a high possibility that gradient descent will over-shot the minima and hence will not converge. On the other hand, the right plot illustrates errors surface, when features are z-normalized. The direction of steepest descent is no longer perpendicular towards minimum. Hence, with each weight update, model con- sistently converge towards global minimum. It is important to note that, the error surface in case of deep neural networks is very high dimensional having several local minimas. There- fore, to avoid getting stuck in minima, feature normalization must be performed on training set.

In this section, we very briefly highlighted some of the pre-requisites for training deep nets. The interested reader is recommended to refer Bengio (2012) for a detailed discussion and practical guidance on training deep architectures.

4

2

0

2

4

6

4

2

0

2

4

6

Original

4

2

0

2

4

6

4

2

0

2

4

6

Z-Normalized

Figure 4.6: Illustration of z-normalization on random dataset. The left figure depicts a randomly generated dataset with two dimensions that is spread on both axes. Similarly, the right graph illustrates normalized version, where mean is subtracted from each value and divided by standard deviation. The resulting dimensions have unit variance nd are centered around origin.

Related documents