Neural Network Training - Experiment 4 – Improving Framework Accuracy

4 Theory and Design and Implementation

5.6 Experiment 4 – Improving Framework Accuracy

5.5.1. Neural Network Training

A key concept underlying Deep Learning methods is the use of distributed representations of the data, in which a large number of possible configurations of the abstract features of the input data are feasible. The training set has therefore to be carefully selected to allow for a compact representation of each sample, leading to a richer generalization. Because deep learning calls for several input features, both the environmental and business triggers described earlier were combined as the input features for the neural network. These features are Memory Capacity, Processor Speed, Device type, and Geo-location.

A multi-layer ANN architecture is made up of an input layer, one or more intermediate or hidden layers, and an output layer. Additionally several neurons with several inputs (xn) and

weights (wn) exist. The action of selecting the correct weights is the process that results in the

adaptation rules in the case study. The supervised learning approach is used, where the output for a set of inputs is given to train the network. Different supervised learning approaches, or learning rules, also exist for supervised algorithms such as error-correction, Boltzmann, Hebbian, and competitive approach. The Error-correction approach is used here, which focuses on the fact that the output perceived in the training process do not always correspond to the desired output. This error is known as the total mean square error (MSE) and is obtained as shown in equation 5.1. It is computed using all training patterns of the calculated and target outputs (Islam et al., 2010).

(5.1)

Where m is the number of examples in the training set, k is the number of output units, Tij is the

target output value of the ith output unit for the jth training example, and Oij is the actual real- valued output of the ith output unit for the jth training example.

111 The back-propagation algorithm uses an iterative gradient technique to minimize the MSE between the calculated output and the target output. The main idea is to initially move forward to compute the error, and then backward the error updating the weights from the output layer to the input. The method calculates the error, modifying the weights to progressively reduce the error. Initially the weights are set to random small numbers and progressively updated during training based on the calculated error. This is the basis of the back propagation learning algorithm. Figure 5.12 gives an example of the error rate values obtained during training of the neural network using the training dataset.

112 Because a neural network is made up of layers and nodes that describe its architecture, it is a good idea to train several networks to ensure that a network with good generalization is found. The objective of the training is to find a network with the smallest error and regularization terms. The error term evaluates how a neural network fits the data set. It depends on adaptive parameters such as biases and weights. On the other hand, the regularization term is used to prevent overfitting (large difference between training and test error), by controlling the effective complexity of the neural network. To avoid overfitting Piotrowski et al., (Piotrowski et al., 2013)

recommend keeping the ANN architecture relatively simple, as complex models are much more prone to overfitting. This is because the error on the training set is driven to a very small value, but when new data is presented to the network the error is large. The network has memorized the training examples, but it has not learned to generalize to new situations. The training set is kept relatively small and tests conducted to show the neural networks ability to generalize.

The process of training the dataset called for modifying several parameters such as the learning rate, momentum, randomizing the weights, as well as increasing the size of the training dataset to arrive at an acceptable error rate. When training an ANN initial weights are identified and updated as the input values move from one layer to another. The final output may not always be the expected output resulting in an error. Since the nature of the error space cannot be known a prioi, neural network analysis often requires a large number of individual runs to determine the best solution. Most learning rules have built-in mathematical terms to assist in this process which control the 'speed' (learning rate) and the ability (momentum) of the learning. The speed of learning is the size of the steps that the weights are changed by. This determines the speed by which a solution is arrived at. But the solution may not always be the best solution i.e. it may have an error. Momentum helps the network to arrive at solutions that are the best by editing the step sizes such that they are not always a constant value.

Momentum simply adds a fraction m of the previous weight update to the current one. When using back propagation with momentum in a network with n different weights w1, w2, ..., wn, the

ith correction for weight wk is given by:

113 Where ϒ and α are the learning and momentum rate respectively. The error function E is determined at the output. Momentum is a value between 0 and 1 that is randomly determined multiplied to the previous weight to form the new weight.

If both the momentum and learning rate are kept at large values, then you might miss the best solution (minimum) with a huge step. A small value of momentum on the other hand can arrive at a solution that is not the best (local minimum) because the step size remains relatively the same. It also slows down the training of the system. The ideal solution lies in keeping a larger momentum term and smaller learning rate. Once a neural network is 'trained' to a satisfactory level it may be used as an analytical tool on other data.

For this experiment the learning rate and momentums are adjusted during training. When the learning rate was increased, the total means square error also increased as shown in Figure 5.13 (a). This is because increasing the learning rate also increases network instability, with weight values oscillating erratically as they converge on a solution. The momentum rate prevents settling into a local minimum by skipping through it. As the momentum rate was varied the error increased when the rate was approaching extreme values i.e. 0.0 or 1.0 as shown in Figure 5.13 (b). This is because as the momentum rate approaches the maximum of 1.0 the training becomes unstable and thus may not achieve local minima, or if it does, it takes an inordinate amount of training time. On the other hand as it approaches 0.0, the momentum is not considered and the network is more likely to settle into a local minimum. Therefore a relatively low learning rate and a moderate momentum rate were ideal for the study.

114 (a) ANN training by varying learning rate

(b) ANN training by varying momentum rate

115 Typically, each back propagation training session starts with different initial weights and biases, and different divisions of data - training, validation, and test sets. By varying these parameters it was observed that these different conditions can lead to very different solutions for the same problem. Weights that resulted in small error values were identified and used for the architecture. Reed and Marks recommend using small initial weights to avoid immediate saturation of the activation function (Reed & Marks, 1999). Further an investigation on how varying the number of hidden layers and Nodes affected the total mean square error was conducted. As mentioned earlier to avoid overfitting a few hidden layers were used. Training was conducted using less than three layers and it was observed that the error grew as the number of layers increased as shown in Figure 5.14. For the data set used there was no significant difference in the error when one or two hidden layers were used as well as when the number of nodes in each layer was increased. However when the nodes are too few the network can result in underfitting, when the model is not able to obtain a sufficiently low error value on the training set. On the other hand using too many neurons in the hidden layers can result in overfitting and an increase in the time taken to train the network. Nodes that were close to the number of input and output parameters were therefore selected.

Figure 5.14 Training by varying the hidden layers and nodes

It can be deduced from this result that for the dataset a prediction system can be optimized with a low factor learning rate (e.g. 0.2) and a moderate momentum (e.g. 0.5). Further the minimum

116 number of hidden layers for a deep neural network, two, was sufficient for the experiment with 5- 6 nodes per layer. A screen shot of the ANN Architecture arrived at after the training process using Neuroph Studio process is depicted in Figure 5.15.

Figure 5.15 ANN Architecture

When the neural network architecture was tested it produced comparatively more accurate results than the previous experiment. These are discussed next.

In document A self learning framework for validation of runtime adaptation in service oriented systems (Page 110-116)