Chapter 2 Analytical methodologies
2.2.2. Neural Network Training
Neural networks try to simulate the learning ability of the human brain. However, unlike the human brain, the neural network structure is fixed, not modifiable and constituted by a fixed number of neurons and connexions between them, which have some values (weights). What changes, on neural networks’ learning process are the weights’ values, increasing if the information is to be transported and decreasing otherwise. There is no indication of what should be the weights values in the beginning of the network training, so they are initialized randomly. Then these values are adjusted after processed one individual or at the end of all individuals processing.
Training a neural network essentially means selecting one model from the set of allowed models, or in a Bayesian framework determining a distribution over the set of allowed models that minimizes the cost criterion. There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory (weights adjustment in order to minimize the error) and statistical estimation. In the following picture it can be observed the purpose of network training, that is through the adjustments done in weightings, minimize the error produced by the network.
Most of the algorithms used in training artificial neural networks employ some form of gradient descent. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient-related direction. With this method, the adjustment can be calculated at each point in order to minimize the network error function, that is given a training set comprising a set of input vectors{ }xn
,where n= 1…N, together with a corresponding set of target vectors { }tn , the error function
E(w) must be minimized:
E (w)= 1 2 y( xn, w)− tn 2 n=1 N
∑
(21)The value of w found by minimizing this function corresponds to the maximum likelihood solution. As the function error E(w) is a smooth continuous function of w, its smallest value will occur at a point in weight space such that the gradient of the error function is equal to 0. However it is difficult to find the solution to the previous equation, iterative numerical procedures are used to find a solution. Moreover, the use of the gradient information can lead to significant improvements in the speed with which the minima of the error function can be located. The simplest approach to using gradient information is to choose the weight update to comprise a small step in the direction of the negative gradient, so that:
w(τ+1) = w(τ)−η∇ E ( w(τ)
) (22)
where the parameter η is the learning rate. After each update the gradient is re-evaluated for the new weight vector and the process is repeated. At each step the weight vector is moved in the direction of the greatest rate of decrease of the error function, and so this technique is known as gradient descent.
In order to find a good minimum it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly chosen starting point and comparing the resulting performance on an independent validation data set.
If the gradient descent technique is used to train a multi-layer network, there will be a difficulty that is the absence of the target value for the hidden units. Therefore it was found an efficient technique for evaluating the gradient of an error function for a feed-forward neural network, which is the backpropagation algorithm. The name of this algorithm is based on
1. The example cases are applied to the network producing some output based on the current state of its synaptic weights (initially, the output will be random).
2. The output is compared to the desired output, and a mean-squared error signal is calculated.
3. The error value is then propagated backwards through the network, and small changes are made to the weights in each layer. The weight changes are calculated to reduce the error signal for the case under study.
4. The whole process is repeated for each example in the training set, then back to the first case again.
5. The cycle is repeated until the overall error value drops below some pre-defined threshold.
At this point we say that the network has learned the problem. It is important to refer that the network will never exactly learn the ideal function, but rather it will asymptotically approach it.
For backpropagation learning, the activation function must be differentiable, and it helps if the function is bounded; the sigmoidal functions (such as logistic and tanh) and the Gaussian function are the most common choices. Functions such as tanh or arctan that produce both positive and negative values tend to yield faster training than functions that produce only positive values such as logistic, because of better numerical conditioning.
For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backpropagation or more efficient gradient-based training methods. With sigmoid units, a small change in the weights usually produces a change in the outputs, making possible to tell whether the change in the weights is good or bad. With threshold units, a small change in the weights will often produce no change in the outputs.
For the output units, the activation function should be chosen to suit the distribution of the target values:
• For binary (0/1) targets, the logistic function is an excellent choice (Jordan, 1995). • For categorical targets using 1-of-C coding, the softmax activation function is the
logical extension of the logistic function.
functions can be used, provided either scale the outputs to the range of the targets or scale the targets to the range of the output activation function ("scaling" means multiplying by and adding appropriate constants).
• If the target values are positive but have no known upper bound, the exponential output activation function can be used.
• For continuous-valued targets with no known bounds, the identity or "linear" activation function can be used.
Multilayer networks can approximate any smooth function as long as there are enough hidden nodes. However, having this great flexibility can cause the network to learn the noise in the data and be over-trained or over-fitted. There are several ways to control the complexity to avoid this over-fitting. One way is to add a regularization term to the error function, also known as weight decay, giving a regularized error of the form:
E~(w)= E (w) + λ
2w
Tw
(23)
An alternative to regularization as a way of controlling the effective complexity of a network is the early stopping procedure. The training of nonlinear network models corresponds to an iterative reduction of the error function defined with respect to a set of training data. However, the error measured with respect to independent data often shows a decrease at first followed by an increase when the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to validation data set.