3.2 Designing Feature Extraction Model using GAs And ANNs
3.2.2 ANN A universal computational method
3.2.2.2 The training of the network
In the iterative ANN training process, as shown in Figure 3.9 on page 77, the network is shown a sample pattern and uses the pattern to calculate its output. The output is then compared with the target output, i.e. an ideal output for the sample. The difference between the target output and the network output is the measure of how well the network is performing. Unless the output is perfectly matched with the target output, an adjustment is usually made to the weights to improve network performance.
The adjustment of the connection weights is measured by the error δ, i.e. the discrepancy between target output and network output, which, if the network contains only one output node y and the target output t, thus δ can be expressed as:
δ = t − y. (3.2)
If both the network output and the target output are identical, no further learning is required for this sample pattern. Another sample pattern is fed into the network and the training process is continued. If the match is not perfect, the network needs to improve by adjusting the connection weights so that the network can perform better when the same sample pattern is provided in the future. The adjustment of the weight ∆w, is the proportion of both the input to the node x, and the size of the error δ:
∆w = ηδx, (3.3)
where η is the learning rate which determines the size of the changes, i.e. high or low, to the weight. The connection weight is then updated with:
Once the weights on all connections have been adjusted, another sample pattern is taken and the process is continued until all sample patterns have been learned and the error δ in the network’s prediction becomes negligible. This training process is commonly known as backpropagation learning.
The training process is aimed at diminishing the difference between target output and network output over a large number of sample patterns. The error could be reduced by making a suitable change to the connection weights, as described earlier, or by the incorporation of learning rate and/or momentum.
In an ANN model, the weights are used to store information about sample patterns and this information builds up over time as the training proceeds. If a large adjustment is made to the weights, knowledge learned previously will be jeopardised. However, if a small adjustment is made to the weights, they are only moved a little into the direction of the optimum values and the learning will take too long to converge. Thus, a typical learning rate η of the ANN is below 0.1 (Cartwright, 2008a).
Cartwright (2008a) suggested a sensible compromise solution based on gradually diminishing the value of the learning rate as training proceeds, so that in the early stages a coarse pruning on the weights can be performed, while in the later stages, only gentle adjustments are made to the weights which allow them to be fine-tuned. For this solution, Cartwright suggested the value should be in the range of 0.0 − 1.0.
ANNs with a large number of interconnected nodes are able to model any continuous function. Consequently, the error will also be highly corrugated, displaying numerous minima and maxima. Once the adjustments to the weights lessen, the network can be easily trapped, making the learning cyclic. This phenomena is known as local minima in the context of machine learning. To reduce the chance of trapping in endless oscillation, a momentum α, is generally applied to update connection weights. Momentum is the velocity of how fast the network is being trained, and will provide spontaneous speed to the network to pass through the local peaks in the error surface, by adding a proportion of the update of the weight in the previous epoch, n − 1, to the weight update in the current epoch n:
wij(n + 1) = wij(n) + ηδj(n)xij+ α[wij(n) − wij(n − 1)], (3.5)
where 0 ≤ α < 1.0.
The effect is to let momentum update the weights as the network travels across the error surface. Conse- quently, the network is more likely to escape from a local minima on the error surface rather than being trapped within it.
In addition to being trapped in the local minima, ANNs can be easily over-fitted by the data. This arises when the network takes too long to learn or when the network is over-parameterised. As the network learns, connection weights are adjusted so that the network can model general rules that underly the data. If
these general rules are applied to a large proportion of the sample patterns, the network will repeatedly see examples of these rules and learn from them first. Subsequently, when more specialised rules, which occur in a few examples, appear in the network, the network will start to learn these rules. Once it has learned these rules well, if the training is allowed to continue, the network may start to learn specific sample patterns within the data and the network will then be overtrained (over-fitted). This is because the network tried to fit the connection weight closer to the target output so that it can reduce the error rate of the network. Over-fitting problems in ANNs, generally, can be tackled either by, monitoring the quality of the training process using appropriate validation mechanisms (see Section 2.2.2 on page 33), or by ensuring small and sufficient networks are used to assess the data. The use of a validation mechanism in assessing network performance is commonly used by most studies for pattern recognition problems, sample classification to be exact.
Taking into considerations of over-fitting and trapping problems, we decided to use a simple feedforward learning rather than the backpropagation learning and with no additional learning acceleration techniques in our model. The primary objective of this research was to find a feature set that correctly acts to discriminate between the classes. The presumption is, and this is a major assumption of our model, that the feature set will actually be the feature set that in some sense correctly acts to discriminate between the classes. That is to say, that by deliberately not focusing on the quality of the ANN classifier, then the selected feature set will be closer to the true discriminating feature set for the given classes. This has led us to select feedforward learning method in our model.