• No results found

PART IV MODELS FOR ANALYSING SECURITY RISKS AND POLICY IMPLICATIONS

3.2 Model Training

The back-propagation algorithm is used in training the ANN. See Murray (1995), Mehrotra et al. (1997) and Haykin (1999) for the explanation of the algorithm. With an initial set of weights and a given record (I11, I12, . . . , I1n,

O1) of the training data set, the learning mechanism of the ANN first calcu-lates the input and output of each node in the hidden layer using equations (1) and (2); then find the value in the output node O using equation (3). The learning mechanism compares the predicted QC rate with the given one in the training set. The error between these two is propagated back to the hidden layer and then the input layer to adjust the weights. With the updated weights, the second record of the training data is fed into the ANN and the learning mechanism will repeat the process to calculate the input and output values for each node in the hidden layer and in the output layer and the error. If the error is acceptable, the training process stops. Otherwise, the error will be back-propagated again to adjust (update) the weights and the next record of the training data will be fed into the ANN to continue the training.

The back-propagation algorithm is a generalization of the least mean square algorithm that modifies network weights to minimize the mean squared error between the desired and actual outputs of the network. Mean squared error (MSE) is defined as follows.

MSE = 1 P

S

P

p=1

(Op− Oop)2, (4)

where P is the number of input patterns; Opis the QC rate for input pattern p predicted by the ANN model; and Oopis the observed value of QC rate for this input pattern.

Back-propagation uses supervised learning in which the network is trained using data for which inputs as well as desired outputs are known. Once trained, the network weights are fixed and can be used to compute output values for new input samples. One way to minimize the MSE is based on the gradient descent method. To do so, the change of a weight wijis proportional to –(∂MSE/∂wij).

The feed-forward process involves presenting an input pattern to input layer nodes that pass the input values on the hidden layer. Each of the hidden layer nodes computes a weighted sum of its inputs using equation (1), passes the sum through its activation function and presents the result to the output layer.

In this study, the sigmoid function (equation (2)) is used as the activation function from the input to the hidden layer and linear function (equation (3)) from the hidden layer to the output layer.

For each input pattern [Ip1, Ip2, . . . , Ipn] (where p = 1, 2, . . . , P), the net

The output of the node in the output layer is Op=

S

m

i=1

w2ioVOpi. (7)

According to the gradient descent, the weight changes are suggested by the following two equations:

can be derived from equations (4)–(7). his the learning rate. The training algorithm is outlined below.

Start with an initial set of weights.

While MSE > preset value (which is calibrated to be 0.8 in pilots runs for this study):

For each input pattern p, p = 1, 2, . . . , P:

Compute outputs Opusing equation (5)–(7),

Modify the weights between hidden and output nodes by Dwio2, Modify the weights between input and hidden nodes by Dw1ji, End for;

End while.

To avoid bias of the initial parameter values, training is generally commenced with randomly chosen initial weight values. In this study, the initial weights are randomly generated from U(− 0.1, 0.1).

There are two approaches to learning: ‘‘per-pattern’’ learning in which the weights are changed after every sample presentation; and ‘‘per-epoch’’ learn-ing in which the weights are updated only after all samples are presented to the network. We use ‘‘per-pattern’’ training in this study because it is simple to implement and the stochastic search of weight space reduces the risk of local minima.

As stated before, the changes in weights are proportional to the negative gradient of the error. This guideline determines the relative changes that must occur in different weights when a training sample is presented, but does not fix the exact magnitudes of the desired weight changes. The magnitude of change depends on the appropriate choice of the learning rate h. A large value of hwill lead to rapid learning but the weight may then oscillate, while low values imply slow learning. The right value of h will depend on the application. In this study, based on the computation experience, his initially set at 0.005 and as computation proceeds, it is adjusted by the following heuristic: increase hby a fixed amount of 0.003 at every iteration that improves performance by some

significant amount (8%); decrease h by a fixed amount of 0.003 at every iteration that worsens performance by some significant amount (8%).

Back-propagation may lead the weights in an ANN to a local minimum of the MSE that is substantially different from its global minimum, the best choice of weights. To prevent the network from getting stuck in some local minimum, we make the weight changes dependent on the average gradient of MSE in a small region rather than the precise gradient at a point. However, calculating averages can be an expensive task. A short cut, suggested by Rumelhart et al. (1986), is to make weight changes in the (t+1)thiteration of the back-propagation algorithm dependent on immediately preceding weight changes which were made in tth iteration. This has an averaging effect, and reduces the drastic fluctuations in weight changes over consecutive iterations.

Given a large network, it is possible that repeated training iterations succes-sively improve performance of the network on training data, e.g. by ‘‘memoriz-ing’’ training samples, but the resulting network may perform poorly on other data. This phenomenon is called over-training. There are various techniques that avoid over-training. See Prechelt (1998) for a discussion and empirical study of these methods. One solution to avoid over-training is to constantly monitor the performance of the network on test data on which the system has not been trained. Neural learning is considered successful only if the system can perform well on the test data. We emphasize the capability of a network to generalize ‘‘rules’’ from input training samples, not to memorize data only.

Therefore, in this study each time after all the training samples are presented to the network, a test set is presented to the network. The weights are adjusted on the basis of the training set only, but the error is monitored on the test set.

The training continues as long as the error on the test set is decreasing, and is terminated if the error on the test set increases or reaches a preset value.

Actually, with this stopping criterion, final weights are validated with the test data in an indirect manner. Since the weights are not obtained from the current test data, it is expected that the network will continue to perform well on further test data.