• No results found

Techniques for handling missing values

7.3 Missing values and neural networks

7.3.1 Techniques for handling missing values

One of the simplest and most commonly used techniques for handling missing values within fields other than neural networks is value substitution (Nie et al 1970, Quinlan 1987). This involves substituting another value in place of the missing input and then evaluating the system's output as normal. The replacement value may be chosen because it is felt to be

innocuous and unlikely to affect the overall result or it may be an estimate of the most likely value of the missing input.

Value substitution can easily be applied to a neural network as the substitution can be done externally before the values are presented to the net, and so the net itself performs normally. The only task is to determine the most effective value to use as the replacement. Four different approaches to determining this value were implemented and tested.

The simplest method is zero substitution, in which the missing value is replaced by zero. This cancels out the numerical effect of this input node. In addition, all of the data sets tested had inputs scaled in such a manner that zero was also the midpoint of the range of possible values for the scaled input data. Therefore it is hopefully a relatively innocuous value which will have little impact on the output of the network. Clearly for some data sets this will be untrue as values around the midpoint may in fact be of significance. However the simplicity of this solution makes it worthy of consideration.

Mean substitution is slightly more complex. For each input the mean value over the entire training set is calculated and used as the replacement value for that particular input. This is a very crude approximation to the most likely value of that input. Clearly for some distributions of input values this will be a very poor estimate but it is extremely easy to calculate.

A more sophisticated method of estimating the value for the unknown input

is network estimate substitution which takes into account the values of the

known inputs in producing a substitution value. An additional network is trained, using all of the input values except the missing value. This network is trained to produce as an output an estimate of the value of the input which it does not use. Once trained this reduced network can be used to produce an estimate of the most likely value for the missing input. This estimate is then substituted for the missing input and the full network using all inputs is used to perform the actual classification. This method will not work if the inputs are uncorrelated but in most real world situations at least some of the inputs are correlated to a certain extent. The estimation network can be either be trained when it is needed, or multiple networks can be trained in advance with each providing an estimate for a single input. Like the reduced network classification method this approach will require a large amount of additional training over that needed to create the basic classification network. However

each of the estimate networks has only a single output and so this method will still require less training than the reduced network classifier. 22

The final substitution method tested was multiple substitution. In this approach the classification network is evaluated several times, substituting a range of different values in place of the missing input. The outputs generated in this manner are then combined to produce the final classification. The major factors to be determined with this approach are the number and range of values to be substituted for the missing input, and the manner in which the multiple outputs are combined to produce the overall classification. For this research ten values equally spaced over the range of the input were used for the substitution values, and the ten output vectors produced were combined using the voting technique developed during the research into committee systems discussed in Section 7.1.2.

System reduction

A second general approach to missing values is system reduction, in which an attempt is made to classify the example without using those parts of the classification system which are dependent on the missing input. As described earlier this approach is difficult to implement using a fully-connected feed- forward network because of the distributed nature of the network. Every input affects every node in the network, and so there are no particular sections of the network dealing with the missing value, which could be ignored or treated separately. The methods developed to adapt this approach to neural networks rely on initially creating a system which is more complex than the basic network by adding additional nodes to the standard network structure. This enhanced network structure is trained so that missing input values can be explicitly identified when they are presented to the network. The rationale is that the network itself will develop techniques to perform the classification without using the missing input.

The first of these methods tested was the flagged network in which each input value is represented by a pair of input nodes, rather than just a single node as in a conventional network. One of these nodes is a binary flag node which is used to indicate whether this input value is known or not. The second node is the value node. If the input value is known then it is placed into the value node. In the case of a missing value if the net is correctly interpreting the flag

22 As the estimation networks were being trained to produce a real-valued output rather

than a binary classification, the output nodes in these networks use a linear activation function rather than the sigmoid used in the other networks described in this thesis.

node then the actual input given to the value node should not matter. However in practice, it was found useful to perform substitution on these value nodes, using average substitution rather than just using a random value. In order for the flagged network to learn the function of the flag nodes it is necessary to train on examples involving missing values. These can be generated from a training example by randomly selecting an input to be treated as missing and setting its flag and value nodes accordingly.

The second approach tested was the shadow weight network. Like the flagged network this method attempts to modify the basic network so that missing values can be explicitly identified when they are input to the network. In this approach this is accomplished by substituting a fixed value for the missing input value (a value of 1 was used for these experiments), and temporarily replacing the weights on the connections to this input node by an alternative set of weights, which were labelled 'shadow weights'.

Several training regimes were trialed with the best results being obtained using a two stage training process. The first stage consists of training the basic network on the training data with no missing values. Once this training is complete all of the weights in this network are frozen and not altered by any subsequent training. Figure 7.1 summarises this architecture; all connections indicated by solid lines are those used in the basic network, whilst the dashed lines indicate shadow weights which are only used if input values are missing.

Figure 7.1 A shadow weights network with three input nodes and two outputs. Solid lines and white circles indicate the standard weights and nodes, dashed lines and the shaded circle indicate shadow weights and nodes used only when an input value is

missing. In this case the value of the lowest of the three input nodes is unknown. The network is then trained on examples containing missing values. As outlined earlier a value of 1 is substituted for the missing input and the weights on the connections to this input are temporarily replaced by shadow weights (which are initially randomised). In addition it was found that performance was improved by adding a number of extra nodes to the hidden

layer and training them on these examples. During this training phase the only weights modified were those connected to the new hidden nodes, and the shadow weights connected to the missing input value.

Once training is completed the basic network can be used whenever all input data is available. If data with missing values is encountered the basic network augmented by the shadow nodes and shadow weights is used instead. In this way the system adapts to the situation of missing values without degrading its performance when all input values are available.