• No results found

Variable Selection and Identification of Relevant MRS Features by Weight Variance^

It is hardly possible for errors to enter into geometrical reasoning,... In this convenient way, the person who knows geometry acquires intelligence.

Ibn Khaldun (1332-1406), Ai- Muqaddimah {An Introduction to History). P u rp o se

In this chapter, a new neural network algorithm for feature selection is proposed. The algorithm, WVAR (for weight variance), is a modification of the Hebbian learning rule applied to the inputs such that the normalised variance of the weight vector from an input node is taken as a comparative measure of the relevance of this node in learning a classification task. We test the effectiveness of the algorithm in identifying relevant inputs by testing it on the parity 4 function (see section 6.2), adding 6 extra (irrelevant) noisy features. W e apply the algorithm to the 180 inputs data set (section 6.3) of spectra from six normal and tumour animal tissues. We compare the performance of W VAR against two conceptually different approaches for assessing ’relevance’ (Mozer and Smolensky, 1989) and ’sensitivity’ (Lisboa et al, 1994) of the network’s input units. We compare the three methods in terms of the number of remaining inputs necessary for correct classification and the consistency of the resulting subset under different combinations of training/testing sets and initial weights.

10.1 introduction

In chapter 5 we showed that adapting the size of a neural network prevents overfitting and improves the generalisation performance. Using MRS data risks overfitting due to the large number of parameters associated with the large dimensionality of the spectra. The data may not have enough patterns to fully resolve these parameters. In this chapter we use the concept of network reduction to automatically identify the components of the spectrum most significant in differentiating between tumour types. W e review two existing techniques for assessing ’relevance’ and ’sensitivity’ of the network’s input units and introduce a new technique that automatically selects the more dominant inputs. In all three techniques, during training, the network recursively removes the input unit least affecting the error function until the performance drops significantly; the remaining input units are then retained as the most relevant. W e test the three procedures on a data set with 180 inputs representing reduced-resolution spectra of six normal and tumour animal tissues. We compare the number of remaining variables necessary for correct classification and the consistency of each method, using different combinations of train/test sets and initial weights.

t)

10.2 Relevance of the input units (INREL)

In general, relevance R, of input unit / may be defined such that

O j = f ( Y , ' v ^ j R . x J (1 0.1)

where is the output of unit j, is the weight from i to 7, /th e sigmoid function and xi is the input feature value. When = 0 , unit / has no effect on the network, while when

R, = \, the unit behaves normally. Each unit in the input space has a relevance parameter associated with it (Fig. 1 0 . 1 ) . The vector R is optimised by steepest decent (with respect to the error function at the output) using backpropagation, in a way similar to, but separate from, the weights, adding to R^ increments proportional to the derivative of the error function with respect to R- :

(10,2) where E is the sum of squares over all patterns of the difference between the target values at the outputs of the network and the actual outputs (MSE) and is a constant. To avoid time fluctuation of adapting and ensure stability, a momentum term is added

^ , ( 0 = - R

r ^ I j ( 1 0 . 3 )

where is a constant. Thus the new values of R become

R ( r + l ) = R ( r ) + A R ( r ) ( 1 0 . 4 )

INPLT O L TPtT

F ig u re 10.1 R elevance according to M ozer and S m olensky (1989). Rj acts as a switch that enables or disables the variable i. By adapting Rj w .r.t to the error function it is possible to approxim ate the effect on the error function exerted by the variable i.

Mozer and Smolensky (1989) used the principle of (10.1) to define relevance of a unit as the difference in performance between a network with the unit removed and one with that unit retained, where performance is defined in terms of the error at the output

Pi

~ ^ W ith o u t unit i ^ w i t h unit i

Since direct calculation of p, is computationally prohibitive, especially for large networks. Mozer and Smolensky (1989) computed an approximate value p. whose formula ’in practice’ is similar to R in (10.4). However, the direct interpretation suggested by (10.1 to 10.4) illustrates better the concept of a unit’s relevance; as training progresses, elements

Chapter 10: Variable selection 122

of R corresponding to more relevant units will tend to increase by accumulating the effects of different patterns while those of units with less or no importance will fluctuate from one training sweep to another and cancel out.

10.3 Logarithmic Sensitivity Matrix (LGSEN and LGSEN_RT)

Lisboa et al (1994) used sigmoidal output units and one hidden layer to define the sensitivity of output node k with respect to input node i as the element of an input by output sensitivity matrix given by

à InXi

yt, Tic are the actual and target outputs on node k respectively and is the input to node i. If the target value of the correct class is set to unity (as is usually the case), (10.5)

reduces to

Sit, ) (1 0 .6)

j

for a network with one hidden layer, where Wÿ are the weights between the input and the hidden layer, hj is the output of hidden node j and are the weights between the hidden and the output layer (Fig. 10.2). The elements S-^ \n (10.6) become components of a vector measuring the sensitivity of the output unit kc with respect to all inputs where kc corresponds to the correct class. Equations (10.5) and (10.6) have the advantage over the straightforward Jacobian that they involve the product of the output of the correct class by the inputs producing it and avoid the saturation term y / l - y j t ) (chapter 9). They require, however, that the network has fully converged and that all patterns are correctly classified. After the network converges to the required level, Lisboa at a! (1994) compute the sensitivity elements 5,/. for each class, form a vector of absolute maximum sensitivity across all classes max\ ^i \ and gradually set elements of the vector with low sensitivity to zero (while removing the corresponding input units) until the network fails to classify correctly. Originally, Lisboa at al, (1994) did not include re-training of the network afterwards. In a minor modification to the algorithm, we retrained the network after each unit removal to allow it to recover to at least it previous level (or better). W e refer to this modified version as LGSEN RT.

G>:

- o .