1 Output Yi = (1+e c si) ;

A simple application of propagation is shown in Example 4. 10 for the height data. Here the classification performed is the same as that seen with the decision tree in Figure 4. l l (d).

EXAMPLE 4.10

Figure 4.14 shows a very simple NN used to classify university students as short, medium, or tall. There are two input nodes, one for the gender data and one for the height data. There are three output nodes, each associated with one class and using a simple threshold activation function. Activation function h is associated with the short class, !4 is associated with the medium class, and fs is associated with the tall class. In this case, the weights of each arc from the height node is 1 . The weights on the gender arcs is 0. This implies that in this case the gender values are ignored. The plots for the graphs of the three activation functions are shown.

4.5.2 NN Supervised Learning

The NN starting state is modified based on feedback of its petformance with the data in the training set. This type of learning is referred to as supervised because it is known a priori what the desired output should be. Unsupervised learning can also be performed if the output is not known. With unsupervised approaches, no external teacher set is used. A training set may be provided, but no labeling of the desired outcome is included. In this case, similarities and differences between different tuples in the training set are uncovered. in this chapter, we examine supervised learning. We briefly explore unsupervised learning in Chapter 5.

Section 4.5 Neural NetwCJrk-Based Algorithms 107

Small

Tall

lM-

0 -

1 2

FIGURE 4. 14: Example propagation for tall data.

Supervised learning in an NN is the process of adjusting the arc weights based on its performance with a tuple from the training set. The behavior of the training data is known a priori and thus can be used to fine-tune the network for better behavior in future similar situations. Thus, the training set can be used as a "teacher" during the training process. The output from the network is compared to this known desired behavior. Algorithm 4.5 outlines the steps required. One potential problem with super vised learning is that the error may not be continually reduced. It would, of course, be hoped that each iteration in the learning process reduces the error so that it is ulti mately below an acceptable level. However, this is not always the case. This may be due to the error calculation technique or to the approach used for modifying the weights. This is actually the general problem of NNs. They do not guarante� convergence or optimality. ALGORITHM 4.5 Inpu t : N X D Outpu t :

/ / S tart ing neural network / / Input tuple from t raining set / / Output tup l e de s i red

N / / Improved neural network SupLearn algorithm :

/ / S implis t ic algori thm to il lustrate approach to NN l earning

Propagate X through N producing output Y; Cal culate error by comparing v' to Y;

Upcj.ate weight s on arcs in N to reduce error ;

Notice that this algorithm must be associated with a means to calculate the error as well as some technique to adjust the weigh

t

s. Many techniques have been proposed to calculate the error. Assuming that the output from node i is Yi but should be di , the

108 Chapter 4 Classification

error produced from a node in any layer can be found by

The mean squared error (MSE) is found by

(Yi - di )2 2

(4.36)

(4.37)

This MSE can then be used to find a total error over all nodes in the network or over only the output nodes. In the following discussion, the assumption is made that only the final output of the NN is known for a tuple in the training data. Thus, the total MSE

error over all m output nodes in the NN is

t

(Yi - di )2

i=l m

(4.38) This formula cou

ffl

be expanded over all tuples in the training set to see the total error over all of them. Thus, an error can be calculated for a specific test tuple or for the total set of all entries.

Tlie Hebb and delta rules are approaches to change the weight on an input arc to a node based on the knowledge that the output value from that node is incorrect. With both techniques, a learning rule is used to modify the input weights. Suppose for a given node, j, the input weights are represented as a tuple (w lj , . . . , Wkj ) , while the input and output values are (xu .. .. , Xkj ) and YJ · respectively. The objective of a learning technique is to change the weights based on the output obtained for a specific input tuple. The change in weights using the Hebb rule is represented by the following rule

(4.39)

Here c is a constant often called the learning rate. A rule of thumb is that c =

I# entries in training setl ·

A variation of this approach, called the delta rule, examines not only the output value YJ but also the desired value dj for output. In this case the change in weight is found by the rule

(4.40)

The nice feature of the delta rule is that it minimizes the error d1 - YJ at each node. Backpropagation is a learning technique that adjusts weights in the NN by prop agating weight changes backwa.td from the sink to the source nodes. Backpropagation is the most well known form of learning because it is easy to understand and generally applicable. Backpropagation can be thought of as a generalized delta rule approach.

Figure 4.15 shows the structure and use of one tiode, j, in a neural network graph. The basic node structure is shown in part (a). Here the representative input arc has a weight cif W?j . where ? is used to show that the input to node j is corning from another node shown here as ?. Of course, there probably are multiple input arcs to a node. The output weight is similarly labeled w J? · During propagation, data values input at the input layer flow through the network, with final values corning out of the network at the output layer. The propagation technique is shown in part (b) Figure 4.15.

Section 4.5 Neural Network- Based Algorithms 109

(a) Node j in NN (b) Propagation at Node j (c) Back-propagation at Node j FIGURE 4. 1 5: Neural network usage.

Here the smaller dashed arrow underneath the regular graph arc shows the input value X?j flowing into node j. The activation function

fJ

is applied to all the input values and weights, with output values resulting. There is an associated input function that is applied to the input values and weights before applying the activation function. This input function is typically a weighted sum of the input values. Here YJ? shows the output value flowing (propagating) to the next node from node j . Thus, propagation occurs by applying the activation function at each node, which then places the output value on the arc to be sent as input to the next nodes. In most cases, the activation function produces only one output value that is propagated to the set of connected nodes. The NN can be used for classification and/or learning. During the classification process, only propagation occurs. However, when learning is used after the output of the classification occurs, a comparison to the known classification is used to determine how to change the weights in the graph. In the simplest types of learning, learning progresses from the output layer backward to the input layer. Weights are changed based on the changes that were made in weights in subsequent arcs. This backward learning process is called backpropagation and is illustrated in Figure 4. 15( c). Weight w J? is modified to become

w j? + 6 w j? . A learning rule is applied to this 6 w j? to determine the change at the next higher level 6 W? j. ALGORITHM 4.6 Input : N X = (xl , .. . , Xh) D =

(d1, . . . , dm)

Output :

/ / Start ing neural network

/ / Input tuple f rom tra ining set / /Output tuple des ired

N / / Improved neural network

Backpropagat i on algor i t hm :

/ / I l lustrate backpropagat ion

Propagat i on (N, X) ; E = 1/2

L7=l (di

-Yi)2 ; Gradient (N, E) ;

A simple version of the backpropagation algorithm is shown in Algorithm 4.6. The MSE is used to calculate the error. Each tuple in the training set is input to this algorithm. The last step of the algorithm uses gradient descent as the technique to modify the weights in the graph. The basic idea of gradient descent is to find the set of weights that minimizes the MSE. gives the slope (or gradient) of the error function for one weight. We thus wish to find the weight where this slope is zero. Figure 4.16 and Algorithm 4.7 illustrate the concept. The stated algorithm assumes only one hidden layer. More hidden layers would be handled in the same manner with the error propagated backward.

1 1 0 Chapter 4 Classification E

w Desired weight

FIGURE 4. 1 6: Gradient descent.

.ALGORITHM 4.7 Input :

N / /S t art ing neural network

E / /Error found from back algorithm

Output :

N / / Improved neural network Gradient algorithm:

/ / I llustrates incremental gradient des cent

for each node i in output layer do

for each node j input t o i do

ll.wji = TJ (di - Yi)Yj (l - Yi)Yi ;

Wji = Wji + fl.Wj i ;

layer = previous laye r ;

for each node j in thi s l ayer do

for each node k input to j do

l-(y ·)2

fl.Wkj = Lm(dm - Ym) WjmYm(l - Ym) ;

Wkj = Wkj + fl. Wkj ;

This algorithm changes weights by working backward from the output layer to the input layer. There are two basic versions of this algorithm. With the batch or offline approach, the weights are changed once after all tuples in the training set are applied and a total MSE is found. With the incremental or online approach, the weights are changed after each tuple in the training set is applied. The incremental technique is usually preferred because it requires less space and may actually examine more potential solutions (weights), thus leading to a better solution. In this equation, 17 is referred to as the learning parameter. It typically is found in the range

(0, 1),

although it may be larger. This value determines how fast the algorithm learns.

Applying a learning rule back through multiple layers in the network may be difficult. Doing this for the hidden layers is not as easy as doing it with the output layer. Overall, however, we are trying to minimize the error at the output nodes, not at each node in the network. Thus, the approach that is used is to propagate the output errors backward through the network.

Section 4.5 Neural Network-B ased Algorithms 1 1 1 Output

wkj _wji

Yk Yi

FIGURE 4. 1 7 : Nodes for gradient descent.

Figure

4.17

shows the structure we use to discuss the gradient descent algorithm.

Here node i is at the output layer and node j is at the hidden layer just before it;

y;

is the output of i and

y J

is the output of j .

The learning function i n the gradient descent technique is based o n using the following value for delta at the output layer:

lM-

t

t

ffl

fJ

(d1, . . . , dm)

L7=l (di

(0, 1),

4.17

y;

y J

aE

aE ay; as;

/).Wji =

=

(4.41)