Multilayer nets and backpropagation
6.1 Training rules for multilayer nets
A network that is typical of the kind we are trying to train is shown in Figure 6.1. It consists of a set of input distribution points (shown in black) and two layers of semilinear nodes shown as circles. The second, or output, layer signals the network's response to any input, and may have target vectors ti applied to it in a supervised training regime. The other half, vi, of the training patterns are applied to the inputs of an intermediate layer of hidden nodes, so called because we do not have direct access to their outputs for the purposes of training and they must develop their own representation of the input vectors.
The idea is still to perform a gradient descent on the error E considered as a function of the weights, but this time the weights for two layers of nodes have to be taken into account. The error for nets with semilinear nodes has already been established as the sum of pattern errors ep defined via (5.15). Further, we assume the serial, pattern mode of training so that gradient estimates based on information available at the presentation of each pattern are used, rather than true gradients available only in batch mode. It is straightforward to calculate the error gradients for the output layer; we have direct access to the mismatch
Figure 6.1 Two-layer net example.
between target and output and so they have the same form as those given in (5.16) for a single-layer net in the delta rule. This is repeated here for convenience:
where the unit index j is supposed to refer to one of the output layer nodes. The problem is now to determine the gradients for nodes (index k) in the hidden layer. It is sometimes called the credit assignment problem since one way of thinking about this is to ask the question—how much "blame" or responsibility should be assigned to the hidden nodes in producing the error at the output? Clearly, if the hidden layer is feeding the output layer poor information then the output nodes can't sensibly aspire to match the targets.
Our approach is to examine (6.1) and see how this might generalize to a similar expression for a hidden node. The reasoning behind the appearance of each term in (6.1) was given in Section 5.3 and amounted to observing that the sensitivity of the error to any weight change is governed by these factors. Two of these, the input and sigmoid slope, are determined solely by the structure of the node and, since the hidden nodes are also semilinear, these terms will also appear in the expression for the hidden unit error-weight gradients. The remaining term, (tp-yp), which has been designated the "d", is specific to the output nodes and our task is therefore to find the hidden-node equivalent of this quantity.
It is now possible to write the hidden-node learning rule for the kth hidden unit as
(6.2)
where it remains to find the form of dk. The way to do this is to think in terms of the credit assignment problem. Consider the link between the kth hidden node and the
jth output node as shown in Figure 6.2. The effect this node has on the error depends on two things: first, how much it can influence the output of
Figure 6.2 Single hidden-output link.
j, the greater the effect we expect there to be on the error. However, this will only be significant if j is having some effect on the error at its output. The contribution that node j makes towards the error is, of course, expressed in the "d" for that node
dj. The influence that k has on j is given by the weight wjk. The required interaction between these two factors is captured by combining them multiplicatively as djwjk. However, the kth node is almost certainly providing input to (and therefore influencing) many output nodes so that we must sum these products over all j giving the following expression for dk:
(6.3)
Here, Ik is the set of nodes that take an input from the hidden node k. For example, in a network that is fully connected (from layer to layer) Ik is the whole of the output layer. Using (6.3) in (6.2) now gives the desired training rule for calculating the weight changes for the hidden nodes since the dj refer to the output nodes and are known quantities.
The reader should be careful about the definition of the "d" in the literature. We found it convenient to split off the slope of the sigmoid '(ak), but it is usually absorbed into the corresponding d term. In this scheme then, for any node k (hidden or output) we may write
(6.4)
where for output nodes
(6.5)
(6.6)
In line with convention, this is the usage that will be adopted subsequently. It remains to develop a training algorithm around the rules we have developed.