Chapter Five The delta rule
6.12 Final remarks
Backpropagation is probably the most well-researched training algorithm in neural nets and forms the starting point for most people looking for a network-based solution to a problem. One of its drawbacks is that it often takes many hours to train real-world problems and consequently there has been much effort directed to developing improvements in training time. For example, in the delta-bar-delta algorithm of Jacobs (1988) adaptive learning rates are assigned to each weight to help optimize the speed of learning. More recently Yu et al. (1995) have developed ways of adapting a single, global learning rate to speed up learning.
Historically, backpropagation was discovered by Werbos (1974) who reported it in his PhD thesis. It was later rediscovered by Parker (1982) but this version languished in a technical report that was not widely circulated. It was discovered again and made popular by Rumelhart et al. (1986b, c, d) in their well-known book Parallel distributed processing, which caught the wave of resurgent interest in neural nets after a comparatively lean time when it was largely overshadowed by work in conventional AI.
6.13 Summary
We set out to use gradient descent to train multilayer nets and obtain a generalization of the delta rule. This was made possible by thinking in terms of the credit assignment problem, which suggested a way of assigning “blame” for the error to the hidden units. This process involved passing back error information from the output layer, giving rise to the term
“backpropagation” for the ensuing training algorithm. The basic algorithm may be augmented with a momentum term that effectively increases the learning rate over large uniform regions of the error-weight surface.
One of the main problems encountered concerned the existence of local minima in the error function, which could lead to suboptimal solutions. These may be avoided by injecting noise into the gradient descent via serial update or momentum.
Backpropagation is a quite general supervised algorithm that may be applied to incompletely connected nets and nets with more than one hidden layer. The operation of a feedforward net may be thought of in several ways. The original setting was in pattern space and it may be shown that a two-layer net (one hidden layer) is sufficient to achieve any arbitrary partition in this space.
Another viewpoint is to consider a network as implementing a mapping or function of its inputs. Once again, any function may be approximated to an arbitrary degree using only one hidden layer. Finally, we may think of nets as discovering features in the training set that represent information essential for describing or classifying these patterns.
Well-trained networks are able to classify correctly patterns unseen during training. This process of generalization relies on the network developing a decision surface that is not overly complex but captures the underlying relationships in the data. If this does not occur the net is said to have overfitted the decision surface and does not generalize well. Overfitting can occur if there are too many hidden units and may be prevented by limiting the time to train and establishing this limit using a validation set. Alternatively, by making the training set sufficiently large we may minimize the ambiguities in the decision surface, thereby helping to prevent it from becoming too convoluted. A more radical approach is to incorporate the construction of the hidden layer as part of the training process.
Example applications were provided that highlighted some of the aspects of porting a problem to a neural network setting.
These were typical of the kind of problems solved using backpropagation and helped expand the notion of how training vectors originate in real situations (first introduced via the visual examples in Fig. 4.10).
6.14 Notes 1. Angled brackets usually imply the average or mean of the quantity inside.
2. Training and test data are referred to as in-sample and out-of-sample data respectively in the paper.
3. DynIM (Dynamic multi-factor model of stock returns) is a trademark of County NatWest Investment Management Ltd.
7.1
The nature of associative memory
In common parlance, “remembering” something consists of associating an idea or thought with a sensory cue. For example, someone may mention the name of a celebrity, and we immediately recall a TV series or newspaper article about the celebrity. Or, we may be shown a picture of a place we have visited and the image recalls memories of people we met and experiences we enjoyed at the time. The sense of smell (olfaction) can also elicit memories and is known to be especially effective in this way
It is difficult to describe and formalize these very high-level examples and so we shall consider a more mundane instance that, nevertheless, contains all the aspects of those above. Consider the image shown on the left of Figure 7.1. This is supposed to represent a binarized version of the letter “T” where open and filled circular symbols represent 0s and 1s respectively (Sect. 4.6.1). The pattern in the centre of the figure is the same “T” but with the bottom half replaced by noise—
pixels have been assigned a value 1 with probability 0.5. We might imagine that the upper half of the letter is provided as a cue and the bottom half has to be recalled from memory. The pattern on the right hand side is obtained from the original “T”
by adding 20 per cent noise—each pixel is inverted with probability 0.2. In this case we suppose that the whole memory is available but in an imperfectly recalled form, so that the task is to “remember” the original letter in its uncorrupted state. This might be likened to our having a “hazy” or inaccurate memory of some scene, name or sequence of events in which the whole may be pieced together after some effort of recall.
The common paradigm here may be described as follows. There is some underlying collection of stored data which is ordered and interrelated in some way; that is, the data constitute a stored pattern or memory. In the human recollection examples above, it was the cluster of items associated with the celebrity or the place we visited. In the case of character recognition, it was the parts (pixels) of some letter whose arrangement was determined by a stereotypical version of that letter. When part of the pattern of data is presented in the form of a sensory cue, the rest of the pattern (memory) is recalled or associated with it. Alternatively, we may be offered an imperfect version of the stored memory that has to be associated with the true, uncorrupted pattern. Notice that it doesn’t matter which part of the pattern is used as the cue; the whole pattern is always restored.
Conventional computers (von Neumann machines) can perform this function in a very limited way using software usually referred to as a database. Here, the “sensory cue” is called the key or index term to be searched on. For example, a library catalogue is a database that stores the authors, titles, classmarks and data on publication of books and journals. Typically we may search on any one of these discrete items for a catalogue entry by typing the complete item after selecting the correct option from a menu. Suppose now we have only the fragment “ion, Mar” from the encoded record “Vision, Marr D.” of the book Vision by D.Marr. There is no way that the database can use this fragment of information even to start searching. We don’t know if it pertains to the author or the title, and, even if we did, we might get titles or authors that start with “ion”. The input to a conventional database has to be very specific and complete if it is to work.
Figure 7.1 Associative recall with binarized letter images.
7.2
Neural networks and associative memory
Consider a feedforward net that has the same number of inputs and outputs and that has been trained with vector pairs in which the output target is the same as the input. This net can now be thought of as an associative memory since an imperfect or incomplete copy of one of the training set should (under generalization) elicit the true vector at the output from which it was obtained. This kind of network was the first to be used for storing memories (Willshaw et al. 1969) and its mathematical analysis may be found in Kohonen (1982). However, there is a potentially more powerful network type for associative memory which was made popular by John Hopfield (1982), and which differs from that described above in that the net has feedback loops in its connection pathways. The relation between the two types of associative network is discussed in Section 7.9. The Hopfield nets are, in fact, examples of a wider class of dynamical physical systems that may be thought of as instantiating “memories” as stable states associated with minima of a suitably defined system energy. It is therefore to a description of these systems that we now turn.
7.3
A physical analogy with memory
To illustrate this point of view, consider a bowl in which a ball bearing is allowed to roll freely as shown in Figure 7.2.
Suppose we let the ball go from a point somewhere up the side of the bowl. The ball will roll back and forth and around the bowl until it comes to rest at the bottom. The physical description of what has happened may be couched in terms of the energy of the system. The energy of the system is just the potential energy of the ball and is directly related to the height of the ball above the bowl’s centre; the higher the ball the greater its energy. This follows because we have to do work to push the ball up the side of the bowl and, the higher the point of release, the faster the ball moves when it initially reaches the bottom. Eventually the ball comes to rest at the bottom of the bowl where its potential energy has been dissipated as heat and sound that are lost from the system. The energy is now at a minimum since any other (necessarily higher) location of the ball is associated with some potential energy, which may be lost on allowing the bowl to reach equilibrium. To summarize: the ball-bowl system settles in an energy minimum at equilibrium when it is allowed to operate under its own dynamics. Further, this equilibrium state is the same, regardless of the initial position of the ball on the side of the bowl. The resting state is said to be stable because the system remains there after it has been reached.
There is another way of thinking about this process that ties in with our ideas about memory. It may appear a little fanciful at first but the reader should understand that we are using it as a metaphor at this stage. Thus, we suppose that the ball comes to rest in the same place each time because it “remembers” where the bottom of the bowl is. We may push the analogy further by giving the ball a co-ordinate description. Thus, its position or state at any time t is given by the three co-ordinates x(t), y(t), z(t) with respect to some cartesian reference frame that is fixed with respect to the bowl. This is written more succinctly in terms of its position vector, x(t)=(x(t), y(t), z(t)) (see Fig. 7.3). The location of the bottom of the bowl, xp, represents the pattern that is stored. By writing the ball’s vector as the sum of xp and a displacement Δx, x=xp+Δx, we may think of the ball’s initial position as representing the partial knowledge or cue for recall, since it approximates to the memory xp.
If we now use a corrugated surface instead of a single depression (as in the bowl) we may store many “memories” as shown in Figure 7.4. If the ball is now started somewhere on this surface, it will eventually come to rest at the local depression that is closest to its initial starting point. That is, it evokes the stored pattern which is closest to its initial partial pattern or cue. Once Figure 7.2 Bowl and ball bearing: a system with a stable energy state.
Figure 7.3 Bowl and ball bearing with state description.
again, this corresponds to an energy minimum of the system. The memories shown correspond to states x1, x2, x3 where each of these is a vector.
There are therefore two complementary ways of looking at what is happening. One is to say that the system falls into an energy minimum; the other is that it stores a set of patterns and recalls that which is closest to its initial state. The key, then, to building networks that behave like this is the use of the state vector formalism. In the case of the corrugated surface this is provided by the position vector x(t) of the ball and those of the stored memories x1, x2,…, xn. We may abstract this, however, to any system (including neural networks) that is to store memories.
(a) It must be completely described by a state vector v(t)=(v1(t), v2(t),…, vn(t)) which is a function of time.
(b) There are a set of stable states v1, v2, v1,…, vn, which correspond to the stored patterns or memories.
(c) The system evolves in time from any arbitrary starting state v(0) to one of the stable states, which corresponds to the process of memory recall.
As discussed above, the other formalism that will prove to be useful makes use of the concept of a system energy. Abstracting this from the case of the corrugated surface we obtain the following scheme, which runs in parallel to that just described.
(a) The system must be associated with a scalar variable E(t), which we shall call the “energy” by analogy with real, physical systems, and which is a function of time.
(b) The stable states vi are associated with energy minima Ei. That is, for each i, there is a neighbourhood or basin of attraction around vi for which Ei is the smallest energy in that neighbourhood (in the case of the corrugated surface, the basins of attraction are the indentations in the surface).
(c) The system evolves in time from any arbitrary initial energy E(0) to one of the stable states Ei with E(0)>Ei. This process corresponds to that of memory recall.
Notice that the energy of each of the equilibria Ei may differ, but each one is the lowest available locally within its basin of attraction. It is important to distinguish between this use of local energy minima to store memories, in which each minimum is as valid as any other, and the unwanted local error minima occurring during gradient descent in previous chapters. This point is discussed further in Section 7.5.3.
7.4