Finding the minimum of a function: gradient descent

Chapter Five The delta rule

5.1 Finding the minimum of a function: gradient descent

Consider a quantity y that depends on a single variable x—we say that y is a function of x and write y=y(x). Suppose now that we wish to find the value x0 for which y is a minimum (so that y(x₀) y(x) for all x) as shown in Figure 5.1. Let x* be our current best estimate for x₀; then one sensible thing to do in order to obtain a better estimate is to change x so as to follow the function "downhill"

Figure 5.1 Function minimization.

as it were. Thus, if increasing x (starting at x*) implies a decrease in y then we make a small positive change, x>0, to our estimate x*. On the other hand, if decreasing x results in decreasing y then we must make a negative change, x<0. The knowledge used to make these decisions is contained in the slope of the function at x*; if increasing x increases y, the slope is positive, otherwise it is negative.

We met the concept of slope in Section 3.1.2 in connection with straight lines. The extension to general functions of a single variable is straightforward, as shown in

Figure 5.2. The slope at any point x is just the slope of a straight line, the tangent, which just grazes the curve at that point. There are two ways to find the slope. First, we may draw the function on graph paper, draw the tangent at the required point, complete the triangle as shown in the figure and measure the sides x and

y. It is possible, however, to calculate the slope from y(x) using a branch of

mathematics known as the differential calculus. It is not part of our brief to demonstrate or use any of the techniques of the calculus but it is possible to understand what is being computed, and where some of its notation comes from.

Figure 5.3 shows a closeup of the region around point P in Figure 5.2. The slope at P has been constructed in the usual way but, this time, the change x used to

construct the base of the triangle is supposed to be very small. If dy is

Figure 5.2 Slope of y(x).

Figure 5.3 Small changes used in computing the slope of y(x).

the change in the value of the function y due to x then, if the changes are small enough, dy is approximately equal to y. We write this symbolically as dy y.

Now, dividing y by x and then multiplying by x leaves y unchanged. Thus we may write

(5.1)

This apparently pointless manipulation is, in fact, rather useful, for the fraction on the the right hand side is just the slope. Further since dy y we can now write

We now introduce a rather more compact and suggestive notation for the slope and write

(5.3)

We have already come across this kind of symbol used to denote the "rate of change" of a quantity. Informally, the ideas of "rate of change" and "slope" have a similar meaning since if a function is rapidly changing it has a large slope, while if it is slowly varying its slope is small. This equivalence in ordinary language is mirrored in the use of the same mathematical object to mean both things. It should once again be emphasized that dy/dx should be read as a single symbol—although its form should not now be so obscure since it stands for something that may be expressed as a ratio. The key point here is that there are techniques for calculating

dy/dx, given the form of y(x), so that we no longer have to resort to graphical

methods. By way of terminology dy/dx is also known as the differential or

derivative of y with respect to x.

Suppose we can evaluate the slope or derivative of y and put

(5.4)

where >0 and is small enough to ensure dy y; then, substituting this in (5.3),

(5.5)

Since taking the square of anything gives a positive value the - term on the right hand side of (5.5) ensures that it is always negative and so dy<0; that is, we have "travelled down" the curve towards the minimal point as required. If we keep

repeating steps like (5.5) iteratively, then we should approach the value x₀ associated with the function minimum. This technique is called, not surprisingly,

gradient descent and its effectiveness hinges, of course, on the ability to calculate,

or make estimates of, the quantities like dy/dx.

We have only spoken so far of functions of one variable. If, however, y is a function of more than one variable, say y=y(x₁, x₂, …, x_n), it makes sense to talk

about the slope of the function, or its rate of change, with respect to each of these variables independently. A simple example in 2D is provided by considering a valley in mountainous terrain in which the height above sea level is a function that depends on two map grid co-ordinates x₁ and x₂. If x₁, say, happens to be parallel to a contour line at some point then the slope in this direction is zero; by walking in this direction we just follow the side of the valley. However, the slope in the other direction (specified by x₂) may be quite steep as it points to the valley floor (or the top of the valley face). The slope or derivative of a function y with respect to the variable x_i is written dy/dx_i and is known as the partial derivative. Just as for the ordinary derivatives like dy/dx, these should be read as a single symbolic entity standing for something like "slope of y when x_i alone is varied". The equivalent of (5.4) is then

(5.6)

There is an equation like this for each variable and all of them must be used to ensure that dy<0 and there is gradient descent. We now apply gradient descent to the minimization of a network error function.