Chapter 4 Methods 1: Conventional Methods
5.3 Optimising the Network
5.3.1 Minimising the error function
Non-linear activation functions are often used in articial neural networks, and since hidden and output units perform dierent roles, the choice of activation function for the output units may dier from that for the hidden units. In a general feed-forward network, each unit computes a weighted sum of its inputs in the form
aj =
X
i
wjizi
where z is as dened in equation (5.3). Suppose that the error function can be
written as a sum over all patterns in the training set of an error dened for each pattern separately,
E =X
n
En
and thatEnis dierentiable function of the network variables so that En=En(y1, . . . , yc).
The derivatives of the error functionE with respect to the weights and biases in the
network can be expressed as sums over the training set of the derivatives for each pattern separately. En depends on weightwji only via the summed input unitaj to unitj. The derivative of En with respect to a weight wji can be obtained using the chain rule for partial derivatives
∂En ∂wji = ∂E n ∂aj ∂aj ∂wji . (5.17)
The required derivative is obtained by multiplying the value of ∂En
∂aj for the unit at
the output end of the weight by the value of ∂aj
the weight. For the output units ∂En ∂ak =g0(ak) ∂En ∂yk (5.18)
by the chain rule on equation (5.3). For the hidden units we have
∂En ∂aj =X k ∂En ∂ak ∂ak ∂aj (5.19)
where the sums run over all unitskto which unitj sends connections, and using the
fact that variations in aj give rise to variations in the error function only through variations in the variablesak. This leads to the back propagation formula
∂En ∂aj =g0(ak) X k wkjg0(ak) ∂En ∂ak (5.20)
which allows the evaluation of errors recursively. The use of the logistic sigmoid as an activation function is computationally ecient here, since its derivative can be expressed in the form
g0(a) =g(a)(1−g(a)). (5.21)
The dervatives of the error function with respect to the weights obtained in this way form the Jacobian matrix of partial derivatives. The derivatives of the outputs with respect to the inputs can also be calculated in a similar manner to form a Jacobian matrix which estimates the contribution of the errors associated with the input variables to the error of the output variables (Bishop [1996]). Back propagation can also be used to obtain the second derivatives of the error with respect to the weights to form the Hessian
∂2E ∂wji∂wlk
. (5.22)
The Hessian and its inverse plays an important role in neural computing which is detailed in Bishop [1996]. The inverse of the Hessian H of the error with respect
to the weights can be approximated using the outer product approximation. If
R≡ ∇wE, is the gradient of the error function andN is the number of patterns in the data set, then the outer product approximation can be written
HN = N
X
n=1
and the Hessian can be built up sequentially using
HN+1 =HN+RN+1(RN+1)T. (5.24)
Then this matrix identity (Kailath [1980]) can be used to provide the inversion:
(A+BC)−1=A−1−A−1B(I+CA−1B)−1CA−1. (5.25)
whereI is the identity matrix. IdentifyingHN =A, RN+1 =B, (RN+1)T =C we have HN−1+1=HN−1−H −1 N RN+1(RN+1)TH −1 N 1 + (RN+1)TH−1 N RN+1 (5.26) This represents a procedure for evaluating the inverse of a Hessian using a single pass through the data set. The initial matrix,H0 is chosen to be αH whereα is a
small quantity. It is important to state that the outer product approximation for use with the sum of squares error function is only likely to be valid for a network trained on the same data set, or one with the same statistical properties as the one used to evaluate the Hessian. For a general network mapping, the second derivative terms will typically not be negligible. The Hessian of the error with respect to the weights can be evaluated exactly for a network of arbitrary feed-forward topology and with an dierentiable error function using an algorithm based on back-propagation for the evaluation of rst derivatives, detailed on page 157 of Bishop [1996].
For regression problems and for classication problems, the purpose of net- work training is to model the underlying generator of the data so that the best possible predictions of the targettare made when the trained network is presented
with a new input vector x. For associative prediction problems of this kind, it is
convenient to decompose the joint probability density in the product of the condi- tional density of the target data, given the input data and the unconditional density of the input data thus:
P(x, t) =P(t|x)p(x). (5.27)
Many error functions can be motivated from the principle of maximum likelihood. For training data{xn, tn}, the likelihood can be written as
L=Y n P(xn, tn) (5.28) L=Y n P(tn|xn)P(xn) (5.29)
under the assumption that each data point (xn, tn) is drawn independently from
the same distribution. It is generally more convenient to minimise the negative log likelihood than to maximise the likelihood, and these are equivalent since the negative logarithm is a monotonic function. Fitting an ANN by maximum likelihood is known as `entropy' tting and is not common (Ripley and Ripley [2001], Joshi et al. [2005]).
E=−lnL=−X n
lnP(tn|xn)−X n
ln P(xn). (5.30)
The second term in equation (5.30) does not depend on the network parameters and is therefore an additive constant which may be omitted from the error function.
These procedures have the advantage that they constrain the weights assigned to the variables during learning to have values falling within the same interval.