• No results found

Training Feedforward and Radial Basis Function Networks

2.5 Feedforward and Radial Basis Function Networks

2.5.3 Training Feedforward and Radial Basis Function Networks

Suppose you have chosen an FF or RBF network and you have already decided on the exact structure, the number of layers, and the number of neurons in the different layers. Denote this network with y` = gHq, xL where q is a parameter vector containing all the parametric weights of the network and x is the input. Then it is time to train the network. This means that q will be tuned so that the network approximates the unknown function producing your data. The training is done with the command NeuralFit, described in Chapter 7, Training Feedforward and Radial Basis Function Networks. Here is a tutorial on the available training algorithms.

Given a fully specified network, it can now be trained using a set of data containing N input-output pairs, 8xi, yi<iN=1. With this data the mean square error (MSE) is defined by

VNHθL = 1 (11)

N ‚

i=1

N Hyi− g Hθ, xiLL2

Then, a good estimate for the parameter q is one that minimizes the MSE; that is,

θ (12) ˆ= argmin

θ VNHθL

Often it is more convenient to use the root-mean-square error (RMSE)

RMSE HθL =è!!!!!!!!!!!!!!!VNHθL (13)

when evaluating the quality of a model during and after training, because it can be compared with the output signal directly. It is the RMSE value that is logged and written out during the training and plotted when the training terminates.

The various training algorithms that apply to FF and RBF networks have one thing in common: they are iterative. They both start with an initial parameter vector q0, which you set with the command Initialize FeedForwardNet or InitializeRBFNet. Starting at q0, the training algorithm iteratively decreases the MSE in Equation 2.11 by incrementally updating q along the negative gradient of the MSE, as follows

θi+1= θi− µ R ∇θVNHθL (14)

Here, the matrix R may change the search direction from the negative gradient direction to a more favorable one. The purpose of parameter m is to control the size of the update increment in q with each iteration i, while decreasing the value of the MSE. It is in the choice of R and m that the various training algorithms differ in the Neural Networks package.

If R is chosen to be the inverse of the Hessian of the MSE function, that is, the inverse of

then Equation 2.14 assumes the form of the Newton algorithm. This search scheme can be motivated by a second-order Taylor expansion of the MSE function at the current parameter estimate qi. There are several drawbacks to using Newton’s algorithm. For example, if the Hessian is not positive definite, the q updates will be in the positive gradient direction, which will increase the MSE value. This possibility may be avoided with a commonly used alternative for R, the first part of the Hessian in Equation 2.15:

H = 2 (16)

N ‚

i=1 N

θg Hθ, xiL ∇θg Hθ, xiLT

With H defined, the option Method may be used to choose from the following algorithms:

è Levenberg-Marquardt è Gauss-Newton è Steepest descent è Backpropagation è FindMinimum Levenberg-Marquardt

Neural network minimization problems are often very ill-conditioned; that is, the Hessian in Equation 2.15 is often ill-conditioned. This makes the minimization problem harder to solve, and for such problems, the Levenberg-Marquardt algorithm is often a good choice. For this reason, the Levenberg-Marquardt algorithm method is the default training algorithm of the package.

Instead of adapting the step length m to guarantee a downhill step in each iteration of Equation 2.14, a diagonal matrix is added to H in Equation 2.16; in other words, R is chosen to be

R = HH + eλ IL−1 (17) and m = 1.

The value of l is chosen automatically so that a downhill step is produced. At each iteration, the algorithm tries to decrease the value of l by some increment Dl. If the current value of l does not decrease the MSE in Equation 2.14, then l is increased in steps of Dl until it does produce a decrease.

The training is terminated prior to the specified number of iterations if any of the following conditions are satisfied:

è λ>10∆λ+Max[s]

è VNiL − VNi+1L

VNiL < 10−PrecisionGoal

Here PrecisionGoal is an option of NeuralFit and s is the largest eigenvalue of H.

Large values of l produce parameter update increments primarily along the negative gradient direction, while small values result in updates governed by the Gauss-Newton method. Accordingly, the Levenberg-Marquardt algorithm is a hybrid of the two relaxation methods, which are explained next.

Gauss-Newton

The Gauss-Newton method is a fast and reliable algorithm that may be used for a large variety of minimiza-tion problems. However, this algorithm may not be a good choice for neural network problems if the Hes-sian is ill-conditioned; that is, if its eigenvalues span a large numerical range. If so, the algorithm will con-verge poorly, slowing down the training process.

The training algorithm uses the Gauss-Newton method when matrix R is chosen to be the inverse of H in Equation 2.16; that is,

(18) R = H−1

At each iteration, the step length parameter is set to unity, m = 1. This allows the full Gauss-Newton step, which is accepted only if the MSE in Equation 2.11 decreases in value. Otherwise m is halved again and again until a downhill step is affected. Then, the algorithm continues with a new iteration.

The training terminates prior to the specified number of iterations if any of the following conditions are satisfied:

Ë VNiL − VNi+1L

VNiL < 10−PrecisionGoal

è µ < 10−15

Here PrecisionGoal is an option of NeuralFit.

Steepest Descent

The training algorithm in Equation 2.14 reduces to the steepest descent form when

(19) R = I

This means that the parameter vector q is updated along the negative gradient direction of the MSE in Equation 2.13 with respect to q.

The step length parameter m in Equation 2.14 is adaptable. At each iteration the value of m is doubled. This gives a preliminary parameter update. If the criterion is not decreased by the preliminary parameter update, m is halved until a decrease is obtained. The default initial value of the step length is m = 20, but you can choose another value with the StepLength option.

The training with the steepest descent method will stop prior to the given number of iterations under the same conditions as the Gauss-Newton method.

Compared to the Levenberg-Marquardt and the Gauss-Newton algorithms, the steepest descent algorithm needs fewer computations in each iteration, because there is no matrix to be inverted. However, the steepest descent method is typically much less efficient than the other two methods, so that it is often worth the extra computational load to use the Levenberg-Marquardt or the Gauss-Newton algorithm.

Backpropagation

The backpropagation algorithm is similar to the steepest descent algorithm, with the difference that the step length m is kept fixed during the training. Therefore the backpropagation algorithm is obtained by choosing R=I in the parameter update in Equation 2.14. The step length m is set with the option StepLength, which has default m = 0.1.

The training algorithm in Equation 2.14 may be augmented by using a momentum parameter a, which may be set with the Momentum option. The new algorithm is

∆θi+1= −µ dVNHθL (20)

dθ + α∆θi θi+1= θi+ ∆θi+1 (21)

Note that the default value of a is 0.

The idea of using momentum is motivated by the need to escape from local minima, which may be effective in certain problems. In general, however, the recommendation is to use one of the other, better, training algorithms and repeat the training a couple of times from different initial parameter initializations.

FindMinimum

If you prefer, you can use the built-in Mathematica minimization command FindMinimum to train FF and RBF networks. This is done by setting the option Method→FindMinimum in NeuralFit. All other choices for Method are algorithms specially written for neural network minimization, which should be superior to FindMinimum in most neural network problems. See the documentation on FindMinimum for further details.

Examples comparing the performance of the various algorithms discussed here may be found in Chapter 7, Training Feedforward and Radial Basis Function Networks.