CHAPTER 2 Training neural networks
2.5. Assessing training speed
2.5.2 Assessing the average learning speed
As the result of a particular trial can depend on factors like initial weight state, more reliable information is obtained by calculating some mean values over a number of trials and some measure of the variation of individual cases. Falhman in [Falhman, 1988] discusses some issues regarding the assessment of the speed in the case in which more than one trial is considered.
If a number of trials are used, the problem of how to treat the failures must be analysed. Some algorithms can become stuck in some particular states or can have anomalous long training times (which are effectively infinite). How can one calculate the mean value of a series of values which can include infinite values? For the purpose of this discussion, the units used for the measurement of individual trials are not relevant. The individual trials can be expressed in number of epochs, pattern presentations, connection crossings or operations and the average will be expressed in the same units. In the following, 'units' can be substituted with any of the above.
A possibility is to report separately the successes and the failures for each set of trials as in [Moeller, 1993], [Weir, 1991] and others. As Falhman points out, if we do this, how will we choose between a technique with a better average and one with fewer failures?
An approach proposed by Tesauro and Janssens is to define the training rate to be
calculated and the average training time is defined by the inverse of this average training rate. Fahlman criticises this approach as i) penalising more consistent algorithms, ii) favouring algorithms which combine taking risky steps for a very short training with many failures and iii) emphasising short trials with respect to long ones. Let us analyse this.
According to this method, the average training time (which can be seen as a training cost measured in units) is calculated as:
n
(2)
where ci is the cost of one trial, n is the number of trials and c is the average cost of training. The terms in equation (2) can be split into terms corresponding to successful trials and tenns corresponding to unsuccessful ones:
/?.. + nf
A 1 1
X~“ + Z-F
(3)
where cs are the costs of successful trials, cf the costs of unsuccessful ones, ns the number of successful trials and nf the number of unsuccessful ones. If the number of failures is 0, the average value calculated according to this method is different from the arithmetical mean and this is why the more consistent functions are penalised i.e. they would appear to have a slower training.
Falhman's idea was to restart the training, with random weights, whenever the network has failed to converge after a certain number of epochs. The duration of a trial is the total number of units since the previous successful trial. This approach offers the advantage of giving the arithmetical mean if there are no failures, thus eliminating the bias of the previous method. However, this approach has a drawback: the implicit dependence on the termination limit i.e. the number of units after which a training is restarted. The same algorithm could give different values for the average convergence time if the termination limit is taken to be different (and the algorithm fails from time to time). Furthermore, this termination limit depends on the algorithm itself. It is not reasonable to use the same limit for two algorithms which usually converge in 20 and 2000 units respectively.
A solution is to calculate a cost of training in which the cost of a normal training session and the cost of a failure are evaluated taking into consideration the particular requirements of the problem. The cost of a training in units is defined as:
costtotal = costsucces + costfailnre
COSttotai z jCS} + CSfaj|uro/2y- i=l c = costaverage costtotal nc Lcsi + 1=1 CSfailuren/ n..
XCSi
cs n i^ + Jf^ = C0stsucc+CSfeiiure_ n„ nv n (4) fIn this formula, csi are the costs of successful trials, cfi the costs of unsuccessful ones, ns the number of successful trials, nf the number of unsuccessful ones, costsucc is the average cost of the successful trials and csfailure is the cost of a single failure. If the number of failures is 0, this formula yields the value of the arithmetical mean (the average cost of successful trials).
If the cost of a failure csfailure is taken to be the termination limit Nmax> the formula (4) models the strategy adopted by Falhman which can be shown by rewriting (4) as:
n, nf ns
ZCSi+Nmn"/ X(CS1+N„.«)+ XCSi
C _ _!=1--- = ±2---(5)
n„ ns
The last form of expression (5) uses two assumptions: i) that the number of failures is less than the number of successes so that each failure can be coupled with a success and ii) that the failures have occurred one before each of the first nf trials. These assumptions are not essential and do not modify the result (due to the commutativity of the sum).
If these costs