Improving three layer neural net convergence

(1)

MURDOCH RESEARCH REPOSITORY

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=140340&contentType=C

onference+Publications

Alder, M., Sok, G.L., Hadingham, P. and Attikiouzel, Y. (1991)

Improving three layer neural net convergence. In: Second

International Conference on Artificial Neural Networks, 18 - 20

November, Bournemouth, UK, pp. 318 - 322.

http://researchrepository.murdoch.edu.au/20266/

Copyright © 1991 IEEE

Personal use of this material is permitted. However, permission to reprint/republish

this material for advertising or promotional purposes or for creating new collective

(2)

IMPROVING

THREE LAYER NEURAL NET CONVERGENCE

Michael A l d e r , Sok

Gek

Lim.Pau1 Hadingham and Yianni Attikiouzel

T h e University of Western Australia, Australia

I N T R O D U C T I O N

T h e m u l t i - l a y e r Perceptron [l], t h e M A D A L I N E [2],

t h e c o m m i t t e e n e t [3] and t h e feed forward n e t w o r k s t r a i n a b l e b y t h e back-propagation algorithm, 141, are

r e l a t e d in ways w h i c h a r e n o t always e n t i r e l y clear. I t is o n e a i m of t h i s p a p e r t o m a k e s o m e of t h e m a little clearer.

T h e main application of neural n e t w o r k s h a s been as trainable p a t t e r n classifiers. and that is how t h e y will b e viewed in this paper. In t h e first part of this paper we i n v e s t i g a t e t h e r e l a t i o n s h i p b e t w e e n t h r e e l a y e r f e e d f o r w a r d b a c k - p r o p a g a t i o n n e t s ( u s i n g t h e terminology of [4]) and t h e committee net of [3], and s h o w t h a t a s i m p l e m o d i f i c a t i o n t o t h e a l g o r i t h m of t h e l a t t e r m a k e s t h e m , in r e s p e c t of t h e i r p o w e r t o classify d a t a s e t s , e q u i v a l e n t . T w o a l g o r i t h m s may, however, b e e q u i v a l e n t in power h u t differ greatly in their practicality. In t h e second part of t h e p a p e r we c o n d u c t s o m e e x p e r i m e n t s in o r d e r t o d e t e r m i n e w h e t h e r t h e m o d i f i e d c o m m i t t e e a l g o r i t h m c a n c o m p e t e w i t h b a c k - p r o p a g a t i o n i n a v a r i e t y of

applications. I t is found that t h e committee algorithm (a) is a b o u t 10 t i m e s as fast in s o m e a p p l i c a t i o n s and ( b ) i s m u c h less p r o n e t o g e t t i n g t r a p p e d in l o c a l minima.

T h e t h e o r e t i c a l interest in t h e p a p e r s t e m s from t h e e a s e of a n a l y s i n g t h e c o m m i t t e e a l g o r i t h m t o g e t h e r w i t h t h e e q u i v a l e n c e . T h q e x p e r i m e n t a l i n t e r e s t is t h a t t h i s m e t h o d of s p e e d i n g u p b a c k - p r o p a g a t i o n m a y b e u s e d w i t h o t h e r i m p r o v e m e n t s t o r e d u c e training times in sdme applications.

I n what f o l l o w s we restrict attention, without loss of g e n e r a l i t y , t o t h e c a s e of a b i n a r y c l a s s i f i c a t i o n problem.

T H E O R E T I C A L C O N S I D E R A T I O N S

T e r m s not defined in this paper are t o b e found in t h e r e f e r e n c e s , s p e c i f i c a l l y see [ 3 ] f o r d e f i n i t i o n s of terms relating t o committee nets. Briefly, a committee n e t is a ( t r a i n a b l e ) s e t of k o r i e n t e d h y p e r p l a n e s in t h e s p a c e e a c h r e t u r n i n g a ‘vote’ + I or - 1 a c c o r d i n g to which side of t h e hyperplane a datum is. T h e v o t e s are summed by t h e ‘ o u t p u t unit’and compared with a t h r e s h o l d of z e r o if t h e number of units is o d d . By

a d d i n g , if n e c e s s a r y , a unit or units r e m o t e from t h e d a t a or b y g i v i n g a t h r e s h o l d of 1/2, t h e c a s e of an e v e n n u m b e r o f u n i t s c a n b e t r e a t e d , a s c a n t h r e s h o l d s o t h e r than zero.

T h e first proposition is t h e trivial observation, made for completeness:

Proposition. Any b i n a r y d a t a s e t can b e c o r r e c t l y classified by a c o m m i r t e e net.

Proof: T h e argument is by induction on t h e n u m b e r of points of t h e d a t a set. We first observe that if there are only two points, one of each category, then t h e set is linearly separable and a single unit net will serve. N o w s u p p o s e t h a t t h e r e i s a l w a y s a c o r r e c t c l a s s i f i c a t i o n o f a d a t a s e t h a v i n g a t o t a l of r d a t a points, and c o n s i d e r a dat set having r + l points. One of t h e p o i n t s at l e a s t m u s t b e e x t r e m a l , i s . may b e s e p a r a t e d f r o m a l l t h e o t h e r s by a h y p e r p l a n e . T h e o t h e r p o i n t s m a y b e c o r r e c t l y c l a s s i f i e d b y a c o m m i t t e e . W i t h o u t loss of g e n e r a l i t y , s u p p o s e t h e n e w p o i n t i s p o s i t i v e a n d t h e c o m m i t t e e w h i c h c l a s s i f i e s t h e r e s t of t h e d a t a s e t w o u l d a s s i g n it a negative value. T h e n we can insert between t h e old r

p o i n t s a n d t h e e x t r e m a l p o i n t s u f f i t i e n t l y m a n y hyperplanes to make t h e new point positive. This will also add n e g a t i v e v a l u e s t o t h e r p o i n t s which may a l t e r t h e c l a s s i f i c a t i o n . W e t h e r e f o r e t a k e a s many h y p e r p l a n e s again which h a v e a l l t h e p o i n t s on t h e p o s i t i v e s i d e . T h i s r e t u r n s t h e r p o i n t s t o t h e s a m e v a l u e s t h e y had b e f o r e , a n d p r e s e r v e s t h e p o s i t i v e sign of t h e extremal point.

0

R e m a r k . T h e r e a r e b e t t e r w a y s . I n p a r t i c u l a r we n e e d a d d o n l y half t h e n u m b e r of new h y p e r p l a n e s s u g g e s t e d i n t h e a b o v e a r g u m e n t . E v e n s o , t h e a r g u m e n t i s h a r d l y c h e e r i n g , s u g g e s t i n g as it d o e s that t h e n u m b e r of hyperplanes, and hence of hidden units, may increase with the size of t h e data set.

(3)

x = ( X , . X ~ . . . . X ~ ) ~ t h e 3-layer FFBP n e t i m p l e m e n t s a

map:

f(x) = sgn ( bjsgn(Faijxi))

J

w h e r e t h e a.. are t h e weights in t h e first hidden layer from t h e ithlkomponent of t h e input to t h e jth hidden unit and t h e b, are t h e weights to t h e output unit, and t h e sgn f u n c t i o n t a k e s t h e v a l u e 1 if i t s a r g u m e n t i s positive, - 1 if i t is negative and is not defined at zero.

Suppose that s o m e d a t a set is correctly classified b y some such 3 - l a y e r FFBP net. We o b s e r v e t h a t s i n c e t h e d a t a set is f i n i t e and f - ' { l ) is open, a s is f - ' ( - I ) , t h e r e is stability under perturbations of t h e weights. Specifically we may take t h e bj to he rational, which

i s just as well given t h e need for using t h e s e n e t s via computer simulations.

M o r e o v e r , i f a n y of t h e c o e f f i c i e n t s s h o u l d b e negative, then reversing t h e sign of t h e corresponding ai, w e i g h t s will a l l o w u s t o r e v e r s e i t s sign t o s t i l l give a s o l u t i o n . And if a weight s h o u l d b e z e r o , we can simply remove t h e unit altogether. Hence we may take i t that t h e weights are all positive rationals, and Fince multiplying throughout by any positive number w i l l p r e s e r v e t h e s i g n of t h e s o l u t i o n , w e m a y m u l t i p l y b y t h e L e a s t C o m m o n M u l t i p l e of t h e d e n o m i n a t o r s t o m a k e t h e w e i g h t s a l l p o s i t i v e integers, which we can then regard as multiplicities. T h i s is now a c o m m i t t e e net with some of t h e u n i t s replicated by their multiplicities.

T h e a b o v e argument l e a v e s one with t h e impression that for any data set that can be classified by a 3-layer FFRP net there is a committee net which can classify i t , h u t t h e n u m b e r o f u n i t s r e q u i r e d f o r t h i s e q u i v a l e n t c o m m i t t e e c a n h e a r b i t r a r i l y l a r g e . Happily. such is not t h e c a s e . We proceed t o show t h i s by a s e r i e s of p r o p o s i t i o n s m o t i v a t e d by s o m e examples.

E x a m p l e . S u p p o s e w e h a v e a c o m m i t t e e w i t h multiplicities attached to t h e units: I shall write:

( 2 , 11, 19, 2 6 ) t o i n d i c a t e t h a t we h a v e f o u r u n i t s

which we may take i t correctly classify some d a t a set in R" , p r o v i d e d w e h a v e t h e given m u l t i p l i c i t i e s . A l t e r n a t i v e l y , we h a v e 2+11+19+26 u n i t s b u t o n l y four of them are distinct hyperplanes.

I f n is big enough, then t h e r e will b e some region of t h e s p a c e h a v i n g a c o u n t of 2+11+19+26, a n d a s w e m o v e a c r o s s t h e l a s t h y p e r p l a n e w e m o v e i n t o a region having a count of 2+11+19 -26, and so on. Now consider t h e second hyperplane. T h e regions on t h e

t h e negative side. Now t h e s e numbers when added t o + I 1 or -11 a r e from t h e set (-47, -43, -9, - 5 , 5 , 9, 4 3 , 4 7 ) . W e observe that t h e sign of t h e result of a d d i n g

11 or s u b t r a c t i n g 11 f r o m t h e s e n u m b e r s will n o t b e a l t e r e d b y c h a n g i n g t h e n u m b e r 1 1 t o s o m e s u f f i c i e n t l y c l o s e o t h e r n u m b e r . I n p a r t i c u l a r , c h o o s i n g any number t o r e p l a c e 11 which is s t r i c t l y between 9 and 43 will not a l t e r any signs; t h i s i s , of course, that interval of t h e values obtained by taking

s u m s a n d d i f f e r e n c e s of t h e m u l t i p l i c i t i e s w h i c h c o n t a i n s 11. W e m a y t h e r e f o r e r e p l a c e t h e g i v e n q u a r t e t ( 2 , 1 1 , 1 9 , 2 6 ) b y t h e e q u i v a l e n t q u a r t e t (2,19,19,26) which m u s t a l s o c o r r e c t l y classify t h e s a m e d a t a s e t . Similarly, t h e 2 6 may b e changed t o a n y v a l u e b e t w e e n 2 a n d 36 so may a l s o b e m a d e equal t o 19. Finally t h e 2 may take any value between

- 19 a n d 19 i n p a r t i c u l a r may t a k e t h e v a l u e 0. W e therefore h a v e that t h e given quartet is equivalent to the quartet (0,19,19,19). T h i s is of course equivalent to t h e q u a r t e t (0.1.1,l) or t h e t r i p l e (1.1.1). I n o t h e r words, t h e original total of 58 u n i t s can b e reduced to 3 and stilI classify t h e same data set.

E x a m p l e . S u p p o s e w e h a v e t h r e e u n i t s w i t h m u l t i p l i c i t i e s p o s i t i v e i n t e g e r s (a,b,c). W e m a y , without loss of generality, suppose t h e m arranged in non-decreasing order and not all t h e same. Now if c >

a + b . t h e first t w o u n i t s c a n n o t c h a n g e t h e sign of a region by enough t o c o m p e n s a t e for t h e v o t e of t h e t h i r d u n i t a n d m a y b e r e m o v e d ; t h e t r i p l e i s e q u i v a l e n t t o a s i n g l e u n i t in t h e t h i r d l o c a t i o n . A l t e r n a t i v e l y w e may argue t h a t b l i e s b e t w e e n a - c and c - a and can b e put to 0, whereupon a lies between - c and c and can also b e put t o 0, finally divide ( 0 . 0 , ~ )

by c. If c < a + b t h e n b l i e s between c - a and c + a and can b e p u t equal t o c, and finally a lies between 0 and 2 c a n d h e n c e may a l s o b e p u t t o c, g i v i n g (c,c,c)

-

(1.1.1). Finally,the case where c = a+b is easily seen to b e r e d u c i b l e t o a t r i p l e (l,l,2) and requires a total of f o u r u n i t s . S u c h a c a s e c a n c e r t a i n l y a r i s e f r o m a 3-layer FFBP net, and we observe that in this case we need one extra unit in an equivalent committee.

D e f i n i t i o n . L e t a = ( a , , a 2 , . . . a k ) b e a s e t o f non-negative integers. T h e unitary span of t h e set I S

t h e s e t of 2k i n t e g e r s o b t a i n e d by t a k i n g all l i n e a r combinations with coefficients in [-I,+l]. a is said to

be unitarily degenerate iff 0 is in t h e unitary span. We w r i t e u s p a n ( a ) f o r t h e set of 2k i n t e g e r s c o m p r i s i n g t h e unitary span.

D e f i n i t i o n . Let a = ( a l

,...

ak) and b = ( b l

....

b k ) b e t w o k - t u p l e s of n o n - n e g a t i v e integers in n o n - d e c r e a s i n g order. Let w = (w,.

...

w k ) b e a unitary vector i.e. each w' is e i t h e r - 1 or + l . T h e n we write a-b iff for every unitary w, sgn(w*a) J = sgn(w*b) w h e r e

*

d e n o t e s t h e usual inner product

(4)

R e m a r k . C l e a r l y

-

i s an e q u i v a l e n c e r e l a t i o n a n d e q u a l l y c l e a r l y i f t h e k - t u p l e s r e p r e s e n t t h e multiplicities of units in t h e hidden layer, replacing o n e s e t of m u l t i p l i c i t i e s w i t h t h e o t h e r w i l l n o t c h a n g e t h e c l a s i f i c a t i o n of t h e n e t . If a k - t u p l e is d e g e n e r a t e t h e n a n y e q u i v a l e n t k - t u p l e i s a l s o degenerate.

D e f i n i t i o n . A k - t u p l e of n o n - n e g a t i v e i n t e g e r s in n o n - d e c r e a s i n g o r d e r w i l l b e r e f e r r e d t o a s a multiplicity-tuple.

D e f i n i t i o n . If a = (ai.a?, ..., ak) is a n o n - d e g e n e r a t e multiplicity-tuple, we write 8, for t h e sequence with

aj r e m o v e d . T h e n i f a ’ i s t h e s e q u e n c e a w i t h a, changed t o b j we say that the transformation from a t o a ’ i s allowable provided both aj and b j are in t h e same interval of consecutive e l e m e n t s of uspan(;iJ), i s . iff there is n o element of uspan(ij) between aj and bj.

P r o p o s i t i o n . If a ’ i s an allowable transform of a then a ’ a n d a are equivalent.

P r o o f : S u p p o s e n o t . T h e n t h e r e i s s o m e u n i t a r y combination of t h e elements on which t h e sign of t h e result is different. Let t h e unitary vector involved be w, so sgn(w*a) is d i f f e r e n t f r o m sgn(w*a?. Now t h e a b s o l u t e v a l u e of t h e d i f f e r e n c e b e t w e e n w*a a n d w * a ’ i s la.-b.l S u p p o s e , w i t h o u t loss of generality. that wj = 1 and let t h e residue r b e given by r = w*a

- a. = w * a ’ - b . S i n c e t h e sign of r

+

aj d i f f e r s from t h e sign of r

+

b . there must b e some number between a. and b . which i s e q u a l t o r , which is an e l e m e n t of u s p a n ( i j ) . B u t t h i s c o n t r a d i c t s t h e d e f i n i t i o n of allowable transformations.

J J ’

J J

0

R e m a r k . Obviously if a ’ i s an allowable transform of a , t h e n a i s a n a l l o w a b l e t r a n s f o r m o f a’. T h e c o m p u t a t i o n s of our f i r s t e x a m p l e d e m o n s t r a t e t h e ease with which one can reduce k - t u p l e s by means of sequences of allowable transforms, and h e n c e reduce t h e n u m b e r o f u n i t s r e q u i r e d o f a c o m m i t t e e n e t w h i l e still p r e s e r v i n g t h e c l a s s i f y i n g capacity. W e n o w e n q u i r e , how m a n y e q u i v a l e n c e c l a s s e s a r e there? T h i s will allow us, with t h e above proposition, t o find b o u n d s on t h e n u m b e r of u n i t s in a r e d u c e d committee equivalent to a given committee. When k =

3 for i n s t a n c e , we h a v e seen that t h e r e a r e precisely t w o non-degenerate equivalence classes, which h a v e representatives (O,O,!) and (l,l,l).

D e f i n i t i o n . A n y unitary v e c t o r w is positive if f o r e v e r y m u l t i p l i c i t y t u p l e a, w*a > 0, i t is negative if f o r e v e r y u n i t a r y v e c t o r a w * a < 0 , a n d i t i s

nmbiguous if t h e r e a r e some a for which w*a > 0 and others for which w*a < 0.

R e m a r k . It is clear that, say (1.1 ,...I) is positive, and

( - 1,- 1

,...,

- 1) i s n e g a t i v e , a n d t h a t (- 1.1.- 1.1

...

- 1.1) i s positive. while (1,- 1.1,- 1

,...,

- 1,l) is ambiguous. (- 1,- 1.1)

is ambiguous, while (- l,l,l) is positive.

P r o p o s i t i o n . F o r k o d d , t h e n u m b e r of equivalence c l a s s e s of multiplicity- t u p l e s is at least o n e greater than half t h e number of ambiguous unitary vectors of length k.

P r o o f : T h e equivalence classes are. by definition, t h e s e t o f d i f f e r e n t c o n s i s t e n t a s s i g n m e n t s t o t h e 2 k d i f f e r e n t unitary v e c t o r s of t h e v a l u e s + I or - 1. T h e v a l u e s assigned to t h e positive and negative unitary v e c t o r s a r e f o r c e d t o b e + 1 a n d - 1 r e s p e c t i v e l y , by d e f i n i t i o n . C o n s i d e r f i r s t t h e c a s e w h e r e t h e a m b i g u o u s v e c t o r s may b e o r d e r e d by u < v iff for e v e r y m u l t i p l i c i t y - t u p l e a, a*u < a*v. T h e n we may t a k e h a l f t h e a m b i g u o u s v e c t o r s in t h i s o r d e r and o b s e r v e t h a t an e q u i v a l e n c e c l a s s is d e t e r m i n e d by c h o o s i n g w h e r e t o p u t a zero in t h e list. Since t h e r e a r e k / 2 e l e m e n t s i n t h e l i s t t h e r e a r e l + k / 2 equivalence classes. If t h e vectors cannot b e ordered, t h e u s u a l s t a t e of a f f a i r s , t h e r e c a n o n l y b e m o r e equivalence classes than this.

R e m a r k . F o r small ( o d d ) k , t h e comparisons are not unfavourable to t h e committee. For example when k =

3, t h e unitary vectors are (l,l,l),(-l,l,l) and (1,- 1.1) for t h e p o s i t i v e v e c t o r s , t h r e e c o r r e s p o n d i n g n e g a t i v e v e c t o r s , and t h e a m b i g u o u s v e c t o r s a r e (- 1,- 1,l) and (1.1.- 1) w h i c h a r e p a i r e d . W e may i n s e r t t h e z e r o b e f o r e o r a f t e r t h e ( - 1,- 1.1) w h i c h g i v e s t w o equivalence classes, (0.0.1) and (1.1,l).

F o r k = 5 , t e n of t h e unitary vectors are positive, ten n e g a t i v e a n d t h e r e m a i n i n g t w e l v e v e c t o r s fall i n t o six pairs. Among t h e six we find a chain of f i v e and t h e l a s t v e c t o r m a y o c c u p y a n y o f t h e f i r s t t h r e e places. T h e r e are t h u s 8 equivalence classes. It is easy to confirm that t h e following 8 multiplicity-tuples are inequivalent: (O,O,O,O,l), (O,O,l,l,l), (l,l,l,l,l), (0,1,1,1,2), (1,l.I,2,2), (0.1.1.2.3). (1,1,1,1,3) a n d (1,1,2,2,3). We o b s e r v e t h a t t h e m a x i m u m s i z e of t h e e q u i v a l e n t committeee is therefore 9, and t h e average size is 5.5.

(5)

f

T a b l e 2 N e t

G e o m e t r y : 3-3-1 3-5- 1 3-7- 1

M o v e s

S.B.P. B*p*

]

195 132 105

0.21 1.2* 0.01 0.013

T i m e B.P.

2842* 442 1632*

E X P E R I M E N T S

R e m a r k . T h e t r a i n i n g p r o c e d u r e f o r c o m m i t t e e s r e q u i r e s s o m e a n a l y s i s o f t h e F F B P t r a i n i n g procedure; an informal discussion of pertintent issues may b e found in Fahlman [4]. Nilsson describes some p r o c e d u r e s w h i c h h a v e s o m e d e f e c t s . W e b r i e f l y outline t h e crucial issues.

I f t h e c o m m i t t e e correctly classifies a point t h e r e is no need to d o anything. If it is in error, we may take it t h a t s o m e of t h e u n i t s h a v e to b e c o r r e c t e d by t h e u s u a l p r o c e d u r e f o r a s i n g l e u n i t , t h e p e r c e p t r o n convergence procedure. T h e question is, which units is o n e t o modify? Nilsson gives a heuristic argument for modifying t h e ’ l e a s t wrong’, that is t o say, if 0 <

a*x < b*x f o r t h e d a t u m x a n d b o t h a a n d b a r e in e r r o r and if only o n e unit n e e d s t o be a d a p t e d , t h e n o n e a d a p t s a. A r a t i o n a l e f o r t h i s is t h a t minimum c h a n g e s a r e t o b e p r e f e r r e d s i n c e if t h e d a t a s e t i s m o s t l y k n o w n , t h i s i s l e a s t l i k e l y t o d e s t r o y ‘ k n o w l e d g e a b o u t t h e d a t a s e t a l r e a d y a c q u i r e d by t h e net’. A d r a w b a c k of t h i s is t h a t w e may h a v e an initial configuration in which all but o n e of t h e units a r e r e m o t e from all t h e d a t a , and so o n l y o n e unit is r e p e a t e d l y a d a p t e d . t h e n e t is ‘starved’. S t r a t e g i e s s u c h a s m a k i n g r a n d o m c h o i c e s of w h i c h u n i t t o correct, d e r i v e d from annealing algorithm ideas, a r e alternatives.

We employ t h e tactic of correcting all units which are in e r r o r , b u t of adapting t h e m by different a m o u n t s , t h e ‘stepsize’being smaller for t h e more remote data.

I f we t a k e a f u n c t i o n such a s 1/(1+x2) for e x a m p l e , w h i c h d e c r e a s e s f r o m 1 t o z e r o , a n d u s e t h i s a s a s t e p s i z e for t h e case when t h e ‘ d i s t a n c e ’ x i s la*yl ( f o r y t h e a u g m e n t e d i n p u t v e c t o r ) , t h e n t h i s i s precisely equivalent to weighting by the derivative of a s i g m o i d f u n c t i o n a r c t a n ( x ) . T h i s , of c o u r s e , i s p r e c i s e l y w h a t i s a c c o m p l i s h e d i n t h e B a c k Propagation algorithm; t h e actual function arctan(x) a p p e a r s e s s e n t i a l l y o n l y v i a i t s d e r i v a t i v e . W e observe that t h e important thing about t h e derivative of s i g m o i d f u n c t i o n s , i s t h a t t h e i r d e r i v a t i v e f o r p o s i t i v e d i s t a n c e s i s a l w a y s n e g a t i v e , t h e c l o s e n e u r o n s a r e m o v e d m o r e t h a n t h e r e m o t e n e u r o n s , a n d h e n c e t h e d y n a m i c i s n o t a t a n y s t a g e a c o n t r a c t i o n m a p p i n g . T h e r e p e a t e d o p e r a t i o n of d i f f e r e n t d a t a on t h e s y s t e m g i v e s us. of c o u r s e , an iterated function system.

T h e a b o v e o b s e r v a t i o n s s u f f i c e t o s p e c i f y a n e w c o m m i t t e e n e t a l g o r i t h m which we c o m p a r e d w i t h standard back-propagation in a series of experiments c o n d u c t e d o n a S U N S P A R C I - s t a t i o n . F i r s t w e treated t h e parity problem in dimensions 2.3.4 and 5 ,

i s . w e t o o k a u n i t h y p e r c u b e i n Rn a n d a s s i g n e d alternate

+

and - categories t o t h e vertices, for t h e

cases of n = 2 t o 5. In dimension 2 t h i s i s s i m p l y t h e XOR function. We took two 3-layer nets with various n u m b e r s of u n i t s in t h e h i d d e n l a y e r ; f i r s t w a s t h e standard back-propagation algorithm and second was t h e case w h e r e t h e output units were h e l d at weight one and only t h e hidden units corrected according to t h e a b o v e r u l e ; w e r e f e r t o t h i s a s ‘ s i m p l i f i e d back-propagation’ (SBP). Initial assignment of t h e w e i g h t s was at r a n d o m in t h e interval - 3 to 3, and a n u m b e r of r e p e t i t i o n s of t h e training were r u n . T h e

s o c a l l e d ‘ g a i n p a r a m e t e r ’ w a s v a r i e d b u t t h e d e p e n d e n c y w a s n o t s t r o n g o v e r a w i d e r a n g e of values. T h e averaged results a r e shown i n table I.

T a b l e 1.

3 - L a y e r Nets: T h e P a r i t y P r o b l e m

D i m e n s i o n 2 3 4 S

N e t

G e o m e t r y : 2-3-1 3-3-1 4-5-1 5 - 7 - 1

M o v e s

B.P. 215 2842* Y01* 3896* S.B.P 46 195 760 1720

T i m e B.P. 0.03 0.Y* 0.8’ 13.0*

( s e c s ) S.B.P. 0.002 0.016 0.2 1.3

*

indicates that some runs failed to converge within 100,000 moves,

Different n u m b e r s of h i d d e n u n i t s p r o d u c e d l a r g e r variations in n u m b e r s of moves and t i m e s f o r Back- P r o p a g a t i o n t h a n f o r S i m p l i f i e d B.P. a s s h o w n in

table 2, w h e r e t h e d a t a set was t h e parity problem in

dimension 3:

(6)

m a d e on m i n e r a l s a m p l e s was o b t a i n e d . T h e v e c t o r s of l e n g t h e i g h t w e r e o b t a i n e d f r o m a n a p p a r a t u s w h i c h r e f l e c t e d l i g h t f r o m m i n e r a l s a m p l e s a n d b i n n e d i t i n t o i n t e r v a l s of t h e s p e c t r u m . C o l o u r gradings by e y e were assigned to t h e two categories of data. Dividing t h e data into test data and training data h a s e s t a b l i s h e d t h a t t h e a c c u r a c y o f a back-propagation neural net is of t h e order of 78%. A three layer back propagation net was therefore trained u n t i l t h i s l e v e l of e r r o r w a s r e a c h e d . S i m i l a r l y , a c o m m i t t e e n e t was t r a i n e d t o t h e s a m e l e v e l . T h e r e s u l t s for d i f f e r e n t n u m b e r s of units i n t h e h i d d e n layer are shown in t a b l e 3. Here we give run times on

SLN SPARC stations required to obtain different per c e n t a g e s of t h e d a t a c o r r e c t l y c l a s s i f i e d . 2 0 repetitions:

T a b l e 3 .

R e a l Data: T h r e e L a y e r n e t s .

BP

T i m e s . s e c o n d s :

%of d a t a s e t c o r r e c t

7 5 % 7'1% 7 9 %

N e t

G e o m e t r y

8-1-1 4 27 94 7

*

8-3-1 253 1085

*

8-5- 1 160 1037

*

S i m p l i f i e d

BP

T i m e s , s e c o n d s :

% o f d a t a s e t c o r r e c t

7 5 % 77% 7 9 %

Net (;eo m e t ry

8- 1- 1 23 6 5 94

8 - 3 - 1 36 4 3 88

8-5- 1

81%

*

81%

*

1 7 74 8

- denotes n o data collected,

*

denotes failure to converge.

CONCLUSIONS.

I t m i g h t b e a r g u e d t h a t B a c k P r o p a g a t i o n i s s o inefficient an algorithm that to attempt to improve it m a r g i n a l l y i s t o w a s t e t i m e a n d e n e r g y . Improvements using, for example conjugate gradient and o t h e r types of net a r e known [ 5 ] , [ 6 ] . We feel it is worth investigating t h e algorithm however for four reasons: first i t constitutes something of a standard,

if a l o w o n e ; s e c o n d s o m e of t h e ' i m p r o v e m e n t s ' abandon all claim t o being neural models, while s o m e k i n d o f a c l a i m c a n s t i l l b e m a d e f o r t h e c o m m i t t e e n e t . T h i r d , a s m e n t i o n e d i n t h e i n t r o d u c t i o n , i t i s p o s s i b l e t h a t s o m e of t h e o t h e r i m p r o v e m e n t s c a n b e m a d e i n a d d i t i o n t o o u r modification. I n particular, n e t s with m o r e layers can b e h a n d l e d u s i n g s i m i l a r i d e a s , a n d in a s u b s e q u e n t p a p e r w e s h a l l make s i m i l a r m o d i f i c a t i o n s t o f o u r - layer nets. Finally, t h e committee net is much simpler t o a n a l y s e : B a c k - P r o p a g a t i o n c o n c e a l s e s s e n t i a l features of t h e dynamic.

For nets with a relatively small number of units in t h e h i d d e n i a y e r , i m p r o v e m e n t s in t i m e by a f a c t o r o f

a b o u t 2 0 a r e r e a d i l y a t t a i n a b l e . T h i s i s d u e t o t w o f a c t o r s : f i r s t t h e s i m p l i f i c a t i o n of t h e a l g o r i t h m e n t a i l s less computation and second, t h e reduction in t h e s i z e of t h e search s p a c e m a k e s f o r f e w e r p a s s e s t h r o u g h t h e t r a i n i n g set and r e d u c e s t h e risk of t h e net becoming trapped in a local minimum.

REFERENCES

1.

2.

3.

4 .

5 .

6 .

Rosenblatt, F: 1968,. " T h e Perceptron" Spartan Books

Widrow B, Lehr M: 1990. 30 Years of Adaptive Neural Networks:Perceptron, Madaline,and Back-Propagation Proc. IEEE S e p f e m b e r

14 15- 1442

Nilsson, N. : 1965 "Learning Machines "

Rumelhart D, Hinton G and Williams R., August 1986 Learning Representation by Back-Propagation of Errors Nature, Vol 323 533

Fahlman S. and Lebiere C:1990, T h e Cascade- Correlation Learning A rch itect ure

in: D.S.Touretzky (Ed) "Advances in Neural McGraw-Hill

Information Processing Systems 2 " Morgan

Brent,R, 1990, " Fast Algorithms for Multi-Layer

Neural Nets" T h e Australian National University Computer Sciences Laboratory Technical Report.