6 Learning Algorithms Based on Gradient Descent

Reformulated RBF neural networks can be trained to mapx k 2IR n i into y k = [y 1;k y 2;k ::: y n o ;k ] T 2 IR n o

, where the vector pairs(x k

;y k

); 1 k M, form the training set. Ifx

k 2 IR

is the input to a reformulated RBF network, its response is T

actual response of theith output unit tox kgiven by: ^ y i;k = f(y i;k ) = f w T i h k = f 0 @ c X j=0 w ij h j;k 1 A ; (55) withh 0;k = 1; 1 k M, h j;k = g(kx k v j k 2 ); 1 j c, h k = [h 0;k h 1;k ::: h c;k ] T , andw i = [w i;0 w i;1 ::: w i;c ] T . Training is typically based on the minimization of the error between the actual outputs of the networky^

;1k M, and the desired responsesy k

; 1kM.

6.1 Batch Learning Algorithms

A reformulated RBF neural network can be trained by minimizing the error: E = 1 2 M X k=1 n o X i=1 (y i;k ^ y i;k ) 2 : (56)

Minimization of (56) using gradient descent implies that all training examples are presented to the RBF network simultaneously. Such training strategy leads to batch learning algorithms. The update equation for the weight vectors of the upper associative network can be obtained using gradient descent as [21]: w p = r wp E = M X k=1 " o p;k h k ; (57)

whereis the learning rate and" o p;k

is the output error, given as:

" o p;k =f 0 (y p;k )(y p;k ^ y p;k ): (58)

Similarly, the update equation for the prototypes can be obtained using gradient descent as [21]: v q = r vq E M X h (59)

whereis the learning rate and" h

q;k is the hidden error, defined as: " h q;k = q;k n o X i=1 " o i;k w iq ; (60) with q;k = 2g 0 (kx k v q k 2

). The selection of a specific functiong() influences the update of the prototypes through

q;k = 2g 0 (kx k v q k 2 ), which is involved in the calculation of the corresponding hidden error " h q;k. Since h q;k = g(kx k v q k 2 )andg(x) = (g 0 (x)) 1 1 m, q;k is given by (48) and the hidden error (60) becomes:

" h q;k = 2 m 1 (h q;k ) m g 0 0 (kx k v q k 2 ) no X i=1 " o i;k w iq : (61)

A RBF neural network can be trained according to the algorithm pre- sented above in a sequence of adaptation cycles, where an adaptation cycle involves the update of all adjustable parameters of the network. An adaptation cycle begins by replacing the current estimate of each weight vectorw

;1pn

o, by its updated version: w p +w p =w p + M X k=1 " o p;k h k : (62)

Given the learning rate and the responsesh

k of the radial basis functions, these weight vectors are updated according to the output errors "

o p;k

;1 p n

o. Following the update of these weight vectors, the current estimate of each prototypev

q, 1qc, is replaced by: v q +v q =v q + M X k=1 " h q;k (x k v q ): (63)

For a given value of the learning rate, the update ofv

q depends on the hidden errors"

h q;k

;1k M. The hidden error" h q;k is influenced by the output errors" o i;k ;1 i n

o, and the weights w iq ;1 i n o, through the termP no i=1 " o i;k w

iq. Thus, the RBF neural network is trained according to this scheme by propagating back the output error.

This algorithm can be summarized as follows: 1. Selectand; initializefw

2. Compute the initial response: h j;k =(g 0 (kx k v j k 2 )) 1 1 m ; 8j;k. h k =[h 0;k h 1;k ::: h c;k ] T ; 8k. y^ i;k =f(w T i h k ); 8i;k. 3. ComputeE = 1 2 P M k=1 P no i=1 (y i;k ^ y i;k ) 2 . 4. SetE old =E.

5. Update the adjustable parameters:

" o i;k =f 0 (y i;k )(y i;k ^ y i;k ); 8i;k. w i w i + P M k=1 " o i;k h k ; 8i. " h j;k = 2 m 1 g 0 0 (kx k v j k 2 )(h j;k ) m P no i=1 " o i;k w ij ; 8j;k. v j v j + P M k=1 " h j;k (x k v j ); 8j. 6. Compute the current response:

h j;k =(g 0 (kx k v j k 2 )) 1 1 m ; 8j;k. h k =[h 0;k h 1;k ::: h c;k ] T ; 8k. y^ i;k =f(w T i h k ); 8i;k. 7. ComputeE = 1 2 P M k=1 P n o i=1 (y i;k ^ y i;k ) 2 . 8. If:(E old E)=E old >; then: go to 4.

6.2 Sequential Learning Algorithms

Reformulated RBF neural networks can also be trained “on-line” by se-

quential learning algorithms. Such algorithms can be developed by using

gradient descent to minimize the errors:

E k = 1 2 n o X i=1 (y i;k ^ y i;k ) 2 ; (64)

fork = 1;2;:::;M. The update equation for the weight vectors of the upper associative network can be obtained using gradient descent as [21]:

w p;k = w p;k w p;k 1 = r wp E k = " o p;k h k ; (65) where w p;k 1 and w

p;k are the estimates of the weight vector w

p before and after the presentation of the training example(x

k ;y

), is the learning rate, and"

o p;k

is the output error defined in (58). Similarly, the update equation for the prototypes can be obtained using gradient descent as [21]: v q;k = v q;k v q;k 1 = r v q E k = " h q;k (x k v q ); (66) where v q;k 1 and v

q;k are the estimates of the prototype v

q before and after the presentation of the training example(x

k ;y k ), is the learning rate, and" h

q;kis the hidden error defined in (61).

When an adaptation cycle begins, the current estimates of the weight vectorsw

pand the prototypes v qare stored in w p;0and v q;0, respectively. After an example(x k ;y k

);1k M, is presented to the network, each weight vectorw p ; 1pn o, is updated as: w p;k =w p;k 1 +w p;k =w p;k 1 +" o p;k h k : (67)

Following the update of all the weight vectorsw p ; 1 p n o, each prototypev q ;1q c, is updated as: v q;k =v q;k 1 +v q;k =v q;k 1 +" h q;k (x k v q;k 1 ): (68) An adaptation cycle is completed in this case after the sequential presentation to the network of all the examples included in the training set. Once again, the RBF neural network is trained according to this scheme by propagating back the output error.

This algorithm can be summarized as follows: 1. Selectand; initializefw

5. Update the adjustable parameters for allk =1;2;:::;M: " o i;k =f 0 (y i;k )(y i;k ^ y i;k ); 8i. w i w i +" o i;k h k ; 8i. " h j;k = 2 m 1 g 0 0 (kx k v j k 2 )(h j;k ) m P no i=1 " o i;k w ij ; 8j. v j v j +" h j;k (x k v j );8j. 6. Compute the current response:

6.3 Initialization of Supervised Learning

The training of reformulated RBF neural networks using gradient descent can be initialized by randomly generating the set of prototypes that determine the locations of the radial basis function centers in the input space. Such an approach relies on the supervised learning algorithm to determine appropriate locations for the radial basis function centers by updating the prototypes during learning. Nevertheless, the training of reformulated RBF neural networks by gradient descent algorithms can be facilitated by initializing the supervised learning process using a set of prototypes specifically determined to represent the input vectors included

in the training set. This can be accomplished by computing the initial set of prototypes using unsupervised clustering or learning vector quantization (LVQ) algorithms.

According to the learning scheme often used for training conventional RBF neural networks [34], the locations of the radial basis function centers are determined from the input vectors included in the training set using the c-means (or k-means) algorithm. Thec-means algorithm begins from an initial set ofcprototypes, which implies the partition of the input vectors intocclusters. Each cluster is represented by a prototype, which is evaluated at subsequent iterations as the centroid of the input vectors belonging to that cluster. Each input vector is assigned to the cluster whose prototype is its closest neighbor. In mathematical terms, the indicator functionu

ij =u

j (x

)that assigns the input vectorx ito the jth cluster is computed as [9]: u ij = ( 1; ifkx i v j k 2 <kx i v ` k 2 ; 8`6=j, 0; otherwise: (69)

For a given set of indicator functions, the new set of prototypes is calculated as [9]: v j = P M i=1 u ij x i P M i=1 u ij ; 1j c: (70)

The c-means algorithm partitions the input vectors into clusters repre- sented by a set of prototypes based on hard or crisp decisions. In other words, each input vector is assigned to the cluster represented by its closest prototype. Since this strategy fails to quantify the uncertainty typically associated with partitioning a set of input vectors, the performance of thec-means algorithm depends rather strongly on its initialization [8], [26]. When this algorithm is initialized randomly, it often converges to shallow local minima and produces empty clusters.

Most of the disadvantages of the c-means algorithm can be overcome by employing a prototype splitting procedure to produce the initial set of prototypes. Such a procedure is employed by a variation of the c- means algorithm often referred to in the literature as the LBG (Linde- Buzo-Gray) algorithm [31], which is often used for codebook design in image and video compression approaches based on vector quantization.

The LBG algorithm employs an initialization scheme to compensate for the dependence of thec-means algorithm on its initialization [8]. More specifically, this algorithm generates the desired number of clusters by successively splitting the prototypes and subsequently employing thec- means algorithm. The algorithm begins with a single prototype that is calculated as the centroid of the available input vectors. This prototype is split into two vectors, which provide the initial estimate for thec-means algorithm that is used withc = 2. Each of the resulting vectors is then split into two vectors and the above procedure is repeated until the desired number of prototypes is obtained. Splitting is performed by adding the perturbation vectors e

i to each vector v

i producing two vectors: v i +e iand v i e

i. The perturbation vector e

ican be calculated from the variance between the input vectors and the prototypes [8].

7 Generator Functions and Gradient

In document RECENT ADVANCES IN ARTIFICIAL NEURAL NETWORKS Design and Applications Lakhmi Jain pdf (Page 81-88)