• No results found

1.2 Problem Statement

2.2.3 Learning Process of a Neural Network

As discussed in Section 2.1, when we use a neural network as a machine learning model, we need to use a learning algorithm to find a set of parameters, i.e., the weights and biases of all neurons in the neural network, that make this neural network be able to map inputXto labelyin the training dataset. However, the learning algorithm is not able to calculate the perfect weights and biases for a neural network. Instead, the learning process of a neural network is regarded as an optimization problem, where the learning algorithm is used to explore the space of possible sets of weights and biases for the neural network. We use a function to evaluate a candidate solution (i.e. a set of weights and biases for the neural network). This function is called a cost function or a loss function. For example, we can define a cost function as given in Equation (2.9). In this equation,X denotes the input vector in the training dataset andX(i)is theith input feature vector in the training dataset. y denotes the desired output of the network andy(i)is the desired output of the network when the input to the network isX(i).fN N(·)denotes the neural network. W denotes all the weights in the neural network.Bdenotes all the biases in the network. Fdenotes the total number of input feature vectors. This loss function is called the Mean Squared Error (MSE) cost function. From this cost function, we can see thatC(W, B)is non-negative. When the costC(W, B)becomes very small, i.e.,C(W, B)is close to 0,fN N(X, W, B)is approximately equal toy. This means that the learning algorithm has found very good weightsW and biasesB such that the neural network with these weights and biases approximately maps the input of this networkXto the desired output of the network

y. In contrast, when the costC(W, B)is large, this means thatfN N(X, W, B)is not equal toy, showing that our neural network with the weights W and the biasesB cannot map well the input of this networkXto the output of the networky. Now, the cost function that is commonly-used for a neural network is called the cross-entropy cost function. The cross-entropy cost function is given in Equation (2.10). In this equation,Edenotes that there areEneurons in the output layer of a neural network.

y(ji)denotes the desired output of thejth neuron in the output layer of a neural network when the input to the network isX(i). fN Nj(X

(i), W, B)

denotes the actual output of thejth neuron in the output layer of a neural network when the input to the network isX(i). C(W, B) = 1 F F X i=1 ky(i)−fN N(X(i), W, B)k2 (2.9) C(W, B) =−1 F F X i=1 E X j=1 [yj(i)log(fN Nj(X (i), W, B)) + (1y(i) j )log(1−fN Nj(X (i), W, B))] (2.10)

From the aforementioned description, we can see that the objective of the learning algorithm is to minimize the costC(W, B). More specifically, we seek to find a set of weightsW and biasesB that minimize this cost as much as possible. For better readability, we useC(v) to denote a cost function. C(v)can be a function of many parameters such asv =v1, v2, ...,vh. SupposeC is a function of just two parame- tersv1,v2. Figure 2.6 shows an example of functionC withv1 andv2. Our object is to findv1, v2 whereC achieves its minimum. For the simple function shown in Figure 2.6, we can use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where C is an extremum. However, the cost function of a neural network can have many more parameters and be much more complex because a neural network contains much more parameters, i.e., the weights and biases of all neurons in the network. For example, very large neural networks have cost functions that depend on billions of weights and biases. It is impossible to use calculus to minimize the cost function.

Figure 2.6: An example of a cost functionCwith two parametersv1 andv2. To solve the aforementioned minimization problem, we can use a method, called gradient descent. We use an analogy to explain how the gradient descent method works to solve this minimization problem. This analogy is shown in Figure 2.7. As shown in this figure, we can think of our cost function as a kind of a valley and imagine a ball rolling down the slope of the valley. When the ball reaches the bottom of this valley, this means we find the minimum of the cost function. Then, the problem comes to how we make a rule that makes the ball roll down to the bottom of the valley. We can use the calculus to describe the move of a ball with a small amount4v1in thev1

direction and a small amount4v2 in thev2 direction by using Equation (2.11). We need to find4v1and4v2that make4Cnegative (negative4Cmeans that the ball is rolling down into the valley). We define that4vis equal to(4v1,4v2)T as shown in Equation (2.12), whereTis the transpose operation that turns row vectors into column vectors in a matrix. We define that the gradient ofC, denoted by5C, is equal to the vector of partial derivatives (∂v∂C1,

∂C ∂v2)

T

, as shown in Equation (2.13). With these definitions, Equation (2.11) can be rewritten to be Equation (2.14). In order to make

4Cnegative, we can make4vto be equal to−η5Cas shown in Equation (2.15), whereη is a small, positive parameter, called the learning rate. Then, by combing Equation (2.14) and Equation (2.15),4C=−η5C· 5C=−ηk 5Ck2. Because

k 5Ck2 0

andη >0,4C ≤0. This means thatCwill always decrease and never increase. Thus, we use Equation (2.15) to define the rule of how to move the ball in the gradient descent algorithm. This means that we use Equation (2.15) to compute a value4vand then move the position of the ballvto a new positionv0by the amount of4v, as shown in Equation (2.16). Then we will use updated rule again to make another movement of the ball. By keeping doing this, we can decreaseC until we reach the (approximate) minimum of the cost functionC.

Figure 2.7: The analogy of using gradient descent to minimize a cost function.

4C≈ ∂C ∂v1 4v1+ ∂C ∂v2 4v2 (2.11)

4v= (4v1,4v2)T (2.12) 5C = (∂C ∂v1 , ∂C ∂v2 )T (2.13) 4C ≈ 5C· 4v (2.14) 4v =−η5C (2.15) v→v0=v−η5C (2.16)

The aforementioned discussion describes how the gradient descent method works when the cost functionChas two parameters. WhenCis a function ofhparameters, i.e.,v1,v2, ...,vh,5Cis calculated using Equation (2.17). We repeatedly apply the rule shown in Equation (2.18) until we reach the (approximate) minimum of the cost functionC. 5C= (∂C ∂v1 ,∂C ∂v2 , ..., ∂C ∂vh )T (2.17) v→v0=v−η5C (2.18)

In fact, for a neural network,vis constituted by all weightsW=w1,w2, ...,wdand all biasesB=b1,b2, ...,bg. Therefore, to update the weights of a neural network, we re- peatedly apply the rule shown in Equation (2.20) to reach the (approximate) minimum of the cost functionC. In Equation (2.20),5CW is calculated using Equation (2.19). Also, to update the biases of a neural network, we repeatedly apply the rule shown in Equation (2.22) to reach the (approximate) minimum of the cost functionC. In Equation (2.22),5CBis calculated using Equation (2.21).

5CW = ( ∂C ∂w1 , ∂C ∂w2 , ..., ∂C ∂wd )T (2.19) W →W0 =W −η5CW (2.20) 5CB= ( ∂C ∂b1 ,∂C ∂b2 , ...,∂C ∂bg )T (2.21) B →B0=B−η5CB (2.22)

Now, researchers often use the gradient descent method with momentum, which is called the momentum-based gradient descent method. The momentum technique modifies the gradient descent method in two ways. Firstly, the momentum technique introduces a notion of “velocity” for the parameters we optimize. The gradient descent method changes the “velocity” of the parameters, not (directly) the “position” of the parameters, and only indirectly affects the “position” of the parameters. Secondly, the momentum technique introduces a friction term, which can gradually reduce the “velocity” of the parameters. In mathematical terms, the momentum-based gradient descent method replaces the updating rule forW (given in Equation (2.20) used in the gradient descent method without momentum) with a new updating rule given in Equation (2.23), and (2.24), whereVwdenotes the aforementioned “velocity” forW, andµdenotes the aforementioned friction term and is called the momentum param- eter. Also, the momentum-based gradient descent method replaces the updating rule forB(given in Equation (2.22) used in the gradient descent method without momen- tum) with a new updating rule given in Equation (2.25), and (2.26), whereVbdenotes the aforementioned “velocity” forB.

Vw →Vw0 =µVw−η5CW (2.23)

W →W0=W +Vw0 (2.24)

B →B0 =B+Vb0 (2.26) One problem of the learning process of a neural network is called overfitting. Overfitting happens when a neural network learns the details of the training data to the extent that it negatively impacts the performance of this neural network on new data. This means that random fluctuations in the training data is learned by the neural network. Unfortunately, the learned random fluctuations cannot apply on new data, thereby negatively impacting the generalizing ability of the network.

In order to reduce the overfitting, one commonly-used technique, called weight decay, is utilized. The weigh decay technique modifies the updating rule for weights

W and does not change the updating rule for biasesBin the gradient descent method. The gradient descent method with weight decay replace the updating rule forW (given in Equation (2.20) used in the gradient descent method without weight decay) with a new updating rule given in Equation (2.27). In Equation (2.27), λ is called the weight decay parameter;F denotes the total number of input feature vectors andηis the learning rate. The updating rule forB used in the gradient descent method with weight decay is the same as the updating rule forB (given in Equation (2.22)) used in the gradient descent method without weight decay.

W →W0 = (1− ηλ

F )W −η5CW (2.27)