Gradient descent and backpropagation - TensorFlow for Machine Intelligence

We cannot close the chapter about basic machine learning, without explaining how the learning algorithm we have been using works.

Gradient descent is an algorithm to find the points where a function achieves its

minimum value. Remember that we defined learning as improving the model parameters in order to minimize the loss through a number of training steps. With that concept,

applying gradient decent to find the minimum of the loss function will result in our model learning from our input data.

Let’s define what a gradient is, in case you don’t know. The gradient is a mathematical operation, generally represented with the symbol (nabla greek letter). It is analogous to a derivative, but applied to functions that input a vector and output a single value; like our loss functions do.

The output of the gradient is a vector of partial derivatives, one per position of the input vector of the function.

You should think about a partial derivative as if your function would receive only one single variable, replacing all of the others by constants, and then applying the usual single variable derivation procedure.

The partial derivatives measure the rate of change of the function output with respect of a particular input variable. In other words, how much the output value will increase if we increase that input variable value.

Here is a caveat before going on. When we talk about input variables of the loss function, we are referring to the model weights, not that actual dataset features inputs.

Those are fixed by our dataset and cannot be optimized. The partial derivatives we calculate are with respect of each individual weight in the inference model.

We care about the gradient because its output vector indicates the direction of maximum growth for the loss function. You could think of it as a little arrow that will indicate in every point of the function where you should move to increase its value:

Suppose the chart above shows the loss function. The red dot represents the current weight values, where you are currently standing. The gradient represents the arrow,

indicating that you should go right to increase the loss. More over, the length of the arrow indicates conceptually how much would you gain if you move in that direction.

Now, if we go the opposite direction of the gradient, the loss will also do the opposite:

decrease.

In the chart, if we go in the opposite direction of the gradient (blue arrow) we will go in the direction of decreasing loss.

If we move in that direction and calculate the gradient again, and then repeat the

process until the gradient length is 0, we will arrive at the loss minimum. That is our goal, and graphically should look like:

That’s it. We can simply define gradient descent algorithm as:

Notice how we added the value to scale the gradient. We call it the learning rate. We need to add that because the length of the gradient vector is actually an amount measured in the “loss function units,” not in “weight units,” so we need to scale the gradient to be able to add it to our weights.

The learning rate is not a value that model will infer. It is an hyperparameter, or a

manually configurable setting for our model. We need to figure out the right value for it. If it is too small then it will take many learning cycles to find the loss minimum. If it is too large, the algorithm may simply “skip over” the minimum and never find it, jumping cyclically. That’s known as overshooting. In our example loss function chart, it would look like:

In practice, we can’t really plot the loss function because it has many variables. So to know that we are trapped in overshooting, we have to look at the plot of the computed total loss thru time, which we can get in Tensorboard by using a tf.scalar_summary on the loss.

This is how a well behaving loss should diminish through time, indicating a good learning rate:

The blue line is the Tensorboard chart, and the red one represents the tendency line of the loss.

This is what it looks like when it is overshooting:

You should play with adjusting the learning rate so it is small enough that it doesn’t overshoot, but is large enough to get it decaying quickly, so you can achieve learning faster using less cycles.

Besides the learning rate, other issues affect the gradient descent in the algorithm. The presence of local optima is in the loss function. Going back to the toy example loss function plot, this is how the algorithm would work if we had our initial weights close to the right side “valley” of the loss function:

The algorithm will find the valley and then stop because it will think that it is where the best possible value is located. The gradient is valued at 0 in all minima. The algorithm can’t distinguish if it stopped in the absolute minimum of the function, the global minimum, or a local minimum that is the best value only in the close neighborhood.

We try to fight against it by initializing the weights with random values. Remember that the first value for the weights is set manually. By using random values, we improve the chance to start descending closer from the global minimum.

In a deep network context like the ones we will see in later chapters, local minima are very frequent. A simple way to explain this is to think about how the same input can travel many different paths to the output, thus generating the same outcome. Luckily, there are papers showing that all of those minima are closely equivalent in terms of loss, and they are not really much worse than the global minimum.

So far we haven’t been explicitly calculating any derivatives here, because we didn’t have to. Tensorflow includes the method tf.gradients to symbolically computate the gradients of the specified graph steps and output that as tensors. We don’t even need to manually call, because it also includes implementations of the gradient descent algorithm, among others. That is why we present high level formulas on how things should work without requiring us to go in-depth with implementation details and the math.

We are going to present through backpropagation. It is a technique used for efficiently computing the gradient in a computational graph.

Let’s assume a really simply network, with one input, one output, and two hidden layers with a single neuron. Both hidden and output neurons will be sigmoids and the loss will be calculated using cross entropy. Such a network should look like:

Let’s define as the output of first hidden layer, the output of the second, and the final output of the network:

Finally, the loss of the network will be:

To run one step of gradient decent, we need to calcuate the partial derivatives of the loss function with respect of the three weights in the network. We will start from the output layer weights, applying the chain rule:

is just a constant for this case as it doesn’t depend on To simplify the expression we could define:

The resulting expression for the partial derivative would be:

Now let’s calculate the derivative for the second hidden layer weight, :

And finally the derivative for :

You should notice a pattern. The derivative on each layer is the product of the

derivatives of the layers after it by the output of the layer before. That’s the magic of the chain rule and what the algorithm takes advantage of.

We go forward from the inputs calculating the outputs of each hidden layer up to the output layer. Then we start calculating derivatives going backwards through the hidden layers and propagating the results in order to do less calculations by reusing all of the elements already calculated. That’s the origin of the name backpropagation.

Conclusion

Notice how we have not used the definition of the sigmoid or cross entropy derivatives.

We could have used a network with different activation functions or loss and the result would be the same.

This is a very simple example, but in a network with thousands of weights to calculate their derivatives, using this algorithm can save orders of magnitude in training time.

To close, there are a few different optimization algorithms included in Tensorflow, though all of them are based in this method of computing gradients. Which one works better depends upon the shape of your input data and the problem you are trying to solve.

Sigmoid hidden layers, softmax output layers, and gradient descent with

backpropagation are the most fundamentals blocks that we are going to use to build on for the more complex models that will see in the next chapters.

Part III. Implementing Advanced Deep

In document TensorFlow for Machine Intelligence (Page 140-148)