• No results found

Implementing Back Propagation

One of the benefits of using TensorFlow, is that it can keep track of operations and automatically update model variables based on back propagation. In this recipe, we will introduce how to use this aspect to our advantage when training machine learning models.

Getting ready

Now we will introduce how to change our variables in the model in such a way that a loss function is minimized. We have learned about how to use objects and operations, and create loss functions that will measure the distance between our predictions and targets. Now we just have to tell TensorFlow how to back propagate errors through our computational graph to update the variables and minimize the loss function. This is done via declaring an optimization function. Once we have an optimization function declared, TensorFlow will go through and figure out the back propagation terms for all of our computations in the graph.

When we feed data in and minimize the loss function, TensorFlow will modify our variables in the graph accordingly.

For this recipe, we will do a very simple regression algorithm. We will sample random numbers from a normal, with mean 1 and standard deviation 0.1. Then we will run the numbers through one operation, which will be to multiply them by a variable, A. From this, the loss function will be the L2 norm between the output and the target, which will always be the value 10. Theoretically, the best value for A will be the number 10 since our data will have mean 1. The second example is a very simple binary classification algorithm. Here we will generate 100 numbers from two normal distributions, N(-1,1) and N(3,1). All the numbers from N(-1, 1) will be in target class 0, and all the numbers from N(3, 1) will be in target class 1. The model to differentiate these numbers will be a sigmoid function of a translation. In other words, the model will be sigmoid (x + A) where A is a variable we will fit. Theoretically, A will be equal to -1. We arrive at this number because if m1 and m2 are the means of the two normal functions, the value added to them to translate them equidistant to zero will be –(m1+m2)/2.

We will see how TensorFlow can arrive at that number in the second example.

While specifying a good learning rate helps the convergence of algorithms, we must also specify a type of optimization. From the preceding two examples, we are using standard gradient descent. This is implemented with the TensorFlow function GradientDescentOptimizer().

How to do it…

Here is how the regression example works:

1. We start by loading the numerical Python package, numpy and tensorflow: import numpy as np

import tensorflow as tf 2. Now we start a graph session:

sess = tf.Session()

3. Next we create the data, placeholders, and the A variable:

x_vals = np.random.normal(1, 0.1, 100) y_vals = np.repeat(10., 100)

x_data = tf.placeholder(shape=[1], dtype=tf.float32) y_target = tf.placeholder(shape=[1], dtype=tf.float32) A = tf.Variable(tf.random_normal(shape=[1]))

4. We add the multiplication operation to our graph:

my_output = tf.mul(x_data, A)

5. Next we add our L2 loss function between the multiplication output and the target data:

loss = tf.square(my_output - y_target)

6. Before we can run anything, we have to initialize the variables:

init = tf.initialize_all_variables() sess.run(init)

7. Now we have to declare a way to optimize the variables in our graph. We declare an optimizer algorithm. Most optimization algorithms need to know how far to step in each iteration. This distance is controlled by the learning rate. If our learning rate is too big, our algorithm might overshoot the minimum, but if our learning rate is too small, out algorithm might take too long to converge; this is related to the vanishing and exploding gradient problem. The learning rate has a big influence on convergence and we will discuss this at the end of the section. While here we use the standard gradient descent algorithm, there are many different optimization algorithms that operate differently and can do better or worse depending on the problem. For a great overview of different optimization algorithms, see the paper by Sebastian Ruder in the See Also section at the end of this recipe:

my_opt = tf.train.GradientDescentOptimizer(learning_rate=0.02) train_step = my_opt.minimize(loss)

There is much theory on what learning rates are best. This is one of the harder things to know and figure out in machine learning algorithms. Good papers to read about how learning rates are related to specific optimization algorithms are listed in the There's more… section at the end of this recipe.

8. The final step is to loop through our training algorithm and tell TensorFlow to train many times. We will do this 101 times and print out results every 25th iteration.

To train, we will select a random x and y entry and feed it through the graph.

TensorFlow will automatically compute the loss, and slightly change the A bias to minimize the loss:

for i in range(100):

rand_index = np.random.choice(100) rand_x = [x_vals[rand_index]]

rand_y = [y_vals[rand_index]]

sess.run(train_step, feed_dict={x_data: rand_x, y_target:

rand_y})

9. Now we will introduce the code for the simple classification example. We can use the same TensorFlow script if we reset the graph first. Remember we will attempt to find an optimal translation, A that will translate the two distributions to the origin and the sigmoid function will split the two into two different classes.

10. First we reset the graph and reinitialize the graph session:

from tensorflow.python.framework import ops ops.reset_default_graph()

sess = tf.Session()

11. Next we will create the data from two different normal distributions, N(-1, 1) and N(3, 1). We will also generate the target labels, placeholders for the data, and the bias variable, A:

x_vals = np.concatenate((np.random.normal(-1, 1, 50), np.random.

normal(3, 1, 50)))

y_vals = np.concatenate((np.repeat(0., 50), np.repeat(1., 50))) x_data = tf.placeholder(shape=[1], dtype=tf.float32)

y_target = tf.placeholder(shape=[1], dtype=tf.float32) A = tf.Variable(tf.random_normal(mean=10, shape=[1]))

Note that we initialized A to around the value 10, far from the theoretical value of -1. We did this on purpose to show how the algorithm converges from the value 10 to the optimal value, -1.

12. Next we add the translation operation to the graph. Remember that we do not have to wrap this in a sigmoid function because the loss function will do that for us:

my_output = tf.add(x_data, A)

13. Because the specific loss function expects batches of data that have an extra dimension associated with them (an added dimension which is the batch number), we will add an extra dimension to the output with the function, expand_dims() In the next section we will discuss how to use variable sized batches in training. For now, we will again just use one random data point at a time:

my_output_expanded = tf.expand_dims(my_output, 0) y_target_expanded = tf.expand_dims(y_target, 0) 14. Next we will initialize our one variable, A:

init = tf.initialize_all_variables() sess.run(init)

15. Now we declare our loss function. We will use a cross entropy with unscaled logits that transforms them with a sigmoid function. TensorFlow has this all in one function for us in the neural network package called nn.sigmoid_cross_

entropy_with_logits(). As stated before, it expects the arguments to have specific dimensions, so we have to use the expanded outputs and targets accordingly:

xentropy = tf.nn.sigmoid_cross_entropy_with_logits( my_output_

expanded, y_target_expanded)

16. Just like the regression example, we need to add an optimizer function to the graph so that TensorFlow knows how to update the bias variable in the graph:

my_opt = tf.train.GradientDescentOptimizer(0.05) train_step = my_opt.minimize(xentropy)

17. Finally, we loop through a randomly selected data point several hundred times and update the variable A accordingly. Every 200 iterations, we will print out the value of A and the loss:

for i in range(1400):

rand_index = np.random.choice(100) rand_x = [x_vals[rand_index]]

rand_y = [y_vals[rand_index]]

sess.run(train_step, feed_dict={x_data: rand_x, y_target:

rand_y})

Step #600 A = [-0.50994617]

Loss = [[ 0.14271219]]

Step #800 A = [-0.76606178]

Loss = [[ 0.18807337]]

Step #1000 A = [-0.90859312]

Loss = [[ 0.02346182]]

Step #1200 A = [-0.86169094]

Loss = [[ 0.05427232]]

Step #1400 A = [-1.08486211]

Loss = [[ 0.04099189]]

How it works…

As a recap, for both examples, we did the following:

1. Created the data.

2. Initialized placeholders and variables.

3. Created a loss function.

4. Defined an optimization algorithm.

5. And finally, iterated across random data samples to iteratively update our variables.

There's more…

We've mentioned before that the optimization algorithm is sensitive to the choice of the learning rate. It is important to summarize the effect of this choice in a concise manner:

Learning rate size Advantages/Disadvantages Uses Smaller learning rate Converges slower but more

accurate results. If solution is unstable, try lowering the learning rate first.

Larger learning rate Less accurate, but converges

faster. For some problems, helps

prevent solutions from stagnating.

Sometimes the standard gradient descent algorithm can get stuck or slow down significantly.

This can happen when the optimization is stuck in the flat spot of a saddle. To combat this, there is another algorithm that takes into account a momentum term, which adds on a fraction of the prior step's gradient descent value. TensorFlow has this built in with the MomentumOptimizer() function.

Another variant is to vary the optimizer step for each variable in our models. Ideally, we would like to take larger steps for smaller moving variables and shorter steps for faster changing variables. We will not go into the mathematics of this approach, but a common implementation of this idea is called the Adagrad algorithm. This algorithm takes into account the whole history of the variable gradients. Again, the function in TensorFlow for this is called AdagradOptimizer().

Sometimes, Adagrad forces the gradients to zero too soon because it takes into account the whole history. A solution to this is to limit how many steps we use. Doing this is called the Adadelta algorithm. We can apply this by using the function AdadeltaOptimizer().

There are a few other implementations of different gradient descent algorithms. For these, we would refer the reader to the TensorFlow documentation at: https://www.tensorflow.

org/api_docs/python/train/optimizers.

See also

For some references on optimization algorithms and learning rates, see the following papers and articles:

f Kingma, D., Jimmy, L. Adam: A Method for Stochastic Optimization. ICLR 2015.

https://arxiv.org/pdf/1412.6980.pdf

f Ruder, S. An Overview of Gradient Descent Optimization Algorithms. 2016.

https://arxiv.org/pdf/1609.04747v1.pdf

f Zeiler, M. ADADelta: An Adaptive Learning Rate Method. 2012. http://www.

matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf