6.2 Reinforcement Learning Implementation
6.2.2 Unique Implementations Regarding Tensorflow
Tensorflow Starting Seeds
When making hyperparameter or reward function changes, it is useful to set the initial seed in order to get the same sequence. Just running the same code twice can produce wildly different results in terms of convergence. Computers use a pseudo random number generation method which usually uses the current timestamp as the initial seed. Sources of random seeds are introduced by
the track spawning, noise for exploration, batch randomization, and most importantly initial weights and biases. The reason why the method for training neural networks is called Stochastic Gradient Descent is due to the randomization of mini-batches and initial weights. To make the sequence deterministic, one must set the initial seed for numpy and random modules of Python. In addition to these, one must set the initial seed for tensorf low.
It was discovered that tensorf low continues to ensure stochastic results despite offering a method for setting a global seed, tf.set random seed(). This is useful if the weights are already set and one wants to reproduce results. The problem is when using lower level tensorf low operations; some require the initial seed to be set for each operation. This occurs especially with weight initialization. When one does not set these seeds, it uses the operation id number as its starting seed. Unfortunately, this does not seem to be constant. Furthermore, even when setting these local operation initial seeds, it appeared to not be deterministic still.
Machine learning is still simply using math, but this causes many to be perplexed as to why the results are still not deterministic for training. One can begin to question whether or not if machine learning really is a sentient Artificial Intelligence. A solution, however, was found using tf.contrib.stateless random unif orm(). It allowed for weight and bias initialization without de- pendency on the operation id numbers. The drawback is that one has to manually implement all the layers instead of implementing tf.dense() layers.
1 with tf.variable_scope("hidden1"):
2 w1 = tf.Variable(self.fan_init(self.n_states) *
3 (2.0 * tf.contrib.stateless.stateless_random_uniform(
4 [self.n_states, self.n_neurons1],
5 seed=[self.seed, 0]) - 1.0),
6 trainable=trainable)
7 b1 = tf.Variable(self.fan_init(self.n_states) *
8 (2.0 * tf.contrib.stateless.stateless_random_uniform(
9 [self.n_neurons1],
10 seed=[self.seed+1, 0]) - 1.0),
11 trainable=trainable)
12 hidden1 = tf.matmul(s, w1) + b1
13 hidden1 = tf.nn.relu(hidden1)
14
15 def fan_init(self, n):
16 return 1.0 / np.sqrt(n)
Manually implementing a layer in tensorf low consists of initializing the weights and biases as tensorflow variables using the tf.contrib.stateless random unif orm() method. The contrib method returns a random float between the range of [0, 1) based on a specific starting seed. Since the
DDPG algorithm suggests to initialize the weights and biases using the fan in method, the trainable parameters are scaled and shifted to achieve this. The weights get multiplied by the inputs or the previous layer outputs, then shifted by the biases. The output of this then gets put through the activation function.
Batch Normalization Details
The DDPG algorithm suggests that using batch normalization help with varying units of states. It was found to be helpful, but something about the implementation was unintuitive. When it comes to evaluating and exploring, the target networks should use the population mean and variance. Reinforcement Learning is different than supervised learning because the agent continually interacts with the environment by making predictions on the input and output mappings. One could think of this as assuming the network is good for the time being and seeing how well the predicted action is. The code snippet below displays how batch normalization is implemented on a layer if bn = True. The tf.contrib.layers.batch norm() method has an argument for setting is training True or False.
1 hidden1 = tf.matmul(s, w1) + b1
2 if bn:
3 hidden1 = self.batch_norm_layer(hidden1,
4 train_phase=self.train_phase_actor,
5 scope_bn=n_scope+'1')
6 hidden1 = tf.nn.relu(hidden1)
7
8 def batch_norm_layer(self, x, train_phase, scope_bn):
9 return tf.contrib.layers.batch_norm(x, scale=True, is_training=train_phase,
10 updates_collections=None,
11 decay=0.999, epsilon=0.001,
12 scope=scope_bn)
It was found that setting is training = False for the target networks produced worse results in terms of convergence on the ContinuousMountainCar-v0 environment. It was initially thought it was due to the target network population mean and variance layer values were not getting updated correctly due to polyak averaging. If one tried to manually copy over the population mean and variance to the target network from the main network, it actually performed worse. So despite the theoretical grounding, it was found that setting is training = T rue during the exploration was needed. This is further discussed in appendix B.
Saving Weights and Biases
The implementation of the DDPG algorithm uses tensorf low 1.12, which instantiates a com- putational graph. Instead of looping through Python code, tensorf low refers to the computational graph to take advantage of the GPU. The computational graph gets created in a tensorflow session. A visual of the computational graph can be found using tensorboard in Figure 6.15, which is a resource to view progress of training on a web browser.
Figure 6.15: The computational graph displays joint computational graph consisting of the actor and critic.
1 with tf.Session() as sess:
2 ''' Construct neural networks here '''
3
4 sess.run(tf.global_variables_initializer())
5 saver = tf.train.Saver()
6
7 ''' Train neural networks until convergence '''
8
9 saver.save(sess, checkpoint_path)
Inside the tensorflow session, the various weights and biases of all the neural networks should be constructed. This is where incorporating the starting seed matters. The computational graph is not created until the session runs the global variables initializer. An object is created so the save method can be used. Training of the neural networks should then begin and end once convergence has been met. Then the save method takes the session and saves the model, consisting of weights and biases, at a desired file location.
Figure 6.16: The saved model consists of four files: one that consists of text and the other three are binary.
The saved models from tensorf low are usually referred to as checkpoint files because the user can restore checkpoints if the training gets interrupted. Due to the files primarily being binary, the ability to open the file to view the weight values is not available. The model, however, can be restored within another tensorflow session for viewing.