6.2 Reinforcement Learning Implementation
6.2.8 Actor Class
The actor class has the main purpose of maintaining the actor neural network, providing the training method. The output of the actor network is the action; this is the reinforcement learning policy or controller. The actor is constructed by initializing the number of states, number of actions, the learning rate, number of neurons, and the upper bounds of the action. The computational graph continues its construction with the actor tensors and operations within the initialization of this class. Just like the critic, the placeholder variables are created inside a variable scope named ‘Actor.’ The data that is fed into the computational graph for the actor include the state, next state, the gradient of the Q values with respect to the actions, the train phase, and the batch size. Table 6.4 summarizes the variables.
Table 6.4: This table summarizes the placeholder values for the actor implementation.
Placeholder variable Description
s state
s next state
qa grads the gradient of the q values with respect to the actions
train phase actor boolean used for toggling batch normalization and mean variance
batch size minibatch size from replay buffer used for normalizing gradients
Nested inside the ‘Actor’ variable scope include additional operations located inside additional variable scopes. The ‘pi online network’ variable scope contains the creation of the online actor neu- ral network using build network(). Once created on the computational graph, the DDPG class can use the predict() method to run a session with the tensorflow operation to make a single prediction of the action.
1 with tf.variable_scope('pi_online_network'):
2 self.pi_online = self.build_network(self.s,
3 trainable=True,
4 bn=self.bn,
5 n_scope='batch_norm')
6 def predict(self, s, train_phase=None):
7 return self.sess.run(self.pi_online,
8 feed_dict={self.s: s.reshape(1, s.shape[0]),
9 self.train_phase_actor: train_phase})
The build network() method has arguments for the states and batch normalization if needed. If batch normalization is set, the state inputs are first batch normalized. This is very standard for supervised learning neural networks. The first hidden layer of this implementation is built manually, so the initial seed is properly set for the weight and bias initialization. If batch normalization is necessary, the layer is batch normalized before the RELU activation function.
Similarly, the second hidden layer weights are initialized so the computation of the weights times the inputs plus the bias is performed. The inputs to the second hidden layer is the output of the previous layer. The layer checks if batch normalization is requested and then applies the activation function.
What is unique to the actor is the output layer because of the tanh activation function. First the weights and biases are initialized to the small range of [−0.003, 0.003] to ensure small initial outputs. The layer is constructed with the tanh activation for the output, but it is important to
scale the result so the action is resembles the steering angles in radians. Since the output of the tanh activation function is a maximum of [−1, 1], the result is scaled by 0.785 radian.
1 with tf.variable_scope("pi_hat"):
2 w3 = tf.Variable(0.003 * (2.0 *
3 tf.contrib.stateless.stateless_random_uniform(
4 [self.n_neurons2, self.n_actions],
5 seed=[self.seed+4, 0]) - 1.0),
6 trainable=trainable)
7 b3 = tf.Variable(0.003 * (2.0 *
8 tf.contrib.stateless.stateless_random_uniform(
9 [self.n_actions],
10 seed=[self.seed+5, 0]) - 1.0),
11 trainable=trainable)
12 pi_hat = tf.matmul(hidden2, w3) + b3
13 pi_hat = tf.nn.tanh(pi_hat)
14 pi_hat_scaled = tf.multiply(pi_hat, self.action_bound)
The build network() method is used identically for the actor target network, but is under the variable scope of ‘pi target network.’ Now that both the online and target networks have been created on the computational graph, operations to collect the trainable parameters are used. An operation to slowly update the target network weights and biases from the online network is ulti- mately created on the computational graph. This operation can be requested by the DDPG class by using the slow update to target() method of the actor class.
1 self.vars_pi_online = tf.get_collection(
2 tf.GraphKeys.TRAINABLE_VARIABLES,
3 scope='Actor/pi_online_network')
4 self.vars_pi_target = tf.get_collection(
5 tf.GraphKeys.TRAINABLE_VARIABLES,
6 scope='Actor/pi_target_network')
7
8 with tf.name_scope("slow_update"):
9 slow_update_ops = []
10 for i, var in enumerate(self.vars_pi_target):
11 slow_update_ops.append(var.assign(
12 tf.multiply(self.vars_pi_online[i], self.tau) + \
13 tf.multiply(self.vars_pi_target[i], 1.0-self.tau)))
14 self.slow_update_2_target = tf.group(*slow_update_ops,
15 name="slow_update_2_target")
16 def slow_update_to_target(self):
17 self.sess.run(self.slow_update_2_target)
The actor network can be updated using the training operation defined in the ‘Actor Loss’ variable scope. There isn’t a real loss associated with updating the actor because it is a Deterministic
Policy Gradient method. Instead the weights and biases are updated in the direction of the policy gradient, performing gradient ascent.
The actor gradients are calculated using the tensorflow gradient method using the forward prediction of the online network, its variables, and the negative of the pre-calculated gradients of the Q values with respect to the actions. The term is negative to perform a maximization for gradient ascent. The pre-calculated gradients of the Q values with respect to the actions are determined from the critic class method, get qa grads(). The DDPG class takes that information and feeds it into the placeholder variables for qa grads.
The actor gradients are then normalized for the batch size using a lambda function, which is a local function. This information is fed to the Adam optimizer using the learning rate. The training operation then applies these gradients to the variables of the online network. The training operation can be requested using the train() method of the actor class.
1 with tf.name_scope("Actor_Loss"):
2 raw_actor_grads = tf.gradients(self.pi_online,
3 self.vars_pi_online,
4 -self.qa_grads)
5 self.actor_grads = list(map(lambda x: tf.div(
6 x, self.batch_size),
7 raw_actor_grads))
8
9 optimizer = tf.train.AdamOptimizer(self.alpha) 10 self.training_op = optimizer.apply_gradients(
11 zip(self.actor_grads,
12 self.vars_pi_online))
13
14 def train(self, s_batch, qa_grads, batch_size, train_phase=None):
15 self.sess.run(self.training_op, feed_dict={
16 self.s: s_batch, self.qa_grads: qa_grads,
17 self.batch_size: batch_size,
18 self.train_phase_actor: train_phase})
Now that the construction of the actor network is complete for the computational graph, a visual of the tensors and operations is shown in Figure 6.19. The scopes have been expanded only for the actor.
Figure 6.19: The visualization of the actor’s portion of the computational graph shows how tensors are passed around for tensorflow operations.
Finally, additional methods for the actor class were needed to run predictions on batches so the DDPG class can update both neural networks. The target actor network is needed to determine the target labels for the critic, which promotes stability in training the critic. The DDPG algorithm requires the prediction from the target actor network to use inputs of the next state.
1 def predict_online_batch(self, s, train_phase=None):
2 return self.sess.run(self.pi_online, feed_dict={self.s: s,
3 self.train_phase_actor: train_phase})
4
5 def predict_target_batch(self, s_, train_phase=None):
6 return self.sess.run(self.pi_target, feed_dict={self.s_: s_,
7 self.train_phase_actor: train_phase})
In addition, a batch prediction of the online network is needed to create the placeholder variables to feed into the training operation of the actor. The inputs to the online network are the current states. The prediction operations, once again, grab all the necessary operations from the compu- tational graph in order to compute a forward pass. This means that once the actor network has been saved, the user can use the output layer of the network on the computational graph to directly compute the action for control. The full implementation of the actor class is found in appendix C.