• No results found

2.7 Reinforcement Learning

2.7.2 Value Based Methods

One of the most popular RL algorithms people hear about is a Value Based Method: Q-Learning. These methods are not Monte Carlo updates, but rather temporal difference updates. Temporal

difference refers to making estimates off of estimates, which is also known as bootstrapping. This allows for updating the weights at every timestep, so learning can improve while in the episode. Think about driving a car to work. If traffic suddenly occurred, with Monte Carlo updates, one would be forced to continue driving the same path to work stubbornly. With temporal difference updates, one could decide to take surface streets.

Instead of a neural network changing the policy directly, it is implicitly represented by a value function. This value function, usually an action-state value function, represents how good an action is to take given a state. Continuous actions are difficult here because there would be too many actions to evaluate to figure out the maximum value function. The optimal policy is still estimated since the agent simply needs to select the discrete action the Q∗ function said was best.

Exploration is thus achieved through an -greedy approach where some probability, , of the time in the episode, the action is selected at random from the discrete options. The remaining actions are greedily selected, or based on what the function approximator currently thinks is best.

The Bellman optimality equation expresses the fact that the action-state value under an optimal policy must equal the expected return for the best action from that action-state. This is important because it allows for the value based algorithms to express values of states as values of other states. The Bellman optimality equation enables the use of iterative approaches for calculating estimated optimal policies. Bellman’s equation for deterministic actions is shown below:

Q∗(s, a) = r + γmaxa0Q(s0, a0) (2.25)

Value based methods use an iterative approach for creating a loss to update the neural networks using the temporal difference error. This is the self-labeling of reinforcement learning. In other words, this is maximizing the future expected reward. The temporal difference error, δt, is a measure of the difference between the estimated action-state value and the better estimate, rt+1+ γQ(s0, a0).

δt= rt+1+ γQ(s0, a0) − Q(s, a) (2.26)

Sarsa is also an on-policy algorithm like REINFORCE, which remember, means learning on the “job.” Sarsa is very similar to Q-learning, but distinguishes itself by being on-policy. This means that Sarsa will perform better while training, however, it is more likely to find only a locally optimal policy. Chris Gatti implemented Sarsa(λ) in [65], but this meant he only had three discrete actions for steering the tractor. The actions were full left, full right, and straight. The λ in the Sarsa algorithm is the eligibility trace, which allows for updating with n-steps of discounted rewards as

opposed to just one. Using the eligibility traces can be thought of as being somewhere in between Monte Carlo (full episode update) and temporal difference update (one step).

Algorithm 3 Sarsa (on-policy) Temporal Difference Control

1: Randomly Initialize state-action value network Q(s|a, ω) with weights ω

2: for n episodes do 3: Observe initial s

4: a ← action from -greedy policy

5: for each step of episode do

6: Take action a, observe reward r and next state s0

7: if s0 is terminal then

8: ω ← ω + α[r − Q(s, a, ω)]∇Q(s, a, ω) Go to next episode

9: a0 ← action from -greedy policy for s

10: ω ← ω + α[r + γQ(s0, a0, ω) − Q(s, a, ω)]∇Q(s, a, ω)

11: s ← s0

12: a ← a0

One does not have to use a neural network for Sarsa, a look up table can be implemented. If the states are just horizontal position and velocity and there are just three actions, the table is three dimensional. The issue is that the states are continuous. So a method of discretizing or binning the states can be used.

One of the problems with Value Based methods is that the label for updating the weights in the neural network keeps changing, so it is like a dog chasing its tail. This is not like supervised learning where the labels are static trying to classify a bird or a plane. The tabular methods don’t have problems learning, but they are limited to smaller toy problems when the states get huge.

Q-learning is also a temporal difference algorithm, however, it is considered off-policy. This means the network is learning by ”looking over someone’s shoulder.” Benefits for off-policy is the ability to learn from observing policies from other agents to maybe learn a better policy. This can be thought of as being more risky in the episode; this is why Sarsa is likely to perform better in the episode initially. Sarsa will likely just learn a worse policy than Q-learning [25]. Off-policy also allows for experiences to be re-used in batch updates as opposed to just one sample at a time. Learning from a batch is similar to how training is done in supervised learning. This also eliminates the potential of un-learning good policies due to noise of just one sample.

Algorithm 4 Tabular Q-Learning (off-policy) Temporal Difference Control

1: Initialize Q(s, a) with zeros

2: for n episodes do

3: Observe initial s

4: for each step of episode do

5: a ← from -greedy policy

6: Take action a , observe reward r and next state s0

7: Q(s, a) ← Q(s, a) + α[r + γmaxa0Q(s0, a0) − Q(s, a)]

8: s ← s0

9: until s is terminal

Q-learning is technically not performing gradient descent because there is no gradient through the max operator in r + γmaxaQ(s0, a0). Training is also difficult because Q-learning tends to have a bias due making maximum estimates of the maximums. According to Sutton and Barto [25], there are certain conditions when Value Based methods are not guaranteed to converge. In fact, sometimes they can diverge and grow out without bound. The deadly triad includes 1) Non-linear Function approximation 2) Bootstrapping and 3) Off-Policy.

Even though Q-learning has shown the best performance in the early days with tables as the function approximators, this divergence phenomenon has been an on-going research question. How- ever, in 2015, Mnih from David Silver’s team at Google’s Deepmind [28] found a way to stabilize the algorithm. This was a major breakthrough for reinforcement learning because the same network parameters were able to achieve performance comparable to humans across 49 Atari video games. The algorithm was called the Deep Q-Network (DQN).

Algorithm 5 DQN

1: Randomly initialize state-action value network Q(s|a, θ) with weights θ

2: Initialize target state-action-value network Q0(s|a, θ0) with weights θ0← θ

3: Initialize replay buffer B

4: for n episodes do

5: Observe initial s

6: for each step of episode do

7: a ← from -greedy policy using target network

8: Take action a, observe reward r and next state s0

9: Store transition (st, at, rt, st+1) in B

10: Sample a random minibatch of transitions (si, ai, ri, si+1) from B 11: if s0 is terminal then

12: yi= ri

13: Set yi = ri+ γmaxa0Q0(si+1, a0, θ0)

14: Update Q(s|a, θ) by minimizing the loss: L = N1 P

i(yi− Q(si, ai, θ))2 15: Every C steps, copy θ0← θ

David Silver said it wasn’t the cleanest approach, but what they did was to have a second network called the target network. They froze the weights in the target network and did the bootstrapping estimates from it because the predictions were consistent. Every now and again they would copy the weights from the primary network over to the target network and freeze them again. Copying the weights over may be done every 10, 000 steps. The target network helped with the correlation problem, i.e. the dog chasing its tail. In addition to this, they utilized a replay buffer to batch update the network to avoid the problem with Sarsa potentially training on noise from one sample. So despite not having theoretical convergence guarantees, the field of reinforcement learning has found ways to make the algorithms work in practice.