Temporal Difference Learning - Reinforcement Learning

2.3 Reinforcement Learning

2.3.1 Temporal Difference Learning

Temporal Difference (TD) learning (Sutton, 1988) methods are probably the most popular model-free RL approaches. Model-free refers to the fact that TD learning neither requires the specification of a model of the environment (in contrast to Dynamic Programming methods which are based on P_ssa0 and Ra_ss0) nor does it learn an explicit

model of the environment. Instead, TD learns a policy indirectly by first approximating Q∗and then deriving π∗from Q∗based on Equation 2.4. Since TD is model-free, the only way for the agent to obtain information about the respective MDP is to interact with its environment and to observe the successor state s0and reward r when applying an action a in state s. TD learning provides means for stochastically approximating Q∗ based on a set of quadruples (st, at, rt+1, st+1). Different TD learning methods differ in the specific learning rules; the two most popular methods are Q-learning (Watkins,

2. BACKGROUND 18

Algorithm 2.1 Q-Learning (Watkins, 1989).

1: Input: Initial Q(s, a), learning rate α ∈ (0, 1], discount factor γ ∈ [0, 1] 2: while True do

3: s∼ So(s)# Sample start state for the episode

4: repeat

5: a∼ π(a|s)# Sample action from policy, e.g., a policy ε-greedy in Q

6: s0∼ P(s0|s, a)# Stochastic state transition according to P_ssa0

7: r∼ R(r|s, a, s0)# Sample reward for state transition according to Ra_ss0

8: Q(s, a) = Q(s, a) + α[r + γ maxa0Q(s0, a0) − Q(s, a)]# Q-learning update rule

9: s= s0# Continue with successor state

10: until St(s)# Stop episode when state is terminal

11: Q(s, a) = 0 ∀a# Terminal states have value 0 for all actions

12: end while

1989) with the learning rule

Q_t+1(st, at) ← Qt(st, at) + αt r_t+1+ γ max a0 Q_t(st+1, a0) − Qt(st, at) (2.5)

and SARSA (Rummery and Niranjan, 1994) with the learning rule

Q_t+1(st, at) ← Qt(st, at) + αt(rt+1+ γQt(st+1, at+1) − Qt(st, at)) . (2.6)

Both algorithms are iterative and on-line in that they update Qtin every time step based on the current observation, the update requires constant time independent of the number of observations seen, and memory consumption is bounded.The parameter αt is a learning rate that typically decreases over time and controls how strongly the current observation affects the action value function. Both algorithms bootstrap, i.e., they compute their new estimate Qt+1of action values based on their current estimate Qt. For finite MDPs, the Q-learning update rule can be seen as a stochastic gradient descent on the approximate Bellman error, with − (rt+1+ γ maxa0Qt(st+1, a0) − Qt(st, at)) being the Bellman error’s approximate stochastic gradient (Heidrich-Meisner et al., 2007). Pseudo-code for Q-learning is given in Algorithm 2.1; SARSA is obtained by replacing line 8 with the respective learning rule.

Model-free RL algorithms such as Q-Learning and SARSA are faced with the exploration-exploitation dilemma: on the one hand, for convergence to the optimal policy, they are required to try every state-action pair indefinitely often, i.e., they have to explore their environment. On the other hand, the ultimate reason for learning is to be able to act in a way such that the obtained reward is maximized, i.e., to choose actions with maximal Q∗(s, a). At any point in time, the agent’s best guess for this is to choose the action with maximal Qt(s, a), which is called exploitation. Thus, the agent is faced with a trade-off between two different objectives, namely exploration and exploitation.

19 2.3 REINFORCEMENTLEARNING

One common way to deal with this is ε-greedy action selection:

πt(s, a) =   

1 − εt+ εt/|A| if a = arg max a0

Q_t(s, a0) εt/|A| else

. (2.7)

This stochastic policy is implemented easily by executing the greedy action a = arg max_a0Q_t(s, a0) with probability 1 − ε_t (exploitation) and choosing an action uniform

randomly with probability εt (exploration). The main difference between Q-learning and SARSA is that the former is off-policy and the latter is on-policy. Learning being off-policy means that the agent can follow any policy, even one which chooses actions uniform randomly, but Qt always approximates Q∗, the action value function of the optimal policy. In contrast, for on-policy learning methods like SARSA, Qt will converge to Qπ _{where π is the behavior policy which is used for action selection during} learning. Thus, SARSA will not learn the optimal action value function if the behavior policy does not converge to the optimal but unknown policy as t → ∞.

For arbitrary initialization of Q0, Q-Learning converges asymptotically to the optimal action value function Q∗for any finite MDP if all state-action pairs fromS × A are executed infinitely often and the learning rate αt converges to 0 with

∞ ∑ i=1 α_ni_(s,a)= ∞ and ∑∞ i=1

α_n2i_(s,a)< ∞ for all s, a where ni(s, a) is the index of the i-th time action a is

executed in state s (Watkins and Dayan, 1992). Similar convergence proofs for SARSA exist (Singh et al., 2000), which require additionally that the behavior policy becomes greedy in the limit with infinite exploration; for instance, for ε-greedy action selection with εt= 1/t, convergence to Q∗is guaranteed.

There exist also non-gradient based, second-order TD learning methods like least- squares temporal difference (LSTD) learning (Bradtke et al., 1996; Boyan, 2002). LSTD does not require specifying a learning rate α and is stable for a broad range of conditions where function approximation (see below) is required. Unfortunately, LSTD can only learn state value function for fixed policies. Lagoudakis and Parr (2003) have proposed an extension of LSTD called least-squares policy iteration (LSPI), which allows learning action value functions for control problems, i.e., problems where the policy is improved during learning. LSPI is an off-policy algorithm like Q-learning; however, it reuses samples and has typically a lower sample-complexity, i.e., requires less observations of the environment to learn an optimal policy. On the other hand, the cost for each update is quadratic in the number of features as is memory consumption. Accordingly, LSPI is often used in an off-line fashion, i.e., the action value function and corresponding policy are not updated after every time step but less frequently.

In document Learning the Structure of Continuous Markov Decision Processes (Page 31-33)