2.2 Reinforcement Learning (RL)
2.2.4 Eligibility Traces
As we have discussed above in Section 2.2.2, the quick convergence property of SARSA is partly attributed to SARSA’s TD-based Q-value updating mechanism, which accelerates the reward back-propagation. However, the back-propagation in SARSA is just one step, meaning that in each learning step, a reward can be propagated only one state backward. An ideal reward back-propagation method should be able to achieve the ‘whole trajectory’ propagation: once the current state-action pair receives its immediate reward, all historical state-action pairs on the trajectory leading to the current state should be able to share a proportion of the current reward, and the proportion is decided by the discount parameterγ as well as their distance to the current state: the longer the distance, the less proportion a historical state-action pair shares, because the reward is discounted by γ per learning step. The Eligibility Traces technique (ET) [SS96] is developed to achieve this ideal back-propagation.
An eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the execution of an action. The trace marks the memory parameters associated with the event as eligible for undergoing learning changes.
When a TD error7occurs, only the eligible state-action pairs are assigned credit or blame for the error. Thus, eligibility traces help bridge the gap between events and training information. Like TD methods themselves, ET is a basic mechanism for temporal credit assignment. Almost any TD-based RL algorithms can be combined with Eligibility Traces [SB98].
Now we briefly describe how ET helps SARSA to achieve the ‘whole trajectory back-propagation’. The ET-augmented version of SARSA is called SARSA(λ) [SS96], whereλ ∈ R, λ ∈ [0, 1] is a parameter used to control how much credit should be delivered back to previous state-action pairs’ Q-values. The pseudo code of SARSA(λ) is presented in Algorithm 5. The basic structure of SARSA(λ) is very similar to that of SARSA (Algorithm 1). Here we only highlight the aug- mented part. Initially, all state-action pairs’ eligibility trace are initialised as 0 (line 3). On each state-action pair visit, its corresponding eligibility trace is set to be 1, meaning that this state-action pair has just been visited (line 10). Note thatδ in line 9 represents the difference between the new estimation ofQ(s, a) and the existing value ofQ(s, a). This value is the information (immediate reward plus TD error) we want to propagate backwards to all previous state-action pairs on the trajectory. To this end, we update all eligible state-action pairs’ Q-values according to the rule given in line 12. A state-action pair(s, a) is eligible iff its corresponding eligibil- ity tracee(s, a) 6= 0. In the updating rule in line 12, we can see that the second addend on the right-hand side is a product of three values: the learning stepα, the information we want to back-propagateδ, and e(s, a), the value indicating to what extent state-action pair(s, a) is eligible for receiving the latest information. After the update, the eligibility trace of(s, a) is discounted by γλ, meaning that (s, a) is less eligible for receiving the latest reward in the next update, because the distance (i.e. number of learning steps) from pair(s, a) to the latest pair increases.
To further understand the relation between SARSA(λ) and standard SARSA, we consider a special case of SARSA(λ): SARSA(0). We can see that when λ = 0, once a new state-action pair(s′, a′) is obtained, only the previous state-action pair (s, a) is updated, because only its corresponding eligibility trace is non-zero; all earlier pairs’ Q-values are not affected, because their eligibility traces are all 0, after updating according to line 13 in Algorithm 5 whereλ = 0. As a result, we can see that SARSA(0) is exactly the same as standard SARSA, and SARSA(λ)
7Q(s′, a′) − Q(s, a), where s, a is the current state-action pair, and s′, a′
is the next state-action pair, is called a TD error. We can see that in SARSA, Q-values are updated by using this TD error (line 8 in Algorithm 1)
Algorithm 5 The SARSA(λ) algorithm with replacing eligibility traces (adjusted from [SB98]).
1: InitialiseQ(s, a) for all state s and action a arbitrarily 2: while the experiment does not terminate do
3: Initialisee(s, a) = 0 for all s and a 4: Initialise the current states
5: Choose actiona from s by using ǫ-greedy 6: whiles is not a terminal state do
7: Execute actiona, observe the next state s′and immediate rewardr 8: Choose actiona′froms′by usingǫ-greedy
9: δ := r + γQ(s′, a′) − Q(s, a) 10: e(s, a) := 1
11: for alls and a do
12: Q(s, a) := Q(s, a) + αδe(s, a) 13: e(s, a) := γλe(s, a) 14: end for 15: s := s′ 16: a := a′ 17: end while 18: end while
is essentially a generalisation of standard SARSA, in the sense that by tuningλ between 0 and 1, we can tune to what extent we want to back-propagate the current information to previous state-action pairs.
To illustrate the advantage of SARSA(λ) over standard SARSA (i.e. SARSA(0)), we consider again the illustrative Wumpus World example we introduced in Sec- tion 2.2.2. This time we use SARSA(1) to learn, and suppose that, initially, the agent is in the state shown in Figure 2.3(a). All Q-values are initialised as shown in this figure. So the current state is s1 =< 0, 0, F, T > (line 4 in Algorithm
5). Suppose the agent randomly selects go right at s1 (line 5), so a = a1 =
go right. By performinga, the agent moves to s′ = s2 =< 1, 0, F, F >, and
receives rewardr = −1 (line 7). Suppose that in s′, the agent chooses go up, so
a′ = a2 = go up (line 8). Easily, we obtain that δ = −1 (line 9), and we update
e(s, a) = e(s1, a1) = 1 (line 10). For simplicity, we let α = γ = λ = 1. Because
alle values except e(s1, a1) are 0, only Q(s1, a1) is updated in line 12. The new
value ofQ(s1, a1) is -1. After updating s = s2(line 15) anda = a2(line 16), the
algorithm moves to the next learning step. Until now, the Q-values are the same as the Q-values updated by using standard SARSA, as shown in Figure 2.3(b).
states′ = s3 =< 1, 1, F, T >, and receives reward r = 500 (line 7). Suppose
a′ = go up (line 8), so δ = 500 + 0 − 0 = 500 (line 9). Then we update the
eligibility trace of the current state-action pair:e(s, a) = e(s2, a2) = 1 (line 10).
Recall that, until now, only two state-action pairs’ eligibility traces are non-zero: e(s1, a1) = e(s2, a2) = 1. Given these two non-zero eligibility traces, we can
update their corresponding Q-values (line 12):Q(s1, a1) = −1 + 1 × 500 × 1 =
499, and Q(s2, a2) = 0 + 1 × 500 × 1 = 500. Then this episode ends. We can
easily see that in all episodes afterwards, the agent will perform the optimal policy, and the Q-values will not change any longer. Compared with standard SARSA, SARSA(1) does not need all learning steps illustrated in Figure 2.4.