• No results found

SARSA Learning Algorithm

2.2 Reinforcement Learning (RL)

2.2.2 SARSA Learning Algorithm

SARSA [RN94] is a simple yet powerful RL algorithm, and it has been used in many application domains, for example the RoboCup Keepaway and Takeaway games, which will be introduced in detail later in Chapter 6. Particularly noticeable features of SARSA include its fast convergence speed and its model-free property [SB98]. Some factors that contribute to the fast convergence property of SARSA include: (1) SARSA is an on-policy learning method, meaning that the policy being evaluated is also the policy being used by the agent, and, meanwhile, the agent continuously updates the policy according to the latest trajectory. This al- lows SARSA to select the policy that is optimal with respect to the latest learning experiences. (2) SARSA is a Temporal Difference (TD) based learning method, meaning that it updates the Q-value of each state-action pair by using the next state-action pairs’ Q-value, and this allows SARSA to back-propagate the delayed stimulus information (i.e. rewards) quickly (this will be illustrated shortly in this subsection). Also, SARSA does not require any prior knowledge about the tran- sition function of the underlying MDP, and it does not explicitly reconstruct the transition function (in RL literature, this kind of RL algorithms are referred to as

model-freealgorithms): all it needs to store are the Q-values (illustrated below).

This model-free property allows SARSA to be used in applications where the tran- sition function is unknown. Also note that SARSA can be used in either MDP or SMDP problems, i.e. the actions in SARSA can take one or more time slots to finish.

Pseudo code of SARSA is presented in Algorithm 1. A walk-through of this algorithm will be given below. We first describe some learning parameters in this algorithm: α ∈ R, α ∈ [0, 1] is the learning rate parameter, which controls how significantly the current Q-value will be changed after each update;γ is the dis- count factor; in line 4 and 7,ǫ-greedy is an action selection policy, which chooses the action to perform in the current state: by usingǫ-greedy, the action with high- est Q-value is chosen with probability 1 − ǫ, and a uniformly random action is chosen with probabilityǫ. As the learning proceeds, both ǫ and α values should be gradually decreasing to 0, meaning that less exploration is needed and Q-values should be updated in smaller steps. Note that the purpose of choosing some ran- dom actions with probabilityǫ is to prevent the learning agent trapped in a local sub-optimal policy, andǫ-greedy is a method to trade off between exploration (i.e. choosing an action that does not have the highest Q-value in the current state) and

                   (a) Step 1                    (b) Step 2

Figure 2.2: The first episode of the SARSA-based learning in the2 × 2 Wumpus World in Figure 2.1. Each square has four numbers, each representing the Q-value of the corresponding action in that square.

exploitation(i.e. choosing the action that has the highest Q-value in the current

state). Actually, theǫ-greedy action selection policy can be viewed as an approxi- mation of the Greedy in the Limit with Infinite Exploration (GLIE) policy [Thr92], which requires that (1) each action is executed infinitely often in every state that is visited infinitely often, and (2) in the limit (i.e. after infinitely many episodes of learning), the policy is greedy with respect to the Q-value function with probability 1.

Algorithm 1 The SARSA algorithm (adjusted from [SB98]). 1: InitialiseQ(s, a) for all states s and actions a arbitrarily 2: while the experiment does not terminate do

3: Initialise the current states

4: Choose actiona in s by using ǫ-greedy 5: whiles is not a terminal state do

6: Execute actiona, observe the next state s′and immediate rewardr

7: Choose actiona′froms′by usingǫ-greedy 8: Q(s, a) := (1 − α)Q(s, a) + α(r + γQ(s′, a′)) 9: s := s′

10: a := a′ 11: end while 12: end while

We illustrate how this algorithm works by considering the Wumpus World in Figure 2.2. Note that in this Wumpus World game, each episode ends when the agent is killed (either by falling into a pit or being eaten by a Wumpus), and a

new episode immediately starts with the environment reset. The Wumpus is in square(0, 1), the exit is in square (1, 1), and the agent is put in square (0, 0) at the beginning of each episode. An experiment consists of multiple episodes, and before the start of the first episode, without loss of generality, we initialise all Q- values to 0 (line 1 in Algorithm 1). Now we consider the first episode. According to line 3, we first initialise the starting state of the agent as< 0, 0, F, T > (note that the first boolean value indicates whether the agent feels breeze in the current state, while the second one indicates whether it feels stench). Then we choose the action to be performed ins, by using ǫ-greedy (line 4). For illustration purpose, we letǫ = 0, such that the agent always chooses the action that can maximise the Q-value in the current state. Since all actions’ Q-values in this state are0 now, the agent can choose any action according toǫ-greedy. Suppose the agent chooses to perform go up, and it goes to square(0, 1) and receives a reward of -1000, i.e. s′ =< 0, 1, F, T > and r = −1000 (line 6). Then the agent chooses the action to be performed ins′ (line 7). Once again, according toǫ-greedy, the agent can select any action because all actions’ Q-values ins′ are 0. Suppose the agent also

chooses go up ins′. Given the current states, current action a, reward r, next states′ and next action a′, we update the Q-value of state-action pair(s, a) (line 8). For simplicity, we letα = 1. Since Q(s, a) = Q(s′, a) = 0, we can easily

see that the newQ(s, a) value is -1000. We then update the current state s and the current actiona (line 9 and 10), and re-enter the loop between line 5 and line 11. Nows =< 0, 1, F, T >, and it is a termination state because there is a Wumpus in this square. Therefore, the algorithm quits the loop and this episode ends. The Q- values until now are given in Fig 2.2(b). We can see that each iteration of the loop between line 5 and line 11 is actually an interaction between the RL agent and the environment; thus, each iteration of this loop is a learning step (see Section 1.3). The first episode has only one learning step.

After the first episode finishes, the second episode starts immediately (line 3). The initial state is the same as in the first episode. However, when selecting the action to be performed ins, go up will not be chosen because its Q-value is the lowest among all actions’ Q-values ins. Because all the other three actions’ Q- values are 0, the agent can randomly select any of those actions. The current situ- ation is illustrated in Figure 2.3(a). Suppose the agent choosesa = go right (line 4); then it will receive rewardr = −1 and go to a new state s′ =< 1, 0, F, F >

(line 6). Since s′ has not been visited before, all actions’ Q-values in s′ are 0 and, therefore, the agent will choose a random actiona′ ins. Suppose it chooses

                   (a) Step 1                    (b) Step 2                  

(c) At the end of step 2

Figure 2.3: The second episode of the SARSA-based learning.

a′ = go up. According to line 8, we can easily obtain that the new Q(s, a) value is−1. The algorithm then updates s and a (lines 9 and 10) and re-enters the loop starting from line 5. Note that, in step 2,a = go up and s =< 1, 0, F, F > (Fig- ure 2.3(b)). By performinga in s, the agent receives reward +500, and moves into a new states′ =< 1, 1, F, T >. Once again, since all actions’ Q-values in s′are 0, a random actiona′ is chosen: let us suppose it is go up. We can easily obtain that

Q(s, a) = +500. Since the agent arrives at the exit, the second episode ends now (Figure 2.3(c)). Thus the second episode has two learning steps.

Now the agent has its third episode in this experiment. The initial states is still < 0, 0, F, T >. In state s, the agent will first choose action go left or go down, because the Q-values of these two actions ats are still 0, whereas the other two actions’ Q-value are all negative (Figure 2.4(a)). Suppose the agent choosesa =

go down, it will receive r = −1 and the new state s′ = s. Easily we can see

thatQ(s, go down) will be updated as -1. In step 2 (Figure 2.4(b)), the agent will choose go left to perform because it has the highest Q-value in states, and will receiver = −1, remain in the same state (i.e. s′ = s), and update Q(s, go left) = −1. In step 3 (Figure 2.4(c)), all actions except go up have the same Q-value, and we assume the agent choosesa = go right. A reward r = −1 will be received, and the new state iss′ =< 1, 0, F, F >. In s′, action go up will be selected (i.e. a′ = go up), because its Q-value is +500 while all the other actions’ Q-values are 0. So the value ofQ(s, a) is updated as follows: Q(s, a) = −1 + 1 × [−1 + 1 × 500 − (−1)] = 499. In step 4 (Figure 2.4(d)), the agent performs go up in state< 1, 0, F, F >, and this will lead the agent to the exit, which terminates this episode (Figure 2.4(e)). This episode has four learning steps.

                   (a) Step 1                    (b) Step 2                    (c) Step 3                    (d) Step 4                  

(e) At the end of step 4

In all episodes afterwards, the agent will first perform go right, and then per- form go up to reach the exit. So all episodes afterwards have two steps, and we can see that this is the best policy in this Wumpus World. Also note that, given our specific setting (α = 1, γ = 1, ǫ = 0), Q(s, a) for all state-action pairs (s, a) will not change in all following episodes. In other words, in this experiment, SARSA converges to the optimal policy after three episodes of learning. From this illustra- tion, we can have a direct feeling of how SARSA converges.

After illustrating how SARSA learns, we briefly discuss a very important opera- tion in SARSA: theQ value updating, as presented in line 8 in Algorithm 1. Since actions are chosen according to theseQ values, the Q value updating plays a key role in SARSA. We can see that given the current states, current action a, reward r, next state s′ and next action a′, the value ofQ(s, a) is updated by using both the existingQ(s, a) value as well as the new estimation of Q(s, a): r + γQ(s′, a).

The fact thatr + γQ(s′, a′) is an estimation of Q(s, a) can be seen from Equa- tion (2.2): if the transition functionP is deterministic, i.e. performing a at s will lead to one specific states′, then Q(s, a) = r + γQ(s, a). However, in most

real applications, P is not deterministic and, therefore, by receiving each r, the SARSA algorithm only changes the oldQ(s, a) value by a small step α, such that after many rounds of update, the value ofQ(s, a) can asymptotically approach its true value. It has been proved [RN94] that whenα asymptotically approaches 0 at certain rates, and when the agent uses a GLIE action selection policy, after long enough time of learning, Q(s, a) of any state-action pair (s, a) will converge to Q∗(s, a) with probability 1.5

Also, we have discussed in the first paragraph in this subsection that SARSA’s quick convergence is partly due to its TD-based updating, and this allows SARSA to ‘back-propagate’ delayed information more quickly. We now describe what is the information back-propagation and why SARSA’s TD nature helps to acceler- ate it. Let us revisit step 3 in the third episode in our aforementioned illustra- tion (Figure 2.4(c)). In this step, s =< 0, 0, F, T >, a = go right, r = −1, s′ =< 1, 0, F, F > and a′ = go up. We know that performing a′ ins′ receives a big positive reward+500, and this is a piece of important information we want to ‘propagate’, because our goal is to maximise the long term rewards. Because in TD-based Q-value updating (line 8 in Algorithm 1),Q(s, a) will be updated by using Q(s′, a), the information contained in Q(s, a) will be propagated back-

5

To be more specific, [RN94] proved that whenlimT →∞PTt=1αt = ∞, limT →∞PTt=1α2t <

ward to Q(s, a). The updated Q(s, a) value is 499, as shown in Figure 2.4(d), and we can see that the reward of reaching the exit (+500) has been successfully propagated to Q(s, a). We can see that by each learning step, the reward can propagate one state back. Intuitively, a more effective back-propagation should be able to back-propagate a reward to all states on the trajectory that leads to the cur- rent state. This more effective back-propagation can be achieved by the Eligibility

Tracestechnique [SS96], which will be introduced later in Section 2.2.4.

Although RL algorithms like SARSA are widely used and have been proved to be effective in some application domains, they by no means completely circumvent the curse of dimensionality. To further combat this curse, some more advanced RL algorithms are developed, including the Hierarchical RL algorithms introduced next.