• No results found

Reinforcement Learning Solution Approaches

2.2 Reinforcement Learning

2.2.3 Reinforcement Learning Solution Approaches

The key idea of reinforcement learning may be summarised as the use of value functions in order to structure and organise the search for high-quality policies [154]. This section is devoted to a review of a variety of basic reinforcement learning algorithms which may be implemented in order to find optimal policies.

Policy iteration

In the case where the environment’s dynamics are known, the Bellman equation in (2.13) results in a system of |S| linear equations in |S| unknowns, the Vπ(s)-values for all s∈ S. Comput-

ing Vπ(s) directly from this system of equations is, however, often impractical, especially for

problems which have large state spaces. As a result, Sutton and Barto [154] proposed estimat- ing value functions by means of iterative methods. The value Vπ

k+1(s), which represents the

estimation of Vπ(s) at the (k + 1)th iteration, is given by

Vk+1π (s) = Eπ{rt+1+ γVkπ(st+1)| st= s} = X a π(s, a)X s0 Pssa0[Rass0+ γVkπ(s0)], (2.17)

where the initial estimate Vπ

0 is chosen arbitrarily. It has been shown that Vkπ converges to Vπ

as k → ∞ under the condition that either γ < 1 or the events are episodic [22]. This method of estimating value functions through the repeated application of (2.17) until convergence is achieved, is called policy evaluation.

If both the state-value function Vπ(s) and the action value-function Qπ(s, a) are known for all

s∈ S and a ∈ A, one may easily determine the optimal policy by simply choosing at each state the action which appears to be best according to Qπ(s, a). The new, greedy policy π0 is given

a s0

ss ss

This process of greedily creating a policy that improves the existing policy with respect to the value function is called policy improvement [154]. Note that the policy π(s) denotes the mapping from state s∈ S to the action a ∈ A(s) the agent chooses according to the current policy. This convention is employed throughout the remainder of this dissertation.

As a result, once a policy π has been improved based on the value of Vπ in order to find a better

policy π0, Vπ0 may be computed, and again improved to find an even better policy π00. As a

result, a sequence of monotonically improving policies and value functions π0 −→ VE π0 −→ πI 1−→ VE π1 −→ πI 2 −→ · · ·E −→ πI ∗ E−→ V∗

may be found, where −→ andE → denote policy evaluation and policy improvement, respectively.I A pseudo-code description of this algorithm, called policy iteration, is given in Algorithm 2.1.

Algorithm 2.1: The policy iteration algorithm [154].

Input : An arbitrary initial value V (s)∈ < and policy π(s) ∈ A(s) for all s ∈ S. Output: An optimal policy π∗(s).

Policy evaluation;

1

4 ← 0;

2

while4 > δ (a small positive number) do

3 4 ← 0; 4 for each s∈ S do 5 v← V (s); 6 V (s)Ps0Pssπ(s)0 [Rπ(s)ss0 + γV (s0)]; 7 4 ← max(4, |v − V (s)|); 8 Policy improvement; 9

policy stable ← True;

10 for each s∈ S do 11 b← π(s); 12 π(s)← maxaPs0Pssa0[Rass0+ γV (s0)]; 13 if b6= π(s) then 14

policy stable ← False;

15

if policy stable = False then

16 go to line 1; 17 else 18 return [π(s)]; 19 Value iteration

One drawback of policy iteration, pointed out by Sutton and Barto [154], is that each itera- tion requires policy evaluation, which may itself be a protracted iterative computation, often

requiring multiple sweeps through the state set. The policy evaluation step may, however, be truncated without the loss of convergence guarantee of policy evaluation. In value iteration, policy evaluation does not continue until convergence, but is terminated after each state has been evaluated once, and thereafter policy improvement is completed immediately [154]. Thus, value iteration combines the policy improvement and truncated policy evaluation steps such that the estimated value is given by

Vk+1(s) = max a E{rt+1+ γV(st+1)| st= s, at= a} = max a X s0 Pssa0[Rssa0 + γVk(s0)].

A pseudo-code desciption of the value iteration algorithm is given in Algorithm 2.2. Algorithm 2.2: The value iteration algorithm [154].

Input : An arbitrary initial value V (s)∈ < for all s ∈ S. Output: An optimal policy π∗(s).

4 ← 0;

1

while4 > δ (a small positive number) do

2 4 ← 0; 3 for each s∈ S do 4 v← V (s); 5 V (s)← maxaPs0Pssa0[Rssa0 + γV (s0)]; 6 4 ← max(4, |v − V (s)|); 7

return [π(s)← arg maxaPs0Pssa0[Rass0+ γV (s0)]];

8

Q-learning

Q-learning is another value iteration-based reinforcement learning algorithm first proposed by Watkins [170]. Unlike in value iteration, however, the goal in Q-learning is to attempt to directly compute the optimal action value function, Q(s, a). This is achieved through the comparison of the current action-value estimation Q(st, at) with a new estimate calculated using the reward rt

received as well as the maximum value of the future state, maxaQ(st+1, a). The update rule for

the action values is given by

Qk+1(st, at) = Qk(st, at) + α h rt+ γ max a Qk(st+1, a)− Qk(st, at) i , (2.19)

where γ represents the discount factor as defined above, and α represents the learning rate, which is a small positive real number influencing the extent of the effect that the new estimation of the value has. For example, if the learning rate is 1, the old value will be replaced by the new estimation. Due to the stochastic nature of the MPDs, however, it is necessary to determine the average value obtained over multiple time steps. As a result, the learning rate is employed only to partially update the old values [130]. The final policy may then be extracted greedily from the final approximation of the state-action values once the algorithm has terminated. A pseudo-code description of the Q-learning algorithm is given in Algorithm 2.3.

Watkins and Dayan [169] have shown that Q-learning converges to the optimal action-value function Q∗(s, a) as long as all state-action pairs are visited and updated infinitely many times,

Algorithm 2.3: The Q-learning algorithm [170].

Input : An arbitrary initial value Q(s, a) for all s∈ S, a ∈ A(s). Output: A near-optimal policy π∗(s).

for all episodes do

1

Initialise s;

2

repeat for each step of each episode

3

Choose atfrom st using some predefined policy derived from Q; 4

Take action at, observe the reward rt, and the next state st+1; 5 Update Q(st, at)← Q(st, at) + α [rt+ γ maxaQk(st+1, a)− Qk(st, at)]; 6 st← st+1; 7 until s is terminal; 8

return [π(s) = maxaQ(s, a)]; 9

SARSA

The state-action-reward-state-action (SARSA) reinforcement learning algorithm is another no- table algorithm derived directly from the Bellman equation (2.13) [154]. The algorithm’s name is derived form the sequence of events that take place during the Q-value updating process. The SARSA algorithm functions similarly to the Q-learning algorithm. Unlike Q-learning, however, SARSA is a so-called on-policy algorithm. The effect of this is that when updating Q(st, at),

the next action at+1 is chosen according to the current policy instead of taking the maximum

Q-value over all actions [130]. The update rule for SARSA is thus given by

Qk+1(st, at) = Qk(st, at) + α [rt+ γQk(st+1, at+1)− Qk(st, at)] . (2.20)

The result is that, as is typical in on-policy methods, a continual estimation of Qπ is provided

for the current policy π, while simultaneously attempting to adapt the policy π over time to find the optimal policy π∗ [154]. A pseudo-code description of the SARSA algorithm is provided in Algorithm 2.4.

R-Markov Average Reward Technique

The R-Markov Average Reward Technique (RMART) is, like Q-learning, an off-policy learning algorithm. The focus of the RMART algorithm, however, is that the value function is not defined with respect to the discounted accumulated reward, but rather with respect to the average expected reward per time step as

%π = lim n→∞ 1 n n X t=1 E(rt), (2.21)

Algorithm 2.4: The SARSA reinforcement learning algorithm [154]. Input : An arbitrary initial value Q(s, a) for all s∈ S, a ∈ A(s).

Output: A near-optimal policy π∗(s).

for all episodes do

1

Initialise s;

2

repeat for each step of each episode

3

Choose atfrom st using some predefined policy derived from Q; 4

Take action at, observe the reward rt, and the next state st+1; 5 Update Q(st, at)← Q(st, at) + α [rt+ γQk(st+1, at+1)− Qk(st, at)]; 6 st← st+1; 7 until s is terminal; 8

return [π(s) = maxaQ(s, a)]; 9

where the process is assumed to be ergodic1, and as a result, rπ does not depend on a specific

starting state [154]. From any state, the long-term average reward is the same, but there is a transient reward, implying that from some states, better than average rewards may be received for a while, while other states may yield lower than average rewards. It is this transient reward which defines the value of a state as

¯ Vπ(s) = ∞ X k=1 Eπ{rt+k − %π | st= s}. (2.22)

Similarly, the action value of a state-action pair may then be defined as ¯ Qπ(s, a) = ∞ X k=1 Eπ{rt+k− %π | st= s, at= a}. (2.23)

These are called relative values, since they are computed relative to the average reward achievable under the current policy [179]. Unlike in Q-learning, however, two policies are maintained in the RMART algorithm, a so-called behaviour policy and an estimation policy, based on the action-value function and an estimated average reward, respectively. A pseudo-code description of the RMART algorithm is given in Algorithm 2.5.