• No results found

Temporal Difference learning: Q-Learning algorithm

3 Reinforcement Learning methods

3.5 Temporal Difference learning: Q-Learning algorithm

Among the various Temporal-Difference learning techniques that can be used to control MDPs, of interest of this work is in particular the Q-Learning algorithm, implemented by the simulator and applied to the analysed network model. An general explanation of the Temporal-Difference methods main features is presented in Section 3.5.1, while a focus on Q-Learning algorithm is done in section 3.5.2.

33

3.5.1 Temporal Difference learning methods overview

Temporal-Difference (TD) methods for estimating the optimal value functions using a model-free approach are generally based on the so-called bootstrapping operation, which is the capability of the agent to use the estimate of the state values of other future states in order to update the estimate of the current state. More specifically, these methods are said to perform a sample backup, in the sense that the new estimate of the value function is obtained from a sample of the possible options for the next (rather than performing a full backup, relying on the estimates of all the possible successor states, like happens in Dynamic Programming).

TD methods can be classified into two main groups: on-policy and off-policy policy evaluation methods. On-policy means that the policy used for control of the MDP (and so to make decisions) is the same which is being improved and evaluated. In other words, the estimate of the value functions for a policy is done once is given an infinite number of episodes generated using that same policy. In on-policy control methods the policy is said

soft, which means that there is always an element of exploration to the policy (without being

strict to always choose the action that gives the most reward). In this case the non- deterministic policy has to be

๐œ‹(๐‘Ž|๐‘ ) = ๐‘ƒ(๐‘Ž๐‘ก = ๐‘Ž|๐‘ ๐‘ก = ๐‘ ) > 0, โˆ€๐‘  โˆˆ ๐’ฎ, โˆ€๐‘Ž โˆˆ ๐’œ(๐‘ ). (3.22) Off-policy means that the policy used for control, can have no correlation with the policy being evaluated and improved. In this case, the infinite set of episodes needed to estimate the value functions ๐‘ฃ๐œ‹ and ๐‘ž๐œ‹ is generated by a policy ๐œ‡ โ‰  ๐œ‹, where ๐œ‹ is the target (or

estimation) policy and ๐œ‡ is the behaviour policy (which controls the agent generating behaviour). In order to use episodes that are generated using policy ๐œ‡, to estimate values for policy ๐œ‹, itโ€™s required that every action taken under ๐œ‹ is also taken (at least occasionally) under policy ๐œ‡. This means that ๐œ‹(๐‘Ž|๐‘ ) > 0 implies that ๐œ‡(๐‘Ž|๐‘ ) > 0, where ๐œ‹ may be deterministic while ๐œ‡ is stochastic. In conclusion, off-policy algorithms can separate exploration from control differently from on-policy ones: an agent trained using an off- policy method may end up learning behaviours that it did not necessarily exhibit during the learning phase.

34 3.5.2 Q-Learning algorithm

Q-Learning algorithm was introduced by Christopher J. Watkins in 1989 [15] and is

classified as a Temporal-Difference off-policy learning algorithm, widely used in RL framework for finding approximate solutions to Bellman optimality equations and then optimal policies. In particular, this algorithm learns optimal Q-values ๐‘žโˆ—(๐‘Ž, ๐‘ ) (rather than state-values ๐‘ฃโˆ—(๐‘ )) and simultaneously determines an optimal policy for the MDP.

The procedural form of the algorithm is shown in Algorithm 3.1, while the backup diagram is shown in Figure 14:

Q-Learning algorithm

Initialize arbitrarily ๐‘ž(๐‘ , ๐‘Ž), โˆ€๐‘  โˆˆ ๐’ฎ, โˆ€๐‘Ž โˆˆ ๐’œ(๐‘ ) and ๐‘ž( ๐‘ก๐‘’๐‘Ÿ๐‘š๐‘–๐‘›๐‘Ž๐‘™ โˆ’ ๐‘ ๐‘ก๐‘Ž๐‘ก๐‘’, โˆ™ ) = 0 Repeat (each episode):

Initialize ๐‘ ๐‘ก

Repeat (each step of episode):

Choose ๐‘Ž๐‘ก from ๐‘ ๐‘ก using policy derived from ๐‘ž (e.g. ๐œ– โˆ’ ๐‘”๐‘Ÿ๐‘’๐‘’๐‘‘๐‘ฆ)

Take action ๐‘Ž๐‘ก, observe ๐‘Ÿ๐‘ก+1 and ๐‘ ๐‘ก+1= ๐‘ โ€ฒ ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก) โ† ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก) + ๐›ผ[๐‘Ÿ๐‘ก+1+ ๐›พ max๐‘Ž

๐‘ก+1

๐‘ž(๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1) โˆ’ ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก)]

๐‘ ๐‘กโ† ๐‘ ๐‘ก+1 = ๐‘ โ€ฒ

Until (๐‘ ๐‘ก is terminal)

The update rule of Q-values is

๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก) โ† ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก) + ๐›ผ[๐‘Ÿ๐‘ก+1+ ๐›พ max

๐‘Ž๐‘ก+1

๐‘ž(๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1) โˆ’ ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก)], (3.23)

where ๐›ผ is called step-size parameter or learning rate and must decay over time such that โˆ‘โˆž๐‘ก=0๐‘Ž๐‘ก = โˆž and โˆ‘โˆž ๐‘Ž๐‘ก2 < โˆž

๐‘ก=0 [15]. This formula is based on the following concept: the

quantity ๐‘Ÿ๐‘ก+1+ ๐›พ max

๐‘Ž๐‘ก+1

35

immediate reward (๐‘Ÿ๐‘ก+1), which includes some knowledge from the environment into the estimate itself. In fact, all the term weighed by ๐›ผ represents an estimate of the error between the current Q-value function and the real Q-value function, for the given policy.

In a more broad sense, this algorithm allows the agent to acquire optimal control strategies from delayed rewards, even when it has no prior knowledge of the effects of its actions on the environment.

Since Q-Learning is an off-policy TD method, while the improved policy is greedy and deterministic, the actions chosen for control are based on some other policy (behaviour policy), which could be unrelated with the first one (e.g. an ๐œ–-greedy policy, with a uniform distribution over the action space).

Figure 14 Q-Learning backup diagram: the top node (the root of the backup) is a state-action pair; then,

once in the next state ๐’”โ€ฒ, a maximization over all actions possible in the next state is done; the end nodes of the backup are all these next actions.

The learned action - value function ๐‘ž(๐‘ ๐‘ก, ๐‘Ž๐‘ก) directly approximates the optimal action - value

function ๐‘žโˆ— independently of the followed policy. In fact, under the assumption of exploring

starts, according to which all the state-action pairs must be visited an infinite number of

times (in the limit of an infinite number of episodes) at the condition of specifying that the first step of each episode starts at a state-action pair and that every such pair has a nonzero probability of being selected as the start (and, in addition, under some stochastic approximation conditions) has been proven that ๐‘ž converges to ๐‘žโˆ— with probability 1.

36