2.2 Reinforcement Learning
2.2.2 The Reinforcement Learning Problem
This section is devoted to a generic formulation of the reinforcement learning problem in general. Key elements of the mathematical structure of the reinforcement learning problem are also introduced.
The Agent-Environment Interface
As stated in §2.2, the reinforcement learning problem is a framing of the problem of learning from interaction in order to achieve a goal. The learner and decision maker is called the agent, while the externalities it interacts with are called the environment. The actions chosen by the agent result in changes in states of the system and resulting rewards. The agent-environment interaction is illustrated graphically in Figure 2.1.
Environment Agent action at reward rt state st rt+1 st+1
Figure 2.1: The agent-environment interaction in reinforcement learning, adapted from [154].
As may be seen in the figure, the agent and the environment interact at a sequence of discrete time steps t = 0, 1, 2, . . . At each time step t, the agent receives a representation of the envi- ronment’s state, st∈ S, where S represents the set of all possible states. Based on the current
state, the agent then chooses an action at∈ A(st), whereA(st) represents the set of all possible
actions available to the agent when the environment is in state st. One time step later, the agent
receives a numerical reward rt+1 ∈ R, where R represents the set of all possible rewards, after
which the environment finds itself in a new state, st+1. At each time step, the agent implements
a mapping from the set of environment states to the unit interval [0, 1] of real numbers repre- senting probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted by π(st, at). Reinforcement learning methods specify how the agent may change
its policy as a result of learning experience. The agent’s goal is to maximise the total reward gained in the long run.
Backup diagrams, as depicted in Figure 2.2, are often used to illustrate the relationships which form the basis of the update operations that are at the heart of reinforcement learning methods. In these diagrams, each open circle represents a state, and each solid circle represents a state- action pair. In Figure 2.2 (a), for example, an agent finds itself in state s∈ S and can take one of three possible actions a∈ A(s), which may then lead to one of several next states s0 ∈ S, along
with a corresponding reward r∈ R. The state nodes in backup diagrams do not necessarily all represent distinct states, as a state may be its own successor.
s a r s0 s, a r s0 a0 (a) (b)
Figure 2.2: Backup diagrams for a specific state s in (a) and a specific state-action pair (s, a) in (b), adapted from [154].
Goals, Rewards and Returns
In reinforcement learning, the purpose or goal of an agent is formalised in terms of a special reward signal passed from the environment to the agent. Typically, this reward rt∈ R is simply
a real number. The reward, formalised as the notion of a goal, is one of the key features of reinforcement learning. The agent always attempts to maximise its reward, and as a result, the reward should be a way of communicating to the agent what has to be achieved, instead of how to achieve it [22]. Take a robot playing chess as an example. A reward should only be obtained by actually winning a game, not for gaining control of an area of the board, for example, or taking its opponent’s pieces, as these may not necessarily lead to a win. Furthermore, it is important that the reward should be calculated in the environment, and not by the agent, so as to ensure that the agent only has imperfect control in order to achieve this goal.
If the sequence of rewards received after some time step t is denoted by rt+1, rt+2, rt+3, . . . ,
then generally the aim is to maximise the expected return, denoted by Rt and defined by some
function of the reward sequence [22]. In the simplest case, the return may simply be the sum of the rewards,
Rt= rt+1+ rt+2+ rt+3+ . . . + rT, (2.4)
where T represents the final time step. This approach makes sense as long as the agent- environment interaction can naturally be partitioned into subsequences, called episodes, such as plays of a game. Critically, each episode must end in a terminal state, which may be followed by a reset to some standard starting state, drawn from a standard distribution of starting states. Tasks that may be partitioned into such episodes are called episodic tasks [154]. In episodic tasks, it should be possible to distinguish between the set of all non-terminal states, and the set of all terminal states, denoted by S+.
In many cases, however, tasks cannot be partitioned into identifiable episodes, but evolve con- tinually. Such tasks are called continuing tasks [154]. For these tasks, the return formula (2.4) is problematic since neither the terminal time nor the accumulated return may be bounded. As a result, Sutton and Barto [154] suggested the concept of discounting. When adopting this
rate determines the value of future rewards: a reward received k time steps in the future is only worth γk−1 times the value it would be worth if it were to be received immediately. As long as γ < 1 the reward sequence {r}k=1,2,3,... is bounded, and the sum in (2.5) has a finite value. If
γ = 0, the agent is said to be myopic in the sense of being concerned only with maximising the immediate rewards achieved. As γ approaches 1, however, future rewards gain more and more importance, and as a result, the agent becomes more far-sighted.
The quantification approaches in (2.4) and (2.5) may be combined into one formula which may be used in both episodic or continuing cases. The return may in this case be written as
Rt= T
X
k=0
γkrt+k+1, (2.6)
which includes the possibilities that T =∞ or γ = 1, but not both. The Markov Property
As mentioned, the agent’s decisions are made as a function of a signal received from the en- vironment, known as a state. This state is usually determined by some preprocessing system, which forms part of the environment. Ideally, this state signal should summarise past sensations compactly, yet retain all the relevant information [154]. A state signal that succeeds in retaining all the relevant information is said to possess the Markov property. Take a game of chess as an example again: the current configuration of all the pieces on the board may be considered as a Markov state, since it summarises everything about the complete sequence of positions that lead to it. Much of the information about the exact sequence of moves is lost, but everything important going forward is retained. In the same way, the current position and velocity of a cannonball may be considered a Markov state, since this contains all the information necessary to trace the future trajectory of the object. For the purpose of tracing out the future trajec- tory, it is, however, not necessary to know how the cannonball achieved its current position and velocity.
Under the assumption that only a finite number of states and reward values exist, the Markov property of the reinforcement learning problem may be formalised as follows. Consider the response of a general environment at time t + 1 corresponding to an action taken at time t. In the most general case, this response may depend on everything that has happened, leading up to the current situation. In this case, the dynamics may be defined only by specifying the complete probability distribution
Pr(st+1 = s0, rt+1 = r| st, at, rt, st−1, at−1, rt−1, . . . , s1, a1, r1, s0, a0), (2.7)
for all s0 ∈ S, r ∈ R, s
t∈ S, at∈ A(st), and all possible values of the past events: st∈ S, at∈
A(st), rt ∈ R, . . . , s1 ∈ S, a1 ∈ A(s1), r1 ∈ R, s0 ∈ S, a0 ∈ A(s0). If, however, the state
signal exhibits the Markov property, then the environment’s response at time t + 1 only depends on the state and action representations at time t, in which case (2.7) reduces to
for all s0 ∈ S, r ∈ R, st ∈ S and at ∈ A(st). In other words, a state signal exhibits the
Markov property if and only if (2.8) is equal to (2.7) for all s0 ∈ S, r ∈ R, and all histories,
st∈ S, at∈ A(st), rt∈ R, . . . , s1 ∈ S, a1∈ A(s1), r1∈ R, s0 ∈ S, a0∈ A(s0).
As a result, if an environment exhibits the Markov property, then the one-step dynamics given in (2.8) allow for the prediction of the next state and associated reward, given only the current state and action. It follows that by iterating the expression in (2.8) one may predict all future states and rewards just as well as would be possible if the entire history up to the current time were known. This implies that the Markov states provide the best basis for choosing actions, which allows the action policy to be formulated as a function of the Markov states.
Markov Decision Processes
A reinforcement learning problem that satisfies the Markov property is called a Markov decision process (MDP) [155]. In the case where the state and action spaces are finite, the process is called a finite MDP. Any particular finite MDP is defined by its state and action sets, and by the one-step dynamics of the environment. Given any state s ∈ S and action a ∈ A(s), the probability of each possible next state s0 is given by
Pa
ss0 = Pr(st+1 = s0| st= s, at= a). (2.9)
These quantities are called transition probabilities. Similarly, given any current state s∈ S and action a ∈ A(s), together with any next state s0 ∈ S, the expected value of the next reward is
given by
Rass0 = E{rt+1| st= s, at= a, st+1= s0}. (2.10)
The quantities Pa
ss0 and Rass0 in (2.9)–(2.10) completely specify the most important aspects of
the dynamics of a finite MDP (only information about the distribution of rewards around the expected value is lost).
Value Functions
Almost all reinforcement learning algorithms are based on estimating value functions — func- tions of states (or state-action pairs) that provide an estimate as to how good it is for an agent to be in a certain state (or how good it is to perform a specific action in a given state) [155]. The notion of “how good” is typically defined in terms of the expected future rewards (i.e. in terms of the expected return). Naturally, the future rewards depend on the actions taken by the agent. Accordingly, the value functions are defined with respect to particular policies. The value of a state s under some policy π is the expected return when starting in state s ∈ S and following π thereafter. In MDPs, the state-value function for policy π, denoted by Vπ(s), is
defined as Vπ(s) = Eπ{Rt| st= s} = Eπ ( ∞ X k=0 γkrt+k+1 | st= s ) , (2.11)
where Eπ{P∞k=0γkrt+k+1 | st = s} denotes the expected value given that the agent follows
policy π. Similarly, the value of taking an action a ∈ A(s) in state s ∈ S under policy π, denoted by Qπ(s, a), is defined as the expected return, starting from state s, of taking action a,
and thereafter following policy π. The function Qπ, called the action-value function for policy
π, is given by Qπ(s, a) = Eπ{Rt| st= s, at= a} = Eπ (∞ X k=0 γkrt+k+1 | st= s, at= a ) . (2.12)
Q (s, a). Estimation methods of this kind are called Monte Carlo methods [154] due to the fact that they involve taking the average of actual returns from random samples.
A fundamental property of value functions used in reinforcement learning is that they satisfy certain recursive relationships. For any policy π and any state s∈ S, the consistency condition
Vπ(s) = Eπ{Rt| st= s} = Eπ (∞ X k=0 γkrt+k+1| st= s ) = Eπ ( rt+1+ γ ∞ X k=0 γkrt+k+2 | st= s ) (2.13) = X a π(s, a)X s0 Pssa0 " Rass0 + γEπ ( ∞ X k=0 γkrt+k+2 | st+1 = s0 )# = X a π(s, a)X s0 Pssa0[Rass0+ γVπ(s0)]
holds between the value of s and the value of its possible successor states, where it is implicit that the actions are taken from the setA(s), and the next states are taken from the set S. The expression in (2.13) is known as the Bellman equation for Vπ [22]. It expresses a relationship
between the value of a state and the values of its successor states.
The Bellman equation (2.13) represents the average over all possibilities, taking the weight of the probabilities into account. It states that the value of the start state must equal the (discounted) value of the expected next state, together with the expected reward. The value function Vπ is
the unique solution to its Bellman equation. As a result, the Bellman equation forms the basis of a number of ways of computing, approximating and learning Vπ.
For finite MDPs, an optimal policy may be defined in the following way. A policy π is said to be better than or equal to another policy π0, denoted by π π0, if its expected return is greater
than or equal to that of π0 for all states. In other words, π π0 if and only if Vπ(s) ≥ Vπ0(s)
for all s∈ S. If one policy exists that is better than or equal to all other policies, it is called an optimal policy [154], denoted by π∗. There may be more than one optimal policy. Each optimal policy π∗ corresponds to an optimal state-value function value
V∗(s) = max
π V
π(s) (2.14)
for all s ∈ S. Similarly, each optimal policy also has a corresponding optimal action-value function value
Q∗(s, a) = max
π Q
π(s, a) (2.15)
for all s∈ S and a ∈ A(s). For each state-action pair, this function value represents the expected return associated with taking some action a in state s and thereafter following an optimal policy. As a result, one may write
Optimality and Approximation
Cases where an agent learns an optimal policy are very rare for real-life problem instances [154]. This is due to the fact that, because of time constraints, current processing technology still cannot compute an optimal policy for such a problem by solving the Bellman equation within a reasonable time available per stage. Furthermore, memory requirements also present a challenge. In tasks with small, finite sets of states, it is often possible to form approximations using tables or arrays containing an entry for each state-action pair. For large problems, which may have infinitely many states, this is, however, not possible. In such cases, the functions must be approximated at the cost of optimality, using some sort of more compact parameterised function representation. This does, however, present unique opportunities for achieving useful approximations. There may, for example, be many states which are reached with such a low probability that computing optimal behaviour for those states will have only a minimal impact on the amount of reward received by the agent. The online nature of reinforcement learning makes it possible to approximate optimal policies in such a way that more attention is afforded to frequently occurring states, resulting in good decisions being made when those states occur, at the expense of less effort being made in learning good policies for less frequently encoun- tered states. This is one key property which distinguishes reinforcement learning from other approximate solution approaches to MDPs.