Reinforcement learning addresses a class of learning problems where an agent interacts with a dynamic, potentially stochastic and partially unknown environment, aiming to learn policies that maximise performance on a given task [87]. The interactions between agent and environment are typically modelled using a Markov decision process when states are observable, or a partially observable Markov decision process in when only limited state information is accessible. Reinforcement learning has received much recent attention, and methods to learn policies from interaction with an environment fall into several families. This section gives an overview and basic theory on the main classes of reinforcement learning methods.
2.9.1
Model-based
The principle behind model-based methods is to learn the MDP’s transition dynamics, and use a planning algorithms to find the best action to execute next. This family of methods is called model-based to reflect the model used to learn transition dynamics. By opposition, the next two families of RL algorithms do not learn the transition dynamics explicitly and are often referred to as model-free RL.
Learning the transition dynamics of an MDP can be cast as a supervised learning problem using transition data collected from interactions between agent and envi- ronment. Transitions are tuples (st, at, st+1), and the state-action pair (st, at) and
next state st+1 can be used as input and target of a supervised learning algorithm,
respectively. Some model-based RL algorithms similarly learn the reward function from transition data.
Once the transition function (and possibly reward function) is learned, action selection is reduced to a planning problem. Indeed, when components of the MDP are
2.9 Reinforcement Learning 25
known, planning algorithms such as full tree search or Monte Carlo tree search can find the optimal action at a given state. In this case, the policy is not a function but a planning algorithm. Examples of model-based RL algorithms include Dyna [86] and prioritised sweeping [53].
Model-based RL methods are known for their high data efficiency, which makes them perfect candidates for real life problems where data is scarce or expensive. Furthermore, these methods can handle changing objectives and tasks (change in rewards), without needing to learn models anew. However, model-based algorithms generally do not scale well with high dimensions, and learning models and planning can be computationally expensive. Lastly, resulting policies are sensitive to compounding transition model errors, and a slightly wrong transition model can result in degenerate policies.
2.9.2
Value function
The core idea of value function RL [13] is to learn a value function V (or state-action value function Q), which encodes the expected long-term discounted value of a state (or state-action pair):
V(s) = ET ,R,π[ ∞ X i=0 γiri|s0 = s], (2.40) Q(s, a) = ET ,R,π[ ∞ X i=0 γiri|s0 = s, a0 = a], (2.41)
where ri = R(si, ai). These functions reflect the expected sum of discounted reward
gained from following policy π, starting from s0 = s (and a0 = a in the case of Q).
These value functions are difficult to learn in practice, but have the advantage of making the policy definition simple. Because value functions encode the value of a state (or state-action pair) using an infinite horizon, high horizon action search is not necessary to achieve non-myopic action selection.
When the state-action value function is optimal, denoted Q∗, the optimal policy
simply becomes a maximisation over the action space: π∗(s) = arg max
a∈A Q
∗(s, a). (2.42)
Learning the state-action value function can be achieved by writing Equation 2.40 recursively, also known as the Bellman equation:
Q(s, a) = ER[R(s, a, s′)] + γEs′,a′|s,a[Q(s′, a′)]. (2.43)
Here, s′ ∼ p(s′|s, a) and a′ ∼ π(s′). This equation can be used to iteratively refine
models of Q based on transition data, and is the base of most value function RL algorithms such as Q-learning [90]. In the tabular case, learning Q with Equation 2.43 converges to the optimal state-action value function Q∗, which is used to recover the
optimal policy π∗.
Value function methods are generally simple algorithms, of low complexity. Learning is relatively data efficient, while not requiring the computational overhead of learning MDP transition dynamics. However, this family of methods does not scale very well with very high dimensions, and does not provide ways to inspect how much of the environment dynamics the agent has learned. Examples of value function RL include value iteration [6, 87], Q-learning [90], SARSA [75], and least-squares temporal difference [40].
2.9.3
Policy search
Policy search methods take a different approach and directly search over a space of policies to find the best performing one. Policies are modelled as function mapping states s to actions a and fully defined by a vector of parameters θ. Finding the optimal policy π∗ becomes an optimisation problem and is equivalent to finding the optimal
2.9 Reinforcement Learning 27 parameters θ∗: θ∗ = arg max θ E[ ∞ X t=0 γtrπθ t ], (2.44) where rπθ
t is the reward obtained for executing the action chosen by πθ at time step t. Equation 2.44 describes a classic optimisation problem, which can be solved using common optimisation techniques. Methods using gradient-free optimisation are known as direct policy search. Examples of direct policy search include using Bayesian optimisation [11] or the Cross Entropy method [46].
Methods using gradient-based optimisation are referred to as policy-gradient. How- ever, obtaining gradients from Equation 2.44 is not trivial and many methods estimate policy gradients by using the likelihood ratio trick. Denoting J the expected discounted return such that J(θ) = E[P∞
t=0γtr πθ
t ], policy parameter gradient updates are given by
∇θJ(θ) = E[ ∞ X t=0 ∇θlog πθ(at|st) ∞ X t=0 γtrπθ t ], (2.45)
where the expectation is often approximated using finite samples. Note that this requires the policy to be stochastic, and a differentiable function of its parameters. Gradient estimates of policy parameters θ can be obtained from Equation 2.45, and any gradient descent method can be used to find a (potentially local) maximum of J. One of the most famous policy gradient methods, making use of the likelihood ratio trick, is REINFORCE [91].
The main advantage of policy search methods is their ease of implementation and deployment, casting RL as a black-box optimisation problem. Their computational complexity is typically low, and these methods do not require explicitly learning transition dynamics. However, policy parameter updates generally suffer from high variance, and policy search methods can be very data inefficient. Similarly to value function based RL, there is no way to inspect how much of the environment dynamics the agent has learned. Lastly, policy search methods only work in episodic regimes, because updates happen after episodes end.
More advanced methods also combine the different types of RL methods. For exam- ple, value function or policy search methods can take advantage of transition models learned by model-based algorithms. Policy gradient methods can also simultaneously learn a value function, called baseline, to reduce gradient estimate variance [91]. Lastly, actor-critic methods also simultaneously learn a policy and value function [33].