• No results found

2.4 Learning in optimal control systems

2.4.1 Reinforcement learning

Reinforcement learning provides a basis in which to study problems concerning the optimal behavior of agents acting in uncertain environments. In the classical (and probably still most common) setting, the environment consists of a discrete state space, and time evolves in discrete steps.

We therefore usually envision the problem as taking place on a finite graph, such as

is illustrated in Fig. 2.2. This figure illustrates a very simple Markov Decision Process

(MDP). The dynamics of the problem are implied by the arrows, which push the agent towards a new state at each time step. Where two or more arrows emerge from a state, the agent—whose current state is illustrated by the stick-figure—is able to influence, by exerting some control action, the probability that he will transition to some particular next state over another. Upon entering each state, the agent receives a reward, illustrated

as a number of coins for each state. The agent’s goal is to find an optimal policy—i.e., a

mapping of states to control actions—to maximize his cumulative reward over time. The state is assumed to be known in an MDP. Relaxing this assumption by adding observations and latent states yields a construct known as a Partially Observable Markov Decision Process (POMDP). The HMM might be considered a special case of a POMDP where controls are absent and the state transition graph has a certain linear structure.

Reinforcement learning typically concerns itself with two major problems: on the one

hand, the planning problem of finding an optimal policy; and on the other, the learning

problem of learning the dynamics of the environment. As one might expect, these problems are closely intertwined.

Reward Allowed transition Current state Figure 2.2: A simple MDP Planning

Planning for MDPs may be approached in many ways, but the most common relies on dynamic programming (DP). Central to the concept of dynamic programming is the no-

tion of a value function V(·) that gives for each state, the maximum cumulative reward

attainable starting in that state and following an optimal policy. The value function has a

simple recursive definition known asBellman’s equation[12]. Denoting byP(x0 |x0, a) the

probability of transitioning from statex to state x0 after taking action a, and denoting by

R(x) the per-state reward, Bellman’s equation is given by

V(x) := max a ( X x0 P(x0 |x, a)(R(x0) +γV(x0)) ) , (2.4.1)

whereγ ∈[0,1) is a given scalardiscount factor.

The optimal policy π(x) evaluated for a given state x is simply that action which

minimizes the right-hand-side of (2.4.1). Knowing the value function is therefore equivalent to knowing the optimal policy.

A classic algorithm for finding the value function consists of turning (2.4.1) into a fixed-point iteration that is guaranteed to converge to the true value function [12] and is

appropriately referred to asvalue iteration.

Bellman himself was one of the first to recognize a critical deficiency of this approach: in large and/or high-dimensional state spaces, the curse of dimensionality makes it impos- sible to even store the value function as a simple table of values, much less perform the required iteration repeatedly over all states. He proposed as a potential solution to in- stead represent the value function approximately by a finite, weighted sum of smooth basis

functions, attempting, in his words, “to trade additional computing time, which is expen- sive, for additional memory capacity, which does not exist.” [13]. This approach is usually

known today as approximate dynamic programmingor value function approximation, and

it remains a very active area of research today [30, 31, 32].

Planning in MDPs is an extremely rich and active field that, regrettably, would take us too far afield if we were to discuss it thoroughly at this point. We therefore move on to a brief summary of the learning problem.

Learning

When we speak of learning in an MDP, it usually refers to the task of planning in an MDP with unknown dynamics. Conceptually, this could be performed by first learning the dynamics (i.e., the state-transition probabilities) and subsequently using one of the planning methods described in the previous section to solve the planning problem; this is

the so-calledmodel-basedapproach. Many methods, however, are based on the observation

that the actual state-transition probabilities need not be computed explicitly if all we are concerned with is that the agent act optimally in the world; these methods are referred to asmodel-free methods.

One well-known example of such a method is Q-learning [104], which employs a value-

iteration-like fixed point algorithm to estimate a function Q(x, a) defined as the optimal

value conditioned on first taking a step with action a. Given a (potentially variable)

learning rate αt, this results in an iteration without the expectation over actions that

would require knowledge of the state-transition distribution: Q(x, a)←Q(x, a) +αt[R(x) +γmax

a0 Q(xt+1, a 0

)−Q(x, a)]. (2.4.2)

Q-learning can be considered a special case of the more general class of temporal dif-

ferencelearning methods, which perform incremental updates based on some sort oferror signal such as that found in the right-hand-side of (2.4.2) [97]. As with approximate dynamic programming in the known-model case, the standard approach to making such methods work in high-dimensional spaces is to use function approximation to represent

value orQfunctions [38, 91]. Perhaps the most well-known success story achieved by such

methods is the case of TD-Gammon, a backgammon-playing program based on temporal difference with a neural network function approximator that eventually learned to play at a world-class level by competing in countless trials against itself [92], demonstrating the power of such methods.