Reinforcement Learning and Markov Decision

Processes

In this section the basic concepts of reinforcement learning are reviewed (a more complete review can be found in e.g.,Sutton and Barto(1998)).

3.3.1 Agent-environment interaction

The agent-environment interaction process as shown in Figure3.2 captures the basic elements of a reinforcement learning problem. In the figure, an agent interacts with the environment through three signals, state transitions, rewards, and actions. At each time step t the agent finds itself in a state, st. The agent then chooses an action atat time step

t. As a (partial) consequence of the action at chosen, the environment gives feedback to

the agent with a reward/penalty rt+1and transitions to a new state st+1.

By interacting with the environment, the agent aims to maximise the total amount of reward it receives over the long run. In order to do that, the agent needs to find the optimal policy; a policy is defined as the mappings from the states to the probabilities of choosing the actions available.

During this process, the states are served as the basis of the action selection. The state is defined as a signal that summaries the information that is available to the agent. The least compact way to represent the state is a list of everything happened so far, i.e., a history of states and actions up until time step t. The most compact state representation is

Figure 3.2: The agent-environment interaction in Reinforcement Learning (Figure from Sutton & Barto, 1998)

called a Markov state. A Markov state is a state that summarises all relevant information in the past. All that matters for the future is the current state. A common example used is ‘the current position and velocity of a cannonball’. ‘The current position and velocity of a cannonball’ is regarded as a Markov state, because it summarises everything that is required for future flight. It does not matter how that position and velocity was obtained (Sutton & Barto, 1998). Formally, if the environment has the Markov property, i.e., the states are Markov states, the environment dynamics can be described by

P r = {st+1 = s0, rt+1 = r | st, at} (3.1)

instead of

P r = {st+1= s0, rt+1 = r | st, at, rt−1, st−1, at−1, ...r1, s0, a0} (3.2)

As shown in Equation 3.1, the response of the environment at time t + 1 depends only on the previous state st, and the action chosen from that state at. In contrast, Equation

3.2 is for a non-Markov environment, where the response of the environment depends on everything that happened so far. The interaction process shown in Figure 3.2 can be modelled as a Markov Decision Process (MDP) if the states have the Markov property. A formal introduction to MDPs is given below.

3.3.2 Markov Decision Processes

The basic elements for a Markov Decision Process (MDP) are (S, T, A, R), where S represents the state space,T is a function that governs the state transitions (more details later), A represents the action space, and R is the reward function. If state space S and action space A are finite, then it is called finite a Markov Decision Process. Finite MDPs are particularly popular in reinforcement learning (Sutton & Barto,1998).

Due to the Markov property, a MDP could be described as a one-step dynamical sys- tem as follows.

s,s0 = P r{s_t+1 = s0 | s_t= s, a_t = a} (3.3)

The state transition function, T is a function that maps the state-action pairs to the new states. The state transition function is usually represented as transition probabilities, Ps,s0(a), where s ∈ S, s0 ∈ S, and a ∈ A. P_s,s0(a) is the probability of entering state

s0 after taking action a at state s at previous time step. The reward functionR is represented as follows.

s,s0 = E{r_t+1| s_t= s, a_t= a, s_t+1 = s0} (3.4)

Equation 3.3 and Equation 3.4 specify the dynamics of a MDP. The agent’s goal is to maximise the reward over the long run given the Markov Decision Process (MDP). One way to specify an agent’s behaviour is in terms of a control policy, which specifies what to do for each state in the state space. Formally, a policy π is a function from the state spaceS to the action space A. In reinforcement learning, the agent’s objective is to learn a control policy that maximises some measure of total reward accumulated overtime. The most prevalent measure is Rt(Sutton & Barto,1998), as shown in Equation3.5.

Rt= rt+1+ γrt+2+ γ2rt+3+ ... = ∞

n=0

γnrt+n+1 (3.5)

0 ≤ γ ≤ 1. The discount rate concerns how future rewards are considered for the value of current state. The closer γ approaches to 1, the stronger future rewards are considered. Rtis also called the return.

Now we are in a place to talk about the value of a state. The value of a state depends on how the agent behaves after that state, so the value of a state is associated with a control policy π. The value of a state s under a policy π, denoted as Vπ_(s)_{, is the expected return}

when starting in state s and following the policy thereafter as in Equation3.6

Vπ(s) = Eπ{Rt | st= s} = Eπ{ ∞

k=0

γkrt+k+1| st = s} (3.6)

where Eπ denotes the expected value given that the agent follows policy π. We call the

function Vπ _{the state-action value function of policy π.}

A policy π is defined to be better than or equal to a policy π0

if and only if Vπ_{(s) ≥}

Vπ

(s) for all s ∈ S. Therefore, there is always at least one policy that is better than or equal to all other policies. This is an optimal policy, π∗_{as defined in Equation}_3.7

π∗ = max

π V

π_(s) _(3.7)

for all s ∈S. Therefore, the goal of the agent is to find this optimal policy. This is the job of a reinforcement learning algorithm.

An important property of Markov Decision Processes (MDPs) is that the optimal policy is well defined and guaranteed to exist. In particular, the Optimality Theorem from dynamic programming guarantees that for a discrete state Markov decision process there always exists a deterministic policy that is optimal. Furthermore, a policy is optimal if and only if it satisfies the following relationship:

Qπ(x, π(x)) = max

a∈A(Qπ(x, a)) ∀x ∈ S (3.8)

the agent starts at state x, applies action a, and follows policy π thereafter. Equation3.8

states a policy is optimal if and only if in each state the policy specifies an action that maximises the local ‘action-value’.

If an MDP is completely known (including transition probabilities and reward dis- tributions), then the optimal policy can be computed directly using techniques from dynamic programming. However in many cases the dynamics of the environment are un- known. Under this circumstance, the optimal policy cannot be directly computed, but can be learnt by exploring the environment by trial-and-error.

In the next section, Q-learning, which can be used to find an optimal action-selection policy for (finite) MDPs, is introduced.

In document An optimal control approach to testing theories of human information processing constraints (Page 40-44)