• No results found

Reinforcement Learning

N/A
N/A
Protected

Academic year: 2022

Share "Reinforcement Learning"

Copied!
68
0
0

Loading.... (view fulltext now)

Full text

(1)

Reinforcement Learning

LU 2 - Markov Decision Problems and Dynamic Programming

Dr. Joschka B¨odecker

AG Maschinelles Lernen und Nat¨urlichsprachliche Systeme Albert-Ludwigs-Universit¨at Freiburg

(2)

LU 2: Markov Decision Problems and DP

Goals:

I Definition of Markov Decision Problems (MDPs)

I Introduction to Dynamic Programming (DP)

Outline

I short review

I definition of MDPs

I DP: principle of optimality

I the DP algorithm (backward DP)

(3)

Review

I Process, can be influenced by actions

I Agent: Sensory input, output of action

I Feedback

I RL: Training information through evaluation only

I Delayed Reinforcement Learning:

Decision, decision, decision, . . . evaluation

I Multi-stage decision process

I Optimization

(4)

The ’Agent Concept’

(5)

Multi-stage decision problems

(6)

Three ’components’

I System, process

I Rewards, costs

I Policy, strategy

(7)

Requirements for the model

Goal: Describing the system’s behaviour (also a system: Process, world, environment) requirements for a model:

I situations

I activities

I current situation can be influenced

I adjustments possible at discrete points in time

I noise, interference, random

I goal specification: definition of costs / rewards

(8)

System description

Discrete decision points t ∈ T = {0, 1, . . . , N} or

(stages) T = {0, 1, . . . }

System state (situation) st ∈ S here: S finite

Actions ut ∈ U here: U finite

Transition function st+1= f (st, ut)

’reaction of the system’

(9)

System description

Discrete decision points t ∈ T = {0, 1, . . . , N} or

(stages) T = {0, 1, . . . }

System state (situation) st ∈ S here: S finite

Actions ut ∈ U here: U finite

Transition function st+1= f (st, ut)

’reaction of the system’

(10)

System description

Discrete decision points t ∈ T = {0, 1, . . . , N} or

(stages) T = {0, 1, . . . }

System state (situation) st ∈ S here: S finite

Actions ut ∈ U here: U finite

Transition function st+1= f (st, ut)

’reaction of the system’

(11)

System description

Discrete decision points t ∈ T = {0, 1, . . . , N} or

(stages) T = {0, 1, . . . }

System state (situation) st ∈ S here: S finite

Actions ut ∈ U here: U finite

Transition function st+1= f (st, ut)

’reaction of the system’

(12)

Goal formulation: Introducing costs

At every decision (= in every stage) direct costs arise

Direct costs c : S → R

Refinement: dependant on c : S × U → R state and action

Reward, cost, punishment?

(13)

Goal formulation: Introducing costs

At every decision (= in every stage) direct costs arise

Direct costs c : S → R

Refinement: dependant on c : S × U → R state and action

Reward, cost, punishment?

(14)

Summary: Deterministic systems

discrete decision points t ∈ T = {0, 1, . . . , N} or

stages T = {0, 1, . . . }

system state (situation) st∈ S

actions ut ∈ U

transition function st+1= f (st, ut)

direct costs c : S × U → R

⇒5-tuple (T , S, U, f , c)

(15)

Example: Shortest path problems

Find the ’shortest’ path from start node to finish node. Every edge has a specific cost that can be interpreted as ’length’.

(16)

Stochastic systems

Again: requirements for a model:

I situations

I activities

I current situation can be influenced

I adjustments possible at discrete points in time

I noise, interference,random

I goal specification: definition of costs / rewards

(17)

Markov Decision Processes

Deterministic system: 5-Tuple (T , S , U, f , c)

Stochastic system: The deterministic transition function f is replaced by a conditional probability distribution.

In the following, we’re looking at a finite state set S = (1, 2, . . . , N). Let i , j ∈ S be states:

Notation:

P(st+1= j |st = i , ut = u) = pij(u)

⇒Markov Decision Process (MDP):

5-Tuple (T , S , U, pij(u), c(s, u))

(18)

Markov property

It holds that:

P(st+1= j |st, ut) = P(st+1= j |st, st−1, . . . , ut, ut−1, . . .) The probability distribution of the following state st+1is uniquely defined given the knowledge of the current state st and the action ut. It especially does not depend on the previous ’history’ of the system.

(19)

Remarks (1)

Deterministic system is a special case of an MDP:

P(st+1|st, ut) =

 1 , st+1= f (st, ut) 0 , otherwise

(20)

Remarks (2)

Equivalent description with deterministic transition function f : Approach: additional argument - random variable wt (noise):

st+1= f (st, ut, wt)

with wt random variable with given probability distribution P(wt|st, ut)

Transformation into previous form:

Let W (i , u, j ) = {w |j = f (i , u, w )} be the set of all values of w , for which the system transitions from state i on input of u into state j .

Then it holds:

pij(u) = P(w ∈ W (i , u, j ))

(21)

Remarks (2)

Equivalent description with deterministic transition function f : Approach: additional argument - random variable wt (noise):

st+1= f (st, ut, wt)

with wt random variable with given probability distribution P(wt|st, ut) Transformation into previous form:

Let W (i , u, j ) = {w |j = f (i , u, w )} be the set of all values of w , for which the system transitions from state i on input of u into state j .

Then it holds:

pij(u) = P(w ∈ W (i , u, j ))

(22)

Summary: MDPs

discrete decision points t ∈ T = {0, 1, . . . , N} or

’stages’ T = {0, 1, . . . }

system state (situation) st ∈ S

actions ut ∈ U

transition probabilites pij(u) P(st+1= j |st = i , ut = u) = pij(u) alternatively: Transition function st+1= f (st, ut, wt)

with wt random variable

direct costs c : S × U → R

⇒5-tuple (T , S, U, pij(u), c(s, u))

(23)

Summary: MDPs

I Model: State, action, following state

I Deterministic and stochastic transition function

I Information about ’history’ summarized in state

I Very general description: OR, control engineering, games, . . .

I Generalizations (not covered here)

I Transition function not stationary pij ,t(u)

I Costs not stationary ct(i , u)

(24)

Example stock keeping

Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days.

state: number of toys in your shop st

action: ordered number of toys to be delivered on the next day

ut

’disturbance’: number of toys sold wt

⇒system equation: st+1= st+ ut− wt

costs for toys in stock

acquisition costs for each toy which was ordered minus gain for sold toys

c(s, u) = c1(s) + c2(u) − gain

there are also terminal costs g (s), if there are still toys in stock after the N days.

(25)

Example stock keeping

Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days.

state: number of toys in your shop st

action: ordered number of toys to be delivered on the next day

ut

’disturbance’: number of toys sold wt

⇒system equation: st+1= st+ ut− wt

costs for toys in stock

acquisition costs for each toy which was ordered minus gain for sold toys

c(s, u) = c1(s) + c2(u) − gain

there are also terminal costs g (s), if there are still toys in stock after the N days.

(26)

Example stock keeping

Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days.

state: number of toys in your shop st

action: ordered number of toys to be delivered on the next day

ut

’disturbance’: number of toys sold wt

⇒system equation: st+1= st+ ut− wt

costs for toys in stock

acquisition costs for each toy which was ordered minus gain for sold toys

c(s, u) = c1(s) + c2(u) − gain

there are also terminal costs g (s), if there are still toys in stock after the N days.

(27)

Policy and selection function

Policy:

Theselection function πt : S → U, πt(s) = u

chooses at time t an action u ∈ U as function of the current state s ∈ S .

⇒Selection function chooses an action in dependence of the situation (see graphic ’agent’)

Refinement: πt: S → U, πt(s) = u, with u ∈U(s) situation dependent action set(example: chess)

Apolicy ˆπ consists of N selection functions (N being the number of decision points)

ˆ

π = (π0, π1, . . . , πt, . . .)

(28)

Policy and selection function

Policy:

Theselection function πt : S → U, πt(s) = u

chooses at time t an action u ∈ U as function of the current state s ∈ S .

⇒Selection function chooses an action in dependence of the situation (see graphic ’agent’)

Refinement: πt: S → U, πt(s) = u, with u ∈U(s) situation dependent action set(example: chess)

Apolicy ˆπ consists of N selection functions (N being the number of decision points)

ˆ

π = (π0, π1, . . . , πt, . . .)

(29)

Policy and selection function

Policy:

Theselection function πt : S → U, πt(s) = u

chooses at time t an action u ∈ U as function of the current state s ∈ S .

⇒Selection function chooses an action in dependence of the situation (see graphic ’agent’)

Refinement: πt : S → U, πt(s) = u, with u ∈U(s) situation dependent action set(example: chess)

Apolicy ˆπ consists of N selection functions (N being the number of decision points)

ˆ

π = (π0, π1, . . . , πt, . . .)

(30)

Non-stationary policies

The selection function πt can be dependent on the time of the decision.

Meaning: The same situation at different points in time can lead to different decisions of the agent.

ˆ

π = (π0, π1, . . . , πt, . . .)

If the selection functions differ for single time points, we call it anon-stationary policy.

Example soccer: Situation s: Midfield player has the ball. Reasonable action in the first minute: π1(s) = return pass Reasonable action in the last minute: π90(s) = shoot on goal

General rationale: The limited optimization time frame (’finite horizon’, see below) usually requires a non-stationary policy!

(31)

Non-stationary policies

The selection function πt can be dependent on the time of the decision.

Meaning: The same situation at different points in time can lead to different decisions of the agent.

ˆ

π = (π0, π1, . . . , πt, . . .)

If the selection functions differ for single time points, we call it anon-stationary policy.

Example soccer: Situation s: Midfield player has the ball.

Reasonable action in the first minute: π1(s) = return pass Reasonable action in the last minute: π90(s) = shoot on goal

General rationale: The limited optimization time frame (’finite horizon’, see below) usually requires a non-stationary policy!

(32)

Stationary policies

We will look mostly at stationary policies.

Then it holds that π0= π1= . . . πt. . . =: π and ˆ

π = (π, π, . . . , π, . . .)

With stationary policies, the terms ’policy’ and ’selection function’ become interchangeable.

We will call the selection function ’π’ - as generally done in literature - our policy.

Bertsekas uses the term µ for the selection function. Therefore there arise minor differences from the notation used there.

Remark: In the following only deterministic selection functions will be used

(33)

Goal of the policy

Reach the optimization goal over multiple stages (sequence of decisions)

⇒Solving a dynamic optimization problem

(34)

Cumulated costs (costs-to-go)

Interesting: Cumulated costs for a given state s with given policy π:

Jπ(s) =X

t∈T

c(st, π(st)), s0= s

Wanted: Optimal policy πso that for all s it holds that: Jπ(s) = min

π∈ ˆπ

X

t∈T

c(st, π(st)), s0= s under the constraint that st+1= f (st, ut)

(35)

Cumulated costs (costs-to-go)

Interesting: Cumulated costs for a given state s with given policy π:

Jπ(s) =X

t∈T

c(st, π(st)), s0= s

Wanted: Optimal policy πso that for all s it holds that:

Jπ(s) = min

π∈ ˆπ

X

t∈T

c(st, π(st)), s0= s under the constraint that st+1= f (st, ut)

(36)

Cumulated costs in MDPs

Expected cumulated costs for a given state s using a given policy π:

Jπ(s) =Ew

X

t∈T

c(st, π(st)), s0= s

Wanted: Optimal policy πso that for all s it holds that: Jπ(s) = min

π∈ΠEw

X

t∈T

c(st, π(st)), s0= s

under the constraint that st+1= f (st, ut, wt), or with given probability distribution P(st+1= j |st= i , ut= u) = pij(u)

(37)

Cumulated costs in MDPs

Expected cumulated costs for a given state s using a given policy π:

Jπ(s) =Ew

X

t∈T

c(st, π(st)), s0= s

Wanted: Optimal policy πso that for all s it holds that:

Jπ(s) = min

π∈ΠEw

X

t∈T

c(st, π(st)), s0= s

under the constraint that st+1= f (st, ut, wt), or with given probability distribution P(st+1= j |st= i , ut= u) = pij(u)

(38)

Problem types

Definition horizon: The horizon N of a problem denotes the number of decision stages to be traversed.

I Finite horizon: Problems with given termination time

I Infinite horizon: Approximation for very long processes or processes with an unknown end (e.g. control system)

(39)

Finite horizon

I N-stage decision problem

I Each state has terminal costs g (i ) that are due if the system ends in i after N stages.

I Costs of a policy π

JNπ(s) = E [g (sN) +

N−1

X

t=0

c(st, πt(st))|s0= s]

I Generally: Non-stationary policy

(40)

Infinite horizon

I Costs of a policy π

Jπ(s) = lim

N→∞E [

N

X

t=0

c(st, πt(st))|s0= s]

I Problem: Finite costs?

I Solution: Discount α < 1 Jπ(s) = lim

N→∞E [

N

X

t=0

αtc(st, πt(st))|s0= s]

(41)

Infinite horizon

I Costs of a policy π

Jπ(s) = lim

N→∞E [

N

X

t=0

c(st, πt(st))|s0= s]

I Problem: Finite costs?

I Solution: Discount α < 1 Jπ(s) = lim

N→∞E [

N

X

t=0

αtc(st, πt(st))|s0= s]

(42)

Solution of dynamic optimization problems

Central question: How do we find the policy that leads (on average) to minimal costs?

Solution method: Dynamic Programming (Bellman, 1957)

I Backward Dynamic Programming (finite horizon)

I Value Iteration (LU 3 ff., infinite horizon)

I Policy Iteration (LU 3 ff., infinite horizon)

(43)

Backward Dynamic Programming - idea

Problem: Stochastic multistage decision problems withfinitehorizon Idea: Calculate the costs starting from the last stage to the first stage.

Example: Find the cheapest path in a graph

(44)

Backward Dynamic Programming - problem specification (1)

I finite horizon N

I MDP

N discrete decision points t ∈ T = {0, 1, . . . , N}

State set finite st∈ S = {1, 2, . . . , n}

Action set finite ut ∈ U = {u1, . . . , um}

Transition prob. pij(u) P(st+1= j |st = i , ut= u) = pij(u)

direct costs c : S × U → R

I in the last stage N every stage causes terminal costs g (sN) := cN(sN)

(45)

Backward Dynamic Programming - objective

Wanted: πwith Jπ= minπJπ with JNπ(i ) = E [g (sN) +PN−1

t=0 c(st, πt(st))|s0= i ]

the costs belonging to π are called the optimal cumulated costs J:= Jπ.

Approach:

1. Calcuation of optimal cumulated costs (’cost-to-go’) Jk(·) for all states (Jk(·) is a n− dimensional vector). k is the number of remaining steps. 2. from Jkfollows the optimal policy for the k−step problem. (k steps until

process terminates).

(46)

Backward Dynamic Programming - objective

Wanted: πwith Jπ= minπJπ with JNπ(i ) = E [g (sN) +PN−1

t=0 c(st, πt(st))|s0= i ]

the costs belonging to π are called the optimal cumulated costs J:= Jπ.

Approach:

1. Calcuation of optimal cumulated costs (’cost-to-go’) Jk(·) for all states (Jk(·) is a n− dimensional vector). k is the number of remaining steps.

2. from Jkfollows the optimal policy for the k−step problem. (k steps until process terminates).

(47)

Backward Dynamic Programming - motivation

Thesis - Bellman’s Principle of Optimality:

If I have k more steps to go, the optimal costs for a state i are given with

the minimal expected value of the sum of

I the direct transition costs

I + optimal cumulated costs of the next state, if there are k − 1 more steps to be done from there.

The minimization here goes over all possible actions

(48)

Bellman’s Principle of Optimality

Formal: For the optimal cumulated costs Jk(i ) of the k-stage decision problem, it holds that:

Jk(i ) = min

u∈U(i )Ewk{c(i , u) + Jk−1 (f (i , u, wk)}

= min

u∈U(i ) n

X

j =1

{pij(u)(c(i , u) + Jk−1 (j ))} i = 1 . . . n (1)

Hence we can calculate the optimal cumulated costs of the N−stage optimization problem recursively starting with k = 0.

⇒Backward-DP algorithm

(49)

Bellman’s Principle of Optimality - proof (1)

Policy ˆπ(k)for k stages: ˆπ(k)= (πk, πk−1, πk−2, . . .) = (πk, ˆπ(k−1))

Let S(k)(i ) = (sN−k= i , s(N−k)+1, ..., sN) be a possible state sequence starting in state i with k transitions.

Jk(i ) = min

ˆ π(k)

Jkπˆ(k)(i )

(2)

= min

ˆ π(k)

{X

S(k)(i )

(P(S(k)(i )|ˆπ(k))(

k

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (3)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(4)

(50)

Bellman’s Principle of Optimality - proof (1)

Policy ˆπ(k)for k stages: ˆπ(k)= (πk, πk−1, πk−2, . . .) = (πk, ˆπ(k−1))

Let S(k)(i ) = (sN−k= i , s(N−k)+1, ..., sN) be a possible state sequence starting in state i with k transitions.

Jk(i ) = min

ˆ π(k)

Jkπˆ(k)(i )

(2)

= min

ˆ π(k)

{X

S(k)(i )

(P(S(k)(i )|ˆπ(k))(

k

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (3)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(4)

(51)

Bellman’s Principle of Optimality - proof (1)

Policy ˆπ(k)for k stages: ˆπ(k)= (πk, πk−1, πk−2, . . .) = (πk, ˆπ(k−1))

Let S(k)(i ) = (sN−k= i , s(N−k)+1, ..., sN) be a possible state sequence starting in state i with k transitions.

Jk(i ) = min

ˆ π(k)

Jkπˆ(k)(i )

(2)

= min

ˆ π(k)

{X

S(k)(i )

(P(S(k)(i )|ˆπ(k))(

k

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (3)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(4)

(52)

Bellman’s Principle of Optimality - proof (1)

Policy ˆπ(k)for k stages: ˆπ(k)= (πk, πk−1, πk−2, . . .) = (πk, ˆπ(k−1))

Let S(k)(i ) = (sN−k= i , s(N−k)+1, ..., sN) be a possible state sequence starting in state i with k transitions.

Jk(i ) = min

ˆ π(k)

Jkπˆ(k)(i )

(2)

= min

ˆ π(k)

{X

S(k)(i )

(P(S(k)(i )|ˆπ(k))(

k

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(3)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(4)

(53)

Bellman’s Principle of Optimality - proof (1)

Policy ˆπ(k)for k stages: ˆπ(k)= (πk, πk−1, πk−2, . . .) = (πk, ˆπ(k−1))

Let S(k)(i ) = (sN−k= i , s(N−k)+1, ..., sN) be a possible state sequence starting in state i with k transitions.

Jk(i ) = min

ˆ π(k)

Jkπˆ(k)(i )

(2)

= min

ˆ π(k)

{X

S(k)(i )

(P(S(k)(i )|ˆπ(k))(

k

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(3)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

Xc(sN−l, πl(sN−l)) + g (sN)))}

(54)

Bellman’s Principle of Optimality - proof (2)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(5)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (6) (7)

(55)

Bellman’s Principle of Optimality - proof (2)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(5)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (6) (7)

(56)

Bellman’s Principle of Optimality - proof (2)

= min

ˆ π(k)

{c(i , πk(i )) + X

S(k)(i )

(P(S(k)(i )|ˆπ(k))

∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(5)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(6) (7)

(57)

Bellman’s Principle of Optimality - proof (3)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(8)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (9) (10)

(58)

Bellman’s Principle of Optimality - proof (3)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(8)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (9) (10)

(59)

Bellman’s Principle of Optimality - proof (3)

= min

ˆ π(k)

{c(i , πk(i )) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , πk)

∗ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(8)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(9) (10)

(60)

Bellman’s Principle of Optimality - proof (4)

= min

u∈U(i )

c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))} (11)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ min

ˆ π(k−1)

{Jk−1πˆ(k−1)(j )} (12)

= min

u∈U(i ){c(i , u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ Jk−1 (j )}

(13)

= min

u∈U(i ){c(i , u) +X

j ∈S

pij(u) ∗ Jk−1 (j )}

(14) (15)

(61)

Bellman’s Principle of Optimality - proof (4)

= min

u∈U(i )

c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(11)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ min

ˆ π(k−1)

{Jk−1πˆ(k−1)(j )} (12)

= min

u∈U(i ){c(i , u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ Jk−1 (j )}

(13)

= min

u∈U(i ){c(i , u) +X

j ∈S

pij(u) ∗ Jk−1 (j )}

(14) (15)

(62)

Bellman’s Principle of Optimality - proof (4)

= min

u∈U(i )

c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(11)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ min

ˆ π(k−1)

{Jk−1πˆ(k−1)(j )}

(12)

= min

u∈U(i ){c(i , u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ Jk−1 (j )}

(13)

= min

u∈U(i ){c(i , u) +X

j ∈S

pij(u) ∗ Jk−1 (j )}

(14) (15)

(63)

Bellman’s Principle of Optimality - proof (4)

= min

u∈U(i )

c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(11)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ min

ˆ π(k−1)

{Jk−1πˆ(k−1)(j )}

(12)

= min

u∈U(i ){c(i , u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ Jk−1 (j )}

(13)

= min

u∈U(i ){c(i , u) +X

j ∈S

pij(u) ∗ Jk−1 (j )}

(14) (15)

(64)

Bellman’s Principle of Optimality - proof (4)

= min

u∈U(i )

c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u)

∗ min

ˆ π(k−1)

{ X

S(k−1)(j )

(P(S(k−1)(j )|ˆπ(k−1)) ∗ (

k−1

X

l =1

c(sN−l, πl(sN−l)) + g (sN)))}

(11)

= min

u∈U(i )c(i, u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ min

ˆ π(k−1)

{Jk−1πˆ(k−1)(j )}

(12)

= min

u∈U(i ){c(i , u) +X

j ∈S

P(s(N−k)+1= j |sN−k= i , u) ∗ Jk−1 (j )}

(13)

= min

u∈U(i ){c(i , u) +X

j ∈S

pij(u) ∗ Jk−1 (j )}

(14)

(65)

Backward Dynamic Programming - algorithm

k = 0:

J0(i ) = g (i ) For k = 1 To N, ∀i ∈ S

Jk(i ) = min

u∈U(i )Ewk{c(i , u) + Jk−1 (f (i , u, wk))}

or

Jk(i ) = min

u∈U(i ) n

X

j =1

pij(u)(c(i , u) + Jk−1 (j ))

(66)

Choosing an action

Requirement: Jk(i ) is known for all k ≤ N.

Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs).

πk(i ) ∈ arg minu∈U(i )Ewk{c(i , u) + Jk−1 (f (i , u, wk))

⇒the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem.

Remark:

I Jkdefines an optimal policy

I The policy is not unique, but Jkis

(67)

Choosing an action

Requirement: Jk(i ) is known for all k ≤ N.

Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs).

πk(i ) ∈ arg minu∈U(i )Ewk{c(i , u) + Jk−1 (f (i , u, wk))

⇒the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem.

Remark:

I Jkdefines an optimal policy

I The policy is not unique, but Jk is

(68)

Remarks

I Complexity for deterministic systems O(N ∗ n ∗ m)

I Complexity for stochastic systems O(N ∗ n2∗ m)

I Exact solution rarely computable, ⇒numeric solution; but: very complex!

(N = number of stages, n = number of states, m = number of actions)

References

Related documents

Overall, most respondents indicated that players do not have a responsibility to play through a concussion (69.2%), the physiotherapist should be responsible for decid- ing

○ If BP elevated, think primary aldosteronism, Cushing’s, renal artery stenosis, ○ If BP normal, think hypomagnesemia, severe hypoK, Bartter’s, NaHCO3,

96 This analytical framework is a more effective method to address the “multifaceted nature of inequality.” 97 It is a way to “assess and assist in modifying laws, policies

In this paper, we tackle the problem of competitive cloud resource pricing by proposing a non-cooperative game to tractably investigate the price competition among cloud providers

Over the past decade we have been working with colleagues from around the globe and from a variety of disciplines to quantify human capital management and use the resultant

This ‘extension’ in business strategy for sustainable change signifies that firms both increase the quality of production (eco-efficiency) and decrease the amount of

It is recommended to perform a power factor and capacitance test of the bushings as noted in the bushing instruction leaflet prior to installation in the transformer1. Install

Although thus far no system that allows such routine monitoring exists, epidemiological evidence from England indicates that about one in five patients seen in general practice