Reinforcement Learning

(1)

Reinforcement Learning

• Control learning

• Control polices that choose optimal actions • Control polices that choose optimal actions • Q learning

(2)

Control Learning

Consider learning to choose actions, e.g.,

• Robot learning to dock on battery chargerg y g

• Learning to choose actions to optimize factory output

• Learning to play Backgammon Note several problem characteristics

D l d d

• Delayed reward

• Opportunity for active exploration

P ibilit th t t t l ti ll b bl

• Possibility that state only partially observable • Possible need to learn multiple tasks with same

sensors/effectors sensors/effectors

(3)

One Example: TD-Gammon

Tesauro, 1995

Learn to play Backgammon Immediate reward

Immediate reward • +100 if win

• 100 if lose • -100 if lose

• 0 for all other states

Trained by playing 1.5 million games against itself

N i t l l t b t h l

(4)

Reinforcement Learning Problem

Environment

state action _reward

Agent

s₀ r₀ a₀ s₁ r₁ a₁ s₂ r₂ a₂ _...

Goal: learn to choose actions that maximize

(5)

Markov Decision Process

Assume

• finite set of statesfinite set of states SS

• set of actions A

• at each discrete time agent observes stateat each discrete time, agent observes state ss_t∈∈ SS

and choose action a_t∈ A

• then receives immediate rewardthen receives immediate reward rr_t_t

• and state changes to s_t+1

• Markov assumption:Markov assumption: ss_t+1₁ = δδ((s as_t, a_t) and ) and rr_t = rr((s as_t, a_t))

– i.e., r_t and s_t+1 depend only on current state and action

– functions δ and r may be nondeterministic

(6)

Agent’s Learning Task

Execute action in environment, observe results, and • learn action policylearn action policy ππ :: SS →→ AA that maximizesthat maximizes

E

[

r

_t

+

γ

r

_t+1

+

γ

2

r

_t+2

+ …]

from any starting state in S

• here

0 ≤ γ

< 1

is the discount factor for future

re ards rewards

Note something new:

t t f ti i S A

• target function is π : S → A

• but we have no training examples of form <s,a>

(7)

Value Function

To begin, consider deterministic worlds …

For each possible policy π the agent might adopt, we For each possible policy π the agent might adopt, we

can define an evaluation function over states

+ +

+

≡

2 ₂ 1 π

₍

₎

_γ

_...

V

s

r

_t

r

_t

r

_t

∑

∞ + + +

≡

2 1

γ

)

(

i t i t t t

r

where r_t,r_t₊₁,… are generated by following policy π

starting at state s

=0

i

g

Restated, the task is to learn the optimal policy π*

)

(

),

(

V

argmax

π

*

≡

argmax

V

π

(

s

),

(

∀

s

)

π

s

∀

(8)

0 0 81 0 G 100 100 0 ₀ 0 0 0 0 0 0 G 100 100 90 90 72 72 8181 ₈₁ 0 0

r(s,a) (immediate reward) values

90 Q(s,a) values 81 ( , ) ( ) G 100 0 90 G 100 90 81 O ti l li

(9)

What to Learn

We might try to have agent learn the evaluation function Vπ* (which we write as ( V*))

We could then do a lookahead search to choose best action from any state s because

[

r(s,a)

V(*

(s,a))

]

(s)

argmax

γ

δ

π

*

a

+

≡

A problem:

• This works well if agent knows a g δ : S × A → S, ,

and r : S × A → ℜ

• But when it doesn’t, we can’t choose actions this way

(10)

Q

Function

Define new function very similar to V*

(s a))

V(*

r(s a)

a

s

Q

(

)

≡

+

γ

δ

If agent learns Q it can choose optimal action even

(s,a))

V(*

r(s,a)

a

s

Q

(

,

)

≡

+

γ

δ

If agent learns Q, it can choose optimal action even without knowing d!

[

γ

δ

]

argmax

π

*

(s)

≡

[

r(s,a)

+

V(*

(s,a))

]

)

,

(

argmax

π

*

δ

γ

argmax

π

a

s

Q

(s)

(s,a))

V (

r(s,a)

(s)

≡

+

Q is the evaluation function the agent will learn

(11)

Training Rule to Learn

Q

Note Q and V* closely related:

)

,

(

max

*

(s)

Q

s

a

V

=

′

Which allows us to write Q recursively as

)

,

(

max

a

Q

s

a

(s)

V

′

δ

γ

)

(

s

,a

r(s

,a

)

V(*

(s

,a

))

Q

=

+

Let denote learner’s current approximation to Q

)

(

max

γ

δ

γ

)

(

1

,

a

s

Q

)

,a

r(s

))

,a

(s

V (

)

,a

r(s

,a

s

Q

t a t t t t t t t t

′

+

=

+

+ ′

Q

ˆ

Let denote learner s current approximation to Q.

Consider training rule

)

(

ˆ

max

γ

)

(

ˆ

_{s a}

_r

_Q

_s

_a

Q

←

+

′

Q

where s' is the state resulting from applying action a

in state s

)

(

max

γ

)

(

s,a

r

Q

s

,

a

Q

a

+

←

′ in state s

(12)

Q

Learning for Deterministic Worlds

For each s,a initialize table entry

Observe current state s

0 )

(

ˆ

_s,a

_←

Q

Observe current state s

Do forever:

• Select an actionSelect an action aa and execute itand execute it

• Receive immediate reward r

• Observe the new state s'

• Observe the new state s

• Update the table entry for as follows:

ˆ

_′

)

(

ˆ

_s,a

Q

'

)

(

max

γ

)

(

s,a

r

Q

s

,

a

Q

a

′

+

←

′ • s ← s'

(13)

Updating

100 72 63 81 R 100 90 63 81 R i i i l a_right

initial state: s₁ next state: s₂

90 } 100 81 63 { 9 0 0 ) ( Qˆ max γ ) , ( ˆ 2 a 1 a r s ,a s Q _right ← + ′ ′ ˆ ˆ then negative, -non rewards if notice 90 } 100 , 81 , 63 max{ 9 . 0 0 ← + = and ) , ( ) , ( ) , , ( ∀s a n Q_n₊₁ s a ≥ Q_n s a ) , ( ) , ( ˆ 0 ) , , ( ∀s a n ≤ Q s a ≤ Q s a

(14)

Convergence

converges to Q. Consider case of deterministic world

where each <s a> visited infinitely often

Qˆ

where each <s,a> visited infinitely often.

Proof: define a full interval to be an interval during which each <s,a> is visited. During each full interval the largest

error in table is reduced by factor of _Qˆ γ

Let be table after n updates, and Δ_n be the maximum error

in ; that is n Qˆ Qˆ in ; that isQ_n ) , ( ) , ( ˆ max ,a Qn s a Q s a s n = − Δ

(15)

Convergence (cont)

For any table entry updated on iteration n+1, the

error in the revised estimate is

) , ( ˆ _s _a Q_n ) , ( ˆ _s _a Q_n ) a , s Q( ) a , s ( Q )) a , s Q( (r )) a , s ( Q (r a s Q a s Q a n a n 1 max ˆ max γ max γ ˆ max γ ) , ( ) , ( ˆ ′ ′ − ′ ′ = ′ ′ + − ′ ′ + = − ′ ′ + ) a , s Q( ) a , s ( Q ) a , s Q( ) a , s ( Q n a a n a ˆ max γ max max γ ′ ′ − ′ ′ ≤ ′ ′ ′ a s Q a s Q ) a , s Q( ) a , s ( Q_n a s 1 , γ ) ( ) ( ˆ ˆ max γ Δ ≤ − ′ ′′ − ′ ′′ ≤ ′ ′′ ( ) f ( ) f ( ) f ( ) f a s Q a s Q_n ₁ _n fact that general used we Note γ ) , ( ) , ( ≤ Δ ≤ + (a) f (a) f (a) f (a) f a a a 1 -max 2 max 1 - 2 max ≤

(16)

Nondeterministic Case

What if reward and next state are non-deterministic? We redefine V,Q by taking expected values

We redefine V,Q by taking expected values

...] γ γ E[ ₁ 2 ₂ π r r r (s) V ≡ _t + _t₊ + _t₊ +

[

γ

]

E 0 i r i t i ≡

∑

∞₌ ₊ ] δ * γ ) , ( [ E ) , (s a r s a V ( (s,a)) Q ≡ +

(17)

Nondeterministic Case

Q learning generalizes to nondeterministic worlds

Alter training rule to Alter training rule to

)] , ( ˆ max [ α ) , ( ˆ ) α 1 ( ) , ( ˆ 1 1 s a r Q s a Q a s Q _n a n n n n ← − − + + _′ − ′ ′ where 1

Can still prove converge of

_Q

ˆ

to Q [Watkins and

) , ( 1 1 α a s visits_n n + =

Can still prove converge of to Q [Watkins and

(18)

Temporal Difference Learning

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

) , ( ˆ max γ ) , ( ₁ ) 1 ( a s Q r a s Q _t _t ≡ _t + _t₊

Why not two steps?

) ( γ ) ( Q ₁ Q _t a t t t + ) , ( ˆ max γ γ ) , ( ₁ 2 ₂ ) 2 ( a s Q r r a s Q ≡ + + Or n? ) , ( max γ γ ) , (s a r r ₁ Q s ₂ a Q _t a t t t t ≡ + + + + ) , ( ˆ max γ γ ... γ ) , ( ₁ n-1 ₁ n ) ( a s Q r r r a s Q n _t _t ≡ _t + _t₊ + + _t₊_n₋ + _t₊_n

Blend all of these:

) , ( γ γ γ ) , ( ₁ ₁ Q Q _t _n a n t t t t t + + − +

[

( ) λ ( ) λ ( )

]

) λ 1 ( ) ( (1) (2) 2 (3) λ _≡ ₊ ₊ ₊ a s Q a s Q a s Q a s Q (s_t,a_t) ≡ (1−λ )

[

Q( )(s_t,a_t ) +λ Q( )(s_t,a_t ) +λ Q( )(s_t,a_t ) +...

]

Q

(19)

Temporal Difference Learning

[

( , ) λ ( , ) λ ( , ) ...

]

) λ 1 ( ) , ( (1) (2) 2 (3) λ _≡ ₋ ₊ ₊ ₊ t t t t t t t t a Q s a Q s a Q s a s Q Equivalent expression:

[

(1 λ )max ˆ( ) λ ( )

]

γ ) ( λ λ _≡ ₊ ₊ a s Q a s Q r a s Q

TD(λ) algorithm uses above training rule

[

(1 λ )max ( , ) λ ( , )

]

γ ) , ( ≡ + − _t _t + _t₊₁ _t₊₁ a t t t a r Q s a Q s a s Q

TD(λ) algorithm uses above training rule

• Sometimes converges faster than Q learning

• converges for learning V* for any 0 ≤ λ ≤1

(Dayan, 1992)

• Tesauro’s TD Gammon uses this algorithm • Tesauro s TD-Gammon uses this algorithm

(20)

Subtleties and Ongoing Research

• Replace table with neural network or other generalizer

Q

ˆ

generalizer

• Handle case where state only partially observable • Design optimal exploration strategies

• Extend to continuous action state • Extend to continuous action, state

• Learn and use d : S × A → S, d approximation to δ

(21)

RL Summary

• Reinforcement learning (RL)

– control learningg – delayed reward

– possible that the state is only partially observable – possible that the relationship between states/actions

unknown

T l Diff L i

• Temporal Difference Learning

– learn discrepancies between successive estimates used in TD Gammon

– used in TD-Gammon

• V(s) - state value function

(22)

RL Summary

• Q(s,a) - state/action value function

– related to V

– does not need reward/state trans functions – training rule

– related to dynamic programming

– measure actual reward received for action and future value using current Q function

value using current Q function

– deterministic - replace existing estimate

– nondeterministic - move table estimate towardsnondeterministic move table estimate towards measure estimate

Reinforcement Learning