Reinforcement Learning

22 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Reinforcement Learning

• Control learning

• Control polices that choose optimal actions • Control polices that choose optimal actions • Q learning

(2)

Control Learning

Consider learning to choose actions, e.g.,

• Robot learning to dock on battery chargerg y g

• Learning to choose actions to optimize factory output

• Learning to play Backgammon Note several problem characteristics

D l d d

• Delayed reward

• Opportunity for active exploration

P ibilit th t t t l ti ll b bl

• Possibility that state only partially observable • Possible need to learn multiple tasks with same

sensors/effectors sensors/effectors

(3)

One Example: TD-Gammon

Tesauro, 1995

Learn to play Backgammon Immediate reward

Immediate reward • +100 if win

• 100 if lose • -100 if lose

• 0 for all other states

Trained by playing 1.5 million games against itself

N i t l l t b t h l

(4)

Reinforcement Learning Problem

Environment

state action reward

Agent

s0 r0 a0 s1 r1 a1 s2 r2 a2 ...

Goal: learn to choose actions that maximize

(5)

Markov Decision Process

Assume

• finite set of statesfinite set of states SS

• set of actions A

• at each discrete time agent observes stateat each discrete time, agent observes state sst ∈∈ SS

and choose action at ∈ A

• then receives immediate rewardthen receives immediate reward rrtt

• and state changes to st+1

• Markov assumption:Markov assumption: sst+11 = δδ((s ast, at) and ) and rrt = rr((s ast, at))

– i.e., rt and st+1 depend only on current state and action

– functions δ and r may be nondeterministic

(6)

Agent’s Learning Task

Execute action in environment, observe results, and • learn action policylearn action policy ππ :: SS →→ AA that maximizesthat maximizes

E

[

r

t

+

γ

r

t+1

+

γ

2

r

t+2

+ …]

from any starting state in S

from any starting state in S

• here

0

≤ γ

< 1

is the discount factor for future

re ards rewards

Note something new:

t t f ti i S A

• target function is π : SA

• but we have no training examples of form <s,a>

(7)

Value Function

To begin, consider deterministic worlds …

For each possible policy π the agent might adopt, we For each possible policy π the agent might adopt, we

can define an evaluation function over states

+ +

+

+

+

2 2 1 π

(

)

γ

γ

...

V

s

r

t

r

t

r

t

∞ + + +

2 1

γ

γ

γ

)

(

i t i t t t

r

where rt,rt+1,… are generated by following policy π

starting at state s

=0

i

g

Restated, the task is to learn the optimal policy π*

)

(

),

(

V

argmax

π

*

argmax

V

π

(

s

),

(

s

)

π

π

s

s

(8)

0 0 81 0 G 100 100 0 0 0 0 0 0 0 0 G 100 100 90 90 72 72 8181 81 0 0

r(s,a) (immediate reward) values

90 Q(s,a) values 81 ( , ) ( ) G 100 0 90 G 100 90 81 O ti l li

(9)

What to Learn

We might try to have agent learn the evaluation function Vπ* (which we write as ( V*))

We could then do a lookahead search to choose best action from any state s because

[

r(s,a)

V*(

(s,a))

]

(s)

argmax

γ

δ

π

*

a

+

A problem:

• This works well if agent knows a g δ : S × AS, ,

and r : S × A → ℜ

• But when it doesn’t, we can’t choose actions this way

(10)

Q

Function

Define new function very similar to V*

(s a))

V*(

r(s a)

a

s

Q

(

)

+

γ

δ

If agent learns Q it can choose optimal action even

(s,a))

V*(

r(s,a)

a

s

Q

(

,

)

+

γ

δ

If agent learns Q, it can choose optimal action even without knowing d!

[

γ

δ

]

argmax

π

*

(s)

[

r(s,a)

+

V*(

(s,a))

]

)

,

(

argmax

π

*

δ

γ

argmax

π

a

a

s

Q

(s)

(s,a))

V (

r(s,a)

(s)

+

Q is the evaluation function the agent will learn

(11)

Training Rule to Learn

Q

Note Q and V* closely related:

)

,

(

max

*

(s)

Q

s

a

V

=

Which allows us to write Q recursively as

)

,

(

max

a

Q

s

a

(s)

V

δ

γ

)

(

s

,a

r(s

,a

)

V*(

(s

,a

))

Q

=

+

Let denote learner’s current approximation to Q

)

(

max

γ

δ

γ

)

(

1

,

a

s

Q

)

,a

r(s

))

,a

(s

V (

)

,a

r(s

,a

s

Q

t a t t t t t t t t

+

=

+

+ ′

Q

ˆ

Let denote learner s current approximation to Q.

Consider training rule

)

(

ˆ

max

γ

)

(

ˆ

s a

r

Q

s

a

Q

+

Q

where s' is the state resulting from applying action a

in state s

)

(

max

γ

)

(

s,a

r

Q

s

,

a

Q

a

+

′ in state s

(12)

Q

Learning for Deterministic Worlds

For each s,a initialize table entry

Observe current state s

0

)

(

ˆ

s,a

Q

Observe current state s

Do forever:

• Select an actionSelect an action aa and execute itand execute it

• Receive immediate reward r

• Observe the new state s'

• Observe the new state s

• Update the table entry for as follows:

ˆ

ˆ

)

(

ˆ

s,a

Q

'

)

(

max

γ

)

(

s,a

r

Q

s

,

a

Q

a

+

′ • ss'

(13)

Updating

100 72 63 81 R 100 90 63 81 R i i i l aright

initial state: s1 next state: s2

90 } 100 81 63 { 9 0 0 ) ( Qˆ max γ ) , ( ˆ 2 a 1 a r s ,a s Q right ← + ′ ′ ˆ ˆ then negative, -non rewards if notice 90 } 100 , 81 , 63 max{ 9 . 0 0 ← + = and ) , ( ) , ( ) , , ( ∀s a n Qn+1 s aQn s a ) , ( ) , ( ˆ 0 ) , , ( ∀s a nQ s aQ s a

(14)

Convergence

converges to Q. Consider case of deterministic world

where each <s a> visited infinitely often

Qˆ

where each <s,a> visited infinitely often.

Proof: define a full interval to be an interval during which each <s,a> is visited. During each full interval the largest

error in table is reduced by factor of Qˆ γ

Let be table after n updates, and Δn be the maximum error

in ; that is n Qˆ Qˆ in ; that isQn ) , ( ) , ( ˆ max ,a Qn s a Q s a s n = − Δ

(15)

Convergence (cont)

For any table entry updated on iteration n+1, the

error in the revised estimate is

) , ( ˆ s a Qn ) , ( ˆ s a Qn ) a , s Q( ) a , s ( Q )) a , s Q( (r )) a , s ( Q (r a s Q a s Q a n a n 1 max ˆ max γ max γ ˆ max γ ) , ( ) , ( ˆ ′ ′ − ′ ′ = ′ ′ + − ′ ′ + = − ′ ′ + ) a , s Q( ) a , s ( Q ) a , s Q( ) a , s ( Q n a a n a ˆ max γ max max γ ′ ′ − ′ ′ ≤ ′ ′ ′ a s Q a s Q ) a , s Q( ) a , s ( Qn a s 1 , γ ) ( ) ( ˆ ˆ max γ Δ ≤ − ′ ′′ − ′ ′′ ≤ ′ ′′ ( ) f ( ) f ( ) f ( ) f a s Q a s Qn 1 n fact that general used we Note γ ) , ( ) , ( ≤ Δ ≤ + (a) f (a) f (a) f (a) f a a a 1 -max 2 max 1 - 2 max ≤

(16)

Nondeterministic Case

What if reward and next state are non-deterministic? We redefine V,Q by taking expected values

We redefine V,Q by taking expected values

...] γ γ E[ 1 2 2 π r r r (s) Vt + t+ + t+ +

[

γ

]

E 0 i r i t i

= + ] δ * γ ) , ( [ E ) , (s a r s a V ( (s,a)) Q ≡ +

(17)

Nondeterministic Case

Q learning generalizes to nondeterministic worlds

Alter training rule to Alter training rule to

)] , ( ˆ max [ α ) , ( ˆ ) α 1 ( ) , ( ˆ 1 1 s a r Q s a Q a s Q n a n n n n ← − − + + − ′ ′ where 1

Can still prove converge of

Q

ˆ

to Q [Watkins and

) , ( 1 1 α a s visitsn n + =

Can still prove converge of to Q [Watkins and

(18)

Temporal Difference Learning

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

) , ( ˆ max γ ) , ( 1 ) 1 ( a s Q r a s Q t tt + t+

Why not two steps?

) ( γ ) ( Q 1 Q t a t t t + ) , ( ˆ max γ γ ) , ( 1 2 2 ) 2 ( a s Q r r a s Q ≡ + + Or n? ) , ( max γ γ ) , (s a r r 1 Q s 2 a Q t a t t t t ≡ + + + + ) , ( ˆ max γ γ ... γ ) , ( 1 n-1 1 n ) ( a s Q r r r a s Q n t tt + t+ + + t+n + t+n

Blend all of these:

) , ( γ γ γ ) , ( 1 1 Q Q t n a n t t t t t + + − +

[

( ) λ ( ) λ ( )

]

) λ 1 ( ) ( (1) (2) 2 (3) λ + + + a s Q a s Q a s Q a s Q (st,at) ≡ (1−λ )

[

Q( )(st,at ) +λ Q( )(st,at ) +λ Q( )(st,at ) +...

]

Q

(19)

Temporal Difference Learning

[

( , ) λ ( , ) λ ( , ) ...

]

) λ 1 ( ) , ( (1) (2) 2 (3) λ + + + t t t t t t t t a Q s a Q s a Q s a s Q Equivalent expression:

[

(1 λ )max ˆ( ) λ ( )

]

γ ) ( λ λ + + a s Q a s Q r a s Q

TD(λ) algorithm uses above training rule

[

(1 λ )max ( , ) λ ( , )

]

γ ) , ( ≡ + − t t + t+1 t+1 a t t t a r Q s a Q s a s Q

TD(λ) algorithm uses above training rule

• Sometimes converges faster than Q learning

• converges for learning V* for any 0 ≤ λ ≤1

• converges for learning V* for any 0 ≤ λ ≤1

(Dayan, 1992)

• Tesauro’s TD Gammon uses this algorithm • Tesauro s TD-Gammon uses this algorithm

(20)

Subtleties and Ongoing Research

• Replace table with neural network or other generalizer

Q

ˆ

generalizer

• Handle case where state only partially observable • Design optimal exploration strategies

• Extend to continuous action state • Extend to continuous action, state

• Learn and use d : S × AS, d approximation to δ

(21)

RL Summary

• Reinforcement learning (RL)

– control learningg – delayed reward

– possible that the state is only partially observable – possible that the relationship between states/actions

unknown

T l Diff L i

• Temporal Difference Learning

– learn discrepancies between successive estimates used in TD Gammon

– used in TD-Gammon

• V(s) - state value function

(22)

RL Summary

• Q(s,a) - state/action value function

– related to V

– does not need reward/state trans functions – training rule

– related to dynamic programming

– measure actual reward received for action and future value using current Q function

value using current Q function

– deterministic - replace existing estimate

– nondeterministic - move table estimate towardsnondeterministic move table estimate towards measure estimate

Figure

Updating...

References

Updating...

Related subjects :