### Reinforcement Learning

• Control learning

• Control polices that choose optimal actions • Control polices that choose optimal actions • Q learning

### Control Learning

Consider learning to choose actions, e.g.,

• Robot learning to dock on battery chargerg y g

• Learning to choose actions to optimize factory output

• Learning to play Backgammon Note several problem characteristics

D l d d

• Delayed reward

• Opportunity for active exploration

P ibilit th t t t l ti ll b bl

• Possibility that state only partially observable • Possible need to learn multiple tasks with same

sensors/effectors sensors/effectors

### One Example: TD-Gammon

*Tesauro, 1995*

Learn to play Backgammon Immediate reward

Immediate reward • +100 if win

• 100 if lose • -100 if lose

• 0 for all other states

Trained by playing 1.5 million games against itself

N i t l l t b t h l

### Reinforcement Learning Problem

### Environment

*state* *action* _{reward}

### Agent

*s*

_{0}*r*

_{0}*a*

_{0}*s*

_{1}*r*

_{1}*a*

_{1}*s*

_{2}*r*

_{2}*a*

_{2}_{...}

Goal: learn to choose actions that maximize

### Markov Decision Process

Assume

• finite set of statesfinite set of states *SS*

• set of actions *A*

• at each discrete time agent observes stateat each discrete time, agent observes state *ss _{t }*∈∈ SS

and choose action *a _{t }*∈ A

• then receives immediate rewardthen receives immediate reward *rr _{t}_{t}*

• and state changes to *s _{t+1}*

• Markov assumption:Markov assumption: *ss _{t+1}_{1}* = δδ((

*s as*,

_{t}*a*) and ) and

_{t}*rr*=

_{t}*rr*((

*s as*,

_{t}*a*))

_{t}– i.e., *r _{t}* and

*s*depend only on current state and action

_{t+1}– functions δ and *r* may be nondeterministic

### Agent’s Learning Task

Execute action in environment, observe results, and
• learn action policylearn action policy ππ :: *SS* →→ *AA* that maximizesthat maximizes

*E*

### [

*r*

_{t }### +

### γ

*r*

_{t+1 }### +

### γ

2*r*

_{t+2 }### + …]

from any starting state in Sfrom any starting state in S

• here

### 0

### ≤ γ

### < 1

is the*discount factor*for future

re ards rewards

Note something new:

t t f ti i *S* *A*

• target function is π : *S* → *A*

• but we have no training examples of form <*s*,*a*>

### Value Function

To begin, consider deterministic worlds …

For each possible policy π the agent might adopt, we For each possible policy π the agent might adopt, we

can define an evaluation function over states

+ +

### +

### +

### +

### ≡

2_{2}1 π

_{(}

_{)}

_{γ}

_{γ}

_{...}

### V

*s*

*r*

_{t}*r*

_{t}*r*

_{t}### ∑

∞ + + +### ≡

2 1### γ

### γ

### γ

### )

### (

*i*

*t*

*i*

*t*

*t*

*t*

*r*

where *r _{t}*,

*r*

_{t}_{+1},… are generated by following policy π

starting at state *s*

=0

*i*

g

Restated, the task is to learn the optimal policy π*

### )

### (

### ),

### (

### V

### argmax

### π

### *

### ≡

### argmax

### V

π### (

*s*

### ),

### (

### ∀

*s*

### )

### π

π*s*

*s*

### ∀

0 0 81 0
**G**
100
100
0 _{0} 0
0
0
0 0
0
**G**
100
100
90
90
72
72
8181 _{81}
0 0

*r(s,a)* (immediate reward) values

90
*Q(s,a)* values
81
*( , )* ( )
**G**
100
0
90 **G**
100
90
81
O ti l li

### What to Learn

We might try to have agent learn the evaluation
function *V*π* (which we write as ( *V**))

We could then do a lookahead search to choose best
action from any state *s* because

### [

*r(s,a)*

*V*(*

*(s,a))*

### ]

*(s)*

### argmax

### γ

### δ

### π

### *

a### +

### ≡

A problem:• This works well if agent knows a g δ : *S* × *A* → *S*, ,

and *r : S* × *A* → ℜ

• But when it doesn’t, we can’t choose actions this way

*Q*

### Function

Define new function very similar to V*

*(s a))*

*V*(*

*r(s a)*

*a*

*s*

*Q*

### (

### )

### ≡

### +

### γ

### δ

If agent learns Q it can choose optimal action even

*(s,a))*

*V*(*

*r(s,a)*

*a*

*s*

*Q*

### (

### ,

### )

### ≡

### +

### γ

### δ

If agent learns Q, it can choose optimal action even without knowing d!

### [

### γ

### δ

### ]

### argmax

### π

### *

*(s)*

### ≡

### [

*r(s,a)*

### +

*V*(*

*(s,a))*

### ]

### )

### ,

### (

### argmax

### π

### *

### δ

### γ

### argmax

### π

a*a*

*s*

*Q*

*(s)*

*(s,a))*

*V (*

*r(s,a)*

*(s)*

### ≡

### +

Q is the evaluation function the agent will learn

### Training Rule to Learn

*Q*

Note *Q*and

*V** closely related:

### )

### ,

### (

### max

### *

*(s)*

*Q*

*s*

*a*

*V*

### =

### ′

Which allows us to write *Q* recursively as

### )

### ,

### (

### max

a*Q*

*s*

*a*

*(s)*

*V*

′
### δ

### γ

### )

### (

*s*

*,a*

*r(s*

*,a*

*)*

*V*(*

*(s*

*,a*

*))*

*Q*

### =

### +

Let denote learner’s current approximation to *Q*

### )

### (

### max

### γ

### δ

### γ

### )

### (

1*,*

*a*

*s*

*Q*

*)*

*,a*

*r(s*

*))*

*,a*

*(s*

*V (*

*)*

*,a*

*r(s*

*,a*

*s*

*Q*

*t*

*a*

*t*

*t*

*t*

*t*

*t*

*t*

*t*

*t*

### ′

### +

### =

### +

+ ′*Q*

### ˆ

Let denote learner s current approximation to *Q*.

Consider training rule

### )

### (

### ˆ

### max

### γ

### )

### (

### ˆ

_{s a}

_{s a}

_{r}

_{r}

_{Q}

_{Q}

_{s}

_{s}

_{a}

_{a}

*Q*

### ←

### +

### ′

### ′

*Q*

where *s*' is the state resulting from applying action *a*

in state *s*

### )

### (

### max

### γ

### )

### (

*s,a*

*r*

*Q*

*s*

*,*

*a*

*Q*

*a*

### +

### ←

′ in state*s*

*Q*

### Learning for Deterministic Worlds

For each*s,a*initialize table entry

Observe current state *s*

### 0

### )

### (

### ˆ

_{s,a}

_{s,a}

_{←}

*Q*

Observe current state *s*

Do forever:

• Select an actionSelect an action *aa* and execute itand execute it

• Receive immediate reward *r*

• Observe the new state *s*'

• Observe the new state *s*

• Update the table entry for as follows:

### ˆ

### ˆ

_{′}

_{′}

### )

### (

### ˆ

_{s,a}

_{s,a}

*Q*

'
### )

### (

### max

### γ

### )

### (

*s,a*

*r*

*Q*

*s*

*,*

*a*

*Q*

*a*

### ′

### ′

### +

### ←

′ •*s*←

*s*'

### Updating

100 72 63 81**R**100

**90**63 81

**R**i i i l

*a*

_{right}initial state: *s _{1}* next state:

*s*

_{2}90
}
100
81
63
{
9
0
0
)
(
Qˆ
max
γ
)
,
(
ˆ
2
a
1 *a* *r* *s* *,a*
*s*
*Q* * _{right}* ← + ′
′
ˆ
ˆ
then
negative,
-non
rewards
if
notice
90
}
100
,
81
,
63
max{
9
.
0
0
← + =
and
)
,
(
)
,
(
)
,
,
(
∀

*s*

*a*

*n*

*Q*

_{n}_{+}

_{1}

*s*

*a*≥

*Q*

_{n}*s*

*a*) , ( ) , ( ˆ 0 ) , , ( ∀

*s*

*a*

*n*≤

*Q*

*s*

*a*≤

*Q*

*s*

*a*

### Convergence

converges to *Q*. Consider case of deterministic world

where each <*s a*> visited infinitely often

*Q*ˆ

where each <*s,a*> visited infinitely often.

Proof: define a full interval to be an interval during which
each <*s,a*> is visited. During each full interval the largest

error in table is reduced by factor of * _{Q}*ˆ γ

Let be table after *n* updates, and Δ* _{n}* be the maximum error

in ; that is
*n*
*Q*ˆ
*Q*ˆ
in ; that is*Q _{n}*
)
,
(
)
,
(
ˆ
max
,

*a*

*Qn*

*s*

*a*

*Q*

*s*

*a*

*s*

*n*= − Δ

### Convergence (cont)

For any table entry updated on iteration *n+1*, the

error in the revised estimate is

)
,
(
ˆ _{s}_{a}*Q _{n}*
)
,
(
ˆ

_{s}

_{a}*Q*

_{n}*)*

*a*

*,*

*s*

*Q(*

*)*

*a*

*,*

*s*

*(*

*Q*

*))*

*a*

*,*

*s*

*Q(*

*(r*

*))*

*a*

*,*

*s*

*(*

*Q*

*(r*

*a*

*s*

*Q*

*a*

*s*

*Q*

*a*

*n*

*a*

*n*1 max ˆ max γ max γ ˆ max γ ) , ( ) , ( ˆ ′ ′ − ′ ′ = ′ ′ + − ′ ′ + = − ′ ′ +

*)*

*a*

*,*

*s*

*Q(*

*)*

*a*

*,*

*s*

*(*

*Q*

*)*

*a*

*,*

*s*

*Q(*

*)*

*a*

*,*

*s*

*(*

*Q*

*n*

*a*

*a*

*n*

*a*ˆ max γ max max γ ′ ′ − ′ ′ ≤ ′ ′ ′

*a*

*s*

*Q*

*a*

*s*

*Q*

*)*

*a*

*,*

*s*

*Q(*

*)*

*a*

*,*

*s*

*(*

*Q*

_{n}*a*

*s*1 , γ ) ( ) ( ˆ ˆ max γ Δ ≤ − ′ ′′ − ′ ′′ ≤ ′ ′′

*( )*

*f*

*( )*

*f*

*( )*

*f*

*( )*

*f*

*a*

*s*

*Q*

*a*

*s*

*Q*

_{n}_{1}

*fact that general used we Note γ ) , ( ) , ( ≤ Δ ≤ +*

_{n}*(a)*

*f*

*(a)*

*f*

*(a)*

*f*

*(a)*

*f*

*a*

*a*

*a*1 -max 2 max 1 - 2 max ≤

### Nondeterministic Case

What if reward and next state are non-deterministic?
We redefine *V,Q* by taking expected values

We redefine *V,Q* by taking expected values

...]
γ
γ
E[ _{1} 2 _{2}
π
*r*
*r*
*r*
*(s)*
*V* ≡ * _{t}* +

_{t}_{+}+

_{t}_{+}+

## [

γ## ]

E 0 i*r*

*i*

*t*

*i*≡

### ∑

∞_{=}

_{+}] δ * γ ) , ( [ E ) , (

*s*

*a*

*r*

*s*

*a*

*V*

*(*

*(s,a))*

*Q*≡ +

### Nondeterministic Case

*Q* learning generalizes to nondeterministic worlds

Alter training rule to Alter training rule to

)]
,
(
ˆ
max
[
α
)
,
(
ˆ
)
α
1
(
)
,
(
ˆ
1
1 *s* *a* *r* *Q* *s* *a*
*Q*
*a*
*s*
*Q* _{n}*a*
*n*
*n*
*n*
*n* ← − − + + _{′} − ′ ′
where
1

Can still prove converge of

_{Q}

_{Q}

### ˆ

to*Q*[Watkins and

)
,
(
1
1
α
*a*
*s*
*visits _{n}*

*n*+ =

Can still prove converge of to *Q* [Watkins and

### Temporal Difference Learning

*Q* learning: reduce discrepancy between successive
*Q* estimates

One step time difference:

)
,
(
ˆ
max
γ
)
,
( _{1}
)
1
(
*a*
*s*
*Q*
*r*
*a*
*s*
*Q* _{t}* _{t}* ≡

*+*

_{t}

_{t}_{+}

Why not two steps?

)
(
γ
)
( *Q* _{1}
*Q* _{t}*a*
*t*
*t*
*t* +
)
,
(
ˆ
max
γ
γ
)
,
( _{1} 2 _{2}
)
2
(
*a*
*s*
*Q*
*r*
*r*
*a*
*s*
*Q* ≡ + +
Or *n*?
)
,
(
max
γ
γ
)
,
(*s* *a* *r* *r* _{1} *Q* *s* _{2} *a*
*Q* _{t}*a*
*t*
*t*
*t*
*t* ≡ + + + +
)
,
(
ˆ
max
γ
γ
...
γ
)
,
( _{1} n-1 _{1} n
)
(
*a*
*s*
*Q*
*r*
*r*
*r*
*a*
*s*
*Q* *n* _{t}* _{t}* ≡

*+*

_{t}

_{t}_{+}+ +

_{t}_{+}

_{n}_{−}+

_{t}_{+}

_{n}Blend all of these:

)
,
(
γ
γ
γ
)
,
( _{1} _{1} *Q*
*Q* _{t}_{n}*a*
*n*
*t*
*t*
*t*
*t*
*t* + + − +

### [

( ) λ ( ) λ ( )### ]

) λ 1 ( ) ( (1) (2) 2 (3) λ_{≡}

_{+}

_{+}

_{+}

*a*

*s*

*Q*

*a*

*s*

*Q*

*a*

*s*

*Q*

*a*

*s*

*Q*(

*s*,

_{t}*a*) ≡ (1−λ )

_{t}### [

*Q*( )(

*s*,

_{t}*a*) +λ

_{t}*Q*( )(

*s*,

_{t}*a*) +λ

_{t}*Q*( )(

*s*,

_{t}*a*) +...

_{t}### ]

*Q*

### Temporal Difference Learning

### [

( , ) λ ( , ) λ ( , ) ...### ]

) λ 1 ( ) , ( (1) (2) 2 (3) λ_{≡}

_{−}

_{+}

_{+}

_{+}

*t*

*t*

*t*

*t*

*t*

*t*

*t*

*t*

*a*

*Q*

*s*

*a*

*Q*

*s*

*a*

*Q*

*s*

*a*

*s*

*Q*Equivalent expression:

### [

(1 λ )max ˆ( ) λ ( )### ]

γ ) ( λ λ_{≡}

_{+}

_{+}

*a*

*s*

*Q*

*a*

*s*

*Q*

*r*

*a*

*s*

*Q*

TD(λ) algorithm uses above training rule

### [

(1 λ )max ( , ) λ ( , )### ]

γ ) , ( ≡ + −

_{t}*+*

_{t}

_{t}_{+}

_{1}

_{t}_{+}

_{1}

*a*

*t*

*t*

*t*

*a*

*r*

*Q*

*s*

*a*

*Q*

*s*

*a*

*s*

*Q*

TD(λ) algorithm uses above training rule

• Sometimes converges faster than *Q* learning

• converges for learning *V** for any 0 ≤ λ ≤1

• converges for learning *V** for any 0 ≤ λ ≤1

(Dayan, 1992)

• Tesauro’s TD Gammon uses this algorithm • Tesauro s TD-Gammon uses this algorithm

### Subtleties and Ongoing Research

• Replace table with neural network or other generalizer

*Q*

### ˆ

generalizer

• Handle case where state only partially observable • Design optimal exploration strategies

• Extend to continuous action state • Extend to continuous action, state

• Learn and use d : *S* × *A* → *S, *d approximation to δ

### RL Summary

• Reinforcement learning (RL)

– control learningg – delayed reward

– possible that the state is only partially observable – possible that the relationship between states/actions

unknown

T l Diff L i

• Temporal Difference Learning

– learn discrepancies between successive estimates used in TD Gammon

– used in TD-Gammon

• V(s) - state value function

### RL Summary

• Q(s,a) - state/action value function

– related to V

– does not need reward/state trans functions – training rule

– related to dynamic programming

– measure actual reward received for action and future value using current Q function

value using current Q function

– deterministic - replace existing estimate

– nondeterministic - move table estimate towardsnondeterministic move table estimate towards measure estimate