Reinforcement Learning
• Control learning
• Control polices that choose optimal actions • Control polices that choose optimal actions • Q learning
Control Learning
Consider learning to choose actions, e.g.,
• Robot learning to dock on battery chargerg y g
• Learning to choose actions to optimize factory output
• Learning to play Backgammon Note several problem characteristics
D l d d
• Delayed reward
• Opportunity for active exploration
P ibilit th t t t l ti ll b bl
• Possibility that state only partially observable • Possible need to learn multiple tasks with same
sensors/effectors sensors/effectors
One Example: TD-Gammon
Tesauro, 1995
Learn to play Backgammon Immediate reward
Immediate reward • +100 if win
• 100 if lose • -100 if lose
• 0 for all other states
Trained by playing 1.5 million games against itself
N i t l l t b t h l
Reinforcement Learning Problem
Environment
state action reward
Agent
s0 r0 a0 s1 r1 a1 s2 r2 a2 ...Goal: learn to choose actions that maximize
Markov Decision Process
Assume
• finite set of statesfinite set of states SS
• set of actions A
• at each discrete time agent observes stateat each discrete time, agent observes state sst ∈∈ SS
and choose action at ∈ A
• then receives immediate rewardthen receives immediate reward rrtt
• and state changes to st+1
• Markov assumption:Markov assumption: sst+11 = δδ((s ast, at) and ) and rrt = rr((s ast, at))
– i.e., rt and st+1 depend only on current state and action
– functions δ and r may be nondeterministic
Agent’s Learning Task
Execute action in environment, observe results, and • learn action policylearn action policy ππ :: SS →→ AA that maximizesthat maximizes
E
[
r
t+
γ
r
t+1+
γ
2r
t+2+ …]
from any starting state in Sfrom any starting state in S
• here
0
≤ γ
< 1
is the discount factor for futurere ards rewards
Note something new:
t t f ti i S A
• target function is π : S → A
• but we have no training examples of form <s,a>
Value Function
To begin, consider deterministic worlds …
For each possible policy π the agent might adopt, we For each possible policy π the agent might adopt, we
can define an evaluation function over states
+ +
+
+
+
≡
2 2 1 π(
)
γ
γ
...
V
s
r
tr
tr
t∑
∞ + + +≡
2 1γ
γ
γ
)
(
i t i t t tr
where rt,rt+1,… are generated by following policy π
starting at state s
=0
i
g
Restated, the task is to learn the optimal policy π*
)
(
),
(
V
argmax
π
*
≡
argmax
V
π(
s
),
(
∀
s
)
π
πs
s
∀
0 0 81 0 G 100 100 0 0 0 0 0 0 0 0 G 100 100 90 90 72 72 8181 81 0 0
r(s,a) (immediate reward) values
90 Q(s,a) values 81 ( , ) ( ) G 100 0 90 G 100 90 81 O ti l li
What to Learn
We might try to have agent learn the evaluation function Vπ* (which we write as ( V*))
We could then do a lookahead search to choose best action from any state s because
[
r(s,a)
V*(
(s,a))
]
(s)
argmax
γ
δ
π
*
a+
≡
A problem:• This works well if agent knows a g δ : S × A → S, ,
and r : S × A → ℜ
• But when it doesn’t, we can’t choose actions this way
Q
Function
Define new function very similar to V*
(s a))
V*(
r(s a)
a
s
Q
(
)
≡
+
γ
δ
If agent learns Q it can choose optimal action even
(s,a))
V*(
r(s,a)
a
s
Q
(
,
)
≡
+
γ
δ
If agent learns Q, it can choose optimal action even without knowing d!
[
γ
δ
]
argmax
π
*
(s)
≡
[
r(s,a)
+
V*(
(s,a))
]
)
,
(
argmax
π
*
δ
γ
argmax
π
aa
s
Q
(s)
(s,a))
V (
r(s,a)
(s)
≡
+
Q is the evaluation function the agent will learn
Training Rule to Learn
Q
Note Q and V* closely related:)
,
(
max
*
(s)
Q
s
a
V
=
′
Which allows us to write Q recursively as
)
,
(
max
aQ
s
a
(s)
V
′δ
γ
)
(
s
,a
r(s
,a
)
V*(
(s
,a
))
Q
=
+
Let denote learner’s current approximation to Q
)
(
max
γ
δ
γ
)
(
1,
a
s
Q
)
,a
r(s
))
,a
(s
V (
)
,a
r(s
,a
s
Q
t a t t t t t t t t′
+
=
+
+ ′Q
ˆ
Let denote learner s current approximation to Q.
Consider training rule
)
(
ˆ
max
γ
)
(
ˆ
s a
r
Q
s
a
Q
←
+
′
′
Q
where s' is the state resulting from applying action a
in state s
)
(
max
γ
)
(
s,a
r
Q
s
,
a
Q
a+
←
′ in state sQ
Learning for Deterministic Worlds
For each s,a initialize table entryObserve current state s
0
)
(
ˆ
s,a
←
Q
Observe current state s
Do forever:
• Select an actionSelect an action aa and execute itand execute it
• Receive immediate reward r
• Observe the new state s'
• Observe the new state s
• Update the table entry for as follows:
ˆ
ˆ
′
′
)
(
ˆ
s,a
Q
')
(
max
γ
)
(
s,a
r
Q
s
,
a
Q
a′
′
+
←
′ • s ← s'Updating
100 72 63 81 R 100 90 63 81 R i i i l arightinitial state: s1 next state: s2
90 } 100 81 63 { 9 0 0 ) ( Qˆ max γ ) , ( ˆ 2 a 1 a r s ,a s Q right ← + ′ ′ ˆ ˆ then negative, -non rewards if notice 90 } 100 , 81 , 63 max{ 9 . 0 0 ← + = and ) , ( ) , ( ) , , ( ∀s a n Qn+1 s a ≥ Qn s a ) , ( ) , ( ˆ 0 ) , , ( ∀s a n ≤ Q s a ≤ Q s a
Convergence
converges to Q. Consider case of deterministic world
where each <s a> visited infinitely often
Qˆ
where each <s,a> visited infinitely often.
Proof: define a full interval to be an interval during which each <s,a> is visited. During each full interval the largest
error in table is reduced by factor of Qˆ γ
Let be table after n updates, and Δn be the maximum error
in ; that is n Qˆ Qˆ in ; that isQn ) , ( ) , ( ˆ max ,a Qn s a Q s a s n = − Δ
Convergence (cont)
For any table entry updated on iteration n+1, the
error in the revised estimate is
) , ( ˆ s a Qn ) , ( ˆ s a Qn ) a , s Q( ) a , s ( Q )) a , s Q( (r )) a , s ( Q (r a s Q a s Q a n a n 1 max ˆ max γ max γ ˆ max γ ) , ( ) , ( ˆ ′ ′ − ′ ′ = ′ ′ + − ′ ′ + = − ′ ′ + ) a , s Q( ) a , s ( Q ) a , s Q( ) a , s ( Q n a a n a ˆ max γ max max γ ′ ′ − ′ ′ ≤ ′ ′ ′ a s Q a s Q ) a , s Q( ) a , s ( Qn a s 1 , γ ) ( ) ( ˆ ˆ max γ Δ ≤ − ′ ′′ − ′ ′′ ≤ ′ ′′ ( ) f ( ) f ( ) f ( ) f a s Q a s Qn 1 n fact that general used we Note γ ) , ( ) , ( ≤ Δ ≤ + (a) f (a) f (a) f (a) f a a a 1 -max 2 max 1 - 2 max ≤
Nondeterministic Case
What if reward and next state are non-deterministic? We redefine V,Q by taking expected values
We redefine V,Q by taking expected values
...] γ γ E[ 1 2 2 π r r r (s) V ≡ t + t+ + t+ +
[
γ]
E 0 i r i t i ≡∑
∞= + ] δ * γ ) , ( [ E ) , (s a r s a V ( (s,a)) Q ≡ +Nondeterministic Case
Q learning generalizes to nondeterministic worlds
Alter training rule to Alter training rule to
)] , ( ˆ max [ α ) , ( ˆ ) α 1 ( ) , ( ˆ 1 1 s a r Q s a Q a s Q n a n n n n ← − − + + ′ − ′ ′ where 1
Can still prove converge of
Q
ˆ
to Q [Watkins and) , ( 1 1 α a s visitsn n + =
Can still prove converge of to Q [Watkins and
Temporal Difference Learning
Q learning: reduce discrepancy between successive Q estimates
One step time difference:
) , ( ˆ max γ ) , ( 1 ) 1 ( a s Q r a s Q t t ≡ t + t+
Why not two steps?
) ( γ ) ( Q 1 Q t a t t t + ) , ( ˆ max γ γ ) , ( 1 2 2 ) 2 ( a s Q r r a s Q ≡ + + Or n? ) , ( max γ γ ) , (s a r r 1 Q s 2 a Q t a t t t t ≡ + + + + ) , ( ˆ max γ γ ... γ ) , ( 1 n-1 1 n ) ( a s Q r r r a s Q n t t ≡ t + t+ + + t+n− + t+n
Blend all of these:
) , ( γ γ γ ) , ( 1 1 Q Q t n a n t t t t t + + − +
[
( ) λ ( ) λ ( )]
) λ 1 ( ) ( (1) (2) 2 (3) λ ≡ + + + a s Q a s Q a s Q a s Q (st,at) ≡ (1−λ )[
Q( )(st,at ) +λ Q( )(st,at ) +λ Q( )(st,at ) +...]
QTemporal Difference Learning
[
( , ) λ ( , ) λ ( , ) ...]
) λ 1 ( ) , ( (1) (2) 2 (3) λ ≡ − + + + t t t t t t t t a Q s a Q s a Q s a s Q Equivalent expression:[
(1 λ )max ˆ( ) λ ( )]
γ ) ( λ λ ≡ + + a s Q a s Q r a s QTD(λ) algorithm uses above training rule
[
(1 λ )max ( , ) λ ( , )]
γ ) , ( ≡ + − t t + t+1 t+1 a t t t a r Q s a Q s a s QTD(λ) algorithm uses above training rule
• Sometimes converges faster than Q learning
• converges for learning V* for any 0 ≤ λ ≤1
• converges for learning V* for any 0 ≤ λ ≤1
(Dayan, 1992)
• Tesauro’s TD Gammon uses this algorithm • Tesauro s TD-Gammon uses this algorithm
Subtleties and Ongoing Research
• Replace table with neural network or other generalizer
Q
ˆ
generalizer
• Handle case where state only partially observable • Design optimal exploration strategies
• Extend to continuous action state • Extend to continuous action, state
• Learn and use d : S × A → S, d approximation to δ
RL Summary
• Reinforcement learning (RL)
– control learningg – delayed reward
– possible that the state is only partially observable – possible that the relationship between states/actions
unknown
T l Diff L i
• Temporal Difference Learning
– learn discrepancies between successive estimates used in TD Gammon
– used in TD-Gammon
• V(s) - state value function
RL Summary
• Q(s,a) - state/action value function
– related to V
– does not need reward/state trans functions – training rule
– related to dynamic programming
– measure actual reward received for action and future value using current Q function
value using current Q function
– deterministic - replace existing estimate
– nondeterministic - move table estimate towardsnondeterministic move table estimate towards measure estimate