(Active)
Reinforcement Learning
CS540-002, Spring 2015 Lecture 14
Announcements
● P2 due the Feb 25.
A Final Warning.
Cheating policy: If you cheat on an assignment, you get a 0 in the class.
If you copied an online solution for P1, it is in your best interest to contact me today.
Today:
● Active RL
○ Greedy
○ Drunkard’s Walk ○ E v E
Recall: Learning
In solving an MDP, we are creating a reflex agent.
That pit sure is bad.
The enemy gate is down. If I try to go left here… maybe go up there...
Example: Grid World
Policy: Trials:
● (1, 1) - (1, 2) - (1, 3) - (1, 2) - (1, 3) - (2, 3) - (3, 3) - (4, 3)
○ R = 7 (-0.04) + 1
● (1, 1) - (1, 2) - (1, 3) - (2, 3) - (3, 3) - (3, 2) - (3, 3) - (4, 3)
○ R = 7 (-0.04) + 1
Recall: Reinforcement Learning
Roadmap:
1. Passive Reinforcement Learning
Given fixed policy, learn rewards and transition 2. Active Reinforcement Learning
Learn policy, rewards, transition, and maximize utility 3. Generalization
Recall: Passive RL
Task:
Given: Learn:
P(s’ | s, a),
R(s)
Basic Idea:
Execute a series of trials, estimate values from
Direct Utility Estimation
Basic Idea:
The Utility of a state is the expected total reward from that state onward.
Each trial provides a sample for every state visited.
Can we do better? (yes)
Policy: U(3, 2)= ½ [
-0.04 - 1 + -0.04 + 1
] = -0.04
P((3, 3) | (3, 2), Up) = 0.5 U(3, 3) = really high
Two Solutions
1. Adaptive Dynamic Programming
Keep updated model of MDP, solve it at each step
2. Temporal Difference Learning
Active Reinforcement Learning
Same setting as before, but now, we have to also learn a policy.
Remember:
First Attempt: Greedy Agent
Estimate R(s) as before.
Estimate P(s’ | s, a) as before.
Maintain model of world as before.
When selecting an action, compute first.
Example: Grid World
Once we find a path, we may cling to it too heavily.
Key aspect:
Example: Life
States: (H)appy, (A)nxious, (S)ad
Actions: (W)ork, (D)ance
Terminal States: H, S
P(S | A, D) = 0.05 P(H | A, D) = 0.9
Example: Life
Initial Policy: Random
Agent Works, which leads to Happy. Now, the best policy is to Work.
So on trial 2, agent Works.
Key Idea: Exploration
Second Attempt: Drunkard’s Walk
At every step: Take random action.
Third Attempt: Explorer
At every step, take the action you have taken the least often in that state.
Exploitation Exploration
Learns a poor model of the world.
Leads to bad policies.
Greed Curiosity
Learns the world quickly. Fails to optimize
reward over time.
Greed
Curiosity
Detour: Multi-Armed Bandit
N slot machines, each with an unknown payout.
(yes, it’s a tiger quoll)
Trials:
A: $1, B: $1, A: $3, C: $8, B: $1, B: $1 Greed: C > A > B (D?)
Curiosity: D > C > B > A A
B C D
Example: 4 Bandits
Fourth Attempt: ε-Greedy
With probability 1- ε, select best action.
With probability ε, select a move at random.
GLIE
Recall that the agent aims to maximize its reward.
In the limit (# of trials goes to infinity), we learn the correct model assuming reasonable exploration.
However, once we know the model, we should be as greedy as possible.
Fifth Attempt: 1/t - Greedy
On the tth trial, at each step select random move with probability ε = 1/t.
Now we’re GLIE!
Exploration Function
Let f(u, n) be our
exploration function.
Estimate of utility
Exploration Function
If we’ve seen this
Sixth Attempt: Active ADP Agent
Basic Idea:
Never before seen state
Utility and reward are both reward.
Numerator and denominator of transition probabilities
Construct empirical estimate of transition model
Recall: Bellman Update
New utility
of state s Probability of ending up in that state
Immediate
Discount factor
Future Reward
Active ADP
Optimistic
estimate of utility
Exploration Function
Number of times we’ve taken action a in state s
Why U
+instead of U?
Even when a state has been observed Ne
times, we want to draw the agent to explore
What about TD Learning?
Recall the TD Update Step:
New Utility
Old Utility Immediate Reward Expected future reward
How much we care about the future
Difference between new
What about TD Learning?
Update step just as before, but no we do not have a fixed policy.
Next Up: Q-Learning
Basic Idea:
Use TD learning in a model-free context. model-free: We do not need to maintain
Detour: Fancy Python Features
Q - Values
Q(s, a) = value of performing action a in state s.