Lecture 14.pdf

(1)

(Active)

Reinforcement Learning

CS540-002, Spring 2015 Lecture 14

(2)

Announcements

● P2 due the Feb 25.

(3)

A Final Warning.

Cheating policy: If you cheat on an assignment, you get a 0 in the class.

If you copied an online solution for P1, it is in your best interest to contact me today.

(4)

Today:

● Active RL

○ Greedy

○ Drunkard’s Walk ○ E v E

(5)

Recall: Learning

In solving an MDP, we are creating a reflex agent.

That pit sure is bad.

The enemy gate is down. If I try to go left here… maybe go up there...

(6)

Example: Grid World

Policy: Trials:

● (1, 1) - (1, 2) - (1, 3) - (1, 2) - (1, 3) - (2, 3) - (3, 3) - (4, 3)

○ R = 7 (-0.04) + 1

● (1, 1) - (1, 2) - (1, 3) - (2, 3) - (3, 3) - (3, 2) - (3, 3) - (4, 3)

○ R = 7 (-0.04) + 1

(7)

Recall: Reinforcement Learning

Roadmap:

1. Passive Reinforcement Learning

Given fixed policy, learn rewards and transition 2. Active Reinforcement Learning

Learn policy, rewards, transition, and maximize utility 3. Generalization

(8)

Recall: Passive RL

Task:

Given: Learn:

P(s’ | s, a),

R(s)

Basic Idea:

Execute a series of trials, estimate values from

(9)

Direct Utility Estimation

Basic Idea:

The Utility of a state is the expected total reward from that state onward.

Each trial provides a sample for every state visited.

(10)

Can we do better? (yes)

Policy: U(3, 2)= ½ [

-0.04 - 1 + -0.04 + 1

] = -0.04

P((3, 3) | (3, 2), Up) = 0.5 U(3, 3) = really high

(11)

Two Solutions

1. Adaptive Dynamic Programming

Keep updated model of MDP, solve it at each step

2. Temporal Difference Learning

(12)

Active Reinforcement Learning

Same setting as before, but now, we have to also learn a policy.

Remember:

(13)

First Attempt: Greedy Agent

Estimate R(s) as before.

Estimate P(s’ | s, a) as before.

Maintain model of world as before.

When selecting an action, compute first.

(14)

Example: Grid World

Once we find a path, we may cling to it too heavily.

Key aspect:

(15)

Example: Life

States: (H)appy, (A)nxious, (S)ad

Actions: (W)ork, (D)ance

Terminal States: H, S

(16)

P(S | A, D) = 0.05 P(H | A, D) = 0.9

(17)

Example: Life

Initial Policy: Random

Agent Works, which leads to Happy. Now, the best policy is to Work.

So on trial 2, agent Works.

(18)

Key Idea: Exploration

(19)

Second Attempt: Drunkard’s Walk

At every step: Take random action.

(20)

Third Attempt: Explorer

At every step, take the action you have taken the least often in that state.

(21)

Exploitation Exploration

(22)

Learns a poor model of the world.

Leads to bad policies.

Greed Curiosity

(23)

Learns the world quickly. Fails to optimize

reward over time.

Greed

Curiosity

(24)

Detour: Multi-Armed Bandit

N slot machines, each with an unknown payout.

(25)

(yes, it’s a tiger quoll)

Trials:

A: $1, B: $1, A: $3, C: $8, B: $1, B: $1 Greed: C > A > B (D?)

Curiosity: D > C > B > A A

B C D

Example: 4 Bandits

(26)

Fourth Attempt: ε-Greedy

With probability 1- ε, select best action.

With probability ε, select a move at random.

(27)

GLIE

Recall that the agent aims to maximize its reward.

In the limit (# of trials goes to infinity), we learn the correct model assuming reasonable exploration.

However, once we know the model, we should be as greedy as possible.

(28)

Fifth Attempt: 1/t - Greedy

On the tth trial, at each step select random move with probability ε = 1/t.

Now we’re GLIE!

(29)

Exploration Function

Let f(u, n) be our

exploration function.

Estimate of utility

(30)

Exploration Function

If we’ve seen this

(31)

(32)

Sixth Attempt: Active ADP Agent

Basic Idea:

(33)

Never before seen state

Utility and reward are both reward.

Numerator and denominator of transition probabilities

Construct empirical estimate of transition model

(34)

Recall: Bellman Update

New utility

of state s Probability of ending up in that state

Immediate

Discount factor

Future Reward

(35)

Active ADP

Optimistic

estimate of utility

Exploration Function

Number of times we’ve taken action a in state s

(36)

Why U

+

instead of U?

Even when a state has been observed N_e

times, we want to draw the agent to explore

(37)

What about TD Learning?

Recall the TD Update Step:

New Utility

Old Utility Immediate Reward Expected future reward

How much we care about the future

Difference between new

(38)

What about TD Learning?

Update step just as before, but no we do not have a fixed policy.

(39)

Next Up: Q-Learning

Basic Idea:

Use TD learning in a model-free context. model-free: We do not need to maintain

(40)

Detour: Fancy Python Features

(41)

Q - Values

Q(s, a) = value of performing action a in state s.

(42)

(43)