Using linear function appro ximation

(1)

Lecture 20: Appr o ximation Methods in MDPs

Gener al pr inciple

Gr adient descent methods

Using linear function appro ximation

Control methods with linear function appro ximation

(2)

Wh y function appr o ximation?

In gener al, state spaces are contin uous or too large to represent

as a tab le

If e v er y state has a separ ate entr y in the tab le , and if w e use

lear ning, then e v er y state has to be visited at least a fe w times

bef ore ha ving a good appro ximation; in the limit e v er y state

should be visited infinitely often, which is not feasib le

Main idea: Use a function appro ximator to gener aliz e from the seen

states to unseen ones

(3)

Representation

Each state (or state-action pair) is represented as a feature

v ector φ 1

φ n

F eatures are usually chosen a pr ior i, and their range is typically

kno wn

T oda y w e assume no model regarding ho w features e v olv e

individually o v er time , b ut w e do assume the Mar k o v proper ty at

the state le v el

(4)

V alue-based methods

W e will use a function appro ximator to represent the v alue function

The input is the feature v ector of the state (or state-action pair)

The output is the predicted v alue of the state (or state-action

pair)

The target (desired) output comes from the MDP/RL algor ithm

E.g. for TD(0), the target w ould be r t 1

γ V

s t 1

(5)

What kind of function appr o ximator can we use?

In pr inciple , there are lots of options:

A tab le where se v er al states are mapped to the same location -

state agg regation

Gr adient-based methods:

– Linear appro ximators

– Ar tificial neur al netw or ks

– Radial Basis Functions

– ...

Memor y-based methods:

– Nearest-neighbor

– Locally w eighted reg ression

(6)

Special requirements

Allo w incremental updates

Ability to handle non-stationar y target functions

F ast adaptation

(7)

State a g gregation

Map the state space S into a finite n umber of par titions

p 1

p n .

Compute a v alue function pretending that the par titions are

states in an MDP

Note that if the policy is fix ed, w e ha v e indeed a Mar k o v process

o v er par titions , so all algor ithms for policy e v aluation w or k

But if w e change the policy , the par tition MDP changes! So

control is not so easy ... b ut still w or ks

The par tition function deter mines ho w good a v alue function w e

can get o v er par titions

(8)

Memor y-based methods

K e y idea: just store all e xamples s

V

s

Nearest neighbor : Giv en state s , first locate “nearest” state seen, ˆs ,

then estimate ˆV

s

V

ˆs

k -Nearest neighbor : T ak e mean of V v alues of k nearest neighbors:

ˆV

s

∑ k i

1 V

s i

k

Locally w eighted reg ression: for m an e xplicit appro ximation ˆV

s

for

region surrounding s

Fit linear function to k nearest neighbors

Fit quadr atic , ...

Produces “piece wise appro ximation” to V

(9)

Pr os and ons of memor y-based methods

Adv antages:

T raining is v er y fast

Lear n comple x target functions

Do not lose an y inf or mation

Local! Hence ha v e good con v ergence proper ties

Disadv antages:

Slo w at quer y time

Easily fooled b y irrele v ant attr ib utes

Need lots of data (b ut this is tr ue of RL in gener al)

(10)

Gradient Descent Methods

Consider the policy e v aluation prob lem: lear ning V π for a giv en

policy π

The appro ximate v alue function V

s t

f

θ

φ t

, where φ t are the attr ib utes (f eatures) descr ibing s t , and θ is a parameter vector

E.g. θ could be the connection w eights in a neur al netw or k

W e will update θ based on the errors computed b y the

reinf orcement lear ning algor ithm

(11)

P erf ormance measure

W e w ant to find a par ameter v ector θ that minimiz es the mean

squared error :

M S E

θ

1 2 ∑ s S P

s

V π

s

V

s

2 What should P be?

In our case P is the on-polic y distrib ution : distr ib ution of

states created when the agent acts according to π

(12)

Gradient descent update

W or ks lik e in the super vised lear ning case:

θ θ

α∇ θ M S E

θ

α∇ θ 1

2 ∑ s

S P

s

V π

s

V

s

2 θ

α ∑ s

S P

s

V π

s

V

s

∇ θ V

s

T o do this incrementally , w e use the sample gradient :

θ θ

α

V π

s

V

s

∇ θ V

s

The sample g radient is an unbiased estimate of the tr ue g radient.

The rule w ould con v erge to a local minim um of the error function, if α is decreased appropr iately o v er time

But where do w e get V π ?

(13)

Using TD tar g ets

Instead of V π , w e will use the targets that come from the T D

λ

algor ithm: θ θα

ν t

s

V

s

∇ θ V

s

If w e use Monte Car lo , then ν t R t is an unbiased estimate of the

tr ue v alue function, and the algor ithm still con v erges to a local

minim um, pro vided α is decreased appropr iately

If ν t R λ t with λ 1 , ν t is not an unbiased estimate , and w e

cannot sa y an ything about the con v ergence in gener al

But the algor ithm is w ell defined, and used in pr actice

(14)

On-line gradient descent T D λ

In addition to the w eight v ector θ , w e will ha v e an eligibility tr ace

v ector e , with one eligibility for e v er y w eight

1. Initializ e the w eight v ector θ arbitr ar ily , and e = 0.

2. Pic k a star t state s

3. Repeat for e v er y time step t :

(a) Choose action a based on policy π and the current state s

(b) T ak e action a , obser v e immediate re w ard r and ne w state s

(c) Compute the TD error : δ r

γ V

s

V

s

(d) Compute the eligibility of e v er y w eight v ector to be updated:

e γλ e

∇ θ V

s

(e) Update the w eight v ector : θ θ

αδ e

(f) s s

(15)

Linear methods

Each state represented b y feature v ector φ

s

φ 1

s

φ n

s

The v alue function is a linear combination of the features:

V

s

θ

φ

s

n ∑ i

1 θ i φ i

s

So the g radient is v er y simple: ∇ θ V

s

φ

s

The error surf ace is quadr atic with a single global minim um

Tsitsiklis and V an Ro y: Linear g radient-descent T D

λ

con v erges w .p .1 to a par ameter v ector θ ∞ in the “vicinity” of the best par ameter

v ector θ

:

M S E

θ ∞

1 γλ

1 γ M S E

θ

(16)

Coar se coding

Main idea: w e w ant linear function appro ximators , b ut with lots of features , so the y can represent comple x functions

a) Narrow generalization b) Broad generalization c) Asymmetric generalization

(17)

Speed of learning with coar se coding

10

40160

6402560

10240

Narrowfeatures desiredfunction

Mediumfeatures Broadfeatures #Examples approx-imation

featurewidth

The width of the cells aff ects the speed, not the precision of the lear ner

(18)

Discretizing the state space

Suppose w e ha v e a contin uous state space with tw o contin uous

v ar iab le (e .g. lik e in the Mountain-Car task)

The simplest tile coding appro ximator w ould be just a g rid

discretizing the state space:

The features are all 0 e xcept for the cell holding the current

state , which is 1 (lik e a 1-of-n encoding)

All states in the same cell ha v e the same v alue (giv en b y the

w eight of the cell)

(19)

Pr os and cons of discretizations

Pros:

Easy to compute the v alue function of a state

Easy to update as w ell (more lik e the tab le lookup case).

Cons:

T o get good precision, w e need a v er y fine g rid - going bac k to

the tab le lookup case?

States in the vicinity of a separ ation line could ha v e radically

diff erent v alues (appro ximation is discontin uous)

(20)

Tile coding (contin ued)

Main idea: Ov er lap se v er al tilings!

tiling #1

tiling #2

Shape of tiles ⇒ Generalization

#Tilings ⇒ Resolution of final approximation

2D statespace

(21)

Characteristics of tile coding

Each tile is a binar y feature

The n umber of features that are activ ated at an y time is

constant, equal to the n umber of tilings

It is easy to compute the indices of the features activ ated, and

easy to compute the w eighted sum

The o v er all discretization is v er y fine , and at the same time the

discontin uities are smoothed out

The shape of the tiles reflects pr ior domain kno wledge

Cf . CMA C (Alb us , 1971)

(22)

Contr ol with function appr o ximation

Input: a descr iption of the state-action pair

s t

a t

Output: an action-v alue function Q

s t

a t

The gener al g radient descent rule:

θ θ

α

ν t

Q

s t

a t

∇ θ Q

s t

a t

Example: Sarsa( λ ) θ θ

αδ t e t

where

δ t r t 1

γ Q

s t 1

a t 1

Q

s t

a t

and e t γλ e t

∇ θ Q

s t

a t

(23)

Illustration: Mountain-Car task

1−

.2

o P

itis

no

.60 Step 428 Goal

oP

itis

no

4

0

−.07

.07 Velocity Velocity

Velocity Velocity

oP

itis

no

oP

itis

no oP

itis

no

0 27

0 120

0 104

0 46 Episode 12

Episode 104Episode 1000Episode 9000

M

OUNTAIN

C

AR

(24)

Theor y of contr ol algorithms

Sarsa pro v en to con v erge to a region of policy space (Gordon,

2001)

Q-lear ning sho wn to div erge in e xtremely simple e xamples

A fe w off-policy e v aluation algor ithms that might shed light into

Q-lear ning beha vior (Precup et al, 2000, 2001)

One of the con v ergence prob lems is bootstr apping

(25)

Bair d’ s countere xample

Vk(s) = θ(7)+2θ(1)

terminalstate99% 1% 100% Vk(s) = θ(7)+2θ(2) Vk(s) = θ(7)+2θ(3) Vk(s) = θ(7)+2θ(4) Vk(s) = θ(7)+2θ(5)

Vk(s) = 2θ(7)+θ(6) 0 5 10

010002000300040005000 10 10/ -10

Iterations (k) 510

1010 010

-- Parametervalues, θk (i)

(log scale,

broken at ±1) θk(7) θk(1) – θk(5)

θk(6)

(26)

Should we bootstrap?

accumulatingtraces

0.2 0.3 0.4 0.5

00.20.40.60.81λ RANDOM WALK

50 100 150 200 250 300

Failures per100,000 steps

00.20.40.60.81λ CART AND POLE 400 450 500 550 600 650 700

Steps perepisode

00.20.40.60.81λ MOUNTAIN CAR

replacingtraces

150 160 170 180 190 200 210 220 230 240

Cost perepisode

00.20.40.60.81λ PUDDLE WORLD_replacing

traces accumulatingtraces

replacingtraces

accumulatingtraces RMS error

(27)