Lecture 20: Appr o ximation Methods in MDPs
Gener al pr inciple
Gr adient descent methods
Using linear function appro ximation
Control methods with linear function appro ximation
Wh y function appr o ximation?
In gener al, state spaces are contin uous or too large to represent
as a tab le
If e v er y state has a separ ate entr y in the tab le , and if w e use
lear ning, then e v er y state has to be visited at least a fe w times
bef ore ha ving a good appro ximation; in the limit e v er y state
should be visited infinitely often, which is not feasib le
Main idea: Use a function appro ximator to gener aliz e from the seen
states to unseen ones
Representation
Each state (or state-action pair) is represented as a feature
v ector φ 1
φ n
F eatures are usually chosen a pr ior i, and their range is typically
kno wn
T oda y w e assume no model regarding ho w features e v olv e
individually o v er time , b ut w e do assume the Mar k o v proper ty at
the state le v el
V alue-based methods
W e will use a function appro ximator to represent the v alue function
The input is the feature v ector of the state (or state-action pair)
The output is the predicted v alue of the state (or state-action
pair)
The target (desired) output comes from the MDP/RL algor ithm
E.g. for TD(0), the target w ould be r t 1
γ V
s t 1
What kind of function appr o ximator can we use?
In pr inciple , there are lots of options:
A tab le where se v er al states are mapped to the same location -
state agg regation
Gr adient-based methods:
– Linear appro ximators
– Ar tificial neur al netw or ks
– Radial Basis Functions
– ...
Memor y-based methods:
– Nearest-neighbor
– Locally w eighted reg ression
Special requirements
Allo w incremental updates
Ability to handle non-stationar y target functions
F ast adaptation
State a g gregation
Map the state space S into a finite n umber of par titions
p 1
p n .
Compute a v alue function pretending that the par titions are
states in an MDP
Note that if the policy is fix ed, w e ha v e indeed a Mar k o v process
o v er par titions , so all algor ithms for policy e v aluation w or k
But if w e change the policy , the par tition MDP changes! So
control is not so easy ... b ut still w or ks
The par tition function deter mines ho w good a v alue function w e
can get o v er par titions
Memor y-based methods
K e y idea: just store all e xamples s
V
s
Nearest neighbor : Giv en state s , first locate “nearest” state seen, ˆs ,
then estimate ˆV
s
V
ˆs
k -Nearest neighbor : T ak e mean of V v alues of k nearest neighbors:
ˆV
s
∑ k i
1 V
s i
k
Locally w eighted reg ression: for m an e xplicit appro ximation ˆV
s
for
region surrounding s
Fit linear function to k nearest neighbors
Fit quadr atic , ...
Produces “piece wise appro ximation” to V
Pr os and ons of memor y-based methods
Adv antages:
T raining is v er y fast
Lear n comple x target functions
Do not lose an y inf or mation
Local! Hence ha v e good con v ergence proper ties
Disadv antages:
Slo w at quer y time
Easily fooled b y irrele v ant attr ib utes
Need lots of data (b ut this is tr ue of RL in gener al)
Gradient Descent Methods
Consider the policy e v aluation prob lem: lear ning V π for a giv en
policy π
The appro ximate v alue function V
s t
f
θ
φ t
, where φ t are the attr ib utes (f eatures) descr ibing s t , and θ is a parameter vector
E.g. θ could be the connection w eights in a neur al netw or k
W e will update θ based on the errors computed b y the
reinf orcement lear ning algor ithm
P erf ormance measure
W e w ant to find a par ameter v ector θ that minimiz es the mean
squared error :
M S E
θ
1
2 ∑ s S P
s
V π
s
V
s
2
What should P be?
In our case P is the on-polic y distrib ution : distr ib ution of
states created when the agent acts according to π
Gradient descent update
W or ks lik e in the super vised lear ning case:
θ θ
α∇ θ M S E
θ
θ
α∇ θ 1
2 ∑ s
S P
s
V π
s
V
s
2
θ
α ∑ s
S P
s
V π
s
V
s
∇ θ V
s
T o do this incrementally , w e use the sample gradient :
θ θ
α
V π
s
V
s
∇ θ V
s
The sample g radient is an unbiased estimate of the tr ue g radient.
The rule w ould con v erge to a local minim um of the error function, if α is decreased appropr iately o v er time
But where do w e get V π ?
Using TD tar g ets
Instead of V π , w e will use the targets that come from the T D
λ
algor ithm: θ θα
ν t
s
V
s
∇ θ V
s
If w e use Monte Car lo , then ν t R t is an unbiased estimate of the
tr ue v alue function, and the algor ithm still con v erges to a local
minim um, pro vided α is decreased appropr iately
If ν t R λ t with λ 1 , ν t is not an unbiased estimate , and w e
cannot sa y an ything about the con v ergence in gener al
But the algor ithm is w ell defined, and used in pr actice
On-line gradient descent T D λ
In addition to the w eight v ector θ , w e will ha v e an eligibility tr ace
v ector e , with one eligibility for e v er y w eight
1. Initializ e the w eight v ector θ arbitr ar ily , and e = 0.
2. Pic k a star t state s
3. Repeat for e v er y time step t :
(a) Choose action a based on policy π and the current state s
(b) T ak e action a , obser v e immediate re w ard r and ne w state s
(c) Compute the TD error : δ r
γ V
s
V
s
(d) Compute the eligibility of e v er y w eight v ector to be updated:
e γλ e
∇ θ V
s
(e) Update the w eight v ector : θ θ
αδ e
(f) s s
Linear methods
Each state represented b y feature v ector φ
s
φ 1
s
φ n
s
The v alue function is a linear combination of the features:
V
s
θ
φ
s
n ∑ i
1 θ i φ i
s
So the g radient is v er y simple: ∇ θ V
s
φ
s
The error surf ace is quadr atic with a single global minim um
Tsitsiklis and V an Ro y: Linear g radient-descent T D
λ
con v erges w .p .1 to a par ameter v ector θ ∞ in the “vicinity” of the best par ameter
v ector θ
:
M S E
θ ∞
1
γλ
1
γ M S E
θ
Coar se coding
Main idea: w e w ant linear function appro ximators , b ut with lots of features , so the y can represent comple x functions
a) Narrow generalization b) Broad generalization c) Asymmetric generalization
Speed of learning with coar se coding
10
40160
6402560
10240
Narrowfeatures desiredfunction
Mediumfeatures Broadfeatures #Examples approx-imation
featurewidth
The width of the cells aff ects the speed, not the precision of the lear ner
Discretizing the state space
Suppose w e ha v e a contin uous state space with tw o contin uous
v ar iab le (e .g. lik e in the Mountain-Car task)
The simplest tile coding appro ximator w ould be just a g rid
discretizing the state space:
The features are all 0 e xcept for the cell holding the current
state , which is 1 (lik e a 1-of-n encoding)
All states in the same cell ha v e the same v alue (giv en b y the
w eight of the cell)
Pr os and cons of discretizations
Pros:
Easy to compute the v alue function of a state
Easy to update as w ell (more lik e the tab le lookup case).
Cons:
T o get good precision, w e need a v er y fine g rid - going bac k to
the tab le lookup case?
States in the vicinity of a separ ation line could ha v e radically
diff erent v alues (appro ximation is discontin uous)
Tile coding (contin ued)
Main idea: Ov er lap se v er al tilings!
tiling #1
tiling #2
Shape of tiles ⇒ Generalization
#Tilings ⇒ Resolution of final approximation
2D statespaceCharacteristics of tile coding
Each tile is a binar y feature
The n umber of features that are activ ated at an y time is
constant, equal to the n umber of tilings
It is easy to compute the indices of the features activ ated, and
easy to compute the w eighted sum
The o v er all discretization is v er y fine , and at the same time the
discontin uities are smoothed out
The shape of the tiles reflects pr ior domain kno wledge
Cf . CMA C (Alb us , 1971)
Contr ol with function appr o ximation
Input: a descr iption of the state-action pair
s t
a t
Output: an action-v alue function Q
s t
a t
The gener al g radient descent rule:
θ θ
α
ν t
Q
s t
a t
∇ θ Q
s t
a t
Example: Sarsa( λ ) θ θ
αδ t e t
where
δ t r t 1
γ Q
s t 1
a t 1
Q
s t
a t
and e t γλ e t
∇ θ Q
s t
a t
Illustration: Mountain-Car task
1−
.2
o P
itis
no
.60 Step 428 Goal
oP
itis
no
4
0
−.07
.07 Velocity Velocity
Velocity Velocity
Velocity Velocity
oP
itis
no
oP
itis
no oP
itis
no
0 27
0 120
0 104
0 46 Episode 12
Episode 104Episode 1000Episode 9000
M
OUNTAINC
ARTheor y of contr ol algorithms
Sarsa pro v en to con v erge to a region of policy space (Gordon,
2001)
Q-lear ning sho wn to div erge in e xtremely simple e xamples
A fe w off-policy e v aluation algor ithms that might shed light into
Q-lear ning beha vior (Precup et al, 2000, 2001)
One of the con v ergence prob lems is bootstr apping
Bair d’ s countere xample
Vk(s) = θ(7)+2θ(1)
terminalstate99% 1% 100% Vk(s) = θ(7)+2θ(2) Vk(s) = θ(7)+2θ(3) Vk(s) = θ(7)+2θ(4) Vk(s) = θ(7)+2θ(5)
Vk(s) = 2θ(7)+θ(6) 0 5 10
010002000300040005000 10 10/ -10
Iterations (k) 510
1010 010
-- Parametervalues, θk (i)
(log scale,
broken at ±1) θk(7) θk(1) – θk(5)
θk(6)
Should we bootstrap?
accumulatingtraces
0.2 0.3 0.4 0.5
00.20.40.60.81λ RANDOM WALK
50 100 150 200 250 300
Failures per100,000 steps
00.20.40.60.81λ CART AND POLE 400 450 500 550 600 650 700
Steps perepisode
00.20.40.60.81λ MOUNTAIN CAR
replacingtraces
150 160 170 180 190 200 210 220 230 240
Cost perepisode
00.20.40.60.81λ PUDDLE WORLDreplacing
traces accumulatingtraces
replacingtraces
accumulatingtraces RMS error