CiteSeerX — Approximate Dynamic Programming via Linear Programming

(1)

via Linear Programming

Daniela P. de Farias

DepartmentofManagementS ien eandEngineering

StanfordUniversity

Stanford,CA94305

pu istanford.edu

BenjaminVanRoy

DepartmentofManagementS ien eandEngineering

StanfordUniversity

Stanford,CA94305

bvrstanford.edu

Abstra t

The urseofdimensionalitygivesrisetoprohibitive omputational

requirementsthatrenderinfeasibletheexa tsolutionoflarge{s ale

sto hasti ontrol problems. Westudy an eÆ ientmethod based

onlinear programmingforapproximating solutionsto su h prob-

lems. The approa h \ts" a linear ombination of pre{sele ted

basisfun tions to thedynami programming ost{to{go fun tion.

Wedevelopboundsontheapproximationerrorandpresentexperi-

mentalresultsinthedomainofqueueingnetwork ontrol,providing

empiri alsupportforthemethodology.

1 Introdu tion

Dynami programmingoersauniedapproa hto solvingproblemsof sto hasti

ontrol. Centraltothemethodologyisthe ost{to{gofun tion,whi h anobtained

viasolvingBellman'sequation. Thedomainofthe ost{to{gofun tion isthestate

spa e of thesystemto be ontrolled,and dynami programmingalgorithms om-

puteandstoreatable onsistingofone ost{to{govalueperstate. Unfortunately,

thesize ofastatespa etypi allygrowsexponentiallyinthe numberof statevari-

ables. Known as the urse of dimensionality, this phenomenon renders dynami

programmingintra tablein thefa eofproblemsofpra ti als ale.

Oneapproa htodealingwiththisdiÆ ultyistogenerateanapproximationwithin

aparameterized lassof fun tions, in aspirit similar to that of statisti al regres-

sion. The fo usof this paperis on linearly parameterizedfun tions: one tries to

approximatethe ost{to{gofun tionJ

byalinear ombinationofprespe iedba-

sisfun tions. Notethatthiss hemedependsontwoimportantpre onditionsforthe

(2)

able hoi erequiressomepra ti alexperien eortheoreti alanalysisthat provides

roughinformationontheshapeofthefun tiontobeapproximated. \Regularities"

asso iated with thefun tion, for example, anguide the hoi e of representation.

Se ond,weneedaneÆ ientalgorithmthat omputesanappropriatelinear ombi-

nation.

The algorithmwestudy is basedon alinearprogrammingformulation, originally

proposed by S hweitzer andSeidman[5℄, that generalizesthe linearprogramming

approa h toexa tdynami programming,originallyintrodu edbyManne [4℄. We

present anerrorboundthat hara terizesthequalityof approximationsprodu ed

bythelinearprogrammingapproa h. Theerroris hara terizedin relativeterms,

omparedagainstthe\bestpossible"approximationoftheoptimal ost-to-gofun -

tion given the sele tion of basis fun tions. This is the rst su h error bound for

anyalgorithmthatapproximates ost{to{gofun tionsofgeneralsto hasti ontrol

problemsby omputingweightsforarbitrary olle tionsofbasisfun tions.

2 Sto hasti ontrol and linear programming

We onsiderdis rete{timesto hasti ontrolproblemsinvolvinganitestatespa e

S of ardinalityjSj = N. Forea h state x 2 S, there is a nite set of available

a tions A

x

. Taking a tion a2 A

x

whenthe urrentstate is x in urs ost g

a (x).

State transition probabilities p

a

(x;y) represent, for ea h pair (x;y) of states and

ea h a tion a 2 A

x

, the probability that the next state will be y given that the

urrentstateisxandthe urrenta tionisa2A

x .

A poli y uisamappingfrom statestoa tions. Givenapoli yu, thedynami sof

thesystemfollowaMarkov hainwithtransitionprobabilitiesp

u(x)

(x;y). Forea h

poli yu,wedeneatransition matrixP

u

whose (x;y)thentryisp

u(x) (x;y).

Theproblem of sto hasti ontrol amounts to sele tionof apoli y that optimizes

agiven riterion. Inthispaper,wewillemployasanoptimality riterioninnite{

horizondis ounted ostoftheform

J

u

(x)=E

"

1

X

t=0

t

g

u (x

t )

x

0

=x

#

;

where g

u

(x) is used as shorthand for g

u(x)

(x) and the dis ount fa tor 2 (0;1)

re e ts inter{temporal preferen es. Optimality is attained by any poli y that is

greedywithrespe ttotheoptimal ost-to-gofun tionJ

(x)=min

u J

u

(x)(apoli y

uis alledgreedywithrespe ttoJ ifT

u

J =TJ).

LetusdeneoperatorsT

u

andT byT

u J =g

u +P

u

JandTJ=min

u (g

u +P

u J).

Theoptimal ost-to-gofun tion solvesuniquelyBellman'sequation J =TJ. Dy-

nami programmingoersanumberofapproa hesto solvingthisequation;oneof

parti ularrelevan eto ourpapermakesuseoflinearprogramming,aswewillnow

dis uss. Considertheproblem

max 0

J (1)

s:t: TJJ;

where is a ve tor with positive omponents, whi h we will refer to as state{

relevan e weights. It anbeshown thatanyfeasibleJ satises J J

. It follows

that,foranysetofpositiveweights ,J

istheuniquesolutionto(1).

Note that ea h onstraint (TJ)(x) J(x) is equivalent to a set of onstraints

g

a

(x)+ P

y2S p

a

(x;y)J(y) J(x); 8a 2 A

x

, so that the optimization problem

(3)

duetothe urse ofdimensionality. Consequently,thelinearprogramofinterestin-

volvesprohibitivelylargenumbersofvariablesand onstraints. Theapproximation

algorithmwestudyredu esdramati allythenumberofvariables.

Let us now introdu e the linear programmingapproa h to approximate dynami

programming. Given pre{sele tedbasisfun tions

1

;:::;

K

, deneamatrix=

[

1

K

℄. Withanaimof omputingaweightve torr~2<

K

su h that~r

isa loseapproximationtoJ

,onemightposethefollowingoptimizationproblem:

max 0

r (2)

s:t: Trr:

Givenasolution~r,onemightthenhopetogeneratenear{optimalde isionsbyusing

apoli y thatisgreedywithrespe tto~r.

Aswiththe aseofexa tdynami programming,theoptimizationproblem(2) an

be re astas a linear program. We will refer to this problem as the approximate

LP. Note that, though the number of variables is redu ed to K, the number of

onstraintsremainsaslargeasin theexa tLP.Fortunately, weexpe tthat most

of the onstraintswill be omeirrelevant, and solutionsto the linearprogram an

beapproximatedeÆ iently,asdemonstratedin[3℄.

3 Error Bounds for the Approximate LP

Whenthe optimal ost{to{go fun tion lieswithin the spanof thebasisfun tions,

solutionoftheapproximateLPyields theexa toptimal ost{to{gofun tion. Un-

fortunately,itis diÆ ultinpra ti etosele tasetofbasisfun tions that ontains

theoptimal ost{to{gofun tion within itsspan. Instead, basisfun tions must be

basedonheuristi sandsimpliedanalyses. One anonlyhopethatthespan omes

losetothedesired ost{to{gofun tion.

FortheapproximateLPtobeuseful, itshould delivergoodapproximationswhen

the ost{to{gofun tionisnearthespanofsele tedbasisfun tions. Inthisse tion,

wepresentaboundthat ensuredesirableresultsofthis kind.

Tosetthestagefordevelopmentofanerrorbound,letusestablishsomenotation.

First,weintrodu etheweightednorms,dened by

kJk

1;

= X

x2S

(x)jJ(x)j; kJk

1;

=max

x2S

(x)jJ(x)j;

for any : S 7! <

+

. Note that bothnorms allow foruneven weightingof errors

a rossthestatespa e.

Wealsointrodu eanoperatorH,denedby

(HV)(x)=max

a2A

x X

y P

a

(x;y)V(y);

for allV : S 7! <. Forany V, (HV)(x) representsthe maximumexpe tedvalue

of V(y) ifthe urrentstate isx and y isa randomvariable representingthenext

state. Basedonthisoperator,wedeneas alar

k

V

=max

x

V(x)

V(x) (HV)(x)

; (3)

(4)

V

a\Lyapunovstabilityfa tor,"inasensethatwewillnowexplain. Intheup oming

theorem, we will only be on erned with fun tions V that are positive and that

make k

V

nonnegative. Also, our error bound for the approximate LP will grow

proportionatelywithk

V

,andwethereforewantk

V

tobesmall. Ataminimum,k

V

shouldbenite,whi htranslatestoa ondition

(HV)(x)<V(x); 8x2S: (4)

If were equal to 1, this would look like a Lyapunov stability ondition: the

maximum expe ted value (HV)(x) at the next time step must be less than the

urrentvalueV(x). Ingeneral, islessthan1,andthis introdu essomesla k in

the ondition. Notealso that k

V

be omessmallerasthe(HV)(x)'s be omesmall

relativetotheV(x)'s. Hen e,k

V

onveysadegreeof\stability,"withsmallervalues

representingstrongerstability.

Weare now readyto stateourmain result. Forany givenfun tion V mappingS

topositivereals,weuse1=V asshorthandforafun tionx7!1=V(x).

Theorem3.1 [2 ℄Let~rbeasolutionoftheapproximateLP.Then,foranyv2<

K

su hthat (v)(x)>0for allx2S andHv<v,

kJ

~rk

1;

2k

v (

0

v)min

r kJ

rk

1;1=v

: (5)

AproofofTheorem3.1 an befoundin thelongversionofthis paper[2℄.

We highlight some impli ations of Theorem 3.1. First, the error bound (5) tells

thatthetheapproximationerroryieldedbytheapproximateLPisproportionalto

theerrorasso iatedwiththebestpossibleapproximationrelativetoa ertainnorm

kk

1;1=v

. Hen eweexpe tthattheapproximateLPwillhavereasonablebehavior

{ifthe hoi eofbasisfun tionsisappropriate,theapproximateLPshould yielda

relativelygood approximationto the ost-to-go fun tion, aslongas the onstants

k

v and

0

v remainsmall.

Notethatontheleft-handsideof(5),wemeasuretheapproximationerrorwiththe

weightednormkk

1;

. Re allthattheweightve tor appearsinobje tivefun tion

oftheapproximateLP(2)andmustbe hosen. Inapproximatingthesolutiontoa

givensto hasti ontrolproblem,itseemssensibletoweightmoreheavilyportions

ofthe statespa ethat arevisited frequently, so thata ura y willbeemphasized

in su h regions. As dis ussedin [2℄, it seemsreasonablethat theweight ve tor

shouldbe hosentore e ttherelativeimportan eofea hstate.

Finally, note that the Lyapunovfun tion v playsa entral role in the bound of

Theorem3.1. Its hoi ein uen esthreetermsontheright{hand{sideofthebound:

1. theerrormin

r kJ

rk

1;1=v

;

2. theLyapunovstabilityfa tork

v

;

3. theinnerprodu t 0

vwiththestate{relevan eweights.

An appropriately hosen Lyapunov fun tion should makeall three of these terms

relatively small. Furthermore, for the bound to be useful in pra ti al ontexts,

these terms should not grow mu h with problem size. Wenowillustrate with an

appli ationinqueueingproblemshowasuitableLyapunovfun tion ouldbefound

(5)

ConsiderasinglereentrantlinewithdqueuesandnitebuersofsizeB.Weassume

thatexogenousarrivalso uratqueue1withprobabilityp<1=2.Thestatex2<

d

indi atesthenumberofjobsin ea hqueue. The ostperstagein urredat statex

isgivenby

g(x)= jxj

d

= 1

d d

X

i=1 x

i

;

theaveragenumberofjobsperqueue.

Asdis ussedin[2℄,under ertainstabilityassumptionsweexpe tthat theoptimal

ost-to-gofun tionshould satisfy

0J

(x)

2

d x

0

x+

1

d e

0

x+

0

;

forsomepositives alars

0 ,

1 and

2

independentofd. We onsideraLyapunov

fun tion V(x)= 1

d x

0

x+C forsome onstantC>0,whi himplies

min

r kJ

rk

1;1=V

kJ

k

1;1=V

max

x0

2 x

0

x+

1 e

0

x+d

0

x 0

x+dC

2 +

1 +

0

C

;

andtheaboveboundisindependentofthenumberofqueuesinthesystem.

Nowletusstudyk

V

. Wehave

(HV)(x)

p

1

d x

0

x+ 2x

1 +1

d

+C

+(1 p)

1

d x

0

x+C

V(x)

+p 2x

1 +1

x 2

1 +dC

;

and itis learthat,for C suÆ ientlylargeandindependentof d,there isa <1

independentofdsu hthat HV V,andthereforek

V

1

1 .

Finally,letus onsider 0

V. Dis ussionpresentedin[2℄suggeststhatonemightwant

to hoose soastore e tthestationarystatedistribution. Weexpe tthat under

somestabilityassumptions,thetailofthestationarystatedistributionwillhavean

upperbound withgeometri de ay[1℄. Thereforewelet (x)=

1

1 B+1

d

jxj

,for

some0<<1. Inthis ase, isequivalenttothe onditionaljointdistributionof

dindependentand identi ally distributed geometri random variables onditioned

ontheeventthattheyarelessthanB+1,andwehave

0

V =E

"

1

d d

X

i=1 X

2

i +C

X

i

<B+1;i=1;:::;d

#

<2

2

(1 ) 2

+

1 +C ;

where X

i

;i = 1;:::;d are identi ally distributed geometri random variables with

parameter 1 . It follows that 0

V is uniformly bounded over the number of

queues.

This exampleshows thatthe termsinvolvedin the errorbound (5)are uniformly

bounded both in the number of states in the system and in the number of state

variables, hen e the behaviorof the approximate LPdoes notdeteriorate as the

problemsizein reases.

Wenally presentanumeri alexperimentto further illustratetheperforman eof

(6)

=1/11.5 λ ₁

= 1/11.5 λ 2

= 4/11.5 µ ₁

= 2.5/11.5 µ 8

= 3/11.5 µ ₃

= 2/11.5 µ 2

= 2.2/11.5

µ 6 µ ₄ = 3.1/11.5

= 3/11.5 µ ₅

= 3/11.5 µ ₇

Figure1: SystemforExample3.2.

Poli y ALP(=0:9) LBFS FIFO LONG

Average Cost 136.7 153.3 163.3 168.3

Table1: Averagenumberofjobsafter50,000,000simulationsteps

3.2 An Eight-Dimensional QueueingNetwork

We onsideraqueueingnetworkwitheightqueues. Thesystemisdepi tedinFigure

1,witharrival(

i

;i=1;2)anddeparture(

i

;i=1;:::;8)probabilitiesindi ated.

Thestatex2<

8

representsthenumberofjobs in ea h queue. The ost-per-state

isg(x)=jxj,andthedis ountfa toris0.995. A tionsa2f0;1g 8

indi atewhi h

queues are being served; a

i

= 1 i a job from queue i is being pro essed. We

onsider only non-iddling poli ies and, at ea h time step, aserver pro esses jobs

fromoneofitsqueuesex lusively.

We hoose oftheform (x)=(1 ) 8

jxj

. Thebasisfun tionsare hosentospan

all polynomials in x of degree2; therefore, theapproximate LPhas 47 variables.

Constraints(Tr)(x)(r)(x)fortheapproximateLParegeneratedbysampling

5000statesa ordingtothedistribution asso iatedwith . Experimentswereper-

formed for = 0:85;0:9 and 0:95, and = 0:9 yielded the poli y with smallest

average ost.

We omparedtheperforman eofthepoli yyieldedbytheapproximateLP(ALP)

with that of rst-in-rst-out (FIFO), last-buer-rst-serve (LBFS) 1

and a poli y

that servesthelongestqueue inea hserver(LONG).Theaveragenumberof jobs

in the system for ea h poli y wasestimated by simulation. Resultsare shown in

Table1. Thepoli ygeneratedbytheapproximateLPperformssigni antlybetter

thanea h ofthe heuristi s,yielding morethan 10%improvementoverLBFS, the

se ondbestpoli y. Weexpe tthatevenbetterresults ouldbeobtainedbyrening

the hoi eofbasisfun tionsandstate-relevan eweights.

4 Closing Remarks and Open Issues

Inthispaperwestudiedthelinearprogrammingapproa htoapproximatedynami

programmingforsto hasti ontrolproblemsasameansofalleviatingthe urseof

1

LBFSservesthejobthatis losesttoleavingthesystem;forexample,iftherearejobs

inqueue2and inqueue6,ajobfromqueue2is pro essedsin eitwill leavethe system

aftergoingthroughonlyonemorequeue,whereasthejobfromqueue6willstill haveto

gothroughtwomorequeues. Wealso hoosetoassignhigherprioritytoqueue8thanto

(7)

basisfun tions. Thebounds were shownto beuniformly bounded in thenumber

ofstatesandstatevariablesin ertainqueueingproblems.

Severalquestions remainopenand aretheobje toffuture investigation: Canthe

state-relevan eweightsin theobje tivefun tion be hosenin someadaptiveway?

CanweaddrobustnesstotheapproximateLPalgorithmtoa ountforerrorsinthe

estimationof ostsandtransitionprobabilities, i.e.,designanalternativeLPwith

meaningfulperforman eboundswhenproblemparametersarejustknowntobein

a ertainrange? Howdoourresultsextendto theaverage ost ase? Howdoour

resultsextendtotheinnite-state ase? Howdoesthequalityof theapproximate

valuefun tion,measurebytheweightedL

1

norm,translateintoa tualperforman e

oftheasso iatedgreedypoli y?

A knowledgements

This resear hwassupported byNSF CAREERGrantECS-9985229,bytheONR

underGrantMURIN00014-00-1-0637,andbyanIBMResear hFellowship.

Referen es

[1℄ Bertsimas,D., Gamarnik, D. &Tsitsiklis, J.,\Performan e ofMulti lass Markovian

QueueingNetworksviaPie ewiseLinearLyapunovFun tions,"submittedtoAnnalsof

AppliedProbability,2000.

[2℄ deFarias,D.P.&VanRoy,B.,\TheLinearProgrammingApproa htoApproximate

Dynami Programming,"submittedtopubli ation,2001.

[3℄ de Farias, D.P. & Van Roy, B., \On Constraint Sampling for Approximate Linear

Programming,",submittedtopubli ation,2001.

[4℄ Manne,A.S.,\Linear Programming andSequentialDe isions," Management S ien e

6,No.3,pp.259-267,1960.

[5℄ S hweitzer,P.&Seidmann,A.,\GeneralizedPolynomialApproximationsinMarkovian

De isionPro esses," Journalof Mathemati alAnalysisandAppli ations110,pp. 568-

582,1985.

CiteSeerX — Approximate Dynamic Programming via Linear Programming

=1/11.5 λ 1

= 1/11.5 λ 2

= 4/11.5 µ 1

= 2.5/11.5 µ 8

= 3/11.5 µ 3

= 2/11.5 µ 2

= 2.2/11.5

µ 6 µ 4 = 3.1/11.5

= 3/11.5 µ 5

= 3/11.5 µ 7

=1/11.5 λ ₁

= 4/11.5 µ ₁

= 3/11.5 µ ₃

µ 6 µ ₄ = 3.1/11.5

= 3/11.5 µ ₅

= 3/11.5 µ ₇