• No results found

There has been intensive interest in simulation-based methods for approx- imate DP since the early 90s, in view of their promise to address the dual curses of DP: the curse of dimensionality (the explosion of the computa-

tion needed to solve the problem as the number of states increases), and the curse of modeling (the need for an exact model of the system’s dynamics). We have used the name approximate dynamic programming to collectively refer to these methods. Two other popular names are reinforcement learn- ing and neuro-dynamic programming. The latter name, adopted by Bert- sekas and Tsitsiklis [BeT96], comes from the strong connections with DP as well as with methods traditionally developed in the field of neural net- works, such as the training of approximation architectures using empirical or simulation data.

Two books were written on the subject in the mid-90s, one by Sutton and Barto [SuB98], which reflects an artificial intelligence viewpoint, and another by Bertsekas and Tsitsiklis [BeT96], which is more mathematical and reflects an optimal control/operations research viewpoint. We refer to the latter book for a broader discussion of some of the topics of this chapter [including rigorous convergence proofs of TD(λ) and Q-learning], for related material on approximation architectures, batch and incremental gradient methods, and neural network training, as well as for an extensive overview of the history and bibliography of the subject up to 1996. More recent books are Cao [Cao07], which emphasizes a sensitivity approach and policy gradient methods, Chang, Fu, Hu, and Marcus [CFH07], which em- phasizes finite-horizon/limited lookahead schemes and adaptive sampling, Gosavi [Gos03], which emphasizes simulation-based optimization and rein- forcement learning algorithms, Powell [Pow07], which emphasizes resource allocation and the difficulties associated with large control spaces, and Bu- soniu et. al. [BBD10], which focuses on function approximation methods for continuous space systems. The book by Haykin [Hay08] discusses ap- proximate DP within the broader context of neural networks and learning. The book by Borkar [Bor08] is an advanced monograph that addresses rig- orously many of the convergence issues of iterative stochastic algorithms in approximate DP, mainly using the so called ODE approach (see also Borkar and Meyn [BoM00]). The book by Meyn [Mey07] is broader in its coverage, but touches upon some of the approximate DP algorithms that we have discussed.

Several survey papers in the volume by Si, Barto, Powell, and Wun- sch [SBP04], and the special issue by Lewis, Liu, and Lendaris [LLL08] describe recent work and approximation methodology that we have not covered in this chapter: linear programming-based approaches (De Farias and Van Roy [DFV03], [DFV04a], De Farias [DeF04]), large-scale resource allocation methods (Powell and Van Roy [PoV04]), and deterministic op- timal control approaches (Ferrari and Stengel [FeS04], and Si, Yang, and Liu [SYL04]). An influential survey was written, from an artificial intelli- gence/machine learning viewpoint, by Barto, Bradtke, and Singh [BBS95]. Some recent surveys are Borkar [Bor09] (a methodological point of view that explores connections with other Monte Carlo schemes), Lewis and Vrabie [LeV09] (a control theory point of view), and Szepesvari [Sze09] (a

machine learning point of view), Bertsekas [Ber10a] (which focuses on roll- out algorithms for discrete optimization), and Bertsekas [Ber10b] (which focuses on policy iteration and elaborates on some of the topics of this chapter). The reader is referred to these sources for a broader survey of the literature of approximate DP, which is very extensive and cannot be fully covered here.

Direct approximation methods and the fitted value iteration approach have been used for finite horizon problems since the early days of DP. They are conceptually simple and easily implementable, and they are still in wide use for approximation of either optimal cost functions or Q-factors (see e.g., Gordon [Gor99], Longstaff and Schwartz [LoS01], Ormoneit and Sen [OrS02], and Ernst, Geurts, and Wehenkel [EGW06]). The simplifications mentioned in Section 6.1.4 are part of the folklore of DP. In particular, post- decision states have sporadically appeared in the literature since the early days of DP. They were used in an approximate DP context by Van Roy, Bertsekas, Lee, and Tsitsiklis [VBL97] in the context of inventory control problems. They have been recognized as an important simplification in the book by Powell [Pow07], which pays special attention to the difficulties associated with large control spaces. For a recent application, see Simao et. al. [SDG09].

Temporal differences originated in reinforcement learning, where they are viewed as a means to encode the error in predicting future costs, which is associated with an approximation architecture. They were introduced in the works of Samuel [Sam59], [Sam67] on a checkers-playing program. The papers by Barto, Sutton, and Anderson [BSA83], and Sutton [Sut88] proposed the TD(λ) method, on a heuristic basis without a convergence analysis. The method motivated a lot of research in simulation-based DP, particularly following an early success with the backgammon playing pro- gram of Tesauro [Tes92]. The original papers did not discuss mathematical convergence issues and did not make the connection of TD methods with the projected equation. Indeed for quite a long time it was not clear which mathematical problem TD(λ) was aiming to solve! The convergence of TD(λ) and related methods was considered for discounted problems by sev- eral authors, including Dayan [Day92], Gurvits, Lin, and Hanson [GLH94], Jaakkola, Jordan, and Singh [JJS94], Pineda [Pin97], Tsitsiklis and Van Roy [TsV97], and Van Roy [Van98]. The proof of Tsitsiklis and Van Roy [TsV97] was based on the contraction property of ΠT (cf. Lemma 6.3.1 and Prop. 6.3.1), which is the starting point of our analysis of Section 6.3. The scaled version of TD(0) [cf. Eq. (6.80)] as well as a λ-counterpart were pro- posed by Choi and Van Roy [ChV06] under the name Fixed Point Kalman Filter. The books by Bertsekas and Tsitsiklis [BeT96], and Sutton and Barto [SuB98] contain a lot of material on TD(λ), its variations, and its use in approximate policy iteration.

Generally, projected equations are the basis for Galerkin methods, which are popular in scientific computation (see e.g., [Kra72], [Fle84]).

These methods typically do not use Monte Carlo simulation, which is es- sential for the DP context. However, Galerkin methods apply to a broad range of problems, far beyond DP, which is in part the motivation for our discussion of projected equations in more generality in Section 6.8.

The LSTD(λ) algorithm was first proposed by Bradtke and Barto [BrB96] for λ = 0, and later extended by Boyan [Boy02] for λ > 0. For λ > 0, the convergence Ck(λ) → C(λ) and d(λ)

k → d(λ) is not as easy to demonstrate as in the case λ = 0. An analysis of the law-of-large-numbers convergence issues associated with LSTD for discounted problems was given by Nedi´c and Bertsekas [NeB03]. The more general two-Markov chain sampling context that can be used for exploration-related methods is an- alyzed by Bertsekas and Yu [BeY09], and by Yu [Yu10a,b], which shows convergence under the most general conditions. The analysis of [BeY09] and [Yu10a,b] also extends to simulation-based solution of general pro- jected equations. The rate of convergence of LSTD was analyzed by Konda [Kon02], who showed that LSTD has optimal rate of convergence within a broad class of temporal difference methods. The regression/regularization variant of LSTD is due to Wang, Polydorides, and Bertsekas [WPB09]. This work addresses more generally the simulation-based approximate so- lution of linear systems and least squares problems, and it applies to LSTD as well as to the minimization of the Bellman equation error as special cases. The LSPE(λ) algorithm, was first proposed for stochastic shortest path problems by Bertsekas and Ioffe [BeI96], and was applied to a chal- lenging problem on which TD(λ) failed: learning an optimal strategy to play the game of tetris (see also Bertsekas and Tsitsiklis [BeT96], Section 8.3). The convergence of the method for discounted problems was given in [NeB03] (for a diminishing stepsize), and by Bertsekas, Borkar, and Nedi´c [BBN04] (for a unit stepsize). In the paper [BeI96] and the book [BeT96], the LSPE method was related to the λ-policy iteration of Sec- tion 6.3.9. The paper [BBN04] compared informally LSPE and LSTD for discounted problems, and suggested that they asymptotically coincide in the sense described in Section 6.3. Yu and Bertsekas [YuB06b] provided a mathematical proof of this for both discounted and average cost problems. The scaled versions of LSPE and the associated convergence analysis were developed more recently, and within a more general context in Bertsekas [Ber09b], [Ber11a], which are based on a connection between general pro- jected equations and variational inequalities. Some other iterative methods were given by Yao and Liu [YaL08]. The research on policy or Q-factor evaluation methods was of course motivated by their use in approximate policy iteration schemes. There has been considerable experimentation with such schemes, see e.g., [BeI96], [BeT96], [SuB98], [LaP03], [JuP07], [BED09]. However, the relative practical advantages of optimistic versus nonoptimistic schemes, in conjunction with LSTD, LSPE, and TD(λ), are not yet clear. The exploration-enhanced versions of LSPE(λ) and LSTD(λ) of Section 6.3.6 are new and were developed as alternative implementations