• No results found

2.2 Methodology

2.2.2 Approximate Dynamic Programming

Approximate dynamic programming (ADP) refers to a broad family of approaches and algorithms to efficiently solve an approximation of large scale dynamic pro-gramming models of the type. The main idea of ADP is to step forward through time and use an approximation of the optimal value function to guide decision making, instead of performing backward computation. However, as the decision making process highly depends on the value function approximation, the policy search process can be easily misled because of biased value function

approxima-tion. Therefore, a lot of effort has been devoted to searching for good policies and simultaneously updating good value function approximations, a process that is called “optimizing while learning” (Powell, 2007). Herein, we restrict our at-tention to two general ADP algorithms: approximate value iteration (AVI) and approximate policy iteration (API), where the policy is determined by a value function approximation.

Approximate Value Iteration

Approximate value iteration is a widely used approximation algorithm in the field of ADP because of its brevity and elegance (Powell, 2007). The basic idea of AVI is to iteratively update the value function approximation that estimates the value of being in each state.

A generic AVI algorithm is outlined in Algorithm 2. In each iteration, the algo-rithm computes the value function estimation ˆvnt and associated “greedy” action ant by exploiting approximate value function Vn−1t+1 in previous iteration. The esti-mated value ˆvtnis used to update the value function of being in a state according to equation (2.7). Meanwhile, the “greedy” action ant helps to determine the next state to visit.

The major drawback of Algorithm 2 is its lack of performance guarantee. It can be mainly attributed to the fact that its policy updating completely relies on previous value function approximation. As a consequence, the policy ant can be easily misled by previous estimation, leading to an unstable performance.

Besides, the AVI algorithm is inefficient in the sense that at each iteration, it only updates the value function approximation for those states that have been visited.

Remark: It is noteworthy that solving the optimization problem (2.6) neces-sitates sample average approximation with inner simulation. The concept of the post-decision state can be used to boost the efficiency of the AVI algorithm by avoiding inner simulation (Powell, 2007).

Algorithm 2 Generic Approximate Value Iteration

– Step 2a: Solve the optimization problem

ˆ Let ant be the optimal solution of problem (2.6).

– Step 2b: Update next state snt+1= Ft(snt, ant, ωnt+1) – Step 2c: Update the approximate value function

Vnt(st) =

((1 − αn−1)Vn−1t (st) + αn−1tn, if st = snt;

Vn−1t (st), otherwise. (2.7)

Step 3: If n < N , set n = n + 1 and go to step 1, otherwise return the final value function approximation VNt for t = 1, 2, · · · , T

In practical applications, more often than not, the state space would be tremendously large (or continuous). In this sense, it is impractical and prohibitive to represent the approximate value function as a look-up table. A simple adap-tation of using a parametric (typically linear) model to approximate the value function has received considerable interest in literature. However, it has been shown that the AVI algorithmic strategy using parametric approximations cannot guarantee to converge for a general setting (Powell, 2007), unless some special and powerful structures, like convexity, can be recognized and exploited. For example, Nascimento and Powell (2009) developed a provably convergent ap-proximate value iteration, named SPAR-Storage algorithm for a large scale energy dispatch problem with a nice convex structure. Instead of directly approximat-ing the value function, they proposed to update the slope of the value function and utilize the property that the slope of a convex function is monotonically increasing to boost the efficiency of the algorithm.

Approximate Policy Iteration

An alternative powerful tool for approximate dynamic programming is approx-imate policy iteration, which has attracted substantial research interest. The strength of this methodology lies in its provably convergence guarantee in the most general case (Powell, 2007). An outline of a generic version of API is pre-sented in Algorithm 3.

Algorithm 3 Generic Approximate Policy Iteration

Inputs: Initial approximate value function Vtπ,0for t = 1, 2, · · · , T , inner sample counter M and maximum number of iterations N .

Step 0: Set iteration count n = 1 and sample initial state sn1;

Step 1: Do for m = 1, 2, · · · , M

Step 2: Generate a sample ωtm

Step 3: Do for t = 1, 2, · · · , T − 1

– Step 3a: Solve the optimization problem

an,mt = argminat∈At Ct(sn,mt , at) + γE[Vt+1π,n−1(Ft(sn,mt , at, ˜ωmt+1))|wtm] (2.8) – Step 3b: Update next state sn,mt+1 = Ft(sn,mt , an,mt , ωmt+1)

Step 4: Initialize ˆvT +1n,m = 0 .

Step 5: Do for t = T, T − 1, · · · , 1

– Step 5a: Accumulate ˆvtn,m = Ct(sn,mt , an,mt ) + γ ˆvt+1n,m – Step 5b: Update approximate value of current policy

tn,m(st) = (m−1

mtn,m−1(st) + m1n,mt , if st = sn,mt ;

tn,m−1(st), otherwise (2.9)

Step 6: Update value function approximation

Vtπ,n(st) = (1 − αn−1)Vtπ,n−1(st) + αn−1tn,M(st) (2.10)

Step 7: If n < N , set n = n + 1 and go to step 1, otherwise return the final value function approximation Vπ,Nt for t = 1, 2, · · · , T

It is worth clarifying that value function approximation is also indispensable in approximate policy iteration, where the “policy” refers to decisions determined by the approximate value function (see Vtπ,n−1in equation (2.8)). Unlike AVI, API algorithm attempts to obtain a statistically reliable estimation of current policy by repeating performance evaluation process with fixed Vtπ,n−1. At the end of each iteration, the policy is updated in the form of equation (2.10).

Value function approximation using linear architectures has been widely adopted in the context of API algorithm, mainly because of its ease of imple-mentation. The resulting algorithm is termed as “least-squares policy iteration”

(LSPI). Several variants of LSPI algorithmic strategies have been investigated in literature (see Bertsekas and Tsitsiklis, 1996; Lagoudakis and Parr, 2003; Nedi´c and Bertsekas, 2003; Xu et al., 2007). Currently, most of existing convergent results of the proposed algorithm are established for infinite horizon MDP (see Tsitsiklis, 2003; Ma and Powell, 2008, 2011). In the aforementioned works, the convergence result is achieved by exploiting the monotonicity property of the dy-namic programming operator in the context of infinite horizon MDP. To the best of our knowledge, there exists scarce literature on convergence guarantee for a fi-nite horizon MDP. A plausible explanation is that the absence of the monotonicity property for the finite horizon MDP.