9.2 Evolution of decision intervals and actions by solving a sequence of discrete
9.2.2 The method in detail
We can now consider these four phases in detail.
First step and initialization: generating ˜M. We build the discrete MDP problem ˜
M with:
• the state space ˜Σ, • the action space A+,
• the transition function ˜Q(˜σ0|˜σ, a),
Chapter 9. Perspectives: evolutive partitioning of time
The transition model ˜Q(˜σ0|˜σ, a) describes the probability that action a, undertaken in
sσ, during Tσ, takes the process to state sσ0 at a date belonging to Tσ0. Similarly ˜r represents
the average reward obtained when applying a and going from (sσ, Tσ) to (sσ0, Tσ0).
More precisely, if tlow and tup represent respectively the lower and upper bounds of
interval Tσ, we can choose to calculate ˜Q as the average over the Tσ interval of the probability
of reaching the Tσ0 interval:
˜ Q(σ0, a, σ) = t 1 up− tlow Z tup tlow P r(t0 ∈ Tσ0, s0 = sσ0|a, sσ, tσ)dt = t 1 up− tlow Z tup tlow P (s0|s, t, a)Z t 0 up t0 low f(t0|s, t, a, s0)dt0 ! dt And if we write the cumulative distribution function F :
F (v|s, t, a, s0) = P r(t0 ≤ v|s, t, a, s0) = Z v −∞f(t 0|s, t, a, s0)dt0 Then we have: ˜ Q(σ0, a, σ) = t 1 up− tlow Z tup tlow P (s0|s, t, a)F (t0up|s, t, a, s0) − F (t0low|s, t, a, s0)dt (9.4) Similarly, we chose to write ˜r as the average over the Tσ interval of the rewards obtained
during the transitions (σ, a, σ0).
˜r((sσ, Tσ), a) = t 1 up− tlow
Z tup
tlow
r((sσ, t), a)dt (9.5)
The choice of taking the average over the Tσ interval is arbitrary and questionable. One
could choose, for example, to use the best reward obtained over the interval in order to build an optimistic reward model instead.
The transition model of wait takes the process to a new state described by the system’s dynamics P (s0|s, t, wait) = W (s0|s, t) and to the first date of the next decision interval in s0.
Evaluating the discrete ˜Q and ˜r functions can be done easily through analytical calcu- lation as previously if possible. Else, it can be approximated via Monte-Carlo sampling or continuous functions discretization.
It is important to note that ˜M is an approximation and an abstraction of M. It is an approximation is the sense that it approximates the transition and reward models over the decision intervals by taking the average values. It also is an abstraction because it does not respect the causality principle anymore. In ˜M, it is possible to reach a temporal interval beginning before the current date, and from this interval, to reach another prior interval which would entirely lie before the initial current date. Therefore, ˜M can be seen as an approximate optimistic problem where causality can be violated and where reachability is considered from a very optimistic point of view.
We provide no theoretical justification of the soundness of such an approximation and abstraction. Instead, we rely on the idea that one does not need to evaluate exactly the transition dynamics and the rewards to build a rough plan of action. This ˜M problem can thus be seen as a — rather drastic — variation of the “optimism in the face of uncertainty”
9.2. Evolution of decision intervals and actions by solving a sequence of discrete problems philosophy developed in [Kaelbling, 1990].
Second step: searching for the optimal action. The second step consists in solving the Bellman optimality equation corresponding to problem ˜M. We suppose there is a “black box” discrete MDP solver available and we can feed the ˜M problem to this solver. This optimization provides us with a ˜π policy defined on ˜Σ.
It can happen that the policy defined on two consecutive decision intervals of the same state ends up in pointing to the same action after the optimization process. In this case, we merge the two decision intervals into one in order to keep the number of bounds low and the representation as compact as possible. No new introduction of bounds is possible at this step since we are only optimizing the discrete problem ˜M.
Third step: Evaluating ˜π on the real system. One can see ˜π as an approximation of an optimal policy for M. It is not exactly a policy obtained through approximate dynamic programming since it results from the “black box” solver used in step 2 — which might be either an exact or an approximate solver, but its generation relies on an approximation of the model which yields an exact or approximate value function on this approximate model, which in turn provides us with ˜π. Consequently, the ˜π policy leaves room for improvement with respect to the continuous initial problem because the problem solved was a discrete approximation of this initial problem. The goal of step 3 is to let the T discretization evolve in order to let the next step’s ˜π be better than the current one, with respect to the continuous temporal problem.
This leaves us with two separate problems:
• Suppose we have found the optimal policy π∗ for M then we have a partitioning set T∗
used for this policy’s description and we can build the associated ˜M∗ problem. Then,
to guarantee the soundness of our algorithm, we need to insure that the optimal policy ˜π found after after optimization on the ˜M∗ problem is identical to π∗.
• Secondly, the evaluation method of ˜π with respect to the continuous problem must be good enough so as to eventually find the points in time where the policy can be improved.
The first problem corresponds to proving that the overall approximation and optimiza- tion scheme has a fixed point in π∗. Ideally, one should also prove it is a contraction mapping
in order to insure convergence. As for many approximate dynamic programming algorithms, proving such a property is often very hard or impossible. For an example illustrating this difficulty, see the discussion on approximate value iteration of section 6.4. However, proving the stability (or bounding the variations) of π∗ through the model approximation and op-
timization steps provides a good criterion to evaluate the consistency of the approximation method for generating ˜M.
Similarly, the evaluation of V˜π can be done via several different methods. If exact com-
putation with the continuous functions of M is feasible, one could try a TMDPpoly -like
evaluation. Approaches such as Approximate Linear Programming (least-square minimiza- tion of a vector of weights on feature functions) as in [Guestrin et al., 2004] or Monte-Carlo approaches are also possible. Depending on the nature of the continuous problem at hand, one could choose an option or another, the goal remains to obtain an evaluation of ˜π’s quality on the real continuous problem, ie. to solve equation 9.2 for ˜π.
Chapter 9. Perspectives: evolutive partitioning of time
Fourth step: populating the decision intervals sets. Once we have the evaluation V˜π, we need to answer the question “where should I introduce a new bound in order to
improve my policy’s quality?”. Answering this question actually means inferring that by performing another action than the one specified by ˜π, one improves the expected gain of an execution. This idea is very close to the improvement step of Policy Iteration. Here, one could consider that the decision variables are the decision intervals’ bounds and that we search for new values of these bounds which will improve the efficiency of our policy. Hence, we need to find where we can potentially improve the policy’s quality.
Evaluating such an improvement can be done by trying to find the best action to under- take in the current state before applying ˜π for the rest of the execution. It corresponds to calculating the one-step lookahead best action by performing one Bellman backup. There- fore, we are looking, per state, for the greatest value of the Bellman error as a function of t. We recall the definition of the Bellman error as presented in [Bertsekas and Tsitsiklis, 1996]. Let π be a policy defined on the state space of a discrete MDP. Let Vπ be π’s value
function. The Bellman error in state s is the value of the best improvement possible with a one-step dynamic programming optimization of the policy:
BE(Vπ(s)) = max a∈A r(s, a) + γ X s0∈S P (s, a, s0)Vπ(s0) ! − Vπ(s) (9.6)
We define the Bellman t-error, in discrete state s, as the function of time representing the gain obtained by optimizing the first action of an execution path, before applying the current policy (or before receiving the value specified by the value function of the policy). In a given discrete state s, the Bellman t-error with respect to value function V is given by:
BEs(t) = max a∈A r(s, a, t) + X s0∈S Z ∞ −∞γ t0−t V (s0, t0)P (s0|s, a, t)f(t0|s, t, a, s0)dt0 ! − V (s, t) (9.7) Finding and maximizing BEs(t) can either make use of analytical calculation if it is pos-
sible (in the TMDPpoly case, finding the supremum of a piecewise polynomial function is
an easy calculation). One can also make use of other optimization techniques such as local convex optimization (gradient descent, Newton methods, evolutionary algorithms) depend- ing on how much information we can extract from V˜π (values, gradients, Hessian matrices,
etc.).
Let us consider the question of finding the largest Bellman error more precisely. For notation convenience, we introduce the Laoperator for standard MDPs:
La(V )(s) = r(s, a) + γ
X
s0∈S
P (s0|s, a)V (s) (9.8)
One can then write: ∀s ∈ S, LV (s) = max
a∈ALaV (s).
Similarly, for SMDP+, we write: Lt a(V )(s, t) = r(s, a, t) + X s0∈S Z ∞ −∞γ (t0−t) V (s0, t0)P (s0|s, t, a)f(t0|s, t, a, s0)dt0 (9.9)
Consequently, we can write:
BEs(t) = max a∈A
n
Lta(Vπ) (s, t)o− Vπ(s, t) (9.10) 146
9.2. Evolution of decision intervals and actions by solving a sequence of discrete problems We are looking for sup
t∈RBEs(t) but:
sup
t∈RBEs(t) = supt∈Rmaxa∈A
n Lta(Vπ)(s, t) − Vπ(s, t)o = max a∈Asupt∈R n Lt a(Vπ)(s, t) − Vπ(s, t) o
So we are left with |S| · |A| maximization problems where we want to solve: sup
t∈R
n
Lta(Vπ)(s, t) − Vπ(s, t)o (9.11) t ∈ [0, T ]
Then, depending on the shape of M’s functions and of Vπ, we can try to apply different
optimization techniques. Gradient descent might generally be sufficient to find the possible sup values.