This concludes this paragraph on the comparison between SMDP+ and TMDP. Its main purpose was to establish the link between the ad hoc definitions of the TMDP model and the theory of stochastic decision processes. This comparison highlighted the limits of the TMDP model and its properties. In conclusion:
2wait(τ = 0) is the null duration waiting in TMDPs, it is actually a shortcut for the implicit wait(t0= t)
in the TMDP action π(s, t) = (t, a).
4.6. Conclusion A TMDP is a total reward criterion SMDP+ where the “wait” action is implicit. This is made possible because this same “idleness” action is static in terms of the state’s evolution (waiting leaves the process’ state unchanged) and deterministic with respect to time. Since the wait action is implicit, TMDP policies are a constant alternate of standard actions and idleness phases. Such a policy’s genericity is only preserved due to the fact that “wait” does not affect s and that its reward model yields reward 0 for zero-duration waiting. One could notice that such a genericity would still be preserved under the weaker condition that wait(τ = 0) leaves the process’ state unchangeda and induces no cost or reward.
athis remains true as long as wait is the only parametric action. The general case of parametric
actions will be developed in chapter 8.
This analysis raises some questions concerning the nature of this “wait” action. Namely: • What if there are other continuous parametric actions like “wait”?
• How do we model the exogenous evolution of the world in TMDPs, how do we use the W function of SMDP+?
• Is there a more general framework — derived from MDPs — for planning with contin- uous time and parametric actions?
• More specifically, couldn’t we write a framework with sequences of extended actions (avoiding the permanent switching of wait/action which is — somehow — a tweak in the TMDP resolution) which would be similar to the standard MDP case?
We will bring an answer to these questions in the more general framework of XMDPs, in chapter 8. In conclusion, we have shown the expressivity and the limits of the TMDP framework. In particular, its implicit wait action is not generic: it corresponds more to an action that would deterministically “freeze” the process state and move to an other date in time. This important feature was highlighted by the more general wait operator of SMDP+. For now, we will use the TMDP framework and notations and will focus on studying and improving the resolution of TMDPs.
Chapter 4. Bridging the gap between SMDP and TMDP: the SMDP+ model
5
Solving TMDPs via Dynamic Programming
The previous chapter connected the TMDP framework to the general case of MDPs and SMDPs. It illustrated the fact that the optimality equation on TMDPs corre- sponded indeed to a total reward criterion over the execution. In this chapter, we focus on this optimality equation. Our goal is to analyze why an exact resolution was possible in the case of [Boyan and Littman, 2001], how we can extend it and what computational tools we need to perform Bellman backups on TMDPs.
5.1
Optimality equations and value function properties
The optimality equations established on the total reward criterion for TMDPs in the last chapter are the basis of the dynamic programming approach to solving TMDPs. Equations 4.10 to 4.13 provide a straightforward value iteration scheme in order to find the optimal value function as presented in equations 5.1 to 5.4.
Vn+1(s, t) = sup t0≥t Z t0 t K(s, θ)dθ + Vn(s, t 0) ! (5.1) Vn(s, t) = maxa∈A Qn(s, t, a) (5.2) Qn(s, t, a) = X µ∈M L(µ|s, t, a) · Un(µ, t) (5.3) Un(µ, t) = R∞ −∞Pµ(t0)[R(µ, t, t0) + Vn(s0µ, t0)]dt0 if Tµ= ABS R∞ −∞Pµ(t0− t)[R(µ, t, t0) + Vn(s0µ, t0)]dt0 if Tµ= REL (5.4)
In their 2001 paper, Boyan and Littman show that under some conditions, TMDPs can be solved exactly using Value Iteration. These conditions are:
• L and K are piecewise constant functions with respect to t.
• R can be decoupled into a sum of piecewise linear reward functions: R(µ, t, t0) = r
t(µ, t) + rt0(µ, t0) + rτ(µ, t0− t) (5.5)
• Pµ is a discrete probability density function.
Even though the first two conditions are acceptable in order to model and approximate any kind of transition model and reward function, one might wish for more expressive function
Chapter 5. Solving TMDPs via Dynamic Programming
shapes. On top of that, looking at discrete probability density functions for Pµ turns out
to be a good approximation of the distributions but also takes the process back to the case of discretized transition durations. In this chapter, we will analyse the optimality equations presented above (equations 5.1 to 5.4) and will extend them to more general classes of func- tions. More specifically, we will show where the limit for exact resolution can be pushed with our approach and how we adapt the exact resolution to the approximate case by generalizing piecewise constant and linear functions to general piecewise polynomial functions.
The core question of this chapter can be stated as follows: we are looking for a value function V (s, t) obeying the previous Bellman’s equation. While in the discrete case, tabular representation is the simplest common basis to all representations of the value function, in the continuous case we deal with function spaces. These function spaces are generally hard to approximate and represent because of their infinite dimension. Hence, we are looking for a shape of V which is an efficient approximation and representation framework as well as an adapted formulation for the operations of equations 5.1 to 5.4. The dynamic programming approach relies on the fact that the Bellman’s operator is a contraction mapping over value function space and admits a fixed point. In order to build an exact resolution scheme, it is useful to find a family of functions which would be stable by application of L. In other words, we are looking for a class of functions C for which:
∀V ∈ C, LV ∈ C (5.6)
However, this search for an “L-stable” class of functions C should be done while keeping in mind the practical fact that operations on V should be easily computable. Also, we will need to compute operations between V and L1, P
µ, K and R, which suggests that they
might need to belong to the same class C.