• No results found

Related work and conclusion

9.2 Evolution of decision intervals and actions by solving a sequence of discrete

9.2.3 Related work and conclusion

As mentioned earlier, this method differs from the algorithms presented in [Feng et al., 2004], [Li and Littman, 2005] and [Benazera et al., 2005] (HAO*) because it does not search for a local refining of a continuous variable’s partitioning, but for the smallest set of bounds needed to define the policy on this variable.

Earlier work on this problem and on the problem of incremental discretization of con- tinuous variables was proposed in [Munos and Moore, 2002] and [Munos and Moore, 2000]. The method proposed in the previous paragraph builds on the same idea to concentrate accuracy where it is needed. However, the main difference lies in the fact that our method sacrifices two aspects to obtain as little bounds as possible: optimality and causality. Op- timality is lost because we pop some bounds out of the bounds’ list when two consecutive actions are equal, thus implying a worse approximation in the discretized model than if we had not removed these bounds. Causality in the discretized problem is lost because of the approximation method, as explained at step 1 of the previous algorithm.

Lastly, one can push the comparison with policy iteration a little further. If one considers the (˜t, π(s, ˜T )) decision variables, the previous algorithm can be seen as a policy iteration algorithm where the evaluation phase is an approximate evaluation of the policy using an optimistic model obtained by discretization of the continuous problem and where the opti- mization phase results from the discrete optimization for the actions and from the continuous approximate optimization for the bounds’ evolution.

Finally:

The method presented in this chapter separates the decision intervals’ bounds op- timization and the action selection procedure. It relies on an incremental method, similar to the philosophy of policy iteration, to improve the bounds’ number and values and on a discrete MDP resolution scheme to preserve the coupling between these bounds and the optimized actions. This method could be implemented using different tools for MDP optimization, model discretization and convex optimiza- tions, providing a family of variants based on the same principle of incrementally finding the right intervals for policy definition.

Chapter 9. Perspectives: evolutive partitioning of time

10

Conclusion

This chapter summarizes the results obtained in the previous chapters. We also discuss the possibility of adapting the TMDPpoly method and tools to the more

general case of XMDPs with hybrid state and action spaces, highlighting where the advantages and difficulties are. Finally we conclude on this first part of the thesis and explain how it leads to the second part.

10.1

“Take-away” messages

This first part of the thesis focused on the problem of introducing a continuous time variable in the MDP framework. This raised questions concerning the link with the discounted criterion, the resolution algorithm and the formal representation framework of temporal Markov decision problems. Here is a short summary of the conclusions drawn from the previous chapters:

• Considering a continuous observable time variable implies looking at a hybrid state space MDP. Furthermore, having an observable time directly affects the definition of the discounted criterion.

• Introducing continuous variables such as time often calls for the introduction of con- tinuous actions such as wait. This yields a hybrid action space MDP with hybrid state space and observable time in the discounted criterion.

• The XMDP framework captures these characteristics and establishes an optimality equations for the policies one could define on such problems. This XMDP framework includes standard MDPs, SMDP+ and TMDPs.

• In practice, when time is the only continuous variable and wait the only continuous action, some extra hypotheses can be made. Namely, wait is often deterministic with respect to the states variables and the reward for a zero duration waiting is zero. This falls into the framework of SMDP+. Sometimes wait might even have no impact on the discrete part of the state space. This is the standard TMDP framework which we slightly extended to deterministic effect on the state variables through the use of a W function describing the deterministic evolution of the system while waiting.

• The optimality equations presented in [Boyan and Littman, 2001] for the TMDP frame- work correspond to a total reward criterion for the equivalent XMDP.

Chapter 10. Conclusion

• Trying to extend the exact resolution scheme of TMDPs to the case of piecewise poly- nomial functions is quickly refrained by the properties of formal calculations on such representations. Namely, this exact resolution scheme could not be extended further than discrete probability density functions, piecewise constant transition probabilities and piecewise polynomial reward functions of degree lower than 5.

• The analysis of the TMDP optimality equations provided a more global approximate resolution method for the case of piecewise polynomial functions, based on:

– Exact and approximate formal calculations on piecewise polynomial functions. – Prioritized sweeping adapted to TMDPs.

– Approximate value iteration.

These features spawned the TMDPpoly algorithm and planner.

• The main drawback of value iteration methods for temporal Markov decision problems comes from the difficulty to define precisely the value functions. In the case of piecewise polynomial functions it is expressed through the number of definition intervals needed to accurately describe the value functions. We provided a first attempt at simplifying this representation by taking a short look into evolutive partitioning of time. This resulted in a “policy iteration”-like method which contains the first ideas about the model-free reinforcement learning methods of the thesis’ next part.