7.4
The UAV patrol problem
7.4.1 Problem definition
The second main example we will present here highlights an interesting use of TMDPs. In all the previous examples, the wait action was mainly used to “freeze” the agent’s discrete state while letting the time variable grow in order to catch any good future reward available in the current state. The UAV patrol problem is different in the sense that it does not define a wait action, but a patrol action which is strictly equivalent to wait in terms of TMDP description. patrol(τ) is both a continuous action and — contrarily to other wait actions which usually provide costs — the only action providing rewards. This example illustrates the fact that we can replace wait by another continuous action and optimize a strategy on a hybrid action space.
Let us now imagine an unmanned air vehicle (UAV) having a mission defined in terms of patrolling over certain areas of a map. More specifically, let us imagine a map with four areas of interest where the UAV has to observe a certain phenomenon. The human agent specifying the mission indicates during which time intervals the UAV should watch each zone and assigns different importances to zones in case of scheduling conflicts. For example, one could say:
“Set importance 2 on position p1 between t = 0 and t = 25,
then set importance 2 on the same position p1 between t = 60 and t = 70,
also set importance 5 on position p2 between t = 45 and t = 50,
assign importance 2 on position p3 between t = 20 and t = 50
and finally set importance 3 on position p4 between t = 45 and t = 70.”
Now let us suppose that the UAV’s navigation map is described as a grid of positions p = (x, y) as in figure 7.19. This grid represents the navigation environment of the UAV and the reward rates associated to each of the patrol zones. The UAV is given a meteorological model indicating how the wind is supposed to blow during the mission and has some proba- bilistic knowledge about the results of its atomic movement actions depending on the wind. The planning problem corresponds to finding the optimal policy of movement between positions and local patrolling as a function of the current position and the current time. Thus, the action space can be written as in table 7.4 and the state space contains the vari- ables presented in table 7.5.
patrol(τ) continuous action indicating to patrol the current position for τ time units
N, S, E, W discrete movement actions taking the UAV to a nearby position
Table 7.4: Patrol problem — action space
We use this paragraph to shortly present the simple wind model we used. Between t = 8 and t = 30, the wind blows from East to West, and between t = 60 and t = 80, from North to South. At all other times, there is no wind. When the wind blows, this changes the probabilities of making a successful move and the transition durations. Without entering
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm * * * * state(3, 8) 0 2 25 60 70 state(5, 2) 0 5 45 50 state(9, 3) 0 2 20 50 state(9, 10) 0 3 60 70
Figure 7.19: UAV patrol problem — Reward rates
t the current time, continuous variable taking its values in [0, 100] x discrete latitude of the UAV, taking its values in {1, . . . , 10} y discrete longitude of the UAV, taking its values in {1, . . . , 10}
Table 7.5: Patrol problem — state space
7.4. The UAV patrol problem the modeling details, the wind has the influence of “pushing” the UAV in a specific direc- tion which shortens or lengthens the movement durations and can result in off-course final transition states.
Therefore, the UAV patrol problem is a grid world navigation problem with stochastic movement actions, stochastic continuous transition durations, hybrid state and action spaces with the TMDP hypothesis on the continuous action.
7.4.2 Optimization results
Because the discrete state space represents only the geographical position of the UAV, this problem is easy to represent graphically. As in the rover case, we designed several versions of the patrol problem. The first version uses only discrete probability density functions, the second one uses piecewise linear density functions.
Table 7.6 summarizes the optimization results for the two versions of the patrol problem. In both cases, the threshold on priorities was set to 0.1, the approximation L∞ bound was
equal to 0.05 and the precision on t for the approximate polynomial calculations was 10−3.
Problem Iterations before convergence Average running time
version 1 531 13.90 seconds
version 2 824 740.17 seconds
Table 7.6: Patrol problem — optimization time
As in the Mars rover case, the figures of table 7.6 illustrate the fact that piecewise poly- nomial operations (such as convolution, etc.) still need a lot of optimizing. For an increase of a factor 1.55 in the number of iterations, the calculation time has been multiplied by 53.25. One can also compare this number of 531 state visits with the number of state updates performed in the Value Iteration-like algorithm of [Boyan and Littman, 2001]. With the latter algorithm, the value function converges to an -optimal value function after 330 passes through the state space, corresponding to 33000 state updates. Therefore, performing asyn- chronous dynamic programming with priorities reduced the number of state visits by a factor 62.
Figures 7.20 to 7.23 present the evolution of priorities and calculation times for the two versions of the patrol problem.
The increase of priorities around iteration 120 is due to the same phenomenon as illus- trated on the three states problem, in section 7.2.
In order to illustrate the evolution of V on a single state, we have selected state (7, 7) on the first version of the patrol problem. This state is only updated five times during the whole process. Since there are 531 updates and 100 states, this number of updates is representative of what happens in average over the whole state space. Figures 7.24 to 7.28 show the evolution of the value function and of the policy. One can tell the “update story” of this state:
• State (7, 7) is updated for the first time during the 40th iteration because it previously had a high priority of 74.98, directly inherited from the propagation of the reward for
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 max priority iteration number
Figure 7.20: UAV patrol problem — Priorities evolution, first version
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 100 200 300 400 500 600
iteration duration (sec)
iteration number
Figure 7.21: UAV patrol problem — Update durations, first version
7.4. The UAV patrol problem 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 max priority iteration number
Figure 7.22: UAV patrol problem — Priorities evolution, second version
0 0.5 1 1.5 2 2.5 0 100 200 300 400 500 600 700 800 900
iteration duration (sec)
iteration number
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm
the patrol zone situated in (9, 10) (figure 7.24).
• At iteration 43, one of its neighbors is updated and it receives priority 14.99.
• At iteration 57, again one of its neighbors is updated and it receives a higher priority of 74.96, thus pushing it almost at the top of the priority list.
• It is then updated for a second time at iteration 66 (figure 7.25).
• Almost immediately after its update, at iteration 68, it receives priority 14.96. These quick priority changes come from the fact that TMDPpoly focuses on the states which
have the largest variations to let them converge first. Since (7, 7) is one of the central states in the map, we can expect the policy to be a delicate compromise between direc- tions and TMDPpolywill focus on it in order to let it converge early in the optimization
process.
• The priorities propagate the change information to the rest of the state space and nothing happens before iteration 225 when a neighbor is updated again, hence providing (7, 7) with priority 16.26.
• It is updated for the third time during update 237 (figure 7.26) and keeps its priority of zero until update number 275 where it receives priority 6.01.
• This priority lets it be updated for the fourth time at update 304 (figure 7.27). • Its priority is finally set to 0.56 at update 333.
• The final update occurs at iteration 408 (figure 7.28).
• After this iteration no priority of more than 0.1 is assigned to state (7, 7) and the value function and policy do not change anymore.
TMDPpolyuses the alphabetical static ordering on actions to break any ties. Since actions
“West” and “North” appear to be equivalent several times during the updates, the chosen action is always “North”, leaving some patches of “West” in the policy when the latter is strictly dominant (at iteration 304 for instance).
Based on the TMDPpoly planner, we built a graphical demonstration interface for the
patrol problem. As illustrated on figure 7.29, this interface allows to change the optimiza- tion parameters, perform step-by-step prioritized sweeping, run and pause the optimization process and save the result to text files or images.
In the “grid” window of the interface, the red square indicates the first state in the cur- rent priority queue. For instance, on figure 7.29, one can see in window “TMDPpoly” that 124 states have been updated so far and that the current highest priority is 75.93. This priority is the one of state (4, 6) where the red cursor is positioned. The blue square in the “grid” window is positioned by the user. It is used to select a certain discrete state and to display its current V , V and Q functions as well as its current policy in the windows in the middle.
The numbers displayed on the grid represent the current priority queue. This priority queue is initialized with the four patrol zones and quickly spreads by local propagation of the priorities.
7.4. The UAV patrol problem
(a) Before (b) After
Figure 7.24: UAV patrol problem — state (7, 7), iterations 40 and 41
(a) Before (b) After
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm
(a) Before (b) After
Figure 7.26: UAV patrol problem — state (7, 7), iterations 237 and 238
(a) Before (b) After
Figure 7.27: UAV patrol problem — state (7, 7), iterations 304 and 305
7.4. The UAV patrol problem
(a) Before (b) After
Figure 7.28: UAV patrol problem — state (7, 7), iterations 408 and 409
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm
Often, when clicking in the grid on a certain discrete state, one can notice that some Q-functions are actually higher than the current V or V functions. This is normal since, as explained in algorithm 6.4 and in section 6.3, Q functions are updated after updating the V functions, in order to propagate the priorities to parent states. Therefore, some states can have Q functions higher than their V functions just because their neighbors have been updated. In these cases, the states in question necessarily have a non-zero priority6.
In the end, the UAV patrol problem illustrate an interesting alternative use of TMDPs by making the wait (patrol) action the only reward-providing action. It opens the door to the general specification of hybrid state and action problems as long as they verify the TMDP hypotheses.
7.5
Conclusion
Finally, this chapter illustrates how the TMDPpoly algorithm works and where are its algo-
rithmic advantages and drawbacks. It results in a formal method for computing the time- dependent optimal policy for temporal Markov decision problems, formulated as TMDPs. By pointing out the TMDP limitations, we were able to extend them, both in terms of representation capability (continuous distributions) and in terms of resolution method (the TMDPpoly algorithm in itself). A next step in extending the TMDP resolution framework
would be to integrate the use of the W function, specifying the system’s dynamics during waiting phases. Using this function might however bring the problem back to a more general setup: if the undisturbed system’s evolution is stochastic, then wait will have to be redefined and the difference with other possible continuous actions will be reduced. Another step would be to introduce the biases on priorities which we presented in section 6.3 in order to exploit even more the causality property associated with the time variable. This indeed corresponds to extending the priorities definition to states (s, t) (instead of states s currently). Such an improvement is expected to improve even more TMDPpoly ’s efficiency since it will directly
exploit the loop-free structure of temporal Markov decision problems.
This chapter also brings multiple perspectives. First, it introduces a fully implemented method for performing what we could name “formal Bellman backups” on a hybrid state space. This method is directly applied to the TMDP case and depends a lot on the TMDP hypotheses. It provides a practical, polynomial-based, formal calculus alternative to Monte- Carlo sampling methods which are the current common way of tackling hybrid state and action problems.
Chapter 8 will generalize the current TMDP framework to a more general class of hybrid problems, thus underlining how this current implementation can be reused for more general cases. Chapter 10 will try to highlight how this method of formal Bellman backups can be extended to these more general cases and will discuss where the difficulties lie.
Secondly, we used the TMDPpoly algorithm as defined in the previous chapters to solve
hybrid state and action problems such as the Mars rover and the UAV patrol problems. While this implementation is not able to scale to very large state spaces yet, it already provides a reasonable basis for solving this class of continuous time problems and extends immediately to the case of a single continuous state variable and a single continuous action as in the patrol problem case. Improving this method with heuristic guidance, structured representations of
6Even though, at the end of the algorithm, these priorities can be considered null because they are below
the priority threshold. In this case, the slight variation of Q (amplitude < 0.01 is not visible on the graph). 116
7.5. Conclusion the discrete part of the state space and better low-level function manipulation operators are some of the keys needed to scale up to larger domains. While these issues will be discussed in chapter 10, they are independent of the basis of the TMDPpoly method which already
provides results on time-dependent problems as the Mars rover problem.
Then, one of the main practical conclusions from our experiments is that improving the efficiency of the POLYTOOLS implementation yields a dramatic improvement of the overall planner’s efficiency. This is quite natural since the whole architecture is built above the POLYTOOLS implementation. Therefore, it would be very interesting to:
• improve POLYTOOLS ’s implementation and efficiency in the first place, but also to • test the TMDPpoly planner with different degrees for interpolation, in particular cubic
splines which are not functional today because of technical implementation reasons; this will allow us to
• evaluate the degree/pieces compromise7.
Therefore, improvement of the POLYTOOLS / TMDPpoly framework is still necessary to
help understanding the advantages and drawbacks of our method and extend them to more general cases.
One important conclusion which does not appear visibly in the previous results is the huge impact of algorithm 6.6 on the optimization process. Without this algorithm, both the degree of polynomials and the number of definition intervals explode and the optimiza- tion gets stuck in very long, sometimes unpredictable, calculations for nothing. Even in the discrete distributions case, algorithm 6.6 decreases dramatically the computational time while conserving the global efficiency of the method and the L∞bounds on the value function.
From the algorithmic point of view, the causality feature of temporal Markov problems has not been used to its full possibilities. Even though this is encouraging with respect to the adaptability of our method to another continuous variable which would not have such properties8, it is a point on which improvement of the TMDPpoly algorithm is possible. For
example, focusing on the latest time intervals of the problem first might accelerate conver- gence since we work with backward propagation. Letting the latest times converge first can actually insure that full parts of the time-dependent value functions have converged and need not be further revised. Thus it would exploit more the oriented nature of the time variable. As mentioned in section 6.3 this could be done by biasing the way we calculate the priorities. It could also take advantage of partial calculation of the V (s, t) functions: during the first state updates, only the “latest” part of the function is important, then, when it has converged, one can focus on “earlier” parts.
Finally, we can conclude that the main obstacle to the TMDPpoly implementation and
experiments was spawned by the very nature of piecewise polynomial functions formal cal- culus. While this obstacle has been at least partially overcome, there still remains a lot of possible improvements for this work. These improvements can especially reduce the gap between the computational times associated to piecewise linear versions of our problems and the discrete distribution versions. Since the number of iterations needed before convergence
7Compromise between polynomial’s degree and number of definition intervals in the piecewise polynomial
functions.
8Namely, causality implies no current event will have repercussions in the past and thus no change in the
Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm
is comparable in both cases, using piecewise continuous distributions will become competitive with discrete ones when the related operations will have been improved regarding calculation time. Still, by only looking at the number of iterations before convergence, we can deduce that there was very little additional complexity associated with using these piecewise con- tinuous distributions at the planner’s level. Moreover, low-level numerical problems such as the ones mentioned earlier — related to the precision of L∞ bounds and the failures of
root finding methods — illustrate the main obstacles associated to dealing with piecewise continuous functions in general and piecewise polynomial ones in our particular case. Some problems intrinsically have such piecewise continuous distributions and it might be rather cumbersome to approximate them with discrete ones. The TMDPpoly planner with improved
piecewise polynomial functions handling might open the door to directly dealing with such problems.
8
Generalizing MDPs to continuous observable time: the XMDP
framework
Including time as a continuous observable variable in the MDP state space naturally leads to considering the continuous wait(τ) action on top of all other previous discrete actions. More generally, including continuous variables in the state space often calls for continuous or hybrid (continuous and discrete) actions. We have seen in chapter 2 that the time variable played a particular role with respect to the