• No results found

7.3 The Mars rover problem

7.3.1 Problem definition

Overview

The rover problem is inspired by and adapted from the International Planning Competition “rover” domain and by the original Mars rover problem statement of [Bresina et al., 2002].

This domain describes the problem of mission planning for a rover over a full day on Mars. The rover’s mission is to collect two rock samples from different sites and to take a photo of a distant object. Available actions deal with recharging the batteries, taking the photo, collecting the samples and moving from site to site. One can make the problem more complex by adding possible transmissions with a remote station, on-board analysis actions, memory management, etc. We will keep this first simple description of the problem for our experiments since it seems rich enough to describe an interesting problem.

Previous work on the problem of planning the operations of the Mars rover tackled dif- ferent aspects of the problem stated in [Bresina et al., 2002]. The complete rover domain, as presented by [Bresina et al., 2002], involves dealing with contingencies, probabilities, contin- uous variables, continuous time, concurrent actions, etc. [Bresina et al., 2002] lists a number of algorithms, planners and approaches for this domain, highlighting their strengths and weaknesses. Later work by [Mausam and Weld, 2007] addresses the question of dealing with concurrent actions, synchronized on a discretized time, with duration uncertainties. [Feng et al., 2004; Li and Littman, 2005] attacked the problem from the fully continuous point of view, representing value functions as kd-trees. HAO* [Benazera et al., 2005] also attacked the rover problem by addressing the question of hybrid state spaces and heuristic search and pruning. While our algorithm is not designed to compete with the previous approaches as a matter of performance, it provides a different alternative which could be combined, for example, with the heuristic approach of HAO*, or with the action elimination scheme of [Mausam and Weld, 2007] for dealing with larger action spaces.

Figure 7.7 illustrates the mission planning problem. The rover can navigate between nodes labeled 1 to 6 which correspond to values of p, the position variable. Each movement action has a certain success or failure probability: these actions can end up in the destination of in the initial position. Similarly, movement durations and energy consumption are uncer- tain. The labels attached to the edges of the navigation graph correspond to the average travel duration for a successful move along the edge. The filled nodes correspond to sample sites: sample 1 is available at position 5 and sample 2 at position 2. The dark gray areas are obstacles to both navigation and vision while the light gray area is an obstacle to navigation

7.3. The Mars rover problem only. Consequently, the photo can be taken from any of the nodes numbered 3 to 6. However, this picture has different probabilities of being successful depending on the shooting site. The rover has the on-board ability to roughly analyse the image in order to determine whether it is good or not. So whenever the picture is taken, it can result in either a good image or a bad one but there is no notion of ranking among images. Consequently, whenever a good image has been shot, it is kept without further questioning. The preferred shooting site is position 6.

1 2 3 4 5 6 photo 4 5 12 3 5 5 5 5

Figure 7.7: Mars rover problem — mission presentation

We consider a day of length 70 time units and we suppose the goal is to finish the mission before nightfall but this constraint is flexible and the mission does not really have to be com- pleted in one day. After 70 time units, night falls and the rover switches to energy saving. We consider e = 0 to be the lowest energy level corresponding to surviving during one night. Hence, the mission can be restarted everyday from any state of the problem which implies we are interested in the policy in every possible starting state.

Finally, depending on the time of day, lighting changes which affects the recharge ability of the rover and the photography’s success probability.

The state variables we consider are summarized in table 7.1. They yield a hybrid state space containing 1968 discrete states and one continuous variable. It can be interesting to compare with a discrete problem generated using a unit discretization of time1: this fully

discrete problem has 139728 states. Some current algorithms for MDPs can deal with such state space sizes — especially heuristic search algorithms and algorithms making use of factored representations — but simple algorithms as Value Iteration over standard tabular representations of this size take a long time converging.

One could object to this argument that, with a unit discretization of time, the resolution takes exactly 71 value iterations because the problem is a finite horizon MDP with uncertain

1as presented in the next paragraphs, a unit discretization of time is the least necessary to roughly ap-

Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm

durations. Indeed, the comparison with the 139728 states problem is more valid for the case of general continuous variables. Anyway, 71 iterations times 1968 states corresponds to 139728 state value updates while we will see a little further that our prioritized sweeping method finishes in about 35000 state function updates which might be an interesting trade- off between calculation complexity and having continuous dynamics representations.

Our implementation of the TMDPpoly algorithm is rather straightforward and leaves a

lot of space to heuristic search improvements and better representations of the state space’s discrete part, so it makes sense comparing the performance of Value Iteration or standard Prioritized Sweeping on these large discrete problems and the performance of TMDPpoly on

the hybrid one.

Variable Description Domain

t time [0, 70] e energy {0, 1, . . . , 39, 40} p position {1, 2, 3, 4, 5, 6} im1 image 1 taken {0, 1} sa1 sample 1 collected {0, 1} sa2 sample 2 collected {0, 1}

Table 7.1: Mars rover problem — state variables

The action space of the rover is described on table 7.2. The number of actions defined in this table is 23 (if we don’t count the continuous wait action). However this number does not really mean much since only few of these actions are available in each state. Therefore, it is better to count the minimum and maximum number of actions available per state in the problem to get an idea of the problem’s difficulty.

• Example of states that have the most available actions: p = 3; 6 ≤ e < 40; im1 = 0

↔ {move(3, 1), move(3, 2), move(3, 4), move(3, 5), take picture(3), recharge, wait} • Example of states that have the least available actions: e = 0

↔ {recharge, wait}

move(p1, p2) movement from p1 to p2

take picture(p) takes the photo from position p sample rock(p) collects a rock sample from position p

recharge fully charges the rover’s battery wait(τ) waits for a future date t0 = t + τ

Table 7.2: Mars rover problem — action space Movement actions

Each movement action can result in six different outcomes: • µ1 — movement success and short duration

• µ2 — movement failure and short duration

7.3. The Mars rover problem • µ3 — movement success and average duration

• µ4 — movement failure and average duration

• µ5 — movement success and long duration

• µ6 — movement failure and long duration

One has, independently of the current state, time and destination state: L(µ1) = 0.6 L(µ2) = 0.05 L(µ3) = 0.15 L(µ4) = 0.025 L(µ5) = 0.15 L(µ6) = 0.025

Destination state of outcome µ1 corresponds to the target position with an energy de-

crease corresponding to a short duration movement. The destination states of the other outcomes can be described similarly.

The duration probability density functions have been implemented in five different ver- sions, all bringing different complexity to the problem. These distribution are chosen so as to match the average and standard deviation of a Gaussian distribution on movement durations.

1. The first one uses piecewise polynomial probability density functions. More specifi- cally, cubic splines, used to interpolate Gaussian distributions. An example of such a distribution is plotted on figure 7.8. Some additional details on calculation of the associated splines are given with the battery charge action description.

2. The second one only uses discrete distributions.

3. The third one uses quadratic splines yielding similar distributions to the ones of the first version.

4. The fourth one uses piecewise linear functions corresponding to applying algorithm 6.6 to the first version’s distributions.

5. The fifth one uses only piecewise linear distributions, mainly “triangular” distributions. We give an example of the two first versions on outcome µ3 of action move(1, 4) applied in

position p = 1. The piecewise polynomial version is plotted on figure 7.8. Pµ3(τ) =1[11,12](τ) · −2τ3+ 69τ2− 792τ + 3025

 + 1[12,13](τ) · 2τ3− 75τ2+ 936τ − 3887

Pµ3(τ) = 0.25 · δ11.5(τ) + 0.5 · δ12(τ) + 0.25 · δ12.5(τ)

No reward is associated with movement actions.

There is an important caveat to mention here. POLYTOOLS is a rather complex set of operations trying to combine knowledge about formal calculus, algorithmic efficiency and nu- merical calculus stability. For example, the sequence of polynomials built for Sturm’s method

Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm -0.2 0 0.2 0.4 0.6 0.8 1 1.2 10 10.5 11 11.5 12 12.5 13 13.5 14

Figure 7.8: Duration probability of µ3

(see appendix A for details) implies performing an exact Euclidean division of polynomials which is feasible in theory and easy to implement but which can imply a lot of numeri- cal instability for ill-conditioned polynomials2. There are many examples of such technical

difficulties which are completely unrelated to the planning problem but constitute a major obstacle to testing the TMDPpoly planner for higher order polynomials. Because of these

technical problems, only versions 2, 4 and 5 of the rover problem were actually solved using our implementation. The other versions are readily available but POLYTOOLS still needs some improvements and some fixing before they can be solved. This has another drawback: it was not possible to evaluate the trade-off between polynomial degree and number of pieces in the piecewise polynomial description because POLYTOOLS still has trouble with higher degree polynomials. However, the simple comparison between discrete density functions and piecewise linear ones already allows to draw some conclusions regarding the complexity on the operations involved and the advantages/drawbacks of such modeling features.

Taking the picture

This action is only available in positions 3 to 6, when the energy resource is sufficient and if a successful photo has not already been stored in memory. It can result in two different outcomes, either the picture is good or it has to be re-shot. The probabilities of a successful picture depend on the shooting location and on the time of day. They are illustrated on figure 7.9.

In all cases, the energy decrease is 1. Similarly, the transition duration is deterministic and has duration 1.

Finally, the reward for taking a good photo depends on the outcome’s end date and on

2polynomials having a very small coefficient of high degree and a very large constant coefficient

7.3. The Mars rover problem 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 10 20 30 40 50 60 70 p=3 p=4 and 5 p=6

Figure 7.9: Probability of successful photo — L(µsuccess|s, t, take picture)

the shooting site:

rt0(µsuccess, p, t) =        4 if p = 3 5 if p = 4 4 if p = 5 7 if p = 6 Collecting the samples

Similarly to the picture action, this action is only available in positions 2 and 5, if the energy level is high enough and if the sample corresponding to the current position has not been collected yet. This action can result in a success or failure outcome; failure corresponding to a failure in grabbing the right sample and storing it. The probability of successfully collect- ing the sample is 0.7, regardless of the sampling site, the current state or the time of day.

Sampling duration can vary according to several possibilities in the grabbing scenario. It results in the following duration distributions:

Pµsuccess(τ) = 0.2 · δ3(τ) + 0.6 · δ4(τ) + 0.2 · δ5(τ)

Pµfailure(τ) = 0.5 · δ2(τ) + 0.5 · δ3(τ)

The reward for collecting sample 1 is 5, and the reward for sample 2 is 3. Charging the batteries

Charging the batteries is an all-or-nothing action which performs a full battery charge, re- gardless of the initial energy level. However, the recharge duration depends on this initial level and on the lighting (directly linked with the time of day). There are two recharging speeds corresponding to two different outcomes: µ1 corresponds to slow charging and µ2 to

fast charging. If the recharge action is undertaken between time 30 and 65, the µ2 outcome

Chapter 7. Implementation and experimental evaluation of the TMDPpoly algorithm

The final discrete state of a recharge action corresponds to setting e to its maximum value. The average durations of outcomes µ1 and µ2 are given by the following equations:

dur(µ1) =



1 if (emax− e)/2.9 < 1

(emax− e)/2.9 else

dur(µ2) =



1 if (emax− e)/4.7 < 1

(emax− e)/4.7 else

And we use a “deviation” parameter w:

w(µ1) = 1

w(µ2) =



1 if dur(µ2≤ 8)

2 else

Similarly to the case of movement actions, we implemented two versions of the recharge action, the first one uses piecewise polynomial distributions, the second one uses discrete dis- tributions. In the first case, the duration distribution function is — similarly to figure 7.8 — the cubic spline interpolation going through the points (dur(µ) − w(µ), 0), (dur(µ), 1/w(µ)), (dur(µ) + w(µ), 0) with slope zero at each interval’s end3.

In the discrete distributions case, the distribution was given as:

Pµ(τ) = 0.25 · δdur(µ)−w(µ)(τ) + 0.5 · δdur(µ)(τ) + 0.25 · δdur(µ)+w(µ)(τ)

There is no reward associated with the recharge action.