• No results found

Caching, Pruning and an Any-time Algorithm

4.3 Speeding Up Planning

4.3.2 Caching, Pruning and an Any-time Algorithm

Caching of Intermediate Results

To improve the interpreter’s performance one has to analyze where the interpreter consumes most of its computation time. In the case of decision-theoretic solving of MDPs, the interpreter con- sumes most computation time in exploring the state space and calculating the values of all visited states. Calculating the value means to assign the reward function to the state, calculating the state’s reward, and multiplying this reward with the probability to reach the state. The most expensive part herein from a computational point of view lies in regressing the fluents to the initial state. This is necessary to compute their current values and to complete the final reward of the current state.

Here, the idea of caching comes into play. Caching means to save and reuse the calculation of intermediate results to speed up the computation. It comes without losing expressiveness or accuracy in the numerical values and results.

Before we describe the implemented method we again point out the fine difference between the notion of state and situation. A situation is defined by the history of actions. A state is defined by the values of all fluents in it. Here, the difference becomes obvious. Different situations can describe the same state. For example, going to the right and then up describes the same state as going first up and then to the right, but the action histories are different.

The method we implemented saves the complete transition table for each state. The transition table is a way to represent the functionT (s, a, s′). Performing action a in state s leads to state s.

We extend the function by associating the valuevT (s,a,s′)to it. Every time we have computed a

value for one transition we store it along with the according transition. The states are saved as the values of each fluent ofs and s′. Keep in mind that each state is completely described by all the

values assigned to all fluents in the current situation. If the same transitionT (s, a, s′) is expanded

again at a later point of time the value is already known. It is read from the cache and reused without performing the same calculation again.

Why is this reasonable? The value of a state only depends on the previous action and the previous state because of the Markov assumption. Saving exactly this transition relation associated with its value is the well grounded reason which allows us to cache the transition.

The advantage of the caching idea is intuitive. In simple, discrete domains like the maze world, the execution times are much faster and the calculated values are often reused. The program gains execution time from a higher memory consumption. The resulting policy is equivalent to the one generated using the original approach.

The disadvantages are the following: instead of calculating the value of a state each time the whole state description and its associated values are saved. This results in high memory usage because each fluent has to be saved with its value. In the domain of UNREALTOURNAMENT 2004 (see Chapter 5.1) for example, this means to save about a hundred fluents where only a small subset of them is changed from state to state. Worse is that in continuous worlds the caching method fails completely in general because only a small finite subset of the state space is visited. In this set caching succeeds only rarely.12 In Chapter 7 we report on a qualitative abstraction

12This depends, of course, on the modeling of the domain but in general caching of arbitrary states fails in continuous

for the soccer domain. With this state space abstraction caching becomes available also in the continuous domain of soccer.13

Heuristic Pruning

Another way to reduce the computation times of the DT planning algorithm, especially when deal- ing with dynamic real-time domains, is that of applying heuristics and that of pruning. Pruning means to remove branches from computation, heuristics are for guiding the search through the search space by simple formulas or hints on where the solution is expected. In general, a heuris- tic is not always correct in its decision, although for some problems good and efficient heuristics are known. Here, we want to describe a method of pruning the decision-theoretic search tree by applying a simple heuristic. We can no longer guarantee an optimal solution. However this is reasonable due to the savings in computational complexity. The main observation we are basing this approach on is that branches with a very low probability have the same computational effort as branches with high probability. The underlying idea to save computation time is simple. Because low probable branches are occurring seldom during the real execution we do not consider them during policy generation. It seems more reasonable to generate a policy which does not handle all improbable cases in a more time efficient fashion. If in a rare occasion one of these improbable events occurs it seems more promising to generate a new policy in an efficient way than to always generate the complete and correct policy which is in general more expensive. Therefore we intro- duce a small boundpminwhich represents the minimal probability of branches which is reasoned

with.

In the grid world we have to find an adequatepmin. For examplepmin= p2f ailrepresents the

fact to forget all branches which fail two times or more often. In a complete stochastic decision tree for the grid world the fail cases of actions have huge impact on the size of the tree. Recall that associated with each action there is one successful outcome and three outcomes represent the action to fail.

In a small MDP induced by the program

solve( depth, forever do (up | down | left | right) endforever)

we generate all16 outcomes at Depth = 1. At Depth = 2 the first branches are pruned and only112 outcomes are generated.14 In the rare case where two actions fail while executing the

computed policy a re-planning step takes place and a new policy is generated.

To visualize the increase in performance and the resulting consequences for the success of the computed policy see Figure 4.9(a) and Figure 4.9(b). The scenario is based on the task for the agent to move four squares in the Maze66 domain. The different lines depict different values for the minimal probability to prune with. pmin= 0.0 represents the normal policy generation

without pruning. In contrast to that standspmin= 0.03 which does not even tolerate one failing

action. The generated policy assumes that no failing case occurs and therefore only those states

13The savings in computation time are around 1/3 in the maze domain.

14On level one of the tree four actions succeed and twelve fail. With each previously failing action at depth two of

the tree a succeeding action is associated and not pruned. That are12 · 4 many. With each succeeding action at depth

one all outcomes of depth two are more probable thanpmin. This are4 · 16 many and the complete number is due to

4.3. SPEEDING UP PLANNING 107 0.01 0.1 1 10 100 1000 0 2 4 6 8 10 12 pmin= 0.03 pmin= 0.005 pmin= 0.0004 pmin= 0.00001 pmin= 0.0 Horizon T im e [s ]

Maze66 example: Time usage to generate a policy up to a specific horizon.

(a) Time usage for computing policies

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 pmin= 0.03 pmin= 0.005 pmin= 0.0004 pmin= 0.00001 pmin= 0.0 Horizon P ro b ab ility

Maze66 example: Probability of success of generated policy for a specific horizon.

(b) Success Probabilities resulting from pruning

Figure 4.9: This figure depicts the probabilities of the resulting policies with a given horizon. The task was to search for a policy which has to use at least four actions to be able to finish in the goal state. Lines which end are no longer computable with 1 GB of memory.

are handled which lie on the optimal path. One can see that the generation time is fast for small horizons, but the probability of success is small compared to all other test cases. pmin= 0.005

reaching the target increases as does the computation time to generate the policy. In Figure 4.9(a) you can see how much time a specific horizon takes to compute a policy for specific values of pmin. Note that they-axis is scaled logarithmically. In Figure 4.9(b) one can see the success

probability of the generated policy. The success probability is at the same time the value of the policy because a reward of1.0 was given only in the target state. All other states were rewarded by0.0 and no action costs were associated.

Nevertheless, this approach has two drawbacks. The first one was mentioned previously and is found in losing the optimality of the MDP’s solution. Because we prune branches of the tree, we lose impacts from the reward function on the value function. The other drawback lies in the border pmin. If the horizon grows larger even actions with a relative high probability are endangered to

be pruned away. This is because executing them sequentially may result in probabilities smaller thanpmin. The designer has to take care of the relation between the maximum planning horizon

and the minimal probability where to prune. Actions which succeed only with low probability are also endangered to be pruned away.

The benefits lie in the savings of computational effort. The explicit value of what is saved at computation time on the one hand depends on the modeling of the domain, and on the other hand it depends on the choice ofpmin. In the example shown above wherepmincorresponds to ignoring

branches which contains two or more failing actions there is no gain in case the depth is one. With a depth of two only43% of the outcomes are generated. In depth three only 640 outcomes of the original4096 (which are only 15% of the original outcomes) have to be generated and checked. If one is able to find a suitablepminfor the given domain this seems to be a reasonable approach.

It saves huge amounts of computation time, because it ignores improbable branches by pruning them. Nevertheless, the exponential growth in the size of the tree still remains.

Any-Time Algorithm

The need for real-time decision making in dynamic real-time domains is indispensable. Therefore we investigated possibilities to extend READYLOGto an any-time algorithm. Instead of specifying a horizon it takes a maximum run-time as argument up to which the algorithm is able to search for a solution. Afterwards the results and the best policy found so far is returned.

We adopt the idea of any-time algorithms from Boddy (1991). He describes methods concern- ing planning with problems which depend on time. Any-time algorithms are described there as algorithms which always present a solution when interrupted. The more time is invested in this algorithm the better is the resulting solution. In the DT planning context this means that the best current solution is returned.

The search algorithm currently used is a depth-first search of Prolog implemented by the reso- lution strategy. A complete action sequence to a given horizon is generated and saved to compare its value to other generated action sequences. To create an any-time algorithm there are two pos- sibilities to modify the above search algorithm. The first possibility is to implement a breadth first search (BFS). But following this method one sees rapidly the drawbacks of BFS: it consumes much more memory than depth-first search. In general, the memory consumption of depth-first search is linear in the size of the solution. Breadth-first search consumes memory exponentially in the size of the solution. Even small tasks in the grid world with a horizon of five cause memory stack overflows when applied to a machine with 1 GB of memory.