Overview of Results - Using Plan Decomposition for Continuing Plan Optimisation and Macro Gener

Figure 3.1 shows the headline result, in the form of the average plan quality achieved by BDPO2 and other systems as time-per-problem increases. The IPC quality score of a plan is calculated as cref/c, wherecis the cost of the plan and cref is the cost of the best plan for the problem instance found over all runs of all systems used in our experiments. Thus, a higher score reflects a lower-cost plan. crefis kept constant, i.e., that it is the cost of the best plan for the instance found by any system at the end of all experiments. This means that the conversion from plan cost to quality score is only a constant re-scaling (and inversion). The results in Figure 3.1 are from experiment 2 and 3, described in the previous section. It is the same as shown in Figure 1.2 (on page 5), but including results for all the compared anytime planning systems. None of the planners starting from scratch find a solution to all 182 problems: LAMA solves 155 problems, IBaCoP2 144, Arvand 134, AEES 87 and LPG 49. For these planners, the average quality score shown in Figure 3.1 is the average over only those problems for which they find at least one plan. (As previously mentioned, this is also the reason why the average quality sometimes falls: when a first plan, of low quality, for a previously unsolved problem is found, the average can decrease.) With this metric, a planner is not penalised for low coverage. Also, for the plan improvement systems, there is no “coverage” difference, since we use the quality of the base plan for any problem that a system has not improved on. Like the planners starting from scratch, none of the post-processing or bounded cost search methods improve on all base plans: BDPO2 finds a plan of lower cost than the base plan for 147 problems, PNGS for 133, IBCS for 66 and Beam-Stack Search for 24. For these systems, the average quality shown in Figure 3.1 is taken over all 182 problems, using the base plan quality score for those problems that a system has not improved on.

The majority of the compared systems show a trend similar to that of LAMA, i.e., improving quickly early on but then flattening out and ultimately stagnating. The reasons vary: Memory is a limiting factor for algorithms that perform exhaustive search, notably PNGS, which exhausts the 8 GB available memory before reaching the 7 hour CPU time limit for 93.7% of problems. LAMA and AEES do the same, respectively, for 67% and 50% of problems. On the other hand, planners that use limited-memory algorithms, such as Beam-Stack Search, LPG and Arvand (both of which use local search), never run out of memory and thus could conceivably run indefinitely. However, the rate at which they find plan quality improvements is small: From 4 to 7 hours, the average quality produced by LPG and Arvand increases by 0.0049 and 0.0094, respectively. (The latter excludes three problems that were solved by Arvand for the first time between 4 and 7 hours; including those brings the average down, making the increase less than 0.002.) The increase in average quality achieved by BDPO2, starting from the high-quality plans generated by PNGS from the base plans, over the same time interval is 0.0115.

§3.4 Overview of Results 45

Time (hours)

age Quality Score (Relativ

e IPC Quality Score / Co

v er age) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 ● ● ●

BDPO2 on PNGS on base plans BDPO2 on base plans PNGS on base plans IBCS on base plans BSS on base plans LAMA from scratch AEES from scratch IBaCoP2 from scratch Arvand from scratch LPG from scratch

Figure 3.1: Average IPC quality score as a function of time per problem, on a set of 182 large-scale planning problems. The quality score of a plan is cref

/c, where c

is the cost of the plan and cref is the cost of the best plan for the problem found by any system at the end of all experiments. Hence a higher score represents better plan quality. The LAMA, AEES, LPG, Arvand and IBaCoP2 planners start from the scratch, whereas the post-processing (PNGS, BDPO2) and bounded cost search (IBCS, Beam-Stack Search) methods start from a set of base plans; their curves are delayed by 1 hour, which is the maximum time allocated to generating each base

We draw two main conclusions: First, BDPO2 achieves the aim of continuing quality improvement even as the time limit grows. In fact, it continues to find better plans, though at a decreasing rate, even beyond the 7 hour time limit used in this experiment. Second, the combination of PNGS and BDPO2 achieves a better result than either does alone. Partly this is because they work well on different sets of problems and the figure is showing an average, but BDPO2 sometimes produces a better result when started with the best plan found by PNGS also in domains where BDPO2 already outperforms PNGS when both start from the same base plans (e.g., Elevators and Transport). However, we have also seen the opposite in some domains (e.g., Floortile and Hiking), where starting BDPO2 with a worse input plan often yields a better final plan. This phenomenon has also been observed by Xie et al. [2013]. Figure 3.2 provides a more detailed view of the above observations. It shows for each problem the cost of the best plan found by each system at the 7 hour total time limit, scaled to the interval between the base plan cost and the highest known lower bound (HLB) on any plan for the problem. (Lower bounds were obtained by a variety of methods, including several optimal planners; cf. [Haslum, 2012].) 18 of the 182 problems are excluded from Figure 3.2: in 3 cases, the base plan cost already matches the lower bound, so no improvement is possible; for another 15 problems, no method improves on the base plans within the stipulated time. (The Pegsol domain does not appear in the graph, because all base plans but one are optimal, and no method improves on the cost of the last one.)

Table 3.1 provides a different summary of the information in Figure 3.2, showing for each domain and system the percentage of instances for which it found a plan with a cost (1) matching the best plan for that instance; (2) strictly better than any other method; and (3) matching the lower bound, i.e., known to be optimal. In aggregate, the combination of BDPO2 after PNGS over the base plans achieves the best result on all three measures. However, in 5 domains (GED, Hiking, Openstacks, Parking, and Tidybot), LAMA finds more plans that are strictly better than any other method. We have tried using LAMA as one of the subplanners in BDPO2, but this does not lead to better results overall. In some domains, such as Openstacks and GED, the smallest improvable subplan is often the whole, or almost the whole, plan, and LAMA finds an improvement of the plan only after searching for a longer time. Although BDPO2 increases the time limit given to subplanners with each retry, the average time limit, across all local optimisation attempts in this experiment, is only 18.48 seconds. Thus, our strategy of searching for quick improvements of plan parts does not work in these domains.

In document Using Plan Decomposition for Continuing Plan Optimisation and Macro Generation (Page 64-66)