• No results found

6. Experimental Study

6.1.2. Technical Setup

We use the benchmarks of the (optimal, where applicable) sequential tracks of all IPCs up to 2014, a set comprised of 1667 planning tasks distributed across 57 domains, which are publicly 4This contrasts the results we originally reported (Sievers et al.,2016), where we also used the re-implementation

Attribute explanation winning value

Coverage number of solved tasks max

Exp Xth perc Xth percentile of the number of expansions, rounded down to the next multiple of 1000, to min reach the last f-layer, computed over all tasks where all algorithms have a value for expansions

Search time geometric mean of the runtime of search in seconds, computed over commonly solved tasks min Total time geometric mean of the total runtime in seconds, computed over commonly solved tasks min # constr number of tasks for which the merge-and-shrink computation finishes max Constr time arithmetic mean of the runtime in seconds required by the merge-and-shrink computation, min

computed over commonly solved tasks

Constr oom number of tasks for which the merge-and-shrink computation exhausts the memory limit min Constr oot number of tasks for which the merge-and-shrink computation exhausts the time limit min Perfect h number of tasks for which the resulting heuristic is perfect max Linear tree percentage of # constr for which the merge tree is linear min MITSS arithmetic mean of the maximum intermediate size of transition systems, computed over min

commonly solved tasks

Table 6.1.: Attributes of planners and their abbreviation.

available.5 This set contains duplicates because some of the domains and tasks of IPC 2008 have

been reused in IPC 2011. We still stick to this set of benchmarks because it is an established test bed for comparing planners. All planning tasks are given as PDDL files which we translate into SAS+representations using the translator that is part of Fast Downward (Helmert,2009).

We use downward-lab for conducting our experiments (Seipp, Pommerening, Sievers, & Helmert, 2017). Time is limited to 30 minutes and memory to 2 GiB per task. All experi- ments were run on machines with Intel Xeon E5-2660 CPUs running at 2.2 GHz.6 To keep the

impact of the number of concurrently running tasks and other factors of the compute cluster at a minimum, downward-lab randomizes the order in which tasks are started.

We use the shrink strategies available in Fast Downward, i.e. B, F, and G. When using the shrink strategies based on bisimulation, i.e. B and G, we allow the shrink strategy to attempt to compute a perfect bisimulation even if no shrinking is necessary (by setting the so-called size “threshold” parameter that triggers shrinking even if not required to 1), and we apply label reductions before shrinking, as bisimulation profits from label reduced transition systems. For the shrink strategy F, experiments have shown that there is no benefit of computing abstractions if no shrinking is necessary, hence we leave the threshold parameter inactive (i.e. set it to the same value as N). Furthermore, we perform label reduction after shrinking, since this is less expensive and f-based shrinking does not profit from label reduced transition systems. For all experiments with shrink strategies B and F, we use a size limit of N = 50000 for all transition systems at any point during the computation of merge-and-shrink. While this is an arbitrary choice and somewhat small, the same value was used in most recent papers and hence allows for a better comparison; other reasonable values which we do not test here use up to N = 200000 states. If using G, we follow the common practice of not limiting the size of transition systems by setting N = ∞.

5From the collection athttps://bitbucket.org/aibasel/downward-benchmarks, mercurial revision 663D121BEC5B,

we use the “optimal strips” benchmark suite.

6The compute cluster used a different OS (CentOS 7.3 instead of CentOS 6.5) compared to results reported in all of

our previous work, which led to results that may differ slightly from results reported in these works, presumably due to different system libraries.

Table 6.1 lists the attributes that we use to describe aggregated results. The first column indicates the abbreviation as we use them in our tables. The second column explains the attribute and possibly mentions a function we use to aggregate values. If reporting results aggregated for a single domain, we either use the given aggregation function such as the geometric mean or a specific percentile, or the sum of the number of tasks as a default. To aggregate results across all domains, we use a two-level aggregation: we first compute the aggregated result for each domain individually and then use the same aggregation scheme to compute the overall result. This weights domains with an equal share, thus negating effects that could otherwise occur due to unequally sized domains. Finally, the third column indicates whether for the attribute, higher values are considered better or not. In all tables or blocks within tables below, we highlight the best performance for each attribute. If the tables contain several blocks, e.g. grouping results of different variants of a single merge-and-shrink strategy, then we usually aggregate and highlight values separately for each block. When reporting domain-wise results, we provide the number of tasks of a domain in parenthesis after its name.