Experiments and Evaluation

Related Work

4.5 Experiments and Evaluation

This section consists of two parts. In the first part, we briefly highlight the differences between our algorithms and the tools T(o)rmc, Faster, and Lever from a user’s per- spective; that is, we discuss differences in the input formats as well as what knowledge the user needs. In the second part, we assess the performance of our algorithms (in terms of runtime) based on a prototype implementation and compare this implementation to T(o)rmc and Faster.

Differences to Existing Tools

We already discussed the tools T(o)rmc, Lever, and Faster in the section about related work. Here we highlight their differences to the techniques developed in this chapter.

4.5 Experiments and Evaluation 

Differences to T(o)rmc T(o)rmc [Leg] implements a white-box algorithm that iterates the given transducer on the set of initial configurations and applies extrapola- tion to approximate the limit of the iteration. The drawback of this method is that the bad configurations are not taken into account during the computation; if the result contains a bad configuration, T(o)rmc has to be restarted with additional user input, which requires expert knowledge about both T(o)rmc’s internals and the problem at hand. Another drawback of T(o)rmc is that it requires DFAs as input, whereas our algorithms also work with NFAs, which can be exponentially smaller than equivalent DFAs. Also, T(o)rmc does not search for a smallest invariant, whereas our algorithms do.

Differences to Lever The Lever tool [VV] implements a learning-based black-box algorithm that builds upon Kearns and Vazirani’s learning algorithm. When used for Regular Model Checking (recall that Lever can also be used for the verification of liveness properties), Lever tries to learn a fixed point representing the exact set of reachable configurations. However, Lever does not learn this set directly but a set of configuration-witness pairs, which consist of a configuration augmented with “distance information”. Compared to the learning-based algorithms of this chapter,

Lever’s approach has the advantage that the set of configuration-witness pairs is unique—whereas we aim for an arbitrary invariant, of which there might be many. The uniqueness of the target language makes answering membership and equivalence queries possible and permits a straightforward application of standard active learning algorithms.

In order to learn sets of configurations-witness pairs, Lever requires an encoding that translates such pairs into finite words and vice versa. However, finding a suit- able encoding requires expert knowledge about both Lever and the given problem domain. Another limiting factor of Lever is that a minimal DFA accepting the set of configuration-witness pairs can be larger than a minimal IDFA because the former needs to represent more information. This, however, is a crucial aspect since the runtime of Lever (like most learning-based algorithms) depends on the size of the learned automaton.

Differences to Faster Faster [BLP] computes the exact set of reachable configurations using acceleration. This approach prevents Faster from terminating if the set of reachable configurations is not recognizable by a finite automaton. In contrast, our algorithms always find an IDFA if one exists. Moreover, Faster was originally designed for integer linear systems over Presburger formulas. That entails that one first has to translate a given Regular Model Checking instance into the Faster input

 4 Regular Model Checking

Table.: Feature summary of algorithms for Regular Model Checking. T(o)rmc Faster Lever White-box Semi-black-box Black-box Mode White-box White-box Black-box White-box Semi-black-box Black-box Input DFAs DFA and Teacher NFAs NFAs and Teacher

formulas teacher

Experience Expert Expert None None None None Target Invariant Reachable Conf.-witn. Invariant Invariant Invariant

concept config. pairs

Minimality no no no yes yes yes

format, which requires manual work as a translation is often not straightforward (if it is possible at all).

In conclusion, Table. summarizes the main features of the considered algorithms for Regular Model Checking. The rows “Mode” and “Input” are self-explanatory. The row “Experience” refers to the question of whether the user needs any (expert) knowledge either about the program at hand or the internals of the algorithm. The row “Target concept” refers to the kind of concept the respective algorithm computes. Finally, the row “Minimality” indicates whether an algorithm searches for a smallest representation of the target concept.

Experimental Results

To assess the effectiveness and performance of our algorithms, we implemented a prototype and benchmarked it against Faster and T(o)rmc. Due to the fact that Lever is not publicly available, a comparison to this tool was not possible. The results of this section partly appeared in conference proceedings [Nei].

Methodology We implemented our prototype in C++ using AMoRE++ [KMP+] as a backend for operations on automata and Libalf for learning automata. As underlying logic solvers, we used Glucoser (for solving SAT formulas) and Microsoft’s Z3 (for solving formulas with uninterpreted functions).

We considered two benchmark suites. The first benchmark suite containsinteger linear systems (mostly protocols, such as the Berkeley cache coherence protocol, the

Synapse cache coherence protocol, and the M.E.S.I. cache coherence protocol) and is available on the Faster and T(o)rmc websites; additionally, we added three petri nets (trans, trans, and trans). The second benchmark suite contains instances of a 2n

modulo-counter and the token ring protocol (see Example. on Page ) over a fixed

4.5 Experiments and Evaluation 

number of processes. In the second benchmark suite, we successively enlarged the input-automata AI and AB, with the motivation to demonstrate the advantages of our semi-black-box and black-box algorithms when confronted with large input-automata. Note, however, that we did not vary the size of the transducer. We comment on this decision on shortly when discussing the results of our experiments.

The examples of the second benchmark suite were not natively expressible as Faster inputs. In order to avoid a biased benchmark, we decided not to run Faster on the second benchmark suite.

Compiling T(o)rmc for-bit systems did not work properly. A -bit executable partly worked but suffered from memory access violations. We experienced crashes, and it was often not possible to conduct experiments; for instance, we could not obtain any result for the modulo counter experiments because T(o)rmc crashed on all inputs. We contacted the tool’s developer, but the problem could not be resolved so far. Thus, we can report T(o)rmc’s results only for a part of the experiments.

We conducted all experiments on an Intel Q CPU at 2.83 GHz with 4 GiB of RAM running Ubuntu. LTS. We imposed a timeout limit of 300 s.

Results Tables. to . (on Pages  and ) present the results of our experiments. All runtimes in the tables are in seconds. A “—” indicates that the correspond- ing experiment either ran out of memory or did not finish within 300 s. An “x” means that the experiment crashed. The best result of each experiment is highlighted in bold font.

Table. shows the results on integer linear systems of the first benchmark suite. The white-box algorithm using Glucoser often performed best, closely followed by Faster. The only exception is the bakery protocol, which none of our algorithms could prove correct. However, the performance of all algorithms is relatively similar on this benchmark suite, and no algorithm excelled. The algorithms using Glucoser performed slightly better than those using Z3.

Tables. and . show the results on the second benchmark suite; Table . reports the results on the modulo counter experiments, and Table. reports the results on the token ring experiments.

In the case of the modulo counter experiments, the algorithms again performed similarly. The semi-black-box approach using the Angluin-style learner and Glucoser achieved the best results, closely followed by the semi-black-box approach using the CEGAR-style learner and Glucoser. Again, the algorithms using Glucoser performed slightly better than those using Z3. Unfortunately, T(o)rmc did not produce any results on this examples.

In the case of the token ring experiments, the white-box algorithm outperformed any other algorithm, regardless of the underlying logic solver. The Angluin-style



Regular

Model

Checking

Table.: Results on integer linear systems of the first benchmark suite. All figures are in seconds. A “—” corresponds to a timeout after 300 s. An “x” indicates that the experiment crashed. The best result of each experiment is highlighted in bold font.

Experiment White-box Semi-black-box Black-box T(o)rmc Faster

Angluin CEGAR Angluin CEGAR

Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3

petri net 0.01 0.05 0.12 0.15 0.11 0.11 0.70 0.96 0.07 0.23 0.02 1.13 berkeley 0.04 0.41 0.62 0.92 1.29 1.45 1.80 1.81 1.79 1.55 4.23 0.03 synapse 0.01 0.03 0.04 0.07 0.06 0.16 0.02 0.07 0.02 0.11 0.19 0.03 lift 0.01 0.14 0.01 0.01 0.01 0.02 0.12 0.13 0.12 0.12 5.54 0.15 mesi 0.45 1.78 0.58 2.64 1.55 6.24 26.42 52.13 27.93 47.48 5.52 0.04 bakery — — — — — — — — — — 32.18 0.04 trans 0.01 0.03 0.01 0.05 0.02 0.18 0.02 0.17 0.02 0.18 x 0.04 trans 0.01 0.03 0.01 0.06 0.02 0.05 0.03 0.14 0.02 0.16 x 0.03 trans 0.04 0.27 0.05 0.29 0.09 0.53 3.13 6.33 2.36 2.88 x 0.07

4.5

Experiments

and

Evaluation



Table.: Results on modulo counter experiments. All figures, except for those in the columns “|AI|_{” and “|A}B|_{”, are in seconds.} A “—” corresponds to a timeout after 300 s. The best result of each experiment is highlighted in bold font.

|AI| |AB| _White-box _{Semi-black-box} _Black-box

Angluin CEGAR Angluin CEGAR

Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3

14 125 0.24 0.46 0.29 0.41 0.75 1.03 0.17 0.37 1.59 2.73

14 156 0.29 1.33 0.58 0.99 1.75 2.09 0.34 0.65 3.33 6.87

34 187 1.29 8.29 1.13 3.52 4.04 6.48 1.17 6.12 9.11 29.30

34 218 27.49 64.29 2.49 20.42 6.45 47.84 5.95 33.13 35.16 80.14

82 249 — — 21.27 100.48 45.23 178.59 — 177.92 — —

Table.: Results on token ring experiments. All figures, except for those in the columns “|AI|_{” and “|A}B|_{”, are in seconds. A} “—” corresponds to a timeout after 300 s. The best result of each experiment is highlighted in bold font.

|AI| |AB| _White-box _{Semi-black-box} _Black-box T(o)rmc

Angluin CEGAR Angluin CEGAR

Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3 Glucoser Z3

10 3 0.01 0.01 0.02 0.07 0.02 0.07 0.04 0.04 0.01 0.10 0.02 25 3 0.03 0.02 0.12 0.21 0.03 0.10 0.18 0.18 0.02 0.10 0.06 50 3 0.02 0.02 1.23 1.52 0.07 0.14 1.22 1.45 0.05 0.14 0.31 100 3 0.04 0.02 21.60 23.39 0.31 0.47 20.89 22.58 0.33 0.48 2.08 200 3 0.04 0.04 — — 2.38 1.84 — — 2.15 2.50 16.13 300 3 0.04 0.05 — — 7.10 5.82 — — 8.53 8.70 55.13 400 3 0.04 0.07 — — 18.75 15.55 — — 18.66 20.70 137.47 500 3 0.05 0.09 — — 31.26 27.97 — — 38.26 38.54 290.41

 4 Regular Model Checking

learner and the Angluin-style ICE-learner performed worst and failed on all instances with |AI|_{> 100. T(o)rmc succeeded in all cases, but was slowest among all successful} algorithms on instances with |AI|_{> 100. In contrast to the experiments above, we did} not observe a difference in the performance between algorithms using Glucoser and algorithms using Z3.

Discussion Considering the results of our experiments, we make two key observa- tions. First, the results of the first benchmark suite show that all of our algorithms can handle problem instances specified for T(o)rmc and Faster with competitive runtimes. Second, we observe that there is no superior algorithm. The white-box algorithm often performs best, but the modulo counter examples show that learning-based algorithms can be advantageous in situations where the input-automata are large (cf. Table.). Moreover, the second benchmark suite contains examples on which the Angluin-based algorithms outperformed the CEGAR-based algorithms (cf. Table.) and vice versa (cf. Table.).

For the benchmarks at hand, the algorithms using the Glucoser SAT solver were always slightly faster than the ones using the Z3 SMT solver. Note, however, that this might be different for larger instances as the size of the generated SAT formulas grows faster than the size of the SMT formulas. A further observation is that Z3 seemed to produce an initialization overhead every time it was invoked, which constituted a large share of the overall runtime on small instances.

Finally, let us comment on our decision to only consider experiments in which we varied AI and ABbut not the transducer. We observed that the black-box approach spend a large share of its time on checking whether a conjecture is inductive, and it turned out that AMoRE++ is not well-suited for this task. Since this is a problem of AMoRE++but not of our black-box algorithm (an ICE-teacher completely abstracts from an actual implementation), benchmarking the present prototype on experiments with varying transducers makes it very hard to draw any conclusions on the performance of our black-box algorithms. However, we expect more meaningful results from using a different automata library.

In document Applications of automata learning in verification and synthesis (Page 138-144)