Our discussion is broken into a quantitative analysis of our results in Table 5.2, along with a qualitative examination of the sort of bugs we found. We start with the quantitative analysis.
5.6.1
Quantitative Analysis
Unsatisfiable Formula Generation
As shown, generators of guaranteed unsatisfiable formulas uniformly found no bugs. This is due primarily to the fact that the unsatisfiable formula generation is over 6× slower than satisfiable generation, due to the extra constraints involved for unsatisfiable generation (described in Section 5.3). This slow generation rate means far fewer tests are generated, so there are fewer opportunities to hit a bug. Looking at the sort of programs produced, it would appear that the added timeouts (discussed in Section 5.4.1) had the unintended consequence of selecting for low-complexity inputs which could reasonably be generated within the timeout. This biased the state space towards programs which were more easily handled, leading to test inputs which were relatively easy for solvers to get correct.
Guaranteed Satisfiable Formula Generation Often Finds Correctness Bugs Our technique of generating guaranteed satisfiable formulas finds a number of cor- rectness bugs, and nearly all the bugs it finds are correctness bugs. This shows that it is biased towards testing the semantics of the system under test, as expected.
Syntactic Finds Overall More Bugs
Generally, the syntactic test case generators found more bugs than the generators based on formulas with guaranteed results. The underlying reason why is because the generators with guaranteed results fundamentally explore more narrow state spaces than that of the traditional syntactic generators, so intuitively there are fewer bugs available. This is further substantiated by looking at the sort of bugs found, with the syntactic ap- proaches often finding non-correctness bugs like memory leaks (e.g. [159, 161, 162, 163]). Such bugs have no direct connection to the semantics of SMT-LIB, so the semantics-based
fuzzers are somewhat blind to them.
Of particular interest is that this result seems to contradict that of swarm testing [105], which found that focusing on different subsets of a language improves the number of bugs found. We believe that this is due to the fact that SMT solvers comprise a very different domain than those observed in the swarm testing result, one where correctness is absolutely critical. As such, it is somewhat expected that core correctness bugs would be less prevalent, as we observe. Additionally, the effectiveness of swarm testing has only been evaluated in the context of specializing syntax as opposed to specializing semantics as in our approach, so it may be that the swarm testing result is somehow limited to syntax.
5.6.2
Qualitative Examination of Bugs Found
In this subsection, we go through general patterns we found among the bugs discov- ered, beyond the simple bug counts shown in Table 5.2.
For one, not only were all solvers found to be buggy, all had at least one correctness bug present. This was true both for the young theory of floating point, but also for the well-established theory of bitvectors. The fact that Z3’s implementation of the theory of bitvectors was found to be buggy is particularly surprising due both to its popularity and the fact that it has been fuzzed for more than a year [48] using the fuzzer developed in Brummayer et al. 2009 [17]. For Z3, this seems to be due in part to the fact that we took a particularly aggressive testing approach wherein we strictly required solvers to return satisfiable or unsatisfiable, as opposed to unknown [48]; this was significant in one case [154].
A common theme for implementations of the theory of bitvectors was that division and remainder by zero was problematic. SMT-LIB has a well-defined semantics for these
cases, though its correct implementation is challenging. Additionally, there tend to be negative performance consequences of this correct implementation. As a result, all solvers tested are intentionally unsound in their default configurations with this case. Soundness can usually be guaranteed with additional options, though Boolector lacked this option at the time this case study was performed [166]. We believe that these options are rarely enabled, and so we have uncovered a largely untested corner of these solvers.
A related, but more severe, problem exists in the theory of floating point, involving the semantics of min/max underneath positive and negative zero. In this case, the stan- dard [132] is not entirely clear what the correct behavior is, which required clarification from the standards committee in order to resolve the cited bug [153]. Much like with the theory of bitvectors, MathSAT5 needs an additional option for soundness in this case, which was originally unbeknownst to us. This ultimately led to inputs which had identical, but incorrect, behavior underneath both Z3 and MathSAT5.
5.7
Conclusions
This work has demonstrated that CLP can be applied to the problem of generating SMT formulas with known satisfiability results, and this generation is fast enough to practically test real SMT solvers. A total of 23 bugs in industry-quality SMT solvers were found thanks to this work, of which a remarkable 12 were correctness bugs.
In terms of the overall thesis, this case study helps bolster the argument that CLP is a valid solution to the structured black-box test case generation problem. Generating SMT-LIB formulas with known satisfiability results represents an unparalleled level of structure for a test input, requiring an expressive logic for any hope of accurrate represen- tation. CLP was expressive enough to encode the behavior of SMT-LIB, and it was fast enough to find a number of bugs in popular industrial SMT solvers. That said, this case
study did manage to bring out a weak point in CLP, in that generation was still much slower relative to the other case studies, and it could not practically generate guaranteed unsatisfiable floating point formulas. Considering the complexity of the problem used in this case study, I personally consider such performance degradation acceptable; if CLP had done better, it would supplant literally decades of work on SMT solvers, which seems a farfetched result.
Case Study: Intelligent Fuzzing of
Student Tokenizers and Parsers
6.1
Introduction
This case study looks at how CLP can be utilized by instructors to help gather quan- titative and qualitative feedback on how students are performing on assignments. With this application, CLP can assist in semi-automated grading, along with helping to iden- tify when students misunderstand various concepts. Education represents a very different testing domain than that of the other case studies discussed, helping to demonstrate the generality and widespread applicability of CLP to testing. This helps promote the claim that CLP is a general solution to the structured black-box test case generation problem. This case study looks at one particular student assignment wherein students needed to write a tokenizer, parser, and evaluator for an arithmetic expression language. CLP was used to generate focused tests for each of these components by encoding a reference solution directly in CLP itself, and then executing that reference solution “backwards” to produce valid test inputs. This encoding of the reference solution was only possible
thanks to the expressiveness of CLP. Additionally, CLP was able to produce millions of tests within a matter of minutes, demonstrating both the performance of CLP and its capability to generate many tests without difficulty. This further backs my overall thesis. While there is a significant amount of related work which applies automated testing to the education domain (e.g., [167, 168, 169, 170]), much of this is limited to one specific kind of assignment, and many tools do not produce a deterministic set of test cases. In contrast, CLP can deliver a fixed suite of tests, and it is applicable to more than just a single kind of assignment (further demonstrated in Chapter 7).
As an aside, this chapter is based on work we published in ITiCSE’17 (Citation: [73]; DOI: 10.1145/3059009.3059051; © 2017 ACM).