An SMT-based Learning Framework - Model Learning Algorithms and Contributions

1.5 Model Learning Algorithms and Contributions

1.5.5 An SMT-based Learning Framework

The last two approaches can, in theory, provide automated ways of generating models for many practical systems. However, adapting both approaches to a broader class of systems or learning scenarios is far from trivial. Take for example adaptations for learning systems without resets or learning systems only from a set of logs. Such adaptations would likely mean reconstruction of these approaches from the ground up.

Adding to that, both approaches require a significant number of tests which grows rapidly with increasing system complexity. This was particularly evident in the TCP case study involving RALib, where the high number of tests meant we had to use small input alphabets and could not learn server implementations. Poor scalablity is caused in part by inefficiencies in the classical learning framework which arise when processing counterexamples. Counterexamples driving the learning process often con- tain complicating information, such as unnecessary inputs or confusing data relations. Unnecessary inputs make counterexamples longer than needed. Confusing data relations make it difficult to identify those which are relevant from a counterexample. To give a concrete example, consider a login system with register and login methods both carrying a user ID and password as parameters. Also consider two counterexample traces exercising the same functionality on the login system: (c1) register(0,0)

1.5. Model Learning Algorithms and Contributions 17 ok() login(0,0) ok()and (c2) register(0,1) ok() login(0,1) ok(). (c1) contains confusing data relations binding user IDs also to passwords, when in fact, it is irrelevant that they are equal. By contrast, (c2) contains only the relevant relations and is not confusing.

The presence of either unnecessary inputs or confusing data relations in counterexamples can adversely impact the performance of active learning algorithms, causing them to run many more tests (and inputs) than necessary. To give an intuition of the impact, imagine if in the learning run of Section 1.4 we would have found the counterexample connect msg msg connect connect msg. Without further processing of this counterexample, we might have very well used the suffix msg connect connect msgas a distinguishing sequence. This suffix has twice as many inputs as the compact msg msg we used in the learning run, and thus leads to longer tests. The suffix is made longer by two unnecessary connect inputs. Confusing data relations hide away the relevant relations. In the context of tree queries, we want to optimize suffix execution only considering relations that are relevant and not those that are irrelevant (such as a user ID being equal to a password).

State-of-the-art algorithms such as TTT effectively tackle the problem of unnecessary inputs for DFAs and Mealy machines. Yet the problem still plagues learners for more advanced formalisms such as RAs. Chapter 4 provides a way of dealing with confusing data relations by a disambiguation step in which all relations are tested, but this procedure is very costly in terms of the number of tests required.

Contribution Chapter 6 proposes a framework based on Satisfiability Modulo Theories (SMT [33]) which intrinsically avoids problems arising with counterexamples. The underlining idea is to separate concerns between the learner and the tester. The learner is no longer able to run tests, its task is reduced to that of generating a hypothesis consistent with a set of observations. The tester is the one performing tests. Counterexamples found by the tester are incorporated by the learner into more refined hypotheses. As it no longer needs to run tests, the learner can also operate in a passive setting, where from a set of logged observations, it can build a hypothesis. By using what is effectively a passive learner in an active setting, we aim to answer a more general question, namely, how does such an approach perform in practical benchmarks compared to the classical active setting using active learning algorithms? As the chapter shows, it is at least competitive.

The proposed framework uses SMT to implement the learner. More specifically, counterexamples found by the tester are encoded into SMT constraints over the functions comprising the formalism definition. The constraints are then supplied to an SMT solver. From the solution provided, the learner generates a hypothesis model which it sends to the tester. This approach benefits from the capacity of SMT solvers to handle advanced arithmetic, which opens the door to the rapid prototyping of learning for advanced formalisms. To that end, we formalize encodings for both conventional FSMs such as DFAs and Mealy machines, and for advanced formalisms

18 1. Introduction such as RAs with equality and fresh values. Our framework is also highly adaptable, as shown in the provided adaptations to learning systems without resets. Additionally, by removing from the learner the ability to run tests, learning performance is no longer affected by complicating information in counterexamples.

We have implemented this framework in the open-source learning tool Z3GI3_{, and}

have shown its effectiveness over a series of experiments, where we compare it to other learners following the classical learning framework. Our tool implements an all purpose learner, in the sense that, it can infer models for many formalisms, including DFAs, Mealy machines, accepting/rejecting RAs and regular input/output RAs (termed IORA in this chapter). It implements learning both actively and passively and can also learn Mealy machines that cannot be reset. Moreover, our tool’s decoupled architecture allows encodings to be swapped while the rest of the framework stays the same, facilitating the probing of new encodings.

A setting similar to ours was previously introduced in [213], where the authors connect a passive learner to a model-based tester, though their realization is markedly different, provides no guarantees on the minimality of the learned model and can only learn one specific formalism, in the form of Partial Labeled Transition Systems (PLTS). We additionally compare our approach to the classical one over a series of

experiments.

Passive learning using SMT solvers is also not new. Neider et al. [161, 162] propose an SMT-based passive learning approach for FSMs using encodings similar to ours. The approach is shown to be effective even when compared to more involving SAT- based approaches. We improve upon this work adapting the SMT-based approach to richer classes of automata. Moreover, we assess the effectiveness of such an approach when used in an active way, by drawing comparison with classical active learning approaches.

In document Active Model Learning for the Analysis of Network Protocols (Page 34-36)