Active Learning Applications

1.3 Related Work

1.3.4 Active Learning Applications

Active model learning has been successfully applied for discovering the behaviour of software systems on numerous occasions. In this section, we give an overview of some of the most notable studies.

In 2010, Cho et al. were the first to demonstrate how active model learning can be used for analysing botnets [40]. They used an adaptation of the L∗ algorithm for learning the MegaD Command and Control (C&C) protocol. MegaD is a botnet that at its prime accounted for 32% of global spam [130]. In their analysis of the protocol they show how to identify its weakest links and design flaws. Besides, they were able to prove the existence of unobservable back-channels between botnet servers, without having access to these servers.

By leveraging properties specific to most network protocols, the heuristics introduced by Cho et al. allow for learning state machines in a realistic high-latency network setting. Compared to the original L∗ algorithm, the time to learn the MegaD C&C protocol was reduced from days to hours with these heuristics. Primarily, the authors observed that communication protocols typically only accept a subset of all inputs at most times. This allowed them to greatly reduce the number of queries asked in the (active) learning process. Also, they had great success using parallel processing and caching of queries.

Also in 2010, Aarts et al. were the first to apply model learning for the analysis of smart cards (i.e. a card that has a chip) [7]. They used the L∗ algorithm for analysis of electronic passports.

Chapter 1

Later, they have used a similar approach for learning models of bank cards that support the EMV protocol [2]. Although they did not find any flaws in these cards, their analysis does reveal differences in the implementation between cards that are supposed to implement the same protocol.

To be able to analyse the e.dentifier2, a USB connected bank card reader, Chalupar et al. make use of a Lego robot in order to perform physical interactions with the device in the learning process [36]. As the USB implementation in the original system does not always provide reliable results, they make use of majority voting to determine the output.

Fiter˘au-Bro¸stean et al. apply model learning with mappers to the TCP network protocol [55, 54]. They show that different implementations of TCP in Windows 8 and Ubuntu induce different models, which allows for fingerprinting of these implementations. Inspection of the learned models reveals that both Windows 8 and Ubuntu violate RFC 793 – the standard that describes the TCP protocol.

De Ruiter et al. use model learning in their analysis of nine different TLS implementations [122]. They found security related flaws in three of these implementations.

Fiter˘au-Bro¸stean et al. apply model learning on three SSH implementations to infer automata models, and then use model checking to verify that these models satisfy basic security properties and conform to the RFCs [56]. Their analysis showed that all tested SSH server models satisfy the stated security properties. They did uncovered several violations of the standard, however, which may allow for fingerprinting of the different implementations. Model learning has been applied to industrial control software on several occasions.

Smeenk et al. use model learning to validate the correctness of a software component that is used in printers and copiers of Oc´e [127]. Their main challenge was that traditional conformance testing methods were unable to find counterexamples for some hypotheses. They therefore implemented an extension of the algorithm of Lee and Yannakakis for computing an adaptive distinguishing sequence [92]. Even when an adaptive distinguishing sequence does not exist, Lee and Yannakakis’ algorithm produces an adaptive sequence that ‘almost’ identifies states. In combination with a standard algorithm for computing separating sequences for pairs of states, the authors managed to 52

verify states with on average 3 test queries. Altogether, they needed around 60 million queries to learn a model of the ESM with 77 inputs and 3.410 states.

Schuts et al. use model learning and model checking to compare a legacy implementation to a new implementation of a component at Philips Healthcare [123]. Instead of comparing the two implementations via their internal structure, they check the equivalence of their behaviour. First they use model learning to construct a model for both the legacy implementation and the new one. Then, they use model checking to see if the learned models are equivalent. This way, they found issues in both the legacy implementation and the new one. After solving these issues, model learning helped to increase confidence that the two implementations behave the same.

Minimal Separating

Sequences for All Pairs of

States

Rick Smetsers, Joshua Moerman, and David N. Jansen

Abstract

Finding minimal separating sequences for all pairs of inequivalent states in a finite state machine is a classic problem in automata theory. Sets of minimal separating sequences, for in- stance, play a central role in many conformance testing methods. Moore has already outlined a partition refinement algorithm that constructs such a set of sequences in O(mn) time, where m is the number of transitions and n is the number of states. In this chapter, we present an improved algorithm based on the minimization algorithm of Hopcroft that runs in O(m log n) time. The efficiency of our algorithm is empirically verified and compared to the traditional algorithm.

Chapter 2

2.1 Introduction

In diverse areas of computer science and engineering, systems can be mod- elled by finite state machines (FSMs). One of the cornerstones of automata theory is minimization of such machines (and many variation thereof). In this process one obtains an equivalent minimal FSM, where states are different if and only if they have different behaviour. The first to develop an algorithm for minimization was Moore [104]. His algorithm has a time complexity of O(mn), where m is the number of transitions, and n is the number of states of the FSM. Later, Hopcroft improved this bound to O(m log n) [72].

Minimization algorithms can be used as a framework for deriving a set of separating sequences that show why states are inequivalent. The separating sequences in Moore’s framework are of minimal length [62]. Obtaining minimal separating sequences in Hopcroft’s framework, however, is a non-trivial task. In this chapter, we present an algorithm for finding such minimal separating sequences for all pairs of inequivalent states of a FSM in O(m log n) time.

Coincidentally, Bonchi and Pous recently introduced a new algorithm for the equally fundamental problem of proving equivalence of states in non-deterministic automata [22]. As both their and our work demonstrate, even classical problems in automata theory can still offer surprising research opportunities. Moreover, new ideas for well-studied problems may lead to algorithmic improvements that are of practical importance in a variety of applications.

One such application for our work is in conformance testing. Here, the goal is to test if a black box implementation of a system is functioning as described by a given FSM. It consists of applying sequences of inputs to the implementation, and comparing the output of the system to the output prescribed by the FSM. Minimal separating sequences are used in many test generation methods [47]. Therefore, our algorithm can be used to improve these methods.

In document Advances in Model Learning for Software Systems (Page 52-58)