Specification Languages for Protocol Grammar

2.3 Existing Specification Languages

2.3.2 Specification Languages for Protocol Grammar

Message Sequence Chart

The Message Sequence Chart (MSC) is an interaction diagram standardized by the ITU [73] related to the languages and general software aspects for telecommunication systems (Z series). This diagram depicts the order in which communications and other events take places between protocol logical processes, their system and their environment. As illustrated on figure 2.13, processes, also called entities or instances, are represented by vertical lines while message exchanges between

them are depicted by arrows. Thus, an MSC models communications through message-passing via reliable FIFOs. Its a high-level description of the possible usage scenario but only specifies message orders. The internal behavior of the each process is not considered. Besides, an MSC exhibit a weak partial order semantic that cannot express constraints between message exchanges. For instance, such diagram cannot be used to model that “if P sends M to Q, Q must pass on this message to R” [59]. For this reason, such description of a protocol grammar is often limited to capture system requirements in the form of “good” scenarios that the implemented system should exhibit.

Figure 2.13 – Message Sequence Chart describing a sample FTP authentication process.

Language of Temporal Ordering Specification

Language of Temporal Ordering Specification (LOTOS) is another formal specification language [69] developed within the ISO between 1981 and 1984. The key idea behind the LOTOS specification of a protocol is to describe the temporal relations that exist between observed externals events (from a system point of view). Some key principles have inspired its design, such as:

— A syntactic and semantic separation is ensured between the definition of processes and the definition of types.

— The operational semantic are defined using an algebra approach, mostly inspired by CCS/CSP-based language [95, 64] in such a way that it is possible to prove a rich set of algebraic equivalence properties.

A LOTOS specification is an ASCII text that describes a set of processes and type definitions. A process is a black box abstraction of an activity in an implementation, for which only its external behavior is considered. Processes are synchronized using a relative temporal ordering of events and share communication mechanisms called interaction points. It supports the description of data and operations based on abstract data types, a mathematical model for similar data structures.

The interested reader can refer to the LOTOS introduction [90] by L. Logrippo et al. which gives a complete definition of all the concepts behind this protocol specification language. This FDT has been widely used for defining common OSI protocols in academic works [93]. In practical, protocol development LOTOS has attained little relevance [80].

Estelle

Published in 1989 [103], Estelle is an ISO standard specification language, capable of defining concurrent and distributed communication protocols. Based on a formal definition, it aims at identifying and mitigating any possible ambiguities in protocol implementations.

To achieve this, an Estelle specification relies on two parts, 1) the architecture and 2) the behavior. The architecture defines a hierarchy of various modules, or actors of a communication while the behavior denotes how actors handle messages based on a finite state machine with memory,

i.e.an Extended Finite State Machine (EFSM). It models a system as a hierarchy of structures that

can run in parallel, exchange messages and share some variables. As illustrated on figure 2.14, two modules interacts through a channel interconnected on their interaction points.

Module 1 Module 2

Channel

Interaction Points

Figure 2.14 – Sample Estelle architecture.

To model interactions between modules, exchanged messages are stored in FIFO queues that enable the use of conditional transitions in the EFSM, i.e. a transition is fired when all enabled conditions are fulfilled. Additional rules can also be used to specify synchronous and asynchronous transition properties.

Specification and Description Language (SDL)

Defined by the International Telecommunication Union (ITU) in 1992, the SDL formal language is intended for the specification of reactive, real-time, and distributed applications involving many concurrent activities. Very most of communication protocols can therefore be described with such language. For example, it exists some SDL specifications for the LTE and DSR protocols [125, 31]. It allows to specify the functional properties of the system and their relationships with the environment.

A graphical representation (SDL/GR) and a textual representation (SDL/PR) are proposed to describe the structure, the behavior and the data of a protocol. The graphic form is preferred for most people as shown by its usage in most academic papers. The interested reader can refer to the reference book on SDL [47].

All these models that can be use to specify the grammar of a communication protocol are complex. Their rely on mathematical tools such as EFSM that are extended with different controls to ensure their large coverage of protocol requirements. These specification languages can be use to model probabilistic and distributed protocols. We believe such models are far too complex to be

inferred with existing grammar inference algorithms. We therefore focused our work on learning deterministic mealy machines.

Communication Protocol Inference

This chapter exposes previous works in the field of the automated inference of a communication protocol. Section 3.1 reviews the different approaches in the field of vocabulary inference while Section 3.2 covers previous work in the field of grammatical inference applied to the RE of protocol grammar.

3.1 Automated Inference of the Vocabulary

As described in section 2.2, a protocol is made of a vocabulary that defines the set of accepted messages with their definition and a grammar denoting the set of accepted sequences of messages. Thus, an inference process must address both to properly reverse an unknown protocol. However, the grammatical inference of a protocol requires some previous knowledge over the vocabulary. For this reason, the reverse engineering of a protocol traditionally starts with the vocabulary inference.

Previous work in the field of automated inference of the vocabulary falls into two families depending on whether they analyze an implementation of the protocol [27, 29, 41] or rather some communication samples [88, 14, 43, 83, 139, 138, 82].

Works that participate in the first family analyzes the executable binary that implements the targeted protocol. They observe the parsing process for received messages and the buffer construction method for sent messages. Results brought by these works seemed to be efficient to retrieve the compositional nature of messages in fields. However, they suppose the use of static analysis and intrusive dynamic techniques on binaries. We believe this approach cannot be easily automated, mostly due to its complexity but also because of existing counter-measures such as static and dynamic obfuscation, code compression, anti-debugging and anti-instrumentation solutions.

Therefore, we focused our work on the second family of vocabulary reverse engineering approaches. Contrary to the first ones, this family of trace-based vocabulary inference approaches only rely on collected messages to infer the vocabulary of an unknown protocol. Messages can be extracted out of a captured communication trace, for instance from a pcap file for network protocols. We believe this approach brings fewer assumptions over the targeted protocol and its implementation and for this reason is more practical. Nonetheless, trace-based approaches are more sensible to encryption than binary-based approaches as they rely on pattern matching algorithms

that are not effective on encrypted messages. However, solutions exist that could be use to tackle this encryption issue [28, 140, 4, 26]. Some of them imply a partial reverse engineering of the implementation to collect exchanged messages before their encryption [28, 140]. For example, specific probes can be use to extract unencrypted sent and received messages that are hosted in some buffers of the program. Such operation is easier than the complete reverse engineering of the protocol implementation. Besides, we do not consider these two families as completely orthogonal and future works could combine our methodology with results brought by a binary analysis.

Among all the existing issues encountered when inferring the vocabulary of a protocol using such trace-based approach, we retained generic ones either clearly identified and addressed by state of the art work or that we faced while building our own trace-based inference solution. Thus we highlight three recurrent issues: 1) message extraction, 2) identification of equivalent messages and of their format and 3) relationship inference. The first common issue is related to the identification, in provided traces, of message boundaries. We detail existing work to address this issue in section 3.1.1. The second issue, detailed in section 3.1.2 comes from the difficulty of identifying equivalent messages and their format in a set of collected traces. Finally, works that identify and infer field relationships, such as size fields and sequence numbers are detailed in section 3.1.3.

In document Exploiting Semantic for the Automatic Reverse Engineering of Communication Protocols. (Page 54-59)