7.4 Reverse Engineering Message Formats
7.4.1 Inference from Network Traces
Early approaches for reverse engineering message formats use machine learning techniques to find patterns in network traces, and use these patterns to infer a message format. They can therefore be considered as primarily black-box, passive learning techniques. In this section, we give an overview of the advancements in this area, and the tools that have been created over the years.
Using sequence alignment for finding similarities
The first preliminary technique for protocol message format reverse engi- neering was introduced by the Protocol Informatics project [17]. Inspired by bioinformatics, it uses sequence alignment to find similarities in two or more messages of the protocol. Sequence alignment is a way of arranging two sequences to identify regions of similarity [107]. In bioinformatics it is used to understand the relationship between two sequences of genetic information, such as DNA or amino acids. For protocol analysis, the concept is similar. The goal is to compare a message to a database of messages belonging to a specific protocol. This allows one to determine its type and the location and size of the fields in each individual message. Sequence alignment is particularly effective on (sets of) messages in which dynamic fields have variable lengths [24]. Beddoe presents preliminary results in learning message formats for HTTP [17].
The sequence alignment algorithm of Needleman and Wunsch [107] was later implemented in the Netzob tool for partitioning messages in simple (i.e. non-complex) fields [24, 25]. The tool uses this technique in conjunction with a clustering algorithm to group together similar messages. This is used as a preprocessing step for inferring the domain of individual message fields, and (consecutively) the message type.
Detecting fields with region analysis
Several tool extend the work done in the Protocol Informatics project by using previously seen messages to heuristically detect some specific fields (such as network addresses, lengths, and cookies).
One such a tool ScriptGen [95, 94], which uses sequence alignment 174
as a building block for a more complex algorithm, called region analysis. Region analysis consists of two steps. By looking at aligned sequences of bytes, it first computes for each aligned byte:
– its most frequent type of data (binary, text or zero-value), – its most frequent value,
– the variability of the values, and
– the presence of gaps in aligned sequences.
Then, fields are identified on this basis of sequences of bytes that have some similar characteristics. By taking advantage of the statistical diversity of a large number of training messages, region analysis can be used to rebuild a partial notion of the semantics in a message format.
The independently developed RolePlayer tool [44] uses a similar technique as ScriptGen. Instead of using a large number of training samples, however, RolePlayer uses a small set of cleverly constructed samples to train the sequence alignment algorithm.
Towards a more complete approach for analysing network traffic In 2007, Cui et al. introduced Discoverer [43]. The goal of this tool is to automatically reverse engineer message formats by analysing sequences of network packets. The idea of Discoverer is to cluster messages with the same record patten together and learn multiple message format specifications for a single protocol. This is achieved in three phases: tokenization, clustering and merging. In the following paragraphs we describe these phases in more detail. An overview of Discoverer’s system architecture can be found in Figure 7.2.
First, consecutive network packets are reassembled in messages deter- mined by the direction of communication. Then, the message is split into a sequence of tokens. A token is a sequence of consecutive bytes likely to belong to the same message field. Two types of tokens are distinguished: text and binary. Text segments are identified by comparing a sequence of bytes with the ASCII values of printable characters. A set of predefined
Chapter 7
= =
Figure 7.2: Overview of Discoverer’s architecture [43, Figure 1] delimiters is used to divide a text segment into tokens. The authors ar- gue that identifying binary field boundaries is very hard. Therefore, they consider each binary byte to be a token in its own right.
Messages are clustered based on their token pattern. However, since messages with the same token pattern do not necessarily have the same format, clusters of messages are further divided so that each message in a cluster has the same format. Then, constant and variable length tokens are identified by comparing them against their counterparts in another message that has the same format. Three field semantics are inferred:
length the size of a field,
offset the byte offset of a field from a certain point (such as the start of the message),
cookie session-specific data that appears in messages from both sides of the application session (such as a session ID).
The key observation behind the merging phase is that sequence alignment can be used to identify similar message formats across different clusters. This is because we can leverage the token properties (text or binary, variable or fixed length) and semantics (length, offset and cookie) inferred in the previous phases. The authors have demonstrated that Discoverer can partially infer message formats for three application protocols: SMB, RPC, and HTTP.
Discoverer has three major limitations. First, it assumes the existence of a (set of) predefined delimiter(s) for dividing a text segment into tokens. However, protocols may not use delimiters and even if they do, these delimiters might not be available to the public. Second, it does not work for asynchronous application protocols, or (synchronous) protocols that are sampled. This is because Discoverer assembles raw packets into messages by grouping each sequence of consecutive packets that flow in one direction. This way of grouping packets is inappropriate, because two parties might send packets to each other at the same time. Moreover, a raw packet trace might be sampled, which severely reduces Discoverer’s approach for the same reason. Third, the tool assumes that the first constant number of bytes of a session describe the complete message format. Whilst this is the case for the application protocols that were used in the experiments, this assumption does not hold all application protocols. The SMTP protocol, for example, indicates the end of the mail data by sending a line containing only a “.”. This expression is part of the message format, while the content of the message itself can be of any (variable) length.
Using frequency distributions and n-grams
The limitations of Discoverer were addressed by Wang et al. in their Veritas and ProDecoder tools [148, 147]. These are fully automatic network-based tools for learning message formats that do not assume any prior knowledge of a protocol specification (such as delimiters). Similar to Discoverer, they are applicable to both text and binary protocols.
The key insight behind these tools is that n-grams in protocol messages exhibit a highly skewed frequency distribution that can be used for inferring its message format. An n-gram is a contiguous subsequence of n elements in a given sequence of at least n elements. In the case of Veritas and ProDecoder, an n-gram is a sequence of n bytes in a protocol message.
Veritas and ProDecoder consists of the following modules:
n-gram generation The input to this module is a set of raw network traces that are of the same protocol. These packets do not necessarily have to consist of (complete) protocol messages. Therefore, the tools are applicable to asynchronous protocols and sampled protocol data as well. In this module the raw packets are decomposed in subsequences
Chapter 7
of n contiguous bytes and the count for each such n-gram are stored. For example, if the parameter n = 4 then the n-grams from the message MAIL FROM are MAIL, AIL , IL F, L FR, FRO and FROM. Keyword unit selection In Veritas, a Kolmogorov-Smirnov test filter is
used to identify the frequent n-grams from the distribution created in the previous module. These frequent subsequences are called keyword units (Keyword units are called message units in [148]). The set of aforementioned n-grams, for example, can be discovered as keyword units, because they are encountered regularly. ProDecoder skips this module.
Keyword identification This module uses the keyword units collected in the previous module to infer keywords. Keywords are identified be searching for keyword units that often occur together. The aforemen- tioned keyword units can be used to reconstruct the keyword MAIL FROM, because they occur together often. A message can have multiple keywords.
Message clustering This module clusters messages based on their key- words using standard machine learning techniques. Veritas uses the Jaccard index to calculate the similarity between messages. ProDe- coder uses a standard hierarchical clustering method. The clusters are validated by using a metric from information theory known as the information bottleneck method [136]. This method captures the rele- vant information in a message with respect to the other messages by compressing the data. This enables ProDecoder to cluster messages based on their semantics, and distinguish among similar keywords belonging to different protocol messages.
Sequence alignment Similar to Discoverer, this module uses sequence alignment on the messages in each cluster to find the common byte sequences among them. These sequences represent the stable parts of the protocol messages, and can therefore be used to represent the message format. Veritas does not perform this final step. Instead, it uses the message format and the raw network traces to infer the protocol state machine.
Figure 7.3: Overview of Veritas’ and ProDecoder’s architectures (from [148, Figure 1] and [147, Figure 2]).
An overview of Veritas’ and ProDecoder’s architectures is shown in Figure 7.3. The authors have implemented and evaluated Veritas to infer messages format specifications for SMTP and two binary peer-to- peer protocols. ProDecoder is evaluated on SMB and SMTP network traces. The experimental results show that both tools accurately parse the application protocols.
Using more advanced methods
The work of Wang et al. was extended by Krueger et al. in their Prisma tool [85]. Instead of using n-grams for finding similarities between different messages, the authors use a more elaborate method. To find common struc- tures in the data, they first define a similarity measure between messages. This is done by embedding the messages in special vector spaces which are reduced via statistical tests to focus on discriminative features. In contrast to previous tools, the model constructed by Prisma can not only analyze but also simulate messages.
Chapter 7