• No results found

6.3 Semantic-based Message Clustering

6.3.4 Format Clustering (Step 4)

Contextual clustersshould be refined for two reasons: 1) to manage messages carrying no

contextual information; and 2) to dissociate messages that include the same contextual information but have a different format. The format clustering step corresponds to the final stage of classification and is applied on each contextual cluster. Unlike the two previous steps, this clustering compares the alignment quality between messages to compute clusters.

We propose to extend both the Needleman & Wunsch (NW) [97] sequence alignment algorithm and the Unweighted Pair Group Method with Arithmetic mean (UPGMA) [127] hierarchical clustering algorithm. Our modifications take into account the semantic in both the alignment and the clustering phase. In the remainder, we give some details about these modifications.

Semantic Needleman& Wunsch

We first propose an extension of the NW algorithm to produce a semantic-aware common alignment between messages. In fact, NW can be applied on a symbol, which represents the common alignment of a set of messages. In the following, we use the term of message to both refer to messages and symbols. As described in Section 3.1.2, the original version of NW aligns two messages in two steps: 1) it fills a matrix with the similarity score of each pair of messages bytes and then 2) execute a back-trace in it. This matrix is filled accordingly to the principle of optimality described by formula (6.1). It uses a gap penalty d and a similarity function S to align messages

m1and m2.

Fi,j = max(Fi−1,j−1+ S(m1[i], m2[j]), Fi,j−1+ d, Fi−1,j+ d) (6.1)

In previous works [14, 43, 88], the similarity function S is reduced to a simple function v(a, b) that either returns the value e if a == b or f if not.

We propose to extend this syntactic comparison with the comparison of the semantic definition attached to each half-byte. Hence our function compares the value but also the semantic tags of each half-byte and preserves common semantic information if available. These semantic tags are computed and attached to half-bytes during the contextual clustering and every time an intra-symbol relationship is found.

We denote ψ(a) = hT, φai, the multiset [132] of semantic tags attached to an half-byte a, with

T the set of all semantic types and φa: T → N, a function returning the multiplicity of a semantic

tag in a. For example, ψ(a) = {{IP, IP, Username}} means that IP and Username semantic tags are attached to half-byte a. In this example the multiplicity of IP is two, i.e. φ(IP ) = 2. This situation may arise when the same semantic tag corresponds to different types of relationship. For example, an half-byte could correspond to both environmental and application information.

Now, suppose ψ(a) and ψ(b) respectively the multiset of semantic tags attached to half-byte a and b, we denote one includes the other with the relation:

ψ(a)< ψ(b) ⇔ ∀e ∈ T, φa(e) < φb(e) (6.2)

and we define a size function the following way:

ψ(a) =X

e∈T

φa(e) (6.3)

We compute the similarity between half-bytes a and b by comparing their values and their semantic tags. For the value comparison we keep the original v(a, b) definition while for the semantic comparison we introduce two new semantic match and mismatch parameters: h and g. Our experimentation has shown best results with the following parameter values: d = 0, e = 5, f = −5, g = 6e and h = 6f.

Hence, as described in table 6.1, our similarity function S returns a high score if the semantic tags match but the values differ and on the contrary, returns a low score if the values match but not

the semantic tags.

ψ(a) ∩ ψ(b) = S(a, b) = v(a, b) + h × ψ(a) + h × ψ(b)

ψ(a) = ψ(b) S(a, b) = v(a, b) + g × ψ(a)

ψ(a)< ψ(b) S(a, b) = v(a, b) + g × ψ(a) + h × ψ(b) \ ψ(a)

ψ(a)= ψ(b) S(a, b) = v(a, b) + g × ψ(b) + h × ψ(a) \ ψ(b)

Table 6.1 – Similarity function S(a, b).

Once the matrix F is computed using our new similarity function and following formula 6.1, a trace-back step is performed. We rely on the original trace-back algorithm we described in

Section 3.1.2. We search for a path that starts at F|m1|+1,|m2|+1and that maximizes the alignment

score back to the origin F1,12. A diagonal path describes a perfect alignment between the two

messages, while a vertical or an horizontal motion implies the addition of gaps in one of the two

messages. Such trace-back produces two messages m0

1 and m02containing the necessary gaps to

align messages m1and m2under the constraints introduced by their inner syntactic and semantic

similarities.

As illustrated in figure 6.6, our semantic based alignment preserves the semantic definition when identifying token boundaries. In this example, without our solution, email addresses get split among multiple tokens and firstnames definition is lost in a bigger dynamic token.

6thomasGA ROOT Q S thomas@g mail. fr

3lucCV ROOT S Dluc@hot mail. com

6 thomas GA [email protected]

3 luc CV [email protected]

firstname email

Dynamic tokens Static tokens

ROOT ROOT QS SD Needleman & Wunsch Semantic N&W

Figure 6.6 – Alignments computed by Needleman & Wunsch and of our modified version. We leverage these two aligned messages to produce a symbol that describes both. As illustrated on Figure 6.7, our semantic NW alignment produces two aligned messages that may contain gaps. We build a symbol out of these messages by means of three steps: 1) we create a single representation of the aligned messages with a succession of static and dynamic tokens. 2) we smooth token boundaries and 3) finally compute fields definitions out of the smoothed tokens.

The objective of the first step is to find a succession of tokens that can describe the two aligned messages. To achieve this, we execute a pairwise comparison of each aligned message bytes. If both equals, we create a static token with its value, if not, we create a dynamic token to which we attach the two values. Once we compared all the bytes of the two aligned messages, we obtain a

sequence of one-byte static and dynamic tokens as illustrated in Figure 6.7.

The second step smooths this sequence of one-byte tokens. To achieve this, we merge successive dynamic or static tokens that either share the same semantic or that have no semantic. This step produces a set of smoothed tokens as illustrated in Figure 6.7.

Finally, we create a symbol out of the sequence of smoothed tokens. In details, if multiple successive tokens participate in the same semantic definition we create a single field to represent them. A field is also created for each token that has no semantic definition. As described in Section 5.1.3, the values accepted by a field is represented under a token-tree. We therefore infer the token-tree of each field. If a field regroups multiple tokens, we represent them with an aggregate node (denoted AGG in Figure 6.7). We also infer the type of the values that are accepted by each token. If a token accepts a single value (i.e. a static token), we insert it in the token-tree of the field. On the other hand, we extract the types of the values that are accepted by each dynamic token. We rely on a heuristic that successively test if the values are of different types. We first test for strongly constrained types such as IPv4 addresses and then tests if the bytes are valid ASCII, or decimals. If all the bytes are valid printable characters we represent them as an ASCII sequence. If not and if the token is one, two or four bytes long, we represent its values under a decimal type. Other values are represented as a sequence of raw bytes.

6 l o u i s G A R O O T Q S l o u i s @ g m f r 3 l u - c C V R O O T S D l u c @ h o t m a i l . c o m 6 l o u i s G A R O O T Q S l o u i s @ g m a i l . f r 3 l u c C V R O O T S D l u c @ h o t m a i l . c o m - - - - firstname firstname email email a i l . - - - - messages (m1 and m2) aligned messages (m1' and m2') Gap firstname - email D l D D D D D D R O O T D D l D D D D D D D D D D D D D D D D D l D D D l firstname email D D D ROOT D D firstname email f0 f1 f2 f3 f4 f5 AGG “l” ASCII s=(3,4) AGG “l” ASCII s=(13,14) Decimal s=(1,1) ASCII s=(2,2)“ROOT” s=(2,2)ASCII tokens smoothed tokens symbol D T Dynamic token Static token (value =”T”) firstname email

Figure 6.7 – The different steps engaged in the construction of a symbol out of two messages.