Combination of features

3.4 Connection of nodes in the CAGS

3.4.4 Combination of features

The characterization of the arcs in a CASG depends on the relationship between the involved traces. Each CASG defines certain conditions of existence of an arc. We say that there is a correlation between two traces or that they are correlated if the conditions to build an arc between them are met.

Trace correlation. In the field of Statistics, the term ‘correlation’ denotes the “lin- ear relationship between pairs of variables for quantitative data”, and it is quantified by a correlation coefficient [Witte 2010]. In Security, the broader sense of a mutual relationship between traces [OED 2018a] is taken, and the term is also applied to the process of finding a set of correlated traces to group them into an attack scenario [Kruegel 2004]. Therefore, we can say that “there is a correlation between these two traces” (equivalent to say “these two traces are correlated”) but we can also refer to a method for building attack scenarios from traces as a “correlation method”. The same as the correlation coefficient used in Statistics, we can define a correlation coefficient for each pair of traces in a dataset, giving it a value between 0 and 1 as we did with the features presented in the previous sections. We refer to this correlation coefficient as correlation weight to differentiate it from the statistical coefficient, which can take a value between 1 and 1. Depending on the author, it can also take other names, such as ‘causal correlation’ [AmirHaeri 2009] or ‘correlation index’ [Colajanni 2010]. The correlation weight. This element can be used to quantify the conditions for linking two traces [Soleimani 2012,Bateni 2013a], by the combination of several partial conditions. These conditions are determined by the three kinds of feature previously defined: similarity, time-based and context-based. The correlation weight can be based on just a single feature, such as equality of destination IP address (f1(ipdsta, ipdstb))

[Du 2009] or an IP address in common (f2(N, M ) with N = M = {ipsrc, ipdst})

[Xuewei 2014]. But the most used approach is the combination of several features [Cipriano 2011, Saad 2014, Pei 2016], in which the relationship between the traces in a CASG is considered to depend on several aspects of the traces [Mathew 2009].

A correlation weight can be easily built as a mathematical combination of similarity functions, because the properties of these functions are preserved under addition and multiplication [Chen 2007]. But if a time-based or a context-based feature is expressed as a function with the same properties (see page 45), it can also be combined with similarity features [Kavousi 2014, Wang 2016]. We give the generic name of feature function to a function that expresses numerically a feature associated to a pair of

traces and following the properties of similarity functions. Given that, we define a correlation function to compute the correlation weight as C : F ⇥ ⇥ ⇥ ⇥ 7! R, with F a set of feature functions. All correlation and feature functions throughout this thesis have a range [0, 1].

Once a correlation function is defined, a threshold has to be established to determine above which value an arc in the CASG exists between two traces. If the result of the correlation function is preserved and attached to the arc, the model is a weighted CASG. Otherwise, the CASG is non-weighted. Preserving the correlation value between the traces gives an idea of the strength of each arc under the imposed conditions.

We review below the most used methods for combining the results from several features to derive a correlation weight:

• Maximum of a set. One option to combine features is to take the maximum

value among the results from the feature functions. This method is much used to reduce the set of features to be used for computing the correlation weight. As an example we can consider the atomic similarity function based on prefix similarity (f3), conceived to work with IP addresses. Imagine that each trace

has two attributes based on IP addresses, ipsrc and ipdst, as it is usually the case. Therefore, we can apply f3 in four diﬀerence ways: f3(ipsrca, ipsrcb),

f3(ipdsta, ipdstb), f3(ipsrca, ipdstb), f3(ipdsta, ipsrcb). As in a multi-step attack

there can be stepping stones, the four functions are not equally appropriate to define all the arcs in the CASG (see page 47). A crossed comparison would signal the hop to the stepping stone, but not the relationship between actions performed in the same host. Choosing the maximum among the results returned by the four functions guarantees that we select the most relevant value for each pair of traces in the attack.

Some authors follow this logic. Shittu, for instance, chooses the highest similarity between the results obtained from f3(ipsrca, ipsrcb)and f3(ipsrca, ipdstb),

and does the same with the ones from f3(ipdsta, ipdstb) and f3(ipdsta, ipsrcb)

[Shittu 2016]. However, Li et al. [Li 2016] preserve only the maximum between f3(ipsrca, ipsrcb), f3(ipdsta, ipdstb), f3(ipsrca, ipdstb)and f3(ipsrca, ipdstb). For

its part, Qiao et al. use a formula to combine them [Qiao 2012]: fQiao(✓a, ✓b) =

max{f3(sa, sb) + f3(da, db), f3(sa, db) + f3(da, sb)}

2 (3.13)

with s = ipsrc and d = ipdst.

3.4. Connection of nodes in the CAGS 59

C1(F, ✓a, ✓b) = maxF (3.14)

• Inclusive disjunction. Given a list of features, resulting correlation weight

based on inclusive disjunction can only take two values: 1 if at least one of the features in the set holds or 0 if none of them holds. This is the same as saying that for having a value of 1, at least one of the feature functions of the set F representing the selected features has to return a value greater than a certain threshold : C2(F, ✓a, ✓b) = 8 < : 1, if 9f(x, y) 2 F | f(✓a, ✓b) 0, otherwise (3.15)

For instance, Chen et al. [Chen 2006] consider that two traces belong to the same scenario if they have either the same source IP address, or the same destination IP address or the same destination port. Cipriano et al. [Cipriano 2011] follow a similar logic but taking crossed equality of IP addresses instead of destination port equality. In the mentioned cases, the features are binary, so = 1.

• Arithmetic mean. The correlation weight is the average of several feature

functions. Given that N is the number of feature functions f(x, y) contained in F: C3(F, ✓a, ✓b) = 1 N X f (x,y)2F f (✓a, ✓b) (3.16)

This is the method used by Saad [Saad 2012, Saad 2014], who incorporates f1(ipsrca, ipsrcb), f1(ipdsta, ipdstb) and f1(typea, typeb).

• Weighted sum. A weight 0  w_i _{ 1 is assigned to each one of the considered}

features. These weights should not be confused with the correlation weight: the first ones determine the importance of each feature in the second one. Each wi

represents the importance of each feature for the final result. We choose the sum of all weights to be always equal to 1 [Li 2016, Haas 2018], to preserve the range of the correlation function. Some authors do so by normalizing the result afterwards [Kavousi 2014, Shittu 2016].

The weighted sum is given by the following expression, which is equivalent to the arithmetic mean if all the weights are equal:

C4(F, ✓a, ✓b) =

fi(x,y)2F

wifi(✓a, ✓b) (3.17)

There are many examples in the literature of correlation functions using weighted sums. Wang and Chiou [Wang 2016] propose a set of eight features: two similarity features based on equality and two based on prefix similarity. They are applied on the source and destination IP addresses (f1(da, sb), f1(sa, db), f3(sa, sb)

and f3(da, db), with s = ipsrc and d = ipdst). They also use two more similarity

features based on equality of port numbers (f1(psrca, psrcb)and f1(pdsta, pdstb));

a time-based feature based on threshold limit (ft

1), and a context-based feature

quantifying the probability of appearance of traces with the same type. On their side, Li et al. [Li 2016] propose three features: a hierarchy-based similarity feature on the destination port number (f5(pdsta, pdstb)); a similarity feature on the

source and destination IP addresses, which combines the maximum of a set of atomic prefix similarity functions (see page 58), and a time-based feature based on a Gaussian formula (ft

3). Finally, Qiao et al. [Qiao 2012] combine a similarity

function based on IP addresses (see equation3.13), a reciprocal time-based function (ft

5) and a similarity equality function working with the type of the traces

(f1(typea, typeb)).

• Sigmoid weighted function. Pei et al. [Pei 2016] combine the similarity

features shown in Table 3.1 using a sigmoid weighted function:

C5(F, ✓a, ✓b) = S 0 @ X fi(x,y)2F wifi(✓a, ✓b) 1 A = 1 1 + e Pfi(x,y)2Fwifi(✓a,✓b) (3.18)

They choose to transform the weighted function in sigmoid because the learning algorithms they apply to determine the values of the weights do not guarantee a non-negative result. The sigmoid function maps the weighted sum to a bounded range between 0 and 1.

• Machine learning techniques. The correlation weight of a pair of traces can

be also obtained by applying machine learning techniques on a set of features. Zhu and Ghorbani [Zhu 2006] proposed in 2006 to do so by a multi-layer per- ceptron, the most widely used type of neural network [Kruse 2013]. They use six features, the same as Wang and Chiou [Wang 2016] with the exception of f1(psrca, psrcb)and f1t. The results of the feature functions become the input of

3.5. Summary 61 the neural network, which outputs a value between 0 and 1 corresponding to the correlation weight. The network is trained using a test dataset of alerts.

3.5 Summary

An overview of what multi-step attacks are and how they are considered in the literature has been presented in this chapter. We have started by comparing them with single-step attacks (section3.1.1), assigning a classification according to diﬀerent crite- ria (3.1.2) and explaining the particularities of a specific type of multi-step attack, the APT (3.1.3). We have then given several examples of multi-step attacks: WannaCry

(3.2.1); LLDoS 1.0 and 2.0.2 (3.2.2); HuMa (3.2.4), and UNB ISCX island-hopping

(3.2.3). We have continued by an explanation of how they can be represented as a se-

quence of traces (3.3.1) and, after reviewing the languages to model attacks in section

3.3.2, how such sequence can be modeled as a graph (3.3.3), called CASG. The arcs of

this graph can be built according to several features, that have been defined next: similarity features (3.4.1), time-based features (3.4.2) and context-based features (3.4.3). We have finally explained how these features can be combined to build the arcs of the CASG (3.4.4). Now that multi-step attacks have been defined and we know how they can be modeled, we will see in the following chapter how the problem of detecting them has been addressed by the literature.

Chapter 4 Multi-step attack detection

4.1 Presenting multi-step attack detection . . . 64

In document Modélisation et identification de cyberattaques multi-étapes dans des ensembles d'événements (Page 74-80)

3.4 Connection of nodes in the CAGS

3.4.4 Combination of features

3.5 Summary

Chapter 4

Multi-step attack detection

Contents