Global Reasoning: Graph Structure Analysis

CHAPTER 3. GRAPH BASED HIERARCHICAL REASONING FRAME-

3.5 Hierarchical Reasoning Framework

3.5.3 Global Reasoning: Graph Structure Analysis

The global reasoning process aims to identify the set of highly correlated hosts that belong to the coordinated attack scene from structure of the evidence graph. Global reasoning is based on the assumption that during the procedure of attack, there must be a strong correlation between members of the attack group, and this correlation is exhibited through certain structural characteristics in the evidence graph. For example, a DDOS attack or an active scanner would be presented by a star-like topology in the evidence graph, while an unusual long path could be indication of a stepping stone chain. These distinctive graph components provide a first approximation of the coordinated attack scenario together with the functional state estimates from local reasoning.

We approach the global reasoning task as a group detection problem, which is to discover potential members of an attack group given the intrusion evidence observed. The attack group detection procedure works in two different phases: (1) create new attack groups by generating seed for the group and (2) expand existing groups by membership testing. Note that as shown in Fig 3.2, primary attack clustering analysis is based on the aggregated evidence graph GA.

3.5.3.1 Seed Generation

In the seed generation phase, we aim to discover important nodes in the evidence graph as initial seeds of attack group. In essence, we would like to select entities that are both function- ally significant and structurally important. From the functional perspective, the investigator generally has two options in seed selection. In a forward search manner, it is straightforward to select external hosts with Attacker state highly activated in the local reasoning process as initial seeds. In a backward search manner, hosts in the trusted domain with Victim or

Stepping Stone state highly activated are good candidates of initial seeds.

From the graph structure perspective, we use the eigenvector centrality metric to evaluate importance of nodes in the evidence graph. Eigenvector centrality is a refined version of the simple degree metric. Instead of just counting the number of edges incident to a node, eigenvector centrality score is based on both the number and the quality of edges. The intuition is that an edge to a node having a high eigenvector centrality score would contribute more than an edge to a node having a low score. Let centrality score of node i be denoted by xi, which is proportional to the sum of the centrality score of n’s neighbors:

x_i= 1 λ n X j=1 A_i,jx_j (3.6)

where A is the adjacency matrix of aggregated evidence graph GA and λ is a constant. The equation can be rewritten in matrix form as Ax = λx where x becomes an eigenvector of the adjacency matrix with eigenvalue λ. In practice, we apply the power iteration method to obtain the dominant eigenvector of evidence graph’s adjacency matrix, which represents the eigenvector centrality scores of corresponding graph nodes. By integrating the eigenvector centrality rank with functional state analysis, we can easily formulate queries for efficient seed selection. For example, “Among all nodes that have Attacker state activated higher than T , which one has the highest structural significance?”

3.5.3.2 Group Expansion

In group expansion process, we expect to discover nodes that have strong correlation with the initial seeds and add them to the attack group. Here we define the distance between two neighbor nodes s and t as the reciprocal of edge weight in aggregated evidence graph GA, i.e. d(s, t) = d(t, s) = _w(e)1 . Smaller distance values represent stronger correlation between two nodes. Group expansion is an iterative process which consists of three basic steps. First, we identify all external neighbors of current seed members as the set of candidate nodes. Second, a ranked list is formed based on the distance between each candidate node to current group members. Finally, the ranked list is cut at a predefined threshold and candidate nodes within

the distance threshold are added as new seed members of the group. If no candidate node is within the distance threshold, the group expansion procedure terminates. The procedure is listed in Algorithm 4.

input : Evidence graph G, initial seed node vs, distance threshold D, step size n output: The derived attack group group

begin

group ← vs; neighbors ← ∅; candidates ← ∅;

repeat

foreach node v in the set group do

neighbors ← FindNeighbour (G, v); candidates ← candidates ∪ neighbors;

end

foreach node v in the set candidates do

v.distance ← GetDistance (v, group);

end

new ←RankCandidates (candidates, D, n); group ← group ∪ new;

until no new member is found; end

Algorithm 4: Basic attack group expansion process

The F indN eighbors function returns the set of external neighbor nodes of current seed members. In the GetDistance function, we evaluate the distance between the candidate node to its nearest seed member. In the RankCandidates function, candidate nodes whose distance exceeds the threshold are discarded. A ranked list is formed for the remaining candidate nodes in the order of increasing distance. Given the step size n, we take a greedy approach and the top n candidate nodes in the ranked list are added to the attack group as new seed members. In essence the group expansion process can be regarded as a seed based single link clustering approach. It belongs to the class of hierarchical and agglomerative clustering algorithms, where each node start being its own cluster and clusters are merged iteratively. The conceptual difference is that we are only interested in clusters formed around “seed nodes” that possess both functional and structure significance. We apply the single linkage distance metric in group expansion, i.e. comparing minimum distance between candidate node and current seeds to the

distance threshold for membership testing. Other distance metric options such as complete and average linkage are less appropriate because in scenarios like a stepping stone chain, the suspicious entity may have strong correlation with single member of the current seed group.

Due to the vast difference in attack traces, selecting appropriate thresholds is not straightforward. Lowering the threshold generally leads to higher rate of false positives while raising the threshold could result in more false negatives. In practice the analyst usually need to compare the results of a set of incremental thresholds for evaluation. Moreover, the ranking list often explains more than trying to find the best cut-off threshold.

In document A graph oriented approach for network forensic analysis (Page 48-51)