4.4 Finding the Needle in a Haystack
4.4.3 Findings and Discussion
Let us take a closer look at why the use of structural information (especially, the ancestor-descendant relationship) is important in reducing false positives. We begin our analysis by focusing on the node-level similarity scores using the weighted and non-weighted Jaccard Index calculated between the HTTP flows in the archival logs (i.e., Dataset 1 in Table 4.2) and those in the malware index. The results are shown as a cumulative distribution function in Figure 4.6. Notice that over 98% of the flows in the dataset had a similarity score below the conservative lower bound thresholds (of 0.23/0.2) derived from Algorithm 2 while all nodes associated with malicious trees had a node similarity score of 0.22 or higher.
Figure 4.6: The CDF of node similarity scores between our test dataset and the malware index. “All” represents similarity scores between all nodes in the dataset and the malware index, while “Malicious” represents the node scores for trees in the dataset that were flagged as malicious(W = weighted).
The similarity scores for both node similarity metrics followed the same distribution; however, the non- weighted Jaccard Index generated on average lower similarity scores than the weighted approach (weighted mean = 0.10, non-weighted = 0.09) with similar standard deviations. As can be seen from Figure 4.6, the similarity gap between the malicious and benign nodes is smaller in the non-weighted case than in the weighted case. This leads to more node-level false positives, and, as a result, structural false positives as seen in the case of FlashPack. Because the weighted Jaccard Index is weighted according to the importance of the
feature, an unweighted version will be more likely to have false positives due to common features that are prevalent both in benign and malicious nodes. Even though the weighted version provides marginally better accuracy, the non-weighted Jaccard Index may be more desirable from an operational perspective because it does not require any feature training.
As shown in Table 4.6, there were a large number of false positives if only node-level similarity (like Snort signatures that focus only on individual flows) was considered for both weighted and non-weighted similarities. The false positive rate started to decrease when consideringmultiple nodesin a tree (without considering structure), as the probability of a benign website having two or more nodes in the same tree that match malicious patterns was an order of magnitude smaller. The false positive rate decreased further by another order of magnitude once astructural scorewas established using tree edit distance. After imposing theancestor-descendant requirementon the tree structure, the false positives were reduced to 68 for the total of over 800 million flows. The results show that tree structure is the primary determining factor in reducing false positives. Threshold (Alg 1) non-weighted Threshold (Alg 1) weighted Tight Threshold weighted Single Node 2,141,493 360,150 141,130
Multi-node (no structure) 79,321 32,130 5,878
Structural 5,967 3,800 420
Structural (w/ restriction) 109 68 68
Table 4.6: FPs from the approach when considering single node matching, multi-node matching without considering structure, structure similarity, and structural similarity with ancestor-descendant requirement for weighted and non-weighted similarity.
As shown in Table 4.6 there is a several orders of magnitude reduction in the number of nodes (flows) that are similar to nodes in the index, w.r.t the total number of nodes (flows) in a given dataset (Table 4.1). This result can be leveraged, by only building trees for flow clusters that have multiple similar nodes in common with a tree in the malware index, allowing the scalable application of much more computationally expensive (and correct) tree building techniques to the wire (i.e., (Neasbitt et al., 2014)) when full payloads are available. Table 4.6 also shows the detection rates under the optimal tight node-level similarity thresholds using weighted similarity. This bound is the maximum node similarity threshold allowed to still detect all true positives, and was calculated using the ground truth dataset. Even with the optimal bound, structural information was still needed to reduce the false positives.
Empirical analysis also showed that in the majority of cases, a relatively low minimum structural threshold (less than 0.05) for the tree-similarity score was sufficient because the flagged tree was malicious in almost every case. The structural similarity threshold is specific to the similarity metric chosen and was set conservatively low to maximize true positives with few false positives, creating a clear separation between benign and malicious cases. Figure 4.7 shows the cumulative distribution of the tree edit distance scores for the malicious subtrees analyzed. The scores ranged anywhere from 0.2 to a perfect 1.0 due to a few factors. In some cases there may be multiple nodes added or missing from the subtree as compared to the malware index, causing an imperfect score. The second reason was that, especially in the case of ClickJack, the exploit may lead into other exploits or websites causing the subtree in the dataset to look different from any of the ones in the malware index. Taken together, these findings underscore the power of using structural information and subtree mining, particularly when there may be subtrees that are incomplete or contain previously unseen nodes compared to those encoded in the index. The combination allows for maximum flexibility and reduced false negatives and false positives over contemporary approaches.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Tree Similarity Scores
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Distribution Function
Figure 4.7: The CDF of tree similarity scores for malicious subtrees.