5. Proposed Model and Prototype
5.5 Detection Framework Implementation
Based on the analyzed heuristics and shown fingerprinting model, a botnet detection approach framework is here presented.
1. Diversity Index calculation for the proposed heuristics by Source and Destination IP for each twenty-minute time windows using Simpson’s Diversity Index formulas/equations presented before.
2. Anomaly Score and Fingerprint computations evaluated for each Source and Destination IP using the different heuristics’ diversity indexes. Unidirectional Flows detection is also made in this step. All calculus is done for each of the twenty-minute time windows mentioned in the previous step.
3. Identification of communications with high (above a predefined threshold) TCP or UDP Communication Anomaly Scores (CAS). Hosts showing high number of anomalous connections or unidirectional flows are identified in this step.
4. Second-pass analysis performed on all data collected and gathered in sixty minutes time windows (one hour). In this step, clusters of similar communications (horizontal analysis) are created and recorded, filtering out hosts uniquely showing scan behaviors.
5. Malicious hosts detection and identification based on the second-pass analysis performed in the previous step. Groups of hosts revealing similar connection fingerprints as well as inter-hosts relations are possible to obtain at this phase.
To better comprehend the proposed framework, all step-stones and corresponding anomaly scores calculations, communications clustering and implemented detection control mechanisms are explained in the following pages.
5.5.1 Diversity Indexes Calculation
The Diversity Index Calculation is one of the most important steps in the process since it gathers the data needed for the rest of the proposed detection process. This phase of the detection is implemented in Perl [53] as a nfsen [52] plugin in order to allow fast processing. Diversity indexes are evaluated for the communications in twenty minutes time windows.
A manageable white list implemented in the diversity indexes evaluation phase allows excluding specific hosts from the detection process.
5.5.2 Connection Anomaly Score (CAS) Calculation
After diversity indexes are evaluated, Anomaly Scores are calculated and a corresponding CAS is determined for each Source and Destination host. Total number of Bytes, Packets and Flows exchanged are also recorded in the database. CAS evaluation is based on a weighted average of calculated diversity indexes rounded to the thousandth and multiplied by 1000. This normalizes CAS values in an integer scale that ranges from 0 (not anomalous) to 1000 (maximum anomaly).
Each heuristic presented tries to capture determined characteristics of hosts in a network. It is possible to find high “maliciousness” scores in one heuristic but low on others. In order to truly capture the essence of network normal behaviors and, at the same time, control false positive rates, an inter-heuristic correlation was implemented and tested. The proposed framework uses a combination of four Anomaly Scores ( , , and ) each one corresponding to a specific heuristic analysis in order to determine final CAS values. This way, inter-heuristic correlation is also achieved.
In the following equations, Anomaly Score is identified by the acronym AS and Inverse Simpson’s Diversity by ̃.
̃
Equation 3: Bytes per Packet Anomaly Score
̃
̃
Equation 5: Time Between Flows Anomaly Score
̃
Equation 6: Flow Duration Anomaly Score
The Anomaly Scores presented, invert the calculated diversity indexes in order to reflect the true meaning of diversity in the corresponding anomaly score evaluation. High diversity in the Time between Flows heuristic ( ̃ ) should be considered normal. That way, the corresponding Anomaly Score ( ) should be lower. Since diversity values always range from zero to one, value inversion is achieved by subtracting the respective diversity index to its maximum value (one).
Connection Anomaly Score (CAS) evaluation uses weight definitions applied to each base AS value allowing an administrator to fine-tune the detection framework. Weights are defined in terms of percentages with a total sum of 100%.
Weights Description
The BPP Weight defines the relative importance of the BPP heuristic and the
in the Connections Anomalous Score calculation
The PPF Weight defines the relative importance of the PPF heuristic and the in
the Connections Anomalous Score calculation
The TBF Weight defines the relative importance of the TBF heuristic and the in
the Connections Anomalous Score calculation
The FD Weight defines the relative importance of the FD heuristic and the in the
Connections Anomalous Score calculation
Table 5: Connection Anomaly Score Weights
Equation 7: Connection Anomaly Score
CAS values are determined separately for TCP and UDP connections similarly to what is done in the communications fingerprint process. The final CAS value is the maximum value of both TCP and UDP CAS.
5.5.3 First-Pass Detection
The first-pass detection process is based on fingerprint analysis. The Communication Anomaly Score Threshold (CAST) is applied to TCP and UDP fingerprints in order to determine which communications should be considered anomalous when evaluating Connection Anomaly Scores (CAS). Based in the assumption of small data exchange in bot operation [39], this framework uses both the Bytes Per Packet and Maximum Bytes Exchanged Thresholds to control the framework operation range filtering out uninteresting communications for the detection process. At the end of this phase, hosts showing any kind of anomalous behavior are registered (with CAS values above defined CAST).
5.5.4 Horizontal Fingerprint Clustering (HFC) and Second-Pass Detection
The final goal of the detection process is not only to detect all malicious network operations but to find bot operations in all of the network traffic observed. For that matter, and using each host’s communications fingerprints, similar malicious interconnected host’s activities are clustered. This way, potential botnets can be revealed and False Positive Rates hopefully reduced.
It is important at this moment to point the fact that HFC is performed only to communications involving the anomalous hosts detected in the first pass detection (point 1, 2 and 3 of the defined detection framework’s five step-stones).
There are several existing methods to cluster data. Since the goal of HFC is to cluster very similar fingerprints leaving other not so similar outside the clusters, centroid-based clustering methods such as K-Means are not suitable in this context since centroid-based methods have the goal to gather all records in one or other cluster which is not what is intended.
In order to determine similarity between fingerprints, centesimal steps were established in each heuristic’s diversity ranges thus predefining possible clusters. A communication is said to belong to a determined cluster when all its heuristic’s diversity indexes rounded to the hundredth below are equal to the cluster’s heuristics characteristics.
The clustering centesimal value can be changed in the framework and is defined as the Clustering Precision Threshold (CPT). A centesimal (0.01) Clustering Precision Threshold guarantees that all communications in a cluster have fingerprints with each characteristic’s value (BPP, TBF, etc.) distancing no more than 0.009(9) from each other.
To control the Horizontal Analysis (clustering) sensitivity, a new threshold was needed. The Horizontal Fingerprint Analysis Threshold (HFAT) allows defining the minimum number of hosts that a cluster must have in order to be considered in the detection process.
All possible clusters values are first determined and inserted into a database table (BotnetClusters) in every hour time-window. GUIDs (Globally Unique Identifiers) are assigned to each possible cluster at this time defining clusters identifiers. After clusters are predetermined for the hour, all communications involving anomalous hosts are then clustered and inserted in the cluster details database table (BotnetClusters_Details).
Figure 13: Clustering Database Tables
The final phase of this step begins by calculating an Anomaly Score for each Source IP or Host. Host Anomaly Score (HAS) is calculated using the equation below.
Equation 8: Host's Anomaly Score
Similar to the Connection Anomaly Score Threshold used to control de First-Pass detection phase, the Second-Pass phase is controlled by the Host Anomaly Score Threshold (HAST) and Minimum Number of Anomalous Contacts Threshold (MNACT). MNACT influence detection by imposing a lower limit to the number of anomalous connections a host starts in order to be considered for detection. This threshold allows systems administrators discard fortuitous anomalous contacts detected.
Beyond HAS, a Unidirectional Anomaly Score (UAS) is also evaluated to help identifying scan behaviors.
Equation 9: Unidirectional Anomaly Score
At this point, and to control scan behaviors detection, two thresholds are used: the Unidirectional Anomaly Score Threshold (UAST) and Minimum Number of Unidirectional Contacts Threshold (MNUCT). The percentage of unidirectional contacts performed by a single host is compared to the UAST. All hosts with a number of unidirectional contacts above the defined MNUCT and a UAS above the defined UAST is then considered to be a Scanner.
5.5.5 Alerts
The final step of the process is to record all hosts identified as anomalous in order to deliver an alert to system administrators. Final detection step can be performed with or without clustering analysis. In order to use this detection framework for botnet detection clustering analysis must be used.