4.3 A Drive-by Download Attack Classifier
4.3.3 Building a Machine Classifier
From the collection of annotated Tweets made in the previous phase 2,000 Tweets were sampled containing unique URLs. The training data consisted of samples of machine activity at regular intervals of 30 seconds throughout the Super Bowl data collection pe- riod. Thetraining setcontained 1,000 URLs identified by Capture HPC as malicious, and 1,000 as benign. The unseen dataset used as thetest set was collected around the time of the cricket World Cup; it consisted of 891 malicious URLs and 1,100 benign (sampling 80% from the day of the final and 10% from each of the semi-finals).
Feature Selection and Preprocessing
To identify features that were predictive of malicious behaviour machine log data was collected while revisiting the malicious and benign URLs in a sandboxed environment. The details of the metrics that we measured are as follows:
1. CPU usage (number).
2. Connection established/listening (yes/no). 3. Port Number (yes for port 80/no for other port). 4. Remote IP (established or not).
5. Network Interface (type e.g. Wifi, Eth0). 6. Bytes Sent (number).
7. Bytes received (number). 8. Packets Sent (number). 9. Packets Received (number). 10. Time since interaction started.
From the log file created only ten attributes from the recorded machine activities were considered to build the machines learning models. The attributes were categorised into two broad categories, one presenting the load on the machine, such as the CPU usage during the visitation of the website and other presenting network data. In the log file gen- erated to build machine learning models, the value for CPU was represented as a numeric value representing CPU usage during visitation. While for attributes represented network statistics had to undergo a pre-processing stage before they could be written in the log file. In the pre-processing stage, attributes such as connection, ports, network interface, and remote IP were transformed into nominal value. Where for connection and remote IP a valueone represented a presence of remote IP and connection established andzero
represent an absence. Whereas for ports a binary value 1 represented the use of port 80 and 0 represented the usage of any other port other than 80. The rationale for focusing on port 80 was because it is the most commonly used for the internet communication proto- col, Hypertext Transfer Protocol (HTTP). It is the port from which a computer sends and receives Web client-based communication and messages from a Web server and is used to send and receive HTML pages or data. For the network interface attribute, each number presented the network interface that was used while visiting the website. Once the data was transformed log file representing each attribute and their respective values at every 30-second interval were generated for both dataset.
In total a 5.5 million observations were recorded from interacting with 2,000 Tweets (1,000 malicious and 1000 benign). Each observation represented a feature vector con- taining metrics which indicated whether the URL was annotated by Capture HPC as ma- licious or not.
Classifier Model Selection
The data contained logs of machine activity, which occurred even when the system was idle, so it was likely that any log would contain a great deal of ’noise’ as well as malicious behaviour. Table 4.1 presents a comparison between the training and testing datasets with
respect to the mean and standard deviation of recorded machine activity. It illustrates the high variance in the mean recorded values of CPU usage, bytes/packets and sent/received used between the two datasets, which suggested that it would be challenging to identify similar measurements between datasets for prediction purposes. The standard deviation in both datasets was very similar, which suggested that the variance is common to both datasets, while the deviation is high, suggesting a great deal of ’noise’ in the data.
In addition to the ‘noise’ in the data – although the training and testing datasets contain Table 4.1: Descriptive statistics for train and test datasets at T=60 for numeric attributes
Attribute Mean Std. Dev
Train Test Train Test
Cpu 1.255354 6.26 2.144828 2.31
Connection 0.86 0.88 0.34 0.32
Portnumber 0 0.37 0.01 0.19
Remoteip 0.86 0.88 0.34 0.32
Network 4 4 2 2
Bytessent 1.01E+08 3.59E+08 2.06E+08 9.50E+08
Bytesrecd 2.87E+08 3.12E+08 8.47E+08 8.90E+08
Packetssent 470821.5 2442275 1472258 6659166
Packetsrecd 539358 2849133 1843365 7742467
a well-balanced number of malicious and benign activity logs – the behaviours in both logs are largely benign, creating a large skew in log activity towards the benign type. The noise and skewness may have an impact on the effectiveness of a discriminative classi- fier in identifying the decision boundaries in the space of inputs (i.e. the inputs may not be linearly separable, which could cause problems in using a perceptron-type classifier) even after great many iterations (for instance, if a multilayer perceptron were used that had been developed using multiple layers of logistic regression).
It could be argued that for more complex relationships, such as multiple sequential activi- ties leading to a malicious machine exploit, a generative model would more appropriately generate a full probabilistic model for all the variables (possible behaviours), giving a training dataset of machine logs. For example, a Bayesian approach could effectively capture the dependencies between variables over time [190]. Or a Naive approach to Bayesian modelling might be more suitable in that it assumes that there are no dependen- cies, but that the probabilistic value of individual variables will be enough to determine
the likely behaviour [115]. The first phase of data modelling was therefore to conduct a number of baseline experiments to determine which of the two models would predict more accurately. We used the Weka toolkit to compare the predictive accuracy of:
1. Generative models that consider conditional dependencies in the dataset (BayesNet) or assume conditional independence (Naive Bayes).
2. Discriminative models that aim to maximise information gain (the J48 Decision Tree) and build multiple models to map input to output via a number of connected nodes, even if the feature space is hard to linearly separate (Multi-layer Perceptron)
While developing each machine learning model using Weka toolkit, default configuration for each algorithm had been selected. The default setting for each are listed below
1. NaiveBayes- Batch Size=100, number of Decimal place =2
2. BayesNet-Batch size=100, Estimator Algorithm= Simple Estimator and Search Al- gorithm= K2-P 1-S Bayes
3. J48- Batch Size=100, confidence factor=0.25, min number of Obj=2, number of folds=3, unpruned=False.
4. MLP- Batch size=100, Decay=False, Hidden layers= (number of attributes+ classes)/2, Learning rate=0.3, Momentum=0.2, Training time =500, Validation threshold =20,Nor- malise Attribute=True, Normalise Numeric class=true.