This section provides the conclusions of every method developed, explained and experimentally evaluated in this thesis.
6.3.1 Android Botnet Detection
This thesis proposed an effective approach for the automatic detection of botnet apps. This approach is based on the analysis of the Java source code of such apps. The approach starts with Android apps asapk files, uses reverse engineering to obtain their Java source code and then analyses that source code. The source code analysis and mining was done using two techniques. In the first method, the Java source code is treated as if it was normal text by using Natural Language Processing (NLP) methods. And in the second approach, several statistical measures (i.e. metrics) from the source code were extracted and used as attributes. The developed approach can be considered static as it does not require the execution of the Android app itself. The idea is that as soon as an Android app is downloaded, it is reverse engineered and its Java source code is obtained and used to predict whether this app is safe or malicious. The advantage here is being proactive. In other words, an attempt is made to identify danger before it occurs.
As for the data resulting after extracting the source code metrics, only one dataset was created. On the other hand, several datasets were created after using NLP techniques.
This is because when converting text into word vectors the number of words to keep
was varied and therefore multiple datasets were created. In addition, feature selection was applied to these datasets and tests were run using the original and feature selected versions of each dataset.
Several traditional classifiers were evaluated and multiple metrics were calculated to examine their performance. It was interesting to see that Random Forest was in general the best classifier and the preferred representation was to use 5000 number of words to keep and to apply feature selection.
6.3.2 Raw Network Traffic Data Preprocessing
Automatic detection of malicious network traffic is an important task that should be as accurate as possible. One of the main steps in carrying out this detection is to capture network traffic, prepare it for analysis and then perform the analysis. As part of thesis, several steps that should be considered when analysing network traffic data were provided, explained, and their results were illustrated using real freely available data. While some of these steps are optional, some others are required in order to transform data into a suitable format for data mining tools and platforms. After applying these steps to an existing open source PCAP dataset, the resulting data was used for extensive machine learning experiments as part of evaluating the transfer learning approaches proposed in this thesis.
6.3.3 Similarity Based Instance Transfer (SBIT)
This thesis has introduced a novel, fast yet effective and powerful method for transfer learning which was successfully used to classify botnet traffic. This method is an instance
transfer method that is based on the similarity between instances in the source data and instances in the target data. The method computes more than one similarity measure to make sure as much information as possible is captured. Experimental results show that this method outperforms, in general, a classical instance transfer learning algorithm, namely the TransferBoost algorithm. Not only this, but this method is also much faster which gives it another advantage.
6.3.4 Class-Balance Similarity Based Instance Transfer (CB-SBIT)
This thesis has introduced the novel SBIT algorithm and an extension to it. The extended version of the SBIT algorithm is aware of the percentage of classes in the resulting dataset (resulting after instance transfer) in the sense that it makes sure the classes are balanced. This helps in avoiding several problems such as overfitting and misinterpretation. The new version of the SBIT algorithm was called Class-Balanced SBIT, or CB-SBIT for short. The thesis also included extensive experimental evaluation of the CB-SBIT algorithm against the original SBIT algorithm as well as against two open source commonly used algorithms; the SMOTE and TransferBoost algorithm.
Experimental results showed that CB-SBIT outperforms SBIT in majority of the tests performed; which means CB-SBIT is an improvement over SBIT. When comparing CB-SBIT against SMOTE, several network traffic datasets of various sizes were used and it was evident that CB-SBIT outperforms SMOTE in small datasets (CB-SBIT seems to perform better than SMOTE as the dataset gets smaller). An interesting case was when the dataset contains only one instance of one or more classes. SMOTE does not work in this case whereas CB-SBIT functions normally. On the other hand, text data from the publicly available 20 news groups dataset was used to compare the performance of CB-SBIT against TransferBoost. It was interesting to discover that, despite the fact
that CB-SBIT (and subsequently SBIT) outperforms TransferBoost when using network traffic data, TransferBoost works much better than CB-SBIT on text data.
The reason why CB-SBIT exhibited poorer performance on the text data proved to be because of the extremely low similarity values between instances from different topics in the text data. Whereas, in the network data where the computations showed that higher similarity values were present, the performance was excellent. The differences in performance between the text and network datasets proves that the proposed ’similarity-based’ methods worked as expected in the appropriate transfer learning scenario.