• No results found

The Class Balanced SBIT Algorithm (CB-SBIT)

4.4 SBIT Limitations and Extension

4.4.4 The Class Balanced SBIT Algorithm (CB-SBIT)

To avoid class imbalance the SBIT (Alothman, 2018b) algorithm discussed in Section 4.2.2 can be modified to ensure the resulting dataset is class balanced. Details of this extension were published by the author of this thesis in (Alothman et al., 2018).

Recall SBIT assumes that the target dataset is class balanced, the modified version of SBIT makes sure that the new dataset (resulting after selecting instances from source datasets) remains class balanced by using a strict criteria as illustrated in Algorithm 2. This can be achieved in more than one way. For example, it can be done on the fly

by keeping track of the ratio of classes of instances transferred from the source datasets and ensuring that whenever an instance is added, the ratio remains almost the same. In other words, it guarantees that approximately the same number of instances from different classes is transferred to the target dataset. Another method is to perform a post-processing step and sub-sample the instances selected for transfer in such a way that the classes are balanced. In the current implementation, both methods are available although only the latter is included in Algorithm 2 (lines 10 and 11).

Algorithm 2:Class Balanced Similarity-Based Instance Transfer (CB-SBIT) Input :Source DatasetsS1, S2, . . . Sn

Input :Target DatasetT Input :Selected = [ ] Input :thr1, thr2, . . . thrk

Output:New Dataset that is the result of M erge(T, Selected)

1 forS∈[S1, S2. . . Sn]do 2 forIs∈S do 3 forIT ∈T do 4 Sim1 =ComputeSimilarity1(Is, IT); 5 Sim2 =ComputeSimilarity2(Is, IT); 6 . . . ; 7 Simk=ComputeSimilarityk(Is, IT);

8 if Sim1 > thr1&Sim2> thr2. . .&Simk> thrk then

9 Add Is to Selected ;

10 ClassBalancedSelected=SubSample(Selected); 11 TN EW =M erge(T, ClassBalancedSelected);

The SubSample function in Algorithm 2 counts the number of instances in each class in the input dataset and randomly removes instances from the majority class(s) until the dataset is class balanced. As will be shown in Section 5.4.4, CB-SBIT in general improves the accuracy of SBIT due to its ability to ensure class balance.

4.5

Summary

This chapter has introduced the algorithms developed as part of this thesis (i.e. SBIT and CB-SBIT) as well as two commonly used open source algorithms (i.e. TransferBoost and SMOTE). As for the developed algorithms, the chapter included a detailed explanation of the intuition behind them, their pseudo-code and discussion. The chapter also included a detailed explanation of what similarity is, the similarity measures used in this work and how they were used in the developed novel algorithms. On the other hand, the two open source algorithms were briefly explained because they were used for evaluation purposes. The next chapter contains a detailed experimental evaluation of all these algorithms.

Preprocessing of Raw Network

Traffic Data and Performance

Evaluation of the Proposed

Methods

One of the most important aspects in machine learning in general, and in botnet detection specifically, is data. Without data no practical experiments can be conducted and therefore no concrete conclusions can be drawn. This chapter provides a detailed overview of how data used in this work was obtained and preprocessed. It also presents an extensive evaluation of the novel techniques explained in Chapter 4. Each data preprocessing step is explained in such a way so that the reader can follow what was carried out and why it was carried out. After explaining the steps, the chapter shows the results of applying these steps to the downloaded dataset. After that, the last section of

this chapter demonstrates the experimental results of using this data with the proposed approaches.

5.1

Introduction

Data preprocessing is an essential step in data-driven approaches; this is because data in the real world is often incomplete, or unsuitable, for being used by software algorithms such as those used in data mining and machine learning. For instance, data can contain missing values or unnecessary and uninformative features. Also, noise can exist in the data in the form of outliers or errors and these can influence not only models created using the data but also the interpretation of these models. Because of these issues and others, it is important to ensure that data is in an acceptable condition to be used in tasks such as automatic prediction and analysis.

Over the last few years, the cost of malicious attacks has risen to tens of millions of Pounds from UK bank accounts (Agency, 2017). However, in order to test and evaluate malicious traffic detection techniques, network traffic data is required. Obtaining Network traffic is normally captured using tools such as Wireshark (Orebaugh et al., 2007). The data is usually in raw PCAP (Packet CAPture) format which is not processable by popular data mining platforms such as WEKA (Hall et al., 2009) or Scikit-learn (Pedregosa et al., 2011). Hence, it is necessary to prepare the data and transform it into a suitable format.

An existing approach that uses Wireshark to transform PCAP data into textual data is the work in (Fowler and Hammel, 2014). While Wireshark extracts some features, it does not extract statistical metrics such as the ones generated by the open source tool that has been used in this work (more in Section 5.2.2). Although the review in (Davis

and Clark, 2011) attempts to summarise several existing preprocessing techniques, it focuses more on features extracted from traffic data. This paper (to the best of this thesis’s author’s knowledge) is the first that provides a detailed step-by-step explanation of various preprocessing phases.

This chapter has three main contributions: 1) It provides several steps that should be considered when carrying out network traffic data transformation from raw to a textual format, 2) It demonstrates these steps by applying them to a real, rather than simulated, data, and 3) It presents detailed performance evaluation of our novel tranfer learning algorithms presented in Chapter 4 using the datasets resulting from the network traffic data transformation.