• No results found

2.3 Machine learning

2.3.4 Enhancing Intrusion Detection Approaches

By examining literature there are two main approaches for enhancing machine learning algorithms for intrusion detection purposes. These are: 1. implementing hybrid algorithms and 2. feature selection. Hybrid algorithms which are developed by combining two or more algorithms. While feature selection is choosing relevant attributes within a dataset that will allow for additional information to be extracted by classifying the dataset based on choosing selected features.

2.3.4.1 Hybrid Algorithms

There have been studies conducted in enhancing intrusion detection approaches by developing hybrid algorithms and approaches to detect unauthorised activities. Kevric et al. (2017) developed a hybrid classification algorithm combining the random decision tree and the Naïve Bayes classifier. Experiments conducted on the NSL-KDD dataset showed the proposed hybrid algorithm outperformed the standard machine learning algorithms. Aslahi-Shahri et al. (2016) proposed a Support Vector Machine (SVM) and a Genetic Algorithm (GA) hybrid approach. The GA was used for feature selection while the SVM classified the data. The results obtained from applying the developed hybrid algorithm to the KDD Cup 99 dataset, showed the developed hybrid algorithm is more accurate at classifying network traffic compared to the standard machine learning algorithms tested. A study conducted by Soheily-Khah et al. (2018) developed a hybrid detection approach based on the K-means and Random Forest decision tree classifier. The K-means was utilised to pre-process the ISCX intrusion detection datasets followed by the results being processed by the Random Forest classifier. Results had shown the proposed algorithm was precise and efficient compared to other machine learning algorithms tested. In addition, studies have been conducted in applying fuzzy logic to association rule mining algorithms (Aburrous, Hossain, Dahal, & Thabtah, 2010; Changguo, Nianzhong, Tailei, Qin, & Xiaorong, 2009). Fuzzy logic allows more flexible segment boundaries, by giving the administrator control of defining the fuzzy set range associated with the boundary. The association rule mining algorithms that had fuzzy logic applied produced greater precision and efficiency compared to the standard association rule mining algorithms examined.

Hybrid algorithms have been proven to enhance the precision and efficiency of machine learning algorithms (Agrawal & Agrawal, 2015). However, the current study intended to utilise the machine learning algorithms to extract patterns from sequential SSH adversarial commands and compare the results between the reduced dataset to the respective full dataset. As such this study is not concerned with implementing enhanced versions of the chosen machine learning algorithms instead to enhance the performance of the existing algorithms.

27

2.3.4.2 Feature Selection

Studies have been conducted on improving the feature selection process to enhance intrusion detection approaches. Feature selection is choosing relevant attributes within a dataset that will allow for additional information to be extracted by classifying the datasets based on the selected features. Studies within the relevant literature have focused on improving feature selection as part of developing hybrid approaches to enhance intrusion detection approaches. De la Hoz, De La Hoz, Ortiz, Ortega, and Prieto (2015) suggested using the probability based Self-Organising Maps (SOMs) and using Principle Component Analysis (PCA) and Fisher Discriminant Ratio (FDR) for selecting features. The experiments conducted suggested the proposed feature selection approach applied on the NSL-KDD dataset was precise compared to the results obtained from standard algorithms. Gauthama Raman, Kirthivasan, and Shankar Sriram (2017) focused on Rough Set Theory (RST) to extract additional data from a dataset and proposed Rough Set Hyper-graph (RSHGT) as a solution for extracting an optimised feature subset. Experiments had been conducted on the KDD Cup 99 dataset, the proposed RSHGT was evaluated by applying the selected features on chosen classifiers then comparing the results to other feature extraction techniques using the same chosen classifiers.

Aminanto, Choi, Tanuwidjaja, Yoo, and Kim (2018), identified the optimal feature to detect an impersonator on a Wi-Fi network using machine learning algorithms. The results show by identifying the optimal features the performance of that machine learning algorithms can be enhanced. This suggests selecting the optimal feature for the current study can enhance the performance of the machine learning algorithms. The feature selection process can take place in the pre-processing phase of this study. The study conducted by Soheily-Khah et al. (2018) presented the pre-processing procedure that had been implemented on the ISCX dataset. The processing procedure implemented was developed to convert raw traffic and separate the data into different traffic types (normal and abnormal). That particular study converted the data from nominal to numeric.

The pre-processing phase is a critical process when applying machine learning (Malley et al., 2016). In the pre-processing phase, the datasets are massaged and prepared to be applied to the selected machine learning algorithms. However, there is a combination of steps that can be taken in the pre-processing phase depending on the requirements of this study and the data collected. Within literature, there are variations of the pre-processing phase (Bramer, 2013; García, 2015; Hackeling, 2014). Hence, after considering the combination of possible steps within the pre-processing phase and the requirements of this study, a pre-processing phase was developed for this study (presented in Section 4.2). The pre- processing procedure developed processes, the acquired SSH honeypot datasets to exhibit the sequence of adversary commands for a unique session.

28

The classification and association problems that are solved by the machine learning algorithms, align with the current study. In order to identify the classification and association machine learning algorithms that should be utilised, the four main probability inferences were examined. The Bayesian and frequentist interpretations of probability were selected for this study. The Naïve Bayes and Markov chain probabilistic classification algorithms, that are based on the Bayesian theorem of probability had been chosen. Along with the Apriori and Eclat association rule mining algorithms, that are based on the frequentist theorem of probability, had also been chosen for this study.