As further information assets are placed in the cyber domain by organisations, enhancing current cyber security defences is becoming increasingly imperative. This thesis investigates the domain of pattern- based intrusion detection approaches, with the purpose of precisely and efficiently detecting adversaries utilising the Secure Shell (SSH) service to access a network. A pattern-based intrusion detection approach is one which discovers and extracts patterns from network traffic to detect adversarial activities. This study examined whether precise patterns could be efficiently extracted from sequential adversarial commands by selected machine learning algorithms. Precision and efficiency are key attributes of intrusion detection approaches, as network traffic should be precisely classified to identify a possible unauthorised attempt to gain access to assets on a host or network. In addition to precisely classifying network traffic, intrusion detection approaches should efficiently process the data, allowing for timely detection of an adversary on the network prior to assets on a host or network being compromised. A pre-processing procedure was developed for this study to test whether a reduced dataset, that is an evenly and coherently represents the associated full dataset can be utilised to extract more precise patterns efficiently.
7.1.1 Problem Space in Precise and Efficient Intrusion Detection
An intrusion detection approach should precisely and efficiently detect threat actors on the network in a timely manner to avoid assets being compromised. This study investigated whether patterns extracted from adversarial SSH commands can be utilised as a pattern-based intrusion detection approach. As SSH is one of the most predominant methods of accessing systems remotely, it is also a prime target for cyber-criminal activities. Existing studies examined in Chapter 2 describe research which has been conducted in utilising deep packet inspection (DPI) to extract adversarial activities for intrusion detection purposes. However, this chapter also demonstrated that limited studies have been conducted in the use of sequential adversarial activities for intrusion detection purposes.
Studies have been conducted on the use of machine learning algorithms to precisely and efficiently detect adversarial activities. From examining the literature, the two main approaches to enhancing machine learning algorithms for intrusion detection purposes had been identified. These are, improving the features selected and the development of a hybrid algorithm. Feature selection is the process of choosing relevant attributes within a dataset that will allow for additional information to be extracted by classifying a dataset based on the selected features. Hybrid algorithms are developed by combining two or more algorithms with the intent of producing one which more closely matches the desired detection outcomes. This study focused on improving the feature selection process through
165
appropriately pre-processing the data. At the time of this study, limited research had been conducted into pre-processing data appropriately to enhance the precision and efficiency of the patterns extracted by a machine learning algorithm. This study has contributed knowledge to this domain through providing evidence that precise patterns can be efficiently extracted from an appropriately reduced dataset of sequential adversarial commands.
7.1.2 Research Methodology and Procedure
The underpinning research paradigm for this study was a post-positivist quantitative approach, with a field experimental research design using quasi-experimentation in a non-equivalent control group pretest-posttest design. The research procedure developed for this study consisted of five phases.
The first phase was the project understanding phase, where the objectives of the study had been defined. The aim of this study was determined by identifying the gap in the knowledge this study intended to fill, by addressing the research questions and associated hypotheses.
The second phase was the data understanding phase, where the three honeypot datasets acquired for this study had been explored and analysed to determine if they were suitable for this study. The initial exploration of the datasets consisted of identifying the tables or files within the dataset, in addition to identifying the relationship between each table and feature in the collected dataset. Upon completion of this step, the analysis of the dataset was conducted by identifying whether the required features were present in the datasets.
The third phase was the pre-processing phase. The pre-processing phase was a critical phase of the research procedure and was applied to the acquired datasets prior to applying the selected machine learning algorithms. There are five steps in the pre-processing procedure that follow an iterative process. These are, data filtering, data integration, data transformation, data reduction and data wrangling. The data reduction step is where the reduced datasets that are an evenly and coherently represents of their full dataset. Upon completing this phase, a reduced dataset had been produced for each of the three full honeypot datasets.
The fourth phase was the experimental phase, where the four machine learning algorithms had been applied to the three full datasets, and their associated reduced datasets. The four machine learning algorithms were the; Naïve Bayes, Markov Chain, Apriori and Equivalence Class Transformation (Eclat) algorithms. The experiments involved testing whether more precise patterns could be extracted from the reduced datasets by each machine learning algorithm as compared to their respective full datasets. The experiments had been developed to test the hypotheses of this study.
166
The fifth phase was the evaluation and analysis phase. In this phase, the results from the tests conducted in the experimental phase were evaluated and analysed to verify the hypotheses of this study, thereby addressing the research questions of this study.