IMPLEMENTATION OF SDN BASED FEATURE SELECTION APPROACHES ON NSL-KDD DATASET FOR ANOMALY DETECTION

(1)

International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 12, Issue 3, March 2021, pp.252-262 Article ID: IJARET_12_03_025

Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=12&IType=3 ISSN Print: 0976-6480 and ISSN Online: 0976-6499

DOI: 10.34218/IJARET.12.3.2021.025 © IAEME Publication Scopus Indexed

IMPLEMENTATION OF SDN BASED FEATURE

SELECTION APPROACHES ON NSL-KDD

DATASET FOR ANOMALY DETECTION

Reenu Batra

Department of Computer Science and Engineering, SGT University, Gurugram, Haryana, India

Manish Mahajan

Department of Computer Science and Engineering, SGT University, Gurugram, Haryana, India

Dr. Amit Goel

School of Computing Science and Engineering, Galgotias University, Greater Noida, Haryana, India

ABSTRACT

In past decade, traditional network was used for transferring of data between nodes. The main issue related to traditional networks was their stable nature and also, they were unable to meet the requirements of newly added devices in the network. So, traditional networks are replaced by Software Defined Network (SDN). Many of the networking applications rely on network for transfer of data. SDN networks are dynamic in nature. SDN can be used to create a framework for data-intensive applications like big data etc. Now a day, security of data over the network is very crucial. Machine learning (ML) algorithms are used for classification of network data in order to detect intrusion attacks.

In this paper, a comparative analysis of machine learning algorithms is done by using different feature selection approaches. For this analysis, NSL-KDD dataset from training and testing with 41 features and 125000 samples are used. Accuracy estimation of machine learning algorithm with a particular feature selection approach can is done in order to detect anomaly over SDN.

Key words: Machine Learning, SDN, Open Flow, Intrusion attacks, Intrusion Detection

System, Distributed Denial of Service, NSL-KDD Dataset, Random Forest, Naive Bayes, Decision Tree.

(2)

Cite this Article: Reenu Batra, Manish Mahajan and Amit Goel, Implementation of

SDN Based Feature Selection Approaches on NSL-KDD Dataset for Anomaly Detection, International Journal of Advanced Research in Engineering and Technology

(IJARET), 12(3), 2021, pp. 252-262.

http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=12&IType=3

1. INTRODUCTION

If researchers talk about architecture of traditional networks, it is not compatible with today’s applications. The main reason behind it is basically dynamic nature of the technologies. If researchers discuss about past decade, use of traditional networks was all over the applications. Traditional networks basically provide a conventional distributed facility. But it is not so much compatible to existing new applications. This limitation of traditional network can be overcome by making use of software defined networking (SDN) [1]. The main advantage of SDN is its centralized property. It mainly separates switches, router etc. are mainly used for forwarding the packets from the controller. In other words, it separates forward plane from the control plane. Forward plane is also known as data plane or user plane. Implementation of SDN revolves around Open Flow (OF) [2]. Open Flow evolve as a communication protocol over SDN that directly make communication with controllers residing on control plane of SDN. Major benefit of Open-Flow used over SDN is its applicability of proving new services and features. The concept of Open Flow generated in year 2008 and was implanted by a private research university named Stanford University. Since its release it is now adopted by many of the big companies like Cisco, Brocade etc. Cisco produced its separate controller by adopting OF named as Cisco XNC. In SDN networking, if a device wants to make communication with SDN controller, then it must be OF supported. In SDN networking deliberate control point is known as SDN controller. It makes use of application programming interface (API) for flow control management. SDN controller basically runs on a server and instructs switches or routers for where to forward data packets. The traffic management is done by SDN controller by following all the networking policies. The central part of SDN is controller which can be treated as operating system of network. It basically occupied by all the networking devices on the one side of network and applications reside on the other side of the network. In SDN architecture, northbound API is used for making communication between SDN controller and high component application. The communication between network devices (low level components) and SDN controller is done with help of southbound API. SDN architecture mainly consist of three layers: Application layer, Control layer and Data layer. All of the business applications incorporated with SDN resides on application layer. At the control layer we have our SDN controller. As many of ongoing dynamic applications are fully dependent on SDN framework, security of data is major concern in SDN because of involvement of controlling functions and API. With blessings of SDN, Google B4 [3], Huawei Carrier Network [4], NOX [5], RYU [6], Beacon [7], Open Daylight [8], Flood-Light [9] was implemented.

For security, an application named as intrusion detection system (IDS) can be indulge in our SDN network. IDS may be one of the devices whose work is to monitor the network in order to ring an alarm when it monitors some abnormal activity over the network. IDS may be used for monitoring of system or may be used for a network. If IDS find some of malicious traffic, it reports directly to the administrator or it reports to Security information and event management system which is placed central. The quality IDS can be defined as that IDS which preciously separates normal activity with the malicious data. Basically, IDS mainly use an alarm filtering method so that false alarm rate should be minimized. Based on IDS used for a host or a whole network, it will be of two different types: Host intrusion detection system (HIDS), network intrusion detection system (NIDS). A HIDS mainly monitors a host or a device

(3)

independently and finds a malicious data by comparing it with normal data or activity. On the other side NIDS monitors a number of devices in parallel. NIDS is placed at a planned point in the SDN. It also monitors all the subnets which are part of the network and also look-after of all the data that migrate from one subnet to another subnet.

Beside of HIDS and NIDS there are further three types of IDS: Protocol based intrusion detection system (PIDS), Application protocol-based intrusion detection system (APIDS) and Hybrid intrusion Detection System. PIDS mainly used at the server for monitoring of protocol used between user and server. It mainly uses HTTPS protocol stream for it. APIDS basically works on a group of servers to monitors communication between various application specific protocols. Detection Methods in IDS basically comes in two categories. Signature based intrusion detection is one of the categories which basically make use of patterns like numbers of ones, numbers of zeros or number of bytes etc. It finds the malicious attacks just by comparing their patterns. In this, malicious attacks are known as signatures. The main limitation of this detection method is that it can’t find new malicious attack or signature because its pattern will not match. To overcome this limitation, we use anomaly-based intrusion detection method because it is capable of finding new malware whose tendency is likely to occur frequently. In this method, we make use of machine learning for creating a true model and if some traffic comes it will be compared with this true model and will be declared as malware or malicious if it will not be matched with the model. Many of machine learning algorithms have been developed so far for anomaly flow detection. The discussion on these developed algorithms has been made in next section of this paper. Further sections of this paper elaborate proposed ML model which is basically a classification model used for classify anomaly based on some features .The model basically used a 10 fold cross validation for training of data. There are a number of machine learning methods used for classification like J48, Random Forest (RF), Projective Adaptive Resonance Theory (PART), Naive Based, Decision Tree, Radial Basis Function Network (RBFN) and Bayesian networks which are compared based on accuracy in the further section of this paper. These methods can be applied on NSL-KDD data set and so that we can also analyze their performance by comparing these methods by using different feature selection mechanisms like info gain, correlation-based feature selection (CFS) etc.

2. RELATED WORK

Flow based anomaly detection has become an interested topic now a days. In this section we will study related work that has been done so far. In past many of studies has been performed on SDN architecture and machine learning (ML) algorithms. A flow-based anomaly detection architecture was proposed based on Multilayer Perception (ML) and an algorithm was studied Gravitational Search Algorithm (GSA) [10]. In this a model was designed to separate normal and abnormal traffic over the network. After this Support Vector Machine (SVM) implemented on NIDS, performed with better accuracy and low false alarm rate [11]. In a model was designed based on NOX controller and Open Flow switches [12]. In these four new algorithms were introduced for anomaly detection (Threshold Random Walk (TRW), maximum entropy detector, rate limiting, NETAD). All of these were good in detecting the anomalies over the network. In a DDoS attack model based on flow features was introduced [13]. This model used the concept of self- organizing map for anomaly detection. In an SVM classifier was introduced based on DDoS attacks giving results as very less false positive alarm rate [14]. In SVM classifier was introduced based on optimized protection mechanism [15]. A Deep Neural Network (DNN) anomaly detection system was modeled in [16] based on six features. This model achieved a high accuracy in detection of anomalies. In order to enhance the detection accuracy and for attaining a high level performance, a new model was proposed [17]. The proposed model was compared with other models for measuring its performance. A risk assessment mechanism also proposed in [18] to know the impact of multi stage attacks on prior

(4)

basis. A two layer and three layer intrusion detection model was proposed in [19] that was mainly compatible with wireless sensor networks. In order to know the impact of application Distributed Denial of Service (DDoS) attacks and to check various server parameters a model was designed in [20]. This model was designed to mainly know attacks intensity and server performance on different attacks. A system was designed for database intrusion detection system for preventing unauthorized access of data in [21]. Various ranker algorithms were developed Relief-F, Information Gain, Gini Index in [22]. In this concept of ranker was developed for feature selection in intrusion detection.

3. ML CLASSIFICATION MODEL

In actual, main have to construct a scheme for intrusion detection that must be feasible to SDN. Secondly, it gives efficient result in form of false alarm rate and high detection rate. An efficient intrusion detection scheme will basically classify intrusion data from the normal data. We can perform this classification processing task in two layers shown in Figure 1. As the classification will be done on certain set of features so at first there is a need to form a feature set that will only contain the relevant features. This task will be performed at the first layer by eliminating unnecessary features or irrelevant feature from the feature set in order to get the desired feature set. After selection of features, we need to generate a classifier on the second layer for the classification. The newly generated classifier will use an optimal machine learning (ML) algorithm. Finally, we will get our classification model and this model will get trained by using 10-fold validation process or method.

Figure 1 Classification Model Based on ML

4. MACHINE LEARNING

Machine learning is used gradually in many of the running applications because of its capability of training a host or computer without any programming [22]. As of advantage all the computers and controllers can work in a distributed environment can works in similar manner. By making use of machine learning algorithms we can train a computer to find out anomaly on the network as shown in Figure 2.

(5)

Figure 2 Flowchart of ML anomaly detection algorithm

Many of the classification techniques are implemented by making use of ML. Now a day’s many of machine learning algorithms are used gradually in classification and regression.

5. DATASET USED

Our need is to train our model based on a particular data set. Accuracy is important for data set that is going to be used. Along with accuracy, many of other features like consistency, redundancy and noise free features are also important. Many of the data sets are easily available on web that can be used for anomaly detection. Some of the data sets are self-made data set [19]. Before using data sets, we need to process and normalize the data sets. In our experiments we are using NSL-KDD data set which is a refined form of KDDCup99 [20]. As KDDCup99 is very complex data set when it is used for intrusion detection it a huge amount of time. Redundancies between training data set and test data set can also be reduced by making use of NSL-KDD data set. As we already know there are four classes of attacks as shown in Table 1 and five normal classes.

Table 1Classes of attacks

Type of Class Attack Type Description

Class 1 DoS Denial of Service

Class 2 U2R User to root

Class 3 Probe Attack related to web structure

Class 4 R2L Remote to local

Our NSL-KDD data set consist of 41 features that may be used to detect attack on the network. NSL-KDD consists of 125000 above data samples for training and 25000 above data samples for training.

6. FEATURE SELECTION

Feature selection mechanism basically considers only relevant and necessary features in data set and removal of unnecessary features from the data set. This process is to be done to increase the accuracy of data set. This is because as the redundancy of features increases in data set, accuracy of the detection mechanism degrades. With different evaluators and search techniques like Ranker, Best First Search etc. features can be can easily redefine. This can be done by using Chi- Square, Gain ratio, CFS subset evaluator, Symmetric Uncertainty etc.

(6)

7. EVALUATION METRICS

The main feature of anomaly detection algorithm is its detection rate. The featured or quality anomaly detection algorithm’s detection rate can be measured by how accurately it detects attacks that feature is known as accuracy. Other measured value of detection algorithm is its recall, false alarm rate, Mathew’s correlation coefficient and F-measure. In IDS there exists a confusion matrix. As shown in Table 2, Confusion matrix always consists of four values: T+, T-, F+, and F-. Many of the measures can be calculated by making use of confusion matrix.

Table 2 Confusion Matrix

Value Value Name Description

T+ True positive Total number of normal packets which actually predicted normal. T- True Negative Total number of attack packets which actually predicted attack.

F+ False positive Total number of packets which are actually normal but declared as attack F- False Negative Total number of packets which are actually attacked but declared as

normal

7.1 Accuracy

It can be measured by ratio of total number of classifications which were performed correctly to the total number of all false (either positive or negative) and all true (either positive or negative) samples.

Accuracy =

7.2 Precision

It can be measured by ratio of true positive classification to all true positive and all false positive classification. It is inverse proportional to false alarm rate.

Precision =

7.3 Recall

It can be measured by ratio of true positive classification to all true positive and all false negative classification.

Recall =

7.4 F - Measure

Accuracy can be best measured with the help of F-measure. F-Measure=

7.5 Mathew’s Correlation Coefficient

It takes the correlation coefficient of predicted and observed classifications. It gives the result between +1 and -1.

Mathew’s Correlation Coefficient =

8. FLOW BASED ANOMALY DETECTION ARCHITECTURE

SDN controller is the central part of our architecture. All the Open Flow switches are connected with SDN controller. Controller can request for any network data to Open Flow switch

(7)

whenever controller feels requirement. For receiving network data, controller send a request message to all Open Flow switches with help of Open Flow Stat request shown in Figure 3. All Open Flow switches will send the network data to the SDN controller when they receive an Open-Flow request message. All the incoming packets are handled by Open Flow switches. Open Flow switches make use of flow table with help of Open Flow (OF) protocol. OF can eliminate an intrusion if it feels by updating flow table.

Figure 3Proposed anomaly detection architecture in SDN

9. RESULT AND ANALYSIS

NSL-KDD data set is best suited to run ML algorithms on MATLAB. In order to do so a computer system of Intel core5 processor (i5), 4 gigabytes of memory, CPU (2.50 GHz) with MATLAB tool. With MATLAB machine learning algorithms can be trained and compared on different parameters. At first, there is a need to normalize data set and then use the 10-Fold validation technique [20]. Whole training set is divided in to 10 samples.

(8)

When the model trained on 9 subsets, then each subset is tested separately. By using NSL-KDD data set we achieve the result of true positive, false negative, accuracy rate on different classifier. NSL-KDD data set is best suited for comparison of various machine learning algorithms. As this data set is best suited for classification and gives more accurate results. This data set is new version of KDD (Knowledge Discovery and Data-mining) data set, consisting of different file types of different extensions including files for training data set and test data set.

Also training set and test set in NSL-KDD are up to mark, with non-redundant records. The comparison of different machine learning algorithms can be presented in form of bar graphs after implementation on MATLAB tool.

Figure 5:Highest accuracy of PART with CFS

(9)

Figure 7 RF with Symmetric Uncertainty

This approach calculated accuracy of different classifier based on feature selection approach. Different feature selection methods may include correlation-based feature selection (CFS), info- gain, chi-square, symmetric uncertainty etc. All of these feature selection methods are used to reduce the input variables in a machine learning algorithm and it also helps to train an algorithm faster by reducing its complexity [21] after getting result we find out that accuracy value of Random Forest (RF) classifier on Info-Gain feature selection method is higher that of other classifiers which are also ML based as shown in Figure 4. From the above figures we can find out as our search strategy changes from Best First Search (BFS) to Ranker. CFS subset evaluator provide better results in order to detect attacks on PART classifier as shown in Figure 5. So, we can say if we use best first, our CFS subset evaluator has high accuracy.

Random Forest (RF) also has accuracy over another classifier when it uses Gain Ratio, Symmetric Uncertainty and Chi-Square respectively as shown in Figure 6, Figure 7 and Figure 8 respectively. Primarily a classification is perform based on two classes then the author proposed an Open Flow controller –Switch architecture in SDN for flow-based detection of anomaly. After finding all the results, we find our Random Forest (RF) classifier have best accuracy of 81% using Gain Ratio feature selection method as shown in Figure 6. In this, author compare all approaches using NSL-KDD complete data set. Also, we find out that Random Forest (RF) has very low false alarm rate of 0.28%. The performance of NSL-KDD data set on Artificial Neural Network (ANN) is to be calculated [21]. Only 7 features are used for testing data [22]. Also, DDoS attacks have been identified by using ML techniques [22]. In ML algorithms like J4, Naive Bayes were implemented to get high accuracy [23].

9. CONCLUSION

Many of machine learning algorithms exists for detection of anomaly over network. Algorithms like decision tree, Random Forest (RF), J48, naïve Bayes, Radial Basis Function Network (RBFN),), Projective Adaptive Resonance Theory (PART) are mainly used for classification. For classification, feature selection is required based on which algorithm is applied. In this

(10)

paper a comparison of classifiers based on machine learning, accuracy factor and feature selection is done over Software Defined Network (SDN). Also different feature classification methods with different simulators using NSL-KDD dataset are introduced here. As in results we find out Random Forest (RF) algorithm is best as compared to other ML algorithms based on its high accuracy and low false alarm rate. The work can be extended to work on different data sets. In future a model can be proposed that can be implemented in SDN over real SDN traffic. For this, feature selection methods based on nature inspiring algorithms can be used.

REFERENCES

[1] McKeown N, Anderson T, Balakrishnan H et al., (2008). ‘OpenFlow: enabling innovation in campus networks’. ACM SIGCOMM Comput Commun Rev 38(2) pp: 69–74.

[2] Gude N, Koponen T, Pettit J, Pfaff B, Casado M, McKeown N, Shenker S (2008), ‘Nox: towards an operating system for networks’. ACM SIGCOMM Computer Communication Rev 38(3) pp.105-110.

[3] Tavallaee M, Bagheri E, LuW, Ghorbani A-A (2009). ‘A detailed analysis of the kdd cup 99 dataset’. In: Proceedings of the second IEEE symposium on computational intelligence for security and defence applications.

[4] Braga R, Mota E, Passito A (2010). ‘Lightweight ddos flooding attack detection using nox/openflow’. In: 2010 IEEE 35th conference on local computer networks (LCN). IEEE, pp: 408–415.

[5] Winter P, Hermann E, Zeilinger M (2011). ‘Inductive intrusion detection in flow-based network data using one-class support vector machines’. In: 2011 4th IFIP international conference on new technologies, mobility and security (NTMS). IEEE, pp. 1–5.

[6] Mehdi SA, Khalid J, Khayam SA (2011). ‘Revisiting traffic anomaly detection using software defined networking’. In: International workshop on recent advances in intrusion detection. Springer, pp. 161–180.

[7] Meng YX (2011). ‘The practice on using machine learning for network anomaly intrusion detection’. In: International conference on machine learning and cybernetics (ICMLC), vol 2. IEEE, pp. 576–581.

[8] Jain S, Kumar A, Mandal S, Ong J, Poutievski L, Singh A, Venkata S, Wanderer J, Zhou J, Zhu M et al., (2013). ‘B4: experience with a globally-deployed software defined wan’. ACM SIGCOMM Computer Communication ev 43(4):3–14.

[9] Erickson D (2013). ‘The Beacon OpenFlow controller’. In: Proceedings of the second ACM SIGCOMM workshop on hot topics in software defined networking. ACM, pp. 13– 18.

[10] Jadidi Z, Muthukkumarasamy V, Sithirasenan E, Sheikhan M (2013). ‘Flow-based anomaly detection using neural network optimized with gsa algorithm’. In: 2013 IEEE 33rd international conference on distributed computing systems workshops, pp. 76–81.

[11] Kokila R, Selvi ST, Govindarajan K (2014). ‘Ddos detection and analysis in sdn-based environment using support vector machine classifier’. In: 2014 sixth international conference on advanced computing (ICoAC). IEEE, pp. 205–210.

(11)

[12] Ashraf J, Latif S (2014). ‘Handling intrusion and DDoS attacks in Software Defined Networks using machine learning techniques’. In: 2014 national software engineering conference, Rawalpindi, pp. 55–60.

[13] Dhanabal L, Shantharajah P (2015). ‘A study on NSL-KDD dataset for intrusion detection system based on classification algorithms’. Int J Adv Res Comput Commun Eng 446–452. [14] Ingre B, Yadav A (2015). ‘Performance analysis of NSL-KDD dataset using ANN’. In: 2015

International conference on signal processing and communication engineering systems, Guntur,pp. 92–96.

[15] Phan TV, Van Toan T, Van Tuyen D, Huong TT, Thanh NH (2016). ‘Openflowsia: an optimized protection scheme for software-defined networks from flooding attacks’. In: 2016 IEEE sixth international conference on communications and electronics (ICCE). IEEE, pp. 13–18.

[16] Tang T, Mhamdi L, McLernon D, Zaidi SAR, Ghogho M (2016). ‘Deep learning approach for network intrusion detection in software defined networking’. In: 2016 International conference on wireless networks and mobile communications (WINCOM) (WINCOM16), Fez, Morocco, Oct 2016.

[17] Louridas P, Ebert C (2016). ‘Machine learning’. IEEE Softw 33(5):110–115.

[18] 18. Abubakar A, Pranggono B (2017). ‘Machine learning based intrusion detection system for software defined networks’. In: Proceedings of the 2017 eighth international conference on emerging security technologies (EST). IEEE.

[19] C. T. Huawei Press Centre and H. unveil world’s first commercial deployment of SDN in carrier networks (28 Feb 2018). Retrieved from http://pr.huawei.com/en/news/hw- 332209-sdn.htm. [20] ‘Open Networking Foundation, ONF SDN Evolution’ (25 Feb 2018). Retrieved from

http://3vf60mmveq1g8vzn48q2o71awpengine.netdnassl.com/wpcontent/uploads/2013/05/TR-535-ONF-SDN-Evolution.pdf

[21] ‘OpenDaylight: a Linux foundation collaborative project’ (11 March 2018). Retrieved from http://www.opendaylight.org

[22] ‘Floodlight’ (15 March 2018). Retrieved from http://www.projectfloodlight.org .

[23] Gupta A., Didwania B., Singh G., Gupta H.P., Mishra R., Dutta T. (2020) ‘Impact of Network Load for Anomaly Detection in Software-Defined Networking’. In: Kolhe M., Tiwari S., Trivedi M., Mishra K. (eds) Advances in Data and Information Sciences,springer singapore, pp 127-134.