Hybrid Feature Selection Algorithm for Intrusion Detection System in Mobile Ad hoc Network

(1)

Hybrid Feature Selection Algorithm for Intrusion Detection System in Mobile Ad hoc Network

V. Asaithambi

^*1

and N. Rama

²

*1

Assistant Professor, Department of Computer Science, Govt. Arts College for Men, Nandanam, Chennai-600035, INDIA.

2

Principal, Govt. Arts and Science College for Women, Villupuram -605602, INDIA.

email:

¹

(Received on: September 14, 2018) ABSTRACT

Mobile Ad hoc Networks (MANETs) is a network with dynamic nodes without any specific infrastructure. The dynamic nature of the nodes and the infrastructure gives flexibility on the mobility of the nodes. But in the aspect of security, there are lot of chances for the network to become an unsecured and unsafe network. Even though it has highly cooperative behaviour and dynamic in nature, it requires an efficient mechanism to secure the nodes on the network and data on the flow of the network to protect them. Any network will affect because of unknown intruders. The intrusion detection and prevention of intrusion are necessary to protect a network. The intrusion detection can be achieved by some detection algorithms.

These algorithms can be applied on the networks in real time. But on the testing of the algorithms, the traces of the network can be used as network data. KDD Cup 99 dataset is widely used network dataset for intrusion detection algorithms. This data set contains 42 features. All the features are not used by all algorithms. Depends on the requirement of an algorithm the set of features can be selected. This paper gives an idea for selecting a set of features for intrusion detection using two algorithms namely Attribute Influenced Feature Selection Algorithm (AIFSA) and Efficient Random Subset Feature Selection Algorithm (ERSFSA).

Keywords: KDD, MANET, Information Gain, Gain Ratio.

1. INTRODUCTION

A Mobile ad-hoc network (MANET) is a network in dynamic nature in its self-

topology configuration. The nodes in this network are become transmitter and receiver

(2)

simultaneously. The link between the nodes are not in wire. As it is wireless network, there are many possibilities to have many attacks in each and every node and on the data communicating on the network. Some intrusion detection mechanism is required to identify the type of attacks. Most of the intrusion detection algorithms are using the KDD Cup 99 dataset for testing. Since all the features in this dataset is not necessary to test the algorithms, it is required to select some specific features for testing the algorithms.

This paper will give an idea to select some specific features from the dataset with the combination of existing algorithms. There are two algorithoms pair of algorithms developed.

First is Attribute Influenced Feature Selection Algorithm (AIFSA) and the second is Efficient Random Subset Feature Selection Algorithm (ERSFSA). After the implementation of these algorithms, a set of features will be selected for the intrusion detection process. These selected features can be used as input for the detection algorithms.

2. PROPOSED METHODS

System Architecture: The system architecture of the feature selection algorithm is given in figure-1. There are two categories of algorithms are used. First one is used on the characteristics of the attributes. The information gain and gain ratio of each attribute will be calculated. The second one is the universally accepted random subset method. Random data concept is most important in any research, because of its dynamic nature. The result set is a random set. The combination of both the results will be the final dataset. The result set is a random set. The combination of both the results will be the final dataset.

Figure-1: System Architecture

Attribute Influence: There are 42 attributes in the KDD Cup 99 dataset. The influence of the attributes impacts the type of attacks. One of the attribute is called “label”, which gives the name of the attack. The influence of an attribute can be calculated by the attribute information gain and gain ratio. The idea of best subset selection also used to finalize the attributes for process.

Figure-1: System Architecture

(3)

Attribute Information Gain: The information gain of an attribute is nothing but the reduction or change in entropy. It measures that the quantity of information an attribute provides on the class ‘label’. The information gain can be mathematically defined as follows

Where

H - entropy of the class ‘label’

𝑚 - the total number of instances, with 𝑚

𝑘

instances belonging to class 𝑘, The entropy is calculated as follows

Where

𝐾=1,…,𝑘.

𝑝𝑘 - Probability of the k

^th

instance.

Using the above equations the Information Gain for all the attributes of the KDD data set with respect to the ‘label’ attribute are calculated and given in Table-1. Here total 41 attributes are listed with their information gain and its serial number in the KDD-Cup 99 dataset. Total number of instances taken for this calculation are 494020. More information gain will give more impact on the ‘label’ class. Attributes lnum_outbound_cmds and is_host_login have no information gain for the class ‘label’

Table-1. Information Gain for all 41 features.

Information Gain

Attribute

Sl. No. Attribute Name Information Gain

Attribute

Sl. No. Attribute Name

1.4384029 5 src_bytes 0.1722753 40 dst_host_rerror_rate

1.3902034 23 count 0.1619854 41 dst_host_srv_rerror_rate

1.3552348 3 service 0.1223836 27 rerror_rate

1.0977182 24 srv_count 0.1111194 28 srv_rerror_rate

1.0961848 36 dst_host_same_src_port_rate 0.0631425 1 duration

1.0159265 2 protocol_type 0.0502661 10 hot

0.9054253 33 dst_host_srv_count 0.039662 13 lnum_compromised

0.8997895 35 dst_host_diff_srv_rate 0.0269304 8 wrong_fragment

0.8713826 34 dst_host_same_srv_rate 0.0061675 22 is_guest_login

0.7844647 30 diff_srv_rate 0.0029189 16 lnum_root

0.7773695 29 same_srv_rate 0.0022768 19 lnum_access_files

0.7645689 4 flag 0.00167 17 lnum_file_creations

0.5820167 6 dst_bytes 0.0015435 11 num_failed_logins

0.5662843 38 dst_host_serror_rate 0.0008922 14 lroot_shell

0.5449353 25 serror_rate 0.0006715 7 land

0.5438835 39 dst_host_srv_serror_rate 0.0004195 18 lnum_shells

0.521429 26 srv_serror_rate 0.0000961 9 urgent

0.4364079 12 logged_in 0.0000793 15 lsu_attempted

0.3470899 32 dst_host_count 0 20 lnum_outbound_cmds

0.3060166 37 dst_host_srv_diff_host_rate 0 21 is_host_login

0.183226 31 srv_diff_host_rate

(4)

Attribute Gain Ratio: Gain ratio (GR) is the extension of the information gain. Information gain is biased. But the Gain ratio reduces the bias. Gain ratio is the ratio between the information gain and the intrinsic information. It can be calculated using the following equation.

Gain Ratio(Attribute) = Gain(Attribute)

Intrinsic_info(Attribute)

The attributes lnum_outbound_cmds and is_host_login are the attributes with ignorable gain ratio. All the other attributes have some gain ratio. Even though all other 39 attributes has gain ratio, the highest gain ratio attributes can be considered.

Using the above equations the Gain Ratio for all the attributes of the KDD data set with respect to the ‘label’ attribute are calculated and given in Table-2. Here total 41 attributes are listed with their gain ratio and its serial number in the KDD-Cup 99 dataset.

Total number of instances taken for this calculation are 494020. More gain ratio will give more impact on the ‘label’ class. Attributes lnum_outbound_cmds and is_host_login have no gain ratio for the class ‘label’

When the attributes has large number of distinct values, Information gain ratio is more benefitted in biasing the decision tree against considering attributes. So it helps to take the distinct valued attributes along with the training set.

The attribute information gain and gain ratio are influenced to select the features from the network features of KDD dataset. Thus the Attribute Influenced Feature Selection Algorithm (AIFSA) is implemented as shown below.

INPUT: KDD Cup’99 Dataset OUTPUT: Selected Features

Step 1: Start

Step 2: Read KDD Data set

Step 3: Calculate Information Gain using Entropy of attributes

Step 4: Sort in decending order the attributes based on Information Gain Step 5: Save the result.

Step 6: Calculate Gain Ratio using Original Input data

Step 7: Sort in decending order the attributes based on Gain Ratio Step 8: Save the result.

Step 9: Choose the common attributes from the above results on first fit order.

Step 10: Save the result.

Step 11. Stop.

(5)

Table-2. Gain Ratio for all 41 features.

Gain Ratio Attribute Sl. No. Attribute Name Gain Ratio

Attribute

Sl. No. Attribute Name

0.992 8 wrong_fragment 0.494 34 dst_host_same_srv_rate

0.948 7 land 0.477 33 dst_host_srv_count

0.933 13 lnum_compromised 0.476 23 count

0.861 2 protocol_type 0.407 22 is_guest_login

0.813 11 num_failed_logins 0.406 37 dst_host_srv_diff_host_rate

0.766 4 flag 0.392 24 srv_count

0.763 10 hot 0.352 32 dst_host_count

0.745 26 srv_serror_rate 0.347 40 dst_host_rerror_rate

0.739 25 serror_rate 0.346 27 rerror_rate

0.721 12 logged_in 0.339 41 dst_host_srv_rerror_rate

0.719 3 service 0.337 31 srv_diff_host_rate

0.716 30 diff_srv_rate 0.315 28 srv_rerror_rate

0.7 39 dst_host_srv_serror_rate 0.271 18 lnum_shells

0.691 38 dst_host_serror_rate 0.262 1 duration

0.634 36 dst_host_same_src_port_rate 0.253 17 lnum_file_creations

0.619 9 urgent 0.221 16 lnum_root

0.579 5 src_bytes 0.215 19 lnum_access_files

0.572 29 same_srv_rate 0.195 15 lsu_attempted

0.55 14 lroot_shell 0 20 lnum_outbound_cmds

0.519 6 dst_bytes 0 21 is_host_login

0.497 35 dst_host_diff_srv_rate

Random Subset: The predictive ability of each attribute is considered to select the set of attributes. It is contemplated along with the degree of redundancy between the attributes. Weka gives two options to select set of features using random subset technique. The first one uses the evaluation function CfsSubsetEval with the search function BestFirst to select subset of attributes. The selected set of attributes are stored. The second one uses the evaluation function CfsSubsetEval with the search function GreedyStepwise to select subset of attributes. The selected set of attributes are also stored. Then the common attributes on both the result sets will be considered as more specific selected features.

Total number of instances taken for this calculation are 494020. There are 42 attributes in which the attribute ‘label’ is considered as class attribute. From the set of 42 attributes only 11 attributes are selected in both search methods. So it reduced the task to do intersection operation between the result sets. The algorithm for Efficient Random Subset Feature Selection Algorithm (ERSFSA) is implemented as shown below.

INPUT: KDD Cup’99 Dataset OUTPUT: Selected Features Step 1: Start Weka

Step 2: Select the item Explorer

Step 3: Click on Open file and load the KDD Data set Step 4: Click on the Select attributes item

Step 5: Choose Attribute Evaluator CfsSubsetEval Step 6: Choose Search Method BestFirst

Step 7: Click Start

(6)

Step 8: Save the result.

Step 9: Click on the Select attributes item

Step 10: Choose Attribute Evaluator CfsSubsetEval Step 11: Choose Search Method GreedyStepwise Step 12: Click Start

Step 13: Save the result.

Step 14: Select common attributes from the above two result sets Step 15: Save the final result.

Step 16. Stop.

3. RESULTS AND REPORTS

This paper will give an idea to select the feature set from the 41 features of KDD dataset. There are four algorithms are used to select the features. The final feature set will be the common features of the selected features from the intermediate results. The following table and graph depicts the final selection process of the features.

Totally 9 features are selected from the above algorithms. These features can be used to do any data mining operations in future. From the Table-3, it is clear that the features selected for further data mining process are with the numbers 2,3,4,5,7,8,23,30 and 36. The 9 features are available in the results of all the four algorithms. It is shown pictorially in the figure 2. In which the number of selected features are shown separately.

Table-3. Comparison of selected features

Algorithm Feature

Selected

Algorithm Feature

Selected GainRatio InfoGain RS/BF RS/GS GainRatio InfoGain RS/BF RS/GS

8 5 2 2 2 34 40 - - -

7 23 3 3 3 33 41 - - -

13 3 4 4 4 23 27 - - -

2 24 5 5 5 22 28 - - -

11 36 6 6 7 37 1 - - -

4 2 7 7 8 24 10 - - -

10 33 8 8 23 32 13 - - -

26 35 14 14 30 40 8 - - -

25 34 23 23 36 27 22 - - -

12 30 30 30 - 41 16 - - -

3 29 36 36 - 31 19 - - -

30 4 - - - 28 17 - - -

39 6 - - - 18 11 - - -

38 38 - - - 1 14 - - -

36 25 - - - 17 7 - - -

9 39 - - - 16 18 - - -

5 26 - - - 19 9 - - -

29 12 - - - 15 15 - - -

14 32 - - - 20 20 - - -

6 37 - - - 21 21 - - -

35 31 - - -

(7)

Figure 2 : Comparison of selected features

4. CONCLUSION

This paper will give an idea to identify the best features to do the classification and performance evaluation on network traces using KDD Cup’99 dataset to evolve the intrusion detection algorithms to find intruders. Though the proposed algorithms can be implemented using any programming languages, Weka is used to evaluate the algorithms since the availability of the algorithms on Weka. The feature selection algorithm is further improved by tuning the arguments of the algorithms with appropriate values.

REFERENCES

1. Preeti Aggarwala, Sudhir Kumar Sharmab, 3rd International Conference on Recent Trends in Computing 2015, Analysis of KDD Dataset Attributes - Class wise For Intrusion Detection (ICRTC-2015).

2. Senthilnayaki Balakrishnan, Venkatalakshmi K, Kannan A , International Journal of Computer Science and Application (IJCSA) Volume 3 Issue 4, Intrusion Detection System Using Feature Selection and Classification Technique (November 2014).

3. Daramola O. Abosede, Adetunmbi A. Olusola, AdeolaS. Oladele,. “Analysis of KDD’99 Intrusion Detection Dataset for Selection of Relevance Features”, Proceedings of the World Congress on Engineering and Computer Science, Vol. I, October 20‐22, (2010).

4. Nidal Nasser and Yunfeng Chen, IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings. Enhanced Intrusion Detection System for Discovering Malicious Nodes in Mobile Ad hoc Networks.

5. Urvashi Modi, Anurag Jain, International Journal of Scientific & Engineering Research,

Volume 6, Issue 11, A survey of IDS classification using KDD CUP 99 dataset in WEKA

(November-2015).

(8)

6. Swasti Singhal, Monika Jena, International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-2, Issue-6, A Study on WEKA Tool for Data Preprocessing, Classification and Clustering (May 2013).

7. Shailesh Singh Panwar, Dr. Y. P. Raiwani, International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),ISSN 0976 - 6375(Online), Volume 5, Issue 10, pp. 21-31, Data Reduction Techniques To Analyze NSL-KDD Dataset October (2014).

8. R. Lopez De Mantaras, Machine Learning, 6, 81-92 (1991) © 1991 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Distance-Based Attribute Selection Measure for Decision Tree Induction (1991).

9. H. Du, S. Jassim, M. F. Obatusin, Data Mining II, C.A. Brebbia & N.F.F. Ebecken (Editors) © 2000 WIT Press, www.witpress.com, ISBN 1-85312-821-X, Effects of attribute selection measures and sampling policies on functional structures of decision trees (2000).

Hybrid Feature Selection Algorithm for Intrusion Detection System in Mobile Ad hoc Network