S. Revathi
, IJRIT 435 International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT) International Journal of Research in Information Technology (IJRIT)
www.ijrit.comwww.ijrit.comwww.ijrit.comwww.ijrit.com
ISSN 2001-5569
Comparison between J48 And Random Forest Decision Tree Algorithms To Detect Probe Attack.
S. Revathi 1 Dr. A. Malathi2
1Ph.D. Research Scholar, PG and Research, Department of Computer Science, Government Arts College, Coimbatore-18
2Assistant Professor, PG and Research, Department of Computer Science Government Arts College, Coimbatore-18
Abstract
In this paper, analysis for anomaly detection is presented. The method used a comparison between two different decision tree algorithm such as J4.8 and Random forest to detect probe attacks using NSL-KDD dataset. This paper describes briefly how the probe attack dataset has been used in detecting intrusive behavior and it also explains about two decision tree algorithms. Experimental result shows comparison between the two methods, which obtain best splitting and detect attacks more accurately than other existing machine learning techniques. The result shows best method to classify probe attack effectively.
Keyword: Probe attack, J4.8, Random Forest, Intrusion detection.
1. Introduction
An Intrusion detection is defined as a set of action to compromise integrity, confidentiality or availability of a resource [1]. It deals with the malicious use of computer and network resources with exterior system intrusion and internal user’s non-authorized behavior. It can detect, prevent and react to the attack.it used to discover unauthorized user behavior or abnormal activity of the computer network security [1]. The two main categories of intrusion detection are misuse detection and anomaly detection.
Data Mining is a non-trivial extraction of implicit, previously unknown, and imaginable useful information from data. It used to finds information that hidden in large volumes of data. The data mining techniques are used to find pattern and consistency in a given dataset [4]. It is an interdisciplinary filed which involves association, classification, clustering and visualization to design pattern. Now- a- days data mining is mainly used to solve problem of network intrusion based security attack [2]. From the data-centric point view, intrusion detection is a data analysis process. It maps data item into one of several pre-defined categories. The algorithms normally output "classifiers", for example, in the form of decision trees or rules.
An ideal application in intrusion detection will be to gather sufficient "normal" and "abnormal" audit data for a user or a program, then apply a classification algorithm to learn a classifier that will determine audit data as belonging to the normal class or the abnormal class [3].
S. Revathi
, IJRIT 436
The rest of the paper is organized as in section II is explains the probe attack in NSL-KDD dataset. Section III compares two decision tree algorithms. Section IV shows experimental analysis and section V draws some conclusion and future works.
2. Dataset Description
Mostly Researcher used DARPA 98 [9] and KDDcup99 [8] dataset to analysis their proposed methods for intrusion detection, but there are exist various statistical degradation in these dataset which result in poor evaluation of anomaly detection was proposed by Mahbod Tavalleein [6]. The inherent problem of KDD dataset leads to new sort of NSL KDD dataset that are mentioned in [6, 7]. It is very difficult to suggest existing networks, but it still can be applied as an effective benchmark data set to compare different intrusion detection methods.
The statistical analysis shows that the degradation affects the system performance and result in very poor estimation in anomaly detection. To solve such issues, a new data set as, NSL-KDD is used, which consists of selected records from the complete KDD data set. The main advantage of NSL KDD dataset are [7]
• No redundant records in the train set, so no biased result.
• No duplicate record in the test set, produce better reduction rates.
• The selected records in each difficult level is inversely proportional to the percentage of records in the original KDD data set.
The training dataset is consist of 21 different attacks out of the 37 present in the test dataset. The known attack types are those present in the training dataset and in addition novel attacks are present in test dataset i.e. not available in the training datasets. The major attack types are grouped into four categories: DoS, Probe, U2R and R2L. Table 1 shows the major attacks in Probe in both training and testing dataset.
Table1: Attacks in Probe
Probe Attack Names Training Attacks Testing Attacks
Portsweep 2931 157
ipsweep 3599 141
nmap 1493 73
satan 3633 735
mscan 0 996
Saint 0 319
Total 11656 2421
In Probe the attacker tries to gather information about a network systems for the apparent purpose of avoiding its security controls. IPsweep and Portsweep, sweep through IP addresses and port numbers form a victim network and host respectively looking for open ports that could potentially be used later in an attack. The total number of training attack be 11656 and the testing attack be 2421, the attack such as mscan and saint are present only in testing phase detection.
3. Proposed System 3.1 Random Forest
It is an ensemble learning method for classification and regression introduced by Lepetit et.al. [11]. It work based on decision tree to make best spilt form individual trees and classify the output based on split. It generate many decision tree classifier algorithm to bootstrap the sample that use combination of L tree based structure such as {h(X, Ѳn), N=1, 2, 3…L}, where X denotes the input data and {Ѳn} is a family of
S. Revathi
, IJRIT 437
identical and dependent distributed random vectors. Every decision tree are selected randomly from the available data. The tree grows simultaneously based on number of split nodes to improve accuracy. The correlation between the trees are reduced due to its feature selection which improves the prediction power and increase efficiency. The main advantage of random forest are [12, 13].• It overcomes the problem of over fitting
• The training data are less sensitive to outlier data
• Parameters can be set easily and eliminates the need for pruning the trees which improves accuracy.
The main benefit of using random forest is use of bagging on samples which makes random subsets on variables to achieve better result. It can handle high dimensional data modelling such as missing values, continuous, categorical and binary data effectively [10]. The bootstrapping and ensemble pattern makes Random Forest robust to overcome the problems of over fitting and prune the trees.
3.2 J48 Algorithm
J48 algorithm may also be known as c4.5 which is a univariate decision tree method. This algorithm is an extension of ID3 algorithm and creates small trees based on divide and conquer approach to grow decision tree [14]. The basic features of J48 algorithm are
• All artibute should belong to same class to split tree nodes.
• Calculate information gain based on entropy method for each attribute.
• Entropy are calculated based on measurement of uncertainty in random variable.
• Find best splitting attribute based on current selection.
The information gain are calculated using the formula
)) / ( (
) ,
( p j Entropy p Entropy j p
Gain = −
(1)Where p can be calculated using
|
|
| log |
|
|
| ) |
(
1
p
pj p
p pj Entropy
n
j
∑
=
−
=
(2)The conditional entropy is calculated based on
|
|
| log |
|
|
| ) |
/
( p
pj p
p pj j
Entropy =
(3)It first needs to create a decision tree based on the attribute values of the available training data. It discriminate the various instances and identify the attribute for the same, if there is any value for which the data instances falling within its category have the same value form the target variable, then it terminate that branch and assign the target value that have obtained [15].
4. Experimental Result Analysis
The experimental analysis of the proposed work has been executed on Weka tool [5] using NSL-KDD dataset. The network has to distinguish the different kinds of anomaly based intrusions. In this paper 11656 training dataset and 2421 testing samples with 41 attributes are used for analysis. The classification accuracy has been calculated based on two different decision tree algorithm such as random forest and J48.
The experimental result shows the mean square error as 0.0145 for random forest and 0.0085 for J48. The accuracy for Random forest be 98.48 and 98.65% for J48 which shows that J48 has improved accuracy than random forest. The precision, recall and f-value of individual attack for J48 is shown in following table.
Table 2: Performance of J48 classifier for probe attack
Precision Recall F-Measure Attack
S. Revathi
, IJRIT 438
0.991 0.986 0.979 ipsweep
0.990 0.991 0.987 satan
0.988 0.988 0.983 portsweep
0.964 0.975 0.972 nmap
The error visualiziation of various probe attacks are shown in figure1. From the above table and figure its clar that the J48 is best than random forest to detect attacks more accurately than other existing methods.
Fig1. Visualization of various Probe attack.
5. Conclusion
This paper presents a comparative analysis between two different decision tree algorithm such as random forest and J48 to detect probe attack. It also describes the nature various attacks in Probe, used for detecting intrusions. Decision tree are mainly used to make best split from data to improve accuracy in classification.
From the above two algorithm it’s clear that J48 performs better than Random forest in accuracy and it also reduce mean square error which improves detection rate approximately to 98%. The paper also shows the visualization of various attacks using graphical representation. Future work includes testing other attacks as U2R and R2L and it also work on other real time environment.
References
1. Stephen Northcutt, Judy Novak “Network Intrusion Detection”, Third Edition, New Riders Publishing.
2. Jiawei Han and Micheline Kamber “Data mining concepts and techniques” Morgan Kaufmann publishers .an imprint of Elsevier .ISBN 978-1-55860-901-3. Indian reprint ISBN 978-81-312- 0535-8.
3. T. Lappas and K. P. “Data Mining Techniques for (Network) Intrusion Detection System,"
January 2007.
S. Revathi
, IJRIT 439
4. Term “INTRODUCTION OF DATA MINING”,”Data Mining: What is Data Mining “,sourcefromhttp://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamini ng.htm.
5. Weka – Data Mining Machine Learning Software. http://www.cs.waikato.ac.nz/ml/weka/
6. Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set”, In the Proc. Of the IEEE Symposium on Computational Intelligence in Security and Defense Applications (CISDA 2009), pp. 1-6, 2009.
7. “Nsl-kdd data set for network-based intrusion detection systems.” Available on:
http://nsl.cs.unb.ca/KDD/NSLKDD.html, March 2009.
8. KDD Cup 1999. Available on http://kdd.ics.uci.edu/ Databases/kddcup 99/kddcup99.html, Ocotber 2007.
9. J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory,” ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262–294, 2000.
10. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
11. Lepetit, V., Fu, P.: Key point recognition using randomized trees. IEEE Trans. Pattern Anal.
Mach. Intel. 28(9), 1465– 1479 (2006).
12. Bosh, A., Zisserman, A., Munoz, X.: Image classification using Random Forests and ferns. In:
IEEE ICCV (2007).
13. Introduction to Decision Trees and Random Forests, Ned Horning; American Museum of Natural History’s.
14. Korting, Thales S. "C4.5 algorithm and Multivariate Decision Trees." 5. Web. 2 Feb. 2013.
15. Quinlan, J. R. "Improved Use of Continuous Attributes in C4.5." 14. Web. 11 Jan. 2013.