An Efficient Intrusion Detection using J48 Decision Tree in KDDCUP99 Dataset

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 6, Issue 2, February 2016)

102

An Efficient Intrusion Detection using J48 Decision Tree in

KDDCUP99 Dataset

Abid Khan

1

, Prof. Kavita Burse

2

, Prof. Kavita Rawat

3 1,2,3

Oriental College of Technology, Bhopal, India

Abstract—Here in this paper an actual planning for the

discovery of intrusion is planned. Here in this paper a novel and effectual system for the arrangement and exposure of intrusions is realized. The planned procedure applied here for the classification of network intrusions provides efficient Confusion Matrix and high Accuracy rate as associated to the existing methodology implemented for intrusion discovery using Genetic Procedure. Here J48 based classifier is used for the generation of rules.

Keywords—Cloud Computing, Public Varifiability, Cloud Storage, Cloud Security, Virtualisatio.

I. INTRODUCTION

A defensive system from computer refuge doses is a vigorous uneasiness of computer refuge. Comprehensive compiling and straight clarification of circulation material are core difficulties in network traffic anomaly detection. As system circulation may chief to change of material argument and delicate data transmission. Though it is also glowing recognized that the dependence of network are also developing quickly. Due to this the system condition are very crucial now a days and it will become more difficult in forthcoming period. This circulation might lead to enormous harm of system organization and its connected capitals. To analyze network behaviour is originates under Irregularity discovery. Numerous host-based irregularity detection organizations have been planned to notice waitperson concessions to notice intrusions by nursing the performance of a package to see if its behavior conforms to a model that describes its normal behaviour [1].

Based on the normal intricacy in characterizing the standard system presentation, the difficulty of anomaly detection may be categorized as model based and non-model based. Rendering to perfect based anomaly indicators, it is expected that are cognized faultless is obtainable for the standard behavior of certain precise features of the network and any divergence from the norm is supposed an anomaly. Network behaviors that cannot be characterized by any model for such condition non-model grounded methods are used. Non perfect based methods can be supplementary confidential founded on the unmistakable application and accurateness restrictions that have been compulsory on the indicator.

[image:1.595.317.550.322.498.2]

IDSs permit for the uncovering of positive or ineffective efforts to cooperation schemes security. An Intrusion Discovery Organization is an significant constituent of any refuge organization that accompaniments other refuge instruments. As illustrated in Figure 1, an IDS consists of four essential components: sensors, analysis engines, data repository, and management and reporting modules.

Figure 1: High level architecture of an intrusion detection system.

Support Vector Machine

Consider training sample{ }, where is the input pattern, is the desired output:

[image:1.595.320.572.567.741.2]

(2)

International Journal of Emerging Technology and Advanced Engineering

103 The statistics fact which is very close is called the boundary of parting

The chief goal of using the SVM is to find the specific hyperplane of which the boundary is exploited

Optimal hyperplane

For example, if we are selecting our perfect from the customary of hyperplanes in Rn, then we have:

f(x; {w; b}) = sign(w . x + b)

We can try to learn f(x; _) by choosing a function that performs well on training data:

∑

II. LITERATURE SURVEY

Wenying Feng, Qinglei Zhang, Gongzhu Hu, Jimmy Xiangji Huang planned a novel and well-organized method for the discovery of interruptions using the mixture grouping of Support vector mechanism and Ant colony systems [1]. Since Data removal is a system for removing some expressive material from it so that the appropriate and rapid investigation can be done. Intrusion Discovery is one of request of statistics removal in which the packages to be direct from foundation to endpoint needs to be clean and if the package comprises any occurrence it can be sensed with high alarm rate. Here in the broadside a new system of recognizing these interruptions using machine erudition tactic such as Support vector machine. The organization of impositions in the packs using support vector machine is an effectual way of categorizing the bouts in the pack.

Chih-Fong Tsai, yu-Feng Hsu, Chia-Ying Lin,Wei-Yang Lin also future a new-fangled method of classifying intrusions by studying numerous intrusion detection methods their rewards and confines [2]. The newspaper potted all the imposition discovery method realized by study their many rewards and limits. The paper deliberates and associates 55 connected trainings from the historical of 2000 to 2007 in which numerous organization and gathering methods for the discovery of interruptions are applied and examined.

Latifur Khan, Mamoun Awad, Bhavani Thuraisingham realized an well-organized method for the intrusion discovery using support vector machine and the gathering using hierarchical bunching [3]. Here in this paper the grouping of hierarchical grouping and then organization using support vector machine is planned which delivers high true optimistic and correctness as likened to the current systems for the discovery of intrusions. The input dataset is first clustered into „N‟ groups according to the classification classes and then these clustered groups are classified using engine education method such as support vector machine.

[image:2.595.319.557.163.309.2]

This methodology greatly classifies the dataset and provides high alarm rate for the detection of intrusions.

Figure 4. Representation of Layered Approach

Jianfeng Pu and Lizhi Xiao planned a new method for the Network Intrusion uncovering which is grounded on the thought of Support vector machine and Ant Colony Procedure [4]. A cross combinatorial technique of smearing support vector machine and ant colony procedure for the discovery of network packages comprises anomalous performance. Support vector machine is used for the collection of significant topographies of the packs that movement in the system.

Shelly Xiaonan Wu, Wolfgang Banzhaf has given a short-livedimpression of intrusion detection organizations and their various confines and rewards [5]. The paper abridges the numerous computational intelligence that may use for the discovery of network interruptions in the system or packets. The numerous fake intellect methods such as Fuzzy and swarm for the discovery of net anomalies can be deliberated and examined their numerous compensations and matters.

Qinglei Zhang and Wenying Feng also planned the same system for the discovery of interruptions using the hybrid combinatorial method of Ant Colony Algorithm and Support Vector Machine [6]. Here in this paper two supervised techniques are combined for the detection of intrusions in the packet. The various experimental results performed on the network packets shows that the hybrid combination is better in performance as compared to the existing Support Vector machine.

(3)

International Journal of Emerging Technology and Advanced Engineering

104 The untried fallouts on the future organization with the feature withdrawal procedure is real to notice the unseen intrusion bouts with high discovery rate and identify standard network circulation with low false fear amount.

Snehal A. Mulay, P.R. Devale, G.V. Garje planned a novel and well-organized method for the discovery of impositions using mixture of Support vector machines and Decision Tree [8]. Support Vector machine is a supervised learning approach which is used for the binary classification so that multiple class problems can be solved easily and quickly, Choice tree based sustenance vector engine can be used for cracking multi-class difficulties more efficiently. By consuming the grouping technique of SVM with decision sapling can reductions the exercise and challenging time as well as system efficiency also increases.

Wenke Lee future a new and effectual agenda for the construction of interruptions in DARPA Datasets [9]. The rules generated here can be used for the misuse detection of anomalies in the network. The experimental results are performed on DARPA Dataset from which a set of rules are generated and hence on the basis of these rules intrusions can be detected.

III. PROPOSED METHODOLOGY

The proposed methodology implemented here consists of the following steps:

1. Take an input dataset of KDDCup99 (Para a). 2. Now Apply J48 based Classification algorithm

on the input dataset to generate a decision tree (Para b).

3. Generate Fuzzy Rules from the decision tree using Fuzzy C-means (Para c).

4. Each of the Packet is compared with the fuzzy rules generated (Para d).

5. The packets are then filtered and classified using the Rules which contains various Classes. (Para e).

Para a

[image:3.595.316.552.137.276.2]

Here the input dataset is a collection of KDDCup99 Dataset which consists of:

Table 1. Training Dataset

Original Records

Distinct Records

Reduction Rate

Attacks 3,925,650 262,178 93.2%

Normal 972,781 812,814 16.44%

Total 4,898,431 1,074,992 78.05%

Table 2. Testing Dataset

Original Records

Distinct Records

Reduction Rate

Attacks 250,436 29,378 88.26%

Normal 60,591 47,911 20.92%

Total 311,027 77,289 75.15%

Para b

The Input dataset is then passed to the J48 classification algorithm for the classification of data. J48 is classification algorithm which generates a decision tree on the basis of which rules are generated. J48 is based on C4.5 classification algorithm which generates binary tree.

INPUT:

D //Training statistics

OUTPUT:

T //Decision diagram DTBUILD (*D) {

T=φ;

T= Make origin node and tag with unbearable attribute; T= Add curve to origin bulge for each divided base and label;

For each arc do

D= Record twisted by smearing splitting predicate to D;

If discontinuing opinion touched for this trail, then T‟= generate leaf protuberance and label with Fitting class;

Else

T‟= DTBUILD(D); T= add T‟ to arc; }

While building a tree, J48 ignores the missing values i.e. the value for that item can be predicted based on what is known about the attribute values for the other records. The basic idea is to divide the data into range based on the attribute values for that item that are found in the training sample.

(4)

International Journal of Emerging Technology and Advanced Engineering

[image:4.595.47.300.130.342.2]

105

Figure 5. Decision Tree created using J48

As soon as the decision tree is constructed fuzzy C-means is applied on these classified dataset to generate fuzzy rules.

Fuzzy c-means (FCM) is a technique of gathering which permits one portion of statistics to fit to two or additional bunches. It is regularly used in decoration gratitude. Amongst the fuzzy grouping devices, fuzzy c-means (FCM) process is the most existing technique used in image subdivision since it has vigorous physiognomies for vagueness and can recollect much more material. Though the conformist FCM procedure everything well on maximum noise-less images, it has a thoughtful restraint like: it does not include any material about longitudinal background that reason it to be receptive to noise and imaging artifacts. To compensate for this shortcoming of FCM, the observable approach is to smooth the image before segmentation. The procedure is an iterative gathering technique that harvests an ideal c panel by minimalizing the prejudiced within cluster sum of formed error detached meaning JFCM:

∑ ∑

Where X = {x1, x2, …, xn} ⊆ Rp is the statistics

customary in the p-dimensional trajectory planetary, n is the quantity of figures items, c is the quantity of collections with 2 ≤ c < n, uik is the gradation of

involvement of xk in the ith cluster, q is a allowance

exponent on each fuzzy connection, vi is the archetype of

the centre of bunch i, d2 (xk, vi) is a remoteness degree

among object xk and cluster centre vi.Let Vi be the set of

vector values in the data points Pi.

a. Initialize membership value U from the set of data point Pi randomly.

b. After k-step calculate the centroid C=[cij] up to the number of clusters using

Where m is the fuzzy parameter and n is the number of data points.

c. After each iteration fuzzy membership is updated using,

d. Stop the fuzzy C-means algorithm if the value of member ship is less than the previous membership,

[image:4.595.297.551.161.779.2]

|Uk-Uk-1| < epsilon

Table 3. Rules Generated using J48

********************************************* Rule-1

********************************************* if (dst_host_srv_serror_rate) > '0.47' then

R2L Attack

********************************************* Rule-2

********************************************* if (dst_host_srv_serror_rate) <= '0.47' then

if (count) > '40' then

DoS Attack

********************************************* Rule-3

if (count) <= '40' then

if (dst_host_diff_srv_rate) > '0.99' then

U2R Attack



 



_n

i

m ij n

i

i m ij

j

u

x

u

c

1 1

)

(

)

(

 





 

    

  

     

  

 

c

n

j

m

j i

m

j i ij

c x

c x u

1

1 1 1 1

(5)

International Journal of Emerging Technology and Advanced Engineering

106 *********************************************

Rule-4

if (dst_host_diff_srv_rate) <= '0.99' then

if (dst_host_srv_serror_rate) > '0.14' then

if (src_bytes) > '7940' then

Probe Attack

********************************************* Rule-5

if (dst_host_srv_serror_rate) > '0.14' then

if (src_bytes) <= '7940' then

Normal Packet

********************************************* Rule-6

if (dst_host_srv_serror_rate) <= '0.14' then

if (flag) == 'RSTR' then

Normal Packet

********************************************* Rule-7

if (flag) == 'S3' then

Normal Packet

********************************************* Rule-8

if (flag) == 'SF' then

Normal Packet

********************************************* Rule-9

if (flag) == 'RSTO' then

Normal Packet

********************************************* Rule-10

if (flag) == 'SH' then

Normal Packet

********************************************* Rule-11

(6)

International Journal of Emerging Technology and Advanced Engineering

107 if (dst_host_diff_srv_rate) <= '0.99' then

if (flag) == 'OTH' then

Normal Packet

********************************************* Rule-12

if (flag) == 'RSTOSO' then

Normal Packet

********************************************* Rule-13

if (flag) == 'S0' then

Normal Packet

********************************************* Rule-14

if (flag) == 'REJ' then

U2R Attack

[image:6.595.196.559.122.748.2]

IV. RESULT ANALYSIS

Table 4.

Analysis of Various Classes using Genetic Algorithm

By CSVAC Classifier Actual Class Classified Class

Normal DoS U2R R2L Probe Normal 923 15 3 42 13

DoS 0 493 0 0 0

U2R 0 0 42 9 0

R2L 38 312 12 628 2 Probe 2 257 16 9 213

Table 5.

Analysis of Various Classes using Proposed Algorithm

By Proposed Classifier

Actual Class Classified Class

Normal DoS U2R R2L Probe

Normal 1025 15 3 42 13

DoS 0 583 0 0 7

U2R 0 0 59 15 0

R2L 45 312 12 739 3

Probe 3 259 20 9 271

Figure 6. Comparison on Various Performance Parameters

0 10 20 30 40 50 60 70 80 90

Val

u

e

s

Measures

Comparison of

Performance Parameters

CSVAC

(7)

International Journal of Emerging Technology and Advanced Engineering

108

[image:7.595.51.545.124.761.2]

Figure 7. Comparison on Normal Class Classification

Figure 8. Comparison of DoS Class Classification

V. CONCLUSION

This planned procedure realized here is a novel method for optimization of detecting intrusion in the web log data on the basis of J48 based Decision tree.

The planned procedure realized here is feasibly for large datasets also and provides high alarm rate and accuracy of detecting intrusions. The result analysis shows the recital of the planned procedure. The planned procedure is associated with the existing Genetic Algorithm based intrusion detection and proposed methodology provides more security and provides high alarm rate and detection ratio as compared to the existing technique.

REFERENCES

[1] Wenying Feng, Qinglei Zhang, Gongzhu Hu, Jimmy Xiangji

Huang,”Mining Network data for intrusion detection through combining SVM‟s with ant colony networks”, Future Generation Computer Systems 37 (2014) 127-140”, Elsevier 2014.

[2] Chih-Fong Tsai, yu-Feng Hsu, Chia-Ying Lin,Wei-Yang Lin,”

Intrusion Detection by machine learning: A Review”, Expert Systems with Applications 36 (2009) 11994-12000,Elsevier 2009.

[3] Latifur Khan, Mamoun Awad, Bhavani Thuraisingham,” A new

intrusion detection system using support vector machine and hierarchical clustering”, The VLDB Journal (2007) 16:507-521,2007.

[4] Jianfeng Pu, Lizhi Xiao,” A Detection of Network Intrusion

Based on SVM and Ant Colony Algorithm”, National Conference on Information Technology and Computer Science (CITCS), 2012.

[5] Shelly Xiaonan Wu, Wolfgang Banzhaf,” The use of

Computational Intelligence in intrusion detection systems: A Review”, Applied Soft Computing 10, Elsevier, 2010.

[6] Qinglei Zhang and Wenying Feng,” Network Intrusion Detection

by Support Vectors and Ant Coloy”, Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009) Qingdao, China, November 21-22, 2009.

[7] S. Janakiraman, V. Vasudevan,” ACO based Distributed Intrusion

Detection System”, 2008.

[8] Snehal A. Mulay, P.R. Devale, G.V. Garje,” Intrusion Detection

System using support vector machine and Decision Tree”, International Journal of Computer Applications (0975-8887), Volume 3-No.3, June 2010.

[9] Wenke Lee, “ A Data Mining Framework for building Intrusion

Detection Models”, 1999.

[10] Wenke Lee, Salvatore J. Stolfo and Kui W. Mok,” Mining Audit

Data to Build Intrusions Detection Models”, 1999.

[11] Z. Muda, W. Yassin, M.N. Sulaiman,” Intrusion Detection based

on K-Means Clustering and OneR Classification”, IEEE 2011.

[12] Mohammadreza Ektefa, Sara Memar,” Intrusion Detection Using

Data Mining Techniques”, IEEE 2010.

[13] Anup Goyal, Chetan Kumar”, GA-NIDS: A Genetic Algorithm

based Network Intrusion Detection System”, 2005.

[14] R. Shanmugavadivu, Dr.N.Nagarajan,” Network Intrusion

Detection System Using Fuzzy Logic”, IJCSE 2011.

[15] Yogita B. Bhavsar, Kalyani C.Waghmare,” Intrusion Detection

System Using Data Mining Technique: Support Vector Machine”, IJETAE 2013.

[16] S. Devaraju and S. Ramakrishnan,” Performance Comparison for

Intrusion Detection System Using Neural Network With KDD Dataset”, ICTACT 2014.

[17] Jun Wang, Xu Hong, Rong-rong Ren, Tai-hang Li,” A Real-time

Intrusion Detection System Based on PSO-SVM”, IWISA 2009.

[18] Saroj Bala, S.I. Ahson, R.P. Agarwal, “ A Pheromone Based

Model for Ant Based Clustering”, International Journal of Advanced Computer Science & Applications, Vol. 3, No. 11, 2012.

0 200 400 600 800 1000 1200

N

o

rm

al

Cl

ass

Val

u

e

s

Classifies Classes

Comparison of Normal

Classification

CSVAC

Proposed

0 100 200 300 400 500 600 700

Cl

assi

fi

e

d

D

o

S

Cl

ass

Classes

Comparison of DoS Class

CSVAC