International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 8, August 2014)
653
Survey Paper on Intrusion Detection using Data Mining
Techniques
Sonam Chourse
1, Prof. Vineet Richhariya
21
M.Tech Scholar, 2Head of Computer Science Department,LNCT Bhopal, India
Abstract—In the era of network based technology, Security is an important issue. With recent research in network based technology and increased dependability on this technology need assure reliable operation of network based systems. As there is the tremendous increase in the resource and information sharing, the need for security is also increases. Intrusion detection system is designed, which monitors the suspicious activity, misuse, unauthorized access etc. This paper presents a survey on intrusion detection system and provides data mining techniques for intrusion detection system.
Keywords— Anomaly detection, Data mining Techniques, Intruders, Intrusion detection, Security.
I. INTRODUCTION
From a decade data mining is gaining importance due to its large volume of data and huge area of application. The large volume of data gathered from the different sources of information may contain sensitive and personal data. Data mining may be helpful in encountering different types of security attacks of the universe and it focuses on discovering significant patterns. For example classification techniques are used to identify attack, anomaly detection is used to detect unusual patterns and prediction techniques are used to determine future attacks.
Now the people are more dependent on internet technology for their needs like online transactions, communications, emails etc. Due to this number of threats, even hackers and intruders, attack on the integrity and confidentiality of the system is increases. Thus the field of information security needs more security and safety. The high security can be achieved using authentication system, firewalls, encryption system, intrusion detection system etc. This paper is organized as follows. Section 2 gives the overview of the intrusion detection system. Section 3 discusses the types of IDS. Section 4 discusses the comparative study of various data mining techniques used in IDS system. Section 5 deals with the discussion related to the type of attacks in intrusion detection system. Finally conclusion of this paper is discussed in section 6.
II. INTRUSION DETECTION SYSTEM
Intrusion is a set of actions that violates the integrity, confidentiality, computer system policies or tries to seize data of the network. Intrusion detection system is the act of detecting intrusion in the network. This system can be software; hardware or both that can detect intrusion. Basically two major type of detection are Anomaly based detection and Signature based detection [1]. Signature based method match a specific signature of large database (known attacks) with gathered information. Signature based method is incapable in identifying unknown attacks [2].
This method is also known as misuse detection. On the other hand Anomaly based detection detects the deviation of patterns from statistical build model.Goals of Intrusion detection system includes attack detection, analyzing system configuration and vulnerability, Accessing files and system Integrity, identifying problems with security policies etc [1].
III. TYPES OF IDS
There are several types of IDS; they are characterized on the basis of different monitoring and analysis approach. Another way of classifying IDS is to group them by information source. Some IDS analyze information sources generated by the application software or Operating system for signs of intrusion. Other analyzes the network packet captured from network link to find attackers [3].Protected systems of IDS are Network based system and Host based system. Host based system monitors an individual host machine. Network based system monitors the traversing of packet on network link [4]. People need to use the IDS in order to identify attacks in host based system and network based system.
A. Network Based System
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 8, August 2014)
654 Listening on a LAN segment, network based Intrusion detection system can monitor the network traffic affecting multiple host that are connected to the network segment, so that it can protect those hosts. Network-based IDS often consist of hosts or a set of single-purpose sensors placed at various points in a LAN
.
Most of these Sensors are design to run in ―stealth‖ mode, for the purpose of making it more difficult for an attacker/intruder to determine their presence and location. [3]. It is most commonly deployed at a boundary between networks, such as in virtual private network servers, wireless networks and remote access servers [5].The following are the advantages of using network based IDS:
1)Network-based IDSs can be made invisible to many attackers to provide security against attack.
2)A few network based IDSs can monitor a large network.
3)Network-based IDSs are usually passive devices that listen on a network wire without interfering with the normal operation of a network. Thus, it is usually easy to fit in an existing network to include network-based IDSs with minimal effort.
Disadvantages of using network based IDS are:
1) Network-based IDSs is unable to analyze encrypted information because most of the organization uses virtual private networks.
2) Most of the advantages of network based IDS don‘t
apply to small segment of network i.e. switch based network. Monitoring range of switches are not universal, this limits the network based IDS monitoring range to single host.
3) Some network based IDS have also problem in
dealing with network based attacks which involve the packet fragmentation. This anomalously formed packets cause the IDS to become unstable and crash. [3].
B. Host based System
A host-based IDS monitors activities associated with a particular host [6] and aimed at collecting information about activity on a host system or within an individual computer system. In host based IDS separate sensors would be needed for an individual computer system. Sensor monitors the event takes place on the system. Sensors collect the data from system logs, logs generated by operating system processes, application activity, file access and modification. These log file can be simple text file or operation on a system.
The following are the advantages of using Host based IDS:
1) Host based IDS can detect attacks which cannot be seen by network based IDS because they monitor local events of a host.
2) Host based IDS operate on operating system audit trails, that can help to detect attacks involve in software integrity breaches.
3) Host-based IDSs remains unaffected by switched networks.
Disadvantages of using Host based IDS are:
1) Host based IDS can be disabled by certain DoS attacks. 2) Host based IDS are not well suited for detecting attacks, those targets an entire network.
3) Host based IDS are difficult to manage, as for every individual system; information is configured and managed [3].
IV. COMPARATIVE STUDY OF INTRUSION SYSTEM
Some of the most important data mining techniques for IDS are explained in following subsection.
A. Markov models
Markov model is sub divided into two types: Hidden Markov Model and Markov Chain. In Hidden Markov Model the system is assumed to be a Markov Process in which transitions and states are hidden. Markov based techniques are normally applied to system calls. An HMM has a definite set of states governed by a set of transition probabilities. In a particularized state, an observation can be generated according to an associated probability distribution [7]. Markov chain model consist of states, S={s1, s2,…sm}. The process starts from one of these states and moves successively from one state to another.
An Algorithm for Hidden Markov Model:
Step 1: The initial transition probability from one state Q0 to another state Q1 at a particular instance of time t0+1 depends on the state at time t0 according to the markov assumption.
Step 2: The probabilities of the transition of the states is time independent, where the transition takes place according to the stationary assumption.
Step 3: Lets ‗m‘ is the number of packets sends at a particular instance of time at a particular transition.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 8, August 2014)
655 Step 5: calculate the general probability of the packet to be transmitted after each step of the transition.
Step 6: The average probability is computed & the condition is checked i.e. intrusion is detected if the average probability is less than the threshold value [8].
B. Fuzzy Logic
Dr. Lotfi Zadeh introduced fuzzy logic as a means to
model the uncertainty of natural language [9]. An ID is an
approach to handle doubtful behaviors inside a network. Extensive approaches have been applied specifically, soft computing techniques, data mining technique and artificial intelligence techniques. AI techniques such as neural networks, decision trees and fuzzy logic are applied for detecting doubtful activities in a network, in which fuzzy based system provides significant advantages over other AI techniques [10].
Fuzzy logic is very appropriate for using on IDS because there is no clear boundary between anomaly and normal events [9]. The fuzzy logic part of the system is responsible for both i.e. dealing with the inaccuracy of the i/p data and handling the large number of i/p parameters. The fuzzy expert system consists of following types of entities: fuzzy variables, fuzzy sets and fuzzy rules. The process of a fuzzy system has three steps they are fuzzification. Rule based evaluation and Defuzzification. Fuzzification means adding fuzziness to data in fuzzy logic. In the fuzzification step, i/p crisp values are transformed into degrees of membership in the fuzzy sets. In the rule based evaluation, strength value is associated with each fuzzy rule. The strength value is determined by the degree of memberships of the crisp i/p values in the fuzzy sets of antecedent (antecedent variables, that are assigned with the input data of the fuzzy expert system) part of the fuzzy rule. The Defuzzification stage converts the fuzzy outputs into crisp values [11].
C. Genetic algorithm
Genetic Algorithms simulate natural process, Selection, Crossover, Mutation and Accepting. GAs is inspired by Darwin‘s theory about evolution ―Survival of the fittest among individuals over consecutive generation for problem solving‖. Therefore a solution obtained to any problem by applying genetic algorithm, consist of only those optimal solution which satisfies a predefined fitness value. The various advantages associated with genetic algorithm are 1) it provides a wider solution space. 2) It does not need prior information of problem space. 3) It is easy to modify.4) It possesses tremendous capabilities for parallel processing [12].
An algorithm for Genetic Algorithm: Step1: [start] generate initial population.
Step2: [fitness] Evaluate the most optimal solution from a number of solutions in a population.
Step3: [selection] select the most optimal solution determined by using fitness function.
Step4: [crossover] crossover the pair of solutions until a completely new generation of solution is obtained.
Step5: [mutation] change some bits in a solution.
Step6: [replace] place new solution in the new population.
Step7: [test] test if the evaluation end reached then stop and return the best solution [12].
D. Naïve Bayes
Bayesian classification is anomaly based. It works by recognizing that feature values have different probabilities of occurring in attacks and in normal TCP traffic [13]. In naïve bayes classification, a set of attributes assigned to a setoff classes based on Bayes theorem as given in eq. 1
, where
P (Ci) is the Prior probability of class. P(X) is the Prior probability of predictor.
P(X/Ci) is the likelihood which is the probability of predictor given class.
P (Ci/X) is the posterior probability of class given predictor.
It requires a set of training data to estimate means and variances of the attributes for classification. The Naïve Bayesian Classifier, or simple Bayesian classifier, works as follows: [14].
Algorithm for naïve Bayesian classification
Step1: Let D be a training set of tuples with their associated class labels. Each tuple is represented by n dimensional vector, X=(x1,x2…..xn) is depicting n
measurement made on the tuple from n attribute respectively A1, A2,….An.
Step 2: Suppose that there are m classes C1,C2….Cm.
Given tuple X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 8, August 2014)
656 Thus we maximize P(Ci/X). the class Ci for which P(Ci/X) is maximized is called the maximum posteriori hypothesis.
By Bayes Theorem
Where, is the prior probability of Ci, is the
prior probability of X.
Step 3: P(X) is constant for all classes only P(X/Ci). P(Ci) need to be maximized. If the class prior probability are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = …… = P(Cm), and would therefore maximize P(X|Ci). Otherwise, maximize P(X|Ci)P(Ci). Class prior probability may be estimated by P(Ci)=|Ci,D/D|, where Ci, D is the number of training tuples of class Ci in D.
Step 4: Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naïve assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus,
The probabilities P(x1|Ci), P(x2|Ci),…..,P(xn|Ci) from the
training tuples. Recall that here Xk refers to the value of attribute Ak for tuple X [14].
V. NETWORK ATTACKS
There are four types of networking attack mainly Denial of Service (DoS), Probe, R2L, U2R.
1) Denial of Service: It is a type of attack in which the hacker makes a memory resource or computing too busy or too full to serve legitimate networking requests and hence denying user‘s access to a machine e.g. syn flood, guest, smurf, apache, teardrop, neptune etc [15].
2) Probing:It is an attack in which the hacker monitors a networking device in order to determine weaknesses and vulnerabilities that may later be exploited so as to compromise the system. Hence basic connection level features such as the ―Source bytes‖ and ―duration of connection‖ are significant features while ―number of files accessed‖ and ―number of files creations‖ features are not expected to provide information for detecting probes [16].
3) Remote to user attacks: It is an attack in which a user sends packet to a m/c over the internet, an unauthorized user tries to access that remote m/c. e.g. send mail dictionary, guessing password, xlock, phf. [15]
4) User to root attacks: These attacks are often content based and target an application. Unauthorized user attempts to illegally gain access to local super user‘s right e.g. port sweep, nmap, perl. [15]
REFERENCES
[1] Richa Srivastava ,Vineet Richhariya 2.Survey of Current Network Intrusion Detection Techniques‖ Journal of Information Engineering and Applications ISSN 2224-5782 (print) ISSN 2225-0506 (online) Vol.3, No.6, 2013.
[2] Sharmila Kishor Wagh, Vinod K. Pachghare, Satish R. Kolhe, Survey on Intrusion Detection System using Machine Learning Techniques‖International Journal of Computer Applications (0975 – 8887) Volume 78 – No.16, September 2013
[3] Rebecca Bace, Peter Mell, ‖NIST Special Publication on Intrusion Detection Systems‖ Infidel, Inc., Scotts Valley, CA National Institute of Standards and Technology.
[4] Douglas J. Brown, Bill Suckow, And Tianqiu Wang, “A Survey of Intrusion Detection Systems” San Diego, CA 92093, USA. [5] Sheetal Thakare, Pankaj Ingle, Dr. B.B. Meshram,‖ IDS : Intrusion
Detection System the Survey of Information Security‖International Journal of Emerging Technology and Advanced Engineering ISSN 2250-2459, Volume 2, Issue 8, August 2012
[6] Swati Paliwal, Ravindra Gupta, ―Denial-of-Service, Probing & Remote to User (R2L) Attack Detection using Genetic Algorithm‖ International Journal of Computer Applications (0975 – 8887) Volume 60– No.19, December 2012
[7] Avinash Ingole, Dr. R. C. Thool‖ Credit Card Fraud Detection Using
Hidden Markov Model and Its Performance”International Journal of Advanced Research in Computer Science and Software Engineering ISSN: 2277 128X Volume 3, Issue 6, June 2013.
[8] Hemlata Sukhwani, Vikas Sharma, Sanjay Sharma, ―A Survey of Anomaly Detection Techniques and Hidden Markov Model‖ International Journal of Computer Applications (0975 – 8887) Volume 93 – No. 18, May 2014
[9] J.T. Yao,S.L. Zhao, L. V. Saxton,‖ A study on fuzzy intrusion detection‖ Regina, Saskatchewan, Canada S4S 0A2
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 8, August 2014)
657
[11] Mostaque Md. Morshedur Hassan, ―Current Studies On IntrusionDetection System, GeneticAlgorithm And Fuzzy Logic‖ International Journal of Distributed and Parallel Systems (IJDPS) Vol.4, No.2, March 2013
[12] Parry Gowher Majeed, Santosh Kumar, ‖Genetic Algorithms in Intrusion Detection Systems: A Survey‖ International Journal of Innovation and Applied Studies ISSN 2028-9324 Vol. 5 No. 3 Mar. 2014, pp. 233-240 © 2014 Innovative Space of Scientific Research Journals.
[13] Hesham Altwaijry, Saeed Algarny, ―Multi-Layer Bayesian Based Intrusion Detection System‖ Proceedings of the World Congress on Engineering and Computer Science 2011 Vol II WCECS 2011, October 19-21, 2011, San Francisco, USA
[14] Manish Kumar Nagle, Dr. Setu Kumar Chaturvedi, ―Feature Extraction Based Classification Technique for Intrusion Detection System‖ International Journal of Engineering Research and Development e-ISSN: 2278-067X, p-ISSN: 2278-800X, Volume 8, Issue 2 (August 2013), PP. 23-38
[15] Partha Sarathi Bhattacharjee, Dr. Shahin Ara Begum,‖ Fuzzy Approach for Intrusion Detection System: A Survey‖ International Journal of Advanced Research in Computer Science ISSN No. 0976-5697 Volume 4, No. 2, Jan-Feb 2013