Efficient Statistics Based Framework for Network Intrusion Detection

(1)



Abstract—Due to the growing threat of network attacks, detecting and measuring network abuse are increasingly important. Network intrusion detection is one of the most frequently deployed approaches. Most detection systems only rely on signature matching methods and, therefore, they suffer from novel attacks. This investigation presents a simple yet efficient data-mining framework (SID) that constructs a statistics based abusive traffic detection system based on network flows. We show that SID can accurately and automatically detect existing and new malicious network attempts. Experimental results validate the feasibility of using SID to detect network anomaly intrusions. In particular, we show that, simply employing four basic features of network flows, SID can yield an accuracy of over 97% with a false positive rate of 0.03% in the testing dataset.

Index Terms—Data Mining, Intrusion Detection, Network Security, Machine Learning.

I. INTRODUCTION

Intrusion detection (ID) techniques are fundamental components of security infrastructures which are adopted to detect and then block intruders. ID techniques are conventionally classified into two categories: misuse detection and anomaly detection. Misuse detection (also called signature-based detection) strives to detect well-known attacks by matching incoming traffic to existing signatures or rules. Misuse detections in general have a low false alarm rate but they suffer from a main drawback: they cannot identify new attacks without pre-defined signatures or rules. In contrast, in an anomaly detection framework, the system creates normal user profiles and then any deviation from the normal user profiles is regarded as an anomaly attack. This approach can detect new attacks but it suffers from more false alarms than misuse detections.

Since network attacks/abuse increase rapidly, detecting and measuring these malicious activities become more and more important. As a result, a vast variety of detection systems have been proposed to alleviate this problem [1]-[7],[12],[22-24]. Because signature based systems, e.g. [24], cannot detect novel attacks, several anomaly based intrusion detection systems are proposed to alleviate this problem. As mentioned earlier, because anomaly detection systems use normal user

Manuscript received March 01, 2013.

Kuo-Chen Lee, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan (R.O.C.).

Zhi-Jun Hsu, R&D department, Digicode Tech. Ltd., Taipei, Taiwan (R.O.C.).

Li Liu, Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan (R.O.C.)

profiles to identify anomaly suspicions, creating correct user profiles is a key factor to their performance. Among various techniques, data mining technique is a widely used method to extract rules from large datasets and then use these rules to identify suspicious instances. The construction of such systems is simple and straightforward: It starts with a learning phase: running in a probabilistic manner with a given set of training data along with their class labels. Afterwards, in the prediction/detection period, the system uses derived a posteriori probabilities along with designed classifiers to classify new instances into corresponding classes [8]. Among all classifiers, Bayes based and threshold based ones, e.g. [4]-[7], can provide simple yet efficient approaches to classify instances into their categories. Further, notice that the two types of classifiers can learn/train either in a batch mode, i.e. all instances are given at a time, or in an incremental mode, i.e. training instances are added sequentially. This flexibility makes them favorable to use especially in today fast evolving Internet environment.

In this paper we propose a Statistical based abusive Intrusion Detection framework, called SID, which examines important and useful features of network flows instead of investigating their contents to detect anomaly attacks. Our intention is to provide a simple yet efficient intrusion detection framework for large networks. The main objective of SID is detecting abusive/noisy network intrusions, such as DoS, rather than high level application attacks, such as buffer overflow exploiting. In addition, we investigate two different classification approaches, naive Bayes classifier and threshold based classifier, to study how classification techniques affect system performance. The naive Bayes classifiers use the Bayes probability model to investigate the optimal decision for unknown incoming traffic. And, as indicated in its name, the threshold classifiers compare the value computed from the constructed buckets with a threshold to classify unknown instances. In Section III.D, we provide an analytical model to study how classifiers are designed.

Finally, to evaluate the performance of SID, we present a series of comprehensive experiments based on a large set of intrusion training and test data. Further, we use a brand new dataset collected from real network.

The rest of this paper is organized as follows. Section II discusses the related work in intrusion detection. Section III presents SID in details, including the proposed data training algorithm and the analytical model of decision classifiers. Section IV presents the experimental results, showing that SID is very accurate and effective. Conclusions are finally drawn in Section V.

Efficient Statistics Based Framework for

Network Intrusion Detection

(2)

II. RELATED WORK

Network intrusion detection is typically signature based, i.e. misuse detection. SNORT [24] is one of the most representative works in this society. Rules to identify malicious connections can be easily written in SNORT. However, since one cannot write rules for future attacks and it is difficult to keep the rules/signatures updated, this approach is vulnerable to novel attacks. As a result, an alternative approach, anomaly intrusion detection, is proposed to identify suspicious traffic/activities.

Various techniques have been proposed for modeling anomaly intrusion detection systems. These systems, such as NIDES [22], SPADE [23], PHAD [1], ALAD [2] and SVM [3], first create normal user profiles and then generate an alarm when a deviation from the normal profiles is detected. They differ from one to another in the way that how they select useful features, how to choose proper algorithms to derive normal profiles, and how the algorithms make decisions. In prior work, most features are obtained from the packet headers. In particular, SPADE, ALAD, and NIDES use the distribution of the source and destination IP addresses, port numbers, and the TCP connection state. PHAD, a time-dependent model, adopted 34 attributes obtained from the packet headers of Ethernet, IP, TCP, UDP, and ICMP packets to detect intrusion. Lakhina et al. proposed using entropy as a measure of distributions of packet features to identify and classify anomaly network traffic volumes in [9]. Cardenas et al. [10] proposed a framework for IDS evaluation by viewing it as a multi-criteria optimization problem. Srinivas et al. tried to identify important features for anomaly intrusion detection systems in [11] and Stolfo et al. proposed using micro models to sanitize training dataset for anomaly detection systems in [12]. In [13], the data of each category was assigned into k clusters through K-means clustering and train the SVN by using new dataset which consist of only centers of cluster. The false alarm values can be reduced. Despite the large body of work, there exists no clear intrusion detection framework for large networks.

Bayes network is one of the most widely adopted models for presenting uncertain data [4]. It is a directed acyclic graph (DAG) in which vertices represent events, and edges denote the relations between events. The numerical component quantifies the different links in the DAG according to distribution of the conditional probability of each node in the context of its parents. Bayes networks have been widely used to create models for anomaly intrusion detection. Puttini et al. [7] presented a behavior model using Bayes to obtain the model parameters. Goldman [14] proposed a model that simulates an intelligent attacker using Bayes networks to generate a plan of goal-directed actions. Kruegel et al. [15] described a service specific intrusion detection system, which combined the type, length, and payload distribution of the request as the features to calculate an anomaly score. Naive Bayes is a simplified Bayes network composed of directed acyclic graphs with only one root node. A naive Bayes network is a restricted network that has only two layers. It assumes that all attributes are independent. These restrictions result in a tree-shaped network with a single root node, which facilitates classifier design. Valdes et al. [16] developed a

system that applies a naive Bayes network to perform intrusion detection on network events. Sebyala et al. [5] presented a system that identifies malicious executable code in active networks.

Since large networks observe tons of packets, flow based detection is favorable for anomaly detection systems in practice [9]. Therefore, in this paper, we propose a simple yet efficient statistics based anomaly intrusion detection framework based on observed network flows, in order to provide a feasible framework for large networks. Although SID is similar to aforementioned anomaly detection systems, our work is distinguished from them in that: we use both normal and anomalous data in the training procedure while conventional anomaly detection systems only use normal data. Additionally, SID concentrates on not only correctness but also on simplicity, efficiency, and feasibility. In particular, we propose a heuristic learning/training algorithm for quick, flexible normal user/traffic profiles construction. Further, based on created profiles, we establish two fast classification methods, naive Bayes classifier and threshold based classifiers, to detect anomaly activities and to study how classification method affect performance. Finally, we use two well-known datasets, e.g. DARPA 98 and KDD CUP 99, and data collected from real network, i.e. university campus, to evaluate the performance of the proposed detection framework. We show that the proposed framework can achieve both a high detection rate and a low false positive rate by simply employing four basic features of network flows.

Fig. 1 Architecture of SID.

III. STATISTICS BASED INTRUSION DETECTION (SID) P. Garcia-Teodoro et al. provide a thoroughly introduction to anomaly based network intrusion detection systems (A-NIDS) in [8]. SID has a similar architecture in functionality to the aforementioned generic A-NIDS. However, in contrast with generic A-NIDS, SID utilizes not only malicious data but also normal data in the training stage. Further, SID concentrates on providing simple yet efficient

(3)

flow based training and detection algorithms for potential deployment in backbone networks. The detailed architecture of SID is shown in Fig. 1. The data preprocessor in both training and testing procedures extracts feature information for further analysis. In the training procedure, the system uses the extracted data to train/create normal user profiles. In the testing procedure, the arriving data are extracted as the training procedure and then the extracted information is fed into designed classifiers to determine its attribute.

A. Characterizing Network Flows

A network flow denotes a series of IP packets exchanged between a source address/port and a destination address/port. Flow states/features include but not limited to flow types, e.g. tcp, udp or icmp flows, flow duration, source/destination addresses and ports, number of packets exchanged, number of bytes transferred, number of packets with options set.

Although Srinivas et al. tried to identify important features for anomaly intrusion detection in [11], the choice of important features is still an open challenge in anomaly intrusion detection systems [8]. In this paper we concentrate on providing a simple, efficient framework for backbone networks rather than investigating an optimal feature selection. Therefore, we propose a generic flexible framework, which is capable to use various features, to detect network intrusions. Our experimental results show that simply using four features in our framework is sufficient to yield outstanding performance. In particular, four of the following five features are used: duration, protocol, service, source port number, and source byte. Duration is the elapsed time of the flow under study and protocol is the Internet protocol number of the flow, see [29]. Service is the destination port number, e.g. HTTP port number is 80, telnet port number is 23, and etc. Source byte is total number of bytes that the source sends to the destination.

Notice that each flow feature is a parameter with various values ranging among a very large set. In [28] features are viewed as either continuous or discrete according to a range of the value of the feature under study. For example, source byte is considered continuous because it is a 32-bit unsigned integer with a range from 0 to 4294967295 and service is also considered continuous because it is a 16-bit unsigned integer with the range from 0 to 65535. Protocol is considered discrete because it is an 8-bit unsigned integer with the range from 0 to 255. In SID, feature types are pre-defined as recommended in [28]. In particular, duration, service, source port, and source byte are continuous while protocol is discrete. The reason for distinguishing continuous data type from discrete data type is for the sake of practical implementation. For a discrete feature, a full-rank array, which means each array element represents a single value of the feature, is allocated. However, it is infeasible to allocate an array with 4294967295 elements for a 32-bit feature, e.g. source byte. As a result, an extra transformation procedure, as we shall see soon, is performed to handle this problem.

B. Data Training

In SID, after feature data is extracted, the system will determine whether it is continuous. If the feature is discrete, data is directly placed into the appropriate position of the

pre-allocated array. Otherwise, a transformation procedure is performed: The system first attempts to place the value into the pre-arranged list which is managed dynamically to accommodate training information. If no appropriate segment is available for a new value, the system attempts to create an additional segment for this value. The feature extraction procedure is performed repeatedly until all training data are processed. And then, we construct the trained buckets for detection classifiers.

The detailed training procedure is shown in Fig. 2. To keep track of training data for each feature considered in the framework, we create a data structure which contains two lists, malicious list (MList) and benign list (BList), and a maximum value, MaxValue, of the feature under study, as shown in lines 1-5 in the DataTrain() routine of Fig. 2. Both lists of a feature consist of disjoint segments that partition the entire range of the feature under study. Each segment records its value, boundaries, and the statistics of the training data stored in the segment. We process training data sequentially, which corresponds to lines 6-10 in DataTrain(). For each training entry, we extract corresponding feature information and then accommodate the information into corresponding segments. This task is described by the FeatureAdd() routine of Fig. 2. When we process an entry, we first identify whether the entry is malicious or benign in order to accumulate extracted feature information to the correct list, which is described in lines 1-5 in the FeatureAdd() routine. And then, we have to update the statistics of the corresponding segment in the chosen list, as shown in lines 6-28 in the FeatureAdd() routine.

As mentioned earlier, a feature can be considered as either continuous or discrete. For discrete data, each segment contains a single value and the two lists can be easily implemented by two arrays. Therefore, extracted feature information of a training entry can be directly placed in the corresponding position by a simple mapping. Table 1 and Fig. 3 illustrate the details in the process. Assume the first six entries of a training dataset are shown in Table 1. When we process the first entry, we figure out that this entry is a malicious TCP connection and the protocol number of TCP connections is 6. Therefore, we increase the malicious count of the element in the Protocol Array indexed by TCP protocol number 6. And then, we increase the malicious count at position 17 because the second entry indicates that an UDP connection is malicious. After processing the first six entries, the content of the Protocol Array is shown in Fig. 3.

However, the values of continuous data could range from 0 to 4294967295 and thus creating a huge array for simple mapping is impractical. To overcome this problem, continuous values are transformed into a set of disjoint segments. For instance, a list of the destination port, say consisted of five segments, could be {0-79, 80, 81-999, 1000-9999, 10000-65535}. To perform this transformation, we propose a heuristic algorithm in order to perform this task quickly. The fundamental idea is dividing a segment into two disjoint segments if a newly inserted data cannot be stored in current segments. The pseudo codes are shown in lines 12-25 in the FeatureAdd() routine of Fig. 2. We again use the entries in Table 1 to illustrate the operations. We can observe from the first training entry, the destination port number of this

(4)

malicious connection is 80. Because the malicious list, MList, is empty, we append a segment in the list. The appended segment spans the entire range of the port number, with value 80. The second entry is also malicious and thus we choose the malicious list again. Now, the extracted port number is 40000 which is different from the value in the current segment of the malicious list.

Fig. 2 Training algorithm.

Table 1 The first six entries of a sample training dataset.

Protocol Port Source IP Dest IP Malicious

TCP(6) 80 192.168.8.22 10.23.4.5 Yes UDP(17) 40000 192.168.1.42 10.2.4.53 Yes TCP(6) 22 192.168.33.200 60.20.4.3 No UDP(17) 80 10.33.213.33 10.20.4.3 No UDP(17) 22 10.33.213.33 10.20.4.3 No UDP(17) 30000 192.168.8.22 10.23.4.5 Yes

Fig. 3 The resulting segment list for the protocol feature.

Fig. 4 The resulting segment list for the destination port feature.

Therefore, we need to insert a new segment to keep track of this value. Because the new value, 40000, is larger than the old value, 80, we insert the new segment after the current segment. The start point of the new segment is the middle point between the old value and new value, which is 20040. Further, we shrink the end point of the current segment to this middle point to make the two segments non-overlapped. The above operations corresponds to lines 16-20 in the FeatureAdd() routine. Let the resulting segments be: segment M1, {0, 20040} with value 80, and segment M2, {20041, 65535} with value 40000. The third and the fourth entries are benign and these two entries break the benign list, BList, into two segments: segment B1, {0, 51} with value 22, and segment B2, {51, 65565} with value 80. The fifth entry is benign again and the port number is 22; thus we only increase the count of the segment B1, i.e. line 15 in the pseudo code. Finally, in the sixth entry, we have to insert a new value, 30000, into the malicious list. Because the value is located in segment M2 and the value is less than the value of M2, i.e. 40000. Therefore, we should insert a new segment between M1 and M2. The operation is described in lines 22-25 in the FeatureAdd() routine. After processing the first sixth entries, the resulting lists are shown in Fig. 4.

C. Bucket Construction

After all entries in the dataset are processed, each feature is characterized by two arrays or lists which contains training data statistics of the feature. It is easy to see that the all segments in the two arrays of a discrete feature, e.g. protocol, are aligned, which can be also easily observed from Fig. 4. Therefore, the two arrays can be easily combined to construct buckets for detection classifiers. However, since the segments in the two lists for continuous features maybe not aligned, i.e. they might have different boundaries, a merge procedure is proposed to construct buckets for continuous features. The idea is merging two adjacent segments, if they have close statistics, until a new segment with different statistics encountered. Note that the merging condition used in this

255 17 6 0 count: 2 count: 1 count: 1 count: 2 Malicious value:80 count:1 value:22 count:2 value:80 count:1 value:30000 count:1 value:40000 count:1 Benign 0 51 65535 65535 0 20040 35000 Benign Malicious

(5)

paper is the malicious to benign ratio of a segment. However, the condition can be very extensive and flexible. All segments which have close statistics are combined into a bucket and the new segment is the start of the next bucket. We iterate both lists until all segments are processed. The detailed pseudo codes are shown in Fig. 5.

Fig. 5 Bucket construction algorithm.

For a feature of interest, given BList and MList, Fig. 6 demonstrates how buckets are constructed from the two generated lists of a continuous feature. As we can observe

from the plot, there are four and two segments in the benign and malicious lists respectively. Let the four segments in the benign list are B1, {0,10000} with value 80, B2, {10001, 20000} with value 16000, B3, {20001, 30000} with value 24000, and B4, {30001, 65535} with value 38000. Denote by M1, {0, 20000} with value 24000, and M2, {20001, 65535} with value 24000, the two segments in the malicious list. To merge the two lists, we start from the minimum segment, the segment with the minimum value, and then move toward the maximum segment. The SegMin() routine of Fig. 5 is proposed to discover the minimum segment between the two lists.

Fig. 6 A bucket construction example.

It is easy to see that the minimum segment between the two lists in Fig. 6 is segment B1. Since the bucket list is empty now, we start with a new bucket, say L1, and store statistics information in B1 to L1, which corresponds to lines 6-10 in the CreateBucket() routine of Fig.5. The next minimum segment is B2. Since all the instances in B2 are benign which is the same as L1, we know that this segment can be merged into bucket L1, i.e. accumulate statistics of B2 into L1. The next minimum segment consists of B3 and M1 because they have the same value. Since the statistics of this segment is different from bucket L1, we have to close bucket L1 and create a new bucket, say L2 for this segment. The terminating point of L1 and the stating point of L2 is the middle point between the value of B2 and the value of B3 (or M1), i.e. 20000. The operations are described by lines 16-20 in the CreateBucket() routine. Now, we know that buckets L1 starts from 0 to 20000 with 5000 benign instances and no malicious instance and that the new bucket, L2, starts from 20001, recording 1000 benign instances and 3000 malicious instances. The next minimum segment is M2 because its value is smaller than the value of B4. Since all instances of this segment are malicious, it is different from L2 where only 75% instances are malicious. As a result, similar operations described in lines 16-20 in the CreateBucket() routine are performed again. Now, L2 is closed at the middle point between 24000 and 36000, and L3 starts from 30001 to accommodate M2. Note that L2 spans from 20001 to 30000 with 1000 benign counts and 3000 value:80 count:3000 Constructed Buckets Bcount:0 Mcount:6000 value:38000 count:3000 value:24000 count:3000 value:24000 count:1000 value:36000 count:6000 0 10000 20000 30000 65535 0 30000 65535 0 20000 30000 37000 65535 Bcount:5000 Mcount:0 Benign Malicious Bcount:1000 Mcount:3000 Bcount:3000 Mcount:0 value:16000 count:2000

(6)

malicious counts. Finally, B4 is added into the buckets. Again, it cannot be merged with L3 and thus a new bucket L4 is created. The complete constructed buckets, L1 … L4, are shown in Fig. 6.

D. Classifiers and Detection

As shown in Fig. 1, the detection procedure is quiet similar to the training procedure. After feature information of an instance, i.e. an incoming flow or a testing data, is extracted, the information is fed, along with the trained buckets, into decision classifiers to determine whether the instance is malicious based on the feature information of the instance. In this section we construct and analyze two decision classifiers: the threshold classifier and the naive Bayes classifier. Before we proceed, we introduce the notations used in this section. Denote by PM and PB the proportion of malicious and benign instances of the training dataset respectively. Let



F

K



F



₁

,...,

be the set of features considered in the framework, where K is the number of features considered. Assume the trained buckets of feature i contains Ni buckets,

denoted by



i



Ni i i B B B  1,..., . Let b j i n, and m j i n, be the number of benign and malicious instances in bucket j of feature i respectively and let

p

_ib_,_j and

p

_im_,_j be the ratios of benign and malicious instances in bucket j of feature i respectively. That is m j i b j i b j i b j i n n n p , , , ,  _ and m j i b j i m j i m j i n n n p , , , ,  _ .

Now, consider feature i, denote by b j i

p

, and m j i

p

, the ratios of benign and malicious instances located in bucket j respectively. It is easy to see that



 

_

Ni j b j i b j i b j i

n

p

1 , , , and



 

_

Ni j m j i m j i m j i

n

p

1 , ,

, . Finally, let X be the instance under study

and let xi be the information extracted from feature i of this instance.

E. Threshold Classifier

As indicated in its name, threshold classifiers use a threshold to classify instances. One of the most representative threshold classifiers is a 1-bit slicer, or analog to digital converter (ADC), which digitizes an analog signal to a {0, 1} digit. In our framework, for example, if we only consider two features and the statistics of both feature show that a instance is malicious with a high probability, it is very likely that the instance is malicious. To quantify “high probability”, defining a probability threshold, say Pth, is one of the simplest approaches. Let i

est

p be the probability estimate that the instance is malicious from the viewpoint of feature i. That is

m j i i est p p  , if i j i B

x  and feature will suggest that X is malicious if

th i

est p

p  . Based on the observation from the features considered, to estimate how likely an instance is malicious, we can compute the malicious estimate, Pest, of X as the sum of the probability estimate of all features. That is





FK Fi i i est est

P

p

. In this paper, we employ a metamorphosis

of majority rule to determine whether an instance is malicious. (Note that threshold classifiers could be very variant. Other reformation of rules or thresholds can be also applied in SID.) In particular, we claim that an instance is malicious if the malicious estimate p_est is larger than the minimum probability that all features claim that the instance is malicious. Further, the mathematical formulation of the threshold classifier can be expressed as follows:

X is otherwise P K P if Benign Malicious _est  _th    ₍₁₎

F. Naive Bayes Classifier

Naive Bayes assumes that all considered features of an instance are conditionally independent. The independence assumption implies that the computation of naive Bayes classifiers can be computed more efficiently than the exponential complexity of non-naive Bayes approaches, since it does not consider combinations in classifiers.

Consider N disjoint possible classifier results (or decisions), C1, …, CN, that partitions the sample space. Given an instance X and its features {xi}, we can compute the probability that X belongs to a particular result by the following equation.





 







K K



i K K i K K i

x

F

x

F

P

C

x

F

x

F

P

C

P

x

F

x

F

C

X

P



,

1 1 1 1 1 1



Under the assumption that all features are independent, the above equation can be rewritten as follows:





 







K K



K j j j i i K K i x F x F P C x F P C P x F x F C X P       



 , , , , 1 1 1 1 1   (2)

Now, it is easy to see that the optimal decision for X is given by

 

_







K j i j j i Ci opt

P

C

P

F

x

C

1

max

arg

(3)

In SID, we consider two possible decisions: Malicious and Benign. We can compute the posterior probabilities that X is malicious and benign, denoted by

p

_estm and

p

_estb

respectively, by (4) and (5).



K



j K j K i m j i M m est K i

B

x

B

x

P

p







 

,

1 1 1 , 1



(4)



K



j K j K i b j i B b est K i

B

x

B

x

P

p







 

,

1 1 1 , 1



(5) Finally, notice that the denominators of the above two equations are identical. As a result, to expedite computation, the naive Bayes decision classifier can be constructed as follows: X is otherwise P P P P if Benign Malicious K i b j i B K i m j i M



i



 i    _    1 , 1 , (6) IV. EVALUATION A. Dataset Description

(7)

a high detection rate with a low false alarm rate. The first two of them are widely used in IDS performance evaluation and the last one is newly collected from real network. To show that SID can identify unknown anomaly activities, the test dataset contains several new intrusions that are not presented in the training dataset.

The first dataset used is the DARPA 98 [25] dataset which originates from MIT Lincoln Laboratory and has been developed for IDS evaluations by DARPA. It collects traffic from a real network, being blasted with multiple attacks. The dataset consists of tcpdump files, which record all network packets, and list files, which record corresponding sessions/flows of the dumped files. Each line in a list file corresponds to an individual session/flow, as described in Section III, and a line consists of nine fields, which identifies the flow, followed by an attack indicator and an optional attack type. The entire dataset consists of data of five-week observation. We use the data collected in the first three weeks as the training dataset and use the data collected in the last two weeks as the testing dataset. The training dataset consists approximately one million data instances and the testing dataset contains about 616,000 instances.

The second dataset we used is the KDD CUP 1999 dataset [28], which extracts various quantitative and qualitative features from the trace of DARPA 98 dataset. In particular, the training dataset is composed of approximately 4,900,000 data instances and each instance consists of 41 discrete or continuous features along with an attack label and an attack type. In addition to the training dataset, KDD CUP 1999 also provides a testing dataset for testing purpose. In both training and testing datasets, the attacks are classified into four categories, Denial of Service attacks (DoS), User-to-Root attacks (U2R, unauthorized access to root privilege), Remote-to-Local attacks (R2L, unauthorized access from remote machines) and Probing. In addition to the entire dataset, the training and testing data are splitted into four sub-datasets based on the aforementioned four attack categories (DoS, U2R, R2L and Probe). For instance, the training and test datasets for DoS include all DoS attacks, and all normal cases in the original training and test data.

Although it is well known that the two aforementioned datasets are synthesized, it is important to mention that the datasets can be considered as the base line of a NIDS related research. The datasets are widely accepted as benchmark datasets and referred by many researchers [17]-[20]. As a result, we first use the two datasets to study the performance of SID and then we compare the results with the results of SID in real network, which is discussed next.

The third dataset we used is collected from real network, i.e. university campus. There are more than 1,000 computers connected to our gateway. We collect network traffic via tcpdump [30]. After raw packets are captured, we groups them into network flows in accordance with the flow classification methods used in the literature [12], [26], [27]. The collection spans over 20 days and the resulting dataset contains about 12 million instances. We perform several experiments based on the entire dataset. Each experiment samples 2 million instances as the training dataset and 1 million instances as the testing dataset.

B. Experiments

In this section we present the experimental results, showing that SID can achieve outstanding performance for both datasets. To demonstrate SID is very flexible, we choose different features for the two datasets and show that both choices can result in excellent performance. In particular, we adopt “duration”, “protocol”, and “service” for all datasets. We choose “source port” in the DARPA 98 dataset while we select “source byte” in the KDD and the real network dataset. Notice that the reason of choosing this combination comes from our intuition and experience as well as the results of prior work. Further, we also perform several experiments using different combinations of selected features. Since similar results are observed in these experiments, we do not present the results in the paper. Discovering the optimal choice of features is out of the scope of this paper. We refer interest readers to [11].

C. DARPA Dataset

We first use the DARPA dataset to study how the threshold, Pth, affects the behavior of a threshold classifier and thus changes system performance. The results are shown in Fig. 7. First, we can observe from the plot that the false positive rate, derived as the percentage of normal instances classified as malicious, and the detection rates, obtained as the percentage of malicious instances detected, are high when Pth is small. This can be explained as follows: When Pth is small, we classify a suspicious instance which has slightly errant behavior as a malicious instance. As a result, we can detect all possible malicious instances and thus achieve a high detection rate. However, at the same time, it is more likely that the decision classifier may classify benign instances as malicious instances. Further, we can observe that the false positive and detection rates decrease as Pth increases. This is because Pth represents how conservative a classifier is: As Pth increases, the classifier becomes conservative and thus the false positive rate decreases. Meanwhile, it is more likely that a malicious instance cannot be detected because the classifier is not confident enough to declare the instance malicious. As a consequence, the detection rate decreases. Additionally, similar results can be also observed for the KDD dataset.

(8)

Fig. 8 ROC plot for the DARPA dataset.

Fig. 9 ROC plot for KDD DoS dataset.

Fig. 10 ROC plot for KDD network dataset.

Receiver Operating Characteristic (ROC) plot is one of the most representative plots used to study the relationship between the false positive rate and the detection rates. In a ROC plot, the X-axis represents the false positive rate and the Y-axis denotes the detection rate. Notice that a data point in the upper left corner corresponds to better performance, namely a lower false positive rate with a higher detection rate, than a point in the lower right place. The ROC plot for the

DARPA dataset is shown in Fig. 8. As we have discussed in Fig. 7, the performance of threshold classifiers changes as Pth changes. Therefore, the results of the threshold classifier in the plot correspond to different values of Pth. Additionally, we also present the result of the naive Bayes classifier in the plot. As we can observe from (6), the naive Bayes classifier does not depend on any configurable variable. Therefore, the result of the naive Bayes classifier consists of only a single point, as we can observe from the plot. Finally, notice that using the proposed framework, both classifiers can achieve a high detection rate, higher than 97%, while keeping the false positive rate as low as 0.03%.

D. KDD CUP Dataset

The second experiment involved the KDD dataset. We consider the following three different sub datasets:

1) DoS dataset: including neptune, smurf, teardrop, land, pod, back and normal data.

2) Network dataset: in addition to the aforementioned DoS dataset, it adds the Probing dataset, including ipsweep, portsweep, saint, and normal data.

3) Entire dataset: all attack categories and normal data presented in the KDD dataset.

The results for DoS, network, and entire datasets are shown in Fig. 9, 10, and 11 respectively. We can make the following interesting observations. First, SID again performs very well in all three datasets. In particular, the threshold classifier can achieve 99%, 94%, 90% detection rate with 0.06%, 0.13%, 0.48% false positive rate for DoS, network, and entire dataset respectively. Further, as we can observe from the plots, the naive Bayes classifier can achieve a slightly lower detection rate but a lower false positive rate than the threshold classifier for all three datasets. This implies that the naive Bayes classifier is more accurate than the threshold classifier but it catches less anomaly instances than the threshold classifier. This observation is a bit of different from the observation in Fig. 8 where the naive Bayes classifier performs slightly better than the threshold classifier. Detailed discussion for the performance comparison between these two classifiers will be addressed in the next section.

(9)

Fig. 12 ROC plot for the real network dataset. E. Real Network Experiment

The final experiment is based on real network dataset collected from university campus. As mentioned earlier, several experiments are performed based on the collected dataset: we sample the dataset to a sub-dataset and then divide the sub-dataset into training and testing dataset. Although the results of the experiments are not identical, all of them are promising. Fig. 12 shows representative result of the experiments. In Fig. 12, the training data is sampled from the first half of the entire dataset and the testing data is collected from the second half of the entire dataset. Again, we can observe from the plots that SID also performs well on the dataset collected from real network. Finally, we can discover that the performance of SID on the real network dataset is slightly worse than the DARPA and KDD datasets. This can be explained as follows: recent advance in network attacks has complicated the behavior/pattern/signature of attacks. Since SID is based on prior knowledge and statistics to detect anomaly. The polymorphism and complicated behavior of advanced attacks degrade the performance of SID on real networks.

F. Remarks

Before we conclude, we highlight some interesting remarks. First, as we have observed from Fig. 8-11, the performance of the naive Bayes classifiers is similar to the optimal performance of the threshold classifiers. Further, the performance of threshold classifiers depends on how to compute the evaluation metric and how to setup the threshold. More important, there exists no explicitly answer to these two questions. In contrast, the design of naive Bayes classifier is more straightforward and simpler, independent of other factors. It seems that the naive Bayes classifier is the favorite choice. However, threshold classifiers can be used to construct ROC plot of a detection system to study system sensitivity. In addition, threshold classifiers in general are computational simpler than naive Bayes classifiers. As a result, choosing a proper classifier is a tradeoff between design complexity and computation complexity.

The experimental results show that SID performs very well for DARPA, KDD DoS, and KDD network datasets. But it suffers slightly performance degeneration for the KDD entire

dataset. This can be explained as follows. Because the purpose of this work is to construct an anomaly based network intrusion detection framework for large networks, we only consider simple features which can be easily extracted from the monitored network, e.g. protocol and service. Since the selected features can characterize and thus identify the attacks covered in the DoS and network dataset, the performance is as good as expected. However, the entire KDD dataset includes more variant attacks, e.g. the R2L and U2R, which cannot be fully characterized by the selected features. Thus the performance degrades. Nevertheless, as mentioned earlier, SID is so flexible that it can be easily extended to incorporate other complicated features, e.g. statistics in a two second window in the KDD dataset, to detect these anomaly activities.

Finally, it is easy to see that SID can be easily implemented. In particular, as we can observe from the proposed algorithms, there is no complicated operation in either the training procedure or the detection procedures. More important, SID can run in an incremental mode as we can conclude from Section III. The training buckets can be easily updated by continuously keeping track of b

j i n_, , m j i n_, and corresponding probabilities while the decision classifiers use updated information to identify anomaly suspects. Compared with other complicated systems, e.g. SVM [3], this seamless online update makes SID favorable, in particular for backbone networks.

Table 2 Performance comparison.

Category Metric SID KDD SVM

DoS Detection Rate(%) 99 97 99

False Positive Rate(%) 0.06 0.1 N/A

R2L Detection Rate(%) 91 8.4 99

False Positive Rate(%) 0.47 0.5 N/A

UDP(17) Detection Rate(%) 99 83.3 99

False Positive Rate(%) 6.2 7.1 N/A Table 3 Complexity comparison.

Category SID Generic SVM SVM worst case

Training 

 

fn 



NSNSfn



2

 

2

fn



Detection 

 

f 



fNS





 

fn

G. Comparison with Other Frameworks

In Section IV.B, we have shown that SID can efficiently detect our target attacks, i.e. abusive and noisy ones. To further demonstrate that SID is amble to practical deployment, we compare SID with two representative prior work: the winning entry in KDD CUP 99 [31], denoted by KDD, and the SVM framework [3]. Table 2 and 3 show the comparison results in terms of system performance and algorithm complexity respectively. First, we can observe from Table 2 that SID outperforms the KDD CUP 99 winner entry in all aspects. The results simply imply SID performs better than the winner entry of KDD CUP 99. Further, we can observe that SID can match the performance of SVM except a lower detection rate in the "R2L" category. Notice that, as mentioned earlier, a high detection rate can be achieved by suffering a high false positive rate. However, the false

(10)

positive rate in SVM is not provided in [3] and thus the results shown in Table 2 does not explicitly means that SVM performs better than SID in the "R2L" category.

More importantly, as we can observe from Table 3, SID has lower complexity than SVM in both training and detecting algorithms. In Table 3, we show the training and detecting (for one instance) complexity of SID, generic SVM, and SVM worst case, where f denotes the number of features selected, n denotes the number of entries in the training dataset, and Ns is the number of support vectors. The complexity of SID can be observed from Fig. 2 and 5 as well as (1) and (6). The results of the complexity of SVM are cited from [21]. We refer interested readers to it for the detailed discussion. Notice that in the worst case scenario, Ns could be as large as n and this is how the results in the SVM worst case column are derived. As we can observe from the table, in the worst case, SVM is much more complex than SID. Further, the detection complexity depends on the number of entries of the training dataset. (To obtain good trained profiles, the number of training data in general will be too large to be ignored.) This drawback also prohibits the SVM framework from utilizing hardware accelerate skills in realization such as pipeline and super scale (parallelism). In contrast with SVM, SID does not suffer from this problem. Based on the above comparison, it is easy to see that SID is more practical than SVM in implementing detection systems in large scale networks.

V. CONCLUSIONS

In this paper we present an efficient statistics-based abusive intrusion detection framework. SID consists of a training subsystem and a testing/detection subsystem. The training subsystem creates user profiles, i.e. trained buckets, by transforming and aggregating the feature fields. The detection subsystem develops two classifiers, threshold and naive Bayes, to detect anomaly instances. A series of evaluations have been performed on SID. The experimental results show that SID can effectively detect network attacks. It can achieve an accuracy of 97% with a false positive rate of 0.03% for the DARPA 98 dataset, 99% detection rate with a 0.06% false positive rate for KDD DoS dataset, and 89% detection rate with a 5% false positive rate for real network. Compared with conventional signature based approaches and existing anomaly based detection systems, it has a promising high accuracy and low false positive rate with a simple implementation. Since SID only inspects some flow features, it has a low detection cost for large networks, i.e. backbone networks.

REFERENCES

[1] M. Mahoney, P.K. Chan, Learning Nonstationary Models of Normal Network Traffic for Detecting Novel Attacks. Proceedings of ACM SIGKDD, 2002.

[2] M. Mahoney, Network Traffic Anomaly Detection Based on Packet Bytes. Proceedings of ACM SAC, 2003.

[3] S. Mukkamala, G. Janoski, A.H. Sung, Intrusion Detection Using Neural Networks and Support Vector Machines. Proceedings of IEEE Int'l Joint Conf. on Neural Networks, 2002.

[4] F.V. Jensen, Introduction to Bayesien networks. UCL Press, 1996. [5] A.A. Sebyala, T. Olukemi, L. Sacks, Active Platform Security through

Intrusion Detection Using Naive Bayesian Network for Anomaly Detection. London Communications Symposium, 2002.

[6] C. Kruegel, D. Mutz, W. Robertson, F. Valeur, Bayesian Event Classification for Intrusion Detection. Proceedings of the 19th Annual Computer Security Applications Conference, 2003.

[7] R. Puttini, Z. Marrakchi, L. Me, Bayesian Classification Model for Real-Time Intrusion Detection. Proceedings of 22nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 2002.

[8] P. Garcia-Teodoroa, J. Diaz-Verdejoa, G. Macia-Fernandeza, E. Vazquezb, Anomaly-based network intrusion detection: Techniques, systems and challenges. Elsevier Computers and Security, 2009. [9] A. Lakhina, M. Crovella, C. Diot, Mining anomalies using traffic

feature distributions. ACM SIGCOMM, 2005.

[10] A. Cardenas, J.S. Baras, K. Seamon, A Framework for the Evaluation of Intrusion Detection Systems. IEEE Symposium on Security and Privacy, 2006.

[11] S. Mukkamala, A.H. Sung, Identifying Significant Features for Network Forensic Analysis Using Artificial Intelligent Techniques. Intl. Journal of Digital Evidence, 2003.

[12] G.F. Cretu, A. Stavrou, M.E. Locasto, S.J. Stolfo, Casting out Demons: Sanitizing Training Data for Anomaly Sensors. In Proceedings of IEEE Symposium on Security and Privacy, 2008.

[13] C.F. Tsai, C.Y. Lin, A triangle area based nearest neighbors approach to intrusion detection. Pattern Recognition, 2010.

[14] R. Goldman, A Stochastic Model for Intrusions. Proceedings of Symposium on Recent Advances in Intrusion Detection, 2002. [15] C. Kruegel, T. Toth, E. Kirda, Service Specific Anomaly Detection for

Network Intrusion Detection. Proceedings of Symposium on Applied Computing, 2002.

[16] A. Valdes, K. Skinner, Adaptive, Model-based Monitoring for Cyber Attack Detection. Proceedings of RAID, 2000.

[17] C. Thomas, N. Balakrishnan, Improvement in Intrusion Detection With Advances in Sensor Fusion. IEEE Transactions on Information Forensics and Security, 2009.

[18] C.M. Chen, Y.L. Chen, H.C. Lin, An Efficient Network Intrusion Detection. Computer Communications, 2010.

[19] R.S. Ritu, G. Neetesh, K. Shiv, To Reduce the False Alarm in Intrusion Detection System using self Organizing Map, 2011.

[20] Z. Muda, W. Yassin, M.N. Sulaiman, N.I. Udzir, A K-Means and Naive Bayes Learning Approach for Better Intrusion Detection, 2011. [21] J.C. Burges, A Tutorial on Support Vector Machines for Pattern

Recognition. Data Mining and Knowledge Discovery, 1998. [22] H.S. Javits, A. Valdes, The NIDES statistical component: Description

and justification. SRI International Computer Science Laboratory, 1993.

[23] J. Hoagland, SPADE. Silican Defense, 2009. [24] Snort - lightweight intrusion detection for networks,

http://www.snort.org.

[25] Massachusetts Institute of Technology Lincoln Laboratory, 1998 darpa intrusion detection evaluation dataset overview, 2009.

[26] K. Thomas, P. Konstantina, BLINC: Multilevel Traffic Classification in the Dark. Proceedings of ACM SIGCOMM, 2005.

[27] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli: Traffic Classification through Simple Statistical Fingerprinting. Computer Communications Review, 2007.

[28] KDD99 CUP dataset. http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html , 2009.

[29] Internet Assigned Numbers Authority (IANA), Assigned Internet Protocol Numbers , 2009.

[30] TCPDUMP. http://www.tcpdump.org [31] KDD CUP 1999: Results. http://www.sigkdd.