2016 International Conference on Mathematical, Computational and Statistical Sciences and Engineering (MCSSE 2016) ISBN: 978-1-60595-396-0

**Based on a Semi-supervised Fuzzy Clustering and Sample Selection **

**Attribute Reduction of the Intrusion Detection **

### Wen-jun YANG

Zhejiang Yuexiu University of Foreign Languages, Shaoxing, 312000, Zhejing, China *Corresponding author

**Keywords:** Intrusion detection, Clustering, Attribute reduction, Sample selection.

**Abstract.**In order to improve the detection of intrusion detection rate and lower false detection rate,
and put forward a kind of attribute reduction based on a semi-supervised fuzzy clustering method,
and applied to the intrusion detection, first of all select the samples of data preprocessing, use a
semi-supervised fuzzy clustering to reduce sample, with the reduction algorithm based on attribute
dependence. Finally, reduction was carried out on the sample set. Simulation experiments using
KDD99 data set, the text results show that the detection has higher efficiency.

**Introduction **

With the continuous development of network technology, computer network security problems become more and more attention, the network intrusion analysis through testing and analysis of network traffic and related audit data, whether in the system security policy is violated, or computer system security behavior [1]. Along with the growing network services and network applications, also gradually highlight the negative influence, how to quickly and efficiently find all kinds of new intrusion behavior is very important, intrusion detection technology is the theory research in the field of network information security in recent years, one of the hot issues.

The current intrusion detection research methods mainly include neural network, mathematical statistics method and so on. How to find intrusion behavior rule, effective extraction of detection rules, improve the accuracy of intrusion alarm, is the focus of the intrusion detection research.

Rough set [2] is the Warsaw university of technology professor Pawlak in 1982 put forward a method of dealing with incomplete information, it doesn't need any priori information, can effectively analyze and deal with the incomplete inconsistent and inaccurate data through the analysis of large amounts of data, according to the theory of domain two equivalent relations of dependency to weed out compatibility information, and to extract potentially valuable knowledge of the rules.

The basic process of network intrusion detection, it is to get the packet for testing. We consider all the data and their attributes mapped to a table, and then remove those attributes and decision attribute correlation is very small, will eventually get a simplified data attribute subset, so you can meet the requirements of real time, and will not reduce the detection accuracy of intrusion detection system using rough set theory can well realize the above requirements [3].

Attribute reduction is the important process of data in the table to obtain, is in order to simplify the original information system but does not affect the classification of information systems capabilities, so as to delete the redundant attributes of process, in order to achieve the purpose of simplifying the knowledge representation.

**Based on a Semi-supervised Fuzzy Clustering and Sample Selection Attribute Reduction of **
**Intrusion Detection Model **

In this paper, the improved based on a semi-supervised fuzzy clustering and sample selection attribute reduction of the intrusion detection model is shown in figure 1. The model through the invasion of the weights of data discretization pretreatment, through clustering analysis of attribute reduction and decision rule extraction and filtering, to improve the detection rate of network intrusion data.

Intrusion data

Data collector

Select training sample set Discretization of data preprocessing

Attribute reduction of decision table

Rule

generation Rule base

Intrusion detection

[image:2.595.165.425.179.295.2]Alarm

Figure 1. Based on a semi-supervised fuzzy clustering and sample selection attribute reduction of the intrusion detection model.

Model is the main body is divided into invasive data processing, the attribute reduction and rule extraction of three parts.

Intrusion data discretization process: data source invasions are divided into user data and network data, the testing analysis of these data is very important step, in order to improve the detection efficiency, need a large number of data gathered for discretization, discretization methods using discretization method based on error free points.

Invasion of attribute reduction, attribute is not equally important in the knowledge base of intrusion detection attributes reduction, remove redundant invasion properties. Under the condition of guarantee the knowledge classification ability constant, delete unrelated data, produce concise knowledge rules, remove redundant attributes, attribute reduction.

Rule extraction: after attribute reduction, will delete redundant attribute value can complete value reduction, and build a decision table, and then export the rules from the decision table. The rules of formation testing and verification, in the rule base, safety detector according to the rules and regulations will follow those rules library of intrusion detection data and behavior.

**Preliminary Knowledge **

Definition 1: [5] IS an information system can be expressed as a quad IS = (U, A, V, f），where U said all the sample collection, called domain ontology; A for describing sample collection of

attributes; V = ⋃a∈A⋁ a，all attributes for the range of values, of which the Va to attribute a, the

range of values. Each attribute identified a from U to a range of values of the properties between the

mapping, f: A × U → V as the information function, said ∀x ∈ Va,有f(a, x) ∈ Va，A = C ∪ D，C ∩

D = ∅. As the condition attribute set, ((C as the condition attribute set, D for decision attribute set)), it is also known as decision system, the information system for DT.

Given a sample set of network connection, U = {X_{1 }, X_{2}, ⋯ , X_{n}} as a finite set is not empty, in
which X_{i} for each network connection samples.

Definition 2: A given decision table DT = (U, A, V, f）, setPC, if P satisfy the following two conditions

1) P relative to the D is independent;
2) POS_{P}(D) = POS_{C}(D);

RedC（D）， ∩ RedC（D） is called C relative to the D nuclear, remember to COREC(D).

Definition 3: [6] (heterogeneous) distance function set x, y ∈ U, x, y of HVDM distance between as follows:

H(x, y) = √∑mj=1d_{j}2(xj, yj) （1）

Definition4: [7] the basic principles of membership size is based on the importance in the samples in the class, or contribution to in class size. Sample to class center distance is one of measure of contribution to in class on the basis of the size of sample, based on the distance the membership degree of certainty is the sample of membership degree as the distance of the sample to class center in feature space. This paper defines membership degree is as follows:

d_{ij} = nj

N 1

rH(xj, cj ) （2）

c_{j} =Na,x,i

Na,x +

1

m∑xi∈c_{i}xi （3）

Where nj is the number of samples in class j, N is the total number of samples,

r = max_{i}H(x_{j}, c_{j }) are suspended for the first j sample radius, H(x_{j}, c_{j }) for the samples was
calculated by Eq.1 heterogeneous distance of x distances from the center of the class.

**Based on Semi-supervised Fuzzy Clustering and Sample Selection Attribute Reduction **
**Algorithm **

**The Reduction Algorithm Based on Attribute Dependence**[11]

The main ideas of the algorithm is: first, starting from the core, based on nuclear last add a reduction of the attribute, this property to ensure that the new attribute set of reliance than prior to add the attribute of the original collection of dependence, and all of the original information table until the reduction set attribute dependence is consistent.

Algorithm 1：

input：Decision table IS＝（U，A，V，f）

output：The relative reduction Red_{C}（D）.

Step1 RedC（D）＝CORE_C (D)；

step 2 C＝C－ RedC（D）；

Step3 Found in C makes SGF（a， Red_{C}（D），D）＝γ_{ Red}_{C}_{(D)∪{a}}(D) − γ RedC(D)(D)， the

value of the property a;

Step4 If make SGF（a， Red_{C}（D），D） more than one take the maximum attribute, is selected

from a Red_{C}（D）the combination of the value of the number of attributes as a minimum;

step5 Red_{C}（D）＝ Red_{C}(D) ∪ {a}, C′_{= C}′_{− {a};}

Step6 If the γ_{ Red}_{C}_{(D)}(D)= 1, is terminated; Otherwise, turn to step 3.

**Based on Semi-supervised Fuzzy Clustering and Sample Selection Attribute Reduction **
**Algorithm **

Algorithm 2：

input：DT＝（U，A，V，f）

output：Reduction Red

Step1 For data set U = {X_{1 }, X_{2}, ⋯ , X_{n}}, Data attributes are given according to the literature [8]
in the standardization, the formula of standard deviation generated after the standardized data sets

Step2 Choose the k samples from U as the center of the k clusters, k is determined by the data set on the number of categories (decision class number).

Step3 Scans the input data set of labeled data, the discrete attributes, using Na,x,i to replace, and

statistical N_{a,x}，Each class was calculated by Eq. 3 center c_{i },i=1,2,... ,k.

Step4 For each data X_{i}without a label for each cluster calculated using Eq. 2 the membership
degree of d_{ij}, take the minimumd_{ij}, and classify X_{i} as corresponding clusters, recalculate class
center c_{i }, kind of radius r. In this paper, the threshold value of the existing class radius of the
maximum value, and the threshold is fixed, in the running process of the value does not change.

Step5 If it is not empty tag sample sets, the clustering results change, no longer stop the algorithm; Otherwise, return to step 4.

Step6 Each cluster sample to center distance is greater than all the samples to the average distance between the center point in the cluster sample to join the new sample set U‘, get new

decision-making system DT’ = (U，C ∪ D，V, f).

Step7 Attributes reduction algorithm 1 is used to calculate Red_{C}(D) of DT’, a new decision table.

**Test result and Analysis **

This approach is simple rapid calculation, high efficiency, this method is the key to clustering analysis stage, get the best center vector, radius.

Experiments using KDDCUP99 data sets. In order to ensure that the distribution of original data, to ensure the selected packet contains a complete classification information, the selection of training data sets, a total of 2470 records; Test set a total of 1 551 records, here USES the way of literature [9] for different network protocol (TCP, UDP, and ICMP), choice of different attributes in the experiment.

Select all the intrusion data experiment, after data preprocessing and discretization, based on a semi-supervised fuzzy clustering and sample selection attribute reduction algorithm of intrusion detection processing, the number of the clustering algorithm in different center of mass is different to the effects of clustering, the experiment selected the different K value of the above data were tested respectively, different K value is presented in figure 2 and figure 3 cases, test results of the algorithm.

[image:4.595.197.486.597.701.2]

Figure 2. Detection rate under different k. Figure 3. Error detection rate under different k.

From the result of the experiment, when the k value increased gradually, and also increase the rate of false positives, but detection rate for maximum when k=15. Can know from this, took the k=15, based on this algorithm can obtain good effect of intrusion detection, the detection rate of 91. 04%, the rate of false positives is 8.32%.

77 79 81 83 85 87 89 91

k=5 k=10 k=15 k=20

Probe U2R U2L Normal

DoS 6.5

7.5 8.5 9.5 10.5 11.5

k=5 k=10 k=15 k=20

Probe U2R

**Summary **

The heterogeneous distance and density of the sample is introduced into the clustering algorithm, formed a semi-supervised fuzzy clustering. This paper proposes an attribute reduction algorithm based on sample selection and intrusion detection, the algorithm can guarantee under the premise of basic unchanged or improved classification accuracy, greatly reduce the algorithm operation time consumption. In the proposed algorithm is verified the feasibility and effectiveness of intrusion detection, but the algorithm in this paper the rate of false positives are still on the high side needs further research and improvement in the future.

**Acknowledgements**

This research was financially supported by the Scientific Search project of Zhejiang Yuexiu

University of Foreign Languages（2016QDA035）.

**References**

[1] Botha M, Von S R. Utilizing fuzzy logic; and trend analysis for effective intrusion detection [J]. Computers and Security. 2003, 22 (5):423-434.

[2] Pawalk Z. Rough sets [J]. International Journal of Information and Computer Science, 1982,11 (6):341-356.

[3]Zheng-ming Zhang, Tian Jingfeng. On attribute reduction with intuitionistic fuzzy Rough Sets.

International Journal of fuzziness & Knowledge-based Systems. Feb, 2012, Vo1.20：59-76.

[4] Yang Xiao-qiang. Algorithm for intrusion detection based on evolutionary semisupervised fuzzy clustering. Computer Engineering and Aoolications. 2008,44 (4):33- 35.

[5] Zhang Wen-xiu, Liang Yi, Wu Wei-zhi. Information system and knowledge discovery[M]. Beijing: Science Press, 2003: 42-48.

[6] Wilson D R, Martinez T R. Improved heterogeneous distance functions [J]. Journal of Artificial Intelligence Research. 1997, 6(1):1-34.

[7] Du Hong-Le, Fan Jing-bo. Semi-supervised fuzzy clustering algorithm for intrusion detection. Computer Engineering and Applications. 2016, 52(3):96-99.

[8] Ming-xiang Li. The research of data mining method based on rough set theory. The thesis of master's degree of Shandong University of Science and Technology. 2003.