• No results found

Integrating genetic algorithms and fuzzy c-means for anomaly detection

N/A
N/A
Protected

Academic year: 2020

Share "Integrating genetic algorithms and fuzzy c-means for anomaly detection"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

IEEEIndicon2005 Conference, Chennai, India, 11-13 Dec.2005

Integrating

Genetic

Algorithms

and

Fuzzy

c-Means

for

Anomaly

Detection

WitchaiChimphlee',

Abdul Hanan

Abdullah2, Mohd

Noor Md

Sap2

Siriporn

Chimphleel

and Surat

Srinoy'

Abstract - The goal of intrusion detection is to discover

unauthorized useof computer systems.Newintrusion types,of which detection systems areunaware,arethe mostdifficult to

detect. The amountof availablenetworkauditdatainstancesis

usually large; human labelingistedious,time-consuming, and

expensive.Traditional anomaly detection algorithms require a

setofpurelynormal data fromwhichtheytraintheir model.In

this paper we propose an intrusion detection method that

combines Fuzzy Clustering and Genetic Algorithms. Clustering-based intrusion detectionalgorithmwhich trainson

unlabeled data in order to detect new intrusions. Fuzzy

c-Means allow objects to belong to several clusters

simultaneously, with differentdegrees ofmembership.Genetic

Algorithms(GA)totheproblem oftelection ofoptimizedfeature subsets to reduce the error caused by using land-selected

features.Our method isabletodetect manydifferent types of

intrusions, whilemaintainingalow falsepositiverate.Weused data setfrom1999KDDintrusion-etection contest.

Keywords - Anomaly detection, Unsupervised clustering,

GeneticAlgorithms, Fuzzyc-means

I.INTRODUCTION

A

s defined in [1], intrusion detection is the

process

of monitoringtheeventsoccurringinacomputer systemor

networkandanalyzingthem forsignsofintrusions.Itis also defined as attempts to compromise the confidentiality, integrity,availability,ortobypassthesecurity mechanisms of

acomputerornetwork Thegoal for handle intrusion detec-tionproblem istoclassifypatternsof the system behavior in

two categories (normal andabnormal).The amountof avail-able network audit data instances isusually large;human la-belingistedious,time-consuming, andexpensive. However,

notall the datacollectedareusefulorinformative. There are

twomainintrusion detection systems. Anomaly intrusion de-tection system isbased on the profiles ofnormalbehaviors of

users orapplicationsandchecks whether the system is being

1FacultyofScienceand

T'echnology,

Suan DusitRajabhatUniversity,

295RajasrimaRoad,Dusit, Bangkok,Thailand. Tel:(662)-2445225,Fax:(662)-6687136

lwitcha_chi(@dusit.ac.th,4siripom [email protected],[email protected] 2Facultyof Computer Science and Information Systems

UniversityTechnologyofMalaysia, 81310Skudai, Johor,Malaysia Tel:(+607)-5532070,Fax:(+607) 5565044

[email protected], [email protected]

used inadifferentmanner

[2].

The secondoneiscalled

mis-use intrusion detection systemwhich collects attack

signa-tures, compares abehavior with these attack signatures, and signalsintrusion when there isamatch.Generally, thereare

four categories of attacks [3]. Theyare: DoS (denial-of-ser-vice) R2L (Remote to Local) U2R (User to Root) , and PROBING.

IDScanbeclassified basedonthefunctional characteristics ofdetection methodsasknowledge based intrusion detection andbehavior based intrusion detection [4]. Thetaskofan in-trusiondetection systemis toprotect acomputer systemby detecting and diagnosing attempted breaches of theintegrity ofthesystem.Anomaly detection still faces manychallenges, whereoneof themostimportant istherelatively highrateof false alarms (false positives). The problem of capturing a

complex normality makes the highrateoffalse positives in-trinsictoanomalydetection except for simple problems[5].

Thepaperis structuredasfollows.InsectionIIpresentsthe

related works.Section III describesabriefintroduction of ge-netic algorithms. Explains fuzzy clustering in section IV.

Section V explains about experimental design. Section VI

evaluatesourintrusion detection modelthrough experiments. Finally, sectionVIIpresents ourconclusion and some discus-sion.

II. RELATEDWORKS

Featureselection has been traditionally used in data mining applicationsaspartofthedata cleaning and/or pre-processing

stepwhere the actualextraction andlearning ofknowledgeor

patterns is done aftera suitable set of features is extracted, Manyapproaches have been proposed whichinclude

statisti-cal[6], machineleaming [7], data mining [8] and

immunolog-ical inspired techniques [9]. Alves et al [10] presents a classification-rule discovery algorithm integrating artificial

immunesystems(AIS) and fuzzy systems. III.THEGENETIC

ALGORITHMS

Genetic algorithms (GAs) wereformally introduced in the

1970sbyJohnHolland[11]. In particular, genetic algorithms

areproblem-solvingteclniques based on the principles ofbi-ological evolution, natural selection andgenetic

iecombina-0-7803-9503-4/05/$20.00©2005IEEE

(2)

- India.11-13

tion. Potential solutions to the problem to be solved are encoded as sequences ofbits, characters or numbers. The unit ofencoding is called a gene, and the encoded sequence is calledachromosome. Each chromosomerepresents one pos-sible solution to the problem or a rule in a classification. GAs is able to select subsets ofvarious sizes in order to determine the optimum combination and number of inputs to network. This allows reducing the computational expense on the train-ing system with near optimal results still reachable. Research [12]hasshowthat GA is one of the most efficient of all fea-tureselection methods for dealing with feature sets containing large (>100) numbers of features.

Generationofarandompolviationofchromosomes Evaluationofthefitnexsvaluesof

_indivilchromosome

r lectXion ofchromavomnswithgodifunems

No

Reproduction ofnextgeneration of

chromosomes/population Generation

limitreached

FYlS

Fig.1.A

typical

GA program flow.

The algorithm begins with a population ofchromosomes generatedeitherrandomlyorfromsomesetof known

speci-mens,andcycles

through

the three steps,namely

evaluation,

selection, andreproduction. A chromosomecontainsthe in-formation about the solution to aproblem, whichit repre-sents. Typically, itcanbe encodedusingabinary string as

follows[13]:

Chromosome1 1101100100110110

Chromosome2 1101111000011110

Fig. 1 showsageneraloutline of thealgorithm. Inthefirst step,that

is,

theevaluationstep, each

string

isevaluated

ac-cording to a

given

performance

criterion known as

fitness

function,

and

assigned afitness

score.Inthenextstep,the

se-lectionstep,adecision is made

according

tothe fitnessscore

assigned toeach individual to decidewhich individuals are

permittedtoproduce offspringand with whatprobability.

Fi-nally,the

reproduction

stepinvolves the creation of

offspring

chromosomesbytwogenerators,

namely

cross-overand

mu-tation. This is themost

important

partof the

genetic

algo-rithms as these genetic operators have an

impact

on the performanceofGAs.

Fig.

2illustratedusingapair ofchromo-somes encodedas two

binary strings,

where the cross-site is

denotedby

"I"

ineach chromosome.

Chromosome 1 Chromosome 2 Offspring I Offsnrini2

110111001001

10110

1lO1l111000011110 11011111000011110

11011100100110110 Fig.2. Pair of chromosomes encoded

For achromosome encoded as a binary string, genes are randomlyselected to undergo mutation operation, where '1' is changed into '0' or viceversa, in Fig.3.

Original

offspring1

11011.110000111 10

4

Mutatedoffspringl 1100111000011110 Originaloffspring2 1101100100110110

1

I

Mutated

offspring

2 1

1l0l

I

001O

l

10IO

Fig.3. Mutation operation

Inwhich a bit value of 1 in thechromosomerepresentation meansthat thecorresponding feature is included in the speci-fied subset, and a value of 0 indicates that the corresponding feature is notincluded in the subset.

The fitness of afeature subset is measured by the test accu-racy (orcross-validation accuracy ofthe classifierlearned us-ing thefeature subset) and any other criteria of interest [14].

IV. FuzzY

C-MEANS CLUSTERING

Fuzzy c-means (FCM) algorithm, also known as fuzzy ISODATA, was introduced by Bezdek [16] as extension to Dunn's [17] algorithmto generate fuzzy sets for every ob-served feature. FCM isan iterativealgorithm to find cluster centers (centroids) that minimize a dissimilarity function. Rather thatpartitioningthe datainto acollection of distinct setsby fuzzypartitioning,themembershipmatrix(U) is

ran-domlyinitializedaccordingtoEquation 1.

(1)

The dissimilarity fimctionwhich is used in FCM in given Equation

c c n

J(U,c1,c2,...cC)=

E

Ji

=

Xunmdi2

i=l i=l j=l

(2)

Uij is between0and 1; ciisthecentroid of clusteri;

dij

is theEuclidian distance between

ith

centroid(ci)and

jh

datapoint;

m E

[1 ,ox]

is aweightingexponent.

Toreachaminimum ofdissimilarityfunction therearetwo

conditions. Thesearegiven in(3)and(4).

576 IEEEIndicon 2005

Conference,;Chennai,

India,

II-13 Dec. 2005

c

uu =L vi =15... ,n.

(3)

IEEE---Inicn- 00 Cnfrece Chennai,----1-India,--- 111 Dec- 200

577--Z

L;1Ui.

m

X.

c.===1 Uij

5 zj=1IUU

uii 2

c (

d1

m-2:k=

1t dk;

4

(3)

1

(4)

Detailed algorithm offuzzyc-meansproposedbyBezdek in 1973 [18] (Fig.4).

L

ML M MH

H

.

Algorithm1. Fuzzyc-means

Step1: Randomly initializethemembership matrix (U)thathas

constraintsinEquation1.

Step 2: Calculate centroids(ci) byusingEquation3.

Step 3: Computedissimilarity betweencentroids and datapoints usingEquation 2.Stopifitsimprovementoverprevious

iterationis belowathreshold.

SteD 4: ComDuteanewUusinn Eauation4toto steo2.

Fig. 4. Fuzzyc-meansclustering.

Byiteratively updatingthe clustercentersand the

member-ship grades foreach datapoint, FCM iteratively movesthe

clustercenterstothe"right"location withinadataset. FCM

doesnotensurethatitconverges toanoptimal solution. Be-causeofclustercentersareinitializing usingU thatrandomly

initialized.(Equation 3)

Performance depends oninitial centroids.Forarobust

ap-proachtherearetwowayswhichisdescribed below[13].

1.) Usinganalgorithmtodetermineallof thecentroids.(for example: arithmeticmeansof alldatapoints)

2.)Run FCM several times eachstartingwithdifferent ini-tialcentroids.

A collection offuzzy sets, called fuzzy space, defines the

fuzzylinguisticvaluesorfuzzyclasses.Asamplefuzzyspace

of fivemembershipfunction isshown inFigure5.

V. EXPERIMENTAL DESIGN

Inourmethod have the threesteps(Figure 6). Firststep is

cleaningfor handlemissingandincompletedata.Secondstep

using genetic algorithmsforselect the best attribute and

fea-tureselection and the laststepforclusteringgroupof data us-ingfuzzyc-means.

Thepre-processormoduleperformsthefollowingtasks:

1.Identifiestheattributesandtheir value.

2.Convertcategoricaltonumerical data. 3. Data Normalization

4. Performsredundancycheck and handle aboutnullvalue. 5. Initializes all thenecessaryparameterssuchasthelength

ofachromosome, populationsize

Fig.5.Afuzzyspace of fivemembershipfunction

Fuzzy c-Means

Pre- Cleaned 00 0

Processor 0O

Selectedbest /

Incomplete data attribute Missing data

KDD-19 Normal Attacks

DataSet

Fig.6. Framework for detection

VI. EXPERIMENTAL SETUP AND RESULTS

Inthisexperiment,we usetheKDDCup 1999 intrusion de-tectioncontest[19].Thisdatabaseincludes awidevariety of intrusions simulated inamilitary network environment that is

a commonbenchmark for evaluation of intrusion detection techniques. The data set has41 attributes for each connection recordplus one class label. There are24attacktypes, but we

treatall of themas anattack group.Thenominal attributes are converted intolinear discrete values (integers). After elimi-natinglabels, the data set is described as a matrixX; which has

N rows andm-=41 columns (attributes). There are md=8 dis-crete-valueattributes and

m,

=33continuous-value attributes.

We ran our experiments on a system with a 1.5 GHz PentiumIVprocessorand512 MBDDRRAMrunning Win-dows XP. Allthepreprocessingwasdone usingMATLABn. Inpractice, the number ofclasses is not always known before-hand. There isno general theoretical solution to finding the optimalnumber of clusters for any given data set. We choose k= 5 forthe study. We will compare five classifiers which

havebeen also used indetectingthesefour types of attacks.

A. Data preprocessing

It wasnecessaryto ensure though, that the reduced dataset

was asrepresentative of the original set as possible. Table 1 *ast-"*0*0*4*8*Abooa&&*-i* *Rveooos *'Af***^bi

A 11 q 0 11

(4)

IEEEIndicon2005 Conference,Chennai,India, 11-13 Dec. 2005

shows the dataset after balanced among category for attack distribution over modified the normal and other attack cate-gories appear in [20]. Preprocessing consisted oftwo steps. The first step involved mapping symbolic-valued attributes to numeric-valued attributes and the second step imple-mented non-zero numerical features.

Table 1: Datasetforattack distribution'

Attack Category %Occurrence Number of records

Normal. 31.64 5,763

PROBE 11.88 2,164

DoS 19.38 3,530

U2R 0.38 70

R2L 36.72 6,689

100% 18,216

B.Featureextractionsystem

Afeature selection methodisproposed toselectasubset

ofvariables in data that preserves as much information

presentin thecompletedataaspossible.The feature subset

selectionproblemrefers the task ofidentifyingand

select-ingausefulsubset of attributestobe usedtorepresent

pat-terns from a larger set of often mutually redundant,

possibly irrelevant, attributes with different associated

measurement costsand/or risks[21] Thefeature selection

processisveryimportantwhichselectstheinformative

fea-tures for used classification process. The benefits and affects of feature subset selectioninclud

C. Intrusion DetectionwithClustering

Basedonthepractical assumption thatnormal instances

dominate attackinstances, oursimple

self-labeling

heuris-tic forunsupervised intrusion detection appear in [22]. Anomalydetectiorn amountstotrainingmodels for normal traffic behavior andthenclassifyingasintrusionsany

net-work behavior thatsignificantlydeviates from theknown. normalpatternsandto constructasetof clusters basedon

trainingdatatoclassifytestdatainstances.

The data setbeforeusing featureselection has 41

attrib-utesandafterusegeneticalgorithmsfor identifiessubset of

features; wegot 8 attributesas followsduration, service,

flag, srv_count, error_rate, dst_host_srv_count,

dst host diffsrvrateand dst host srv rerror rate.

VII. CONCLUSIONS

In this paper we apply genetic algorithm methods with

data reductionandtoidentifysubset offeatures for network

security andusing fuzzy c-meanstointrusion detection to

avoidaharddefinition betweennormal class and certain

in-trusionclass. Features selectionmethods aim atselectinga

small or prespecified number offeatures leading to the best pos-sible performance ofthe entire classifier. The task ofidentifying and selecting a useful subset offeatures to be used to represent patterns from a larger set ofoften mutually redundant or even ir-relevant features. Therefore, the main goal of feature subset se-lection is to reduce the number of features used in classification while maintaining acceptable classification accuracy.

Intrusion detection model is acompositive model that needs various theories and techniques. One or two models can hardly offer satisfying results. We plan to evaluation results and to ap-ply other theories and techniques in intrusion detection in our future work.

REFERENCES

[1] R. Bace and P.Mell,"Intrusion Detection Systems",NIST Special Publi-cations onIntrusionDetectionSystem.31November 2001.

[2] H. Jin, J. Sun, H. Chen and Z. Han, "A Fuzzy Data Mining Based Intrusion Detection System", Proc. 10"InternationalWorkshopon future Trends in

Distributed Computing Systems (FTDCS04) IEEE Computer Society, Suzhou, China, May 26-28, 2004, pp. 191-197.

[3] W. Lee, S.Stolfo and K.Mok,"Adata miningframeworkfor building

in-trusion detectionmodels", Proc 1999 IEEESymposiumon Security and Privacy, May 1999, pp. 120-132.

[4] B.Balajinath,S.V.Raghavan, "Intrusiondetection throughlearning

be-haviormodel", 2001.

[5] K. Burbeck and S.Nadjm-Tehrani,Adaptive Real-Time Anomaly Detec-tionwith ImprovedIndexand Ability to Forget, in Proc of the 25th IEEE

InternationalConference on Distributed Computing Systems Workshops, June 2005.

[6] D. Denning, "Anintrusion-detectionmodel," I IEEE Computer Society

Symposiumonresearchinsecurity and privacy, 1986, pp. 1 8-131.

[7] T.Lane,"Machine Learningtechniques forthecomputer Security", PhD thesis, PurdueUniversity, 2000.

[8] W. Leeand S.Stolfo,"Dataminingapproaches for intrusion detection,"

Proceedings of the7thUSENIXsecurity symposium, 1998.

[9] D. Daguptaand F.Gonzalez,"Animmunity-based Technique to charac-terizeintrusions in computer networks", IEEE Transn Evolutionary

Computation,Vol.6, June 2002, pp.28-291.

[10] R.T.Alves, M.R.B.S.Delgado,H.S.Lopes, A.A. Freitas, "An artificial immune system for fuzzy-rule induction in data mining", Lecture Notes in

Computer Science, Berlin: Springer-Verlag, v. 3242, 2004, pp.

1011-1020.

[11] J.Holland,"AdaptationinNatural andArtificial Systems", University of

MichiganPress,Michigan, 1975.

[12] M.Kudo,J.Sklansky, "Comparisonofalgorithmsthatselectfeaturesfor

pattern classifiers",PatternRecognition33(2000)25-41.

[13] L.Y.Zhai,L.P.Khoo,and S.C.Fok,"Featureextraction using roughset theory andgenetic algorithmsandapplicationfor the simplification of

product qualityevaluation", Computers&IndustrialEngineering,2002, pp. 661-676.

(5)

IEEEIndicon 2005

Conference,

Chennai India 11-13Dec. 2005

[14] G. Helmer, J.S.K. Wong, V. Honavar, and L. Miller, "Automated

dis-coveryof concise predictive rules for intrusion detection", Journal of

Systems and Software, 2002,pp.165-175.

[15] R.Duda and P. Hart, "Pattern Classification and SceneAnalysis", NY: WileyInterscience, 1973.

[16] J.Bezdek,"PatternRecognitionwithFuzzy Objective Function Algo-rithms", Plenum Press, USA, 1981.

[17] S. Albayrak,F."Amasyali", Fuzzyc-meansclusteringonMedical

Diagnostic Systems, International XII. Turkish Symposiumon

Artificial Intelligence and Neural Networks, TAINN 2003. [18] F. Godinez, D. Hutter, R. Monroy"AttributeReduction forEffective

IntrusionDetection". AWIC2004:74-83.

[19] KDD dataset,1999;

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[20] W.Chimphlee,A.H.Abdullah, M.N. M. Sap and S. Chimphlee, "Unsu-pervised Clusteringmethods for Identifying Rare Events in Anomaly Detection", 6thInternation Enformatika Conference(IEC2005), Octo-ber 26-28, 2005, Budapest, Hungary.

[21] "Yang J. and Honavar V.," Feature Subset Selection UsingaGenetic

Algorithm. Invited chapter. In: Feature Extraction, Construction, and SubsetSelection: A Data Mining Perspective, Motoda, H. and Liu,H. (Ed.) New York: Kluwer, 1998.

[22] S. Zhong,T.Khoshgoftaar and N. Seliya, "Evaluating Clustering Tech-niques for NetworkIntrusion Detection", 10th ISSAT Int. Conf.on

Re-liability and Quality Design, 2004,pp. 149-155.

References

Related documents

The output characteristic (Fig. 6-1) for a water-gated PBTTT film is close to ideal, with very little hysteresis and a low threshold between 0V and 0.1V. The responses of PBTTT

Misuse intrusion detection works by comparing network traffic to known attack patterns called "signatures." Anomaly detection systems "learn," or are told,

As a solution, sound compensation programs help a company achieve the four main objectives of compensation: Attract, Retain, Focus, and Motivate. Designing

High speed change management is embedded in our organizational, process, and project management competence, allowing us to initiate change purposefully, logically and

The 7TMITETH driver is designed to use one IP address for each node or one IP address for multiple nodes and you must select one of the connection type options:.. 

On Fuzzy Intrusion Detection”, Proceedings of the Data Mining, Intrusion Detection, Information Assurance, And Data Networks Security”, SPIE, Vol. Abd-alhafez,

In this review, the research carried out using various ion-exchange resin-like adsorbents including modified clays, lignocellulosic biomasses, chitosan and its derivatives, microbial

THERMISTOR ( PIPE TEMP. ) FAN MOTOR CAPACITOR STEPPING MOTOR CONT RO L B OAR D PO W ER SUP PL Y BO ARD TRANS FO RME R BO AR D DI SP LA Y FAN MO TO R BLACK GRAY GRAY