IEEEIndicon2005 Conference, Chennai, India, 11-13 Dec.2005
Integrating
Genetic
Algorithms
and
Fuzzy
c-Means
for
Anomaly
Detection
WitchaiChimphlee',
Abdul HananAbdullah2, Mohd
Noor MdSap2
Siriporn
Chimphleel
and Surat
Srinoy'
Abstract - The goal of intrusion detection is to discover
unauthorized useof computer systems.Newintrusion types,of which detection systems areunaware,arethe mostdifficult to
detect. The amountof availablenetworkauditdatainstancesis
usually large; human labelingistedious,time-consuming, and
expensive.Traditional anomaly detection algorithms require a
setofpurelynormal data fromwhichtheytraintheir model.In
this paper we propose an intrusion detection method that
combines Fuzzy Clustering and Genetic Algorithms. Clustering-based intrusion detectionalgorithmwhich trainson
unlabeled data in order to detect new intrusions. Fuzzy
c-Means allow objects to belong to several clusters
simultaneously, with differentdegrees ofmembership.Genetic
Algorithms(GA)totheproblem oftelection ofoptimizedfeature subsets to reduce the error caused by using land-selected
features.Our method isabletodetect manydifferent types of
intrusions, whilemaintainingalow falsepositiverate.Weused data setfrom1999KDDintrusion-etection contest.
Keywords - Anomaly detection, Unsupervised clustering,
GeneticAlgorithms, Fuzzyc-means
I.INTRODUCTION
A
s defined in [1], intrusion detection is theprocess
of monitoringtheeventsoccurringinacomputer systemornetworkandanalyzingthem forsignsofintrusions.Itis also defined as attempts to compromise the confidentiality, integrity,availability,ortobypassthesecurity mechanisms of
acomputerornetwork Thegoal for handle intrusion detec-tionproblem istoclassifypatternsof the system behavior in
two categories (normal andabnormal).The amountof avail-able network audit data instances isusually large;human la-belingistedious,time-consuming, andexpensive. However,
notall the datacollectedareusefulorinformative. There are
twomainintrusion detection systems. Anomaly intrusion de-tection system isbased on the profiles ofnormalbehaviors of
users orapplicationsandchecks whether the system is being
1FacultyofScienceand
T'echnology,
Suan DusitRajabhatUniversity,295RajasrimaRoad,Dusit, Bangkok,Thailand. Tel:(662)-2445225,Fax:(662)-6687136
lwitcha_chi(@dusit.ac.th,4siripom [email protected],[email protected] 2Facultyof Computer Science and Information Systems
UniversityTechnologyofMalaysia, 81310Skudai, Johor,Malaysia Tel:(+607)-5532070,Fax:(+607) 5565044
[email protected], [email protected]
used inadifferentmanner
[2].
The secondoneiscalledmis-use intrusion detection systemwhich collects attack
signa-tures, compares abehavior with these attack signatures, and signalsintrusion when there isamatch.Generally, thereare
four categories of attacks [3]. Theyare: DoS (denial-of-ser-vice) R2L (Remote to Local) U2R (User to Root) , and PROBING.
IDScanbeclassified basedonthefunctional characteristics ofdetection methodsasknowledge based intrusion detection andbehavior based intrusion detection [4]. Thetaskofan in-trusiondetection systemis toprotect acomputer systemby detecting and diagnosing attempted breaches of theintegrity ofthesystem.Anomaly detection still faces manychallenges, whereoneof themostimportant istherelatively highrateof false alarms (false positives). The problem of capturing a
complex normality makes the highrateoffalse positives in-trinsictoanomalydetection except for simple problems[5].
Thepaperis structuredasfollows.InsectionIIpresentsthe
related works.Section III describesabriefintroduction of ge-netic algorithms. Explains fuzzy clustering in section IV.
Section V explains about experimental design. Section VI
evaluatesourintrusion detection modelthrough experiments. Finally, sectionVIIpresents ourconclusion and some discus-sion.
II. RELATEDWORKS
Featureselection has been traditionally used in data mining applicationsaspartofthedata cleaning and/or pre-processing
stepwhere the actualextraction andlearning ofknowledgeor
patterns is done aftera suitable set of features is extracted, Manyapproaches have been proposed whichinclude
statisti-cal[6], machineleaming [7], data mining [8] and
immunolog-ical inspired techniques [9]. Alves et al [10] presents a classification-rule discovery algorithm integrating artificial
immunesystems(AIS) and fuzzy systems. III.THEGENETIC
ALGORITHMS
Genetic algorithms (GAs) wereformally introduced in the
1970sbyJohnHolland[11]. In particular, genetic algorithms
areproblem-solvingteclniques based on the principles ofbi-ological evolution, natural selection andgenetic
iecombina-0-7803-9503-4/05/$20.00©2005IEEE
- India.11-13
tion. Potential solutions to the problem to be solved are encoded as sequences ofbits, characters or numbers. The unit ofencoding is called a gene, and the encoded sequence is calledachromosome. Each chromosomerepresents one pos-sible solution to the problem or a rule in a classification. GAs is able to select subsets ofvarious sizes in order to determine the optimum combination and number of inputs to network. This allows reducing the computational expense on the train-ing system with near optimal results still reachable. Research [12]hasshowthat GA is one of the most efficient of all fea-tureselection methods for dealing with feature sets containing large (>100) numbers of features.
Generationofarandompolviationofchromosomes Evaluationofthefitnexsvaluesof
_indivilchromosome
r lectXion ofchromavomnswithgodifunems
No
Reproduction ofnextgeneration of
chromosomes/population Generation
limitreached
FYlS
Fig.1.A
typical
GA program flow.The algorithm begins with a population ofchromosomes generatedeitherrandomlyorfromsomesetof known
speci-mens,andcycles
through
the three steps,namelyevaluation,
selection, andreproduction. A chromosomecontainsthe in-formation about the solution to aproblem, whichit repre-sents. Typically, itcanbe encodedusingabinary string as
follows[13]:
Chromosome1 1101100100110110
Chromosome2 1101111000011110
Fig. 1 showsageneraloutline of thealgorithm. Inthefirst step,that
is,
theevaluationstep, eachstring
isevaluatedac-cording to a
given
performance
criterion known asfitness
function,
andassigned afitness
score.Inthenextstep,these-lectionstep,adecision is made
according
tothe fitnessscoreassigned toeach individual to decidewhich individuals are
permittedtoproduce offspringand with whatprobability.
Fi-nally,the
reproduction
stepinvolves the creation ofoffspring
chromosomesbytwogenerators,
namely
cross-overandmu-tation. This is themost
important
partof thegenetic
algo-rithms as these genetic operators have an
impact
on the performanceofGAs.Fig.
2illustratedusingapair ofchromo-somes encodedas twobinary strings,
where the cross-site isdenotedby
"I"
ineach chromosome.Chromosome 1 Chromosome 2 Offspring I Offsnrini2
110111001001
101101lO1l111000011110 11011111000011110
11011100100110110 Fig.2. Pair of chromosomes encoded
For achromosome encoded as a binary string, genes are randomlyselected to undergo mutation operation, where '1' is changed into '0' or viceversa, in Fig.3.
Original
offspring1
11011.110000111 104
Mutatedoffspringl 1100111000011110 Originaloffspring2 1101100100110110
1
I
Mutated
offspring
2 11l0l
I001O
l10IO
Fig.3. Mutation operationInwhich a bit value of 1 in thechromosomerepresentation meansthat thecorresponding feature is included in the speci-fied subset, and a value of 0 indicates that the corresponding feature is notincluded in the subset.
The fitness of afeature subset is measured by the test accu-racy (orcross-validation accuracy ofthe classifierlearned us-ing thefeature subset) and any other criteria of interest [14].
IV. FuzzY
C-MEANS CLUSTERING
Fuzzy c-means (FCM) algorithm, also known as fuzzy ISODATA, was introduced by Bezdek [16] as extension to Dunn's [17] algorithmto generate fuzzy sets for every ob-served feature. FCM isan iterativealgorithm to find cluster centers (centroids) that minimize a dissimilarity function. Rather thatpartitioningthe datainto acollection of distinct setsby fuzzypartitioning,themembershipmatrix(U) is
ran-domlyinitializedaccordingtoEquation 1.
(1)
The dissimilarity fimctionwhich is used in FCM in given Equation
c c n
J(U,c1,c2,...cC)=
E
Ji
=Xunmdi2
i=l i=l j=l(2)
Uij is between0and 1; ciisthecentroid of clusteri;
dij
is theEuclidian distance betweenith
centroid(ci)andjh
datapoint;
m E
[1 ,ox]
is aweightingexponent.Toreachaminimum ofdissimilarityfunction therearetwo
conditions. Thesearegiven in(3)and(4).
576 IEEEIndicon 2005
Conference,;Chennai,
India,
II-13 Dec. 2005c
uu =L vi =15... ,n.
IEEE---Inicn- 00 Cnfrece Chennai,----1-India,--- 111 Dec- 200
577--Z
L;1Ui.
m
X.c.===1 Uij
5 zj=1IUU
uii 2
c (
d1
m-2:k=
1t dk;4
(3)
1
(4)
Detailed algorithm offuzzyc-meansproposedbyBezdek in 1973 [18] (Fig.4).
L
ML M MH
H.
Algorithm1. Fuzzyc-means
Step1: Randomly initializethemembership matrix (U)thathas
constraintsinEquation1.
Step 2: Calculate centroids(ci) byusingEquation3.
Step 3: Computedissimilarity betweencentroids and datapoints usingEquation 2.Stopifitsimprovementoverprevious
iterationis belowathreshold.
SteD 4: ComDuteanewUusinn Eauation4toto steo2.
Fig. 4. Fuzzyc-meansclustering.
Byiteratively updatingthe clustercentersand the
member-ship grades foreach datapoint, FCM iteratively movesthe
clustercenterstothe"right"location withinadataset. FCM
doesnotensurethatitconverges toanoptimal solution. Be-causeofclustercentersareinitializing usingU thatrandomly
initialized.(Equation 3)
Performance depends oninitial centroids.Forarobust
ap-proachtherearetwowayswhichisdescribed below[13].
1.) Usinganalgorithmtodetermineallof thecentroids.(for example: arithmeticmeansof alldatapoints)
2.)Run FCM several times eachstartingwithdifferent ini-tialcentroids.
A collection offuzzy sets, called fuzzy space, defines the
fuzzylinguisticvaluesorfuzzyclasses.Asamplefuzzyspace
of fivemembershipfunction isshown inFigure5.
V. EXPERIMENTAL DESIGN
Inourmethod have the threesteps(Figure 6). Firststep is
cleaningfor handlemissingandincompletedata.Secondstep
using genetic algorithmsforselect the best attribute and
fea-tureselection and the laststepforclusteringgroupof data us-ingfuzzyc-means.
Thepre-processormoduleperformsthefollowingtasks:
1.Identifiestheattributesandtheir value.
2.Convertcategoricaltonumerical data. 3. Data Normalization
4. Performsredundancycheck and handle aboutnullvalue. 5. Initializes all thenecessaryparameterssuchasthelength
ofachromosome, populationsize
Fig.5.Afuzzyspace of fivemembershipfunction
Fuzzy c-Means
Pre- Cleaned 00 0
Processor 0O
Selectedbest /
Incomplete data attribute Missing data
KDD-19 Normal Attacks
DataSet
Fig.6. Framework for detection
VI. EXPERIMENTAL SETUP AND RESULTS
Inthisexperiment,we usetheKDDCup 1999 intrusion de-tectioncontest[19].Thisdatabaseincludes awidevariety of intrusions simulated inamilitary network environment that is
a commonbenchmark for evaluation of intrusion detection techniques. The data set has41 attributes for each connection recordplus one class label. There are24attacktypes, but we
treatall of themas anattack group.Thenominal attributes are converted intolinear discrete values (integers). After elimi-natinglabels, the data set is described as a matrixX; which has
N rows andm-=41 columns (attributes). There are md=8 dis-crete-valueattributes and
m,
=33continuous-value attributes.We ran our experiments on a system with a 1.5 GHz PentiumIVprocessorand512 MBDDRRAMrunning Win-dows XP. Allthepreprocessingwasdone usingMATLABn. Inpractice, the number ofclasses is not always known before-hand. There isno general theoretical solution to finding the optimalnumber of clusters for any given data set. We choose k= 5 forthe study. We will compare five classifiers which
havebeen also used indetectingthesefour types of attacks.
A. Data preprocessing
It wasnecessaryto ensure though, that the reduced dataset
was asrepresentative of the original set as possible. Table 1 *ast-"*0*0*4*8*Abooa&&*-i* *Rveooos *'Af***^bi
A 11 q 0 11
IEEEIndicon2005 Conference,Chennai,India, 11-13 Dec. 2005
shows the dataset after balanced among category for attack distribution over modified the normal and other attack cate-gories appear in [20]. Preprocessing consisted oftwo steps. The first step involved mapping symbolic-valued attributes to numeric-valued attributes and the second step imple-mented non-zero numerical features.
Table 1: Datasetforattack distribution'
Attack Category %Occurrence Number of records
Normal. 31.64 5,763
PROBE 11.88 2,164
DoS 19.38 3,530
U2R 0.38 70
R2L 36.72 6,689
100% 18,216
B.Featureextractionsystem
Afeature selection methodisproposed toselectasubset
ofvariables in data that preserves as much information
presentin thecompletedataaspossible.The feature subset
selectionproblemrefers the task ofidentifyingand
select-ingausefulsubset of attributestobe usedtorepresent
pat-terns from a larger set of often mutually redundant,
possibly irrelevant, attributes with different associated
measurement costsand/or risks[21] Thefeature selection
processisveryimportantwhichselectstheinformative
fea-tures for used classification process. The benefits and affects of feature subset selectioninclud
C. Intrusion DetectionwithClustering
Basedonthepractical assumption thatnormal instances
dominate attackinstances, oursimple
self-labeling
heuris-tic forunsupervised intrusion detection appear in [22]. Anomalydetectiorn amountstotrainingmodels for normal traffic behavior andthenclassifyingasintrusionsany
net-work behavior thatsignificantlydeviates from theknown. normalpatternsandto constructasetof clusters basedon
trainingdatatoclassifytestdatainstances.
The data setbeforeusing featureselection has 41
attrib-utesandafterusegeneticalgorithmsfor identifiessubset of
features; wegot 8 attributesas followsduration, service,
flag, srv_count, error_rate, dst_host_srv_count,
dst host diffsrvrateand dst host srv rerror rate.
VII. CONCLUSIONS
In this paper we apply genetic algorithm methods with
data reductionandtoidentifysubset offeatures for network
security andusing fuzzy c-meanstointrusion detection to
avoidaharddefinition betweennormal class and certain
in-trusionclass. Features selectionmethods aim atselectinga
small or prespecified number offeatures leading to the best pos-sible performance ofthe entire classifier. The task ofidentifying and selecting a useful subset offeatures to be used to represent patterns from a larger set ofoften mutually redundant or even ir-relevant features. Therefore, the main goal of feature subset se-lection is to reduce the number of features used in classification while maintaining acceptable classification accuracy.
Intrusion detection model is acompositive model that needs various theories and techniques. One or two models can hardly offer satisfying results. We plan to evaluation results and to ap-ply other theories and techniques in intrusion detection in our future work.
REFERENCES
[1] R. Bace and P.Mell,"Intrusion Detection Systems",NIST Special Publi-cations onIntrusionDetectionSystem.31November 2001.
[2] H. Jin, J. Sun, H. Chen and Z. Han, "A Fuzzy Data Mining Based Intrusion Detection System", Proc. 10"InternationalWorkshopon future Trends in
Distributed Computing Systems (FTDCS04) IEEE Computer Society, Suzhou, China, May 26-28, 2004, pp. 191-197.
[3] W. Lee, S.Stolfo and K.Mok,"Adata miningframeworkfor building
in-trusion detectionmodels", Proc 1999 IEEESymposiumon Security and Privacy, May 1999, pp. 120-132.
[4] B.Balajinath,S.V.Raghavan, "Intrusiondetection throughlearning
be-haviormodel", 2001.
[5] K. Burbeck and S.Nadjm-Tehrani,Adaptive Real-Time Anomaly Detec-tionwith ImprovedIndexand Ability to Forget, in Proc of the 25th IEEE
InternationalConference on Distributed Computing Systems Workshops, June 2005.
[6] D. Denning, "Anintrusion-detectionmodel," I IEEE Computer Society
Symposiumonresearchinsecurity and privacy, 1986, pp. 1 8-131.
[7] T.Lane,"Machine Learningtechniques forthecomputer Security", PhD thesis, PurdueUniversity, 2000.
[8] W. Leeand S.Stolfo,"Dataminingapproaches for intrusion detection,"
Proceedings of the7thUSENIXsecurity symposium, 1998.
[9] D. Daguptaand F.Gonzalez,"Animmunity-based Technique to charac-terizeintrusions in computer networks", IEEE Transn Evolutionary
Computation,Vol.6, June 2002, pp.28-291.
[10] R.T.Alves, M.R.B.S.Delgado,H.S.Lopes, A.A. Freitas, "An artificial immune system for fuzzy-rule induction in data mining", Lecture Notes in
Computer Science, Berlin: Springer-Verlag, v. 3242, 2004, pp.
1011-1020.
[11] J.Holland,"AdaptationinNatural andArtificial Systems", University of
MichiganPress,Michigan, 1975.
[12] M.Kudo,J.Sklansky, "Comparisonofalgorithmsthatselectfeaturesfor
pattern classifiers",PatternRecognition33(2000)25-41.
[13] L.Y.Zhai,L.P.Khoo,and S.C.Fok,"Featureextraction using roughset theory andgenetic algorithmsandapplicationfor the simplification of
product qualityevaluation", Computers&IndustrialEngineering,2002, pp. 661-676.
IEEEIndicon 2005
Conference,
Chennai India 11-13Dec. 2005[14] G. Helmer, J.S.K. Wong, V. Honavar, and L. Miller, "Automated
dis-coveryof concise predictive rules for intrusion detection", Journal of
Systems and Software, 2002,pp.165-175.
[15] R.Duda and P. Hart, "Pattern Classification and SceneAnalysis", NY: WileyInterscience, 1973.
[16] J.Bezdek,"PatternRecognitionwithFuzzy Objective Function Algo-rithms", Plenum Press, USA, 1981.
[17] S. Albayrak,F."Amasyali", Fuzzyc-meansclusteringonMedical
Diagnostic Systems, International XII. Turkish Symposiumon
Artificial Intelligence and Neural Networks, TAINN 2003. [18] F. Godinez, D. Hutter, R. Monroy"AttributeReduction forEffective
IntrusionDetection". AWIC2004:74-83.
[19] KDD dataset,1999;
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[20] W.Chimphlee,A.H.Abdullah, M.N. M. Sap and S. Chimphlee, "Unsu-pervised Clusteringmethods for Identifying Rare Events in Anomaly Detection", 6thInternation Enformatika Conference(IEC2005), Octo-ber 26-28, 2005, Budapest, Hungary.
[21] "Yang J. and Honavar V.," Feature Subset Selection UsingaGenetic
Algorithm. Invited chapter. In: Feature Extraction, Construction, and SubsetSelection: A Data Mining Perspective, Motoda, H. and Liu,H. (Ed.) New York: Kluwer, 1998.
[22] S. Zhong,T.Khoshgoftaar and N. Seliya, "Evaluating Clustering Tech-niques for NetworkIntrusion Detection", 10th ISSAT Int. Conf.on
Re-liability and Quality Design, 2004,pp. 149-155.