IEEEIndicon2005 Conference, Chennai, India, 11-13 Dec.2005
Abdul HananAbdullah2, Mohd
Noor MdSap2
and Surat
Abstract - The goal of intrusion detection is to discover
unauthorized useof computer systems.Newintrusion types,of which detection systems areunaware,arethe mostdifficult to
detect. The amountof availablenetworkauditdatainstancesis
usually large; human labelingistedious,time-consuming, and
expensive.Traditional anomaly detection algorithms require a
setofpurelynormal data fromwhichtheytraintheir model.In
this paper we propose an intrusion detection method that
combines Fuzzy Clustering and Genetic Algorithms. Clustering-based intrusion detectionalgorithmwhich trainson
unlabeled data in order to detect new intrusions. Fuzzy
c-Means allow objects to belong to several clusters
simultaneously, with differentdegrees ofmembership.Genetic
Algorithms(GA)totheproblem oftelection ofoptimizedfeature subsets to reduce the error caused by using land-selected
features.Our method isabletodetect manydifferent types of
intrusions, whilemaintainingalow falsepositiverate.Weused data setfrom1999KDDintrusion-etection contest.
Keywords - Anomaly detection, Unsupervised clustering,
GeneticAlgorithms, Fuzzyc-means
s defined in [1], intrusion detection is theprocess
of monitoringtheeventsoccurringinacomputer systemornetworkandanalyzingthem forsignsofintrusions.Itis also defined as attempts to compromise the confidentiality, integrity,availability,ortobypassthesecurity mechanisms of
acomputerornetwork Thegoal for handle intrusion detec-tionproblem istoclassifypatternsof the system behavior in
two categories (normal andabnormal).The amountof avail-able network audit data instances isusually large;human la-belingistedious,time-consuming, andexpensive. However,
notall the datacollectedareusefulorinformative. There are
twomainintrusion detection systems. Anomaly intrusion de-tection system isbased on the profiles ofnormalbehaviors of
users orapplicationsandchecks whether the system is being
used inadifferentmanner
The secondoneiscalledmis-use intrusion detection systemwhich collects attack
signa-tures, compares abehavior with these attack signatures, and signalsintrusion when there isamatch.Generally, thereare
four categories of attacks [3]. Theyare: DoS (denial-of-ser-vice) R2L (Remote to Local) U2R (User to Root) , and PROBING.
IDScanbeclassified basedonthefunctional characteristics ofdetection methodsasknowledge based intrusion detection andbehavior based intrusion detection [4]. Thetaskofan in-trusiondetection systemis toprotect acomputer systemby detecting and diagnosing attempted breaches of theintegrity ofthesystem.Anomaly detection still faces manychallenges, whereoneof themostimportant istherelatively highrateof false alarms (false positives). The problem of capturing a
complex normality makes the highrateoffalse positives in-trinsictoanomalydetection except for simple problems[5].
Thepaperis structuredasfollows.InsectionIIpresentsthe
related works.Section III describesabriefintroduction of ge-netic algorithms. Explains fuzzy clustering in section IV.
Section V explains about experimental design. Section VI
evaluatesourintrusion detection modelthrough experiments. Finally, sectionVIIpresents ourconclusion and some discus-sion.
Featureselection has been traditionally used in data mining applicationsaspartofthedata cleaning and/or pre-processing
stepwhere the actualextraction andlearning ofknowledgeor
patterns is done aftera suitable set of features is extracted, Manyapproaches have been proposed whichinclude
statisti-cal[6], machineleaming [7], data mining [8] and
immunolog-ical inspired techniques [9]. Alves et al [10] presents a classification-rule discovery algorithm integrating artificial
immunesystems(AIS) and fuzzy systems. III.THEGENETIC
Genetic algorithms (GAs) wereformally introduced in the
1970sbyJohnHolland[11]. In particular, genetic algorithms
areproblem-solvingteclniques based on the principles ofbi-ological evolution, natural selection andgenetic
tion. Potential solutions to the problem to be solved are encoded as sequences ofbits, characters or numbers. The unit ofencoding is called a gene, and the encoded sequence is calledachromosome. Each chromosomerepresents one pos-sible solution to the problem or a rule in a classification. GAs is able to select subsets ofvarious sizes in order to determine the optimum combination and number of inputs to network. This allows reducing the computational expense on the train-ing system with near optimal results still reachable. Research [12]hasshowthat GA is one of the most efficient of all fea-tureselection methods for dealing with feature sets containing large (>100) numbers of features.
Generation of a random population of chromosomes
Evaluation of the fitness values of
Selection of chromosomes with good fitness
Reproduction ofnextgeneration of
chromosomes/population Generation
GA program flow.The algorithm begins with a population ofchromosomes generatedeitherrandomlyorfromsomesetof known
the three steps,namelyevaluation,
selection, andreproduction. A chromosomecontainsthe in-formation about the solution to aproblem, whichit repre-sents. Typically, itcanbe encodedusingabinary string as
Chromosome1 1101100100110110
Chromosome2 1101111000011110
Fig. 1 showsageneraloutline of thealgorithm. Inthefirst step,that
theevaluationstep, eachstring
isevaluatedac-cording to a
criterion known asfitness
andassigned afitness
score.Inthenextstep,these-lectionstep,adecision is made
tothe fitnessscoreassigned toeach individual to decidewhich individuals are
permittedtoproduce offspringand with whatprobability.
stepinvolves the creation ofoffspring
cross-overandmu-tation. This is themost
partof thegenetic
algo-rithms as these genetic operators have an
on the performanceofGAs.Fig.
2illustratedusingapair ofchromo-somes encodedas twobinary strings,
where the cross-site isdenotedby
Chromosome 1 Chromosome 2 Offspring 1 Offspring 2
101101lO1l111000011110 11011111000011110
11011100100110110 Fig.2. Pair of chromosomes encoded
For achromosome encoded as a binary string, genes are randomlyselected to undergo mutation operation, where '1' is changed into '0' or viceversa, in Fig.3.
11011110000111110
Mutatedoffspringl 1100111000011110 Originaloffspring2 1101100100110110
2 11l0l
Fig.3. Mutation operationInwhich a bit value of 1 in thechromosomerepresentation meansthat thecorresponding feature is included in the speci-fied subset, and a value of 0 indicates that the corresponding feature is notincluded in the subset.
The fitness of afeature subset is measured by the test accu-racy (orcross-validation accuracy ofthe classifierlearned us-ing thefeature subset) and any other criteria of interest [14].
IV. FuzzY
Fuzzy c-means (FCM) algorithm, also known as fuzzy ISODATA, was introduced by Bezdek [16] as extension to Dunn's [17] algorithmto generate fuzzy sets for every ob-served feature. FCM isan iterativealgorithm to find cluster centers (centroids) that minimize a dissimilarity function. Rather thatpartitioningthe datainto acollection of distinct setsby fuzzypartitioning,themembershipmatrix(U) is
ran-domlyinitializedaccordingtoEquation 1.
The dissimilarity fimctionwhich is used in FCM in given Equation
c c n
i=l i=l j=l(2)
Uij is between0and 1; ciisthecentroid of clusteri;
is theEuclidian distance betweenith
m E
[1 ,ox]
is aweightingexponent.Toreachaminimum ofdissimilarityfunction therearetwo
conditions. Thesearegiven in(3)and(4).
uu =L vi =15... ,n.
X.c.===1 Uij
5 zj=1IUU
uii 2
c (
1t dk;4
Detailed algorithm offuzzyc-meansproposedbyBezdek in 1973 [18] (Fig.4).
Algorithm1. Fuzzyc-means
Step1: Randomly initializethemembership matrix (U)thathas
Step 2: Calculate centroids(ci) byusingEquation3.
Step 3: Computedissimilarity betweencentroids and datapoints usingEquation 2.Stopifitsimprovementoverprevious
iterationis belowathreshold.
SteD 4: ComDuteanewUusinn Eauation4toto steo2.
Fig. 4. Fuzzyc-meansclustering.
Byiteratively updatingthe clustercentersand the
member-ship grades foreach datapoint, FCM iteratively movesthe
clustercenterstothe"right"location withinadataset. FCM
doesnotensurethatitconverges toanoptimal solution. Be-causeofclustercentersareinitializing usingU thatrandomly
initialized.(Equation 3)
Performance depends oninitial centroids.Forarobust
ap-proachtherearetwowayswhichisdescribed below[13].
1.) Usinganalgorithmtodetermineallof thecentroids.(for example: arithmeticmeansof alldatapoints)
2.)Run FCM several times eachstartingwithdifferent ini-tialcentroids.
A collection offuzzy sets, called fuzzy space, defines the
of fivemembershipfunction isshown inFigure5.
Inourmethod have the threesteps(Figure 6). Firststep is
cleaningfor handlemissingandincompletedata.Secondstep
using genetic algorithmsforselect the best attribute and
fea-tureselection and the laststepforclusteringgroupof data us-ingfuzzyc-means.
1.Identifiestheattributesandtheir value.
2.Convertcategoricaltonumerical data. 3. Data Normalization
4. Performsredundancycheck and handle aboutnullvalue. 5. Initializes all thenecessaryparameterssuchasthelength
ofachromosome, populationsize
Fig.5.Afuzzyspace of fivemembershipfunction
Fuzzy c-Means
Pre- Cleaned 00 0
Processor 0O
Selectedbest /
Incomplete data attribute Missing data
KDD-19 Normal Attacks
Fig.6. Framework for detection
Inthisexperiment,we usetheKDDCup 1999 intrusion de-tectioncontest[19].Thisdatabaseincludes awidevariety of intrusions simulated inamilitary network environment that is
a commonbenchmark for evaluation of intrusion detection techniques. The data set has41 attributes for each connection recordplus one class label. There are24attacktypes, but we
treatall of themas anattack group.Thenominal attributes are converted intolinear discrete values (integers). After elimi-natinglabels, the data set is described as a matrixX; which has
N rows andm-=41 columns (attributes). There are md=8 dis-crete-valueattributes and
=33continuous-value attributes.We ran our experiments on a system with a 1.5 GHz PentiumIVprocessorand512 MBDDRRAMrunning Win-dows XP. Allthepreprocessingwasdone usingMATLABn. Inpractice, the number ofclasses is not always known before-hand. There isno general theoretical solution to finding the optimalnumber of clusters for any given data set. We choose k= 5 forthe study. We will compare five classifiers which
havebeen also used indetectingthesefour types of attacks.
A. Data preprocessing
It wasnecessaryto ensure though, that the reduced dataset
IEEEIndicon2005 Conference, Chennai, India, 11-13 Dec. 2005
shows the dataset after balanced among category for attack distribution over modified the normal and other attack cate-gories appear in [20]. Preprocessing consisted oftwo steps. The first step involved mapping symbolic-valued attributes to numeric-valued attributes and the second step imple-mented non-zero numerical features.
Table 1: Datasetforattack distribution'
Attack Category %Occurrence Number of records
Normal. 31.64 5,763
PROBE 11.88 2,164
DoS 19.38 3,530
U2R 0.38 70
R2L 36.72 6,689
100% 18,216
Afeature selection methodisproposed toselectasubset
ofvariables in data that preserves as much information
presentin thecompletedataaspossible.The feature subset
selectionproblemrefers the task ofidentifyingand
select-ingausefulsubset of attributestobe usedtorepresent
pat-terns from a larger set of often mutually redundant,
possibly irrelevant, attributes with different associated
measurement costsand/or risks[21] Thefeature selection
fea-tures for used classification process. The benefits and affects of feature subset selectioninclud
C. Intrusion DetectionwithClustering
Basedonthepractical assumption thatnormal instances
dominate attackinstances, oursimple
heuris-tic forunsupervised intrusion detection appear in [22]. Anomalydetectiorn amountstotrainingmodels for normal traffic behavior andthenclassifyingasintrusionsany
net-work behavior thatsignificantlydeviates from theknown. normalpatternsandto constructasetof clusters basedon
The data setbeforeusing featureselection has 41
attrib-utesandafterusegeneticalgorithmsfor identifiessubset of
features; wegot 8 attributesas followsduration, service,
flag, srv_count, error_rate, dst_host_srv_count,
dst host diffsrvrateand dst host srv rerror rate.
In this paper we apply genetic algorithm methods with
data reductionandtoidentifysubset offeatures for network
security andusing fuzzy c-meanstointrusion detection to
avoidaharddefinition betweennormal class and certain
in-trusionclass. Features selectionmethods aim atselectinga
small or prespecified number offeatures leading to the best pos-sible performance ofthe entire classifier. The task ofidentifying and selecting a useful subset offeatures to be used to represent patterns from a larger set ofoften mutually redundant or even ir-relevant features. Therefore, the main goal of feature subset se-lection is to reduce the number of features used in classification while maintaining acceptable classification accuracy.
Intrusion detection model is acompositive model that needs various theories and techniques. One or two models can hardly offer satisfying results. We plan to evaluation results and to ap-ply other theories and techniques in intrusion detection in our future work.
IEEEIndicon 2005
