4.3 Geneti Algorithm
4.3.4 Geneti Algorithm for Multi-Obje tive Optimization
tion
The main idea in MO problems is to nd the global pareto-optimal front.
Of ourse, this annot be guaranteed, but the algorithms designed for this
problemmust havetwo properties: generatingsolutions along other pareto-
optimalfronts and nding new fronts. In the reprodu tionphase, it is om-
mon to generate solutions that are dominated by other individuals in the
population. Thesehavetobedis arded sin ethey areof nointerest. Theal-
gorithmusedfor multi-obje tiveoptimizationinthe xed s alingproblemis
the ElitistNon-Dominated Sorting Geneti Algorithm (NSGA-II) proposed
in[34℄.
NSGA-II algorithm works in two basi steps: sorting of solutions based on
dominan e and elitism sele tion to keep the best fronts en ountered. Sin e
the population size does not hange, the last front for in lusion has to be
dividedintwoparts. Thedivision isdonetokeep the mostdiverse solutions
for the next generations. This diversity in NSGA-II is al ulated using the
rowding distan e, whi h measures the distan e between solutions in the
obje tive fun tion spa e. With this approa h, the algorithm ex ludes the
sharingparameter,whi hisresponsiblefor al ulatingtheproximitybetween
population members and has to be dened by user. Sorting of solutions is
done by areful book-keeping in order to speed up the exe ution time. For
details refer to [34℄. The overall omplexity of the algorithm is
O(mp
2
)
,
where
m
is the number of obje tives (in our asem = 2
) andp
is the size of the population. One thing worth mentioningis the onstant fa tor inthementionedasymptoti runningtime. Thealgorithmworks by sortingonthe
set of both the urrentpopulationand the ospring, doublingthesize ofthe
set. Withthistakenintoa ount,themorepre ise omplexityis
O(m(2p)
2
)
Experiments: Sear h Algorithms
Theexperimentswere arriedoutonavarietyof omputerar hite turesand
dierent setups, but MATLAB was used as the main environment to run
the experiments. In this hapter we ompare the performan e of the sear h
algorithmsexplainedin Chapter 4 onseveral regression data sets.
A number of data sets with varying number of samples and dimensionality
wasused totest the quality of the ompositionof Delta Test with the three
mentionedsear h algorithms. The following datasets were used for ompar-
ing the performan e of FBS, TS and GA, with Table 5.1 summarizing the
sizes of alldata sets.
1. Housingdataset [67℄: Thehousingdatasetisrelatedtothe estimation
ofhousingvaluesinsuburbs ofBoston. The valuetopredi tistheme-
dian value of owner-o upied homesin $1000's. The data set ontains
506 instan es, with 13input variablesand one output.
2. Te ator data set [68℄: The Te ator data set aims at performing the
task of predi ting the fat ontent of a meat sample onthe basis of its
near infrared absorban e spe trum. The data set ontains 215 useful
instan es forinterpolationproblems, with 100 input hannels, 22prin-
ipal omponents(whi h remainunused) and3outputs, althoughonly
one is goingto be used (fat ontent).
3. Anthrokidsdataset[69℄: Thisdatasetrepresentstheresultsofathree-
year study on3900 infantsand hildrenrepresentativeof theU.S.pop-
ulation of year 1977, ranging in age from newborn to 12 years of age.
The data set omprises 121 variables with the weight of a hild being
priorsample and variable dis riminationhadto be performedto build
a robust and reliable data set. The nal set without missing values
ontains 1019 instan es, 53 input variables and one output (weight).
Moreinformationonthisdataset redu tionmethodology anbefound
in[63℄.
4. TheSantaFetime series ompetitiondata set [70℄: TheSantaFedata
set is a time series re orded from laboratory measurements of a Far-
Infrared-Laserina haoti state,andproposedforatime series ompe-
titionin 1994. The set ontains 1000samples, and itwas reshaped for
itsappli ation totime seriespredi tion using regressors of 12samples.
Thus, the set used in this work ontains 987 instan es, 12 inputs and
one output.
5. ESTSP 2007 ompetition data set [71℄: This time series was proposed
forthe European Symposium onTime Series Predi tion 2007. Itis an
univariateset ontaining 875 samples, while the regressor size for this
seriesvariedfordierentsetofexperimentsasexplainedinthis hapter
and the next one.
Dataset Samples Input variables
Boston Housing 506 13
Anthrokids 1019 53
Te ator 215 100
Santa Fe 987 12
ESTSP2007 819 55
Table 5.1: Datasets used for testingthe performan e of sear h algorithms.
5.1 Approximate Nearest Neighbor Inuen e
First we show the importan e of using faster nearest neighbor sear h when
optimizingtheDT.Table5.2showstheaveragerunningtimesfortheGeneti
Algorithm for Santa Fe, ESTSP 2007 and Anthrokids data sets. As an be
seen,the omputationalsavingsfromusingunderlyingdatastru tureinANN
is substantial, with improvement of 80% for Santa Fe and roughly 90% for
Data set Naive sear h Approximate
k
-NNSanta Fe 620 124
ESTSP 2007 2573 283
Anthrokids 2938 314
Table 5.2: Averagerunningtime inse onds forDT optimizationusing naive
NN approa hand approximate