to test the influence of thepbestorgbest updating mechanism in PSO for feature selection.
3.2.6
Combination of New Initialisation and Updating Mech-
anisms
To further investigate and improve the performance of PSO for feature se- lection, it is also necessary to test the performance of PSO using a new ini- tialisation strategy and a newpbestandgbestupdating mechanism. There- fore, a new algorithm named PSOIniPG is formed by combining the mixed initialisation and the pbest and gbest updating mechanism, which treats the classification performance as the first priority. The reason is that the mixed initialisation is proposed to utilise the advantages and avoid the disadvantages of both forward selection and backward selection. Con- sidering the classification performance as the first priority will reduce the number of features without reducing the classification performance, which may even increase the classification performance on unseen test set be- cause of the removal of redundancy. Therefore, PSOIniPG is expected to simultaneously increase the classification performance and reduce the number of features.
The pseudo-code of PSOIniPG can be seen in Algorithm 3. The pseudo- code of the other new algorithms (i.e. PSOIni1, PSOIni2, PSOIni3, PSOPG1, PSOPG2 and PSOPG3) are similar to that of PSOIniPG except for the pro- cedures in the initialisation and thepbestandgbestupdating mechanism.
3.3
Design of Experiments
3.3.1
Benchmark Techniques
Two conventional wrapper feature selection methods, linear forward se- lection (LFS) [114] and greedy stepwise backward selection (GSBS) [168],
Input : A Training set and a Test set;
Output :gbest(selected feature subset);
Training and test classification accuracies.
1 begin
2 initialise most of the particles using small feature subsets and the others
particles using relatively large feature subsets;
3 initialise the velocity of each particle; 4 whileM aximum Iterationsis not reacheddo
5 evaluate the fitness (classification performance, i.e. error rate) of each
particle on the Training set;
6 fori=1toP opulation Sizedo
// F itness1(xi) measures the error rate of xi
7 ifF itness1(xi)< F itness1(pbest)then
8 pbest=xi; // Update the pbest of particle i
9 else ifF itness1(xi) =F itness1(pbest)and|xi|<|pbest|then
10 pbest=xi; // Update the pbest of particle i
11 ifanyF itness1(pbest)< F itness1(gbest)then
12 gbest=pbest; // Update the gbest of particle i
13 else ifanyF itness1(pbest) =F itness1(gbest)and|pbest|<|gbest|
then
14 gbest=pbest; // Update the gbest of particle i
15 fori=1toP opulation Sizedo
16 update the velocity and the position of particlei;
17 calculate the classification accuracy of the selected feature subset on the Test
set;
18 return the position ofgbest(the selected feature subset), the training and test
classification accuracies;
Algorithm 3:The pseudo-code of PSOIniPG.
are used as benchmark techniques in the experiments to examine the per- formance of the proposed feature selection algorithms.
LFS and GSBS were derived from SFS and SBS, respectively. LFS [114] restricts the number of features that are considered in each step of the for-
3.3. DESIGN OF EXPERIMENTS 85 ward selection, which can reduce the number of evaluations. Therefore, LFS is computationally less expensive than SFS and can obtain good re- sults. More details can be seen in the literature [114].
The greedy stepwise based feature selection algorithm can move either forward or backward in the search space [168]. Given that LFS performs a forward selection, a backward search is chosen in the greedy stepwise search to form a greedy stepwise backward selection (GSBS). GSBS starts with all available features and stops when the deletion of any remaining feature results in a decrease in evaluation, i.e. the accuracy of classifica- tion.
3.3.2
Datasets and Parameter Settings
In order to examine the performance of the proposed feature selection al- gorithms, a set of experiments have been conducted on 14 datasets, where the details of the datasets can be seen in Table 1.1 on Page 16.
In the experiments, all the instances in each dataset are divided into two sets: a training set and a test set. A common splitting strategy is that 2/3 (around 66%) of instances in the datasets are in the training set and 1/3 (around 33%) of the instances are in the test set [169]. To make it easy, we split 70% of the instances in each dataset as the training set and the other 30% as the test set. The instances are selected so that the proportion of instances from different classes remains the same in both the training set and the test set. Note that n-fold cross-validation is not used here. The main reason is that a feature selection process is different from clas- sification. n-fold cross-validation for classification produces n accuracies and their average value is the desired result. However, a n-fold cross- validation feature selection process producesn feature subsets, but then
feature subsets can not be averaged and the averaged feature subset is not a meaningful/valid solution for users. Another reason is that the majority of datasets have a good number of instances and the 70/30 splitting can
cope well. It is not entirely necessary to usen-fold cross-validation.
As wrapper approaches, the proposed algorithms require a learning/ classification algorithm to evaluate the fitness of the selected feature sub- sets. Any classification algorithm can be used here. A simple and com- monly used classification algorithm [93], KNN is used in the experiments and K=5 (5NN). During the evolutionary training process, the classifica- tion performance of a selected feature subset is evaluated by 10-fold cross- validation on the training set. Note that 10-fold cross-validation is per- formed as an inner loop on the training set to evaluate the classification performance of a single feature subset and it does not generate 10 fea- ture subsets. After the evolutionary training process, the selected feature subset is evaluated on the test set to obtain the testing classification per- formance. A detailed discussion of why and how 10-fold cross-validation is applied in this way is given by [5].
The experiments of LFS and GSBS are conducted using Waikato Envi- ronment for Knowledge Analysis (Weka) [170]. All the settings in LFS and GSBS are kept to the defaults because they can achieve good performance. 5NN is also used in LFS and GSBS. Both LFS and GSBS are determinis- tic methods, which produce a unique solution (feature subset) for each dataset.
The parameters in all the PSO based feature selection algorithms are selected according to common settings proposed by Clerc and Kennedy [89]. The common settings are used here because using them can clearly test whether the improvement of the performance is caused by the newly proposed mechanisms rather than other factors. The detailed settings are shown as follows: w = 0.7298, c1 = c2 = 1.49618, population size is 30, and the maximum iteration is 100. The fully connected topology is used. According to our preliminary experiments, the thresholdθ is set as 0.6 to determine whether a feature is selected or not. In the two-stage approach, i.e. PSO2S, as the maximum number of iteration is 100, the first 50 itera- tions are set as the first stage and the last 50 iterations are set as the second
3.4. RESULTS AND DISCUSSIONS 87