4.4 Parametric approaches
4.4.2 Newton-like method algorithm
The second approach that we employ to find the root of problem (4.15) is based on Newton-like method [17, 31, 60] described as follows. Suppose that at the beginning of iteration i a lower-bound ti on λ⋆ is known, which can be obtained, e.g., by computing the fractional objective function at any feasible solution. If v(ti) = 0, then ti = λ⋆; otherwise, the algorithm updates ti+1= h(xi), where xi is an optimal solution of v(ti), and proceeds to the next iteration. The formal pseudo-code is given in Algorithm 2.
Note that at each iteration of Algorithm 2 we can stop the optimization of problem (4.15) in line 6 whenever a feasible solution with an objective function value greater than rel⋅ ∣ti∣
and absis found, which, based on the discussion in Section 4.4.1, can result in more iterations but a better performance for the algorithm.
Algorithm 2 Newton-like method algorithm
1: Input: rel, relative gap parameter; abs, absolute gap parameter;
2: Output: x; if xj
=
1, then feature j is selected3: i
←
04: Compute ti
▷
e.g., ti=
h(
1′)
5: while time limit not exceeded do
6: Solve problem (4.15) for ti and obtain v
(
ti)
and its optimal solution xi 7: if v(
ti) >
rel⋅ ∣
ti∣
and v(
ti) >
abs then8: ti+1
←
h(
xi)
9: else
10: return xi
▷
Solution found within either relative or optimality gaps11: end if
12: i
←
i+
113: end while
14: return xi
▷
Best solution found within the time limitRecall the relative and optimality gaps defined in (4.17). Following the proofs of similar results in [79] and [37, Proposition 4], if the time limit is not reached, then Algorithm 2 terminates with a feasible solution with either gaprel ⩽ rel or gapabs ⩽ abs. If the time limit is reached after the operation of the i-th iteration of Algorithm 2, then we compute approximations of relative and absolute gaps by
gaprel≃ v(t i)
∣ti∣ ⋅ g(xi), and gapabs≃
v(ti)
g(xi). (4.19)
4.5 Computational results
The aim of our computational study is to evaluate the performances of the MILP refor- mulations provided in Section 4.3 versus the parametric approaches of Section 4.4. In Sec- tion 4.5.1, we outline the real-life test instances and settings used for computational experi- ments. Then we present our results in Section 4.5.2.
4.5.1 Computational environment and test instances
In all of the computational test instances, we solve MILPs and BQPs (in each iteration of the parametric Algorithms 1 and 2) using CPLEX 12.7.1 [47]. We run experiments on a computer, where we allocate 4 threads (CPU 2.90GHz) and 16 GB of RAM for each individual experiment. We use a time limit of one hour (3600 seconds). To avoid running-out- of-memory difficulties we use the “node-file storage-feature” of CPLEX to store some parts of the branch-and-cut tree on a disk when the size of the tree exceeds the allocated memory. Furthermore, for computing the mutual information and correlation between a feature and the target class or between two features, as well as computing the classification accuracy score we use scikit-learn package [72] and Python 3.7.3 [78].
Test instances. We consider various real-world instances obtained from UCI ma- chine learning repository [5] and ASU feature selection repository [55] available at https: //archive.ics.uci.edu and http://featureselection.asu.edu, respectively. Table 13 pro- vides the list of instances as well as their sizes and their key characteristics.
Linearization bounds. In both MILP1 and MILP2, we let y` = 0 and yu = 1. More- over, for MILP2 reformulation of mRMR we letMbj = ∑k∈J∣I(fj, C) − I(fj, fk)∣ and Mdj = n, for all j ∈ J. For MILP2 reformulation of CFS we set Mbj = ∑k∈Jρ(fj, C) ⋅ ρ(fk, C) and Md
j = ∑k∈J,k≠j2ρ(fj, fk), for all j ∈ J. Finally, we consider M = ∑j∈J∑k∈J∣I(fj, C)−I(fj, fk)∣ in MILP4.
Gaps. We consider rel = 0.01 and abs = 0.001 in both Algorithms 1 and 2. If the time limit is reached, then gaprel and gaprel are computed by using formulas given in (4.18) and (4.19) for Algorithms 1 and 2, respectively. Similarly, in solving of the MILPs we set 0.01 and 0.001 for the relative and absolute optimality gaps in the solver which are computed by gaprel= ∣U BLB−LB∣ and gapabs= ∣UB −LB∣, where UB and LB are the upper- and the lower- bound on the optimal objective function value at the termination of the solver, respectively.
Table 13: The sizes of the considered instances including the number of features, n, and the number of samples, m. Additionally, we provide some characteristics of the data instances such as the type of features values and the type of target class variable; if ∣C∣ = 2, then the target class is binary, otherwise it is multi-class.
Instance n m Data type Class type
banknote authentication1 4 1,372 continuous binary
Breast cancer1 9 286 discrete binary
Letter Recognition1 16 20,000 discrete multi
Zoo1 17 101 discrete multi
Breast Cancer Wisconsin (Diagnostic)1 31 569 continuous binary
SPECTF Heart Data1 44 267 continuous binary
Lung Cancer1 56 32 discrete binary
Sports articles for objectivity analysis1 59 1,000 discrete binary
Connectionist1 60 208 continuous binary
Optical Recognition1 62 3,823 discrete multi
Hill-Valley1 100 606 continuous binary
Urban Land Cover1 147 168 continuous multi
Epileptic Seizure Recognition1 178 11,500 discrete multi
SCADI1 205 70 discrete multi
Semeion Handwritten Digit1 256 1,593 discrete multi
USPS2 256 9,298 continuous multi
lung discrete2 325 73 discrete multi
Madelon1,2 500 2,000 continuous binary
ISOLET1,2 617 7,797 continuous multi
Parkinson’s Disease1 754 756 continuous binary
CNAE-91 856 1,080 discrete multi
Yale 32x322 1,024 165 continuous multi
ORL 32x322 1,024 400 continuous multi
colon2 2000 62 discrete binary
PCMAC2 3289 1943 discrete binary
Classification accuracy score. Given a sample, the accuracy of a subset of features in predicting the true class of the sample can be evaluated by the classification accuracy. We use the well-known Naive Bayes classifier method (commonly used in the related literature, see, e.g., [67, 68, 73]), described below with the 5-fold cross validation to evaluate the accuracy of a subset of features.
Recall that set C denotes the set of possible values for the target class variable, i.e., C ∈ C. Let S be a subset of features and A be a vector of size ∣S∣, where Aj is the value of feature fj ∈ S in the sample. Then in order to evaluate the classification accuracy of S in classifying sample A, under the assumption that features are independent, Naive Bayes classifier uses the following equation to find the class of sample CA.
CA= argmax ck∈C
P(ck) ∏ Aj∈A
P(Aj∣ck), (4.20)
where probabilities P(ck) and P(aj∣ck) are computed based on the training data set. Equa- tion (4.20) implies that the most probable class is assigned as the class of sample A.