whereP is a distribution on Rd× {−1,1} that has margin-noise exponent βd∈(0,∞) and whose marginal distribution PX has tail exponentτd ∈( 0,∞], cd,τd,ecd,βd >0 are constants andcdis the constant occurring in equation (8.10) in SC08. So for a given pair (λ, d) if we choose γ(λ, d) = λdβd+dτdτd+βdτd then it can be seen that A
2(λ, d, γ(λ, d)) λdβd+βdτddτd+βdτd (wheredenotes ‘less than or equal to’ up to constants). Hence the bound
on the approximation error is satisfied for any J.
So for a sequence of SVM objective functions λnkfk2Hγ(λn) +
1
n
Pn
i=1max{0,1− yif(xi)} defined for a sequence λn−1 = o(n) with λn → 0 the assumptions for the theoretical results on consistency of RFE are met, and thus Lemma 12 is proved.
3.7 Assumptions for RFE in general function spaces
In this section we discuss assumptions that are inherently needed for consistency of our algorithm under more general settings. We also discuss the necessity of these assumptions for our recursive search through appropraite examples.
3.7.1 Assumptions
Consider the setting of risk minimization (regularized or non regularized) with re- spect to a given functional spaceF (which are typically RKHSs in case of SVM). Our aim in this section is to provide a framework where the modified recursive feature elim- ination method is consistent in finding the correct lower dimensional subspace of the input space. First we note the following assumptions:
(A1). Let J be a subset of {1, . . . , d}. Let fP,FJ be the function that minimizes risk within the space FJ with respect to the measure P on X × Y. Define F∅ =
F. We assume that there exists a J∗, that is, |J∗| = d −d0 (where d0 is the
number of significant signals in the model) withd0 ≥0, such that it satisfies the
criterion that for any pair (d1, d2) satisfying d1 ≤ d2 ≤ d−d0, ∃ Jd1 and Jd2
with Jd1 ⊆ Jd2 ⊆ J∗ and |Jd1| = d1 and |Jd2| = d2, we have the condition that
R∗ L,P,FJ∗ =R ∗ L,P,FJd1 =R ∗ L,P,FJd2. Remark 13.
1. In other words, Assumption (A1) says that there exists a ‘path’ from the original input space X to the correct lower dimensional space XJ∗ in the sense of equality of
the minimized risk within FJs along this ‘path’. So there exists a sequence of indices
J from Jstart = ∅ to Jend = J∗, where J :=
{Jstart ≡ J1, J2, . . . , Jend} : J1 ⊆ J2 ⊆ · · · ⊆Jend,|Ji|=|Ji−1|+ 1 , such that R∗L,P,FJ is the same for all J ∈ J.
2. Note that J may not be unique and there might be more than one path leading to
XJ∗.
3. Also note thatJ∗ may not be unique in general, but any one of them would work for
our purpose. So we will assume it to be unique in this paper.
(A2). Let J1,J2, . . . ,JN be the exhaustive list of such paths from X to XJ∗, and let
e
J := N
[
i=1
Ji. There exists0 >0 such that wheneverJ /∈Je,R∗L,P,FJ ≥ R
∗
0.
Note trivially from discussions we had in Section 3.4.3, that assumptions (A1) and (A2) are satisfied for nested or dense models. Now at first glance these assumptions might look restrictive, but these do help define the premise for consistency of the resursive algorithm in any general setting. In Section 3.5 we will show how Assumptions (A1) and (A2) are sufficient for a recursive feature elimination algorithm like RFE to work (in terms of consistency). The following examples however are used to show the necessity of these assumptions in order for a well-defined recursive feature elimination algorithm to work.
3.7.2 Necessity of existence of a path in (A1)
Example 14. Consider the empirical risk minimization framework. Let X = [−1,1]2
and let Y = 0. Let X1 ∼ U where U is some distribution on [−1,1] and X2 ≡ −X1.
Let the functional space F be {c(X1+X2), c >0}. Let the loss function be the squared
error loss, i.e., L(x, y, f(x)) = (y−f(x))2. By Definition 1, F{1} ={cX
2, c >0} and F{2} ={cX
1, c > 0} and F{1,2} = {0}. We see that RL,P(fP,F) = RL,P(fP,F{1,2}) = 0
but bothRL,P(fP,F{1})andRL,P(fP,F{2})6= 0. Hence even if the correct low-dimensional
functional space may have minimized risk the same as that of the original functional space, if there does not exist a path going down to that space, the recursive algorithm will not work. Note that the minimizer of the risk belongs to F{1,2} but there is no path
from F to F{1,2}, in the sense of (A1).
3.7.3 Necessity of Equality in (A1)
It would appear that for the algorithm to work, we don’t have to necessarily work with equalities along the path and that we can relax (A1) to include inequalities as well. Suppose we redefine (A1) such that the equality of minimized risk along the path is
replaced by the inequality ‘≤’. So now we assume that minimized risk is not necessarily constant along the path, but that it does not increase. We show below that under this modified assumption, our recursive search algorithm might fail to find the correct lower dimensional subspace of the input space.
Example 15. Consider the empirical risk minimization framework again. Let Y ∼ U(−1,1) and X ⊂ R3 such that Y = X
3 = X2 + 1 = X1 −1. Let F = {c1X1 + c2X2 +c3X3, c1, c2, c3 ≥ 1}, and let the loss function be squared error loss. Now
by definition, F{1} = {c 2X2 +c3X3, c2, c3 ≥ 1}, F{2} = {c1X1 +c3X3, c1, c3 ≥ 1}, F{3} = {c2X2+c1X1, c1, c2 ≥ 1}, F{1,2} = {c3X3, c3 ≥ 1}, F{1,3} ={c2X2, c2 ≥ 1}, F{2,3} ={c 1X1, c1 ≥1}, and F{1,2,3} ={0}.
By simple calculations, we see that R∗
L,P,F =R∗L,P,F{1} =R ∗ L,P,F{2} = 4/3, R ∗ L,P,F{3} = R∗ L,P,F{1,2,3} = 1/3, R ∗ L,P,F{1,3} = R ∗ L,P,F{2,3} = 1 and R ∗
L,P,F{1,2} = 0. Note that the
correct dimensional subspace of the input space is X{1,2} and there exists paths leading
to this space via X → X{1} → X{1,2} since R∗
L,P,F = R∗L,P,F{1} > R
∗
L,P,F{1,2} or via X →X{2} →X{1,2} since R∗
L,P,F =R∗L,P,F{2} >R∗L,P,F{1,2} in the sense of Assumption
(A1*). But there also exists the blind path X → X{3} since R∗
L,P,F >R∗L,P,F{3} which
does not lead to the correct subspace. Hence the recursive search in this case may not be guaranteed to lead to the correct subspace.
Hence equality in (A1) guarantees that the recursive search will never select an important dimension j ∈ J∗ for redundancy because then the Assumption (A2) would
be violated. Hence the equality in (A1) will ensure that we will follow a path recursively to the correct input space XJ∗.