Assumptions for RFE in general function spaces

whereP is a distribution on _Rd× {−1,1} that has margin-noise exponent βd∈(0,∞) and whose marginal distribution PX has tail exponentτd ∈( 0,∞], cd,τd,ecd,βd >0 are constants andcdis the constant occurring in equation (8.10) in SC08. So for a given pair (λ, d) if we choose γ(λ, d) = λdβd+_dτdτd+_βdτd _{then it can be seen that} _A

2(λ, d, γ(λ, d)) λdβd+βdτd_dτd+_βdτd _(where_{denotes ‘less than or equal to’ up to constants). Hence the bound}

on the approximation error is satisfied for any J.

So for a sequence of SVM objective functions λnkfk2Hγ(λn) +

i=1max{0,1− yif(xi)} defined for a sequence λn−1 = o(n) with λn → 0 the assumptions for the theoretical results on consistency of RFE are met, and thus Lemma 12 is proved.

3.7 Assumptions for RFE in general function spaces

In this section we discuss assumptions that are inherently needed for consistency of our algorithm under more general settings. We also discuss the necessity of these assumptions for our recursive search through appropraite examples.

3.7.1 Assumptions

Consider the setting of risk minimization (regularized or non regularized) with respect to a given functional spaceF (which are typically RKHSs in case of SVM). Our aim in this section is to provide a framework where the modified recursive feature elimination method is consistent in finding the correct lower dimensional subspace of the input space. First we note the following assumptions:

(A1). Let J be a subset of {1, . . . , d}. Let f_P,FJ be the function that minimizes risk within the space FJ _{with respect to the measure} _P _on _{X × Y}_{. Define} _F∅ ₌

F. We assume that there exists a J∗, that is, |J∗| = d −d0 (where d0 is the

number of significant signals in the model) withd0 ≥0, such that it satisfies the

criterion that for any pair (d1, d2) satisfying d1 ≤ d2 ≤ d−d0, ∃ Jd1 and Jd2

with Jd1 ⊆ Jd2 ⊆ J∗ and |Jd1| = d1 and |Jd2| = d2, we have the condition that

R∗ L,P,FJ∗ =R ∗ L,P,FJd1 =R ∗ L,P,FJd2. Remark 13.

1. In other words, Assumption (A1) says that there exists a ‘path’ from the original input space X to the correct lower dimensional space XJ∗ _{in the sense of equality of}

the minimized risk within FJ_{s along this ‘path’. So there exists a sequence of indices}

J from Jstart = ∅ to Jend = J∗, where J :=

{Jstart ≡ J1, J2, . . . , Jend} : J1 ⊆ J2 ⊆ · · · ⊆Jend,|Ji|=|Ji−1|+ 1 , such that R∗_L,P,_FJ is the same for all J ∈ J.

2. Note that J may not be unique and there might be more than one path leading to

XJ∗_.

3. Also note thatJ∗ may not be unique in general, but any one of them would work for

our purpose. So we will assume it to be unique in this paper.

(A2). Let J1,J2, . . . ,JN be the exhaustive list of such paths from X to XJ∗, and let

J := N

[

i=1

Ji. There exists0 >0 such that wheneverJ /∈Je,R∗_L,P,_FJ ≥ R

∗

Note trivially from discussions we had in Section 3.4.3, that assumptions (A1) and (A2) are satisfied for nested or dense models. Now at first glance these assumptions might look restrictive, but these do help define the premise for consistency of the resursive algorithm in any general setting. In Section 3.5 we will show how Assumptions (A1) and (A2) are sufficient for a recursive feature elimination algorithm like RFE to work (in terms of consistency). The following examples however are used to show the necessity of these assumptions in order for a well-defined recursive feature elimination algorithm to work.

3.7.2 Necessity of existence of a path in (A1)

Example 14. Consider the empirical risk minimization framework. Let X = [−1,1]2

and let Y = 0. Let X1 ∼ U where U is some distribution on [−1,1] and X2 ≡ −X1.

Let the functional space F be {c(X1+X2), c >0}. Let the loss function be the squared

error loss, i.e., L(x, y, f(x)) = (y−f(x))2_{. By Definition 1,} _F{1} ₌_{cX

2, c >0} and F{2} ₌_{cX

1, c > 0} and F{1,2} = {0}. We see that RL,P(fP,F) = RL,P(fP,F{1,2}) = 0

but bothRL,P(fP,F{1})andR_L,P(f_P,_F{2})6= 0. Hence even if the correct low-dimensional

functional space may have minimized risk the same as that of the original functional space, if there does not exist a path going down to that space, the recursive algorithm will not work. Note that the minimizer of the risk belongs to F{1,2} _{but there is no path}

from F to F{1,2}_{, in the sense of (A1).}

3.7.3 Necessity of Equality in (A1)

It would appear that for the algorithm to work, we don’t have to necessarily work with equalities along the path and that we can relax (A1) to include inequalities as well. Suppose we redefine (A1) such that the equality of minimized risk along the path is

replaced by the inequality ‘≤’. So now we assume that minimized risk is not necessarily constant along the path, but that it does not increase. We show below that under this modified assumption, our recursive search algorithm might fail to find the correct lower dimensional subspace of the input space.

Example 15. Consider the empirical risk minimization framework again. Let Y ∼ U(−1,1) and X ⊂ _R3 _{such that} _Y ₌ _X

3 = X2 + 1 = X1 −1. Let F = {c1X1 + c2X2 +c3X3, c1, c2, c3 ≥ 1}, and let the loss function be squared error loss. Now

by definition, F{1} ₌ _{c 2X2 +c3X3, c2, c3 ≥ 1}, F{2} = {c1X1 +c3X3, c1, c3 ≥ 1}, F{3} = {c2X2+c1X1, c1, c2 ≥ 1}, F{1,2} = {c3X3, c3 ≥ 1}, F{1,3} ={c2X2, c2 ≥ 1}, F{2,3} ₌_{c 1X1, c1 ≥1}, and F{1,2,3} ={0}.

By simple calculations, we see that R∗

L,P,F =R∗_L,P,F{1} =R ∗ L,P,F{2} = 4/3, R ∗ L,P,F{3} = R∗ L,P,F{1,2,3} = 1/3, R ∗ L,P,F{1,3} = R ∗ L,P,F{2,3} = 1 and R ∗

L,P,F{1,2} = 0. Note that the

correct dimensional subspace of the input space is X{1,2} _{and there exists paths leading}

to this space via X → X{1} → X{1,2} since R∗

L,P,F = R∗_L,P,F{1} > R

∗

L,P,F{1,2} or via X →X{2} →X{1,2} since R∗

L,P,F =R∗L,P,F{2} >R∗_L,P,_F{1,2} in the sense of Assumption

(A1*). But there also exists the blind path X → X{3} since R∗

L,P,F >R∗L,P,F{3} which

does not lead to the correct subspace. Hence the recursive search in this case may not be guaranteed to lead to the correct subspace.

Hence equality in (A1) guarantees that the recursive search will never select an important dimension j ∈ J∗ for redundancy because then the Assumption (A2) would

be violated. Hence the equality in (A1) will ensure that we will follow a path recursively to the correct input space XJ∗_.

In document Dasgupta_unc_0153D_14888.pdf (Page 68-71)