Most of our results in the body of the draft assume the premise that we have a fixed design at hand, that is, we assume that dimensiondof the input dataX remains fixed. We derived our asymptotic results for consistency of the feature selection algorithm under this premise. High dimensional settings (whendgrows withn) are becoming more and more vogue in supervised learning problems and hence, one interesting question is then to study the properties of our algorithm when both n, d → ∞ (however we still assume that the number of significant signals in the design remain fixed, that is, d0 is
fixed and finite). In this section, our goal is to discuss our algorithm in light of this new premise, and modify arguments to achieve consistency like in fixed design settings.
Let us assume that X ∈ Rd, and we observe data D = {(X
1, Y1), . . . ,(Xn, Yn)} ∼ i.i.d.Pd
X ×Y, where the probability distribution of the design now depends on the di-
mension d of the input space X. Note that Pd denotes the measure for the initial input-output spaceX × Y, and as we traverse down in the feature space for our algo- rithm, we will assume that the probability measure on the reduced input spaces are just restrictions of Pd on these spaces (like we do for a fixed design). Henceforth, we will denote the problem byPd. The modified feature selection algorithm is given below. Algorithm 21. Start off with J ≡[·] empty and let Z ≡[1,2, ..., d].
STEP 1: In the kth cycle of the algorithm choose dimension i
k for which ik = arg min i∈Z\J λfD,λ,HJ∪{i} 2 HJ∪{i} +RL,D fD,λ,HJ∪{i} −λfD,λ,HJ 2 HJ − RL,D fD,λ,HJ .
Figure 3.4: Stopping rule for the modified algorithm in the limiting design size setting
Continue this until the difference
min i∈Z\Jλ fD,λ,HJ∪{i} 2 HJ∪{i} +RL,D fD,λ,HJ∪{i} −λfD,λ,HJ 2 HJ − RL,D fD,λ,HJ > δPnd(d− |J|),
where δPnd(·) is a known positive function intrinsic to the design, and output J as the set of indices for the features to be removed from the model.
So the main modification of the algorithm lies in the stopping rule. In the fixed design problem, the stopping rule was a fixed constantδn, while in this modified version it is a functionδPd
n (·) :{1, . . . , d} 7→R. Figure 3.4 shows a visual representation of the stopping condition in this case. δnPd(·) acts as an envelop function and our algorithm is stopped if and when the difference function jumps aboveδnPd(·).
To achieve consistency for this algorithm, we will now have to modify our assump- tions and we will briefly discuss these modifications here. Let us consider the most general framework (Condition 2). We keep assumption (A1) fixed, that is, while mov- ing down between spaces that always contain all the significant features, we still believe in the existence of a path of equality of risk like before. Assumption (A2) needs to be
modified however, since the assumption of a fixed gap 0 between risks in models that
contain all significant features vs all other sub-optimal models makes sense only in a fixed design problem. In a varying design problem, heuristically this gap should dimin- ish as well and shrink to 0 as d tends to ∞. Hence assumption (A2) is modified to (A2*) and is given below:
(A2*). Let J1,J2, . . . ,JN be the exhaustive list of such paths from X to XJ∗, and let
e
J := N
[
i=1
Ji. There exists a monotonically decreasing discrete function P d
0 (·)>0
intrinsic to the problem and reaching 0 in limit, such that for J1 ∈ Je, J2 ∈/ Je
with |J2|=|J1|+ 1, we have R∗L,Pd,FJ2 ≥ R ∗ L,Pd,FJ1 + Pd 0 (d− |J1|). (3.13)
So we modify our assumption to reflect the varying gap size with the size of the de- sign. Heuristically what this gap-size assumption says is the following: For a prob- lem Pd, with starting design size d, Pd
0 (·) is a strictly positive, monotonically de-
creasing function from {1, . . . , d} 7→ R, such that Pd
0 ( ˜d) → 0 in limit when both d,d˜→ ∞. Hence there are two different asymptotic conditions working on δPd
n (·) here, with δPd n (·)→P d 0 (·) as n→ ∞, and additionally δP d n ( ˜d)→0 as d,d, n˜ → ∞. 3.11.1 Under universal bounds for entropy and approximation error
We still have some work left before we can argue consistency for this algorithm. For now, we assume that regularity conditions given in Theorem 10 will hold for any given designd, that is, there are universal constantsa, csuch that the entropy bound and the approximation error bound continue to hold universally. Then in lieu of our discussions in this section, simple observation will show that results stated in Lemma 16 – Corollary 19 continue to hold under slightly restated versions (Pn is replaced withPd,n to denote
the appropriate probability measure for the starting design). Statements (i) and (iii) in Lemma 20 will continue to hold, while (ii) can be changed to the following:
ii*. For J1 ∈ Je and J2 ∈/ Je and for |J2| =|J1|+ 1, ∃ ({n} >0)→ 0, such that we
have withPd,n probability greater than 1−2e−τ,
λn fD,λn,HJ2 2 HJ2 +RL,D fD,λn,HJ2 ≥λn fD,λn,HJ1 2 HJ1+RL,D fD,λn,HJ1 +P0d(d− |J1|)−n.
Under the premise of this modified statement, we can sufficiently move on to estab- lish the consistency arguments. It can be easily observed that the initial steps in the proof of Theorem 10 in section 3.8.2 (which has been presented for a fixed design size) continue to hold by taking δPd
n (d− |J|) = P d
0 (d− |J|)−n for design XJ, and now we further assume that supd∈
N,d˜≤dlim infn→∞
P d
0 ( ˜d)
n >2. This allows us to define a sequence
{N1, . . . , Nd, . . .}, such that 2n ≤ P d
0 ( ˜d), whenever n > Nd and for all ˜d ≤ d. The subsequent steps follow and see that we arrive at,
P (‘RFE finds the correct dimensions’)≥
d−d0 Y
i=0
1−2(d−i)e−τ & 1−2de−τd,
where the last approximate inequality follows assuming 2de−τ <1 for sufficiently large
n, and τ =o(n2β2β+1) with τ → ∞. Now for the limiting infinite product to converge to
1 whenn, d → ∞, see that
1−2de−τd= 1− 2d eτ −eτ 2d !− 2d2 eτ .
Hence if we assume d2e−τ → 0, see that the above quantity converge to 1 in limit. Hence for consistency results to hold, d needs to grow slower than a certain rate in terms of the sample size n. See that restricting the growth of τ to be o(n2β2+1β ) implies
that we can choose τ ≈n22ββk+1 for some k <1. This implies that de−τ /2 ≈de−0.5n 2βk 2β+1 , and hence d≈o e0.5n 2βk 2β+1 suffices.
3.11.2 Under relaxed bounds for entropy and approximation error
It can be well reasoned that the entropy bounds (and the approximation error bounds) should depend on the size of the design d. A look at the bounds derived for the Gaussian RBF kernel in section 3.6.2 makes it clear. It is however difficult to obtain explicit bounds in terms of the design size and is currently beyond the scope of this discussion. We will then assume very relaxed rates for these bounds in terms of the design size, and try to establish our consistency arguments under that premise. Let us restate our main theorem now.
Theorem 22. Let Pd be a probability measure onX × Y, where the input space X is a valid metric space. LetL:X × Y ×R7→[0,∞]be a convex locally Lipschitz continuous loss function satisfying L(x, y,0) ≤ 1 for all (x, y) ∈ X × Y. Let H be the separable RKHS of a measurable kernel k on X with kkk∞≤1. Let, for fixed n ≥1, ∃constants
˜
a≥1,α ≥0andp∈(0,1)such thatED
X∼PXd,nei(id:H 7→L∞(DX))≤ae
αdi−21p, i≥1. For a given sample size n, let {λn} ∈ [0,1] be such that λn → 0 and lim
n→∞λnn = ∞.
We also assume that there exists a c > 0, α˜ and a β ∈(0,1] such that AJ
2(λ)≤ce˜ αd˜ λβ
for any J and for all λ≥0 (where AJ
2(λ)≡AH
J
2 (λ)).
For d=O(logn), there exists δnPd(·) =P0d−O(n−γ)where γ ∈0,2ββ+1, for which the following statements hold:
1. The Recursive Feature Elimination Algorithm for support vector machines, defined forδnPd(·)given above, will find the correct lower dimensional subspace of the input
space (XJ∗) with probability tending to 1.
2. The function chosen by the algorithm achieves the best risk within the original RKHS H asymptotically.
Now it is then well understood that the modifications needed to reflect these changes is look at our bounds in Lemma 16 – Corollary 19 by replacing a by ˜aeαd and c by ˜
ceαd˜ . Lemma 20 can now be restated by replacing
n by n,d = (2ceαd˜ + 24
√
2τ + 48K2a2pe2αpd)n
−2ββ+1
+ 40τ n−2(24ββ+1+1). Now we need to ensure that asymptotically
n,d goes to 0. Observe first for τ =o(n2β2β+1), this reduces to
n,d =Kee αdn −2ββ+1
+o(1), for a constantKe and α= max( ˜α,2αp). Now if we fix a constantγ ∈
0,2ββ+1, such that
n,dn =O(n−γ), we must have eαd ≤ C1n
β
2β+1−γ, or that d =O(logn). All subsequent
steps follow similarly as discussed in the previous section, where we continue to assume supd∈
N,d˜≤dlim infn→∞
P d
0 ( ˜d)
n,d >2.
Now since logn grows slower than e0.5n
2βk
2β+1
, we have de−τ → 0 for d = O(logn) automatically, and hence we can arrive at our consistency results.