In this section we show the validity of our results in many practical cases of risk minimization by discussing the results in some known settings.
3.6.1 CASE STUDY 1: Feature Elimination in Linear Regression
In this case study we present our results for the simple setting of linear regression. This example shows that the consistency results achieved in this paper can be applied to many different situations ranging from simple to complex risk minimization problems
and in some cases can substantiate known techniques that are in practice in such contexts for feature elimination. Linear regression is one of the most frequently used statistical techniques for data analysis. It is also a simple example of an empirical risk minimization problem.
In a linear regression model, we assume that the functional relationship can be expressed asy=hα, xi+b0, wherehα, xidenotes the Euclidean inner product of vectors α and x and b0 is the bias. The prediction quality of this model can be measured by
the squared-error loss function LLS given as LLS(x, y, f(x)) = (f(x)−y)2 and our goal is to find linear weights αb and bb0 for the observed data D that minimize the
empirical risk. We assume that the input space X ⊆ B ⊂ Rd. We further assume that Y ⊂ R is a closed set. The functional space Flin is given by Flin = {fα,b0 :
fα,b0(x) = hα, xi+b0, (α, b0)∈R
d+1, k(α, b
0)k∞≤M, for someM < ∞}. We can now
observe that the regularity conditions2 required for the consistency for the recursive
algorithm in this setting hold for this problem. The Least Squares Loss function LLS is convex, and as observed in SC08, LLS is locally Lipschitz continuous when Y is compact. Existence of M and B follows from the observations that X ⊆ B ⊂ Rd,
Y ⊂ R is a closed set, and that for some M < ∞, k(α, b0)k∞ ≤ M for any function
fα,b0 within Flin. Compactness follows trivially sinceFlin is non-empty. We assume an
exponential bound on the average entropy number. Many analyses have been done on covering numbers for linear function classes (see Zhang and Bartlett 2002, Williamson 2000) and under quite general assumptions it was proved that exponential bounds can be imposed on the-entropies of such functional classes, which is actually stronger than our bound (Refer to Theorems 4 and 5 in Zhang and Bartlett (2002)).
Thus the RFE procedure presented in this paper translates in the linear regression case as a non-parametric backward selection method based on the value of the ‘average
sum of squares of error’ or R2/n. Indeed, the average empirical risk of the estimator
b
f(x) for the sample is exactly R2/n. In a non-parametric setup, under restrictive distributional assumptions on the output vectorY, the idea of using penalized versions of logR2 like AIC, AICc or BIC are well accepted ad-hoc methodologies for model
selection (and hence feature elimination), although it is not always trivial to know which penalty should be used in a given situation, or which is best in that regard. This paper produces a theoretical basis for using the non-penalized criterionR2/n as a tool for feature elimination in linear regression. Suppose we start with a set of covariates
X = {X1, . . . , Xd} and let’s assume without loss of generality that the covariates are pre-ordered on the basis of their importance. Then null model assumption can be interpreted as claiming the existence of anr∈ {1,2, . . . , d}such that the following null hypothesis is trueH0 :{αd =· · ·=αr+1 = 0, αr, . . . , α1 6= 0}. So this paper establishes
consistency for RFE based on the criterion R2/n and a pre-determined stopping rule
in finding the correct feature spaceX0 ={X1, . . . , Xr} under this null hypothesis H0.
3.6.2 CASE STUDY 2: Support Vector Machines with a Gaussian RBF Kernel
Here we provide a brief review of the application of RFE in the classic SVM premise for classification using a Gaussian RBF kernel. Assume that Y = {1,−1}. We want to find a function f : X 7→ {1,−1} such that for almost every x ∈ X, P(f(x) =
YX = x) ≥ 1/2. In this case, the desired function is the Bayes decision function
fL,P∗ with respect to the loss function LBC(x, y, f(x)) = 1{y · sign(f(x)) 6= 1}. In practice, since LBC is non convex, it is usually replaced by the hinge loss function
LHL(x, y, f(x)) = max{0,1 −yf(x)}. For SVMs with a Gaussian RBF kernel, we minimize the regularized empirical criterionλkfk2+1
n
Pn
i=1max{0,1−yif(xi)}for the
kγ defined as kγ(x, y) =e
−kx−yk22
γ2 .
Lemma 12. For classification using support vector machines with a Gaussian RBF kernel, the RFE defined forδ=0−O(n−
β
2β+1)whereβ = βdτd
dβd+dτd+βdτd, with βd∈(0,∞) being the margin-noise exponent of the distributionP on Rd× {−1,1}and τ
d∈( 0,∞] being the tail exponent of the marginal distribution PX, is consistent in finding the
correct feature space3.
In order to prove Lemma 12, we need to verify the regularity conditions given before Theorem 10 in this setup. First note thatLHL is Lipschitz continuous and bounded for all 3-tuples of the form (x, y,0) (see Example 2.27 in SC08). Separability ofHγ holds since an RKHS over a separable metric space having a continuous kernel is separable (Lemma 4.33 of SC08) and since X ∈ Rd is separable. It is also easy to see that
|kγ(x, y)| ≤1 is true for all x, y ∈ X and all γ >0 and hence kkγk∞ ≤1.
From the proof of Proposition 17 (also see results in chapter 7 of SC08) we can see that the assumption on the bound on the average entropy of the RKHS space given before Theorem 10, can be replaced by the following:
• We assume that for fixed n ≥ 1, ∃ constants a ≥ 1 and p ∈ (0,1) such that for any J ⊆ {1,2, . . . , d}, EDX∼PXnei id:H
J 7→L
2(DX)
≤ai−21p, i≥1.
It is easily seen from the steps in (5.8) in Appendix B.3.5 that results will hold if we replace the earlier assumption with the latter. Then we see that Theorem 7.34 with Corollary 7.31 of SC08 along with the fact thatd/(d+τ) is an increasing function ind, yields a bound as given here with a:= maxd1≤dc,pγ
−(1−p)(1+)d1
2p for γ ≤1, for all >0,
d/(d+τ)< p <1 and a constant c,p depending only on p and a given. We however preferred to use the former in our theoretical derivations because it can be potentially weaker in many situations.
3For a discussion on margin-noise exponents and tail exponents of a distribution refer to Chapter
The bound on the approximation error follows from results obtained in Theorem 8.18 of SC08 (see also Theorem 2.7 in Steinwart and Scovel 2007). Note that this bound is not required for consistency results, as we already have thatA2(λ)→0 when λ →0
from Lemma 5.15 of SC08. It however helps us to obtain explicit rates for the RFE and we show that it holds here which will help us to derive rates in this framework. Without going into explicit details, we can see from Theorem 8.18 of SC08 that the approximation error for a SVM using Gaussian RBF kernel of width γ on Rd can be bounded by a function given as
A2(λ, d, γ)≤max{cd,τd,ecd,βdcd}
λd+τdτdγ−ddτd+τd +γβd