Model selection and estimator selection for statistical learning

(1)

1/62

Model selection and estimator selection for statistical learning

Sylvain Arlot

1Cnrs

2Ecole Normale Sup´´ erieure (Paris), Liens, ´Equipe Sierra

Scuola Normale Superiore di Pisa, 14–23 February 2011

Model selection and estimator selection for statistical learning Sylvain Arlot

(2)

2/62

Outline of the 5 lectures

1 Monday 14, 14:00–16:00: Statistical learning

2 Tuesday 15, 9:00–11:00: Model selection for least-squares regression

3 Thursday 17, 14:00–16:00: Linear estimator selection for least-squares regression

4 Tuesday 22, 14:00–16:00: Resampling and model selection

5 Wednesday 23, 9:00–11:00: Cross-validation and model/estimator selection

(3)

3/62

Learning Estimators Estimator selection Interactions Conclusion

Part I

Statistical learning

(4)

4/62

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

(5)

5/62

Outline

2 Which estimators?

5 Conclusion

(6)

6/62

General framework

Data: ξ₁, . . . , ξ_n∈ Ξ i.i.d. ∼ P Goal: estimate a feature s^? ∈ Sof P Quality measure: loss function

∀t ∈ S , L_P( t ) = Eξ∼P[ γ(t; ξ) ] = Pγ(t) minimal at t = s^?

Contrast function: γ : S × Ξ 7→ [0, +∞) Excess loss

` ( s^?, t ) = Pγ(t) − Pγ(s^?)

(7)

7/62

Example: prediction

Data: (X₁, Y₁), . . . , (X_n, Y_n) ∈ Ξ = X × Y Goal: predict Y given X with (X , Y ) = ξ ∼ P

s^?(X ) is the “best predictor” of Y given X , i.e., s^? minimizes the loss function

Pγ(t) with γ(t; (x , y )) = d (t(x ), y )

measuring some “distance” between y and the prediction t(x ).

(8)

8/62

Example: regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

(9)

9/62

Goal: find the signal (denoising)

(10)

10/62

Example: regression

prediction with Y = R

Data: (X1, Y1), . . . , (Xn, Yn) i.i.d.

Yi = η(Xi) + εi with E [ εi| X_i] = 0

least-squares contrast: γ(t; (x , y )) = (t(x ) − y )

⇒ s^?= η and ` ( s^?, t ) = kt − ηk²₂= E h

( t(X ) − η(X ) )² i

(11)

10/62

Example: regression

prediction with Y = R

Data: (X1, Y1), . . . , (Xn, Yn) i.i.d.

Yi = η(Xi) + εi with E [ εi| X_i] = 0

least-squares contrast: γ(t; (x , y )) = (t(x ) − y )²

⇒ s^?= η and ` ( s^?, t ) = kt − ηk²₂= E h

( t(X ) − η(X ) )² i

(12)

11/62

Example: regression on a fixed design

(X₁, . . . , X_n) = (x₁, . . . , x_n) deterministic

Y = F + ε ∈ Rⁿ with F = (η(x₁), . . . , η(x_n)) ∈ Rⁿ and ε1, . . . , εn centered and independent.

Quadratic loss of t ∈ S = Rⁿ: L_P( t ) = EY

1

nkY − tk²

= EY

" 1 n

n

X

i =1

( Yi− t_i)²

#

⇒ s^?= F and ` ( s^?, t ) = 1

nkF − tk² = 1 n

n

X

i =1

( η(xi) − ti)²

(13)

11/62

Example: regression on a fixed design

Homoscedastic case: ε1, . . . , εn i.i.d.

1

nkY − tk²

= EY

" 1 n

n

X

i =1

( Yi− t_i)²

#

⇒ s^?= F and ` ( s^?, t ) = 1

nkF − tk² = 1 n

n

X

i =1

( η(xi) − ti)²

(14)

11/62

Example: regression on a fixed design

Homoscedastic case: ε1, . . . , εn i.i.d.

1

nkY − tk²

= EY

"

1 n

n

X

i =1

( Yi− t_i)²

#

⇒ s^?= F and ` ( s^?, t ) = 1

nkF − tk² = 1 n

n

X

i =1

( η(xi) − ti)²

(15)

12/62

Example: regression: fixed vs. random design

Random design Fixed design

D_n ( X_i, Y_i)_{1≤i ≤n} i.i.d. ∼ P Y = F + ε ∈ Rⁿ (Xn+1, Yn+1) ∼ P Xn+1 ∼ U (x₁, . . . , xn)

S t : X → R t ∈ Rⁿ

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )²i

E_Y h

1

nkY − tk²i s^? η : x → E [ Y | X = x ] F = (η(x₁), . . . , η(x_n))

` ( s^?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )²

i 1

nkF − tk²

with ∀x ∈ Rⁿ , kxk² =

n

X

i =1

x_i²

(16)

12/62

Example: regression: fixed vs. random design

S t : X → R t ∈ Rⁿ

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )²i

E_Y h

1

` ( s^?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )²

i 1

nkF − tk²

n

X

i =1

x_i²

(17)

12/62

Example: regression: fixed vs. random design

S t : X → R t ∈ Rⁿ

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )²i

E_Y h

1

` ( s^?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )²

i 1

nkF − tk²

n

X

i =1

x_i²

(18)

13/62

Example: density estimation (Ξ = R): data

−2 0 2 4 6 8 10

(19)

14/62

Example: density estimation (Ξ = R): data and target

−2 0 2 4 6 8 10

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

(20)

15/62

Density estimation

µ reference measure on Ξ f density of P w.r.t. µ

⇒ s^? = f and ` ( s^?, t )Kullback-Leibler distancefrom s^? to t

γ(t; ξ) = ktk²_L2(µ)− 2t(ξ)

⇒ s^? = f and` ( s^?, t ) = kt − s^?k²_L2(µ)

(21)

15/62

Density estimation

γ(t; ξ) = − ln(t(ξ))

γ(t; ξ) = ktk²_L2(µ)− 2t(ξ)

⇒ s^? = f and` ( s^?, t ) = kt − s^?k²_L2(µ)

(22)

15/62

Density estimation

γ(t; ξ) = − ln(t(ξ))

γ(t; ξ) = ktk²_L2(µ)− 2t(ξ)

⇒ s^? = f and` ( s^?, t ) = kt − s^?k²_L2(µ)

(23)

16/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

classe 0 classe 1

(24)

17/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

classe 0 classe 1

(25)

18/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(26)

19/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(27)

20/62

Example: binary supervised classification

Prediction, X = R and Y = { 0, 1 } If S = { measurable mappings X 7→ Y } 0–1 loss: γ(t; (x , y )) = 1t(x )6=y

If t ∈ S = { measurable mappings X 7→ [ 0, 1 ] },

Convex losses: γ(t; (x , y )) = ϕ(t(x )(1 − 2y )) with ϕ : R 7→ R convex, non-negative, non-increasing.

(28)

20/62

Example: binary supervised classification

Prediction, X = R and Y = { 0, 1 } If S = { measurable mappings X 7→ Y } 0–1 loss: γ(t; (x , y )) = 1t(x )6=y

If t ∈ S = { measurable mappings X 7→ [ 0, 1 ] },

Convex losses: γ(t; (x , y )) = ϕ(t(x )(1 − 2y )) with ϕ : R 7→ R convex, non-negative, non-increasing.

(29)

21/62

Outline

2 Which estimators?

5 Conclusion

(30)

22/62

What is an estimator?

Statistical algorithm orLearning rule:

A: S

n∈NΞⁿ7→ S sample D_n= ( ξ₁, . . . , ξ_n) 7→ A(D_n)

A(D_n) =bs^A(D_n) =bs(D_n) ∈ S is anestimator of s^?

Remark: Pγ bs (D_n) and ` s ,bs (D_n) are random Risk ofbs^A:

EDn∼P^⊗n Pγ bs^A(Dn) = R(A, n) Excess risk ofbs^A:

EDn∼P^⊗n ` s^?,bs^A(D_n) = R(A, n) − Pγ ( s^?)

(31)

22/62

What is an estimator?

Statistical algorithm or Learning rule:

A: S

A(D_n) =bsÂ(D_n) =bs(D_n) ∈ S is an estimator of s^? Remark: Pγ bsÂ(Dn) and ` s^?,bsÂ(Dn) arerandom

Risk ofbs^A:

EDn∼P^⊗n Pγ bs^A(Dn) = R(A, n) Excess risk ofbs^A:

(32)

22/62

What is an estimator?

Statistical algorithm or Learning rule:

A: S

A(D_n) =bsÂ(D_n) =bs(D_n) ∈ S is an estimator of s^? Remark: Pγ bsÂ(Dn) and ` s^?,bsÂ(Dn) are random RiskofbsÂ:

EDn∼P^⊗n Pγ bs^A(Dn) = R(A, n) Excess riskofbs^A:

(33)

23/62

(Universal) consistency, learning rates

Consistency (P fixed): ` s^?,bs^A(D_n) → 0 as n → +∞

Universal consistency:

sup_P lim_n→∞EDn∼P^⊗n ` s^?,bs^A(D_n) = 0 Uniform universal consistency:

lim_n→∞sup_P

EDn∼P^⊗n ` s^?,bs^A(D_n) = 0 (uniform learning rate over all distributions).

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996): In binary classification with X infinite, ∀A, ∀n ≥ 1,

sup

P

EDn∼P^⊗n ` s^?,bs^A(Dn) = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

(34)

23/62

(Universal) consistency, learning rates

sup_P lim_n→∞EDn∼P^⊗n ` s^?,bs^A(D_n) = 0

lim_n→∞sup_P

sup

P

EDn∼P^⊗n ` s^?,bs^A(Dn) = 1 2

(35)

23/62

(Universal) consistency, learning rates

lim_n→∞sup_P

sup

P

EDn∼P^⊗n ` s^?,bs^A(Dn) = 1 2

(36)

23/62

(Universal) consistency, learning rates

lim_n→∞sup_P

“No Free Lunch”(cf. Devroye, Gy¨orfi & Lugosi, 1996):

sup

P

EDn∼P^⊗n ` s^?,bs^A(Dn) = 1 2

(37)

23/62

(Universal) consistency, learning rates

lim_n→∞sup_P

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996):

In binary classification with X infinite, ∀A, ∀n ≥ 1, sup

P

EDn∼P^⊗n ` s^?,bs^A(Dn) = 1 2

(38)

24/62

Least-squares estimator: regressogram

(39)

25/62

Least-squares estimator

Framework: Regression, least-squares contrast γ(t; (x , y )) = (t(x ) − y )²

Natural idea: minimize an estimator of Pγ(t) = E

h

( t(X ) − Y )² i

Least-squares criterion: P_nγ(t) = 1

n

X

i =1

( t(X_i) − Y_i)² with P_n= 1 n

n

X

i =1

δ_ξ_i

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒ Least-squares estimator on S:

bs_S ∈ arg min

t∈S{ P_nγ(t) } = arg min

t∈S

(1 n

n

X

i =1

( t(X_i) − Y_i)² )

(40)

25/62

Least-squares estimator

h

( t(X ) − Y )² i Least-squares criterion:

P_nγ(t) = 1 n

n

X

i =1

n

X

i =1

δ_ξ_i

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒ Least-squares estimator on S:

bs_S ∈ arg min

t∈S

(1 n

n

X

i =1

( t(X_i) − Y_i)² )

(41)

25/62

Least-squares estimator

h

( t(X ) − Y )² i Least-squares criterion:

P_nγ(t) = 1 n

n

X

i =1

n

X

i =1

δ_ξ_i

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒Least-squares estimator on S :

bs_S ∈ arg min

t∈S

(1 n

n

X

i =1

( t(X_i) − Y_i)² )

(42)

26/62

Model examples in regression

histogramson some partition Λ of X

⇒ the least-squares estimator (regressogram) can be written

bs_m=X

λ∈Λ

βb_λ1λ βb_λ = 1 Card { Xi ∈ λ }

X

Xi∈λ

Y_i

subspace generated by a subset of an orthogonal basis of L²(µ) (Fourier, wavelets, and so on)

variable selection: X_i =

X_i⁽¹⁾, . . . , X_i^(p)

∈ R^p gathers p variables that can (linearly) explain Y

∀m ⊂ { 1, . . . , p } , S_m =







t : x ∈ X 7→X

j ∈m

β_jx^{(j )} s.t. β ∈ R^m





 .

(43)

27/62

Regression: fixed vs. random design

S t : X → R t ∈ Rⁿ

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )²i

E_Y h

1

` ( s^?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )²

i 1

nkF − tk² P_nγ(t) = ¹_nPn

i =1( Y_i − t(X_i) )² ¹_nkY − tk²

n

X

i =1

x_i²

(44)

28/62

Minimum contrast estimators

Empirical risk (or empirical contrast) P_nγ(t) = 1

n

X

i =1

γ ( t; ξ_i)

∀t ∈ S, E [Pnγ(t) ] = Pγ(t)

Minimum contrast estimator (empirical risk minimizer) on some model S ⊂ S:

bs_S ∈ arg min

t∈SP_nγ(t) with P_n= 1 n

n

X

i =1

δ_ξ_i

Another example: maximum-likelihoodin density estimation:

γ(t; ξ) = − ln(t(ξ))

(45)

29/62

Regularized estimator: kernel ridge regression

Idea: control theestimator norm in some functional space F

F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (X_i) )²+ λ kf k²_F )

Representer theorem ⇒ bf =Pn

i =1αb_ik(X_i, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)⁻¹Y An example of linear estimator bF = AY

Other examples: least-squares, k-nearest-neighbours (in regression), Nadaraya-Watson, and so on

(46)

29/62

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space(RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (X_i) )²+ λ kf k²_F )

Representer theorem ⇒ bf = _{i =1}αb_ik(X_i, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)⁻¹Y An example of linear estimator bF = AY

(47)

29/62

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (X_i) )²+ λ kf k²_F )

i =1αb_ik(X_i, ·) Fixed design: (bf (x_i))_{1≤i ≤n}=F = K (K + nλIb n)⁻¹Y

An example of linear estimator bF = AY

(48)

29/62

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (X_i) )²+ λ kf k²_F )

i =1αb_ik(X_i, ·) Fixed design: (bf (x_i))_{1≤i ≤n}=F = K (K + nλIb n)⁻¹Y An example oflinear estimator F = AYb

(49)

30/62

Other regularized estimators

Support Vector Machines (SVM) in classification:

arg min

f ∈F

n

Pnγhinge(f ) + λ kf k²_F o

Lasso (Tibshirani 1996): regression, X = R^p

arg min

w ∈R^p

(1 2

n

X

i =1

Y_i − w^>X_i

2

+ λ kw k₁ )

Structured Lasso

arg min

w ∈R^p

(1 2

n

X

i =1

Y_i − w^>X_i2

+ λΩ(w ) )

e.g., group Lasso (Yuan & Lin 2006): Ω(w ) =P

g ∈Gkw_gk₂

(50)

31/62

Classification (X = R)

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(51)

32/62

Nearest neighbour rule

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(52)

33/62

k-nearest neighbours rule (k = 20)

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(53)

34/62

20-nearest neighbours rule: regression

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(54)

35/62

Outline

2 Which estimators?

5 Conclusion

(55)

36/62

How to choose the dimension D?

D = 1

D = 9

D = 3

D = 36

(56)

37/62

How to choose the number k of neighbours?

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 3

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 100

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 20

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 200

(57)

38/62

Estimator selection problem

Collection of statistical algorithms given: (A_m)_m∈M Problem: choosing among (A_m(D_n))_m∈M= (bs_m(D_n))_m∈M

Examples:

model selection

calibration (choice of k or of the distance for k-NN, choice of the regularization parameter, choice of some kernel, and so on) choosing among algorithms of different nature, e.g.,

k-NN and SVM

(58)

38/62

Estimator selection problem

Collection of statistical algorithms given: (A_m)_m∈M Problem: choosing among (A_m(D_n))_m∈M= (bs_m(D_n))_m∈M Examples:

model selection

calibration(choice of k or of the distance for k-NN, choice of the regularization parameter, choice of some kernel, and so on) choosing among algorithms of different nature, e.g.,

k-NN and SVM

(59)

39/62

Goal: estimation or prediction

Main goal: findm minimizing ` sb ^?,bs

m(Db n)(D_n) Oracle: m^? ∈ arg min_m∈M_n{ ` ( s^?,bsm(Dn) ) }

Oracle inequality(in expectation or with high probability):

` ( s^?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s^?,bs_m(D_n) ) } + R_n

Non-asymptotic: all parameters can vary with n, in particular the collection M = Mn

Adaptation (e.g., in the minimax sense) to the regularity of s^?, to variations of E ε²

X, and so on (if (A_m)_m∈M_n is well chosen)

(60)

39/62

Goal: estimation or prediction

` ( s^?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s^?,bs_m(D_n) ) } + R_n

s^?, to variations of E ε²

(61)

39/62

Goal: estimation or prediction

` ( s^?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s^?,bs_m(D_n) ) } + R_n

Adaptation (e.g., in the minimax sense) to the regularity of s^?, to variations of E ε²

(62)

40/62

Goal: identification

Additional assumption (model selection case): s^? ∈ S_m₀ for some m₀ ∈ M_n

Additional goal: select m = mb 0 with a maximal probability Consistency:

P (m = mb 0) −−−→

n→∞ 1

Contradictory goals in general (Yang, 2005)

Sometimes possible to share the strengths of both approaches (e.g., Yang, 2005; van Erven et al., 2008)

(63)

40/62

Goal: identification

Additional assumption (model selection case): s^? ∈ S_m₀ for some m₀ ∈ M_n

Additional goal: select m = mb 0 with a maximal probability Consistency:

P (m = mb 0) −−−→

n→∞ 1

Estimation and identification (AIC-BIC dilemma)?

Contradictory goals in general (Yang, 2005)

Sometimes possible to share the strengths of both approaches (e.g., Yang, 2005; van Erven et al., 2008)

(64)

41/62

Model selection: bias and variance

E [ ` ( s^?,bs_m(D_n) ) ] = Bias + Variance Bias or Approximation error

` ( s^?, s_m^? ) := inf

t∈Sm

{ ` ( s^?, t ) } Variance or Estimation error

E [ P γ (bsm(Dn) ) ] − Pγ ( s_m^? )

⇒ avoid over-fittingandunder-fitting

(65)

41/62

Model selection: bias and variance

E [ ` ( s^?,bs_m(D_n) ) ] = Bias + Variance Bias or Approximation error

` ( s^?, s_m^? ) := inf

t∈Sm

{ ` ( s^?, t ) } Variance or Estimation error

E [ P γ (bsm(Dn) ) ] − Pγ ( s_m^? )

Bias-variance trade-off

⇒ avoid over-fittingandunder-fitting

(66)

42/62

Bias-variance trade-off

0 5 10 15 20 25

0 0.05 0.1

Dimension

Biais Exces de risque

(67)

43/62

Example: homoscedastic regression on a fixed design

Y = F + ε with E ε²_i = σ²

Fb_m = A_mY with A_m = A^>_m= A²_m and tr(A_m) = dim(S_m)

⇒Bias-variance decomposition of the risk

F_m = arg min

t∈Sm

nkt − F k²o

= A_mF E 1

n

Fbm− F

2

= 1

nk(A_m− I )F k²+σ²dim(Sm) n

= Bias + Variance

(68)

43/62

Example: homoscedastic regression on a fixed design

⇒ Bias-variance decomposition of the risk

F_m = arg min

t∈Sm

nkt − F k²o

= A_mF E 1

n

Fb_m− F

2

= 1

nk(A_m− I )F k²+σ²dim(Sm) n

= Bias + Variance

(69)

44/62

Unbiased risk estimation principle

m ∈ arg minb

m∈Mn

{ crit(m) }

crit_id(m) = ` ( s^?,bsm(Dn) ) Heuristics:

crit(m) ≈ E [ ` ( s^?,bs_m(D_n) ) ]

⇒ valid if Card(M_n) is not too large (+ concentration inequalities)

(70)

45/62

Why should the empirical risk be penalized?

0 5 10 15 20 25

−0.1

−0.05 0 0.05 0.1 0.15

Dimension

Biais Exces de risque E[Risque empirique]

(71)

46/62

Penalization

Penalization: crit(m) = Pnγ (bsm) + pen(m)

m ∈ arg minb

m∈Mn

{ P_nγ (bs_m) + pen(m) }

Ideal penalty:

pen_id(m) = (P − Pn)γ (bsm)

Mallows’ heuristics:

pen(m) ≈ E [ penid(m) ] ⇒ oracle inequality

(72)

46/62

Penalization

Penalization: crit(m) = Pnγ (bsm) + pen(m)

m ∈ arg minb

m∈Mn

{ P_nγ (bs_m) + pen(m) }

Ideal penalty:

pen_id(m) = (P − Pn)γ (bsm)

Mallows’ heuristics:

pen(m) ≈ E [ penid(m) ] ⇒ oracle inequality

(73)

47/62

Example: homoscedastic regression on a fixed design

Recall that

E 1 n

Fb_m− F

2

= 1

nk(A_m− I )F k²+σ²dim(S_m) n

⇒ Empirical risk? Ideal penalty? Expectations?

pen_id(m) = 2

nhA_mε, εi + 2

nh(A_m− I_n)F , εi E [ penid(m) ] = 2σ²D_m

n ⇒ C_p (Mallows, 1973)

(74)

47/62

Example: homoscedastic regression on a fixed design

Recall that

E 1 n

Fb_m− F

2

= 1

nk(A_m− I )F k²+σ²dim(S_m) n

⇒ Empirical risk? Ideal penalty? Expectations?

pen_id(m) = 2

nhA_mε, εi + 2

n h(A_m− I_n)F , εi E [ penid(m) ] = 2σ²D_m

n ⇒ C_p (Mallows, 1973)

(75)

48/62

Classical penalties

C_p (Mallows, 1973; regression, least-squares estimator):

2σ²D_m/n

CL (Mallows, 1973; regression, linear estimator bFm = AmY ):

2σ²tr(Am)/n

AIC (Akaike, 1973; log-likelihood, p degrees of freedom):

2p/n

BIC (Schwarz, 1978; log-likelihood, identification goal):

ln(n)p/n

(76)

49/62

Hold-out

(77)

50/62

Hold-out: training sample

(78)

51/62

Hold-out: training sample

(79)

52/62

Hold-out: validation sample

(80)

53/62

Hold-out: validation sample

(81)

54/62

Unbiased risk estimation principle

Heuristics:

E [ crit(m) ] ≈ E [ P γ (bsm) ] ⇔ E [ pen(m) ] ≈ E [ penid(m) ] Examples:

FPE (Akaike, 1970), SURE (Stein, 1981)

some kinds of cross-validation(e.g., leave-p-out, p n) log-likelihood: AIC (Akaike, 1973), AICc (Sugiura, 1978;

Hurvich & Tsai, 1989)

least-squares: C_p, C_L (Mallows, 1973), GCV (Craven &

Wahba, 1979)

covariance penalties (Efron, 2004)

bootstrap penalty (Efron, 1983), resampling(A., 2009) ...

(82)

55/62

Outline

2 Which estimators?

5 Conclusion

(83)

56/62

Probability theory: measure concentration

Empirical processes:

(P_n− P)γ(t) or sup

t∈S

{ (P_n− P)γ(t) }

Concentration ofquadratic terms, kMεk², χ²-type statistics (writting them as a sup, or through the general problem of concentration of U-statistics)

More complex quantities, such as the “ideal penalty”

(P − Pn)γ (bsm(Dn) )

(84)

57/62

Probability theory

Exact computation or upper bounds onexpectations:

E

sup

t∈S

{ (P_n− P)γ(t) }

E [ (P − Pn)γ (bs_m(D_n) ) ] Understanding therisk as a function of n

E [ P γ (bs_m(D_n) ) ] Resampling process

Control of remainder terms (variance, deviations, ...) compared to expectations

...