• No results found

Model selection and estimator selection for statistical learning

N/A
N/A
Protected

Academic year: 2022

Share "Model selection and estimator selection for statistical learning"

Copied!
92
0
0

Loading.... (view fulltext now)

Full text

(1)

1/62

Model selection and estimator selection for statistical learning

Sylvain Arlot

1Cnrs

2Ecole Normale Sup´´ erieure (Paris), Liens, ´Equipe Sierra

Scuola Normale Superiore di Pisa, 14–23 February 2011

Model selection and estimator selection for statistical learning Sylvain Arlot

(2)

2/62

Outline of the 5 lectures

1 Monday 14, 14:00–16:00: Statistical learning

2 Tuesday 15, 9:00–11:00: Model selection for least-squares regression

3 Thursday 17, 14:00–16:00: Linear estimator selection for least-squares regression

4 Tuesday 22, 14:00–16:00: Resampling and model selection

5 Wednesday 23, 9:00–11:00: Cross-validation and model/estimator selection

(3)

3/62

Learning Estimators Estimator selection Interactions Conclusion

Part I

Statistical learning

Model selection and estimator selection for statistical learning Sylvain Arlot

(4)

4/62

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

(5)

5/62

Learning Estimators Estimator selection Interactions Conclusion

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

Model selection and estimator selection for statistical learning Sylvain Arlot

(6)

6/62

General framework

Data: ξ1, . . . , ξn∈ Ξ i.i.d. ∼ P Goal: estimate a feature s? ∈ Sof P Quality measure: loss function

∀t ∈ S , LP( t ) = Eξ∼P[ γ(t; ξ) ] = Pγ(t) minimal at t = s?

Contrast function: γ : S × Ξ 7→ [0, +∞) Excess loss

` ( s?, t ) = Pγ(t) − Pγ(s?)

(7)

7/62

Learning Estimators Estimator selection Interactions Conclusion

Example: prediction

Data: (X1, Y1), . . . , (Xn, Yn) ∈ Ξ = X × Y Goal: predict Y given X with (X , Y ) = ξ ∼ P

s?(X ) is the “best predictor” of Y given X , i.e., s? minimizes the loss function

Pγ(t) with γ(t; (x , y )) = d (t(x ), y )

measuring some “distance” between y and the prediction t(x ).

Model selection and estimator selection for statistical learning Sylvain Arlot

(8)

8/62

Example: regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

(9)

9/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: find the signal (denoising)

Model selection and estimator selection for statistical learning Sylvain Arlot

(10)

10/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression

prediction with Y = R

Data: (X1, Y1), . . . , (Xn, Yn) i.i.d.

Yi = η(Xi) + εi with E [ εi| Xi] = 0

least-squares contrast: γ(t; (x , y )) = (t(x ) − y )

⇒ s?= η and ` ( s?, t ) = kt − ηk22= E h

( t(X ) − η(X ) )2 i

(11)

10/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression

prediction with Y = R

Data: (X1, Y1), . . . , (Xn, Yn) i.i.d.

Yi = η(Xi) + εi with E [ εi| Xi] = 0

least-squares contrast: γ(t; (x , y )) = (t(x ) − y )2

⇒ s?= η and ` ( s?, t ) = kt − ηk22= E h

( t(X ) − η(X ) )2 i

Model selection and estimator selection for statistical learning Sylvain Arlot

(12)

11/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression on a fixed design

(X1, . . . , Xn) = (x1, . . . , xn) deterministic

Y = F + ε ∈ Rn with F = (η(x1), . . . , η(xn)) ∈ Rn and ε1, . . . , εn centered and independent.

Quadratic loss of t ∈ S = Rn: LP( t ) = EY

 1

nkY − tk2



= EY

" 1 n

n

X

i =1

( Yi− ti)2

#

⇒ s?= F and ` ( s?, t ) = 1

nkF − tk2 = 1 n

n

X

i =1

( η(xi) − ti)2

(13)

11/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression on a fixed design

(X1, . . . , Xn) = (x1, . . . , xn) deterministic

Y = F + ε ∈ Rn with F = (η(x1), . . . , η(xn)) ∈ Rn and ε1, . . . , εn centered and independent.

Homoscedastic case: ε1, . . . , εn i.i.d.

Quadratic loss of t ∈ S = Rn: LP( t ) = EY

 1

nkY − tk2



= EY

" 1 n

n

X

i =1

( Yi− ti)2

#

⇒ s?= F and ` ( s?, t ) = 1

nkF − tk2 = 1 n

n

X

i =1

( η(xi) − ti)2

Model selection and estimator selection for statistical learning Sylvain Arlot

(14)

11/62

Example: regression on a fixed design

(X1, . . . , Xn) = (x1, . . . , xn) deterministic

Y = F + ε ∈ Rn with F = (η(x1), . . . , η(xn)) ∈ Rn and ε1, . . . , εn centered and independent.

Homoscedastic case: ε1, . . . , εn i.i.d.

Quadratic loss of t ∈ S = Rn: LP( t ) = EY

 1

nkY − tk2



= EY

"

1 n

n

X

i =1

( Yi− ti)2

#

⇒ s?= F and ` ( s?, t ) = 1

nkF − tk2 = 1 n

n

X

i =1

( η(xi) − ti)2

(15)

12/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression: fixed vs. random design

Random design Fixed design

Dn ( Xi, Yi)1≤i ≤n i.i.d. ∼ P Y = F + ε ∈ Rn (Xn+1, Yn+1) ∼ P Xn+1 ∼ U (x1, . . . , xn)

S t : X → R t ∈ Rn

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )2i

EY h

1

nkY − tk2i s? η : x → E [ Y | X = x ] F = (η(x1), . . . , η(xn))

` ( s?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )2

i 1

nkF − tk2

with ∀x ∈ Rn , kxk2 =

n

X

i =1

xi2

Model selection and estimator selection for statistical learning Sylvain Arlot

(16)

12/62

Example: regression: fixed vs. random design

Random design Fixed design

Dn ( Xi, Yi)1≤i ≤n i.i.d. ∼ P Y = F + ε ∈ Rn (Xn+1, Yn+1) ∼ P Xn+1 ∼ U (x1, . . . , xn)

S t : X → R t ∈ Rn

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )2i

EY h

1

nkY − tk2i s? η : x → E [ Y | X = x ] F = (η(x1), . . . , η(xn))

` ( s?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )2

i 1

nkF − tk2

with ∀x ∈ Rn , kxk2 =

n

X

i =1

xi2

(17)

12/62

Learning Estimators Estimator selection Interactions Conclusion

Example: regression: fixed vs. random design

Random design Fixed design

Dn ( Xi, Yi)1≤i ≤n i.i.d. ∼ P Y = F + ε ∈ Rn (Xn+1, Yn+1) ∼ P Xn+1 ∼ U (x1, . . . , xn)

S t : X → R t ∈ Rn

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )2i

EY h

1

nkY − tk2i s? η : x → E [ Y | X = x ] F = (η(x1), . . . , η(xn))

` ( s?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )2

i 1

nkF − tk2

with ∀x ∈ Rn , kxk2 =

n

X

i =1

xi2

Model selection and estimator selection for statistical learning Sylvain Arlot

(18)

13/62

Example: density estimation (Ξ = R): data

−2 0 2 4 6 8 10

(19)

14/62

Learning Estimators Estimator selection Interactions Conclusion

Example: density estimation (Ξ = R): data and target

−2 0 2 4 6 8 10

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Model selection and estimator selection for statistical learning Sylvain Arlot

(20)

15/62

Learning Estimators Estimator selection Interactions Conclusion

Density estimation

µ reference measure on Ξ f density of P w.r.t. µ

⇒ s? = f and ` ( s?, t )Kullback-Leibler distancefrom s? to t

γ(t; ξ) = ktk2L2(µ)− 2t(ξ)

⇒ s? = f and` ( s?, t ) = kt − s?k2L2(µ)

(21)

15/62

Learning Estimators Estimator selection Interactions Conclusion

Density estimation

µ reference measure on Ξ f density of P w.r.t. µ

γ(t; ξ) = − ln(t(ξ))

⇒ s? = f and ` ( s?, t )Kullback-Leibler distancefrom s? to t

γ(t; ξ) = ktk2L2(µ)− 2t(ξ)

⇒ s? = f and` ( s?, t ) = kt − s?k2L2(µ)

Model selection and estimator selection for statistical learning Sylvain Arlot

(22)

15/62

Density estimation

µ reference measure on Ξ f density of P w.r.t. µ

γ(t; ξ) = − ln(t(ξ))

⇒ s? = f and ` ( s?, t )Kullback-Leibler distancefrom s? to t

γ(t; ξ) = ktk2L2(µ)− 2t(ξ)

⇒ s? = f and` ( s?, t ) = kt − s?k2L2(µ)

(23)

16/62

Learning Estimators Estimator selection Interactions Conclusion

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

classe 0 classe 1

Model selection and estimator selection for statistical learning Sylvain Arlot

(24)

17/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

classe 0 classe 1

(25)

18/62

Learning Estimators Estimator selection Interactions Conclusion

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model selection and estimator selection for statistical learning Sylvain Arlot

(26)

19/62

Example: classification (prediction, X = R, Y = {0, 1})

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(27)

20/62

Learning Estimators Estimator selection Interactions Conclusion

Example: binary supervised classification

Prediction, X = R and Y = { 0, 1 } If S = { measurable mappings X 7→ Y } 0–1 loss: γ(t; (x , y )) = 1t(x )6=y

If t ∈ S = { measurable mappings X 7→ [ 0, 1 ] },

Convex losses: γ(t; (x , y )) = ϕ(t(x )(1 − 2y )) with ϕ : R 7→ R convex, non-negative, non-increasing.

Model selection and estimator selection for statistical learning Sylvain Arlot

(28)

20/62

Example: binary supervised classification

Prediction, X = R and Y = { 0, 1 } If S = { measurable mappings X 7→ Y } 0–1 loss: γ(t; (x , y )) = 1t(x )6=y

If t ∈ S = { measurable mappings X 7→ [ 0, 1 ] },

Convex losses: γ(t; (x , y )) = ϕ(t(x )(1 − 2y )) with ϕ : R 7→ R convex, non-negative, non-increasing.

(29)

21/62

Learning Estimators Estimator selection Interactions Conclusion

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

Model selection and estimator selection for statistical learning Sylvain Arlot

(30)

22/62

Learning Estimators Estimator selection Interactions Conclusion

What is an estimator?

Statistical algorithm orLearning rule:

A: S

n∈NΞn7→ S sample Dn= ( ξ1, . . . , ξn) 7→ A(Dn)

A(Dn) =bsA(Dn) =bs(Dn) ∈ S is anestimator of s?

Remark: Pγ bs (Dn) and ` s ,bs (Dn) are random Risk ofbsA:

EDn∼P⊗n Pγ bsA(Dn)  = R(A, n) Excess risk ofbsA:

EDn∼P⊗n ` s?,bsA(Dn)  = R(A, n) − Pγ ( s?)

(31)

22/62

Learning Estimators Estimator selection Interactions Conclusion

What is an estimator?

Statistical algorithm or Learning rule:

A: S

n∈NΞn7→ S sample Dn= ( ξ1, . . . , ξn) 7→ A(Dn)

A(Dn) =bsA(Dn) =bs(Dn) ∈ S is an estimator of s? Remark: Pγ bsA(Dn) and ` s?,bsA(Dn) arerandom

Risk ofbsA:

EDn∼P⊗n Pγ bsA(Dn)  = R(A, n) Excess risk ofbsA:

EDn∼P⊗n ` s?,bsA(Dn)  = R(A, n) − Pγ ( s?)

Model selection and estimator selection for statistical learning Sylvain Arlot

(32)

22/62

What is an estimator?

Statistical algorithm or Learning rule:

A: S

n∈NΞn7→ S sample Dn= ( ξ1, . . . , ξn) 7→ A(Dn)

A(Dn) =bsA(Dn) =bs(Dn) ∈ S is an estimator of s? Remark: Pγ bsA(Dn) and ` s?,bsA(Dn) are random RiskofbsA:

EDn∼P⊗n Pγ bsA(Dn)  = R(A, n) Excess riskofbsA:

EDn∼P⊗n ` s?,bsA(Dn)  = R(A, n) − Pγ ( s?)

(33)

23/62

Learning Estimators Estimator selection Interactions Conclusion

(Universal) consistency, learning rates

Consistency (P fixed): ` s?,bsA(Dn) → 0 as n → +∞

Universal consistency:

supP limn→∞EDn∼P⊗n ` s?,bsA(Dn)  = 0 Uniform universal consistency:

limn→∞supP

EDn∼P⊗n ` s?,bsA(Dn)  = 0 (uniform learning rate over all distributions).

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996): In binary classification with X infinite, ∀A, ∀n ≥ 1,

sup

P



EDn∼P⊗n ` s?,bsA(Dn)  = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

Model selection and estimator selection for statistical learning Sylvain Arlot

(34)

23/62

Learning Estimators Estimator selection Interactions Conclusion

(Universal) consistency, learning rates

Consistency (P fixed): ` s?,bsA(Dn) → 0 as n → +∞

Universal consistency:

supP limn→∞EDn∼P⊗n ` s?,bsA(Dn)  = 0

limn→∞supP

EDn∼P⊗n ` s?,bsA(Dn)  = 0 (uniform learning rate over all distributions).

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996): In binary classification with X infinite, ∀A, ∀n ≥ 1,

sup

P



EDn∼P⊗n ` s?,bsA(Dn)  = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

(35)

23/62

Learning Estimators Estimator selection Interactions Conclusion

(Universal) consistency, learning rates

Consistency (P fixed): ` s?,bsA(Dn) → 0 as n → +∞

Universal consistency:

supP limn→∞EDn∼P⊗n ` s?,bsA(Dn)  = 0 Uniform universal consistency:

limn→∞supP

EDn∼P⊗n ` s?,bsA(Dn)  = 0 (uniform learning rate over all distributions).

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996): In binary classification with X infinite, ∀A, ∀n ≥ 1,

sup

P



EDn∼P⊗n ` s?,bsA(Dn)  = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

Model selection and estimator selection for statistical learning Sylvain Arlot

(36)

23/62

Learning Estimators Estimator selection Interactions Conclusion

(Universal) consistency, learning rates

Consistency (P fixed): ` s?,bsA(Dn) → 0 as n → +∞

Universal consistency:

supP limn→∞EDn∼P⊗n ` s?,bsA(Dn)  = 0 Uniform universal consistency:

limn→∞supP

EDn∼P⊗n ` s?,bsA(Dn)  = 0 (uniform learning rate over all distributions).

“No Free Lunch”(cf. Devroye, Gy¨orfi & Lugosi, 1996):

sup

P



EDn∼P⊗n ` s?,bsA(Dn)  = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

(37)

23/62

Learning Estimators Estimator selection Interactions Conclusion

(Universal) consistency, learning rates

Consistency (P fixed): ` s?,bsA(Dn) → 0 as n → +∞

Universal consistency:

supP limn→∞EDn∼P⊗n ` s?,bsA(Dn)  = 0 Uniform universal consistency:

limn→∞supP

EDn∼P⊗n ` s?,bsA(Dn)  = 0 (uniform learning rate over all distributions).

“No Free Lunch” (cf. Devroye, Gy¨orfi & Lugosi, 1996):

In binary classification with X infinite, ∀A, ∀n ≥ 1, sup

P



EDn∼P⊗n ` s?,bsA(Dn)  = 1 2

⇒ assumptions on P are necessary for havinguniform learning rates

Model selection and estimator selection for statistical learning Sylvain Arlot

(38)

24/62

Least-squares estimator: regressogram

(39)

25/62

Learning Estimators Estimator selection Interactions Conclusion

Least-squares estimator

Framework: Regression, least-squares contrast γ(t; (x , y )) = (t(x ) − y )2

Natural idea: minimize an estimator of Pγ(t) = E

h

( t(X ) − Y )2 i

Least-squares criterion: Pnγ(t) = 1

n

n

X

i =1

( t(Xi) − Yi)2 with Pn= 1 n

n

X

i =1

δξi

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒ Least-squares estimator on S:

bsS ∈ arg min

t∈S{ Pnγ(t) } = arg min

t∈S

(1 n

n

X

i =1

( t(Xi) − Yi)2 )

Model selection and estimator selection for statistical learning Sylvain Arlot

(40)

25/62

Learning Estimators Estimator selection Interactions Conclusion

Least-squares estimator

Framework: Regression, least-squares contrast γ(t; (x , y )) = (t(x ) − y )2

Natural idea: minimize an estimator of Pγ(t) = E

h

( t(X ) − Y )2 i Least-squares criterion:

Pnγ(t) = 1 n

n

X

i =1

( t(Xi) − Yi)2 with Pn= 1 n

n

X

i =1

δξi

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒ Least-squares estimator on S:

bsS ∈ arg min

t∈S{ Pnγ(t) } = arg min

t∈S

(1 n

n

X

i =1

( t(Xi) − Yi)2 )

(41)

25/62

Learning Estimators Estimator selection Interactions Conclusion

Least-squares estimator

Framework: Regression, least-squares contrast γ(t; (x , y )) = (t(x ) − y )2

Natural idea: minimize an estimator of Pγ(t) = E

h

( t(X ) − Y )2 i Least-squares criterion:

Pnγ(t) = 1 n

n

X

i =1

( t(Xi) − Yi)2 with Pn= 1 n

n

X

i =1

δξi

∀t ∈ S , E [ Pnγ(t) ] = Pγ(t)

Model: S ⊂ S ⇒Least-squares estimator on S :

bsS ∈ arg min

t∈S{ Pnγ(t) } = arg min

t∈S

(1 n

n

X

i =1

( t(Xi) − Yi)2 )

Model selection and estimator selection for statistical learning Sylvain Arlot

(42)

26/62

Model examples in regression

histogramson some partition Λ of X

⇒ the least-squares estimator (regressogram) can be written

bsm=X

λ∈Λ

βbλ1λ βbλ = 1 Card { Xi ∈ λ }

X

Xi∈λ

Yi

subspace generated by a subset of an orthogonal basis of L2(µ) (Fourier, wavelets, and so on)

variable selection: Xi =

Xi(1), . . . , Xi(p)

∈ Rp gathers p variables that can (linearly) explain Y

∀m ⊂ { 1, . . . , p } , Sm =

t : x ∈ X 7→X

j ∈m

βjx(j ) s.t. β ∈ Rm

 .

(43)

27/62

Learning Estimators Estimator selection Interactions Conclusion

Regression: fixed vs. random design

Random design Fixed design

Dn ( Xi, Yi)1≤i ≤n i.i.d. ∼ P Y = F + ε ∈ Rn (Xn+1, Yn+1) ∼ P Xn+1 ∼ U (x1, . . . , xn)

S t : X → R t ∈ Rn

Pγ(t) E(X ,Y )∼P

h

( Y − t(X ) )2i

EY h

1

nkY − tk2i s? η : x → E [ Y | X = x ] F = (η(x1), . . . , η(xn))

` ( s?, t ) E(X ,Y )∼P

h

( t(X ) − η(X ) )2

i 1

nkF − tk2 Pnγ(t) = 1nPn

i =1( Yi − t(Xi) )2 1nkY − tk2

with ∀x ∈ Rn , kxk2 =

n

X

i =1

xi2

Model selection and estimator selection for statistical learning Sylvain Arlot

(44)

28/62

Minimum contrast estimators

Empirical risk (or empirical contrast) Pnγ(t) = 1

n

n

X

i =1

γ ( t; ξi)

∀t ∈ S, E [Pnγ(t) ] = Pγ(t)

Minimum contrast estimator (empirical risk minimizer) on some model S ⊂ S:

bsS ∈ arg min

t∈SPnγ(t) with Pn= 1 n

n

X

i =1

δξi

Another example: maximum-likelihoodin density estimation:

γ(t; ξ) = − ln(t(ξ))

(45)

29/62

Learning Estimators Estimator selection Interactions Conclusion

Regularized estimator: kernel ridge regression

Idea: control theestimator norm in some functional space F

F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (Xi) )2+ λ kf k2F )

Representer theorem ⇒ bf =Pn

i =1αbik(Xi, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)−1Y An example of linear estimator bF = AY

Other examples: least-squares, k-nearest-neighbours (in regression), Nadaraya-Watson, and so on

Model selection and estimator selection for statistical learning Sylvain Arlot

(46)

29/62

Learning Estimators Estimator selection Interactions Conclusion

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space(RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (Xi) )2+ λ kf k2F )

Representer theorem ⇒ bf = i =1αbik(Xi, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)−1Y An example of linear estimator bF = AY

Other examples: least-squares, k-nearest-neighbours (in regression), Nadaraya-Watson, and so on

(47)

29/62

Learning Estimators Estimator selection Interactions Conclusion

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (Xi) )2+ λ kf k2F )

Representer theorem ⇒ bf =Pn

i =1αbik(Xi, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)−1Y

An example of linear estimator bF = AY

Other examples: least-squares, k-nearest-neighbours (in regression), Nadaraya-Watson, and so on

Model selection and estimator selection for statistical learning Sylvain Arlot

(48)

29/62

Regularized estimator: kernel ridge regression

Idea: control the estimator norm in some functional space F F ⊂ S is the Reproducing Kernel Hilbert Space (RKHS) associated with a positive definite kernel k : X × X 7→ R

bf ∈ arg min

f ∈F

( 1 n

n

X

i =1

( Yi − f (Xi) )2+ λ kf k2F )

Representer theorem ⇒ bf =Pn

i =1αbik(Xi, ·) Fixed design: (bf (xi))1≤i ≤n=F = K (K + nλIb n)−1Y An example oflinear estimator F = AYb

Other examples: least-squares, k-nearest-neighbours (in regression), Nadaraya-Watson, and so on

(49)

30/62

Learning Estimators Estimator selection Interactions Conclusion

Other regularized estimators

Support Vector Machines (SVM) in classification:

arg min

f ∈F

n

Pnγhinge(f ) + λ kf k2F o

Lasso (Tibshirani 1996): regression, X = Rp

arg min

w ∈Rp

(1 2

n

X

i =1



Yi − w>Xi

2

+ λ kw k1 )

Structured Lasso

arg min

w ∈Rp

(1 2

n

X

i =1



Yi − w>Xi2

+ λΩ(w ) )

e.g., group Lasso (Yuan & Lin 2006): Ω(w ) =P

g ∈Gkwgk2

Model selection and estimator selection for statistical learning Sylvain Arlot

(50)

31/62

Classification (X = R)

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(51)

32/62

Learning Estimators Estimator selection Interactions Conclusion

Nearest neighbour rule

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model selection and estimator selection for statistical learning Sylvain Arlot

(52)

33/62

k-nearest neighbours rule (k = 20)

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(53)

34/62

Learning Estimators Estimator selection Interactions Conclusion

20-nearest neighbours rule: regression

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Model selection and estimator selection for statistical learning Sylvain Arlot

(54)

35/62

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

(55)

36/62

Learning Estimators Estimator selection Interactions Conclusion

How to choose the dimension D?

D = 1

D = 9

D = 3

D = 36

Model selection and estimator selection for statistical learning Sylvain Arlot

(56)

37/62

How to choose the number k of neighbours?

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 3

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 100

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 20

−2 0 2 4 6 8 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 200

(57)

38/62

Learning Estimators Estimator selection Interactions Conclusion

Estimator selection problem

Collection of statistical algorithms given: (Am)m∈M Problem: choosing among (Am(Dn))m∈M= (bsm(Dn))m∈M

Examples:

model selection

calibration (choice of k or of the distance for k-NN, choice of the regularization parameter, choice of some kernel, and so on) choosing among algorithms of different nature, e.g.,

k-NN and SVM

Model selection and estimator selection for statistical learning Sylvain Arlot

(58)

38/62

Estimator selection problem

Collection of statistical algorithms given: (Am)m∈M Problem: choosing among (Am(Dn))m∈M= (bsm(Dn))m∈M Examples:

model selection

calibration(choice of k or of the distance for k-NN, choice of the regularization parameter, choice of some kernel, and so on) choosing among algorithms of different nature, e.g.,

k-NN and SVM

(59)

39/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: estimation or prediction

Main goal: findm minimizing ` sb ?,bs

m(Db n)(Dn) Oracle: m? ∈ arg minm∈Mn{ ` ( s?,bsm(Dn) ) }

Oracle inequality(in expectation or with high probability):

` ( s?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s?,bsm(Dn) ) } + Rn

Non-asymptotic: all parameters can vary with n, in particular the collection M = Mn

Adaptation (e.g., in the minimax sense) to the regularity of s?, to variations of E ε2

X, and so on (if (Am)m∈Mn is well chosen)

Model selection and estimator selection for statistical learning Sylvain Arlot

(60)

39/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: estimation or prediction

Main goal: findm minimizing ` sb ?,bs

m(Db n)(Dn) Oracle: m? ∈ arg minm∈Mn{ ` ( s?,bsm(Dn) ) }

Oracle inequality(in expectation or with high probability):

` ( s?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s?,bsm(Dn) ) } + Rn

Non-asymptotic: all parameters can vary with n, in particular the collection M = Mn

s?, to variations of E ε2

X, and so on (if (Am)m∈Mn is well chosen)

(61)

39/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: estimation or prediction

Main goal: findm minimizing ` sb ?,bs

m(Db n)(Dn) Oracle: m? ∈ arg minm∈Mn{ ` ( s?,bsm(Dn) ) }

Oracle inequality(in expectation or with high probability):

` ( s?,bs

mb) ≤ C inf

m∈Mn

{ ` ( s?,bsm(Dn) ) } + Rn

Non-asymptotic: all parameters can vary with n, in particular the collection M = Mn

Adaptation (e.g., in the minimax sense) to the regularity of s?, to variations of E ε2

X, and so on (if (Am)m∈Mn is well chosen)

Model selection and estimator selection for statistical learning Sylvain Arlot

(62)

40/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: identification

Additional assumption (model selection case): s? ∈ Sm0 for some m0 ∈ Mn

Additional goal: select m = mb 0 with a maximal probability Consistency:

P (m = mb 0) −−−→

n→∞ 1

Contradictory goals in general (Yang, 2005)

Sometimes possible to share the strengths of both approaches (e.g., Yang, 2005; van Erven et al., 2008)

(63)

40/62

Learning Estimators Estimator selection Interactions Conclusion

Goal: identification

Additional assumption (model selection case): s? ∈ Sm0 for some m0 ∈ Mn

Additional goal: select m = mb 0 with a maximal probability Consistency:

P (m = mb 0) −−−→

n→∞ 1

Estimation and identification (AIC-BIC dilemma)?

Contradictory goals in general (Yang, 2005)

Sometimes possible to share the strengths of both approaches (e.g., Yang, 2005; van Erven et al., 2008)

Model selection and estimator selection for statistical learning Sylvain Arlot

(64)

41/62

Learning Estimators Estimator selection Interactions Conclusion

Model selection: bias and variance

E [ ` ( s?,bsm(Dn) ) ] = Bias + Variance Bias or Approximation error

` ( s?, sm? ) := inf

t∈Sm

{ ` ( s?, t ) } Variance or Estimation error

E [ P γ (bsm(Dn) ) ] − Pγ ( sm? )

⇒ avoid over-fittingandunder-fitting

(65)

41/62

Learning Estimators Estimator selection Interactions Conclusion

Model selection: bias and variance

E [ ` ( s?,bsm(Dn) ) ] = Bias + Variance Bias or Approximation error

` ( s?, sm? ) := inf

t∈Sm

{ ` ( s?, t ) } Variance or Estimation error

E [ P γ (bsm(Dn) ) ] − Pγ ( sm? )

Bias-variance trade-off

⇒ avoid over-fittingandunder-fitting

Model selection and estimator selection for statistical learning Sylvain Arlot

(66)

42/62

Bias-variance trade-off

0 5 10 15 20 25

0 0.05 0.1

Dimension

Biais Exces de risque

(67)

43/62

Learning Estimators Estimator selection Interactions Conclusion

Example: homoscedastic regression on a fixed design

Y = F + ε with E ε2i  = σ2

Fbm = AmY with Am = A>m= A2m and tr(Am) = dim(Sm)

⇒Bias-variance decomposition of the risk

Fm = arg min

t∈Sm

nkt − F k2o

= AmF E 1

n

Fbm− F

2

= 1

nk(Am− I )F k22dim(Sm) n

= Bias + Variance

Model selection and estimator selection for statistical learning Sylvain Arlot

(68)

43/62

Example: homoscedastic regression on a fixed design

Y = F + ε with E ε2i  = σ2

Fbm = AmY with Am = A>m= A2m and tr(Am) = dim(Sm)

⇒ Bias-variance decomposition of the risk

Fm = arg min

t∈Sm

nkt − F k2o

= AmF E 1

n

Fbm− F

2

= 1

nk(Am− I )F k22dim(Sm) n

= Bias + Variance

(69)

44/62

Learning Estimators Estimator selection Interactions Conclusion

Unbiased risk estimation principle

m ∈ arg minb

m∈Mn

{ crit(m) }

critid(m) = ` ( s?,bsm(Dn) ) Heuristics:

crit(m) ≈ E [ ` ( s?,bsm(Dn) ) ]

⇒ valid if Card(Mn) is not too large (+ concentration inequalities)

Model selection and estimator selection for statistical learning Sylvain Arlot

(70)

45/62

Why should the empirical risk be penalized?

0 5 10 15 20 25

−0.1

−0.05 0 0.05 0.1 0.15

Dimension

Biais Exces de risque E[Risque empirique]

(71)

46/62

Learning Estimators Estimator selection Interactions Conclusion

Penalization

Penalization: crit(m) = Pnγ (bsm) + pen(m)

m ∈ arg minb

m∈Mn

{ Pnγ (bsm) + pen(m) }

Ideal penalty:

penid(m) = (P − Pn)γ (bsm)

Mallows’ heuristics:

pen(m) ≈ E [ penid(m) ] ⇒ oracle inequality

Model selection and estimator selection for statistical learning Sylvain Arlot

(72)

46/62

Penalization

Penalization: crit(m) = Pnγ (bsm) + pen(m)

m ∈ arg minb

m∈Mn

{ Pnγ (bsm) + pen(m) }

Ideal penalty:

penid(m) = (P − Pn)γ (bsm)

Mallows’ heuristics:

pen(m) ≈ E [ penid(m) ] ⇒ oracle inequality

(73)

47/62

Learning Estimators Estimator selection Interactions Conclusion

Example: homoscedastic regression on a fixed design

Recall that

Y = F + ε with E ε2i  = σ2

Fbm = AmY with Am = A>m= A2m and tr(Am) = dim(Sm)

E 1 n

Fbm− F

2

= 1

nk(Am− I )F k22dim(Sm) n

⇒ Empirical risk? Ideal penalty? Expectations?

penid(m) = 2

nhAmε, εi + 2

nh(Am− In)F , εi E [ penid(m) ] = 2σ2Dm

n ⇒ Cp (Mallows, 1973)

Model selection and estimator selection for statistical learning Sylvain Arlot

(74)

47/62

Example: homoscedastic regression on a fixed design

Recall that

Y = F + ε with E ε2i  = σ2

Fbm = AmY with Am = A>m= A2m and tr(Am) = dim(Sm)

E 1 n

Fbm− F

2

= 1

nk(Am− I )F k22dim(Sm) n

⇒ Empirical risk? Ideal penalty? Expectations?

penid(m) = 2

nhAmε, εi + 2

n h(Am− In)F , εi E [ penid(m) ] = 2σ2Dm

n ⇒ Cp (Mallows, 1973)

(75)

48/62

Learning Estimators Estimator selection Interactions Conclusion

Classical penalties

Cp (Mallows, 1973; regression, least-squares estimator):

2Dm/n

CL (Mallows, 1973; regression, linear estimator bFm = AmY ):

2tr(Am)/n

AIC (Akaike, 1973; log-likelihood, p degrees of freedom):

2p/n

BIC (Schwarz, 1978; log-likelihood, identification goal):

ln(n)p/n

Model selection and estimator selection for statistical learning Sylvain Arlot

(76)

49/62

Hold-out

(77)

50/62

Learning Estimators Estimator selection Interactions Conclusion

Hold-out: training sample

Model selection and estimator selection for statistical learning Sylvain Arlot

(78)

51/62

Hold-out: training sample

(79)

52/62

Learning Estimators Estimator selection Interactions Conclusion

Hold-out: validation sample

Model selection and estimator selection for statistical learning Sylvain Arlot

(80)

53/62

Hold-out: validation sample

(81)

54/62

Learning Estimators Estimator selection Interactions Conclusion

Unbiased risk estimation principle

Heuristics:

E [ crit(m) ] ≈ E [ P γ (bsm) ] ⇔ E [ pen(m) ] ≈ E [ penid(m) ] Examples:

FPE (Akaike, 1970), SURE (Stein, 1981)

some kinds of cross-validation(e.g., leave-p-out, p  n) log-likelihood: AIC (Akaike, 1973), AICc (Sugiura, 1978;

Hurvich & Tsai, 1989)

least-squares: Cp, CL (Mallows, 1973), GCV (Craven &

Wahba, 1979)

covariance penalties (Efron, 2004)

bootstrap penalty (Efron, 1983), resampling(A., 2009) ...

Model selection and estimator selection for statistical learning Sylvain Arlot

(82)

55/62

Outline

1 The statistical learning problem

2 Which estimators?

3 Estimator selection

4 Interactions within mathematics

5 Conclusion

(83)

56/62

Learning Estimators Estimator selection Interactions Conclusion

Probability theory: measure concentration

Empirical processes:

(Pn− P)γ(t) or sup

t∈S

{ (Pn− P)γ(t) }

Concentration ofquadratic terms, kMεk2, χ2-type statistics (writting them as a sup, or through the general problem of concentration of U-statistics)

More complex quantities, such as the “ideal penalty”

(P − Pn)γ (bsm(Dn) )

Model selection and estimator selection for statistical learning Sylvain Arlot

(84)

57/62

Probability theory

Exact computation or upper bounds onexpectations:

E

 sup

t∈S

{ (Pn− P)γ(t) }



E [ (P − Pn)γ (bsm(Dn) ) ] Understanding therisk as a function of n

E [ P γ (bsm(Dn) ) ] Resampling process

Control of remainder terms (variance, deviations, ...) compared to expectations

...

References

Related documents

In this dissertation, we will provide a survey of known algorithms for computing various statistics under interval uncertainty and their computational complexity, a description of

DO NOT BOOK AIRFARE WITH AGENCY UNLESS YOU HAVE APPROVED TRAVEL AUTHORIZATION GIVE AGENT NAME OF UNIT or DRA IF GRANT, OR DEPARTMENT HEAD IF NOT GRANT (from APPROVED

This study tries to present the performance analysis of the effect of the financial statements as measured by financial ratios of the company's stock return in the Indonesia

For example, if your limiting belief is “there are no good jobs out there” change it to “the perfect job is coming my way.” If your beliefs do not result in the life you want,

Z tohoto důvodu byla přidána třída SceneComponent mezi Figure a Scene, která tedy obsahuje figure, který reprezentuje, ale také její umístění na scéně.. Dále bylo

Despite relatively high rates of interest in one-to-one cessation support among baseline smokers (42%), only a small proportion (12%) of our cohort who smoked at all during

This UFC replaces Maintenance and Operation Manual MO-113, Maintenance and Repair of Roofs , by adopting available industry reference sources as Non- Government Standards

The aim of this present work was to improve the plant tissue culture medium by the addition of intra- and extracellular substances from Phormidium subincrustatum