Metamodel construction for sensitivity analysis

(1)

Jean-François Coeurjolly & Adeline Leclercq-Samson

METAMODEL CONSTRUCTION FOR SENSITIVITY ANALYSIS

Sylvie Huet

1

and Marie-Luce Taupin

2 1

Abstract. We propose to estimate a metamodel and the sensitivity indices of a complex modelmin the Gaussian regression framework. Our approach combines methods for sensitivity analysis of complex models and statistical tools for sparse non-parametric estimation in multivariate Gaussian regression model. It rests on the construction of a metamodel for aproximating the Hoeffding-Sobol decomposition ofm. This metamodel belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to a functional ANOVA decomposition. The estimation of the metamodel is carried out via a penalized least-squares minimization allowing to select the subsets of variables that contribute to predict the output. It allows to estimate the sensitivity indices ofm. We establish an oracle-type inequality for the risk of the estimator, describe the procedure for estimating the metamodel and the sensitivity indices, and assess the performances of the procedure via a simulation study. Résumé. Nous considérons l’estimation d’un méta-modèle d’un modèle complexemà partir des ob-servations d’unn-échantillon dans un modèle de régression gaussien. Nous en déduisons une estimation des indices de sensibilité dem. Notre approche combine les méthodes d’analyse de sensibilité de mod-èles complexes et les outils statistiques de l’estimation non-paramétrique en régression multivariée. Elle repose sur la construction d’un méta-modèle qui approche la décomposition de Hoeffding-Sobol dem. Ce méta-modèle appartient à un espace de Hilbert à noyau reproduisant qui est lui-même la somme directe d’espaces de Hilbert, permettant ainsi une décomposition de type ANOVA. On en déduit des estimateurs des indices de sensibilité dem. Nous établissons une inégalité de type oracle pour le risque de l’estimateur, nous décrivons la procédure pour estimer le méta-modèle et les indices de sensibilité, et évaluons les performances de notre méthode à l’aide d’une étude de simulations.

1. _Introduction

We consider a Gaussian regression model

Y =m(X) +σε, (1)

whereXis adrandom vector with a known distributionPX=P1×. . .×Pd onX a compact subset ofRd, and

ε is independent ofX, and distributed as aN(0,1) variable. The variance σ2 _{is unknown and the number of}

variables dmay be large. The function mis a complex and unknown function from _Rd _to

R. It may present

strong non-linearities and high order interaction effects between its coordinates. On the basis of a n-sample (Yi,Xi), i= 1, . . . , n, we aim to construct metamodels and perform sensitivity analysis in order to determine

the infuence of each variable or group of variables on the output.

1 _{UR 1404 MaIAGE, INRA Jouy-en-Josas, France; e-mail:}_{[email protected]}

2 _{Laboratoire LaMME, UMR CNRS 8071- USC INRA, Université d’Evry Val d’Essonne, France;} e-mail:[email protected]

c

EDP Sciences, SMAI 2017

(2)

Our approach combines methods for sensitivity analysis of a complex model and statistical tools for sparse non-parametric estimation in multivariate Gaussian regression model. It rests on the construction of a meta-model for aproximating the Hoeffding-Sobol decomposition of the function m. This metamodel belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to a functional ANOVA decomposition involving variables and interactions between them. The estimation of the metamodel is carried out via a penalized least-squares minimization allowing to select the subsets of variablesX that contribute to predict the output Y. Finally, the estimated metamodel allows to estimate the sensitivity indices ofm.

A lot of work has been done around meta-modelling and sensitivity indices estimation.

For a complete account on global sensitivity analysis, see for example the book by Saltelli et al. [35]. Let us briefly present the context of usual global sensitivity analysis. Suppose that we are able to calculate the ouputs zof a modelmfornrealizations of the input vectorX, such thatzi=m(Xi) fori= 1, . . . , n. Starting from the

values (zi,Xi), i= 1, . . . n, the objectives of meta-modelling and global sensitivity analysis are to approximate

the functionm by what is called a metamodel or to quantify the influence of some subsets of the variablesX

on the outputz. This metamodel helps to understand the behavior of the model, or allows to speed up future calculation using it in place of the original model m.

In particular when the inputs variables X are independent, if m is square integrable, one may consider the classical Hoeffding-Sobol decomposition [36, 41] that leads to writem according to its ANOVA functional expansion:

m(x) =m0+ X

v∈P

mv(xv), (2)

where P denotes the set of parts of {1, . . . , d} with dimension 1 todand where for allx∈Rd, xv denotes the

vector with componentsxj forj ∈v. The functionsmv are centered and orthogonal inL2(PX) leading to the

following decomposition of the variance of m: Var (m(x)) =P

vVar (mv(xv)). The Sobol sensitivity indices,

introduced by Sobol [36], are defined for any group of variablesxv,v∈ P by

Sv=

Var (mv(xv))

Var (m(x)) .

They quantify the contribution of a subset of variablesxto the outputm(x). Several approaches are available for estimating these sensitivity indices, see for example Iooss and Lemaître [17] for a recent review. Among all of them, let us consider the one based on metamodel construction that allows to directly obtain the sensitivity indices. Generally one consider metamodels that correspond to an ANOVA-type decomposition, and that are candidate to approximate the Hoeffding decomposition ofm. The ANOVA-type decomposition leads to consider functions defined as follows:

f :X →R, f(x) =f0+ X

v∈P

fv(xv), EPXfv(xv) =EPXfv(xv)fv0(xv0) = 0∀v, v

0 _{∈ P} ₍₃₎

for functions fv that are chosen to belong to some functionnal spaces. The polynomial Chaos construction,

see for example Ghanem and Spanos [14], Soize and Ghanem [38], can be used to approximate the Hoeffding decomposition of m. This approach was considered by Blatman and Sudret [5] who propose a method for truncating (such that to keep polynomials with degree less than some integer) the polynomial Chaos expansion and then an algorithm based on least angle regression for selecting the terms in the expansion. For the same purpose, Gu and Wu [15] propose an algorithm based on the hierarchy principle (lower order effects are more likely to be important than higher order effects) and on the heredity principle (interaction can be active only if one or all of its parent effects are also active). This approach joins the one proposed by Bach [2] for variable selection based on hierarchical kernel learning.

(3)

a functional ANOVA decomposition (see Equation (3)), such that the projection of m onto the RKHS is an approximation of the Hoeffding decomposition ofm.

Following Lin and Zhang [25], Touzani and Busby [40] propose an algorithm to calculate the penalized least-square estimator ofmon the RKHS space, where the least-square criteria is penalized by the sum of the norms ofmon each Hilbert subspace. This group-lasso type procedure allows both to select and calculate the non-zero terms in the functional ANOVA decomposition.

Our objective is to propose an estimator of a metamodel which will approximate the Hoeffding decomposition of m considering a Gaussian regression model defined at Equation (1) and to deduce from this estimated metamodel, estimators for the sensitivity indices of m. Contrary to the usual setting of sensitivity analysis where m(Xi) is available, only the observationsY are available, which leads us to consider the nonparametric

multivariate regression setting.

Let us briefly describe the methods related to this regression setting and review their theoretical properties, starting with papers assuming an univariate additive decomposition for the functionmin the context of high-dimensional sparse additive models. Precisely, denote byF1-add_{, the set of functions}_f _{defined on}_X _{such that}

f(x) =f0+P

d

a=1fa(xa) wheref0 is a constant, and where fora= 1, . . . , d, the functionsfa are centered and

square-integrable with respect toPa. For each functionf, the setSf of indicesa∈ {1, . . . d}such thatfa is not

identically zero is called the actice set off.

Ravikumar et al. [32] propose a group-lasso procedure where each functionfais approximated by its truncated

decomposition on a basis of functions. They provide an algorithm and, assuming that the functionmbelongs to the setF1-add_{and that} _S

mis sparse, prove the consistency of the active set and of the risk of the estimator

ofm.

Meier et al. [27] propose to combine a sparsity penalty (group-lasso) and a smoothness penalty (ridge) for estimatingm(x). Considering the fixed design framework, they establish some oracle properties of the empirical risk for estimating the projection ofmonto the set of univariate additive functionsF1-add_.

Raskutti et al. [31] consider the case where each univariate functionfa belongs to a RKHS and as Meier et

al. [27] combine a sparsity and a smoothness penalty. Assuming that thedvariables Xare independent, they derive upper bounds for the integrated and the empirical risks, as well as a lower bound for the integrated risk over spaces of sparse additive models whose each component is bounded with respect to the RKHS norm.

Additive sparse modelling is too restrictive in practical settings because it does not take into account inter-actions between variables that may affect the relationship between Y and X. The generalization of additive smoothing splines to interaction smoothing splines leading to an ANOVA-type decomposition (see Equation (3)) was proposed by several authors (see for example Wahba [42], Friedman [13], Wahba et al. [43]).

To control smoothness and to enforce sparsity in the ANOVA-type decomposition, Gunn and Kandola [16] propose to consider the ANOVA decomposition as a weighted linear sum of kernels and to use a lasso penalty on the weights to select the terms in the decomposition as well as a ridge penalty to ensure smoothness of the kernel expansion. The COSSO proposed by Lin and Zhang [25] is based on smoothness penalty defined as the sum of the RKHS-norms of the functions fv. The authors study existence and rate of convergence of the

estimator. In a more general framework, where the functionmis written as a linear span of a large number of kernels, Koltchinskii and Yuan [21] established oracle inequalities on the excess risk assuming that the function mhas a sparse representation (the set ofv∈ Psuch thatf_v∗is non zero is sparse). The authors generalized their results to a penalty function that combines sparsity and smoothness (see Koltchinskii et al. [22]), as proposed by Meier et al. [27] and Raskutti et al. [31].

Recently Kandasamy and Wu [19] proposed an estimator called SALSA, based on a ridge penalty, where the ANOVA-type decomposition is restricted to set v ∈ P such that |v| ≤Dmax. The authors propose to choose

Dmaxvia a cross-validation procedure.

Our contributions

(4)

works in the framework of nonparametric estimation of sparse additive models, we propose a penalized least-square estimator where the penalty function enforces both the sparsity and the smoothness of the terms in the decomposition. We show that our estimator satisfies an oracle inequality with respect to the empirical and integrated risks.

Our procedure allows both to select and estimate the terms in the ANOVA decomposition, and therefore, to select the sensitivity indices that are non-zero and estimate them. In particular it makes possible to estimate Sobol indices of high order, a point known to be difficult in practice.

Finally, using convex optimization tools, we develop an algorithm (available on request) inR[30], for calcu-lating the estimator. A simulation study shows the good performances of our estimator in practice.

The paper is organised as follows: The RKHS construction based on ANOVA kernels and the procedure for estimating a metamodel are presented in Section 2. The estimators of the Sobol indices are given in Section 3. The theoretical properties of the metamodel estimator are stated in Theorem 4.1 and Corollaries 4.1 and 4.2 whose proofs are postponed in Sections 7 and 8. Section 5 is devoted to the calculation of the estimator and Section 6 to the simulation study.

2. _{Meta-modelling}

We start from the Hoeffding decomposition (see Sobol [37] and Van der Vaart [41], p. 157) of the function mthat consists in writtingm as in Equation (2)

m(x) =m0+ X

v∈P

mv(xv),

where P denotes the set of parts of {1, . . . , d} with dimension 1 todand where for allx∈Rd, xv denotes the

vector with componentsxj forj∈v. For allv, v0 inP,

EX(mv(Xv)) =EX(mv(Xv)mv0(X_v0)) = 0.

We propose to consider a functionnal space based on the tensorial product of Reproducing Kernel Hilbert spaces (RKHS), and to approximate the unknown functionmby its projection denotedf∗ on such such RKHS space. One of the key point is to construct the spaceHsuch that the terms of the decomposition of a function f in Hcorrespond to its Hoeffding-Sobol decomposition.

2.1. RKHS construction

Let us describe the construction of spaces H, based on ANOVA kernels, construction which was given by Durrande et al. [12].

LetX =X1×. . .× Xd be a compact subset ofRd. For each coordinatea∈ {1,· · · , d}, we choose a RKHS Ha and its associated kernelka defined on the setXa⊂Rsuch that the two following properties are satisfied

(1) ka:Xa× Xa →RisPa×Pa measurable,

(2) EPa p

ka(Xa, Xa)<∞.

The RKHSHa may be decomposed asHa=H0a

⊥

⊕ H1a, where

H0a = {fa∈ Ha,EPa(fa(Xa)) = 0},

H1a = {fa∈ Ha, fa(Xa) =C},

the kernelk0a associated to the RKHSH0a being defined as follows (see Berlinet and Thomas-Agnan [4]):

k0a(xa, x0a) =ka(xa, x0a)−

EU∼Pa(ka(xa, U))EU∼Pa(ka(x 0

a, U))

(5)

The ANOVA kernel is defined as

k(x,x0) =

d

Y

a=1

(1 +k0a(xa,x0a)) = 1 +

X

v∈P

kv(xv,x0v), wherekv(xv,x0v) =

Y

a∈v

k0a(xa, x0a),

and the corresponding RKHS

H=⊗da=1

1⊕ H⊥ 0a

= 1 +X

v∈P Hv,

where the RKHSHv is associated with kernelkv. According to this construction, any functionf ∈ Hsatisfies

f(x) =hf, k(x,·)iH=f0+ X

v∈P

fv(x),

where fv(x) =hf, kv(x,·)iH depends on xv only. For allv ∈ P, fv(xv) is centered and for all v0 6=v, fv(xv)

andfv0(x_v0) are uncorrelated. We get thus the Hoeffding decomposition off.

2.2. Approximating the Hoeffding decomposition of

m

Letf∗=f₀∗+P

v∈Pfv∗ which minimizes

km−fk2_L2₍_P_X₎=EX(m(X)−f(X)) 2

over functionsf ∈ H. Thisf∗ can be viewed as an approximation ofmand more specifically his Hoeffding de-composition is an approximation of the Hoeffding dede-composition ofm. Therefore if the Hoeffding decomposition ofmis written as in Equation (2), each functionf_v∗ approximates the functionmv.

The idea is propose an estimator off∗ as estimator ofm.

2.3. Selection step

SinceP is the set of parts of{1, . . . , d}, the number of functionsf∗

v is related to the cardinality ofP = 2d−1

that may be huge. Our construction is thus associated to a selection strategy.

The selection off_v∗inf∗is based on aridge-group-sparsetype procedure which minimizes the penalized least-squares criteria over functionsf ∈ H. The least-squares criteria is penalized in order to both select few terms in the additive decomposition of f over sets v ∈ P, and to favour smoothness of the estimatedfv. The ridge

regularization is ensured by controling the norm offv in the Hilbert space Hv for allv, and the group-sparse

regularization is strengthened by controling the empirical norm offv, defined as

kfkn =

v u u t

1 n

n

X

i=1

f2

v(Xv,i).

For anyf ∈ Hsuch thatf =f0+P_v_∈Pfv, and for some tuning parameters (µv, γv, v∈ P), letL(f) be defined

as

L(f) = 1 n

n

X

i=1

Yi−f0− X

v∈P

fv(Xv,i)

!2

+X

v∈P

µvkfvkHv+ X

v∈P

γvkfvkn. (4)

Let us define the set of functionsF

F =

(

f such that f =f0+ X

v∈P

fv, fv ∈ Hv,kfvkHv ≤1 )

(6)

Then the estimatorfbis defined as

b

f = argmin{L(f), f ∈ F }. (6)

Remark 2.1. The construction of the RKHS spaces described above, allows to consider functionnal spaces that suit well to the smoothness of the function m, irrespectively of the distribution of X. Indeed, the kernels k0,a

depend on the distribution of X only for calculating the projection onto the space of constant functions. In comparison, the decomposition based on the truncated polynomial Chaos expansion, used for sensitivity analysis (see Blatman and Sudret [5]), is based on the distribution ofX, and only the choice of the truncation handles the smoothness of the approximation.

3. _{Sensitivity analysis}

3.1. Sobol indices

Let us go back to the Hoeffding decomposition Equation (2). The orthogonality between two terms in this decomposition leads to the additive decomposition of the variance ofm(x):

Var (m(x)) =X

v∈P

Var (mv(xv)).

Each of these variance terms are related to Sobol indices [36]. For example, the Sobol indice linked with the interaction between variablesxv is defined as

Sv=

Var (mv(xv))

Var (m(x)) , or the global Sobol indices for the variablexa,a∈ {1,· · ·, d}, is

Ga=

P

v⊇{a}Var (mv(xv))

Var (m(x)) .

Those Sobol indices and global Sobol indices quantify the contribution of a subset of variablesxto the output m(x). As it is said in the introduction direct estimation of these Sobol indices may require lot of calculations. We consider here methods based on metamodels to directly obtain Sensitivity indices.

3.2. Estimation of Sobol indices

Thanks to the orthogonality property of functions inH, the variance ofm(x) will be estimated by

d

Var (m(x)) =X

v∈P d

Var (mv(xv)), whereVar (d mv(xv)) =EX

b

f_v2(Xv)

=kfbvk2_L2₍_P

X). (7)

In practice, in order to avoid calculating the variance of fbv(Xv), one may use an estimator based on the

empirical variances of functionsfbv. Precisely, if fbv,·is the mean of the fbv(Xv,i), fori= 1, . . . , n, then

d

Varemp(mv(xv)) =

1 n−1

n

X

i=1

b

fv(Xv,i)−fbv,· 2

. (8)

(7)

4. _{Theoretical result: oracle inequality for metamodel}

In this section we state the oracle inequality for the estimated metamodelfbwhich approximates the Hoeffding

decomposition of the unknown functionm.

4.1. Notations and Assumptions

For a functionf ∈ H,f =f0+Pv∈Pfv, we denote bySf its support and|Sf|its cardinality. More precisely

Sf ={v∈ P, fv6= 0}. (9)

We consider RKHS spacesHv, v∈ P satisfying the following assumptions:

• for allfv∈ Hv,EXfv2(X)<∞,

• for allfv∈ Hv,fv0 ∈ H_v0,E_Xf_v(X) = 0 andE_Xf_v(X)f_v0(X) = 0, • there existsR0>0 such that

∀fv∈ Hv kfvk∞= sup{|fv(X)|,X∈ X } ≤R0. (10)

For eachv∈ P, letωv,k, fork≥1 be the eigenvalues of the operator associated to the self reproducing kernel

kv, arranged in the decreasing order. Let us define the functionQn,v(t), for positive t, as follows:

Qn,v(t) =

s

5 n

X

k≥1

min(t2_{, ω}

v,k), (11)

and for some ∆>0 letνn,v be defined as follows

νn,v= inf

t such thatQn,v(t)≤∆t2 . (12)

For each v ∈ P, νn,v refers to the so-called critical univariate rate, the minimax-optimal rate for L2(PX

)-estimation of a single univariate function in the hilbert spaceHv (e.g. Mendelson [28]).

Our choices of regularization parameters and rates are specified in terms of the quantities:

λn,v = max

n

νn,v,

p

d/no. (13)

Theorem 4.1. Let us consider the regression model defined at Equation (1). Let (Yi,Xi), i = 1, . . . , n be a

n-sample with the same law as(Y,X). Letfbbe defined by (6).

If there exist constants Cl, l= 1,2,3,C1≥1, and0< η <1 such that the following conditions are satisfied:

for all v∈ P, λn,v≤min

_γ

v

C1

,

r_µ

v

C1

, nλ2_n,v≥ −C2logλn,v, (14)

and

for all f ∈ F, max



 X

v∈Sf

γ_v2, X

v∈Sf µv



≤1, (15)

then, with probability greater than1−η,

km−fbk2_n≤C3 inf

f∈F  



km−fk2

n+σ

2 X

v∈Sf

µv+γv2

 



(8)

The result can be easily generalized to the minimization ofL(f) over the spaceFrdefined as follows: For some

positive constants rv, v∈ P,

Fr=

(

f such thatf =f0+ X

v∈P

fv, fv∈ Hv,kfvkHv ≤rv )

. (16)

Indeed, we just have to consider the RKHSH0_v associated with the kernelk0_v(xv,x

0

v) =r

2

vkv(xv,x

0

v) in place of

the RKHSHv. Then minimizing

1 n

X

i

Yi−f0− X

v

fv(Xv,i)

!2

+X

v

µvrvkfvkH0_v + X

v

γvkfvkn

over the set

F0 =

(

f such thatf =f0+ X

v

fv, fv∈ H

0

v,kfvkH0_v ≤1 )

,

is equivalent to minimizingL(f) over the setFr.

Let us now comment on the theorem.

• The termkm−fk2

nrefers to the usual bias term quantifying the approximation properties of the Hilbert

spaceHas a distance between the truemandf, its approximation intoH.

• Koltchinskii et Yuan [22] considered what is called the multiple kernel learning problem, where the functions inHhave an additive representation over kernel spaces. They do not assume that the variable

Xare independent, nor that the kernel spaces satisfy an orthogonality condition. In return, they assume that some decomposability type properties are satisfied, and they introduce some characteristics related to the degree of “dependence” of the kernel spaces.

• In the particular case when the decomposition is limited to the main effects of the variables, then the problem comes back to the classical nonparametric additive model. The theoretical properties of the estimator based on aridge-group-sparsetype procedure have already been established (see for example Meier et al. [27], and Raskutti et al. [31]).

• Weights in the penalty terms may be of interest for applications. The theoretical result highlights that the tuning parameters (µv, γv) should depend on the decreasing of the eigenvalues of the kernel defining

Hv. Besides, we may be interested by introducing weigths that favor small order interaction terms.

• Because we aim to approximate the Hoeffding decomposition of m, we need to have orthogonality between the spacesHv, v∈ P. This condition, required by our objective, is also a key point in the proof

of the Theorem, when the problem is to compare the euclidean norm of functions in Hwith the norm in _L2₍_P

X). At this step we need to assume that Assumption (15) holds to conclude.

• We do not assume that the functions inFare uniformly globally bounded, that is that the sup{|f(x)|,x∈ X }is bounded by a constant that does not depend onf. Instead we assume that each function within the unit ball of the Hilbert spaceHvis uniformly bounded by a constant multiple of its Hilbert norm. In

fact, the functionsf in the spaceF, written asf0+Pvfv satisfy that for each groupv,fvis uniformly

bounded. This assumption is easily satisfied as soon as the kernelkv is bounded on the compact setX.

Indeed,kfvk∞≤supX∈X p

kv(Xv,Xv)kfvkHv. We refer to Raskutti et al. [31] for a discussion on that subject and a comparison with the work of Koltchinskii et Yuan [22].

• Assuming that nλ2

n,v≥ −C2logλn,v allows to control the probability of the union of|P| events. This

is a mild condition, satisfied forλn,v =Kn/

√

nforKn of the order

√

logn.

The following corollary gives an upper bound of the risk with respect to the _L2₍_P

X) norm. It is mainly a

(9)

Corollary 4.1. Under the assumptions of Theorem 4.1, we have that, with probability greater than 1−η, for some constant C4,

km−fbk2_L2₍_P

X)≤C4_finf_∈F

 



km−fk2

n+km−fk

2

L2(PX)+σ

2 X

v∈Sf

µv+γv2

 



.

From this corollary we can compare theVar (d mv(xv)) (see Equation (7)) with the variance ofmv(xv). Thanks

to the following inequality

kfbvkL2(PX)− kmvkL2(PX)

≤ kfbv−mvkL2(PX)≤ kfb−mkL2(PX),

and to Corollary 4.1, we get the following result

ifmv≡0 then Var (d mv(xv))≤ kfb−mk2_L2₍_P X),

ifkmv(xv)kL2(PX)≥c >0 then

d

Var (mv(xv))

−1

≤ kfb−mk2

L2(PX)/c.

4.2. Rate of convergence

Corollary 4.2. Under the same condition as Theorem 4.1, ifγv=cλn,v andµv=cλ2n,v forc≥C1, then

km−fbk2_n ≤C3 inf

f∈F  



km−fk2n+



 X

v∈Sf

ν_n,v2 +d|Sf| n



σ 2

 



.

The result is non asymptotic in the sense that it is shown for any (n, d). Nevertheless, the upper bound is relevant when the infimum is reached for functionsf whose decomposition inHis sparse, and whendis small face ton. In fact, the coefficientdoccuring in the rated|Sf|/ncomes from the logarithm of the cardinality of

P equal to log(2d−1). When dis large, it may be judicious to limit the decomposition of functions in H, to interactions of limited order, so that the number of terms in the decomposition stays of the order log(d).

Let us discuss the rate of convergence given by P

v∈Sfν 2

n,v. For the sake of simplicity let us consider the

case where the variablesX1, . . . , Xd have the same distributionP1onX1⊂R, and where the unidimensionnal

kernelsk0a are all identical, such thatkv(xv,x0v) =

Q

a∈vk0(xa, x0a). The kernel k0 admits an eigen expansion

given by

k0(x, x0) = X

`≥1

ω0,`ζ`(x)ζ`(x0),

where the eigenvalues ω0,` are non negative and ranged in the decreasing order, and where the ζ` are the

associated eigen functions, orthonormal with respect to L2(P1). Therefore the kernel kv admits the following

expansion

kv(xv,x

0

v) =

X

`=(`1...`|v|) |v| Y

a=1

ω0,`a

| {z }

ωv,`

|v| Y

a=1

ζ`a(xa)

| {z }

ζv,`(xv) |v| Y

a=1

ζ`a(x

0

a)

| {z }

ζv,`(x0v) .

Consider now the case where the eigenvalues ω0,` are decreasing at a rate `−2α for some α > 1/2. It can

be shown, see Section 8.3, that the rate νn,v defined at Equation (12) is bounded above by a term of order

n−α/(2α+1)_(log_n₎γ_{, where} _γ _≥₍_|_v_{| −}₁₎_α/₍₂_α₋_{1). Note that in this particular case, the rate of convergence}

(10)

space H should be chosen such that the unknown function m is well approximated by sparse functions in H

with low order of interactions.

5. _{Calculation of the estimator}

The functional minimization problem described at Equation (6) is equivalent to a parametric minimization problem. Indeed, we know that ifHis a RKHS associated with a kernelk:X ×X →R, then for all (x1, . . . ,xn)∈

Xn_{, and for all (}_α

1, . . . , αn)∈Rn, the functionf(·) =Pni=1αik(xi,·) is inHandkfk2H= Pn

i,i0₌₁αiαi0k(x_i,x0_i0).

In particular, it can be shown that the solution to our minimization problem is written asf =f0+P_v_∈Pfv

where, according to the representer Theorem (see Kimeldorf and Whahba [20]),fv(·) =P n

i=1θvikv(Xvi,·) for

some parameterθ∈Rn|P| with components (θv,i, i= 1, . . . , n, v= 1, . . . ,|P|).

Let k · k denotes the usual euclidean norm in Rn. For each v ∈ P, let Kv be the n×n matrix with

components (Kv)i,i0 = k_v(X_vi,X_vi0) that satisfiest(K1/2)K1/2 =K. Let fb0 and bθ be the minimizer of the

following penalized least-squares criteria:

C(f0,θ) =

1

nkY−f01n−

X

v∈P

Kvθvk2+

1

√

n

X

v∈P

γvkKvθvk+

X

v∈P

µvkKv1/2θvk. (17)

Then the estimatorfbdefined at Equation (6) satisfies

b

f(x) =fb0+ X

v∈P b

fv(xv) withfbv(xv) = n

X

i=1 b

θv,ikv(Xv,i,xv).

BecauseC(f0,θ) is a convex and separable criteria, we propose to calculatebθusing a block coordinate descent

algorithm described in the following section.

Note that the estimator fbdefined at Equation (6) should satisfy kfvkHv = kK 1/2

v θvk ≤ 1, or generally

kfvkHv ≤ rv for some positive rv, see (16). Usually we have no idea of the value of this upper-bound in practice, and we propose to remove this contraint in the optimization procedure. Nevertheless, if one wants to consider such an additional constraint, the problem can be solved at the price of some additional complication, considering a Lagrangian method for example, see Section 8.7 for more details.

5.1. Algorithm

We will assume that for all v ∈ P the matrices Kv are strictly definite positive. If it is not the case, one

modifiesKv byKv+ξIn whereξ is a small positive value, in order to ensure positive definiteness.

Using a coordinate descent procedure, we minimize the criteria C(f0,θ) along each group v at a time. At

each step of the algorithm, the criteria is minimized as a function of the current block’s parameters, while the parameters values for the other blocks are fixed to their current values. The procedure is repeated until convergence, considering for example that the convergence is obtained if the norm of the difference between two consecutive solutions is small enough. See for exemple Boyd et al. [8] for optimization in such context.

For the sake of simplicity, we consider the minimization of the following criteria:

C0(f0,θ) =kY−f0− X

v∈P

Kvθvk2+

X

v∈P

γ_v0kKvθvk+

X

v∈P

µ0_vkK_v1/2θvk. (18)

(11)

Let us begin with the constant termf0. Because the penalty function does not depend on f0, minimizing

C0(f0,θ) with respect to f0 for fixed values ofθ leads to

f0=Y·− X

v n

X

i=1

(Kvθv)_i/n, (19)

whereY·denotes the mean ofYand (Kvθv)idenotes thei-th component ofKvθv. In what follows, we consider

a groupv, and fix the values of the parameters for all the other groups. We describe the algorithm and postpone the proofs in Section 8.7.

Let us first consider the case where bothµ0_v andγ_v0 are non zero. If∂C_v0 denotes the subdifferential ofC0(f0,θ)

with respect toθv, we need to solve 0∈∂Cv0, which is equivalent to

−2KvRv+ 2Kv2θv+γv0sv+µ0vtv = 0, (20)

where

Rv=Y−f0− X

w6=v

Kwθw

and wheresv andtv satisfy:

ifθv = 0 kKv−1svk ≤1, andkKv−1/2tvk ≤1,

ifθv 6= 0 sv=

Kv2θv

kKvθvk

, andtv=

Kvθv

kKv1/2θvk

.

The first task is to obtain necessary and sufficient conditions for which the solutionθv= 0 is the optimal one.

Let

J(t) =k2Rv−µ0vK

−1

v tk

2_, _and_J∗_{= argmin}n_J₍_t₎_, _for_t_∈

Rn such thatkKv−1/2tk ≤1

o

. (21)

Then the solution to Equation (20) is zero if and only ifJ∗≤γ_v02. CalculatingJ∗is a ridge regression problem that can be easily solved (see Propositions 8.2 and 8.3 in Section 8.7).

If the solution to Equation (20) is notθv= 0, the problem is to solve the subgradient equation:

θv=

µ0_v

2kKv1/2θvk

In+Kv+

γ_v0 2kKvθvk

Kv

!−1

Rv. (22)

Becauseθv appears in both sides of the equation, a numerical procedure is needed (see Proposition 8.4).

Other cases.

• If all the µ0_v are equal to 0, the parametersθv are not identifiable.

• If all the γ_v0 are equal to 0, then we have to solve a classical group-lasso problem with respect to the parameters θ0_v defined asθ0_v=Kv1/2θv for allv∈ P.

• Letv such thatµ0v = 0, γ0v = 0 and assume that at least one of the6 µ0w is non zero forw∈ P, w6=v.

Then it is shown in Proposition 8.1 thatθv= 0 if and only if 2kRvk ≤γv0.

• In the same way, ifγ0

v= 0, andµ0v 6= 0, thenθv= 0 if and only if 2kK

1/2

v Rvk ≤µ0v.

Finally the algorithm is the following: (1) Start with an initial valueθ=θ0.

(2) Calculatef0 using Equation (19). For the groupv, calculate Rv and determine if the groupv should

be excluded or not either by solving the problem defined at Equation (21) if µ0

v 6= 0 and γv0 6= 0, or

directly if one of them equals 0. If it is the case, setθv= 0. If not, solve Equation (22) to obtain θv.

(12)

5.2. Choice of the tuning parameters

For each value of the tuning parameters (µ0_v, γ_v0), v∈ P, the algorithm provides an estimate ofmand of the Sobol indices. The problem for choosing these parameters values is crucial. We propose to restrict this choice by considering tuning parameters proportionnal to known weights: for allv∈ P,µ0_v =µωvandγv0 =γζv, where

the weightsωv andζv are fixed. For example, one can take weights that increase with the cardinal ofvin order

to favour effects with small interaction order between variables. Or, according to the theoretical result given at Theorem 4.1, we can chooseωv=_bνn,v2 andζv=ν_bn,v, whereν_bn,vis an estimate ofνn,v based on the eigenvalues

of the matrix Kv. Any other choice, depending on the problem of interest, may be relevant.

Once the weights are chosen, we estimate m, on the basis of a learning data set (Yi,Xi), i= 1, . . . n, for a

grid of values of (µ, γ). We first set γ= 0, and calculateµmax the smallest value ofµsuch that the solution to

the minimization of

kY−f0− X

v∈P

Kvθvk2+µ

X

v∈P

ωvkKv1/2θvk,

isθv= 0 for allv∈ P. Then we can considerµ`=µmax2−`for`∈ {1, . . . , `max}, as a grid of values forµ. The

grid of values forγ may be chosen after few attempts.

For choosing the final estimator, sayfb, we suppose that we have at our disposal a testing data set (Y_iT,XT_i), i=

1, . . . nT, and we propose two procedures.

Proc. GS: The first one uses the testing data set for estimating the prediction error. Precisely, for each value of (µ, γ) in the grid, letfb(µ,γ)(·) be the estimation ofmobtained with the learning data set. Then

PE(µ, γ) = 1 nT

nT

X

i=1

Y_iT −fb(µ,γ)(XTi)

2

estimates the prediction error, and we propose to choose the pair (µ, γ) that minimizes PE(µ, γ), say (µ,_b _bγ). Finally the estimator, denoted fbGSis defined as fbGS=fb₍

b

µ,

bγ)

. In the following, we will refer to this procedure as the Group-Sparse procedure.

Proc. rdg: Doing the parallel with the inconsistency of the lasso for estimating the support of the pa-rameters in the classical regression problem, we propose to choose the tuning parameter that minimizes the risk of the ridge estimator over the support estimated by the ridge-group-sparse procedure. In-deed, if the tuning parameter is chosen to minimize the prediction error, the lasso is not consistent for support estimation (see [24] for example). One idea to overcome this problem, is to choose the tuning parameter that minimizes the risk of the Gauss-lasso estimator which is calculated in two steps: For a given value of the tuning parameter, the estimation of the support of the parameter is estimated using a lasso procedure, then the least square estimator over this support is calculated. When the objective is support estimation, some numerical simulations [33] and theoretical results [18] suggest that it may be more advisable not to apply the selection schemes based on prediction risk to the lasso estimators, but rather to the Gauss-lasso estimators. Our procedure, called the ridge procedure, applies the same idea in the framework of sparse nonparametric estimation. Precisely it considers the collection of supports composed of the differentS

b

f(µ,γ)

when (µ, γ) belongs to the grid. For each supportS in this collection,

we estimatef using a ridge procedure assuming that the support off isS: for a givenλ,f_λ,Srdgis defined as follows:

f_λ,Srdg= argmin

 



1 n

n

X

i=1

Yi−f0− X

v∈S

fv(Xv,i)

!2

+λX

v∈S

kfvk2Hv, f=f0+ X

v∈S

fv, fv∈ Hv

 



(13)

We choose a grid of values for λ, and for each S in the collection and λ in the grid, we estimate the prediction error PE(λ, S) defined as follows:

PE(λ, S) = 1 nT

nT

X

i=1

Y_iT −f_λ,Srdg(XT_i)

2

.

For eachS in the collection, letbλ(S) be the minimizer of PE(λ, S) whenλvaries in the grid, and letSb

be the minimizer of PE(λb(S), S), then the estimator denotedfbrdg is defined asfbrdg=f rdg

bλ(bS),Sb

.

If a testing data set is not available, we can use the classical V-fold cross validation (see [1] for example) either to estimate PE(µ, γ) or to estimate PE(λ, S).

6. _{Simulation study}

In order to evaluate the performances of our method for estimating a meta-model and sensitivity indices of a functionmwe carried out a simulation study. We consider theg-function of Sobol defined on [0,1]d _as

m(x) =

d

Y

a=1

|4xa−2|+ca

1 +ca

, ca >0,

whose Sobol indices can be expressed analytically (see Saltelli et al. [34]). Following the simulation experiment proposed by Durrande et al. [12], we take d= 5 and (c1, c2, c3, c4, c5) = (0.2,0.6,0.8,100,100). The lower the

value ofca, the more significant the variable xa. The variables Xa, a= 1, . . . , dare independent and uniformly

distributed on [0,1]. We consider the regression model Yi = m(Xi) +σεi, for i = 1, . . . , n, with N(0,1)

independent error termsεi.

Simulation design. We present the results forn∈ {50,100,200},σ∈ {0,0.2}. For all a= 1, . . . , d, the kernels ka are the same: we considered the Brownian kernel,kb(x, x0) = 1 + min{x, x0}, the Matérn kernel,km(x, x0) =

(1 + 2|x−x0|) exp(−2|x−x0|), and the Gaussian kernel,kg₍_{x, x}0_{) = exp(}_x₋_x0₎2_.

For each simulation, we generate three independent data sets as follows : a Latin Hypercube Sample of the inputs is simulated to give the matrix Xwith nrows and d= 5 columns, and an-sample of independent centered Gaussian variable with variance 1 is simulated. This operation is repeated three times in order to obtain the learning and testing data sets and a third data set for estimating the estimators performances. As explained in Section 5.2, we choose optimal values of the tuning parameters, (µ, γ) by minimizing a prediction error PE, and get an estimator fbofm, as well as estimates of the Sobol indices:

b

Sv=

d

Var (mv(xv))

P

w∈PVar (d mw(xw))

.

Let Ωv be the matrix whose components satisfy

(Ωv)i,i0 = Y

a∈v

EU∼Pa(k0a(U, Xa,i)k0a(U, Xa,i0)) fori, i

0_{= 1}_{, . . . , n.}

The estimatorVar (d mv(xv)) is calculated as follows:

d

Var (mv(xv)) =αb

T vΩvαbv.

(14)

σ= 0 σ= 0.2

n= 50 n= 100 n= 200 n= 50 n= 100 n= 200

Proc. GS 0.814 0.920 0.959 0.737 0.835 0.889

Proc. rdg 0.874 0.976 0.989 0.763 0.854 0.892

Table 1. Estimated coefficient of determinationR2for different values ofnandσ, with the

Matérn kernel.

σ= 0 σ= 0.2

n= 50 n= 100 n= 200 n= 50 n= 100 n= 200

Proc. GS 0.033 0.0137 0.0139 0.051 0.028 0.020

Proc. rdg 0.011 0.0009 0.0007 0.042 0.022 0.013

Table 2. Estimated empirical risk ER for different values ofnandσwith the Matérn kernel.

Performance indicators. To evaluate the performances of our method for estimating a meta-model, we use the classical coefficient of determinationR2 _{estimated using the third data set (}_YP

i ,XPi ), i= 1, . . . , n:

R2= 1− Pn

i=1

YP

i −fb(XP_i ) 2

Pn

i=1 Y

P i −Y·P

2 .

Moreover we calculate the empirical risk ER =km−fbk2_n. For each simulation s, we getR_s2 and ERsand we

report the means of these quantities over all simulations.

Similarly, for eachv, and each simulations, we getSbv,sand we report its mean,Sbv,·, its estimated

standard-error, and to sum up the behaviour of our procedure for estimating the sensitivity indices, we estimate the global error, denoted GE, defined as follows

GE = X

v

(Sbv,s−Sv)2.

In order to assess the performances of our procedure for selecting the greatest Sobol Indices, precisely those that are greater than some small quantity as ρ= 10−4, we calculate for each group v ∈ P, the percentage of simulations for which v is in the support of the estimator fb. Then we average these quantities, on one hand

over groupsv such thatSv > ρ, and on the other hand, over groupsv such thatSv ≤ρ. Let us denote these

quantities pSelSv>ρ and pSelSv≤ρ respectively.

For each (µ, γ) in a grid of values, the estimator fbµ,γ is defined as the minimizer of the criteria given at

Equation (18) takingωv=ζv= 1 for allv∈ P. In order to save computation time, we restrict the optimisation

to sets v such that |v| ≤3. Some preliminary simulations showed that the terms coresponding to|v| ≥4 are nearly always equal to 0.

Choosing the tuning parameters. Let us begin with the comparison of the two methods proposed for choosing the final estimator, see Section 5.2. The results are given in Tables 1 and 2. It appears that the procedure based on the ridge estimator after selection of the groups outperforms the method based on the group-sparse estimator. As expected both methods perform better whennincreases, and whenσ= 0.

Similarly, the Sobol indices are better estimated, in the sense of the global error, with the procedure based on the ridge estimate of the metamodel, see Table 3. The means of the estimators for the Sobol indices greater thanρ= 10−4 _{are given in Tables 4 and 5. It appears than}_S

{1} is over-estimated using the procedure based

(15)

σ= 0 σ= 0.2

n= 50 n= 100 n= 200 n= 50 n= 100 n= 200

Proc. GS 1.91 0.79 0.45 2.41 1.16 0.54

Proc. rdg 0.80 0.10 0.03 1.50 0.47 0.15

Table 3. Estimated global error GE×100 for different values ofnandσwith the Matérn kernel.

v={1} v={2} v={3} v={1,2} v={1,3} v={2,3} v={1,2,3} sum

S.I. 43.3 24.3 19.2 5.63 4.45 2.50 0.579 99.98

Proc. GS 50.1 (6.2) 26.5 (5.4) 20.9 (4.9) 0.69 (0.8) 0.63 (1.1) 0.51 (0.9) 0.02 (0.07) 99.29 Proc. rdg 45.4 (4.3) 25.3 (3.4) 20.3 (3.5) 3.08 (2.5) 2.18 (2.1) 1.44 (1.8) 0.09 (0.5) 98.66

Table 4. The first line of the table gives the true values of the Sobol indices×100 greater than 10−2_{, as well as their sum in the last columns. The following lines give the mean of the}

estimators as well as their standard-error (in parenthesis), calculated over 100 simulations, for n= 50, andσ= 0 with the Matérn kernel.

v={1} v={2} v={3} v={1,2} v={1,3} v={2,3} v={1,2,3} sum

S.I. 43.3 24.3 19.2 5.63 4.45 2.50 0.579 99.98

Proc. GS 47.5 (5.4) 26.2 (4.9) 19.8 (3.7) 2.35 (1.5) 1.45 (1.2) 0.84 (0.8) 0.03 (0.1) 98.95 Proc. rdg 43.5 (3.9) 25.0 (3.6) 19.7 (2.8) 4.85 (1.7) 3.39 (1.5) 2.02 (1.3) 0.05 (0.3) 99.02

Table 5. The first line of the table gives the true values of the Sobol indices×100 greater than 10−2, as well as their sum in the last columns. The following lines give the mean of the estimators as well as their standard-error (in parenthesis), calculated over 100 simulations, for n= 100, andσ= 0.2 with the Matérn kernel.

SI< ρ SI≥ρ v={1} v={2} v={3} v={1,2} v={1,3} v={2,3} v={1,2,3}

Proc. GS 17.9 68 100 100 100 72 66 67 9

Proc. rdg 6.3 51 100 100 100 72 62 47 4

Table 6. The first two columns give respectively pSel_S_v_≤_ρ and pSel_S_v_>ρ. The last columns give the values of pSel_v for each groupv such thatSv> ρ. Results forn= 50 andσ= 0 with

the Matérn kernel.

Let us now consider the performances of the procedure for selecting the non zero Sobol indices. In Tables 6 and 7 we report the percentages of simulations for which the Sobol indices smaller (respectively greater) thanρ are selected, and for which each of Sobol index greater thanρis selected. From these results, we conclude that the procedure based on the ridge estimator is more strict for selecting non-zero Sobol indices.

Comparing different kernels. Finally we compare the performances of the procedures for different kernels, see Tables 8 and 9. The means of the estimated empirical risk, ER, and of the global error for estimating the sensitivity indices, GE, are calculated for each kernel. It appears that the Matérn kernel gives the best results, except for the casen= 50 andσ= 0 where the empirical risk offbGS is smaller for the Brownian kernel.

In practice, one may want to choose the kernel according to the smallest prediction error. For that purpose, we propose to calculate the estimators fbGS and/orfbrdg for each kernel, as described in Section 5.2, as well as

(16)

SI< ρ SI≥ρ v={1} v={2} v={3} v={1,2} v={1,3} v={2,3} v={1,2,3}

Proc. GS 38 85 100 100 100 100 99 98 18

Proc. rdg 6 58 100 100 100 98 92 77 3

Table 7. The first two columns give respectively pSel_S_v_≤_ρ and pSel_S_v_>ρ. The last columns give the values of pSel_v for each groupv such thatSv > ρ. Results for n= 100 andσ = 0.2

with the Matérn kernel.

n= 50,σ= 0 n= 100,σ= 0.2

Matérn Gaussian Brownian mixed Matérn Gaussian Brownian mixed

Proc. GS 0.033 0.054 0.027 0.028 0.033 0.054 0.027 0.028

Proc. rdg 0.011 0.024 0.025 0.011 0.011 0.024 0.025 0.023

Table 8. Estimated empirical risk ER: Performances of the procedures according to the kernel choice.

n= 50,σ= 0 n= 100,σ= 0.2

Matérn Gaussian Brownian mixed Matérn Gaussian Brownian mixed

Proc. GS 1.91 2.49 2.19 1.88 1.16 1.90 1.69 1.16

Proc. rdg 0.80 1.39 1.40 0.85 0.48 0.83 0.73 0.51

Table 9. Estimated global error GE×100: Performances of the procedures according to the

kernel choice.

7. _{Sketch of proof of Theorem 4.1}

We give here a sketch of the proof and we postpone to Section 8 for complete statements. In particular, we denote byC constants that vary from an equation to the other, and we assume thatσ= 1.

The proof of Theorem 4.1 starts in the same way as the proof of Theorem 1 in Raskutti et al. [31]. Nevertheless it differs in several points, in particular because the terms occuring in the decomposition of functions inHdepend on several variables and thus are not independent. Indeed,fv(Xv) andfv0(Xv0) are not independent as soon as

the groupsv andv0 _{share some of the variables}_X

a, a= 1, . . . , d. Moreover, we do not assume that the function

mis in F.

Starting from the definition offb, some simple calculation (see Equation (28)) give that for allf ∈ F

Ckm−fbk_n2 ≤ km−fk2_n+

1 n

n

X

i=1

εi(fb(Xi)−f(Xi))

+ X

v∈Sf

γvkfbv−fvkn+µvkfbv−fvkHv

.

If we setg=fb−f, theng∈ H,g=g0+Pvgv, with gv=fbv−fv, and for eachv,kgvkHv ≤2.

The main problem is now to control the empirical process. For eachv, lettingλn,v as in (13), we state (see

Lemma 8.1, page 44) that, with high probability,

ifkgvkn≤λn,vkgvkHv then

n

X

i=1

εigv(Xv,i)

≤Cnλ2_n,vkgvkHv,

ifkgvkn> λn,vkgvkHv then

n

X

i=1

εigv(Xv,i)

(17)

Therefore, if for allv,µvandγvsatisfy Equation (14), we deduce that with high probability (settingg=fb−f)

Ckm−fbk2_n≤ km−fk2_n X

v∈Sf

(γvkgvkn+µvkgvkHv) + X

v /∈Sf

γvkfbvkn+µvkfbvkHv

.

Besides we can express the decomposability property of the penalty as follows (see lemma 8.2, page 45): with high probability (in the set where the empirical process is controled as stated above),

X

v /∈Sf

γvkfbvkn+µvkfbvkHv

≤C X

v∈Sf

(γvkgvkn+µvkgvkHv).

Putting the things together, and noting again thatkgvkHv ≤2, we obtain the following upper bound

Ckm−fbk_n2 ≤ km−fk2_n+ X

v∈Sf

(µv+γvkgvkn).

The last important step consists in comparing P

v∈Sfkgvkn tok P

v∈Sfgvkn. More precisely, it can be shown (see lemma 8.3, page 45) that for allv∈ P, with high probability, we have

kgvkn ≤2kgvk_L2₍_P_X₎+γv.

Using the orthogonality assumption between the spacesHv, we haveP_v_∈_S_fkgvk2_L2₍_P X)=k

P

vgvk

2

L2(PX), and thus we get

Ckm−fbk_n2 ≤ km−fk2_n+ X

v∈Sf µv+

X

v∈Sf

γ2_v+k X

v∈Sf

(fbv−fv)2k2_L2₍_P_X₎.

Finally it remains to consider different cases according to the rankings ofkfb−fk2

L2(PX),kfb−fk

2

nand

P

v∈Sfµv+ γ2

v to get the result of Theorem 4.1.

8. _Proofs

Recall that we cconsider the regression model defined at Equation (1), where X has distribution PX =

P1×. . .×Pd defined onX a compact subset of Rd, andεis distributed asN(0, σ2). We denote byPX,ε the distribution of (X,ε). We observe ansample (Yi,Xi), i= 1, . . . , nwith lawPX,ε.

The notation and the procedure are given in Sections 2 and 4. Let us add on few notations that will be used along the proofs.

Forv∈ Pwe denote |v|the cardinal ofv. For a functionφ:R|v|7→R, we denoteVn,εthe empirical process

defined by

Vn,ε(φ) =

1 n

n

X

i=1

εiφ(Xv,i). (23)

For the sake of simplicity we assume σ= 1. Moreover, we setR0 = 1, see (10). Consequently, for any function f ∈ H,kfvkHv ≤1, andkfvk∞≤ kfkHv. The proofs can be done exactly in the same way by considering the general case. In the proofs, thek · k_L2₍_P

X) norm will be denoted byk · k2.

8.1. Proof of Theorem 4.1

The proof is based on four main lemmas proved in Section 8.5. In Section 8.4 other lemmas used all along the proof are stated. Their proof are postponed to Section 8.6.

(18)

Using that for any v ∈ Sf, and any norm k · k in Hv, kfvk − kfbvk ≤ kfv −fbvk and that for any v /∈ Sf,

kfvk= 0, we get that

X

v∈P

µvkfvkHv− X

v∈P

µvkfbvkHv ≤ X

v∈Sf

µvkfv−fbvkHv− X

v∈Sc f

µvkfbvkHv, (24)

X

v∈P

γvkfvkn−

X

v∈P

γvkfbvkn≤

X

v∈Sf

γvkfv−fbvkn−

X

v∈Sc f

γvkfbvkn. (25)

Combining (24), and (25), to the fact that for any functionf ∈ H,L(fb)≤ L(f), we obtain that

km−fbk2_n ≤ km−fk2_n+B,

with

B= 2Vn,ε fb−f

+ X

v∈Sf h

µvkfbv−fvkHv +γvkfbv−fvkn i

− X

v∈Sc f h

µvkfbvkHv+γvkfbvkn i

. (26)

Ifkm−fk2

n ≥B, we immediately get the result since in that case

km−fbk2_n ≤2km−fk2_n≤2km−fk2_n+ X

v∈Sf µv+

X

v∈Sf γ_v2.

Ifkm−fk2

n < B, we get that

kfb−mk2_n ≤ 2B (27)

≤ 4|Vn,ε fb−f

|+ 2 X

v∈Sf h

. (28)

The control of the empirical process|Vn,ε fb−f

|is given by the following lemma (proved in Section 8.5.1, page 52).

Lemma 8.1. Let Vn,ε be defined in (23). For anyf in F, we consider the eventT defined as

T =n∀f ∈ F,∀v∈ P,|Vn,ε fbv−fv| ≤κλ2n,vkfbv−fvkHv+κλn,vkfbv−fvkn o

, (29)

where the quantitiesλn,vare defined by Equation (13) and whereκ= 10+4∆. Then, for some positive constants

c1, c2,

PX,ε(T)≥1−c1 X

v∈P

exp(−nc2λ2n,v).

Conditionning onT, Inequality (28) becomes

kfb−mk2_n ≤ 4κ X

v∈P h

λ2_n,vkfbv−fvkHv +λn,vkfbv−fvkn i

+ 2 X

v∈Sf h

,

which may be decomposed as follows

kfb−mk2_n ≤ X

v∈Sf

4κλ2_n,v+ 2µv

kfbv−fvkHv+ X

v∈Sf

[4κλn,v+ 2γv]kfbv−fvkn+

4 X

v∈Sc f

κλ2_n,vkfbv−fvkHv+ 4 X

v∈Sc f