Parametric estimation and tests through divergences and duality technique

(1)

DIVERGENCES AND DUALITY TECHNIQUE MICHEL BRONIATOWSKI∗_{AND AMOR KEZIOU}∗∗

Abstract. We introduce estimation and test procedures through divergence optimiza-tion for discrete or continuous parametric models. This approach is based on a new dual representation for divergences. We treat point estimation and tests for simple and composite hypotheses, extending maximum likelihood technique. An other view at the maximum likelihood approach, for estimation and test, is given. We prove existence and consistency of the proposed estimates. The limit laws of the estimates and test statistics (including the generalized likelihood ratio one) are given both under the null and the alternative hypotheses, and approximation of the power functions is deduced. A solu-tion to the irregularity problem of the generalized likelihood ratio test, of the number of components in a mixture, is given, and a new test, based on χ2_{-divergence on signed}

finite measures and duality technique, is proposed.

Key words: Parametric estimation; Parametric test; Maximum likelihood; Mixture; Power function; Duality; φ-divergence.

MSC (2000) Classification: 62F03; 62F10; 62F30.

1. Introduction and notation

Let (X , B) be a measurable space and P be a given probability measure (p.m.) on (X , B). Denote M the real vector space of all signed finite measures on (X , B) and M(P ) the vector subspace of all signed finite measures absolutely continuous (a.c.) with respect to (w.r.t.) P . Denote also M1 _{the set of all p.m.’s on (X , B) and M}1_{(P ) the subset of all}

p.m.’s a.c. w.r.t. P . Let ϕ be a proper1 closed2 convex function from ] − ∞, +∞[ to [0, +∞] with ϕ(1) = 0 and such that its domain domϕ := {x ∈ R such that ϕ(x) < ∞} is an interval with endpoints aϕ < 1 < bϕ (which may be finite or infinite). For any signed

Date: Jun 2007.

1_{We say a function is proper if its domain is non void.}

2_{The closedness of ϕ means that if a}_ϕ_{or b}_ϕ_{are finite numbers then ϕ(x) tends to ϕ(a}_ϕ_{) or ϕ(b}_ϕ_{) when}

x ↓ aϕ or x ↑ bϕ, respectively.

(2)

finite measure Q in M(P ), the φ-divergence between Q and P is defined by φ(Q, P ) := Z X ϕ µ dQ dP(x) ¶ dP (x). (1.1)

When Q is not a.c. w.r.t. P , we set φ(Q, P ) = +∞. The φ-divergences were introduced by Csisz´ar (1963) as “f -divergences”. For all p.m. P , the mappings Q ∈ M 7→ φ(Q, P ) are convex and take nonnegative values. When Q = P then φ(Q, P ) = 0. Furthermore, if the function x 7→ ϕ(x) is strictly convex on a neighborhood of x = 1, then the following fundamental property holds

φ(Q, P ) = 0 if and only if Q = P. (1.2)

All these properties are presented in Csiszár (1963), Csiszár (1967a), Csiszár (1967b) and Liese and Vajda (1987) chapter 1, for φ-divergences defined on the set of all p.m.’s M1_.

When the φ-divergences are defined on M, then the same properties hold. Let us conclude these few remarks quoting that in general φ(Q, P ) and φ(P, Q) are not equal. Hence, φ-divergences usually are not distances, but they merely measure some difference between two measures. Of course a main feature of divergences between distributions of random variables X and Y is the invariance property with respect to common smooth change of variables.

1.1. Examples of φ-divergences. When defined on M1_{, the Kullback-Leibler (KL),}

modified Kullback-Leibler (KLm), χ2, modified χ2 (χ2m), Hellinger (H), and L1

diver-gences are respectively associated to the convex functions ϕ(x) = x log x − x + 1, ϕ(x) =

− log x + x − 1, ϕ(x) = 1₂(x − 1)2, ϕ(x) = 1₂(x − 1)2/x, ϕ(x) = 2(√x − 1)2 and ϕ(x) =

|x − 1|. All these divergences except the L₁ one, belong to the class of the so called “power divergences” introduced in Cressie and Read (1984) (see also Liese and Vajda (1987) chapter 2). They are defined through the class of convex functions

x ∈]0, +∞[7→ ϕγ(x) := x

γ_{− γx + γ − 1}

γ(γ − 1) (1.3)

if γ ∈ R \ {0, 1}, ϕ0(x) := − log x + x − 1 and ϕ1(x) := x log x − x + 1. (For all γ ∈ R,

we define ϕγ(0) := limx↓0ϕγ(x)). So, the KL−divergence is associated to ϕ1, the KLm

to ϕ₀, the χ2 _{to ϕ}

2, the χ2m to ϕ−1 and the Hellinger distance to ϕ1/2.

We extend the definition of the power divergences functions Q ∈ M1 _{7→ φ}

γ(Q, P ) onto

the whole vector space of all signed finite measures M via the extension of the definition of the convex functions ϕ_γ : For all γ ∈ R such that the function x 7→ ϕ_γ(x) is not defined

(3)

on ] − ∞, 0[ or defined but not convex on whole R, set

x ∈] − ∞, +∞[7→

(

ϕγ(x) if x ∈ [0, +∞[,

+∞ if x ∈] − ∞, 0[. (1.4)

Note that for the χ2_{-divergence, the corresponding ϕ function ϕ}

2(x) := 1₂(x−1)2is defined

and convex on whole R.

In this paper, we are interested in estimation and test using φ-divergences. An i.i.d. sample X1, . . . , Xnwith common unknown distribution P is observed and some p.m. Q is

given. We aim to estimate φ(Q, P ) and, more generally, infQ∈Ωφ(Q, P ) where Ω is some

set of measures, as well as the measure Q∗ _{achieving the infimum on Ω. In the parametric}

context, these problems can be well defined and lead to new results in estimation and tests, extending classical notions.

1.2. Statistical examples and motivations.

1.2.1. Tests of fit. Let Q0 and P be two p.m.’s with same finite discrete support S. It

holds φ(Q₀, P ) =X j∈S ϕ µ Q₀(j) P (j) ¶ P (j)

which can then be estimated via “plug-in”, setting e φ(Q0, P ) := φ(Q0, Pn) = X j∈S ϕ µ Q0(j) Pn(j) ¶ Pn(j),

where P_n is the empirical measure pertaining to the sample X₁, . . . , X_n with distribution

P , namely, Pn:= _n1 n X i=1 δXi

in which δx denotes the Dirac measure at point x, for all x.

More generally, when Q₀ and P have continuous common support S, consider a partition

A1, . . . , Ak of S. The divergence φ(Q0, P ) can be approximated by φ(Q0, P ) ' k X j=1 ϕ µ Q0(Aj) P (Aj) ¶ P (Aj), (1.5)

which, in turn can be estimated by e φ(Q0, P ) = k X j=1 ϕ µ Q0(Aj) Pn(Aj) ¶ Pn(Aj).

(4)

In this vein, goodness of fit tests have been proposed by Cressie and Read (1984), Land-aburu and Pardo (2000) for fixed number of classes, and by Gy¨orfi and Vajda (2002) when the number of classes depends on the sample size.

1.2.2. Parametric estimation and tests. Let {Pθ; θ ∈ Θ} be some parametric model with

Θ an open set in Rd_{. On the basis of an i.i.d. sample X}

1, . . . , Xnwith distribution PθT, we

want to estimate θT, the unknown true value of the parameter and perform statistical tests

on the parameter using φ-divergences. When all p.m.’s Pθ share the same finite support

S, we have φ(P_θ, P_θ_T) =X j∈S ϕ µ P_θ(j) P_θ_T(j) ¶ P_θ_T(j).

For such models, Liese and Vajda (1987), Lindsay (1994) and Morales et al. (1995) intro-duced the so-called “Minimum φ-divergences estimates” (MφDE’s) (Minimum Disparity Estimators in Lindsay (1994)) of the parameter θT, defined by

e

θ_φ:= arg inf

θ∈Θφ(Pθ, Pn), (1.6)

where φ(P_θ, Pn) is the “plug-in” estimate of φ(Pθ, PθT), i.e.,

φ(Pθ, Pn) = X j∈S ϕ µ Pθ(j) Pn(j) ¶ Pn(j).

Various parametric tests can be performed based on the previous estimates of φ-divergences; see Lindsay (1994) and Morales et al. (1995). Also in Information theory, estimation of the KL-divergence leads to accurate bounds for the average length of a Shannon code based on a source estimate; see e.g. Cover and Thomas (1991) theorem 5.4.3.

The class of estimates (1.6) contains the maximum likelihood estimate (MLE). Indeed, when ϕ(x) = ϕ₀(x) = − log x + x − 1, we obtain

e θ_KL_m := arg inf θ∈ΘKLm(Pθ, Pn) = arg infθ∈Θ X j∈S − log(P_θ(j)) = M LE. (1.7) The MφDE’s (1.6) are motivated by the fact that a suitable choice of the divergence may lead to an estimate more robust than the ML one (see e.g. Lindsay (1994), Basu and Lindsay (1994) and Jim´enez and Shao (2001)).

When interested in testing hypotheses H0 : θT = θ0 against alternatives H1 : θT 6= θ0,

where θ0 is a given value, we can use the statistic φ(Pθ0, Pn), the plug-in estimate of the

φ-divergence between P_θ0 and P_θ_T, rejecting H0 for large values of the statistic; see e.g.

(5)

based on KL_m(P_θ0, P_n) does not coincide with the generalized likelihood ratio one, which defined through the generalized likelihood ratio statistic λn:= 2 logsupθ∈Θ

Q_n

i=1pθ(Xi)

Q_n

i=1pθ0(Xi) . The

new estimate dKL_m(P_θ0, P_θ_T) of KL_m(P_θ0, P_θ_T), which is proposed in this paper, leads to the generalized likelihood ratio test (GLRT); see remark 3.4 below.

When the support S is continuous, the plug-in estimates (1.6) are not well defined; Basu and Lindsay (1994) investigate the so-called “minimum disparity estimators” (MDE’s) for continuous models, they consider

p_n,h(x) := 1 h Z K µ x − t h ¶ dP_n(t),

the kernel estimate of the density p_θ_T of the X0

is, and the modified version of the densities

p_θ defined by p∗_θ,h(x) := 1 h Z K µ x − t h ¶ dP_θ(t). The MDE is then defined as

e θφ,h := arg min θ∈Θφ(p ∗ θ,h, pn,h) := arg min θ∈Θ Z ϕ Ã p∗ θ,h(t) p_n,h(t) ! pn,h(t) dt. (1.8)

When ϕ(x) = − log x + x − 1, this estimate clearly, due to smoothing, does not coincide generally with the ML one. Also, the test based on KLm(p∗_θ0, pn,h) is different from the

generalized likelihood ratio test. So, this methodology does not extend the maximum likelihood method. Further, the estimates (1.8) poses the problem of the choice of the smoothing parameters K and h.

No direct plug-in estimate of φ(Q, P ) can be performed by substitution of P by Pn when

Q belongs to some class of p.m.’s a.c. w.r.t. the Lebesgue measure λ. In order to build

tests pertaining to the density p := dP_dλ, Beran (1977) and Berlinet et al. (1998) proposed to use the smoothed kernel estimate pn of p. Beran (1977) handles the Hellinger distance,

while Berlinet (1999) obtains the limit distribution of the estimate for the Kullback-Leibler divergence. The extension of their results to other divergences remains an open problem; see Berlinet (1999), Gy¨orfi et al. (1998), and Berlinet et al. (1998). Also, it seems difficult to use such methods to obtain the limit distribution of an estimate of infQ∈Ωφ(Q, P ) when

Ω is some class of p.m.’s; this problem will be treated in the present paper when Ω is a parametric class, avoiding the smoothing method. In any case, an exhaustive study of

(6)

MφDE’s seems necessary, in a way that would include both the discrete and the continu-ous support cases. This is precisely the scope of this paper.

When the support S is discrete finite, the estimates in (1.6) are well defined for large n, since then the empirical measure gives positive mass to any point of S with probability one. However, when S is discrete infinite or continuous, then the plug-in estimate φ(Pθ, Pn)

usually takes infinite value when no use is done of some partition-based approximation, as done in (1.5). In Broniatowski (2003), a new estimation procedure is proposed in order to estimate the KL-divergence between some set of p.m.’s Ω and some p.m. P , without making use of any partitioning nor smoothing, but merely making use of the well known “dual” representation of the KL-divergence as the Fenchel-Legendre transform of the mo-ment generating function. Extending the paper by Broniatowski (2003), we will use the new dual representations of φ-divergences (see Broniatowski and Keziou (2006) theorem 4.4 and Keziou (2003) theorem 2.1) to define the minimum φ-divergence estimates in both discrete and continuous parametric models. These representations are the starting point for the definition of estimates of the parameter θT, which we will call “minimum dual

φ-divergence estimates” (MDφDE’s). They are defined in parametric models {P_θ; θ ∈ Θ}, where the p.m.’s Pθ do not necessarily have finite support; it can be discrete or

continu-ous, bounded or not. Also the same representations will be applied in order to estimate

φ(P_θ0, P_θ_T) and inf_θ∈Θ0φ(P_θ, P_θ_T) where θ0 is a given value in Θ and Θ0 is a given subset

of Θ, which leads to various simple and composite tests pertaining to θ_T, the true unknown value of the parameter. When ϕ(x) = − log x + x − 1, the MDφD estimate coincides with the maximum likelihood one (see remark 3.2 below); since our approach includes also test procedures, it will be seen that with this peculiar choice for the function ϕ, we recover the classical likelihood ratio test for simple hypotheses and for composite hypotheses (see remark 3.4 and remark 3.6 below).

The rest of this paper is organized as follows. In section 2, we recall the dual representa-tions of φ-divergences obtained by Broniatowski and Keziou (2006) theorem 4.4. Section 3 presents, through the dual representation of φ-divergences, various estimates and tests in the parametric framework and deals with their asymptotic properties both under the null and the alternative hypotheses. The existence and consistency of the proposed estimates are proved using similar arguments as developed in Qin and Lawless (1994) lemma 1. We use the limit laws of the proposed test statistics, in a similar way to Morales and Pardo (2001), to give an approximation to the power functions of the tests (including the GLR

(7)

one). Observe that the power functions of the likelihood ratio type tests are not generally known. One of our contributions is to provide explicit power functions in the general case for simple or composite hypotheses. As a by-product, we obtain the minimal sample size which ensures a given power, for quite general simple or composite hypotheses. In section 4, we give a solution to the irregularity problem of the GLRT of the number of compo-nents in a mixture; we propose a new test based on the χ2-divergence on signed finite measures, and a new procedure of construction of confidence regions for the parameter in the case where θT may be a boundary value of the parameter space Θ. All proofs are in

the Appendix. We sometimes write P f for R f dP for any measure P and any function f , when defined.

2. Fenchel Duality for φ-divergences

In this section, we recall a version of the dual representations of φ-divergences obtained in Broniatowski and Keziou (2006), using Fenchel duality technique. First, we give some notations and some results about the conjugate (or Fenchel-Legendre transform) of real convex functions; see e.g. Rockafellar (1970) for proofs. The Fenchel-Legendre transform of ϕ will be denoted ϕ∗_{, i.e.,}

t ∈ R 7→ ϕ∗(t) := sup

x∈R

{tx − ϕ(x)} , (2.1)

and the endpoints of domϕ∗_{(the domain of ϕ}∗_{) will be denoted a}

ϕ∗and b_ϕ∗with a_ϕ∗ ≤ b_ϕ∗.

Note that ϕ∗ _{is a proper closed convex function. In particular, a}

ϕ∗ < 0 < b_ϕ∗, ϕ∗(0) = 0 and aϕ∗ = lim y→−∞ ϕ(y) y , bϕ∗ = limy→+∞ ϕ(y) y . (2.2)

By the closedness of ϕ, applying the duality principle, the conjugate ϕ∗∗ _{of ϕ}∗ _coincides

with ϕ, i.e.,

ϕ∗∗(t) := sup

x∈R

{tx − ϕ∗(x)} = ϕ(t), for all t ∈ R. (2.3) For the proper convex functions defined on R (endowed with the usual topology), the lower semi-continuity3 _{and the closedness properties are equivalent.}

The function ϕ (resp. ϕ∗_{) is differentiable if it is differentiable on ]a}

ϕ, bϕ[ (resp. ]aϕ∗, b_ϕ∗[),

the interior of its domain. Also ϕ (resp. ϕ∗_{) is strictly convex if it is strictly convex on}

]a_ϕ, b_ϕ[ (resp. ]a_ϕ∗, b_ϕ∗[).

3_{We say a function ϕ is lower semi-continuous if the level sets {x ∈ R such that ϕ(x) ≤ α}, α ∈ R are}

(8)

The strict convexity of ϕ is equivalent to the condition that its conjugate ϕ∗ _{is essentially}

smooth, i.e., differentiable with

lim_t↓a_ϕ∗ϕ∗0_{(t) = −∞} _if _a

ϕ∗ > −∞,

lim_t↑b_ϕ∗ϕ∗0_{(t) = +∞} _if _b

ϕ∗ < +∞.

(2.4) Conversely, ϕ is essentially smooth if and only if ϕ∗ _{is strictly convex; see e.g. Rockafellar}

(1970) section 26 for the proofs of these properties.

If ϕ is differentiable, we denote ϕ0 _{the derivative function of ϕ, and we define ϕ}0_(a ϕ) and

ϕ0_(b

ϕ) to be the limits (which may be finite or infinite) limx↓aϕϕ0(x) and limx↑bϕϕ0(x),

respectively. We denote Imϕ0 _{the set of all values of the function ϕ}0_{, i.e., Imϕ}0 _:=

{ϕ0(x) such that x ∈ [aϕ, bϕ]}. If additionally the function ϕ is strictly convex, then ϕ0

is increasing on [aϕ, bϕ]. Hence, it is one-to-one function from [aϕ, bϕ] to Imϕ0. In this

case, ϕ0−1 _{denotes the inverse function of ϕ}0 _{from Imϕ}0 _{to [a} ϕ, bϕ].

If ϕ is differentiable, then for all x ∈]aϕ, bϕ[,

ϕ∗¡ϕ0(x)¢= xϕ0(x) − ϕ (x) . (2.5) If additionally ϕ is strictly convex, then for all t ∈ Imϕ0 _{we have}

ϕ∗(t) = tϕ0−1(t) − ϕ ³

ϕ0−1(t) ´

and ϕ∗0(t) = ϕ0−1(t). (2.6) On the other hand, if ϕ is essentially smooth, then the interior of the domain of ϕ∗

coin-cides with that of Imϕ0_{, i.e., (a}

ϕ∗, b_ϕ∗) = (ϕ0(a), ϕ0(b)).

Let F be some class of B-measurable real valued functions f defined on X , and denote

MF, the real vector subspace of M, defined by

MF := ½ Q ∈ M such that Z |f | d|Q| < ∞, for all f ∈ F ¾ .

In the following theorem, we recall the dual representation of φ-divergences (for the proof, see Broniatowski and Keziou (2006) theorem 4.4).

Theorem 2.1. Assume that ϕ is differentiable. Then, for all Q ∈ MF such that φ(Q, P )

is finite and ϕ0³dQ dP

´

belongs to F, the φ-divergence φ(Q, P ) admits the dual representation φ(Q, P ) = sup f ∈F ½Z f dQ − Z ϕ∗(f ) dP ¾ , (2.7)

(9)

and the function f := ϕ0³dQ dP

´

is a dual optimal solution4_{. Furthermore, if ϕ is essentially} smooth, then f := ϕ0(dQ/dP ) is the unique dual optimal solution (P -a.e.).

3. Parametric estimation and tests through minimum φ-divergence approach and duality technique

We consider an identifiable parametric model {Pθ; θ ∈ Θ} defined on some measurable

space (X , B) and Θ is some open set in Rd_{. For notational clearness, we write φ(θ, α)}

instead of φ(Pθ, Pα). We assume that for any θ in Θ, Pθ has density pθ with respect to

some dominating σ-finite measure λ, which can be either with countable support or not. Assume further that the support S of the p.m. Pθ does not depend upon θ. On the basis

of an i.i.d. sample X1, ..., Xn with distribution PθT, we intend to estimate θT the true

unknown value of the parameter. We will consider only strictly convex functions ϕ which are finite on ]0, +∞[ and essentially smooth. We will use the following assumption

Z ¯¯_¯ ¯ϕ0 µ p_θ(x) pα(x) ¶¯_¯ ¯ ¯ dPθ(x) < ∞. (3.1)

Note that if the function ϕ satisfies

there exists 0 < δ < 1 such that for all c in [1 − δ, 1 + δ], we can find numbers c1, c2, c3 such that

ϕ(cx) ≤ c1ϕ(x) + c2|x| + c3, for all real x,

(3.2)

then the assumption (3.1) is satisfied whenever φ(θ, α) < ∞; see e.g. Broniatowski and Keziou (2006) lemma 3.2. Note also that the real convex functions ϕ_γ (1.4), associated to the class of power divergences, all satisfy the condition (3.2) including all standard divergences.

For a given θ ∈ Θ, consider the class of functions

F := Fθ := ½ x 7→ ϕ0 µ pθ(x) pα(x) ¶ ; α ∈ Θ ¾ . (3.3)

So, by application of theorem 2.1 above, when assumption (3.1) holds for any α ∈ Θ, we obtain φ(θ, θT) = sup f ∈Fθ ½Z f dP_θ− Z ϕ∗(f ) dP_θ_T ¾ ,

which, by (2.5), can be written as

φ(θ, θ_T) = sup α∈Θ ½Z ϕ0 µ p_θ pα ¶ dP_θ− Z · p_θ pα ϕ0 µ p_θ pα ¶ − ϕ µ p_θ pα ¶¸ dP_θ_T ¾ . (3.4)

(10)

Furthermore, the supremum in this display is unique and it is achieved at α = θ_T indepen-dently upon the value of θ. Hence, it is reasonable to estimate φ(θ, θT) =

R

ϕ(pθ/pθT) dPθT,

the φ-divergence between Pθ and PθT, by

b φ(θ, θT) := sup α∈Θ ½Z ϕ0 µ pθ pα ¶ dPθ− Z · pθ pαϕ 0 µ pθ pα ¶ − ϕ µ pθ pα ¶¸ dPn ¾ , (3.5)

in which we have replaced PθT by its estimate Pn, the empirical measure associated to the

data.

For a given θ ∈ Θ, since the supremum in (3.4) is unique and it is achieved at α = θT,

define the following class of estimates of θT

b α_φ(θ) := arg sup α∈Θ ½Z ϕ0 µ p_θ pα ¶ dP_θ− Z · p_θ pα ϕ0 µ p_θ pα ¶ − ϕ µ p_θ pα ¶¸ dP_n ¾ (3.6) which we call “dual φ-divergence estimates” (DφDE’s); (in the sequel, we sometimes write b

α instead of bαφ(θ)). Further, we have

inf

θ∈Θφ(θ, θT) = φ(θT, θT) = 0.

The infimum in this display is unique and it is achieved at θ = θ_T. It follows that a natural definition of minimum φ-divergence estimates of θT, which we will call “minimum dual

φ-divergence estimates” (MDφDE’s), is

b θ_φ:= arg inf θ∈Θ_α∈Θsup ½Z ϕ0 µ p_θ pα ¶ dP_θ− Z · p_θ pα ϕ0 µ p_θ pα ¶ − ϕ µ p_θ pα ¶¸ dP_n ¾ . (3.7) In order to simplify formulas (3.5), (3.6) and (3.7), define the functions

g(θ, α) : x 7→ g(θ, α, x) := pθ(x) pα(x)ϕ 0 µ pθ(x) pα(x) ¶ − ϕ µ pθ(x) pα(x) ¶ , (3.8) f (θ, α) : x 7→ f (θ, α, x) := ϕ0 µ p_θ(x) pα(x) ¶ (3.9) and h(θ, α) : x 7→ h(θ, α, x) := Pθf (θ, α) − g(θ, α, x). (3.10)

Hence, (3.5), (3.6) and (3.7) can be written as follows b φ(θ, θT) := sup α∈Θ Pnh(θ, α), (3.11) b αφ(θ) := arg sup α∈ΘPnh(θ, α) (3.12) and b θφ:= arg inf θ∈Θ_α∈ΘsupPnh(θ, α). (3.13)

(11)

Formula (3.4) can be written as

φ(θ, θT) = sup α∈Θ

PθTh(θ, α). (3.14)

Remark 3.1. For L1 distance, i.e., when ϕ(x) = |x − 1|, formula (3.4) does not apply since the corresponding ϕ function is not differentiable. However, using the general dual representation of divergences given in Broniatowski and Keziou (2006) theorem 4.1, we can obtain an explicit formula for L₁ distance avoiding the differentiability assumption. A methodology on estimation and testing in L1 distance has been proposed by Devroye and Lugosi (2001), and its consequences for composite hypothesis testing and for model selection based density estimates for nested classes of densities are presented in Devroye

et al. (2002) and Biau and Devroye (2005).

Remark 3.2. (An other view at the ML estimate). The maximum likelihood estimate

belongs to both classes of estimates (3.12) and (3.13). Indeed, it is obtained when ϕ(x) = − log x + x − 1, that is as the dual modified KL-divergence estimate or as the minimum dual modified KL-divergence estimate, i.e., MLE=DKLmDE=MDKLmDE. Indeed, we

then have P_θf (θ, α) = 0 and Pnh(θ, α) = −

R log ³ pθ pα ´ dPn. Hence by definitions (3.6) and (3.7), we get b αKLm(θ) = arg sup α∈Θ − Z log µ pθ pα ¶ dPn= arg sup α∈Θ Z log(pα) dPn= M LE

independently upon θ, and

b θ_KL_m = arg inf θ∈Θ_α∈Θsup− Z log µ p_θ p_α ¶ dPn= arg sup θ∈Θ Z log(p_θ) dPn= M LE.

So, the MLE is the estimate of θT that minimizes the estimate of the KLm-divergence

between the parametric model {P_θ; θ ∈ Θ} and the p.m. P_θ_T.

3.1. The asymptotic properties of the DφDE’s bαφ(θ) and bφ(θ, θT) for a given θ in

Θ. This section deals with the asymptotic properties of the estimates (3.11) and (3.12). In the sequel, we assume that condition (3.1) holds for any α ∈ Θ. Also, we assume that the convex function ϕ has continuous derivatives up to 4th order, and the density pα(x)

has continuous partial derivatives up to 3th order (for all x λ − a.e). We use |.| to denote the Euclidean norm and I_θ_T the Fisher information matrix, i.e., the matrix defined by

I_θ_T := Z _p₀ θTp 0T θT p_θ_T dλ.

In the following theorem, we state the existence and the consistency of the estimates bα_φ(θ) and bφ(θ, θ_T). We give also their limit laws. We will use the following assumptions.

(12)

(A.1) There exists a neighborhood N (θ_T) of θ_T such that the first and second order partial derivatives (w.r.t α) of f (θ, α, x)pθ(x) are dominated on N (θT) by some

λ-integrable functions. The third order partial derivatives (w.r.t α) of h(θ, α, x)

are dominated on N (θT) by some PθT-integrable functions;

(A.2) The integrals P_θ_T |(∂/∂α)h(θ, θ_T)|2 and P_θ_T¯¯(∂2_/∂α2_{)h(θ, θ}

T)

¯

¯ are finite, and the matrix PθT(∂

2_/∂α2_{)h(θ, θ}

T) is non singular;

(A.3) The integral PθTh(θ, θT)2 is finite.

Theorem 3.1. Assume that assumptions (A.1) and (A.2) hold. (a) Let B(θ_T, n−1/3_{) :=}©_{α ∈ Θ; |α − θ}

T| ≤ n−1/3

ª

. Then, as n → ∞, with probability one, the function α 7→ Pnh(θ, α) attains its maximum value at some point bαφ(θ) in

the interior of the ball B, which implies that the estimate bαφ(θ) is n1/3-consistent

and satisfies Pn(∂/∂α)h(θ, bαφ(θ)) = 0.

(b) √n (bα_φ(θ) − θ_T) converges in distribution to a centered multivariate normal

ran-dom variable with covariance matrix

Vφ(θ, θT) = S−1M S−1 (3.15)

with S := −PθT(∂2/∂α2)h(θ, θT) and M := PθT(∂/∂α)h(θ, θT)(∂/∂α)Th(θ, θT).

If θ_T = θ, then V_φ(θ, θ_T) = V (θ_T) = I_θ−1

T.

(c) If θT = θ, then the statistic _ϕ002n₍₁₎φ(θ, θb T) converges in distribution to a χ2 random

variable with d degrees of freedom.

(d) If additionally assumption (A.3) holds, then when θT 6= θ, we have

√ n ³ b φ(θ, θ_T) − φ(θ, θ_T) ´

converges in distribution to a centered normal random variable with variance

σ2_φ(θ, θT) = PθTh(θ, θT)

2_{− (P}

θTh(θ, θT))

2_. _(3.16)

Remark 3.3. Using theorem 3.1 part (c), the estimate bφ(θ0, θT) can be used to perform

statistical tests (asymptotically of level ²) of the null hypothesis H0 : θT = θ0 against the alternative H₁ : θ_T 6= θ₀ for a given value θ₀. Since φ(θ₀, θ_T) is nonnegative and takes

value zero only when θT = θ0, the tests are defined through the critical region Cφ(θ0, θT) := ½ 2n ϕ00₍₁₎φ(θb 0, θT) > qd,² ¾ (3.17)

where qd,² is the (1 − ²)-quantile of the χ2 distribution with d degrees of freedom. Note that

these tests are all consistent, since bφ(θ0, θT) are n-consistent estimates of φ(θ0, θT) = 0

under H0, and √

n-consistent estimate of φ(θ0, θT) > 0 under H1; see part (c) and (d) in theorem 3.1 above. Further, the asymptotic result (d) in theorem 3.1 above can be used

(13)

to give approximation of the power function θ_T 7→ β(θ_T) := P_θ_T(C_φ(θ₀, θ_T)). We obtain

then the following approximation β(θ_T) ≈ 1 − F_N µ √ n σφ(θ0, θT) · ϕ00₍₁₎ 2n qd,²− φ(θ0, θT) ¸¶ (3.18)

where F_N is the cumulative distribution function of a normal random variable with mean zero and variance one. An important application of this approximation is the approximate sample size (3.19) below that ensures a power β for a given alternative θT 6= θ0. Let n0 be the positive root of the equation

β = 1 − FN µ √ n σ_φ(θ0, θT) · ϕ00₍₁₎ 2n qd,²− φ(θ0, θT) ¸¶ i.e., n0 = (a+b)− √ a(a+2b) 2φ(θ0,θT)2 where a = σ 2 φ(θ0, θT) £ F_N−1(1 − β)¤2 and b = ϕ00_(1)q d,²φ(θ0, θT).

The required sample size is then

n∗= [n₀] + 1 (3.19)

where [.] is used here to denote “integer part of”.

Remark 3.4. (An other view at the generalized likelihood ratio test and

approx-imation of the power function through KLm-divergence). In the particular case of

the KL_m-divergence, i.e., when ϕ(x) = ϕ₀(x) := − log x + x − 1, we obtain from (3.17)

the critical area CKLm(θ0, θT) := ½ 2n sup α∈Θ Pnlog µ pα p_θ0 ¶ > qd,² ¾ = ½ 2 logsupα∈Θ Q_n i=1pα(Xi) Q_n i=1pθ0(Xi) > qd,² ¾ , which is to say that the test obtained in this case is precisely the generalized likelihood ratio one. The power approximation and the approximate sample size guaranteeing a power β for a given alternative (for the GLRT) are given by (3.18) and (3.19), respectively, where ϕ is replaced by ϕ0 and φ by KLm.

3.2. The asymptotic behavior of the MDφDE’s. We now explore the asymptotic properties of the estimates bθφ and bαφ(bθφ) defined in (3.13) and (3.12). We assume that

condition (3.1) holds for any α, θ ∈ Θ. Also, we assume that the convex function ϕ has continuous derivatives up to 4th order, and the density p_θ(x) has continuous partial derivatives up to 3th order (for all x λ-a.e.). In the following theorem, we state the existence and the consistency of the estimates bθφ and bαφ(bθφ), and we give theirs limit

laws. We will use the following conditions.

(C.1) There exists a neighborhood N (θT) of θT such that the first and second order

(14)

N (θ_T) by λ-integrable functions. The third partial derivatives (w.r.t. α and θ) of

h(θ, α, x) are dominated on N (θT) × N (θT) by some PθT-integrable functions;

(C.2) The integrals PθT|(∂/∂α)h(θT, θT)| 2_{, P} θT|(∂/∂θ)h(θT, θT)| 2_{, P} θT ¯ ¯(∂2_/∂α2_)h(θ T, θT) ¯ ¯, P_θ_T¯¯(∂2_/∂θ2_)h(θ T, θT) ¯ ¯ and P_θ_T¯¯(∂2_{/∂θ∂α)h(θ} T, θT) ¯

¯ are finite, and the matrix

I_θ_T is non singular.

Theorem 3.2. Assume that conditions (C.1) and (C.2) hold. (a) Let B := ©θ ∈ Θ; |θ − θT| ≤ n−1/3

ª

. Then, as n → ∞, with probability one, the function (θ, α) 7→ Pnh(θ, α) attains its min-max value at some point

³ b

θφ, bαφ(bθφ)

´

in the interior of B × B, which implies that the estimates bθφ and bαφ(bθφ) are n1/3

-consistent and satisfy Pn(∂/∂α)h

³ b θ_φ, bα_φ(bθ_φ) ´ = 0 and Pn(∂/∂θ)h ³ b θ_φ, bα_φ(bθ_φ) ´ = 0. (b) Both √n ³ b θφ− θT ´ and √n ³ b αφ(bθφ) − θT ´

converge in distribution to a centered multivariate normal random variable with covariance matrix V = I_θ−1

T .

3.3. Composite tests by minimum φ−divergence. Let Θ0 be a subset of Θ. We

assume that there exists an open set B0 ⊂ Rd−land mappings r : Θ → Rland s : B0 → Rd

such that the matrices R(θ) := h ∂ ∂θir(θ) i and S(β) := h ∂ ∂βis(β) i

exist, with elements continuous, and are of rank l and (d − l), respectively, Θ₀ = {s(β); β ∈ B₀} and r(θ) = 0

for all θ ∈ Θ0. Consider the composite null hypothesis

H0 : θT ∈ Θ0 versus H1 : θT ∈ Θ\Θ0. (3.20)

This is equivalent to

H0 : θT ∈ s(B0) versus H1 : θT ∈ Θ\s(B0).

Using (3.14), the φ-divergence φ(Θ0, θT), between the set of distributions {Pθ such that θ ∈ Θ0}

and the p.m. PθT, can be written as φ(Θ0, θT) = infθ∈Θ0supα∈ΘPθTh(θ, α). Hence, it can

be estimated by b φ(Θ₀, θ_T) := inf θ∈Θ0 b φ(θ, θ_T) := inf θ∈Θ0_α∈ΘsupPnh(θ, α).

We use bφ(Θ0, θT) to perform statistical test pertaining to (3.20). Since φ(Θ0, θT) :=

inf_θ∈Θ0φ (θ, θT) is positive under H1 and takes value 0 only under H0 (provided that the

infimum is attained on Θ₀), we reject H₀ whenever bφ(Θ₀, θ_T) takes large values. The following theorem provides the limit distribution of bφ(Θ0, θT) under the null hypothesis

(15)

Theorem 3.3. Let us assume that the conditions in theorem 3.2 are satisfied. Under

H0, the statistics _ϕ002n₍₁₎φ(Θb 0, θT) converge in distribution to a χ2 random variable with l

degrees of freedom.

The following theorem gives the limit laws of the test statistics _ϕ002n₍₁₎φ(Θb 0, θT) under the

alternative hypothesis H1: θT ∈ Θ\Θ0. We will use the following assumptions.

(C.3) The minimum of θ 7→ φ(θ, θT) on Θ0is attained at some point, say θ∗ := s(β∗) with β∗ _{∈ B}

0; uniqueness then follows by strict convexity of ϕ and model identifiability

assumption;

(C.4) There exists a neighborhood N (β∗_{) of β}∗_{and a neighborhood N (θ}

T) of θT such that

the first and second order partial derivatives (w.r.t. α and β) of f (s(β), α, x)p_s(β)(x) are dominated on N (β∗_{) × N (θ}

T) by λ-integrable functions. The third partial

derivatives (w.r.t. β and α) of h(s(β), α, x) are dominated on N (β∗_{) × N (θ} T) by

some P_θ_T-integrable functions;

(C.5) The integrals PθT|(∂/∂α)h(s(β∗), θT)| 2_{, P} θT|(∂/∂β)h(s(β∗), θT)| 2_, PθT ¯ ¯(∂2_/∂α2_)h(s(β∗_{), θ} T) ¯ ¯, PθT ¯ ¯(∂2_/∂β2_)h(s(β∗_{), θ} T) ¯ ¯ and PθT ¯ ¯(∂2_{/∂β∂α)h(s(β}∗_{), θ} T) ¯ ¯ are finite, and the matrix

A :=

"

A11 A12 A21 A22

#

is non singular, where A11:= PθT(∂2/∂β2)h(s(β∗), θT), A22:= PθT(∂2/∂α2)h(s(β∗), θT)

and A12= AT21:= PθT(∂2/∂β∂α)h(s(β∗), θT).

(C.6) The integral PθT |h(s(β∗), θT)|

2 _{is finite.}

Denote bβφand bαφ( bβφ) the min-max optimal solution of bφ(Θ0, θT) := infβ∈B0supα∈ΘPnh(s(β), α),

and let B(β∗_{, n}−1/3_{) :=}©_{β ∈ B}

0; |β − β∗| ≤ n−1/3

ª

, cn:= ( bβ_φT, bαφ( bβφ)T)T, c∗ := (β∗T, θTT)T

and F the matrix defined by

F := PθT " (∂/∂β)h(s(β∗_{), θ} T) (∂/∂α)h(s(β∗_{), θ} T) # " (∂/∂β)h(s(β∗_{), θ} T) (∂/∂α)h(s(β∗_{), θ} T) #_T .

Theorem 3.4. Assume that conditions (C.3-5) hold. Then, under the alternative

hypoth-esis H₁, we have

(a) As n tends to ∞, with probability one, the function (β, α) 7→ Pnh(s(β), α) attains

its min-max value at some point cn in the interior of B(β∗, n−1/3) × B(θT, n−1/3),

which implies that both bβ and bαφ( bβ) are n1/3-consistent estimate of β∗ and θT

(re-spectively), and satisfy Pn(∂/∂β)h

³ s( bβ), bα_φ( bβ) ´ = 0 and Pn(∂/∂α)h ³ s( bβ), bα_φ( bβ) ´ = 0.

(16)

(b) √n (c_n− c∗_{) converges in distribution to a centered multivariate normal random}

variable with covariance matrix V = A−1F A−1.

(c) If additionally the condition (C.6) holds, then √n

³ b

φ(Θ0, θT) − φ(Θ0, θT)

´

con-verges in distribution to a centered normal random variable with variance

σ2_φ(β∗, θT) = PθTh(s(β ∗_{), θ}

T)2− (PθTh(s(β ∗_{), θ}

T))2. (3.21)

Remark 3.5. Using theorem 3.3, the estimate bφ(Θ0, θT) can be used to perform statistical

tests (asymptotically of level ²) of the null hypothesis H₀: θ_T ∈ Θ₀ against the alternative H1 : θT ∈ Θ\Θ0. Since φ(Θ0, θT) is nonnegative and takes value zero only when θT ∈ Θ0, the tests are defined through the critical region

Cφ(Θ0, θT) := ½ 2n ϕ00₍₁₎φ(Θb 0, θT) > ql,² ¾ , (3.22)

where ql,² is the (1 − ²)-quantile of the χ2 distribution with l degrees of freedom. Note that

these tests are all consistent, since bφ(Θ0, θT) are n-consistent estimates of φ(Θ0, θT) = 0

under H0, and √

n-consistent estimate of φ(Θ0, θT) > 0 under H1; see theorem 3.3 and theorem 3.4 part (c). Further, the asymptotic result (c) in theorem 3.4 above can be used to give an approximation to the power function θT 7→ β(θT) := PθT (Cφ(Θ0, θT)). We

obtain then the following approximation β(θT) ≈ 1 − FN µ √ n σ_φ(β∗_{, θ}_T₎ · ϕ00₍₁₎ 2n ql,²− φ(Θ0, θT) ¸¶ (3.23)

where F_N is the cumulative distribution function of a normal variable with mean zero and variance one. An important application of this approximation is the approximate sample size (3.24) below that ensures a power β for a given alternative θT ∈ Θ\Θ0. Let n0 be the positive root of the equation

β = 1 − F_N µ √ n σ_φ(β∗_{, θ} T) · ϕ00₍₁₎ 2n ql,²− φ(Θ0, θT) ¸¶ i.e., n₀ = (a+b)− √ a(a+2b) 2φ(Θ0,θT)2 where a = σ 2 φ(β∗, θT) £ F_N−1(1 − β)¤2 and b = ϕ00_(1)q l,²φ(Θ0, θT).

The required sample size is then

n∗= [n0] + 1 (3.24)

where [.] is used here to denote “integer part of”.

Remark 3.6. (An other view at the generalized likelihood ratio test for

com-posite hypotheses, and approximation of the power function through KLm

(17)

− log x + x − 1, we obtain from (3.22) the critical area C_KL_m(Θ₀, θ_T) = ½ 2 log supα∈Θ Q_n i=1pα(Xi) sup_θ∈Θ0Qn_i=1pθ(Xi) > q_l,² ¾ ,

which is to say that the test obtained in this case is precisely the generalized likelihood ratio test associated to (3.20). The power approximation and the approximate sample size guaranteeing a power β for a given alternative (for the GLRT) are given by (3.23) and (3.24), respectively, where ϕ is replaced by ϕ0 and φ by KLm.

4. A new test of the number of components in a mixture through

χ2_{-divergence on signed finite measures}

In this section, we present an application of φ-divergences technique. We give a solution to some irregular problems for which the maximum likelihood method poses some diffi-culties. We consider the test problem of the number of components in a mixture (see e.g. Titterington et al. (1985) for this problem). Let

n Pa1(1); a1 ∈ A1 o , . . ., n Pa(k)k ; ak∈ Ak o be k-parametric models where A₁, . . . , A_k are k (k ≥ 2) open sets in Rd1_{, . . . , R}dk and

d1, . . . , dk∈ N∗. Denote Pθ the mixture model

P_θ := k X i=1 wiPa(i)i (4.1) where 0 ≤ wi ≤ 1, P wi = 1 and θ ∈ Θ := ( (w1, . . . , wk, a1, . . . , ak)T ∈ [0, 1]k× A1× · · · × Ak such that k X i=1 wi= 1 ) , (4.2) and assume that the model is identifiable. Let k₀∈ {1, . . . , k − 1}. We test if (k −k₀) com-ponents in (4.1) have null coefficients. By convention, we consider then P(k0+1)

a_k0+1 , . . . , Pakk.

So, denote Θ0 the subset of Θ defined by

Θ0:= {θ ∈ Θ such that wk0+1= · · · = wk= 0} . (4.3)

On the basis of an i.i.d sample X₁, . . . , X_n with distribution P_θ_T, θ_T ∈ Θ, we intend to

perform tests of the hypothesis

H₀ : θ_T ∈ Θ₀ against the alternative H₁ : θ_T ∈ Θ \ Θ₀. (4.4) It is known that the generalized likelihood ratio test, based on the Wilks likelihood ratio statistic

2 log λ := 2 log supθ∈Θ Q_n

i=1pθ(Xi)

(18)

is not valid for this problem, since the asymptotic approximation by χ2 _{distribution does}

not hold in this case; the problem is due to the fact that the null value of θ is not in the interior of the parameter space Θ. We clarify now this problem. For simplicity, consider a mixture of two known densities p0 and p1 with p0 6= p1:

p_θ = (1 − θ)p₀+ θp₁ where θ ∈ Θ := [0, 1]. (4.6) Given data X1, . . . , Xn with distribution PθT, θT ∈ [0, 1], consider the test problem

H0 : θT = 0 against the alternative H1 : θT > 0. (4.7)

The generalized likelihood ratio statistic for this test problem is

Wn(0) := 2logL(b_L(0)θ), (4.8)

where L(θ) := Qn_i=1[(1 − θ)p0(Xi) + θp1(Xi)] for all θ ∈ [0, 1], and bθ is the MLE of θ.

Using the strict concavity of the function θ ∈ [0, 1] 7→ l(θ) := log L(θ), it is clear that b

θ = 0 whenever l0

+(0), the derivative on the right at θ = 0 of θ 7→ l(θ), is nonpositive.

Hence, we can write

P0{Wn= 0} ≥ P0 n b θ = 0 o = P0 © l₊0 (0) ≤ 0ª= P0 ( _n X i=1 p0(Xi) p1(Xi) − n ≤ 0 ) = P0 ( √ n Ã 1 n n X i=1 p₀(X_i) p1(Xi) − 1 ! ≤ 0 ) (4.9) which, by the CLT, tends to 1/2 (if 1 6= E(Y2

i ) < ∞ where Yi := p0(Xi)/p1(Xi)) since the

random variables Yi are i.i.d with E(Yi) = 1 under H0. This proves that the convergence

in distribution of the generalized likelihood ratio statistic W_n(0) to a χ2 _{random variable}

(under H0) does not hold. Under suitable regularity conditions we can prove that the limit

distribution of the statistic Wn in (4.8) is 0.5δ0+ 0.5χ21, a mixture of the χ2-distribution

and the Dirac measure at zero; see Self and Liang (1987). On the other hand, in the case of more than two components and k − k₀ ≥ 2, the limit distribution of the GLR statistic

(4.5) under H0 is complicate and not standard (not a χ2 distribution) which poses some

difficulty in determining the critical value that will give correct asymptotic size; see Self and Liang (1987). On the other hand, the likelihood ratio statistic

Wn(θ) := 2logL(b_L(θ)θ) (4.10)

can not be used to construct asymptotic confidence region for the parameter θT since its

(19)

We propose the following simple solution for the two above problems : Consider the following set of signed finite measures

p_θ := (1 − θ)p₀+ θp₁ where θ ∈ R. (4.11) This set (of signed finite measures with mass one) obviously contains the mixture model (4.6). In particular, the null value of θ (i.e., θ = 0) is an interior point of the parameter space R. The likelihood ratio test (for a model of signed measures) cannot be used since the log-likelihood l(θ) may be infinite (when θ < 0 or θ > 1). In the context of divergences, this means that the estimate \KLm(P0, PθT) may be infinite if we consider the model (4.11),

which is due to the fact that the corresponding convex function ϕ(x) = − log x + x − 1 is infinite on R−. This suggests to use a divergence associated to a convex function ϕ

which is finite on all R, for instance, the χ2_{-divergence (which is associated to the convex}

function ϕ(x) = 1₂(x − 1)2_{). So, in order to perform a test asymptotically of level ² for}

(4.7), we propose to use the following estimate of the χ2_{-divergence between P}

0 and PθT

f

χ2_{(0, θ}_T_{) = sup}

α∈Θe

{P0f (0, α) − Png(0, α)} , (4.12)

where f (0, α) = p₀/p_α− 1 and g(0, α) = 1/2(p₀/p_α + 1)(p₀/p_α− 1) as a consequence of

definitions (3.9) and (3.8), and Θe is the new parameter space which we define as follows

Θ_e:= ½ θ ∈ R such that Z |f (0, θ)| dP₀ is finite ¾ .

The value of the parameter θ_T under the null hypothesis H0, i.e., θT = 0, is in the interior

of the new parameter space Θ_e which is generally non void. Hence, under conditions of theorem 3.1 where Θ is replaced by Θe and θ by zero, under H0 the statistic 2nfχ2(0, θT)

converges in distribution to a χ2 _{random variable with one degree of freedom; the critical}

region takes then the form

CR :=

n

2nfχ2_{(0, θ}_T_{) > q}_1,²o_, _(4.13)

where q1,² is the (1 − ²)-quantile of the χ2 distribution with one degree of freedom.

Ob-viously, other divergences which are associated to convex functions finite an all R can be used. The use of the χ2-divergence is recommended. Indeed, for regular cases (for example for multinomial goodness-of-fit tests) χ2_{-test is equivalent (in Pitman sense) to}

the generalized likelihood ratio one; see also Cressie and Read (1984) sections 3.1 and 3.2 for other motivations in favor of the χ2 _{approach. In the other hand, the estimate}

f

χ2_{(θ, θ}_T_{) = sup}

α∈Θe

(20)

where Θe:= ½ α ∈ R such that Z |f (θ, α)| dPθ is finite ¾ ,

can be used also to construct asymptotic confidence region for the parameter θT with level

(1 − ²) as follows C := n θ ∈ Θ ∩ Θ_e such that 2nfχ2_{(θ, θ} T) ≤ q1,² o .

In fact, limn→∞PθT(θT ∈ C) = 1 − ² both when θT = 0 or θT > 0 since the statistic

2nfχ2_(θ_T_{, θ}_T_{) converges in distribution to χ}2 _{random variable with one degree of freedom}

both when θ_T = 0 or θ_T > 0.

5. Concluding remarks and possible developments

We have addressed the parametric estimation and test problems. We have introduced new estimation and test procedure using divergence minimization and duality technique for discrete or continuous parametric models, avoiding the smoothing method. The pro-cedure leads to optimal estimates for the parameter model and for the divergences. It includes both the discrete (finite or infinite) and the continuous support cases. It extends the maximum likelihood method for both estimation and test problems. Moreover, the procedure and the divergences framework permit to obtain the limit laws of the proposed estimates and the test statistics both under the null and the alternative (simple or com-posite) hypotheses, including the generalized likelihood ratio one. As a by-product, we obtain explicit power functions in a general case for simple or composite parametric test problems, and approximations of the minimal sample size guaranteeing a desired power for a given alternative. A new test and new asymptotic confidence regions are proposed in the case where the parameter may be a boundary value of the parameter space. Many problems remain to be studied in the future, such as the choice of the divergence which leads to an “optimal” (in some sense) estimate or test in terms of efficiency and robustness, construction of convergent estimates and test statistics by divergence where the maximum likelihood is not consistent (for example for location family for which the expectation does not exists), the Bartlett correctability and the large deviation properties of the proposed statistics bφ.

6. Appendix

Proof of Theorem 3.1. (a) Using (A.1), simple calculus give

(21)

and

P_θ_T(∂2/∂α2)h(θ, θ_T) = − Z

ϕ00(p_θ/p_θ_T)(p2_θ/p_θ3_T)p0_θ_Tp0T_θ_T dλ =: −S. (6.2) Observe that the matrix S is symmetric and positive since the second derivative ϕ00 _is

nonnegative by the convexity of ϕ. Let Un(θT) := Pn(∂/∂α)h(θ, θT), and use (6.1) and

(A.2) in connection with the Central Limit Theorem (CLT) to see that

√

nU_n(θ_T) → N (0, M ). (6.3)

Also, let Vn(θT) := Pn(∂2/∂α2)h(θ, θT), and use (6.2) and (A.2) in connection with the

Law of Large Numbers (LLN) to conclude that

V_n(θ_T) → −S (a.s). (6.4)

Now, for any α = θ_T + un−1/3 _{with |u| ≤ 1, consider a Taylor expansion of P}

nh(θ, α) in α

around θT, and use (A.1) to see that

nPnh(θ, α) − nPnh(θ, θT) = n2/3uTUn+ 2−1n1/3uTVnu + O(1) (a.s.) (6.5)

uniformly on u with |u| ≤ 1. Now, using (6.4) and the fact that U_n= O¡n−1/2_{(log log n)}1/2¢

(a.s) to conclude that

nPnh(θ, α) − nPnh(θ, θT) = O

³

n1/6(log log n)1/2 ´

− 2−1uTSun1/3+ O(1) (a.s.) (6.6) uniformly on u with |u| ≤ 1. Hence, uniformly on the surface of the ball B (i.e., uniformly on u with |u| = 1), we have

nPnh(θ, α) − nPnh(θ, θT) ≤ O

³

n1/6(log log n)1/2 ´

− 2−1cn1/3+ O(1) (a.s.) (6.7) where c is the smallest eigenvalue of the matrix S. Note that c is positive since S is positive definite (it is symmetric, positive and non singular by assumption A.2). In view of (6.7), by the continuity of α 7→ Pnh(θ, α), it holds that as n → ∞, with probability one,

α 7→ P_nh(θ, α) attains its maximum value at some point bα_φ(θ) in the interior of the ball

B, and therefore the estimate bαφ(θ) satisfies Pn(∂/∂α)h(θ, bα) = 0 and bα − θT = O(n−1/3).

(b) Using the fact that Pn(∂/∂α)h(θ, bα) = 0 and a Taylor expansion of Pn(∂/∂α)h(θ, bα)

in bα around θ_T, we obtain

0 = P_n(∂/∂α)h(θ, bα) = P_n(∂/∂α)h(θ, θ_T) + (bα − θ_T)TP_n(∂2/∂α2)h(θ, θ_T) + o_p(n−1/2). Hence,

√

(22)

Using (6.3) and (6.4) and Slutsky Theorem, we conclude then

√

n (bα − θT) → N (0, Vφ(θ, θT)) (6.9)

where V_φ(θ, θ_T) is given in part (b) of Theorem 3.1. When θ_T = θ, direct calculus shows that Vφ(θ, θT) = I_θ−1_T .

(c) Assume that θ_T = θ. From (6.8), using the convergence (6.4), we get

√

n (bα − θT) = S−1

√

nUn(θT) + op(1). (6.10)

On the other hand, a Taylor expansion of [2n/ϕ00_{(1)] b}_{φ(θ, θ}

T) = [2n/ϕ00(1)] Pn(∂/∂α)h(θ, bα)

in bα around θT, using the fact that Pnh(θ, θT) = 0 when θT = θ, gives

2n

ϕ00₍₁₎φ(θ, θb T) =

2n

ϕ00₍₁₎UnT(bα − θT) +_ϕ2n₀₀₍₁₎(bα − θT)TVn(bα − θT) + op(1).

Use (6.4), (6.10) and the fact that S = −ϕ00_(1)I

θT when θT = θ to conclude that

2n ϕ00₍₁₎φ(θ, θb T) = ϕ00(1) −2√_nUT nIθ−1T √ nUn+ op(1).

Finally, use the convergence (6.3) and the fact that M = ϕ00(1)2IθT when θT = θ, to

conclude that [2n/ϕ00_{(1)] b}_{φ(θ, θ}

T) converges in distribution to a χ2 variable with d degrees

of freedom when θT = θ.

(d) Assume that θT 6= θ. A Taylor expansion of bφ(θ, θT) = Pnh(θ, bα), in bα around θT,

using the fact that PθT(∂/∂α)h(θ, θT) = 0, gives bφ(θ, θT) = Pnh(θ, θT)+op(n−1/2). Hence,

√ n ³ b φ(θ, θT) − φ(θ, θT) ´ =√n [Pnh(θ, θT) − PθTh(θ, θT)] + op(1),

which under assumption (A.3), by the CLT, converges in distribution to a centred normal variable with variance σ2

φ(θ, θT) = PθTh(θ, θT)2− (PθTh(θ, θT))

2_.

Proof of Theorem 3.2. (a) Under condition (C.1), simple calculus give

PθT ∂ ∂αh(θT, θT) = PθT ∂ ∂θh(θT, θT) = PθT ∂2 ∂α∂θh(θT, θT) = PθT ∂2 ∂θ∂αh(θT, θT) = 0, (6.11) −P_θ_T ∂2 ∂α2h(θT, θT) = PθT ∂2 ∂θ2h(θT, θT) = ϕ00(1)IθT, (6.12)

(23)

and PθT · ∂ ∂θh(θT, θT) ¸ · ∂ ∂θh(θT, θT) ¸_T = PθT · ∂ ∂αh(θT, θT) ¸ · ∂ ∂αh(θT, θT) ¸_T = −P_θ_T · ∂ ∂αh(θT, θT) ¸ · ∂ ∂θh(θT, θT) ¸_T = ϕ00(1)2IθT. (6.13) Denote Un(θ, θT) := Pn(∂/∂α)h(θ, θT), Vn(θ, θT) := Pn(∂2/∂α2)h(θ, θT) and

S(θ, θT) := −PθT(∂2/∂α2)h(θ, θT). We prove that bα(θ) − θT = O(n−1/3) (a.s.) uniformly

on θ ∈ B(θ_T, n−1/3_{). First, using condition (C.1), we can write}

Un(θ, θT) := Un(θT, θT) + O(n−1/3) (a.s.) (6.14)

and

Vn(θ, θT) := Vn(θT, θT) + O(n−1/3) (a.s.), (6.15)

uniformly on θ ∈ B(θ_T, n−1/3_{). On the other hand, for any α = θ}

T + un−1/3 with |u| ≤ 1,

by a Taylor expansion using condition (C.1), we obtain

nP_nh(θ, α) − nP_nh(θ, θ_T) = n2/3uTU_n(θ, θ_T) + 2−1n1/3uTV_n(θ, θ_T)u + O(1) (a.s.) uniformly on θ ∈ B(θ_T, n−1/3_{) and u with |u| ≤ 1. Combining this with (6.14) and (6.15)}

to see that

nP_nh(θ, α) − nP_nh(θ, θ_T) = n2/3uTU_n(θ_T, θ_T) + 2−1n1/3uTV_n(θ_T, θ_T)u + O(n1/3) (a.s.) uniformly on θ ∈ B(θ_T, n−1/3_{) and u with |u| ≤ 1. Now, from this, using the fact that}

U_n(θ_T, θ_T) = O¡n−1/2_{(log log n)}1/2¢_{(a.s.) and V}

n(θT, θT) = −S(θT, θT) + o(1) (a.s.), we obtain nPnh(θ, α) − nPnh(θ, θT) = O ³ n1/6(log log n)1/2 ´ − 2−1n1/3uTS(θT, θT)u + O(n1/3) (a.s.) (6.16) uniformly on θ ∈ B(θT, n−1/3) and u with |u| ≤ 1. Hence, uniformly on α in the surface

of the ball B(θ_T, n−1/3_{) (i.e., uniformly on u with |u| = 1), we have}

nPnh(θ, α)−nPnh(θ, θT) ≤ O

³

n1/6(log log n)1/2 ´

−2−1ϕ00(1)cn1/3+O(n1/3) (a.s.) (6.17) (uniformly on θ ∈ B(θT, n−1/3)) where c > 0 is the smallest eigenvalue of the matrix IθT =

ϕ00₍₁₎−1_S(θ

T, θT). This implies that, as n → ∞, with probability one, the function α 7→

(24)

and this holds for all θ ∈ B(θ_T, n−1/3_{). Further, since (6.17) holds uniformly on θ ∈}

B(θT, n−1/3), we conclude that

b

αφ(θ) − θT = O(n−1/3) (a.s.) uniformly on θ ∈ B(θT, n−1/3). (6.18)

We now prove that, as n → ∞, with probability one, the function θ 7→ P_n(θ, bα_φ(θ)) attains its minimum value at some point bθφ in the interior of the ball B(θT, n−1/3). Here, bαφ(θ)

is any value in the interior of B(θT, n−1/3) which maximizes α 7→ Pnh(θ, α). It exists by

the above arguments. For any θ = θT + vn−1/3 with |v| ≤ 1, by a Taylor expansion of

nP_nh(θ, bα_φ(θ)) in θ and bα_φ(θ) around θ_T, and a Taylor expansion of nP_nh(θ_T, bα_φ(θ_T)) in b

αφ(θT) around θT, using (6.18) and (6.11), we obtain

nPnh(θ, bαφ(θ)) − nPnh(θT, bαφ(θT)) = n2/3vTPn(∂/∂θ)h(θT, θT) +

2−1n1/3vT £Pn(∂2/∂θ2)h(θT, θT)

¤

v + o(n1/3) (a.s.) uniformly on v with |v| ≤ 1. Hence, from this, using the fact that

P_n(∂/∂θ)h(θ_T, θ_T) = O¡n−1/2_{(log log n)}1/2¢_{(a.s.) and P}

n(∂2/∂θ2)h(θT, θT) = ϕ00(1)IθT+

o(1) (a.s.), we conclude that

nPnh(θ, bαφ(θ))−nPnh(θT, bαφ(θT)) = O ³ n1/6(log log n)1/2 ´ +2−1ϕ00(1)vTIθTvn 1/3_+o(n1/3_{) (a.s.)}

uniformly on v with |v| ≤ 1. Hence, uniformly on θ in the surface of the ball B(θT, n−1/3)

(i.e., uniformly on v with |v| = 1), we obtain

nPnh(θ, bαφ(θ))−nPnh(θT, bαφ(θT)) ≥ O

³

n1/6(log log n)1/2 ´

+2−1ϕ00(1)cn1/3+o(n1/3) (a.s.) where c > 0 is the smallest eigenvalue of I_θ_T. This implies that

n2/3Pnh(θ, bαφ(θ))−n2/3Pnh(θT, bαφ(θT)) ≥ O

³

n−1/6(log log n)1/2 ´

+2−1ϕ00(1)c+o(1) (a.s.) uniformly on θ in the surface of the ball B(θT, n−1/3). The left hand side of the above

display equals zero when θ = θT and is positive when θ is in the surface of the ball

B(θ_T, n−1/3_{) (for n sufficiently large). This implies that, as n → ∞, with probability one,}

the function θ 7→ Pnh(θ, bαφ(θ)) attains its minimum value at some point bθφ in the interior

of the ball B. This concludes the proof of part (a).

(b) Denote aT n := ³ (bθ_φ− θ_T)T_{, (b}_α φ(bθφ) − θT)T ´_T

. Under conditions (C.1) and (C.2), by a Taylor expansion using part (a), we obtain

√ nan= √ n " 1 ϕ00₍₁₎I_θ−1_T 0 0 −1 ϕ00₍₁₎I_θ−1_T # " −Pn_∂θ∂h(θT, θT) −P_n ∂ ∂αh(θT, θT) # + op(1). (6.19)

(25)

We therefore deduce, by the CLT, that√na_nconverges in distribution to a centred normal variable with covariance matrix

V = " I_θ−1_T I_θ−1_T I_θ−1 T I −1 θT # ,

which completes the proof of Theorem 3.2. Proof of Theorem 3.3. We have

b φ(Θ0, θT) := inf β∈B0_α∈ΘsupPnh (s(β), α) . = Pnh ³ s( bβ), bα ´ ,

in which as in the proof of Theorem 3.2, s( bβ) and bα are solutions of the system of equations

   Pn_∂β∂ h ³ s( bβ), bα ´ = 0 P_n ∂ ∂αh ³ s( bβ), bα ´ = 0.

In the first equation the partial derivative is intended w.r.t. the first variable β in s(β) and in the second one w.r.t. the second variable α. A Taylor expansion of Pn_∂β∂ h

³ s( bβ), bα ´ and Pn_∂α∂ h ³ s( bβ), bα ´ in a neighborhood of (βT, θT) gives " −Pn_∂β∂ h(s(βT), θT) −Pn_∂α∂ h(s(βT), θT) # = " PθT ∂2 ∂β2h(s(βT), θT) PθT ∂2 ∂β∂αh(s(βT), θT) PθT ∂2 ∂α∂βh(s(βT), θT) PθT ∂2 ∂β2h(s(βT), θT) # b_n+ o_p(1), (6.20) where bn:= ³ ( bβ − βT)T, (bα − θT)T ´_T

. This implies that bn= Op(n−1/2). So, by a Taylor

expansion of bφ(Θ0, θT) around (βT, θT), we obtain

2n ϕ00₍₁₎T φ n = UnTA−1Un− VnTB−1Vn+ op(1), (6.21) where Un:= √ n ϕ00₍₁₎Pn ∂ ∂αh (s(βT), θT) , Vn:= √ n ϕ00₍₁₎Pn ∂ ∂βh(s(βT), θT), A := − 1 ϕ00₍₁₎PθT ∂2 ∂α2h(s(βT), θT), B := 1 ϕ00₍₁₎PθT ∂2 ∂β2h(s(βT), θT).

By (6.12), it holds A = IθT. On the other hand,

∂ ∂βh (s(βT), θT) = · ∂ ∂βs(βT) ¸_T ∂ ∂s(β)h (s(βT), θT) = [S(βT)]T_∂s(β)∂ h (s(βT), θT) .