Asymptotic behavior of some Bayesian nonparametric and semi-parametric procedures

(1)

parametric Procedures. (Under the direction of Professor Subhashis Ghosal).

This dissertation extends some established results about the asymptotic behavior of some Bayesian nonparametric and semi-parametric procedures in three aspects.

First, positivity of the prior probability of Kullback-Leibler neighborhood around the true density, commonly known as the Kullback-Leibler property, plays a fun-damental role in posterior consistency. A popular prior for Bayesian estimation is given by a Dirichlet mixture, where the kernels are chosen depending on the sample space and the class of densities to be estimated. The Kullback-Leibler property of the Dirichlet mixture prior has been shown for some special kernels like the normal density or Bernstein polynomial, under appropriate conditions. We obtain easily ver-ifiable sufficient conditions, under which a prior obtained by mixing a general kernel possesses the Kullback-Leibler property. We study a wide variety of kernels used in practice, including the normal, t, histogram, gamma, Weibull densities and so on, and show that the Kullback-Leibler property holds if some easily verifiable conditions are satisfied at the true density. This gives a catalog of conditions required for the Kullback-Leibler property, which can be readily used in applications.

(2)

ard function is unknown, accelerated failure time models and partial linear regression model. We give sufficient conditions under which the posterior distribution of the parametric part is consistent in the Euclidean distance while the non-parametric part is consistent with respect to some topology such as the weak topology. Our results are obtained by verifying the conditions of an appropriate modification of a cele-brated result of Schwartz. Our general consistency result applies also to the case of independent, non-identically distributed observations. Application of our theorem requires showing the existence of exponentially consistent tests for the complement of the neighborhoods of the “true” value of the parameter and the prior positivity of a Kullback-Leibler type of neighborhood of the true distribution of the observations. We construct the required tests and give sufficient conditions for positivity of prior probabilities of Kullback-Leibler neighborhoods in all the examples we consider in the corresponding chapter of this dissertation.

(3)

by Yuefeng Wu

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fullfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina 2009

APPROVED BY:

Sujit K. Ghosh Huixia Wang

Subhashis Ghosal Dennis Boos

(4)

DEDICATION

(5)

BIOGRAPHY

(6)

ACKNOWLEDGMENTS

First, I would like to thank my academic advisor, Dr. Subhashis Ghosal, for his continual guidance on not only the research itself but also the research methods, the teaching skills and so on.

Then, I would like to thank Dr. Dennis Boos, Dr. Sujit K. Ghosh and Dr. Huixia Wang for what I have learned from the great courses they taught and for serving on my advisory committee.

Also, I would like to thank Dr. Tumulesh Solanky, my academic advisor at Uni-versity of New Orleans, and Dr. Linxiong Li. They introduced me to statistics and enhanced my interest in academic work.

I would like to thank my parents, my sister, and my friends, who have been always supportive and helpful.

(7)

LIST OF TABLES

Table 3.1 Simulation results for showing the convergence . . . 94

(10)

LIST OF FIGURES

Figure 2.1 Density estimated by Dirichlet mixture of normal density prior . . . 50 Figure 2.2 Density estimated by mixture of Polya tree prior . . . 51 Figure 2.3 Density estimated by Dirichlet mixture of triangular density prior . . . 51

(11)

Chapter 1

Introduction

1.1

Overview

The Bayesian nonparametric and semi-parametric models are getting more atten-tion from both academic and practical fields. This is especially due to the flexibility of the nonparametric and semi-parametric models and the feasibility of the comput-ing for Bayesian approach. The advantages of Bayesian approach, such as reflectcomput-ing one’s prior beliefs into the analysis and the straightforward inference coming from the posterior distribution, also encourages people to use these models.

(12)

Dirichlet process and the Polya tree process.

As for any other models, the validation and the performance of Bayesian non-parametric and semi-non-parametric models should be studied. Among many different criteria, asymptotic consistency used for checking the validation of a Bayesian model and rate of convergence used for evaluating the performance of a model are widely accepted. Some results from the literature about asymptotic consistency for Bayesian nonparametric and semi-parametric models are given in Section 1.3.

The last section of this chapter describes how this thesis is organized.

1.2

Priors on Infinite-dimensional Spaces

In this section, we describe some priors on function spaces. These priors are fundamental and often used to construct more complicated priors for some specific problems in the later chapters.

We first consider Dirichlet processes. A Dirichlet process on a given measurable space (X,B) with parameter α, where α is a finite measure on X, is a random probability measure P such that for every B ∈B,

(i) P(B) is a measurable random variable taking value in [0,1]; (ii) each realization ofP is a probability measure on (X,B);

(iii) for each measurable finite partition{B1, . . . , Bk}ofX, the joint distribution of the vector (P(B1), . . . , P(Bk)) has Dirichlet distribution on the k-dimensional unit simplex with parameters (k;α(B1), . . . , α(Bk)).

(13)

The Polya tree processes are a large class of priors, which contain the Dirichlet processes. For anyj, let Ej be a set containing all sequences of 0s and 1s of lengthj, and E∗ =∪j∈NEj. Let Tm ={B, ∈Em}, with B0 and B1 being a partition of B, and {Tm} be a sequence of binary partitions. Let A ={α :∈ E∗} be a collection of nonnegative numbers. A random probability measure P on _Rfollows a Polya tree process with parameters ({Tm},{A}) if there exists a collection of random variables

Y ={Y :∈E∗} such that,

(i) The collectionY consists of mutually independent random variables; (ii) For each ∈E∗, Y has a beta distribution with parameters α0 and α1; (iii) The random probability measure P is related to Y through the relations

P(B1···m) =





m

Y

j=1;j=0

Y1···j−1



 



m

Y

j=1;j=1

(1−Y1···j−1)



,

m= 1,2, . . . , where the factors are Y∅ or 1−Y∅, if j = 1.

The Polya tree process was originally introduced as a prior distribution on the space of probability measures by Ferguson (1974) and Blackwell and Mac Queen (1973). Mauldin et al. (1992) and Lavine (1992, 1994) study Polya tree processes more thoroughly.

If the base measure of the Dirichlet process isα =M α0, where α0 is a probability measure belonging to some parametric familysss andM is the total mass of α, where the actual values of α0 and M are unknown, then it is natural to assign priors on these parameters.This leads to a mixture of Dirichlet processes, which was introduced by Antoniak (1974). Such prior may be described in the following way,

P ∼Dαθ, θ ∼π,

(14)

Though the Dirichlet process can spread mass well over the space of probability measures by choosing an appropriate base measure, the random probability P ∼Dα is almost surely (a.s.) discrete. Tconstruct a prior sitting on the space of probability densities, we convolute randomP with a kernel function. The resulting prior is called a Dirichlet mixture, which was introduced by Ferguson (1983) and Lo (1984). The choice of proper kernel and more details about priors of the type is given in the next chapter.

Among other processes derived from the Dirichlet processes are invariant Dirichlet process, Pinned-down Dirichlet. See Dalal (1979), Diaconis and Freedman (1986a, 1986b), and Doss (1985a, 1985b) for more details.

There are a variety of processes which could be thought as generalizations of as fifthe the Dirichlet process. They are Tail-free (see Freedman 1963, Ferguson 1974, Ghosh and Ramamoorthi 2003 and Kraft 1964) and neutral to the right (see Doksum 1974) process, Polya tree process, d-dimensional Dirichlet process see Hjort 1996) ad priors obtained from random series representation (see Sethuraman 1994, Hjort 2000 and Iswaran and Zarepour 2002).

There are many other prior distributions in the literature. For example, Leonard (1978) and then Lenk (1988, 1991) constucted Gaussian process (see also Wahba 1978 and Choudhuri et al. 2004b); L`evy process, which has as independent increment processes σ; discussed by Hjort (1990) and Kim (1999).

1.3

Consistency

In this section we give some basic results about posterior consistency in the context of Bayesian nonparametric and semi-parametric models. Roughly speaking, posterior consistency means that the posterior distribution converges to the degenerated point mass probability measure that assigns mass 1 to the true value of the parameter when the sample size goes to infinity.

(15)

al 1995). However, for infinite-dimensional problems, such a simple assertion does not hold (Freedman (1963), Diaconis and Freedman (1986a, 1986b), Doss (1985a, 1985b) and Kim and Lee (2001)). Thus, verifying the posterior consistency for nonparametric and semi-parametric models is not trivial.

Let Π be the prior on the parameter space Θ and the true value θ0 ∈ Θ. Let Pθ denote a probability measure with parameter θ on a measurable space (X,A). Let Ω = (X∞,A∞) and P_θ∞ be the independent and identically distributed (i.i.d.) product measure defined on Ω. We give the definition of consistency of posterior distribution below.

Definition 1. For each n, letΠ(·|_Xn) be a posterior givenX1, . . . , Xn. The sequence

{Π(·|_Xn)} is said to be consistent at θ0 if there is a Ω0 ⊂ Ω with Pθ0∞(Ω0) = 1 such

that if ω∈Ω0, then for every neighborhood U of θ0,

Π(U|_Xn(ω))→1. (1.1)

Remark 1. When Θ is a metric space, {θ :ρ(θ0, θ)<1/n :n≥ 1} forms a base for the neighborhoods ofθ0, and hence expression (1.1) is equivalent to

Π(U|_Xn(ω))→1 almost surely (a.s.) Pθ0∞. (1.2) If almost surely does not make sense or does not hold, convergence inP_θ0∞-probability may be considered. Also note that, for density estimation cases, ifU is a neighborhood in L1-distance, we refer such consistency as strong consistency, while U is a weak neighborhood of θ0, the consistency is denoted as weak consistency.

(16)

Schwartz (1965) obtained a general result on consistency, which lays the founda-tion of Bayesian asymptotic theory for general parameter spaces. Before we recite the theorem, define a Kullabck-Leibler neighborhood of θ0 of size byK(θ0) ={θ :

K(θ0;θ) < }, where K(θ0;θ) =

R

p(x, θ0) log(p(x, θ0)/p(x, θ))dµ(x), the Kullback-Leibler divergence between θ0 and θ. Note that p(x, θ) is a density indexed by θ∈Θ with respect to some sigma-finite measure µ. We say that the KL property holds at θ0 orθ0 is in the Kullback-Leibler support (KL support) of Π, and writeθ0 ∈KL(Π), if Π(K(θ0))>0 for every >0.

Theorem 1. Letθ0 ∈U ⊂Θ.If there existsm ≥1, a test functionφ(Xm) for testing H0 : θ = θ0 against Ha : θ ∈ Uc with the property that inf{Eθφ(Xm) : θ ∈ Uc} > Eθφ(_Xm) and θ0 ∈KL(Π), then Π{θ ∈Uc|_Xn} →0 a.s. [Pθ0∞].

This theorem requires two conditions to be satisfied. First is the existence of a strictly unbiased test for testing the null hypothesis θ = θ0 against the complement of a neighborhood U. The second asks for KL property. The testing condition is usually more difficult to satisfy. Ghosh and Ramamoorthi (2003), Theorem 4.4.2, gave conditions for considering weak topology, where the existence of the test is not difficult to show. However, in more complicated problem or for stronger topologies on densities, existence becomes much harder to show and sometimes needs additional conditions.

1.4

Outline

(17)

(18)

Chapter 2

Kullback Leibler Property of

Kernel Mixture Priors

2.1

Introduction

Density estimation, which is also relevant in various applications such as cluster analysis and robust estimation, is a fundamental nonparametric inference problem. In Bayesian approach to density estimation, a prior such as a Gaussian process, a Polya tree process, or a Dirichlet mixture is constructed on the space of probability densities. Dirichlet mixtures were introduced by Ferguson (1983) and Lo (1984) who also obtained expressions for resulting posterior and predictive distribution. West (1992), West, M¨uller and Escobar (1994) and Escobar and West (1995, 1998) devel-oped powerful Markov chain Monte Carlo methods to calculate Bayes estimates and other posterior quantities for Dirichlet mixtures.

(19)

location-scale kernel is appropriate. If X is the unit interval, a uniform or triangular density kernel, or Bernstein polynomial may be considered. If X is the positive half line (0,∞), mixtures of gamma, Weibull, lognormal, exponential or inverse gamma may be used. Petrone and Veronese (2007) discussed the issue of the choice of a kernel in view of a constructive approximation known as the Feller sampling scheme. Let P, the mixing distribution on Θ, be given a prior Π on M(Θ), the space of probability measure on Θ. Let supp(Π) denote the weak support of Π. The prior on P and the chosen kernel then give rise to a prior on D(X), the space of densities on X, via the map P 7→fP(x) :=

R

K(x;θ)dP(θ). We shall call such a prior a type I mixture prior or Prior 1 in short. To enrich the family of the kernels, let the kernel function contain another parameterφ, referred to as the hyper parameter. In this case, we shall denote the kernel byK(x;θ, φ). The hyper parameterφmight be elicited a priori or be given a prior. In the former case, such a prior essentially reduces to Prior 1. For the latter case, assume that φ is independent of P and denote the prior for φ by µ. Let Φ be the space of φ and supp(µ) denote the support of µ. With such a random hyper parameter in the chosen kernel, the prior on densities is induced by µ×Π via the map (φ, P)7→fP,φ(x) :=

R

K(x;θ, φ)dP(θ). We shall call this prior a Type II mixture prior or simply Prior 2. Clearly, Prior 2 contains Prior 1 as a special case where φ is treated as a vacuous parameter. In some situations, the prior Π may contain an additional indexing parameter ξ. For instance, when Π is the Dirichlet process with base measure αξ (written as DP(αξ)) depending on an indexing parameter ξ, which is also given a prior, we obtain a mixture of Dirichlet processes (MDP) (Antoniak (1974)) prior for mixing distribution P. Addition of this hierarchical structure to Prior 1 or Prior 2 gives somewhat more flexibility. In this paper, we do not make any specific assumption on Π like DP or MDP other than requiring that it has large weak support. The prior induced on the space of densities by a mixing distributionP ∼Π (and φ ∼ µ and ξ ∼ π) will be denoted by Π∗ and we shall refer to it as a kernel mixture prior. Note that the variable x and the parameters θ, φ and ξ mentioned above are not necessarily one-dimensional.

(20)

poste-rior distribution based on kernel mixture pposte-riors were established by Ghosal, Ghosh and Ramamoorthi (1999), Tokdar (2006), and Ghosal and van der Vaart (2001, 2007), when the kernel is chosen to be a normal probability density (and the prior distri-bution of the mixing distridistri-bution is DP). Similar results for Dirichlet mixture of Bernstein polynomials were shown by Petrone and Wasserman (2002), Ghosal (2001) and Kruijer and van der Vaart (2005). However, in the literature, there is a lack of such results for mixture of other kernels, which are also widely used in practice. We are only aware of the article by Petrone and Veronese (2007) who considered gen-eral kernels. However, they derived consistency only under the strong and unrealistic condition that the true density is exactly of the mixture type for some compactly supported mixing distribution, or the true density itself is compactly supported and is approximated in terms of Kullback-Leibler divergence by its convolution with the chosen kernel.

Schwartz (1965) showed that the consistency at a true density f0 holds if the prior assigns positive probabilities to specific type of neighborhoods of f0 defined by Kullback-Leibler divergence measure and the size of the model is restricted in some appropriate sense. Thus the prior positivity condition, known as the Kullback-Leibler property (KL property), is fundamental in posterior consistency studies. More for-mally, let a density function f be given a prior Π∗. Define a Kullabck-Leibler neigh-borhood of f of size by K(f) = {g : K(f;g) < }, where K(f;g) =

R

flog(f /g), the Kullback-Leibler divergence between f and g. We say that the KL property holds at f0 ∈D(X) or f0 is in the Kullback-Leibler support (KL support) of Π∗, and write f0 ∈ KL(Π∗), if Π∗(K(f0)) > 0 for every > 0. For the weak topology, the size condition in Schwartz’s theorem holds automatically (Ghosh and Ramamoorthi (2003)[Theorem 4.4.2]). Further, Ghosal, Ghosh and Rammamoorthi (1999) argued that this property drives consistency of the parametric part in some semiparametric models.

(21)

particular type of kernel or by a prior distribution for mixing distribution. The distinguished feature of our results is that we allow the true density to be not of the chosen mixture type, and impose only simple moment conditions and qualitative conditions like continuity or positivity.

Ghosal, Ghosh and Rammamoorthi (1999) presented results on consistency for Dirichlet location mixture of a normal kernel with an additional scale parameter in terms of both weak and L1-topologies. Tokdar (2006) [Theorem 3.2] considered a location-scale mixture of the normal kernel and established consistency in weak topology (weak consistency) under more relaxed conditions. If the prior Π is chosen to be DP(α), Tokdar (2006) also weakened a moment condition on the true density in his Theorem 3.3. His Theorem 3.2 will be implied by Theorem 5 in this paper (with the choice λ= 0 there). In fact, we establish the KL property for a general location-scale kernel mixture and show that such a result applies to various kernels including the skew-normal,t, double-exponential and logistic. This is a substantial generalization of results known for only the normal kernel thus far. Moreover, we obtain results about the KL property for priors with kernels not belonging to location–scale families, e.g., the Weibull, gamma, uniform, and exponential kernels. The examples studied here provide a ready catalog of conditions required for the KL property to hold for virtually all kernel mixture priors that are of practical interest.

With the the help of our results on KL property, consistency in L1- (equivalently, Hellinger) distance can be obtained by constructing appropriate sieves approximating the class of mixtures and establishing entropy bounds for them. Since the techniques used for sieve construction and bounding entropy vary widely depending on the chosen kernel, we do not address L1-consistency in this paper.

(22)

2.2

General Kernel Mixture Priors

First we observe that the Kullback Leibler property is preserved under taking mixtures.

Lemma 1. Let f|ξ ∼Π∗_ξ, where ξ is an indexing parameter following a prior π and let f0 be the true density. Suppose that there exists a set B with properties Π(B)>0

and B ⊂ {ξ :f0 ∈KL(Π∗ξ)}. Then f0 ∈KL(Π∗), where Π∗ =

R

Π∗_ξdπ(ξ).

The proof is almost a trivial application of Fubini’s theorem, since Π∗(f :K(f0;f)< )≥

Z

B

Π∗_ξ(f :K(f0;f)< )dπ(ξ)>0.

In view of this result, henceforth we shall discard the indexing parameter ξ from our prior.

Theorem 2. Let f0 be the true density, µ and Π be priors for the hyper parameter

and the mixing distribution in Prior 2, and Π∗ be the prior induced by µ and Π on

D(X). If for any >0, there exists P, φ, A ⊂ Φ with µ(A) >0 and W ⊂M(Θ)

with Π(W)>0, such that A1. R f0log_f_P,φf0 < ,

A2. R

f0log fP,φ

fP,φ < for every φ∈A, and

A3. R f0log fP ,φ

fP,φ < for every P ∈W, φ∈A,

then f0 ∈KL(Π∗).

Proof. For any >0, φ∈A and P ∈W,

Z

X

f0(x) log

f0(x) fP,φ(x)

dx =

Z

X

f0(x) log

f0(x) fP,φ(x)

dx +

Z

X

f0(x) log

fP,φ(x)

dx +

Z

X

f0(x) logfP,φ(x)

fP,φ(x)

(23)

Hence,

Π∗{f :f ∈K3(f0)} ≥Π∗{fP,φ :P ∈W , φ∈A}= (Π×µ)(W ×A)>0.

2

Remark 2. If Π = DP(α) and supp(P) ⊂supp(α), then P ∈supp(Π); see, for in-stance, Theorem 3.2.4 of Ghosh and Ramamoorthi (2003). In particular, the condition holds for any chosen P if α is fully supported on Θ. A similar assertion holds when Π is the Polya tree prior PT({Tm},A) (see Lavine (1992)). LetTm be a collection of gradually refining binary partitions and A ={α1,...,m : 1, . . . , m = 0 or 1, m ≥1}.

If the end points of Tm form a dense subset of some set S where S ⊃supp(P) and the elements ofA, which control the beta distributions regulating the mass allocation to the sets in Πm, are positive, then alsoP ∈supp(Π). This is implicit in Theorem 5 of Lavine (1992) or Theorem 3.3.6 of Ghosh and Ramamoorthi (2003); for an explicit statement and proof, see Theorem 2.20 of Ghosal and van der Vaart (2009). Now, if

W is an open neighborhood of P, then Π(W)>0 holds.

Remark 3. Assume that φ ∈supp(µ). Condition A2 clearly holds with A an open neighborhood of φ, assuming that φ7→

R

f0log(fP,φ/fP,φ) is continuous.

In most application, we can choose P to be compactly supported. Compactness of supp(P) often helps satisfy condition A4–A9 in Lemmas 2 and 3, which are useful in verifying the conditions of Theorem 2.

Lemma 2. Let f0, Π, µandΠ∗ be the same as in Theorem 2. If for any >0, there

exist P, a set D containing supp(P), and φ ∈ supp(µ) such that A1 holds and the

kernel function K satisfies

A4. for any given x and θ, the map φ7→K(x;θ, φ) is continuous on the interior of the support of µ;

A5. R_X n

log

sup_θ∈DK(x;θ,φ)

infθ∈DK(x;θ,φ)

+

log

sup_θ∈DK(x;θ,φ)

infθ∈DK(x;θ,φ)

o

f0(x)dx <∞for everyφ ∈N(φ),

(24)

A6. for any given x ∈ X, θ ∈ D and φ ∈ N(φ), there exists g(x, θ) such that g(x, θ)≥K(x;θ, φ) , and R

g(x, θ)dP(θ)<∞;

then there exists a set A⊂Φ such that A2 holds.

Proof. By Condition A4, we have that K(x;θ, φ)→ K(x;θ, φ) as φ →φ, for any givenxandθ. By Condition A6 and the dominated convergence theorem (DCT), fP,φ(x)→fP,φ(x,) asφ →φ, for any given x. Equivalently, this can be written as

logfP,φ

fP,φ

→0 pointwise, as φ→φ. (2.2)

Note that

log

fP,φ

≤          log

supθ∈DK(x;θ,φ)

infθ∈DK(x;θ,φ)

, if

fP,φ

fP,φ ≥1,

log

supθ∈DK(x;θ,φ)

infθ∈DK(x;θ,φ)

, if

fP,φ

fP,φ <1.

By Condition A5 and the DCT, R f0log fP,φ

f_P,φ →0 asφ →φ. Hence, for given >0, there exists δ >0 such that R f0log

fP,φ

fP,φ < if |φ−φ|< δ. If A={φ : |φ−φ|<

δ} ∩N(φ), then

R

f0log fP,φ

fP,φ < for all φ∈A. The proof is completed by noticing

that µ(A)>0, since A is an open neighborhood ofφ ∈supp(µ). 2

Lemma 3. Let f0, Π, µandΠ∗ be the same as in Theorem 2. If for any >0, there

exist P ∈ supp(Π), φ ∈ supp(µ), and A ⊂ Φ with µ(A) > 0 such that Conditions

A1 and A2 hold and for some closed D⊃supp(P), the kernel functionK and prior Π satisfy

A7. for any φ∈A, R

log fP,φ(x)

infθ∈DK(x,θ,φ)f0(x)dx <∞;

A8. c:= infx∈Cinfθ∈DK(x;θ, φ)>0, for any compact C ⊂X;

A9. for any given φ ∈ A and compact C ⊂ X, there exists E containing D in its

interior such that the family of maps {θ 7→ K(x;θ, φ), x ∈ C} is uniformly equicontinuous on E ⊂Θ, and sup{K(x;θ, φ) :x∈C, θ∈Ec_}_{< c/4}_;

(25)

Proof. For any φ∈A, write

Z

X

f0(x) log

fP,φ(x)

fP,φ(x)dx =

Z

Cc

f0(x) log

fP,φ(x)

fP,φ(x)dx +

Z

C

f0(x) log

fP,φ(x)

fP,φ(x)dx. (2.3) Now, since P(D) = 1> 1₂, V ={P : P(D)> 1₂} is an open neighborhood of P by the Portmanteau Theorem. For anyP ∈V and φ ∈A,

Z

Cc

f0(x) log

fP,φ(x)

fP,φ(x) dx

≤

Z

Cc

f0(x) log

fP,φ(x)

R

θ∈Dinfθ∈DK(x;θ, φ)dP(θ) dx

≤

Z

Cc

f0(x) log

fP,φ(x)

infθ∈DK(x;θ, φ)

dx+ (log 2)Pf0(Cc);

here Pf0 is the probability measure corresponding to f0. By Condition A7, there exists compact C ⊂X, such that

Z

Cc

f0(x) log

fP,φ(x)

infθ∈DK(x;θ, φ)

dx < /4. (2.4)

We can further ensure thatPf0(Cc)< /4, so the bound for

R

Ccf0log f_P,φ

fP,φ is less than

/2. Now, if we can show that for the given >0, there exists a weak neighborhood

U of P, such that

R

Cf0(x) log

fP,φ(x)

fP,φ(x)dx < /2 for any P ∈ U and φ ∈ A, then

Lemma 3 is proved by letting W =U ∩V.

Observing that for any given φ∈A, the family of maps {θ 7→K(x;θ, φ) :x∈C}

is uniformly equicontinuous on E ⊂ Θ, by the Arzela-Ascoli theorem, (see Royden (1988) [pp. 169]) for any δ >0, there exist x1, x2, . . . , xm, such that, for any x∈C,

sup θ∈E

|K(x;θ, φ)−K(xi;θ, φ)|< cδ. (2.5)

for some i= 1,2, . . . , m. LetU ={P :|R

EK(xi;θ, φ)dP(θ)−

R

(26)

For any x∈C, choosing xi to satisfy (2.5), we have that Z Θ

K(x;θ, φ)dP(θ)−

Z

Θ

K(x;θ, φ)dP(θ)

≤sup{K(x;θ, φ) :θ∈Ec, x∈C}

+ Z E

K(x;θ, φ)dP(θ)−

Z

E

K(xi;θ, φ)dP(θ)

+ Z E

K(xi;θ, φ)dP(θ)−

Z

E

K(xi;θ, φ)dP(θ)

+ Z E

Z

E

K(x;θ, φ)dP(θ)

< c

4 + 2cδ+

Z

E

Z

E

K(xi;θ, φ)dP(θ)

< c(

4+ 3δ) (2.6)

if P ∈ U. Also R_ΘK(x;θ, φ)dP(θ) > c for any x ∈ C, since P has support in D. Hence, given φ∈A, for any P ∈U and x∈C,

R

ΘK(x;θ, φ)dP(θ)

R

ΘK(x;θ, φ)dP(θ)

−1

≤3δ+

4. Then, for 3δ+/4<1,

R

ΘK(x;θ, φ)dP(θ)

R

ΘK(x;θ, φ)dP(θ)

−1

<

3δ+/4 1−3δ−/4.

By choosingδ small enough, we can ensure that the right hand side (RHS) of the last display is less than /2. Hence, for any given φ∈A

Z

C

f0(x) log

fP,φ(x)

dx≤sup x∈C

R

ΘK(x;θ, φ)dP(θ)

R

ΘK(x;θ, φ)dP(θ)

−1

< /2

for any P ∈U. 2

2.3

Location scale kernel

In this section we discuss priors with kernel functions belonging to location scale families. We write the kernels as K(x;θ, h) = _h1dχ(

x−θ

h ), where χ(·) is a probabil-ity densprobabil-ity function defined on _Rd, x = (x1, . . . , xd), and θ = (θ1, . . . , θd) are d-dimensional vectors and h ∈(0,∞). Let kxk denote px2

1+x22+. . .+x2d, and χ

0

(27)

denote ∂χ(x)_∂x

i . Obviously, when d = 1, this reduces to ordinary derivative and k · k

denotes absolute value. We have the following theorems, whose proofs use some ideas from the proof of Theorem 3.2 of Tokdar (2006).

Theorem 3. Let f0(x) be the true density and Π∗ be a type I prior on D(X) with

kernel function h−dχ(x−_hθ), i.e. P ∼ Π, and given P, (θ, h) ∼ P. If χ(·) and f0(x)

satisfy:

B1. χ(·) is bounded, continuous and positive everywhere;

B2. there exists l1 >0such that χ(x) decreases as xmoves away from 0 outside the ball {x:kxk< l1};

B3. there exists l2 >0 such that

Pd

i=1zi χ0_i(z)

χ(z) <−1 for kzk ≥l2 and i= 1, . . . , d;

B4. for some 0< M <∞, 0< f0(x)≤M for all x;

B5. |R

f0(x) logf0(x)dx|<∞;

B6. for some δ >0, R f0(x) log_φf0(x)

δ(x)dx <∞, where φδ(x) = infkt−xk<δf0(t);

B7. there exists η >0, such that |R

f0(x) logχ(2xkxkη)dx|<∞

and R

f0(x)|logχ(x−_ba)|dx <∞ for any a∈Rd, b∈(0,∞);

B8. the weak support of Π is M(_Rd_×

R+);

B9. when d≥2, χ(y) = o(kyk−d₎ _as _k_y_{k → ∞}_.

Then f0 ∈KL(Π∗).

Remark 3. Tokdar (2006) assumed that the weak support of Π includes all compactly supported probabilities in_Rd_×

R+. Then automatically the weak support

of Π isM(_Rd×_R+_{). This is because any arbitrary probability measure can be weakly} approximated by a sequence of compactly supported probability measures.

(28)

To show that Condition A1 is met, we define,

fm(x) =

      

tmf0(x), kxk< m,

0, otherwise,

m ≥1,

wheret−1 m =

R

kxk<mf0(x)dx,hm =m

−η_,_F

m is the probability measure corresponding tofm, Pm =Fm×δ(hm), where δ(·) is the degenerate distribution. Obviously,Pm is compactly supported. Then, using the transformation a= (x−θ)/hm,

fPm(x) =

Z 1 hd m χ

x−θ hm

dFm(θ) = tm

Z

kθk<m 1 hd m

χ

x−θ hm

f0(θ)dθ =

Z

kx−ahmk<m

χ(a)f0(x−ahm)da.

Since for any given a, χ(a)f0(x−ahm)→ χ(a)f0(x) as hm → 0 and f0 is bounded, by the DCT, we obtain fPm(x)→f0(x).

Now, to satisfy Condition A1, we show that

Z

f0(x) log

f0(x) fPm(x)

dx→0 asm → ∞. To this end, observe that

fPm(x) = tm

Z

kθk<m 1 hd m

χ

x−θ hm

f0(θ)dθ

≤ M tm

Z

kθk<m 1 hd m

χ

x−θ hm

dθ

≤ M tm ≤M t1. Hence, as logf0(x)_{M t1} <0,

log f0(x) fPm(x)

≥logf0(x) M t1

. (2.7)

Also

Z

f0(x) log

f0(x) fPm(x)

dx =

Z

kxk≤m

f0(x) log

f0(x) fPm(x)

dx+

Z

kxk>m

f0(x) log

f0(x) fPm(x)

(29)

Letm > l1. Now, forkxk> m, using assumption B2, fPm(x) = tm

Z

kθk<m 1 hd m

χ

x−θ hm

f0(θ)dθ

≥ tm

Z

kθk<m 1 hd m

χ x+m x

kxk

hm

!

f0(θ)dθ

= 1

hd m

χ x+m x

kxk

hm

!

tm

Z

kθk<m

f0(θ)dθ

= 1

hd m

χ x+m x

kxk

hm

!

= mηχ

mηx+ x

kxkm

1+η

≥ kxkη_χ(2_k_x_kη_x) _(2.8)

The last inequality holds when T 7→ Tηχ(Tη(x+T_kx_x_k)) is decreasing for T > T0. This follows because, with z =Tη_x₊_Tη+1_x/_k_x_k_{, a positive multiple of} _x,

d dT

ηlogT + logχ

Tηx+Tη+1 x

kxk

= η T + d X i=1 χ0_i(z)

χ(z)

η Tzi+T

η zi

kzk

= η T ( 1 + d X i=1 χ0_i(z)

χ(z)zi

1 + T 1+η ηkzk

)

≤0 by Condition B3.

For kxk ≤m, let δ >0 be fixed, and φ_m∗(x) = infkt−xk<δhmf0(t),

fPm(x) = tm

Z

kθk≤m 1 hd m

χ

x−θ hm

f0(θ)dθ

≥ tm

Z

{kθk<m}∩{kθ−xk<δhm}

1 hd m

χ

x−θ hm

f0(θ)dθ

≥ tmφ∗m(x)

Z

{kθk<m}∩{kθ−xk<δhm}

1 hd m

χ

x−θ hm

dθ

= tmφ∗m(x)

Z

{kx−uhmk≤m}∩{kuk≤δ}

χ(u)du

≥ tmφ∗m(x)

Z

Qd

i=1[0,sign(xi)δ/

√

d]

(30)

with the convention that [a, b] = [b, a] if b < a. The last inequality holds because when kxk ≤m,

n

u:u∈

d

Y

i=1

[0,sign(xi)δ/

√

d]o⊂nu:kx/hm−uk ≤m/hm and kuk ≤δ

o

.

We have tm ≥1,φ∗m(x)≥φ1(x). Let

c= min

x∈{δ/√d,−δ/√d}d

Z

Qd i=1[0,xi]

χ(u)du. Then, fPm(x)≥cφ1(x), for all kxk< m. For 0< R < m,

fPm(x)≥

      

cφ1(x), kxk< R,

minnkxkη_χ(2_k_x_k1+η x

kxk), cφ1(x) o

, kxk ≥R.

log f0(x) fPm(x)

≤ξ(x) :=

      

log_cφ1(x)f0(x) , kxk< R,

maxlog _k_x_kη_χ(2f0(x)_k_x_k1+η x

kxk)

,log _cφ1(x)f0(x), kxk ≥R.

(2.9)

Combining (2.7) and (2.9), we obtain

log

f0(x) fPm(x)

≤max ξ(x), log

f0(x) M t1

. From Condition B5,

Z

log

f0(x) M t1

f0(x)dx= logM t1− Z

f0(x) logf0(x)dx <∞. Now

Z

ξ(x)f0(x)dx =

Z

kxk<R

f0(x) log f0(x) cφ1(x)

dx

+

Z

kxk≥R

f0(x) max

log f0(x)

kxkη_χ(2_k_x_kη_x),log f0(x) cφ1(x)

. Hence,

Z

ξ(x)f0(x)dx≤

Z

f0(x) log f0(x) cφ1(x)

dx +

Z

kxk≥R,f0(x)>kxkη_χ(2_k_x_kη_x)

f0(x) log

f0(x)

(31)

since max(x1, x2) ≤ x1 +x+2 if x1 ≥ 0. The first term on the RHS of the above inequality is finite, by Condition B6. By Conditions B5 and B7, the second term is also finite. ThusR f0(x) log_ff0(x)

Pm(x)dx→0 as m→ ∞, i.e., Condition A1 is satisfied.

We show that Condition A3 is met by verifying the conditions of Lemma 3. First, from the proof above, we see that for any > 0, there exists m such that

R

f0(x) log_ff0(x)

Pm(x)dx < . Let P in Theorem 2 be chosen to be Pm, which is

com-pactly supported. By Condition B8, P ∈ supp(Π). Second, Condition A7 is satis-fied. To show log fP(x)

inf(θ,h)∈D_hd1 χ(x−hθ)

is f0-integrable, it suffices to show that logfP(x)

and log inf(θ,h)∈D _h1dχ(

x−θ

h ) are both f0-integrable. Without loss of generality, let D ={kθk ≤ a∗} ×[h, h], where a∗ ≥ m and 0 < h≤ m−η ≤ h < 1₂. For kxk< a∗, log inf(θ,h)∈D _h1dχ(

x−θ

h ) is bounded. For kxk> a

∗_,

log inf (θ,h)∈D

1 hdχ

x−θ h

= log

(

1 hdχ

x+a∗_kx_x_k h

!)

. (2.10)

By Condition B7 and expression (2.10), log inf(θ,h)∈D _h1dχ(

x−θ

h ) is f0-integrable. Consider fP(x) =

R

D 1 hdχ(

x−θ

h )dP. Let D={kθk< a

∗_{} ×}_{[h, h], then}

log Z D 1 hdχ

x−θ h dP ≤ log ( 1 hdχ

x+a∗_kx_x_k h

!

P(D)

) ,

for kxk> a∗. Hence, logfP(x) is alsof0-integrable by the similar argument.

Condition A8 is satisfied by Condition B1.

We show that Condition A9 is also satisfied. Let C ⊂X be a given compact set. First we show that { 1

hdχ(

x−θ

h ) : x ∈ C} is uniformly equicontinuous as a family of functions of (θ, h) on E = [−a, a]d_×_[1

2h,2h] wherea > a

∗_.

Such an E containsDin its interior, and is compact. By the definition of uniform equicontinuity, it is to show that for any > 0, there exists δ > 0 such that for all x ∈ C and all (θ, h),(θ0, h0) ∈ E with k(θ, h)−(θ0, h0)k < δ, we have |h−d_χ(x−θ

(32)

h0−dχ(x−_h0θ0)|< . Observe that

1 hdχ

x−θ h

− 1

h0dχ

x−θ0 h0 =

χ(x−_hθ)− hd

h0dχ(

x−θ h ) hd ≤ χ(

x−θ h )−χ(

x−θ0 h0 )

hd +

|h0d−hd|

hd_h0d χ

x−θ0 h0

. (2.11)

Since E and C are compact and h is bounded away from 0 within E, {x−θ

h : x ∈ C,(θ, h)∈E}is also a compact set. Hencec1 = supx∈C,(θ,h)∈Eχ(x

−θ0

h0 ) is finite, by the continuity ofχ(·). Letδ∗ = ₂2dh+12d_c1, then for|h

0₋_h_|_< δ∗

d(2h)d−1, we have|h

0d₋_hd_|_{< δ}∗

and hence the last term in (2.11) is less than /2. Since {x−θ

h : x ∈ C,(θ, h) ∈ E is compact, χ(·) is uniformly continuous on it. For any given >0, there exists δ∗∗ >0 such that whenever x ∈ C and (θ, h),(θ0, h0) ∈ E, with kx−θ

h − x−θ0

h0 k< δ

∗∗_{, we have}

|χ(x−_hθ)−χ(x_h−0θ0)| < h/2d+1, which ensures the second term on the RHS of (2.11) less than /2. Notice that kx−θ

h − x−θ0

h0 k< δ∗∗ is equivalent to

k(h−h0)θ+ (θ0−θ)h+ (h0−h)xk< hh0δ∗∗. (2.12) When kθ − θ0k < h2dδ∗∗

4h and |h − h

0_| _< _min_{h2d_δ∗∗

4√da , |h − h

0_| _< h2d_δ∗∗

2 supx∈Ckxk},

rela-tion (2.12) holds. Hence if > 0 and δ = min{ δ∗ d(2h)d−1,

h2dδ∗∗ 24h ,

h2dδ∗∗ 12√da,

h2dδ∗∗ 12 supx∈Ckxk},

then for all x ∈ C and all (θ, h),(θ0, h0) ∈ E with k(θ, h)−(θ0, h0)k < δ, we have

|h−d_χ(x−θ h )−h

0−d_χ(x−θ0

h0 )| < . Thus the uniform equicontinuity required in Condi-tion A9 is satisfied.

We can enlarge E to ensure that h−dχ(x−_hθ) is less than any preassigned number for x ∈ C and (θ, h) ∈ Ec_{. This holds for large value of} _{h, since} _χ(_·_{) is bounded.} For small values of h, notice that h−d_χ(x−θ

h ) ≤ h

−d_o( hd

kx−θkd) = o(kx−θk

−d_{). This} follows from Assumption B9 when d ≥ 2. For d = 1, the condition automatically holds since R

χ(y)dy = 1 implies χ(y) = o(kyk−1_{) with the help of the montonicity} condition B2. For givenC, choosinga and hlarge enough to construct the set E, we have sup{h−dχ(x−_hθ) :x∈C,(θ, h)∈Ec}< c/4, for any given. 2

(33)

location-parameter for the density be mixed according to P following a prior Π. Let the scale-parameter h be a hyper-parameter, which is also given a prior distribution µ. Assume that h and P are a priori independently distributed. We let Π∗ to denote the prior for the density functions on X, induced by Π×µvia the mapping (P, h)7→

fP,h =

R

h−d_χ(x−θ

h )dP(θ). We then have the following theorem.

Theorem 4. For such prior described above, let χ(x) and f0(x) be densities on X satisfying condition B1–B9. Then, f0 ∈KL(Π∗).

Proof. The proof uses Theorem 2 and Lemmas 2 and 3. Verification the Condi-tions A7–A9 is similar to (but easier than) that in Theorem 3. The second inequality in Condition B7 implies that Condition A5 is satisfied. Conditions A4 and A6 are satisfied sinceχ(·) is a continuous probability density function and the kernel we con-sider here is a location family of χ(·) with a fixed scale. Condition A1 will be proved in the same way as in the proof of Theorem 3. 2

2.4

Examples

In this section, we discuss the KL property for some kernel mixture priors with concretely specified kernels. More precisely, we prove that the property holds under some conditions on the true density when the kernel is chosen to be skew-normal (normal also, as it is a special case), multivariate normal, logistic, double exponential, t (Cauchy also as it is a special case), histogram, triangular, uniform, scaled uniform, exponential, log-normal, gamma, inverse gamma and Weibull densities.

2.4.1

Location-scale kernels

(34)

µon (0,∞). The KL property may be verified by checking Condition B1–B9 for the kernel and applying respectively Theorem 3 or Theorem 4.

In this subsection, we consider several examples of location-scale kernels. Con-dition B1 and B2 can be easily verified. ConCon-ditions B4–B6 are also the conCon-ditions assumed in all the following theorems for each of the location scale density kernels. By choosing prior on P as described in Remark 1, Condition B8 can be satisfied. In this subsection, only multivariate normal density has a mixing parameter θ with dimension d ≥ 2. For this kernel Condition B9 is obviously satisfied. Hence, in the rest of this subsection, for each kernel function and corresponding prior, we only show that conditions B3 and B7 are satisfied.

1. Skew-normal density kernel

Consider the skew-normal kernel χλ(x) = 2

1

√

2πe

−x2/2

Z λx −∞

1

√

2πe

−t2/2_dt,

where the skewness parameter λ is given. We have the following result.

Theorem 5. Assume that the prior Π satisfies B8. Let f0(x) be a continuous

density on _R satisfying conditions B4, B5, B6 and there exists η > 0 such that

R

R|x|

2(1+η)_f0_{(x)dx <}_∞_{. Then} _f0 _∈_KL(Π∗₎_.

Proof. For Condition B3, we have χ 0_(z)

χ(z) =−z+ Φ0(λz)

Φ(λz), Φ0(λz)

Φ(λz) → ∞whenz → −∞ by L’Hospital’s rule, since (e−(λz)2/2λ)0

(Rλz

−∞e−(λt)2/2dt)0

=−λz; and Φ_Φ(λz)0(λz) →0 whenz → ∞. Hence Condition B3 is satisfied.

Condition B7 is satisfied, since

Z

f0(x) logχ(2|x|ηx)dx

= Z

f0(x)

c1(x)−

(2|x|1+η₎2 2 dx <∞ and similarly Z f0(x) logχ

x−a b dx = Z

f0(x)

c2(x)−

(x−a)2 2b2

(35)

for any a and b, wherec1(x) andc2(x) are bounded functions here. 2

Remark 4. Withλ= 0, Theorem 5 implies Theorem 3.2 of Tokdar (2006), since the normal density is a special case of the skew-normal.

2. Multivariate normal density kernel

Let χ(x) = (2π)−d/2Qd

i=1e

−x2_i/2_{, where} _x _{= (x}

1, . . . , xd). We have the following result.

Theorem 6. Assume that the prior Πsatisfies B8. Let f0(x)be a continuous density

on _Rd _{satisfying Conditions B4, B5, B6 and that} R

kxk2(1+η)_f

0(x)dx < ∞ for some η >0. Then f0 ∈KL(Π∗).

Proof. The proof of this theorem is very similar to the proof of Theorem 5, with λ = 0 and some other minor modifications in all the steps except in verifying Condition B7. Note that for some bounded functions c1(x) and c2(x), we have that

Z

f0(x) logχ(2kxkηx)dx

= Z

c1(x)f0(x)dx−

Z

2f0(x)kxk2(1+η)dx

<∞.

and similarly

Z

f0(x)

logχ

x−a b dx= Z

f0(x)

c2(x)−

Pd

1(xi−ai)2 2b2

dx <∞

for any a and b. 2

3. Double-exponential density kernel

Let χ(x) = 1₂e−|x|. We have the following result.

on _R satisfying B4, B5, B6 and R

R|x|

1+η_f

(36)

Proof. Condition B3 is satisfied, since χ_χ(z)0(z) = −1 when z > 0, and χ_χ(z)0(z) = 1 when z ≤ 0. Condition B7 follows easily from the fact that |logχ(x)| is a linear function of |x|. 2

4. Logistic density kernel

Let the kernel be χ(x) = e−x_{/(1 +}_e−x₎2_{. We have the following result.}

R|x|

1+η_f

0(x)dx < ∞ for some η > 0. Then f0 ∈ KL(Π∗).

Proof. Condition B3 is satisfied, since χ_χ(z)0(z) → −1 as z → ∞ and χ_χ(z)0(z) → 1 as z → −∞. Condition B7 is easily verified since the tails of logχ(x) behave like|x|. 2

5. tν-density kernel Let the kernel be given by

χν(x) = Γ( ν+1

2 )

√

νπΓ(ν₂)

1

(1 + (x_φ−2θ)_ν2)(ν+1)/2 ,

where the degrees of freedom ν is given. Let log₊u = max(logu,0). We have the following result.

Rlog+|x|f0(x)dx <∞. Then f0 ∈KL(Π

∗₎_.

Proof. Condition B3 is satisfied, since χ_χ(z)0(z) =−cz(1+z 2 ν)

−1_{, where}_c_{is a positive} constant.

Condition B7 can be verified by observing the tail of |logχν(x)| has growth like log|x|as |x| → ∞. 2

(37)

2.4.2

Kernels with bounded support

The priors with kernels supported on [0,1] are preferred for estimating densities supported on [0,1]. We study the KL property of such priors using Theorem 2.

The following lemma will be used in the following proofs repeatedly.

Lemma 4. For any density f0 on [0,1] and > 0, there exist m > 0 and f1(x) ≥ m >0, such that Π∗(K(f1))>0 implies that Π∗(K2+√(f0))>0.

Proof. If f0 is not bounded away from zero, then define f1(x) =

max(f0(x), m)

R

max(f0(u), m)du.

By Lemma 5.1 in Ghosalet al. (1999), we haveK(f0;f)≤(c+ 1) logc+ [K(f1;f) +

p

K(f1;f)], where c=

R

max(f0(x), m)dx. Hence, c → 1 as m → 0. For any given > 0, there exists m > 0 such that (c+ 1) logc < . Therefore Π∗(K2+√(f0)) ≥ Π∗(K(f1)). 2

6. Histogram density kernel

Let the kernel function be

K(x;θ, m) =

(

m, bothx and θ ∈((i−1)/m, i/m], for some 1≤i≤m <∞, 0, otherwise.

Consider a kernel mixture prior obtained by mixing both θ and m. We have the following result. An analogous result holds when only θ is mixed and m is given a prior with infinite support.

Theorem 10. If f0(x) is a continuous density on [0,1], and the weak support of Π

contains M([0,1]×_N), then f0 ∈KL(Π∗).

Proof. By Lemma 4, we only need to show that Conditions A1 and A3 are satisfied for the density f0 that bounded away from zero. For any > 0, there exist integer m >0 and {w1, w2,· · · , wm}, such thatPm_i=1wi = 1 and

sup x∈[0,1]

f0(x)−

m

X

i=1 wiK

x;i− 1 2 m , m

(38)

To see this, define wi =

f0(i−1_m )+f(_mi) Pm

j=1f0(

j−1

m )+f0( j m)

. By Riemann integrability of a con-tinuous function, for any 1 > 0, there exists M1 > 0, such that for m > M1,

|Pm 1

f0(i−1_m )+f0(_mi)

2m −1|< 1. Since f0 is continuous on a compact set, it is uniformly continuous. Hence, for any given 2 >0, there exists M2 >0, such that for m > M2, sup|f0(x)−Pm₁

f0(i−1_m )+f0(i m)

2m K(x; i−1/2

m , m)| < 2. Let ∆ =

Pm

i=1

f0(i−1_m )+f0(i m)

2m , we have f0(x)− m X i=1 wiK

x;i− 1 2 m , m

≤ |(∆−1)f0(x) +2|1

∆ ≤2M 1+ 22, where M is an upper bound for f0 on [0,1]. Hence, by choosing 1 and 2 small enough, there exists M3 = max(M1, M2) such that for m > M3, (2.13) holds. Since we considerf0 bounded away from 0 here, Condition A1 will be satisfied by choosing m large enough and appropriate weights {w1, . . . , wm}.

Let

W =nP :P

_i₋1

2 −δ1 m

,i− 1 2 +δ1 m

× {m}

> wie−, for i= 1, . . . , m

o

, where 0< δ1 <1/4 and >0. Since W is not empty and it is an open neighborhood of some distribution that belongs to the support of Π, P ∈ W, we have with the index i corresponding to the given x, fPm

fP < e

_{, and hence} R

f0log f_Pm

fP < for all

P ∈W. 2

7. Triangular density kernel

Let the kernel function be

K(x;m, n) =

                                         (

2n−2n2_{x, x}_∈_(0,1 n), 0, otherwise,

m= 0,

      

n2_(x₋m

n) +n, x∈( m−1

n , m n),

−n2(x−m

n) +n, x∈( m n,

m+1 n ),

0, otherwise,

m= 1,2, . . . , n−1,

(

2n+ 2n2(x−1), x∈(0,_n1),

0, otherwise,

(39)

Construct a kernel mixture prior by mixing both m and n. We have the following result.

Theorem 11. Let f0(x)be a continuous density on [0,1], and the weak support of Π

contains M([0,1]×_N). Then f0 ∈KL(Π∗).

Proof. Since the mixing parameters are discrete, defining wi =

f0(i/n) Pn

j=0f0(j/n) and letting W ={P :P(i/n)> wie−, for i= 1,2, . . . , n}, we can complete the proof as in Theorem 10. 2

8. Bernstein polynomial kernel

In the literature, Bernstein polynomials have been used to estimate densities under both frequentist and Bayesian framework. The motivation of the prior comes from the fact that any bounded function on [0,1] can be approximated by a Bernstein polynomial at each point of continuity of the function; see Lorentz (1953).

As in Petron (1999a, 1999b), consider a prior Π∗ induced on D(X) by the map

(k,(w0, . . . , wk))7→ k

X

j=0 wj

k j

xj(1−x)k−j

and priors (w0, . . . , wk)|k ∼Πkandk ∼µ,whereµis a discrete distribution supported on the set of all positive integers, Πkis a distribution supported on (k+1)-dimensional simplex Pk = {(w0, . . . , wk), 0 ≤ wj ≤ 1, j = 0, . . . , k,

Pk

0wj = 1}. We can then rederive Theorem 2 of Petrone and Wasserman (2002) from Theorem 2.

Theorem 12. If f0(x)is a continuous density on [0,1], µ(k)>0 for infinitely many k = 1,2, . . ., and Πk is fully supported on Pk, then f0 ∈KL(Π∗).

(40)

and the assumed positivity condition of its prior. The rest of the proof proceeds as before by considering all possible weights w_j0 > wje−. 2

2.4.3

Kernels supported on

[0,

∞)

9. Lognormal density kernel

Let the kernel function be K(x;θ, φ) = √ 1

2πxφ

e−(logx−θ)2/(2φ)2

x . Consider a type I or type II mixture prior based on this kernel.

Transformx7→ey _{in the kernel function and in}_f

0. If the model usingeyK(ey;θ, φ) as kernel function possess KL property at ey_f

0(ey), then the corresponding model using K(x;θ, φ) as kernel function possess the KL property at f0(x). This is because of

Z ∞

0

f0(x) log

f0(x)

R

K(x;θ, φ)dP(θ)dx =

Z ∞

−∞

eyf0(ey) log

eyf0(ey) eyR

K(ey_;_{θ, φ)dP}_(θ)dy < . For the lognormal kernel, we have the following result.

Theorem 13. Assume that the priorΠsatisfies B8. Letf0(x)be a continuous density

on _R+ satisfying

1. f0 is nowhere zero except at x= 0 and bounded above by M <∞;

2. |R

R+f0(x) log(xf0(x))dx|<∞;

3. R

R+f0(x) log

f0(x)

φδ(x)dx <∞ for some δ >0, where φδ(x) = inf|t−x|<δf0(t);

4. There exists η >0 such that |R

R+f0(x)|logx|

2(1+η)_dx_|_<_∞_.

Then f0 ∈KL(Π∗).

Proof. Considering the kernel functionφ−1χ((y−θ)/φ) = √_2πφ1 e−(y−θ)2/(2φ2), we can apply Theorem 5 withλ = 0 or Theorem 6 withd = 1. It follows from a change of variable that g0(y) :=eyf0(ey) satisfies B4, B5, B6 and

R

|y|2(1+η)_g

(41)

10. Weibull density kernel

Weibull is a widely used kernel function. Ghosh and Ghosal (2006) discussed a model using this density as kernel function and showed posterior consistency useful in survival analysis. However, the assumption for the true densityf0 assumed there was quite strong. Here we establish the KL property with this kernel under very general assumptions.

The Weibull kernel is given by K(x;θ, φ) = θφ−1xθ−1e−xθ/φ. We can transform this kernel using the map x=ey to

θW((y−θ−1logφ)/θ−1) =ey−θ −_{1 log}

φ θ−1 e−e

y−θ−1 logφ θ−1

,

where W(z) = exp[z−ez], the location parameter isθ−1logφ and scale parameter is θ−1. We have the following result.

Theorem 14. Let f0(x) be a continuous density on R+ satisfying

1. f0 is nowhere zero except at x= 0 and bounded above by M <∞;

2. |R

R+f0(x) log(f0(x))dx|<∞;

3. R

R+f0(x) log

f0(x)

φδ(x)dx <∞ for some δ >0, where φδ(x) = inf|t−x|<δf0(t);

4. there exists η >0 such that e2|logx|1+η is f0-integrable;

5. the weak support of Π contains M(_R+×_R+₎_.

Then, f0 ∈KL(Π∗).

Proof. We need to verify Conditions B3–B7 for kernel W(·) and true density ey_f

0(ey). Condition B3 is satisfied, since we have W 0_(z)

W(z) = 1−e

z_{. To verify Condition} B7, observe that Condition 4 of this theorem implies

Z

R

eyf0(ey) loge2|y| 1+η

W(e2|y|1+η)dy

<∞

and

Z

R

eyf0(ey)|logW(e

y−a

(42)

11. Gamma density kernel

The gamma density is one of the most widely used kernel function for density estimation on [0,∞). Hason (2006) discussed a model using the gamma density as kernel with the hierarchical structure has as many stages as the most general one we discussed in Section 1. Chen (2002) and Bouezmarni and Scaillet (2003) discussed a mixture of gamma model with a different parametrization.

Let K(x;α, β) = _Γ(α)β1 αx

α−1_e−x/β _{be the kernel function. Set}

φδ(x) =

(

inf[x,x+δ)f0(t), 0< x < 1, inf(x−δ,x]f0(t), x≥1.

(2.14)

Theorem 15. Assume that the weak support of prior Π is M(_R+_×

R+). Let f0(x)

be a continuous and bounded density on [0,∞) satisfying B4, B5 and B6.∗ R f0(x) log_φf0(x)

δ(x)dx <∞ for someδ >0;

B7.∗ there exists η >0, such that R max(x−η−2_{, x}η+2_)f

0(x)dx <∞.

Then, f0 ∈KL(Π∗).

Proof. We use Km(x;α) to denote K(x;α, m−1). Let

fm(x) =tm

Z 1+m2

2

Km(x;α)m−1f0((α−1)/m)dα, (2.15)

wheretm= (

Rm

m−1f0(s)ds)−1. LetPm denoteFm∗×δ(m

−1_{), where}_F∗

mis the probability measure corresponding totmm−1f0((α−1)/m)1l(α∈[2,1 +m2]) as a density function for α, and 1l(·) is the indicator function. Obviously, Pm is compactly supported and fm(x) = fPm(x). LetFm be the probability measure corresponding tofm. By Lemma

5, which is stated and proved in the following, R f0(x) log_ff0(x)_m_(x)dx → 0 as m → ∞, which implies that Condition A1 is satisfied.

To complete the proof, we show that Condition A3 is satisfied by verifying con-ditions of Lemma 3. For any given > 0, let D = [2,1 +m2

]× {m

−1

}, where m is such that R f0(x) log

f0(x)

(43)

that R f0(x)|logfm(x)|dx < ∞ and

R

f0(x)|log inf(α,β)∈DK(x;α, β)|dx < ∞. Based on expression (2.17), (2.18) and (2.23) in the appendix, we have

log inf

(α,β)∈DK(x;α, β) = log(min{K(x; 1 +m 2 , m

−1

), K(x; 2, m

−1 )}), for any 0< x <∞. Hence

|log inf

(α,β)∈DK(x;α, β)|

< xm+ (m2)|logx|+|log(Γ(m 2

+ 1)m

−(m2

+1)

)|+|log(m

−2 )|. By Condition B7*, we have that R

|log inf(α,β)∈DK(x;α, β)|f0(x)dx < ∞. Further, logfm(x) is alsof0-integrable by a similar argument. Condition A8 is obviously

sat-isfied. Condition A9 is satisfied by lettingE be large enough compact set containing D. This proves the theorem. 2

Lemma 5. Let fm(x) be defined as in (2.15). If the conditions of Theorem 15 are

satisfied, then R f0(x) log_ff0(x)

m(x)dx →0 as m→ ∞.

Proof. First, we derive the lower bound of fm(x) for x in different intervals. Observe that

d

dαlog(Km(x;α)) = logm+ logx−Ψ0(α), (2.16) where Ψ0(z) = _dzd log(Γ(z)), is the digamma function. Also Ψ0(z) is continuous and monotone increasing forz ∈(0,∞), Ψ0(z+1) = Ψ0(z)+_z1, and Ψ0(z)−log(z−1)→0; see (Arfken (1985)[pp. 549–555]) for details.

For x < m−1, log(mx) < 0, and Ψ0(α) ≥ Ψ0(2) = 0.42 for α ∈ [2,1 +m2], and hence _dαd log(Km(x;α)) < 0. For x > m +m−1 and α ∈ [2,1 +m2], log(mx) ≥ log(m2₎ _≥ _Ψ

0(1 +m2) ≥ Ψ0(α), and hence _dαd log(Km(x;α))> 0. Thus replacing α by 1 +m2 in the integrand, we obtain a lower bound for fm(x),x < m−1, as,

fm(x)≥tm

Z 1+m2

2

xm2

e−xm_mm2₊₁

Γ(m2_{+ 1)} f0(α)dα= xm2

e−xm_mm2₊₁

Γ(m2_{+ 1)} . (2.17) Similarly, replacing α by 2 in the integrand, we obtain that for x > m+m−1_,

(44)

Consider the RHS of equation (2.17). For x < m−1, we have d

dmlog

xm2_e−xm_mm2+1 Γ(m2 _{+ 1)}

!

= 2m[ log(xm)−Ψ0(m2+ 1)] +

m2_{+ 1}

m −x <0, for all m sufficiently large, where c1 > 0 is some constant. Consider the RHS of equation (2.18), for x > m+m−1_{, we have} d

dm(xe

−xm_m2_{) =}_xme−xm₍₂₋_xm)_<_0. Hence, replacing m by x−1 on the RHS of (2.17), we obtain a lower bound of fm(x) for x < m−1 as below,

fm(x)≥

xm2e−xmmm2+1 Γ(m2_{+ 1)} ≥

xx−2e−1x−x−2−1 Γ(x−2_{+ 1)} =

1

exΓ(x−2_{+ 1)}; (2.19) and similarly, replacingmbyxon the RHS of (2.18), we obtain that forx > m+m−1_,

fm(x)≥xe−xmm2 ≥e−x 2

x3. (2.20)

Now, we consider fm(x) for m−1 ≤ x ≤ m + m−1. Let δ > 0 be fixed and v = (α−1)/m. For m large,

fm(x) ≥

Z x+δ x−δ

Km(x;mv+ 1)tmf0(v)dv

≥       

φδ(x)tm

Rx+δ

m−1_∨_xKm(x;mv+ 1)dv, x <1 φδ(x)tm

Rm∧x

x−δ Km(x;mv+ 1)dv, x≥1

≥ C(x)φδ(x), where C(x) is given in Lemma 8.

Now we have the lower bound of function fm(x),

fm(x)≥

                  

C(x)φδ(x), R−1 ≤x≤R,

min(C(x)φδ(x),_exΓ(x1−2₊₁₎), 0< x < R

−1_,

min(C(x)φδ(x), e−x 2

x3_), _{R < x,}

(45)

where 0< R < m. Hence, we have that log f0(x)

fm(x) ≤ ξ(x)

:=                   

log _C(x)φf0(x)

δ(x), R

−1 _≤_x_≤_R,

max

n

log_C(x)φf0(x)

δ(x),log(exΓ(x

−2_{+ 1)f0(x))}o_, ₀_{< x < R}−1_,

max

n

log_C(x)φf0(x)

δ(x),log

f0(x) e−x2_x3

o

, R < x.

Since f0(x)< M < ∞, we also have that log_ff0_m ≥log f0(x)_{M t2} . Further, as logf0_{M t2}(x) <0, we have |log _ff0(x)

m(x)| ≤max{ξ(x),|log

f0(x) M t2|}.

By Condition B5, R |log f0(x)_{M t2} |f0(x)dx = logM t2 − R

f0log(f0)dx < ∞. Now, considerR ξ(x)f0(x)dx, which equals to

Z R R−1

f0(x) log

f0(x) C(x)φδ(x)

dx

+

Z R−1

0

f0(x) max

log f0(x) C(x)φδ(x)

,log(f0(x))−log(exΓ(x−2+ 1)f0(x))

dx +

Z ∞

R

f0(x) max

log f0(x) C(x)φδ(x)

,log(f0(x))−log f0(x) e−x2

x3 dx ≤ Z ∞ 0

f0(x) log f0(x) φδ(x)

dx+

Z ∞

0

f0(x) log 1

C(x)dx (2.22)

+

Z

(0,R−1_]_∩_A f0(x)

h

log(exΓ(x−2+ 1)f0(x))

i

dx

+

Z

(R,∞)∩B f0(x)

h

log f0(x) e−x2

x3

i

dx,

whereA={x:f0(x)≥[exΓ(x−2+ 1)]−1},andB ={x:f0(x)≥e−x 2

(46)

by Stirling’s inequality, (see Feller (1957) [vol. I. pp. 50-53])

log

1 exΓ(x−2_{+ 1)}

≤ |logx|+ 1 + log(2π) + (x−2 + 1) log(x−2+ 1) + (x

−2_{+ 1)}2_{+ 1} 12(x−2_{+ 1)} ,

for 0 < x < 1. Hence, the third term on the RHS of (2.22) is less than infinity by Condition B7*. Similarly, so is the fourth term. By Lemma 6, we have that fm →f0 pointwise. Thus, by the DCT, R

f0(x) log _ff0(x)_m_(x)dx→0 as m → ∞. 2

Lemma 6. Let fm(x) be defined as in (2.15), then fm(x) → f0(x) as m → ∞ for

each x >0.

To prove this lemma, we need the lemma below, which generalizes Theorem 2.1. of Devore and Lorentz (1993) from two aspects — the functionsKmandf are considered on a possibly non-compact X, and the intervalsAm can vary with m.

Lemma 7. Let Am = [am, bm] ⊂ X, and let Km(x;t) be a sequence of continuous

functions for x ∈ X and t ∈ Am. Define fm(x) =

R

AmKm(x, t)f(t)dt, m = 1,2, . . .,

where f is bounded, uniformly continuous and integrable on X. If Km satisfies

C1. R_A

mKm(x, t)dt→1 as m→ ∞,

C2. for each δ >0, R_|_x₋_t_|≥_δ,t_∈_A

m|Km(x, t)|dt →0 as m → ∞,

C3. R

Am|Km(x, t)|dt ≤ M(x) < ∞ for each x ∈ X, m = 1,2. . ., where the bound

M(x) may depend on x,

then fm(x)→f(x) for each x∈X.

Proof. Let >0 be given and let δ > 0 be so small that |f(t)−f(x)| < for

|x−t| ≤δ. Because of Condition C1, fm(x)−f(x) =

Z

Am

(47)

where the last term goes to 0 for m→ ∞, for each x∈X. We have

Z

|x−t|≤δ, t∈Am

[f(t)−f(x)]Km(x, t)dt

≤

Z

|x−t|≤δ, t∈Am

|Km(x, t)|dt ≤M(x).

It follows from Condition C2 that for each δ > 0, and any bounded continuous function f∗ on X,R_|_x₋_t_|≥_δf∗(t)Km(x, t)dt→0 as m→ ∞. Hence,

Z

|x−t|>δ, t∈Am

[f(t)−f(x)]Km(x, t)dt →0 as m→ ∞.

By Condition C3 it now follows that |fm(x)−f(x)| ≤M(x) +o(1), and hence the result. 2

Proof of Lemma 6. Let v = (α−1)m−1 and u=m−1. Let K(x;v, u) = x

v/u_e−x/u Γ(v/u+ 1)uv/u+1,

and Km(x;v) = K(x;v, m−1), where v ∈ Am, Am = [m−1, m]. Now fm(x) =

Rm

m−1Km(x, v)f0(v)dv, we show that suchKm(x, v) satisfies condition C1–C3 in Lemma 7.

Given x > 0, consider expression (2.16), for m sufficient large, such that m−1 _< x < m+m−1_{, we have}

d

dvKm(x;v)



  

  

>0 m−1 _≤_{v < x}₋_m−1_,

<0 m≥v > x−m−1₊_ρ,

(2.23)

(48)

decreasing when v > m0. For sufficient large m,

e−xm





[m2_]

X

t=0

(xm)t

t! −1−

(xm)[m0]+1 ([m0] + 1)!





≤e−xm

" Z m2

1

(xm)vm

Γ(vm+ 1)d(vm)

#

≤e−xm





[m2_]

X

t=0

(xm)t t! −1 +

(xm)m0 ([m0]−1)!



, (2.24)

where [z] stands for the largest integer less than or equal to z. Using the expression for the remainder of Taylor’s series, we have the LHS of (2.24) at least

1−

(xm)[m2]+1

([m2_]+1)! ex ∗_m

exm −

1 exm −

(xm)[m0]+1 ([m0]+1)!

exm , (2.25)

where x∗ ∈ (0, x). It is obvious that the expression in (2.25) tends to 1 as m → ∞. Similarly, we have that the RHS of (2.24) tends to 1 as m→ ∞. Hence,

Z m

m−1

K(x;v, u)dv=e−xm

Z m2

1

(xm)vm

Γ(vm+ 1)d(vm)→1 asm → ∞, that is, Condition C1 is satisfied.

From above, we also know that Condition C3 is satisfied, since Km(x;v) >0 for allv ∈Am and x∈X.

To verify Condition C2, for any δ >0 and x∈X, we want

Z

|x−v|>δ ,v∈Am

Km(x, v) dv= Z

|x−v|>δ, v∈Am

e−xm_(xm)vm

Γ(vm+ 1) dv→0, as m→ ∞. We show that for anyδ >0,

m sup

|x−v|>δ,v∈Am

e−xm_(xm)vm

Γ(vm+ 1) →0 as m→ ∞, which is equivalent to showing that

logm+ loge

−xm_(xm)vm

(49)

For anyv such thatv ∈Am,|x−v|> δ, we have by Stirling’s inequality for factorials, logm+ loge

−xm_(xm)vm Γ(vm+ 1)

≤logm+ loge

−xm_(xm)vm [vm]!

≤logm+vmlog(xm)−xm−vmlogvm+vm = logm+{1 + log(x/v)−x/v}vm→ −∞,

as m → ∞, since for any given x and δ, there exists q <0 such that 1 + log(x/v)−

x/v < q for all thev ∈Am, |x−v|> δ.

Thus Conditions C1–C3 in Lemma 7 are all satisfied and we have that fm(x)→ f0(x) as m→ ∞ for each x >0. 2

Lemma 8. Let Km(x;α) be defined as in Section 12. If Condition B7* is satisfied,

then there exists a function 0< C(x)<1 such that

C(x)≤



  

  

Rx+δ

m−1_∨_xKm(x;mv+ 1)dv, m−1 < x < 1,

Rm∧x

x−δ Km(x;mv+ 1)dv, 1≤x≤m+m

−1_,

(2.26)

and R log_C(x)1 f0(x)dx <∞.

Proof. For m−1 < x < 1, applying Stirling’s inequality and noting that v < x+δ <1 +δ in the following integral, it follows that

Z x+δ

m−1_∨_x

Km(x;mv+ 1)dv

=

Z x+δ

m−1_∨_x

mmv+1_xmv_e−mx Γ(mv+ 1) dv

≥

Z x+δ

m−1_∨_x

mmv+1xmve−mx

√

2π(mv+ 1)mv+1/2_exp_{−_(mv<

Asymptotic behavior of some Bayesian nonparametric and semi-parametric procedures

APPROVED BY:

DEDICATION

TABLE OF CONTENTS

LIST OF TABLES

Priors on Infinite-dimensional Spaces

Consistency

Outline

General Kernel Mixture Priors

Location scale kernel

Examples

Kernels with bounded support

Kernels supported on