Semi parametric Bayesian Partially Identified Models based on Support Function

(1)

Munich Personal RePEc Archive

Semi-parametric Bayesian Partially

Identified Models based on Support

Function

Liao, Yuan and Simoni, Anna

University of Maryland, CNRS (Paris, France), THEMA Université

de Cergy-Pontoise

December 2012

Online at

https://mpra.ub.uni-muenchen.de/43262/

(2)

Semi-parametric Bayesian Partially Identified Models

based on Support Function

∗

Yuan Liao

†

University of Maryland

Anna Simoni

‡

CNRS and THEMA

This Version: December 2012

Abstract

Bayesian partially identified models have received a growing attention in recent years in the econometric literature, due to their broad applications in empirical studies. Classical Bayesian approach in this literature has been assuming a parametric model, by specifying an ad-hoc parametric likelihood function. However, econometric models usually only identify a set of moment inequalities, and therefore assuming a known likelihood function suffers from the risk of misspecification, and may result in incon-sistent estimations of the identified set. On the other hand, moment-condition based likelihoods such as the limited information and exponential tilted empirical likelihood, though guarantee the consistency, lack of probabilistic interpretations. We propose a semi-parametric Bayesian partially identified model, by placing a nonparametric prior on the unknown likelihood function. Our approach thus only requires a set of moment conditions but still possesses a pure Bayesian interpretation. We study the posterior of the support function, which is essential when the object of interest is the identified set. The support function also enables us to construct two-sided Bayesian credible sets (BCS) for the identified set. It is found that, while the BCS of the partially identified parameter is too narrow from the frequentist point of view, that of the identified set has asymptotically correct coverage probability in the frequentist sense. Moreover, we establish the posterior consistency for both the structural parameter and its identified set. We also develop the posterior concentration theory for the support function, and prove the semi-parametric Bernstein von Mises theorem. Finally, the proposed method is applied to analyze a financial asset pricing problem.

∗_{The authors are grateful to the seminar and conference participants at CREST (LS and LMI), University}

of Luxembourg, University of Mannheim, THEMA and 4th_{French Econometrics Conference for useful}

com-ments. Anna Simoni gratefully acknowledges financial support from the University of Mannheim through the DFG-SNF Research Group FOR916 and hospitality from CREST. Yuan Liao gratefully acknowledges financial support from the University of Maryland at College Park.

†_{Department of Mathematics, University of Maryland at College Park, College Park, MD 20742 (USA).}

Email: [email protected]

‡_{CNRS and THEMA, Universit´e de Cergy-Pontoise - 33, boulevard du Port, 95011 Cergy-Pontoise}

(3)

Key words: partial identification, posterior consistency, concentration rate, support func-tion, two-sided Bayesian credible sets, identified set, coverage probability, moment inequality models

(4)

1 Introduction

1.1 Bayesian inference for partially identified models

Partially identified models have been receiving extensive attentions in recent years, due to their broad applications in econometrics. Partial identification of a structural parameter arises when the data available and the constraints coming from economic theory only allow to place the parameter inside a proper subset of the parameter space. Due to the limitation of the data generating process, the data cannot provide any information within the set where the structural parameter is partially identified (called identified set).

This paper aims at developing Bayesian inference for partially identified models. A Bayesian approach may be appealing for several reasons. First, while frequentist approaches cannot tell anything inside the identified set, the Bayesian approach can. When informa-tive (subjecinforma-tive) priors are available for the structural parameter, the shape of the posterior density may be not flat even inside the identified set, providing more information about the parameter that cannot be told by the data. When no a priori information is available, using a uniform prior helps us estimate the true identified set. The Bayesian analysis for partially identified models produces a posterior distribution whose support will asymptotically con-centrate around the true identified set. Therefore, the asymptotic behavior for the posterior distribution is different from that of the traditional point identified case, the latter being usu-ally (asymptoticusu-ally) normusu-ally distributed due to the Bernstein-von Mises theorem. Hence, the information from the prior is washed away by the data when the structural parameter is identifiable. A second appealing feature of Bayesian methods arises in situations where we are interested only in a projection of the identified region, that is, a subset of the struc-tural parameter. It turns out that projecting a high dimensional identified region to a low dimensional space using a Bayesian approach is easier than with frequentist approaches be-cause this simply requires the marginalization of a joint distribution. Third, when the model incorporates a multidimensional parameter with some components that are identified and some others that are not, we can learn from the data something also about the non-identified parameters through the information brought by the identified parameters. Moreover, when (asymptotic) equivalence between Bayesian credible sets (BCS) and frequentist confidence sets (FCS) is established, we can take advantage of the fact that BCS are often easier to construct than FCS thanks to the use of Markov Chain Monte Carlo (MCMC) methods. Finally, sometimes frequentist inference relies heavily on point identification, and in some cases achieving the point identification requires stringent assumptions that are hard to ver-ify. In contrast, a Bayesian procedure nevertheless makes inference based on the posterior, whose construction does not require point identification. We illustrate this point further in the following two examples.

(5)

deduced posterior ofθ nevertheless enables statistical inference. In particular, if point identi-fication is indeed guaranteed, it can be inferred from the shape of the posterior distribution.

Example 1.2 (Quantile regression with endogenous censoring). In the model y=XT_θ₊_ϵ

and Med(ϵ_|X) = 0.5, only (I(y < c),min(y, c), X) is observed for some censoring variable c. In particular, the censoring may arbitrarily depend on y and thus is endogenous. Though a sufficient condition for the point identification ofθhas been given in the literature (e.g., Khan and Tamer 2009), it is stringent and also hard to verify. In contrast, a Bayesian procedure nevertheless imposes a prior onθ and makes inference via the posterior distribution. On the other hand, the posterior can help to check if point identification is indeed guaranteed.

There are in general two Bayesian approaches for partially identified models currently developed in the literature. The first one is based on a parametric likelihood function, which is assumed to be known by econometricians up to a finite dimensional parameter. This approach has been used frequently in the literature, see e.g., Moon and Schorfheide (2012), Poirier (1998), Gustafson (2012), Bollinger and Hasselt (2009), Norets and Tang (2012) among many others. However, econometric models usually only identify a set of moment inequalities instead of the full likelihood function. Examples are: interval-censored data, interval instrumental regression, asset pricing (Chernozhukov et al. 2008), incomplete structural models (Menzel 2011) etc. Assuming a parametric form of the likelihood function is therefore ad-hoc. Once the likelihood is mis-specified, the posterior can be misleading. The second approach starts from a set of moment inequalities, and uses a moment-condition-based likelihood such as the limited information likelihood (Kim 2002) and the exponential tilted empirical likelihood (Schennach 2005). Further references may be found in Liao and Jiang (2010), Kitagawa (2012) and Wan (2011). This approach avoids assuming the knowledge of the true likelihood function. However, it does not have a probabilistic interpretation. The use of moment-condition-based likelihoods makes this approach quasi-Bayesian, which uses the Bayesian machinery for inference, see Chernozhukov and Hong (2003).

We propose a pure Bayesian procedure without assuming a parametric form of the true likelihood function. Instead, we place nonparametric priors on the likelihood and obtain the marginal posterior distribution for the partially identified parameter as well as the posterior for the identified set. A similar Bayesian procedure was recently used in Florens and Simoni (2011). As a result, our procedure is semi-parametric Bayesian that involves both finite and infinite dimensional parameters. Our approach thus only requires a set of moment conditions but still possesses a probabilistic interpretation.

(6)

incorporate the partial identification structure by assuming it is supported only on the iden-tified set, the latter being parametrized byϕ. We further illustrate this difference in a simple example of interval censoring.

Example 1.3 (Interval censored data). Suppose Y _∈ R _{is censored between} _Y₁ _and _Y₂_. We are interested in the structural parameter θ = EY, but only Y1 and Y2 are observable.

Let ϕ = (ϕ1, ϕ2)′ = (E(Y1), E(Y2))′, then θ is partially identified on Θ(ϕ) = [ϕ1, ϕ2]. The

moment-condition-based approach starts from a moment inequality model E(Y1 −θ) ≤ 0

and E(θ₋Y2)≤0, and places a prior π(θ) that is supported on the entire parameter space.

In contrast, the likelihood-based approach places a prior π(θ|ϕ) that is only supported on [ϕ1, ϕ2]. Therefore, the latter’s prior specification takes into account the fact that Y is

censored in [Y1, Y2], while the first approach does not need so.

In this paper we specify a conditional prior π(θ|ϕ) for θ given ϕ which incorporates the partial identification structure as in the likelihood-based approach. Examples of such priors include the uniform prior, truncated normal prior, and many priors that have bounded support. Then, the unknown likelihood function is defined for ϕ only, where ϕ is a point-identified nuisance parameter.

We provide a frequentist validation of our procedure. This means that we admit the existence of a true value of the structural parameter and the identified set, and prove that the posterior distribution concentrates asymptotically in a neighborhood of this true value. This property is known asposterior consistency and is important because it guarantees that, with a sufficiently large amount of data, we can almost surely recover the truth accurately. Lack of consistency is particularly undesirable and a Bayesian procedure should not be used if the corresponding posterior is inconsistent.

1.2 Highlights of our contributions

We highlight three distinguished features of our approach, which also illustrate our main contributions.

Semi-parametric Bayesian partial identification

We endow the point identified nuisance parameter ϕ with a prior π(ϕ). The true like-lihood function ln(ϕ) is defined on the support of ϕ. Without assuming any parametric

form for ln(·), we place a nonparametric prior π(ln) on the space of probability density (or

distribution) functionsln. The prior specification is completed by a conditional prior π(θ|ϕ)

which takes into account the partial identification structure. Therefore, the model contains finite dimensional parameters (θ, ϕ) and an infinite dimensional parameter ln, where (ϕ, ln)

are point-identified nuisance parameters. The marginal posterior of θ is then given by

p(θ_|Data)_∝

∫

π(θ_|ϕ)π(ϕ)ln(ϕ)π(ln)dϕdln.

(7)

In partially identified models, inference may be carried out both for the structural pa-rameter θ and for the identified set. The prior specification π(θ_|ϕ) on θ plays a role only for inference on θ.

We propose two types of priors on the point-identified parameter (ϕ, ln). The first

con-sists of a nonparametric prior on the distribution function of the data generating process, with the Dirichlet process prior as an important example. Using this prior, the prior π(ϕ) of the parameter ϕ can be recovered by viewing ϕ as a function of the distribution func-tion. This prior is appealing when we have no prior information for ϕ. On the contrary, if there is informative prior information for ϕ, it is more convenient to place an alternative semi-parametric prior specified as the product of a prior on ϕ and a prior on the underlying likelihood function ln. This type of prior on ln is usually specified on the space of

proba-bility density functions, and includes the finite mixture of normals and Dirichlet mixture of normals as examples.

For these prior schemes, we show that asymptotically p(θ_|Data) will be supported within an arbitrarily small neighborhood of the true identified set, which is the notion of

pos-terior consistency under partial identification. Moreover, we construct the posterior for

the identified set, and show that asymptotically it concentrates within a √log_nn Hausdorff-neighborhood around the true identified set.

Support function

Our setup is similar to that of Moon and Schorfheide (2012) in that the identified set is completely determined by the identified nuisance parameter ϕ, and hence can be written as Θ(ϕ). Once the posterior of ϕ is determined, so is that of Θ(ϕ). For a definition of the prior and posterior of Θ(ϕ) we refer to Florens and Simoni (2011) who define them in terms of capacity functionals. To make inference on Θ(ϕ) we can take advantage of the fact that when Θ(ϕ) is closed and convex it is completely characterized by its support function Sϕ(·)

defined as:

Sϕ(p) = sup θ∈Θ(ϕ)

θTp

wherep_∈Sdim(θ)_{, the unit sphere. Therefore, inference on Θ(}_ϕ_{) may be conveniently carried} out through inference on its support function. The posterior distribution of Sϕ(·) is also

determined by that of ϕ. We show that in a general moment inequality model, the support function has an asymptotic linear representation in a neighborhood of the true value for ϕ. The posterior of Sϕ(·) is shown to asymptotically concentrate within a

√ logn

n

(8)

Two-sided Bayesian credible sets for the identified set

We construct two types of Bayesian credible sets (BCS), one for the partially identified parameterθand the other for the identified set Θ(ϕ). In particular, the BCS for theidentified set is constructed based on the support function and is two-sided, that is, we find sets Θ( ˆϕM)−qτ/

√

n_{and Θ( ˆ}_ϕ

M)−qτ/

√

n_{, where ˆ}_ϕ

M is any consistent estimator ofϕ(e.g. the posterior

mode of ϕ, see Section 6 for definitions) such that with probability one, P(Θ( ˆϕM)−qτ/

√_n ⊂

Θ(ϕ) _⊂ Θ( ˆϕM)qτ/

√

n_|_Data_{) = 1}₋_τ _{for credible level 1}₋_τ. _{It is found that the two-sided}

BCS for the identified set have asymptotically correct coverage probability, in the sense that

PDn(Θ( ˆϕM)

−qτ/√n

⊂Θ(ϕ0)⊂Θ( ˆϕM)qτ/

√_n

)_≥1₋τ+op(1)

wherePDn denotes the sampling probability. Therefore, Θ( ˆϕM)−

qτ/√n _{and Θ( ˆ}_ϕ

M)−qτ/

√

n _can

also be used as frequentist confidence sets for the identified set. The notation of Θ( ˆϕM)−qτ/

√

n_,

Θ( ˆϕM)qτ/

√

n _and _q

τ are to be formally defined in Section 6. On the other side, we find that

also in the semi-parametric Bayesian model, Moon and Schorfheide (2012)’s conclusion about the BCS for the partially identified parameterθ still holds. Indeed, the BCS for the partially identified parameter tends to be smaller than frequentist confidence sets in large samples.

Note that we consider a fixed data generating process (DGP). The constructed BCS has asymptotically correct coverage probability for any specific DGP, and the uniformity issue as in Andrews and Soares (2010) is not considered. In addition, all the results on the identified set, support function and posterior consistency forθ are valid even when point identification is actually achieved, that is, when Θ(ϕ) is a singleton.

1.3 Literature review

There is a growing literature on Bayesian partially identified models. Besides those mentioned above, the list also includes Gelfand and Sahu (1999), Neath and Samaniego (1997), Epstein and Seo (2011), Stoye (2012), Kline (2011), etc. There is also an extensive literature from a frequentist point of view. A partial list includes Andrews and Guggenberger (2009), Andrews and Soares (2010), Beresteanu, Molchanov and Molinari (2010), Bugni (2010), Canay (2010), Chernozhukov, Hong and Tamer (2007), Chiburis (2009), Imbens and Manski (2004), Romano and Shaikh (2010), Rosen (2008), Stoye (2010), among others. See Tamer (2010) for a review.

When the identified set is closed and convex, the support function becomes one of the useful tools to characterize its properties. Therefore the support function has been recently introduced to study partially identified models, and the literature on this perspective has been growing rapidly, see for example, Bontemps, Magnac and Maurin (2012), Beresteanu and Molinari (2008), Beresteanu et al. (2012), Kaido and Santos (2012), Kaido (2012) and Chandrasekhar et al. (2012).

1.4 Organization

(9)

consistency for the (marginal) posterior distribution of the structural parameter. Section 4 derives the posterior consistency for the identified set and provide the concentration rate. Section 5 studies the posterior of the support function in moment inequality models. In particular, the Bernstein von Mises theorem for the support function is proved. Section 6 constructs the Bayesian credible sets for both the structural parameter and its identified set and looks in detail at the missing data example. Section 7 discusses the case when point identification is actually achieved. In this case, all the derived results on the identified set and the support function are still valid. Section 8 applies the support function approach to a financial asset pricing study. Finally, Section 9 concludes with further discussions. All the proofs are given in the appendix.

2 General Setup of Bayesian Partially Identified Model

2.1 The Model

Econometric models often involve a finite dimensional structural parameter θ. In many cases such a structural parameter is only partially identified by the data generating process on a non-singleton set, which we callidentified set. The goal of an econometrician is to make inference on the partially identified parameter as well as the identified set based on the data. Along withθ, the model also includes a finite dimensional nuisance parameterϕ_∈Φ that is point identified by the data generating process. Here Φ denotes the parameter space forϕ. The point identified parameter often arises naturally as it characterizes the data distribution. In most of partially identified models, the identified set is also characterized byϕ, hence we denote it by Θ(ϕ) to indicate that onceϕ is determined, so is the identified set. Let Θ denote the parameter space for θ; we assume Θ(ϕ)_⊆Θ.

We put a prior on (θ, ϕ), which induces a prior on the identified set Θ(ϕ) via ϕ. Due to the identification feature, for any given ϕ∈Φ, the conditional prior π(θ|ϕ) is specified such that

π(θ_∈Θ(ϕ)_|ϕ) = 1.

Our analysis focuses on the situation where Θ(ϕ) is a closed and convex set for each ϕ. Therefore Θ(ϕ) can be uniquely characterized by its support function. Let d= dim(θ). For any fixed ϕ, the support function for Θ(ϕ) is a function Sϕ(·) :Sd→R such that

Sϕ(p) = sup θ∈Θ(ϕ)

θTp.

whereSd _{denotes the unit sphere in} Rd_{. The support function plays a central role in convex} analysis since it determines all the characteristics of a convex set. For example, if θ_∈Θ(ϕ), then its kth component has bounds θk ∈ [−Sϕ(−ek), Sϕ(ek)], where ek is the kth standard

basis vector (a vector of all zeros, except for a one in the kth position). Also, θ ∈ Θ(ϕ) if and only if pT_θ _≤_S

ϕ(p) for all p∈Sd.

(10)

objects for our Bayesian inference. At the best of our knowledge a Bayesian estimation of the support function of the identified set has not been proposed in the literature so far.

Similar to Θ(ϕ), we put a prior onSϕ(·) via the prior on ϕ. In this paper we investigate

the asymptotic frequentist properties of the posterior distribution of the support function, as well as those ofθ and of Θ(ϕ), including the posterior concentration rates and the Bernstein von Mises theorem as in Bickel and Kleijn (2012). In addition, we carry out Bayesian inferences by constructing two-sided Bayesian credible sets for the identified set Θ(ϕ) based on the support function.

Before formalizing our Bayesian setup, let us present a few examples that have received much attention in partially identified econometric models literature.

Example 2.1 (Interval censored data - continued). Let (Y, Y1, Y2) be a 3-dimensional

ran-dom vector such that Y ∈ [Y1, Y2] with probability 1. The random variables Y1 and Y2

are observed while Y is unobservable (see, e.g., Moon and Schorfheide 2012). We denote: θ =E(Y) and ϕ = (ϕ1, ϕ2)′ := (E(Y1), E(Y2))′. Therefore, we have the following identified

set for θ: Θ(ϕ) = [ϕ1, ϕ2]. The support function for Θ(ϕ) is easy to derive:

Sϕ(1) =ϕ2, Sϕ(−1) =−ϕ1.

Example 2.2 (Interval regression model). The regression model with interval censoring has been studied by, for example, Haile and Tamer (2003), etc. Let (Y, Y1, Y2) be a 3-dimensional

random vector such that Y ∈ [Y1, Y2] with probability 1. The random variables Y1 and Y2

are observed whileY is unobservable. Assume that

Y =XTθ+ϵ

where X is a vector of observable regressors. In addition, assume there is a d-dimensional vector of nonnegative exogenous variables Z such that E(Zϵ) = 0. Here Z can be either a vector of instrumental variables when X is endogenous, or a nonnegative transformation of X when X is exogenous. It follows that

E(ZY1)≤E(ZY) =E(ZXT)θ ≤E(ZY2). (2.1)

We denote ϕ = (ϕ1, ϕ2, ϕ3) with (ϕT1, ϕT3) = (E(ZY1)T, E(ZY2)T) and ϕ2 = E(ZXT). Then

the identified set for θis given by Θ(ϕ) = {θ∈Θ :ϕ1 ≤ϕ2θ≤ϕ3}.Supposeϕ−21 exists. The

support function for Θ(ϕ) is given by (for sgn(x) =I(x >0)₋I(x <0))1_:

Sϕ(p) =pTϕ−21 (

ϕ1+ϕ3

2

)

+αT_p

(

ϕ3−ϕ1

2

)

, p_∈Sd

where αp = ((pTϕ−21)1sgn(pTϕ−21)1, ...,(pTϕ−21)dsgn(pTϕ−21)d)T.

1_{See Appendix A for detailed derivations of the support function in this example. Similar results but in}

(11)

Example 2.3 (missing data). Consider a bivariate random vector (Y, M) where M is a binary random variable which takes the value M = 1 when Y is missing and 0 otherwise. The parameter of interest is the marginal distribution FY of Y: θ = FY(y). This problem

without the missing-at-random assumption has been extensively studied in the literature, see for example, Manski and Tamer (2002), Manski (2003), etc. Let F and FM denote

respectively the joint distribution of (Y, M) and marginal distribution of M. Moreover, FY|M denotes the conditional distribution of Y given M. By the Law of Total Probability:

θ = FY|M(y|M = 0)FM(M = 0) + FY|M(y|M = 1)FM(M = 1). Since FY|M(y|M = 1)

cannot be recovered from the data then the empirical evidence partially identifiesθ and θ is characterized by the following moment restrictions:

F(y, M = 0)_≤θ _≤F(y, M = 0) +FM(M = 1).

Here ϕ = (F(y, M = 0), FM(M = 1)) = (ϕ1, ϕ2). The identified set is Θ(ϕ) = [ϕ1, ϕ1 +ϕ2].

and it support function is: Sϕ(1) =ϕ1 +ϕ2, Sϕ(−1) =−ϕ1.

2.2 Semi-parametric Bayesian Setup

Let F denote the distribution function for the observed data, which is point identified by the data generating process. In a parametric Bayesian partially identified model as in Poirier (1998), Gustafson (2005, 2012) and Moon and Schorfheide (2012), F is linked with a known likelihood function for ϕ. Therefore the model is parametric and one does not put priors on F. For example, in the interval censored data Example 2.1, if we know that Y1

and Y2 are jointly normal with mean (ϕ1, ϕ2) and a known covariance matrix Σ. Then the

likelihood function is given by

l(ϕ) = (2πdet(Σ))−n/2exp

(

−1

2

n

∑

i=1

(Y1i−ϕ1, Y2i −ϕ2)Σ−1(Y1i−ϕ1, Y2i−ϕ2)T )

where _{(Y1i, Y2i)}ni=1 is a set of i.i.d. realizations of (Y1, Y2). ThenF is the cdf of a bivariate

normal distribution with mean vectorϕ and covariance Σ.The standard Bayesian approach for a partially identified model (parametric) proceeds by specifying a joint prior distribution π(θ, ϕ) and obtains the marginal posterior for θ:

p(θ|{(Y1i, Y2i)}ni=1)∝ ∫

Φ

π(θ, ϕ)l(ϕ)dϕ.

(12)

A much more robust approach is to proceed without assuming a parametric form for the likelihood function, but put a prior onF instead. This yields to the semi-parametric Bayesian setup. The statistical model therefore contains three parameters: the structural parameter of interestθ, a finite dimensional nuisance parameter ϕ which can be point identified by the DGP, and a nuisance infinite dimensional parameter F which characterizes the distribution of the data.

A further distinction among the parameters may be done on the basis of identification: the identified parameters (ϕ, F), which characterize the sampling distribution of the observable random variables, and the partially identified parameter θ, which is linked to the sampling distribution through ϕ. We have to take this difference into account when we construct the prior distribution for the model parameters. Therefore, the prior distribution is naturally decomposed into a marginal prior for the identified parameter and a conditional prior for θ given the identified parameter such that

π(θ_∈Θ(ϕ)_|ϕ) = 1.

We specify a conditional prior distribution for θ given ϕ taking the form

π(θ_|ϕ)_∝Iθ∈Θ(ϕ)g(θ)

where g(·) is some probability density function with respect to the Lebesgue measure and Iθ∈Θ(ϕ)is the indicator function of Θ(ϕ) which takes the value 1 ifθ ∈Θ(ϕ). By construction

this prior puts all its mass on Θ(ϕ),_∀ϕ_∈Φ.

Below we describe two possible ways to specify the prior on (ϕ, F): a fully nonpara-metric prior and a semi-paranonpara-metric prior. The first prior scheme consists in placing a fully nonparametric prior on F which induces a prior on ϕ through a transformation ϕ =ϕ(F). When there is more informative prior information for ϕ directly, it is more convenient to place a prior on (ϕ, η) whereη is an infinity-dimensional nuisance parameter (often a density function) that is independent of ϕ a priori and that characterizes F. The prior on (ϕ, F) is then deduced from the prior on (ϕ, η).

Below we formally define these two prior specifications. An illustrative example is given in Section 2.3. Let X denote the observable random variable for which we have n i.i.d. observations Dn = {Xi}ni=1. Let (X,Bx, F) denote a probability space in which X takes

values. LetF denote the set of probability measures on (X,B_X_{), which is also the parameter}

space of F.

2.2.1 Nonparametric prior

Since ϕ is point identified, we assume it can be rewritten as a measurable function ofF as ϕ = ϕ(F), for instance ϕ = E(X) = ∫ xF(dx). A possible way to construct the prior distribution consists of specifying a nonparametric prior distribution forF and then deduce from it the prior distribution for ϕ via ϕ(F). The Bayesian experiment is

X|F ∼F, F ∼π(F), θ|ϕ =ϕ(F)∼π(θ|ϕ(F))

(13)

Dirichlet process priors (Ferguson (1973)) and Polya tree (Lavine (1992)). The case where π(F) is a Dirichlet process prior in partially identified models has been proposed by Florens and Simoni (2011).

Conditionally onϕ, the data are completely uninformative aboutθ: the prior distribution ofθ is revised by the data only through the information brought by the identified parameter ϕ(F). Indeed, since ϕ(F) is identified, it is straightforward to show that the posterior of θ conditional on ϕ(F) satisfies

p(θ_|ϕ(F), Dn) = π(θ|ϕ(F)).

(see Poirier (1998), who called the data to be conditionally uninformative for θ given ϕ). Letp(F_|Dn) denote the marginal posterior ofF which, by abuse of notation, can be written

p(F_|Dn)∝π(F)∏n_i₌₁F(Xi). The posterior distribution of ϕ, Θ(ϕ),Sϕ(·) are deduced from

the posterior ofF. Then, for any measurable set B ⊂Θ, the marginal posterior probability of θ is given by, averaging overF:

P(θ ∈B|Dn) =

∫

F

P(θ ∈B|ϕ(F), Dn)p(F|Dn)dF

=

∫

F

π(θ _∈B_|ϕ(F))p(F_|Dn)dF =E[π(θ∈B|ϕ(F))|Dn] (2.2)

where the conditional expectation is taken with respect to the posterior of F. The corre-sponding marginal posterior density function of θ will be denoted byp(θ_|Dn).

2.2.2 Semi-parametric prior

Alternatively, instead of modeling F nonparametrically, we could reformulate the model and parameterize the sampling distribution F in terms of a finite-dimensional parameter ϕ _∈ Φ and a nuisance parameter η _{∈ P}, where _P is an infinite-dimensional measurable space. Therefore, _F =_{Fϕ,η;ϕ∈Φ, η ∈ P}. When we consider the frequentist properties of

the posterior distribution, we assume there is a fixed true value for F, denoted by F0. Since

bothϕ andF are identified then there exist unique ϕ0 ∈Φ andη0 ∈ P such thatF0 =Fϕ0,η0,

whereϕ0 andη0 denote the true values ofϕ and η. Denote byln(ϕ, η) the model’s likelihood

function.

One of the appealing features of this semi-parametric approach is that it allows us to impose a priorπ(ϕ) directly on the identified parameter ϕ, which is convenient whenever we have good prior information regarding ϕ. In contrast, a nonparametric prior specification may be inconvenient to incorporate subjective prior information.

For instance, in the interval censored data example, we can write

Y1 =ϕ1+u, Y2 =ϕ2 +v

u_∼f1, v ∼f2,

(14)

Y1, Y2|ϕ1, ϕ2, f1, f2 are disjoint2. Then η= (f1, f2), and the likelihood function is

ln(ϕ, η) = n

∏

i=1

f1(Y1i −ϕ1)f2(Y2i−ϕ2).

We put priors on (ϕ, f1, f2). This is a location-model studied for instance by Ghosal et al.

(1999) and Amewou-Atisso et al. (2003). Examples of priors on density functions f1 and f2

include mixture of Dirichlet process priors, Gaussian process priors, etc.

The joint prior distribution π(θ, ϕ, η) is naturally decomposed as

π(θ, ϕ, η) = π(θ_|ϕ)_×π(ϕ, η). (2.3)

We place an independent prior on (ϕ, η) as π(ϕ, η) = π(ϕ)π(η). Therefore, the Bayesian experiment is

X|ϕ, η ∼Fϕ,η, (ϕ, η)∼π(ϕ, η) =π(ϕ)×π(η), θ|ϕ, η ∼π(θ|ϕ).

The posterior distribution of ϕ has a density function given by

p(ϕ|Dn)∝

∫

P

π(ϕ, η)ln(ϕ, η)dη. (2.4)

Then the marginal posterior of θ is, for any measurable setB ∈Θ:

P(θ _∈B_|Dn)∝

∫

Φ ∫

P

π(θ_∈B_|ϕ)π(ϕ, η)ln(ϕ, η)dηdϕ. (2.5)

Moreover, the corresponding posterior density function is: p(θ_|Dn) =∫_Φπ(θ|ϕ)p(ϕ|Dn)dϕ.

wherep(ϕ|Dn) has been defined in (2.4). The posteriors of Θ(ϕ) andSϕ(·) are deduced from

that of ϕ. Suppose for example we are interested in whether Θ(ϕ)∩A is an empty set for some A_⊂Θ, we then look at the posterior probability

P(Θ(ϕ)∩A|Dn) =

∫

{ϕ:Θ(ϕ)∩A=∅}

∫

Pπ(ϕ)π(η)ln(ϕ, η)dηdϕ

∫ Φ

∫

Pπ(ϕ)π(η)ln(ϕ, η)dηdϕ

.

The finite-dimensional posterior distribution of the support functionSϕ(·) is the distribution

P(Sϕ(pi) ∈Ai, for 1 ≤i ≤k), k ∈ N, for every (p1, . . . , pk) such that pi ∈ Sd, i = 1, . . . , k,

and for every product of measurable sets Ai in R.

Example 2.4 (Interval regression model - continued). Consider Example 2.2, where ϕ =

2_{In order to implement this we have two possibilities. Let [}_{u, u}_{] and [}_{v, v}_{] denote the supports of}_f 1 and

f2and [ϕ

1, ϕ1] and [ϕ2, ϕ2] denote the supports ofϕ1andϕ2, respectively. First, we can specify a conditional

priorπ(f1, f2|ϕ1, ϕ2) such that u+ϕ1≤v+ϕ₂. A second possibilities is to specify a independent prior on

(15)

(ϕ1, ϕ2, ϕ3) = (E(ZY1), E(ZXT), E(ZY2)). Write

ZY1 =ϕ1+u1, ZY2 =ϕ3+u3, vec(ZXT) = vec(ϕ2) +u2,

where u1, u2 and u3 are correlated and their joint unknown probability density function is

η(u1, u2, u3). The likelihood function is then

ln(ϕ, η) = n

∏

i=1

η(ZiY1i−ϕ1, ZiY2i−ϕ3,vec(ZiXiT)−vec(ϕ2)).

Many nonparametric priors can be used for π(η) in the location-model of the type of Example 2.4, where ln(ϕ, η) = ∏_in₌₁η(Xi−ϕ), or of the type of the interval data example.

The next examples show three possible ways for constructing priors π(η) on probability density functions.

Example 2.5. The finite mixture of normals (e.g., Lindsay and Basak (1993), Ray and Lindsay (2005)) assumes η to take the form

η(x) =

k

∑

i=1

wih(x;µi,Σi)

where h(x;µi,Σi) is the density of a multivariate normal distribution with mean µi and

variance Σi and {wi}ki=1 are unknown weights such that ∑k

i=1wiµi = 0. Then

∫

η(x)xdx=

∑k

i=1wi

∫

h(x;µi,Σi)xdx= 0. We impose prior π(η) =π({µl,Σl, wl}kl=1), then

p(ϕ_|Dn) ∝

∫

P

π(ϕ)π(η)

n

∏

i=1

η(Xi−ϕ)dη

=

∫

π(ϕ)

n

∏

i=1

k

∑

j=1

wjh(Xi−ϕ;µj,Σj)π({µl,Σl, wl}kl=1)dwjdµjdΣj.

Example 2.6. Dirichlet mixture of normals (e.g., Ghosal, Ghosh and Ramamoorthi (1999) Ghosal and van der Vaart (2001), Amewou-Atisso, et al. (2003)) assumes

η(x) =

∫

h(x−z; 0,Σ)dH(z)

where h(x; 0,Σ) is the density of a multivariate normal distribution with mean zero and variance Σ and H is a probability distribution such that∫ zH(z)dz = 0. Then∫ xη(x)dx= 0. To place a prior on η, we let H have the Dirichlet process prior distribution Dα ≡

D(ν0, Q0) where α is a finite positive measure, ν0 =α(X)∈R+ and Q0 =α/α(X) is a base

(16)

Σ independent of the prior on H. Then

p(ϕ|Dn)∝

∫

π(ϕ)π(Σ)Dα(H) n

∏

i=1

∫

h(Xi −ϕ−z; 0,Σ)dH(z)dΣdH.

Example 2.7. Random Bernstein polynomials (e.g., Walker et al. (2007) and Ghosal (2001)) admits a density function

η(x) =

k

∑

j=1

[H(j/k)₋H((j ₋1)/k)]_Be(x;j, k₋j + 1),

where Be(x;a, b) stands for the beta density with parameters a, b > 0 and H is a random distribution function with prior distribution assumed to be a Dirichlet process. Moreover, the parameter k is also random with a prior distribution independent of the prior on H. Then p(ϕ|Dn)∝

∫

π(ϕ)∏n_i₌₁η(Xi−ϕ)π(H)π(k)dHdk.

Besides, other commonly used priors are wavelet expansions (Rivoirard and Rousseau (2012)), Polya tree priors (Lavine (1992)), Gaussian process priors (van der Vaart and van Zanten (2008), Castillo (2008)), etc.

2.3 Interval censored data: an example

For illustration purposes, we consider a simple version of the interval censored data example 2.1 where Y2 = Y1 + 1 and only Y1 is observable, i.e. Y1 ≡ X in our general

notation. Let ϕ = EY1 and θ = EY, then the identified set is Θ(ϕ) = [ϕ, ϕ+ 1]. Let

F denote the marginal distribution of Y1. Then a more formal way to write ϕ should be

ϕ=ϕ(F) =E(Y1|F).

Let us specify a Dirichlet process prior for F: π(F) = Dir(ν0, Q0), where ν0 ∈ R+ and

Q0 is a base probability on (X,Bx) such that Q0(x) = 0, ∀x ∈ (X,Bx). By using the

stick-breaking representation (see Sethuraman (1994)), the deduced prior distribution of the transformation ϕ(F) is

π(ϕ ∈A) =P

( _∞ ∑

j=1

αjξj ∈A

)

, ∀A⊂Φ

where {ξj}j≥1 denote independent drawings from Q0, αj = vj∏j_l₌₁(1 −vl) with {vl}l≥1

independent drawings from a Beta distribution Be(1, ν0) and {vl}l≥1 are independent of

{ξj}j≥1. The posterior distribution of F is still a Dirichlet process: p(F|Dn) =Dir(νn, Qn),

with νn = ν0 +n, and Qn = _ν₀ν₊0_nQ0+ _ν₀n₊_nF ,ˆ where ˆF is the empirical distribution of the

sample (Y11, . . . , Y1n). The posterior distribution of the transformation ϕ(F) is

P(ϕ ∈A|Dn) =P

(

ρ

n

∑

j=1

βjY1j + (1−ρ)

∞

∑

j=1

αjξj ∈A

Dn

)

(17)

where ρ is drawn from a Beta distribution _Be(n, ν0) independently of the other quantities

and (β1, . . . , βn) are drawn from a Dirichlet distribution with parameters (1, . . . ,1) on the

simplex Sn−1 of dimension (n − 1). With the prior π(θ|ϕ) ∝ Iθ∈Θ(ϕ)g(θ), the marginal

posterior density function of θ evaluated at some fixed ˜θ is

p(θ˜_|Dn

)

∝

∫

F

I(θ˜_∈[ϕ, ϕ+ 1])g(θ˜)p(F_|Dn)dF

= g(θ˜)P (ϕ(F)_≤θ˜_≤ϕ(F) + 1Dn

)

= g(θ˜)P

(

θ₋1_≤ρ

n

∑

j=1

βjY1j + (1−ρ)

∞

∑

j=1

αjξj ≤θ

Dn

)

.

The alternative semi-parametric prior can be formulated as follows. Define u = Y1 −

ϕ1, and assume u has a continuous density f. The likelihood is thus given by ln(ϕ, f) =

∏n

i=1f(Yi−ϕ). We place any parametric priorπ(ϕ) onϕ and a Dirichlet mixture of normals

prior on f, which assumes f(u) =∫ h(u₋z; 0, σ2₎_dH₍_z_{) where} _H _{is a probability measure}

that has a Dirichlet process priorDα andσ2 is a variance parameter for the normal mixtures

that has an inverse Gamma prior (see Example 2.6 for details). We then obtain the posterior

p(ϕ|Dn)∝

∫

π(ϕ)π(σ2)Dα(H) n

∏

i=1

∫

h(Yi−ϕ−z; 0, σ2)dH(z)dσ2dH.

The marginal posterior density function of θ evaluated at some fixed ˜θ is

p(˜θ_|Dn)∝g(˜θ)P(˜θ−1< ϕ <θ˜|Dn).

3 Posterior Consistency for

θ

In Bayesian analysis, one starts with a prior knowledge (sometimes uninformative) on the parameter and updates it according to the marginal posterior given the data. In classical point identified parametric and semi-parametric models, under mild assumptions the poste-rior is asymptotically normal due to the Bernstein von Mises theorem and hence its shape is not affected anymore asymptotically by the prior specification. In contrast, the shape of the posterior of a partially identified parameter still relies upon its prior distribution (see Poirier (1998)) even asymptotically. Only the support of the prior distribution of θ (given ϕ) is revised after data are observed and eventually converges towards the true identified set asymptotically. This corresponds to frequentist consistency of the posterior for partially identified parameters and is due to the fact that the point-identified parameterϕ completely characterizes the support.

We assume there is a true value of ϕ, denoted by ϕ0, which induces a true identified set

Θ(ϕ0) and a true F, denoted by F0. Our goal is to achieve the frequentist posterior

(18)

P(θ _∈Θ(ϕ0)ϵ|Dn)→p 1 and P(θ∈Θ(ϕ0)−ϵ|Dn)→p (1−τ)

where

Θ(ϕ)ϵ={θ ∈Θ :d(θ,Θ(ϕ))≤ϵ} (3.1)

is the ϵ-envelope of Θ(ϕ) and

Θ(ϕ)−ϵ ={θ∈Θ(ϕ) :d(θ,Θ\Θ(ϕ))≥ϵ} (3.2)

is theϵ-contraction of Θ(ϕ) with Θ_\Θ(ϕ) = _{θ_∈Θ;θ /_∈Θ(ϕ)_}andd(θ,Θ(ϕ)) = infx∈Θ(ϕ)∥θ−

x∥, see e.g. Molchanov (2005) and Chernozhukov, Hong and Tamer (2007). Thus, poste-rior (or frequentist) consistency for a partially identified parameter means that the pos-terior distribution of θ puts all its mass on a set whose boundaries belongs to the set

{θ ∈ Θ;d(θ;∂Θ(ϕ0)) ≤ ϵ} where ∂Θ(ϕ0) denotes the boundary of Θ(ϕ0). Posterior

con-sistency is one of the benchmarks of a Bayesian procedure under consideration, which en-sures that with a sufficiently large amount of data, it is nearly possible to discover the truth identified set. Therefore lack of consistency is extremely undesirable. Liao and Jiang (2010, 2011) studies the posterior consistency for partially identified models, however, with a pseudo likelihood function whose probabilistic interpretation is still in question3_{. More}

recently, Kitagawa (2011) considered the posterior consistency for Θ(ϕ) in terms of the pos-terior lower probability when the parametric form of the likelihood is known.

We recall that the conditional prior on θ (given ϕ) is specified as

π(θ|ϕ)∝g(θ)Iθ∈Θ(ϕ) (3.3)

for some g(θ). In the special case where θ is point identified, then _{θ_} = Θ(ϕ) becomes a function of ϕ, whose prior is completely determined by that of ϕ instead of by (3.3).

In this section we focus on the frequentist consistency of the marginal posterior of θ (marginalized with respect to the posterior of ϕ). We will investigate the posterior concen-tration rate of Θ(ϕ) and Sϕ(·) in subsequent sections. For a measurable set B ⊂ Θ, the

marginal posterior probability is given by (2.2):

p(θ_∈B_|Dn) =

∫

F

π(θ _∈B_|ϕ(F))p(F_|Dn)dF

when the prior onϕ is induced by the nonparametric prior specified onF, and by (2.5):

p(θ ∈B|Dn) =

∫

Φ

π(θ ∈B|ϕ)p(ϕ|Dn)dϕ

when the prior on ϕ is specified through a semi-parametric prior as described in section 2.2.2. Recall that F and ϕ are point-identified and frequentist asymptotic properties of the marginal posterior of θ rely on frequentist asymptotic properties of the posterior of F and ϕ. Therefore, we assume that the priors π(F) and π(ϕ) specified forF and ϕ are such that the corresponding posterior distributions are consistent:

(19)

Assumption 3.1. At least one of the following holds:

(i). The measurable function ϕ :_{F →}Φ is continuous and the prior π(F) is such that the posterior p(F|Dn) satisfies:

∫

F

m(F)p(F|Dn)dF →p

∫

F

m(F)δF0(dF)

for any bounded and continuous functionm(·) onF where δ is the Dirac function, and F0 is the true distribution function of X;

(ii). the prior π(ϕ) is such that the posterior p(ϕ|Dn) satisfies:

∫

Φ

m(ϕ)p(ϕ_|Dn)dϕ→p

∫

Φ

m(ϕ)δϕ0(dϕ)

for any bounded and continuous function m(_·) on Φ.

Assumptions 3.1(i)and(ii)correspond to the nonparametric and semi-parametric prior, respectively and are verified by many nonparametric and semi-parametric priors. Exam-ples are: Dirichlet process priors, Polya Tree process priors, Gaussian process priors, etc. We refer to Ghosh and Ramamoorthi (2003) for examples and sufficient conditions for posterior consistency. For instance, when π(F) is the Dirichlet process prior, the sec-ond part of Assumption 3.1 (i) was proved in Ghosh and Ramamoorthi (2003, Theorem 3.2.7). The condition that ϕ(F) is continuous in F is verified in many examples rele-vant for applications. For instance, in example 2.1, ϕ(F) = E(Y|F) and in example 2.2, ϕ(F) = (E(ZY1|F), E(ZXT|F), E(ZY2|F)), which are all linear functionals ofF.

Assumption 3.2. For any ϵ >0 there are measurable sets A1, A2 ⊂Φ such that

0< π(ϕ_∈Ai)≤1, i= 1,2 and

(i) for all ϕ∈A1, Θ(ϕ0)ϵ∩Θ(ϕ)̸=∅; for all ϕ /∈A1, Θ(ϕ0)ϵ∩Θ(ϕ) =∅,

(ii) for all ϕ_∈A2, Θ(ϕ0)−ϵ∩Θ(ϕ)̸=∅; for allϕ /∈A2, Θ(ϕ0)−ϵ∩Θ(ϕ) =∅.

This assumption allows us to prove the posterior consistency without assuming the prior π(θ|ϕ) to be a continuous function of ϕ, and therefore priors like Iϕ1<θ<ϕ2 in the interval

censoring data example are allowed. Under this assumption and if the conditional prior π(θ|ϕ) is a regular conditional distribution, the conditional prior probability of theϵ-envelope (and of theϵ-contraction) of the identified set can be approximated by a continuous function, i.e., there is a sequence of bounded and continuous functions hm(ϕ) such that (see lemma

C.1 in the appendix) almost surely in ϕ:

π(θ_∈Θ(ϕ0)ϵ|ϕ) = lim

m→∞hm(ϕ).

A similar approximation holds for the conditional prior of theϵ-contractionπ(θ_∈Θ(ϕ0)−ϵ|ϕ).

(20)

Assumption 3.3. For any ϵ >0, and ϕ _∈Φ, π(θ _∈Θ(ϕ)−ϵ_|_ϕ₎_<₁_.

This is an assumption on the prior for θ, which means the identified set should be sharp with respect to the prior information. Roughly speaking, the support of the prior should not be a proper subset of any ϵ-contraction of the identified set Θ(ϕ). If otherwise the prior information restricts θ to be inside a strict subset of Θ(ϕ) so that Assumption 3.3 is violated, then that prior information should be taken into account and we should shrink Θ(ϕ) to a sharper set. In the special case when θ is point identified (Θ(ϕ) is a singleton), the ϵ-contraction is empty and thus π(θ _∈Θ(ϕ)−ϵ_|_ϕ_{) = 0}_.

The following theorem gives the posterior consistency for partially identified parameters.

Theorem 3.1. Let π(θ_|ϕ)be a regular conditional distribution. Under assumptions 3.1-3.3, for any ϵ >0, there is τ _∈(0,1] such that

P(θ _∈Θ(ϕ0)ϵ|Dn)→p 1 and P(θ∈Θ(ϕ0)−ϵ|Dn)→p (1−τ).

4 Posterior consistency for

Θ(

ϕ

)

Let ϕ0 be the true value of ϕ, which corresponds to the true identified set Θ(ϕ0). The

estimation accuracy of the identified set is often measured, in the literature, by the Hausdorff distance. Specifically, for a point a and a set A, let d(a, A) = infx∈A∥a−x∥, where ∥ · ∥

denotes the Euclidean norm. The Hausdorff distance between sets A and B is defined as

dH(A, B) = max

{

sup

a∈A

d(a, B),sup

b∈B

d(b, A)

}

= max

{

sup

a∈A

inf

b∈B∥a−b∥,sup_b_∈_Bainf∈A∥b−a∥

}

.

It follows immediately that dH(A, B) = dH(B, A) and when both A and B are compact,

dH(A, B) = 0 if and only if A = B. This section aims at deriving the rate rn = o(1) such

that for some constant C >0,

P(dH(Θ(ϕ),Θ(ϕ0))< Crn|Dn)→p 1.

The above result is based upon the posterior concentration rate for ϕ – in the sense that rn is the same as the concentration rate for ϕ – as well as some continuity condition on

dH(Θ(ϕ),Θ(ϕ0)) with respect to ϕ.

In a semi-parametric Bayesian model where ϕ is point identified and either a nonpara-metric or a semi-paranonpara-metric prior is placed, the posterior of ϕ achieves a near-parametric concentration rate under proper conditions on the prior. Since our goal is to study the pos-terior of Θ(ϕ) and θ instead of ϕ, we state a high level assumption on the posterior of ϕ as follows instead of deriving it from more general conditions. More formal derivations of this assumption will be presented in appendix B.

Assumption 4.1. The marginal posterior of ϕ is such that

(21)

This assumption is imposed for both kinds of priors described in Section 2, and is a standard result in semi-nonparametric Bayesian literature. If we place a nonparametric prior on F as described in Section 2.2.1, this assumption becomes

P(_∥ϕ(F)₋ϕ(F0)∥ ≤Cn−1/2(logn)1/2|Dn)→p 1.

Primitive conditions for the validity of this case can be found in a recent work by Rivoirard and Rousseau (2012). On the other hand, if we parametrize the model in F = {Fϕ,η :

ϕ _∈ Φ, η _{∈ P}} as described in Section 2.2.2, with η being an infinite-dimensional nuisance parameter, a sufficient condition for Assumption 4.1 is found in both Bickel and Kleijn (2012) and the appendix of this paper.

Instead of assuming continuity of dH(Θ(ϕ),Θ(ϕ0)) with respect to ϕ, which is sufficient

in order to get the concentration rate of Θ(ϕ), we place less demanding assumption that still allow us to get the concentration rate. With this aim, we consider a more specific partially identified model: the moment inequality model, which assumes that θ satisfies k moment restrictions:

Ψ(θ, ϕ)≤0, Ψ(θ, ϕ) = (Ψ1(θ, ϕ), ...,Ψk(θ, ϕ))T (4.1)

where Ψ : Θ_×Φ_→Rk _{is a known function of (}_{θ, ϕ}_{). The model depends on the data} _X _via the point identified parameterϕ. In the moment inequality model, the identified set can be characterized as:

Θ(ϕ) =_{θ _∈Θ : Ψ(θ, ϕ)_≤0_}. (4.2)

Since most of the partially identified models can be characterized as moment inequality mod-els, model (4.1)-(4.3) has received extensive attention in the partially identified literature.

Assumption 4.2. The parameter space Θ_×Φ is compact.

Assumption 4.3. _{Ψ(θ,_·) : θ _∈ Θ_} is Lipschitz equi-continuous on Φ, that is, for some K >0, ∀ϕ1, ϕ2 ∈Φ,

sup

θ∈Θ∥

Ψ(θ, ϕ1)−Ψ(θ, ϕ2)∥ ≤K∥ϕ1−ϕ2∥.

Given the compactness of Θ, this assumption is satisfied by many interesting examples of moment inequality models.

Assumption 4.4. There exists a closed neighborhood U(ϕ0) of ϕ0, such that for any an =

O(1), and any ϕ_∈U(ϕ0), there exists Cϕ>0 that might depend on ϕ,

inf

θ:d(θ,Θ(ϕ))≥Cφan

max

i≤k Ψi(θ, ϕ)> an.

Intuitively, when θ is bounded away from Θ(ϕ) (up to a ratean), at least one of the

mo-ment inequalities is violated, which means maxi≤kΨi(θ, ϕ)>0. This assumption quantifies

how much maxi≤kΨi(θ, ϕ) will depart from zero. This is a sufficient condition for the partial

identification condition in Chernozhukov, Hong and Tamer (2007). If we define

Q(θ, ϕ) =_∥max(Ψ(θ, ϕ),0)_∥=

[ _k ∑

i=1

(max(Ψi(θ, ϕ),0))2

(22)

thenQ(θ, ϕ) = 0 if and only ifθ _∈Θ(ϕ). The partial identification condition in Chernozhukov et al. (2007, Condition (4.5)) assumes that there exists K >0 so that for all θ,

Q(θ, ϕ)≥Kd(θ,Θ(ϕ)), (4.3)

which says Q should be bounded below by a number proportional to the distance from the identified set if θ is bounded away from the identified set. Assumption 4.4 is a sufficient condition for (4.3).

Example 4.1 (Interval censored data - continued). In the interval censoring data example, Ψ(θ, ϕ) = (θ₋ϕ2, ϕ1 −θ)T, for any ϕ = (ϕ1, ϕ2) and ˜ϕ = ( ˜ϕ1,ϕ˜2), ∥Ψ(θ, ϕ)−Ψ(θ,ϕ˜)∥ =

∥ϕ₋ϕ˜_∥. This verifies Assumption 4.3. Moreover, for anyθ such thatd(θ,Θ(ϕ))_≥an, either

θ ≤ϕ1−an orθ ≥ϕ2+an. Ifθ ≤ϕ1−an, then Ψ2(θ, ϕ) =ϕ1−θ≥an; ifθ ≥ϕ2+an, then

Ψ1(θ, ϕ) =θ−ϕ2 ≥an.This verifies Assumption 4.4.

The following theorem shows the concentration rate for the identified set.

Theorem 4.1. Under Assumptions 4.1-4.4, for some C >0,

P(dH(Θ(ϕ),Θ(ϕ0))> Cn−1/2(logn)1/2|Dn)→p 0.

Remark 4.1. The above result holds for both nonparametric priorϕ(F) and semi-parametric prior (ϕ, η) as described in Section 2. The concentration rate is nearly parametric: n−1/2_(log_n₎1/2_.

The term √logn arises commonly in the posterior concentration rate literature. The poste-rior probability in the theorem is now converging to zero, instead of only being smaller than an arbitrarily small constant. Same rate of convergence in the frequentist perspective has been achieved by Chernozhukov et al. (2007), Beresteanu and Molinari (2008), Kaido and Santos (2011), among others.

Remark 4.2. Recently Kitagawa (2012) obtained the posterior consistency for Θ(ϕ): for any ϵ >0,

lim

n→∞P(dH(Θ(ϕ),Θ(ϕ0))> ϵ|Dn) = 0

for almost every sampling sequence of Dn.This result was obtained for the case whereθ is a

scalar whose identified set Θ(ϕ) is a connected interval and dH(Θ(ϕ),Θ(ϕ0)) is assumed to

be a continuous map of ϕ. In multi-dimensional cases where Θ(ϕ) is a more general convex set, however, verifying the continuity ofdH(Θ(ϕ),Θ(ϕ0)) is much more technically involved,

due to the challenge of computing the Hausdorff distance in multi-dimensional minifolds. In contrast, our Lipschitz equi-continuity condition in Assumption 4.3 is much easier to verify in specific examples, as it depends on the moment conditions directly.

5 Bayesian Inference of Support Function

This section develops Bayesian inference for the support function Sϕ(p) of the identified

(23)

next section. In this section, we first develop an asymptotically valid linearization inϕ of the support function. Based on this result we show that posterior consistency can be achieved and prove the Bernstein von Mises theorem for the support function.

5.1 Moment Inequality Model

Our analysis focuses on identified sets which are closed and convex. These sets are com-pletely determined by their support functions, and efficient estimation of support functions may lead to optimality of estimation and inference of the identified set. As a result, much of the new development in the partially identified literature focuses on the support function, e.g., Kaido and Santos (2011), Kaido (2012), Beresteanu and Molinari (2008), Bontemps, Magnac and Maurin (2012).

In the moment inequality model, Θ(ϕ) := _{θ _∈ Θ; Ψ(θ, ϕ)_≤ 0_}, where Ψ(θ, ϕ) is given in (4.1) and each component of Ψ(θ, ϕ) is a convex function of θ for every ϕ ∈ Φ as stated in the next assumption.

Assumption 5.1. Ψ(θ, ϕ) is continuous in (θ, ϕ) and convex in θ for every ϕ _∈Φ.

Let us consider the support functionSϕ(·) :Sd→Rof the identified set Θ(ϕ). We restrict

its domain to the unit sphere Sd _in Rd _since _S_ϕ₍_p_{) is positively homogeneous in} _p_{. Under} assumption 5.1 the support function is the optimal value of an ordinary convex program:

Sϕ(p) = sup

θ∈Θ{

pTθ; Ψ(θ, ϕ)≤0}

and therefore it also admits a Lagrangian representation (see Rockafellar, chapter 28):

Sϕ(p) = sup

θ∈Θ{

pTθ₋λ(p, ϕ)TΨ(θ, ϕ)_}, (5.1)

where λ(p, ϕ) : Sd_×Rdφ → _Rk

+ is a k-vector of Lagrange multipliers. Note that dϕ is the

dimension of ϕ.

We denote by ΨS(θ, ϕ0) the kS-subvector containing the constraints that are strictly

convex functions of θ and by ΨL(θ, ϕ0) the kL constraints that are linear in θ. Obviously,

kS+kL=k. The corresponding Lagrange multipliers are denoted byλS(p, ϕ0) andλL(p, ϕ0),

respectively, for p _∈ Sd_{. Moreover, define Ξ(}_{p, ϕ}_{) = arg max}_θ_∈_Θ_{_pT_θ_{; Ψ(}_{θ, ϕ}₎ _≤ ₀_} _{as the} support set of Θ(ϕ). Then, by definition,

pTθ=Sϕ(p), ∀θ ∈Ξ(p, ϕ).

We also denote by _∇ϕΨ(θ, ϕ) the k×dϕ matrix of partial derivatives of Ψ with respect to

ϕ. Let B(ϕ0, δ) ={ϕ∈ Φ; ∥ϕ−ϕ0∥ ≤ δ} denote a closed ball centered at ϕ0 with radius δ.

For every ϕ _∈ B(ϕ0, rn) and θ ∈ Θ(ϕ), we denote by Act(θ, ϕ) := {i; Ψi(θ, ϕ) = 0} the set

of the inequality active constraint indices and by dA(θ, ϕ) the number of its elements. For

every i∈Act(θ, ϕ),∇θΨi(θ, ϕ) denotes the d-vector of partial derivatives of Ψi with respect

toθ. We assume the following:

(24)

Assumption 5.3. There is some δ >0 such that for all ϕ_∈B(ϕ0, δ), we have:

(i) the k_×dϕ matrix ∇ϕΨ(θ, ϕ) exists and is continuous in (θ, ϕ);

(ii) the set Θ(ϕ) is non empty;

(iii) there exists a θ_∈Θ such that Ψ(θ, ϕ)<0;

(iv) Θ(ϕ)_⊂int(Θ) where int(Θ) denotes the interior of Θ;

(v) for every i ∈ Act(θ, ϕ0), with θ ∈Θ(ϕ0), the vector ∇θΨi(θ, ϕ) exists and is continuous

in (θ, ϕ)_∈Θ_×B(ϕ0, δ).

Assumption 5.3 (iii) implies assumption 5.3 (ii). However, we prefer to keep both condi-tions since in order to establish some technical results we only need condition (ii) which is weaker.

The next assumption concerns the inequality active constraints. In particular, assumption 5.4 (i) may be restrictive in the one dimensional case (i.e. d = 1) but is easily verified in the cases with d > 1. For instance, in example 2.1 this assumption is not verified in the degenerate case whereϕ1 =ϕ2. Assumption 5.4 (ii) says that the active inequality constraint

gradients _∇θΨi(θ, ϕ0) are linear independent. This assumption guarantees that a θ which

solves the optimization problem (5.1) with ϕ = ϕ0 satisfies the Kuhn-Tucker conditions.

Alternative assumptions that are weaker than assumption 5.4 (ii) could be used, but the advantage of assumption 5.4 (ii) is that it is easy to check.

Assumption 5.4. (i) dA(θ, ϕ0)≤d for everyθ ∈Θ(ϕ0) where dA(θ, ϕ0) is the number of

active constraints;

(ii) the gradient vectors {∇θΨi(θ, ϕ)}i∈Act(θ,ϕ0), are linearly independent ∀θ ∈Θ(ϕ0).

The following assumption is sufficient for the differentiability of the support function at ϕ0:

Assumption 5.5. At least one of the following holds:

(i) For the ball B(ϕ0, δ) in Assumption 5.3, for every (p, ϕ)∈ Sd×B(ϕ0, δ), Ξ(p, ϕ) is a

singleton;

(ii) There are linear constraints inΨ(θ, ϕ0), which are also separable inθ, that is,ΨL(θ, ϕ0) =

A1θ +A2(ϕ0) for some function A2 : Φ → RkL (not necessarily linear) and some

(kL×d)-matrix A1.

Assumption 5.5 is particularly important for the linearization of the support function that we develop in section 5.2. In fact, if one of the two parts of Assumption 5.5 holds then the support function is differentiable for every (p, ϕ) _∈ Sd_×_B₍_ϕ₀_{, δ}_{) and we have a closed} form for its derivative.

The last set of assumptions that we introduce will be used to prove the Bernstein von Mises theorem for Sϕ(·) and allows to strengthen the result of theorem 5.1 below. The first

three assumptions are (local) Lipschitz equi-continuity assumptions.

Assumption 5.6. For the ball B(ϕ0, δ) in assumption 5.3, for some K1, K2, K3 > 0 and

(25)

(i) sup_p_∈Sd∥λ(p, ϕ₁)−λ(p, ϕ₂)∥ ≤K₁∥ϕ₁−ϕ₂∥;

(ii) sup_θ_∈_Θ_∥∇ϕΨ(θ, ϕ1)− ∇ϕΨ(θ, ϕ2)∥ ≤K2∥ϕ1−ϕ2∥;

(iii) ∥∇ϕΨ(θ1, ϕ0)− ∇ϕΨ(θ2, ϕ0)∥ ≤K3∥θ1−θ2∥, for every θ1, θ2 ∈Θ;

(iv) If Ξ(p, ϕ0) is a singleton ∀p∈W for some compact subset W ⊆Sd then there exists a

ε=O(δ) such that Ξ(p, ϕ1)⊆Ξε(p, ϕ0).

We show in the following example that Assumptions 5.1-5.6 are easily satisfied

Example 5.1 (Interval censored data -continued). The setup is the same as in Example 2.1. Assumption 5.2 is verified if Y1 and Y2 are two random variables with finite first moments

ϕ0,1 and ϕ0,2, respectively. Moreover, Ψ(θ, ϕ) = (ϕ1−θ, θ−ϕ2)T, ϕ= (ϕ1, ϕ2)T,

∇ϕΨ(θ, ϕ) =

(

1 0

0 −1

)

so that Assumptions 5.1, 5.2 and 5.3(i)-(ii)are trivially satisfied. Assumption 5.3(iii) holds for every θ inside (ϕ1, ϕ2); Assumption 5.3 (iv) is satisfied if ϕ1 and ϕ2 are bounded. To see

that Assumptions 5.3 (v) and 5.4 are satisfied note that ∀θ < ϕ0,1 we have Act(θ, ϕ0) =

{1_}, _∀θ > ϕ0,2 we have Act(θ, ϕ0) = {2} while ∀θ ∈ [ϕ0,1, ϕ0,2] we have Act(θ, ϕ0) = ∅.

Assumption 5.5(i)and(ii) are both satisfied since the support set takes the values Ξ(1, ϕ) = ϕ2 and Ξ(−1, ϕ) = −ϕ1 and the constraints in Ψ(θ, ϕ0) are both linear with A1 = (−1,1)T

and A2(ϕ0) =∇ϕΨ(θ, ϕ0)ϕ0.

In order to verify assumption 5.6, we use the largest eigenvalue as the matrix norm. The eigenvalues of ∇ϕΨ(θ, ϕ) are{1,−1}for everyθ and ϕ. Hence, assumptions 5.6(ii)-(iii) are

verified. The lagrange multiplier is λ(p, ϕ) = (₋pI(p < 0), pI(p _≥ 0))T _{so that assumption}

5.6 (i) is satisfied since the norm is equal to 0. Finally, the support set Ξ(p, ϕ) = ϕ1I(p <

0) +ϕ2I(p≥ 0) is a singleton for every ϕ∈ B(ϕ0, δ) and Ξ(p, ϕ0)ε ={θ ∈Θ;∥θ−θ∗∥ ≤ ε}

where θ_∗ = Ξ(p, ϕ0) = ϕ0,1I(p < 0) +ϕ0,2I(p ≥ 0). Therefore, ∥Ξ(p, ϕ)−θ∗∥ ≤ δ and

assumption 5.6 (iv) holds with ε =δ.

5.2 Asymptotic Analysis

The support function of a closed and convex set is in general non-differentiable in p but it admits directional derivatives, see e.g. Milgrom and Segal (2002). Luckily, when assumption 5.5 holds the derivative of the support function exists. We exploit this fact to derive an expansion inϕof the support function. This allows us to establish a Bernstein-von Mises type result for the posterior distribution of the support function.

The next theorem states that the support function can be locally approximated (asymp-totically) by a linear function of ϕ1, ϕ2 ∈ B(ϕ0, rn) for rn = o(1) a bounded sequence

de-pending on the sample sizen. The expansion is stochastic whenϕis interpreted as a random variable associated with the posterior distribution P(_·|Dn).

(26)

is a N such that for every n _≥ N there exist: (i) a real function f(ϕ1, ϕ2) defined for

every ϕ1, ϕ2 ∈ B(ϕ0, rn) and (ii) a function λ(p, ϕ0) : Sd×Rdφ → Rk+ such that for every

ϕ1, ϕ2 ∈B(ϕ0, rn):

sup

p∈Sd

(Sϕ1(p)−Sϕ2(p))−λ(p, ϕ0)

T

∇ϕΨ(θ∗(p), ϕ0)[ϕ1 −ϕ2]

=f(ϕ1, ϕ2)

and f(ϕ1,ϕ2)

∥ϕ1−ϕ2∥ →0 uniformly in ϕ1, ϕ2 ∈B(ϕ0, rn) as n → ∞.

We remark that the functions λ and θ_∗ do not depend on the specific choice of ϕ1 and

ϕ2 inside B(ϕ0, rn), but only on p and the true value ϕ0. With the approximation given in

the theorem we are now ready to state posterior consistency (with concentration rate) and asymptotic normality of the posterior distribution ofSϕ(p). In the next theorems we will set

rn= (logn)1/2n−1/2 whenn is sufficiently large.

Theorem 5.2. Under assumption 4.1 and the assumptions of Theorem 5.1 with rn =

√

(logn)/n, for some C >0,

P(sup

p∈Sd|

Sϕ(p)−Sϕ0(p)|< C(logn)

1/2_n−1/2_|_D

n)→p 1. (5.2)

Remark 5.1. Notice that dH(Θ(ϕ),Θ(ϕ0)) = supp∈Sd|S_ϕ(p)−S_ϕ₀(p)|. Therefore, (5.2) is

another statement of Theorem 4.1. However, they are obtained by different proof strategies. In particular, Theorem 5.2 is obtained as a byproduct of the Bernstein-von Mises theorem stated in theorem 5.3 below and is based on the asymptotic local expansion of the support function as in theorem 5.1. As will be shown below, this expansion also yields the Bernstein von Mises theorem of the support function, that is, the posterior of the support function is asymptotically normal.

We now state a Bernstein-von Mises (BvM) theorem for the support function. This theorem is valid under the assumption that a Bernstein-von Mises (BvM) theorem holds for the posterior distribution of the finite-dimensional identified parameter ϕ. We denote by

∥ · ∥T V the total variation distance, that is, for two probability measures P and Q,

∥P ₋Q_∥T V := sup

B |

P(B)₋Q(B)_|

where B is an element of the σ-algebra on which P and Q are defined.

Assumption 5.7. Let P√

n(ϕ−ϕ0)|Dn denote the posterior distribution of

√

n(ϕ ₋ϕ0). We

assume

∥P√

n(ϕ−ϕ0)|Dn − Ndφ( ˜∆n,ϕ0,I˜

−1

ϕ0)∥T V →

p ₀

where Ndφ denotes the dϕ-dimensional normal distribution, ∆˜n,ϕ0 :=n−

1/2∑n

i=1I˜ϕ−01˜lϕ0(Xi),

˜

lϕ0 is the semi-parametric efficient score function of the model and I˜ϕ0 denotes the

semi-parametric efficient information matrix.

(27)

notation, remark that ˜lϕ0 and ˜Iϕ0 also depend either on the true η0 or on the true F0 –

de-pending whether the model has been re-parameterized or not. Thesemi-parametric efficient

score function and the semi-parametric efficient information contribute to the stochastic

local asymptotic normality (LAN) expansion of the integrated likelihood, which is necessary in order to get the BvM result in assumption 5.7. A precise definition of ˜lϕ0 and ˜Iϕ0 may be

found in van der Vaart (2002) (Definition 2.15).

Theorem 5.3. If the assumptions of Theorem 5.1 and assumption 5.6 hold with δ =rn =

√

(logn)/n, under assumption 5.7:

∥P√

nsup_p∈Sd(Sφ(p)−Sφ₀(p))|Dn− N( ¯∆n,ϕ0,

¯

I_ϕ−₀1)∥T V →p 0

where ∆¯n,ϕ0 =λ(p, ϕ0)

T_∇

ϕΨ(θ∗(p), ϕ0) ˜∆n,ϕ0 and

¯

I_ϕ−₀1 =λ(p, ϕ0)T∇ϕΨ(θ∗(p), ϕ0) ˜I_ϕ−₀1∇ϕΨ(θ∗(p), ϕ0)Tλ(p, ϕ0).

The asymptotic mean and covariance matrix may be easily estimated by replacingϕ0 by any

consistent estimator ˆϕ ofϕ0. So thatθ∗(p) will be replaced by an element ˆθ∗(p)∈Ξ(p,ϕˆ) and

an estimate of λ(p, ϕ0) will be obtained by solving – eventually numerically – the ordinary

convex program in (5.1) with ϕ0 replaced by ˆϕ.

Remark 5.2. The posterior asymptotic variance of the support function ¯I_ϕ−₀1 is the same as that of the frequentist estimator obtained by Kaido and Santos (2012, Theorem 3.2), and both are derived based on a linear expansion of the support function. On one hand, the linear expansion of Theorem 5.1 is obtained from expanding Ψ(θ_∗(p), ϕ)₋Ψ(θ_∗(θ), ϕ0) in a

neighborhood of ϕ0. This gives the asymptotic variance ∇ϕΨ(θ∗(p), ϕ0) ˜Iϕ−01∇ϕΨ(θ∗(p), ϕ0)

T_,

which is semi-parametric efficient for estimating Ψ(θ_∗(p), ϕ0) as guaranteed by the Bernstein

von Mises theorem proved by Bickel and Kleijn (2012). On the other hand, Kaido and Santos (2012)’s frequentist estimator of the support function has a linear representation in terms of

b

Ψ(θ_∗(p))₋Ψ(θ_∗(θ), ϕ0), where Ψ(b θ∗(p)) is a sample analog of Ψ(θ∗(θ), ϕ0) and is therefore

semi-parametric efficient. This implies that the asymptotic variances of the support function from both Bayesian and frequentist approaches are the same.

The asymptotic normality of the posterior ofSϕ(p) also implies that the posterior coverage

and the coverage under the limiting normal of our two-sided BCS for the identified set – that we construct in the next section – are the same.

6 Bayesian Credible Sets

(28)

6.1 Credible set for

θ

Bayesian inference on θ can be carried out through finite-sample Bayesian credible sets (BCS). A BCS is a set BCS(τ) such that

P(θ_∈BCS(τ)_|Dn) = 1−τ (6.1)

at level 1−τ, forτ ∈(0,1). Apparently such a definition is not unique. One of the popular choice of the credible set is the highest-probability-density (HPD) set, which has been widely used in empirical studies and also used in the Bayesian partially identified literature e.g., Moon and Schorfheide (2012) and Norets and Tang (2012).

The BCS then can be compared with the frequentist confidence set (FCS). LetPDn(.)

de-note the probability measure based on the sampling distribution, where (θ, ϕ, η) = (θ0, ϕ0, η0)

or (θ, F) = (θ0, F0). A frequentist confidence set FCS(τ) forθ0 satisfies

lim

n→∞ϕinf∈Φθ0∈infΘ(ϕ)

PDn(θ0 ∈FCS(τ))≥1−τ.

There have been various procedures proposed in the frequentist literature to construct FCS(τ) that satisfies the above inequality. One of the key properties of these proposed FCS is that they are based on some consistent estimator ˆϕof ϕ0, and Θ( ˆϕ)⊂FCS(τ). Moon

and Schorfheide (2012) compared HPD with FCS and showed that in a parametric Bayesian model with known likelihood, for any τ > 0, P(θ ∈HPD(τ), θ /∈ FCS(τ)|Dn) = op(1), that

is, the FCS is too large to do Bayesian inference. Under the more robust semi-parametric Bayesian setup, the frequntist confidence set is also “too big” from the Bayesian point of view, shown by Theorem 6.1 below.

The following assumption is needed.

Assumption 6.1. (i) The frequentist FCS(τ) is such that, there is ϕˆwith ∥ϕˆ−ϕ0∥=op(1)

satisfying Θ( ˆϕ)⊂FCS(τ). (ii) sup₍_θ,ϕ₎_∈_Θ_×_Φπ(θ_|ϕ)<_∞.

Many frequentist FCS’s satisfy condition (i), see, e.g., Imbens and Manski (2004), Cher-nozhukov, Hong and Tamer (2007), Rosen (2008), Andrews and Soares (2010), etc. Condition (ii) is easy to verify since Θ_×Φ is compact. Examples of π(θ_|ϕ) include: the uniform prior with density

π(θ|ϕ) = µ(Θ(ϕ))−1Iθ∈Θ(ϕ),

where µ(·) denotes the Lebesgue measure; and the truncated normal prior with density

π(θ|ϕ) =

[∫

Θ(ϕ)

h(x;λ,Σ)dx

]−1

h(θ;λ,Σ)Iθ∈Θ(ϕ),

where h(x;λ,Σ) is the density function of multinormalN(λ,Σ).

Theorem 6.1. Under Assumptions 4.1, the assumptions of theorem 5.1 withrn=

√

(logn)/n, and 6.1, for any τ >0,

(i)