Univariate Bayesian models - Bayesian structural inference with applications in social science

Chapter 2 Background

2.3 Univariate Bayesian models

The basic building-blocks of the models that we consider in Section 2.4 and 2.5 are simple univariate models. In this section, we first describe conjugacy, a property that characterises a class of analytically-tractable models. We then review the simplest form of the two conjugate models that are considered throughout this thesis. We assume that Y is an n-dimensional random vector consisting of independent, identically distributed components. We suppose observationsy ofY are available.

2.3.1 Conjugate priors

The integration required to evaluate Equation 2.2 is analytically intractable for many choices of priors for a given model. If our understanding is such that our prior needs to take a form for which the integration is intractable, numerical methods of

a prior for which the integration is straightforward, this difficulty can be avoided. For the models that we consider in this thesis, priors of the required form are well known, and are calledconjugate priors.

Conjugate priors (Raiffa and Schlaifer, 1961) are familiesP of distributions that are closed under sampling from a distribution in a familyF of distributions. A familyP

of prior distributions is said to be closed under sampling from a distributionp(y|θ) in a parametric family F if for every prior distribution π(θ) ∈ P, the posterior distribution p(θ | y) ∝ π(θ)p(y | θ) is also in P. A catalogue of many conjugate priors is given in Gelmanet al. (2004).

Raiffa and Schlaifer (1961) list three properties that they view as desirable in a family of priors: tractability, interpretability and richness. Conjugate families are tractable, and this is the main reason for their adoption. Conjugate priors sometimes also have a simple interpretation. In exponential families we can consider the prior as constituting “virtual samples” (see, e.g. Robert, 2007), and so the relative weight implied on the prior and data can be ascertained. It is in richness, however, that conjugate families can be lacking. Ideally, a prior should exactly match a Bayesian modeller’s prior beliefs, but conjugate priors are often not flexible enough to allow this to be fully achieved. Sometimes, a close approximation to prior beliefs can be constructed within the conjugate family, but often a poor approximation is accepted because of the computational advantages of conjugate priors.

In many standard Bayesian models, using non-conjugate priors is now feasible since the emergence of easily available computationally-intensive approximations. How- ever, in the setting considered here, non-conjugate priors are not viable for the following reasons.

First, there are formidable computational challenges even when conjugate priors are used. These challenges are considerably compounded by the use of non-conjugate priors. Additionally we will be exclusively considering settings in which the sample

size of the data is large. The large sample size means that the prior will exert only a minimal effect on the posterior distribution, thus making its exact specification less important.

For these reasons, we use conjugate priors throughout.

2.3.2 Multinomial-Dirichlet

The standard Bayesian model for univariate multinomial data (e.g. O’Hagan and Forster, 2004) will form the basis of the models we consider in this thesis. Consider a random vectorY, each component of which takes one ofr discrete categories. Sup- pose thatY is distributed according to a multinomial distribution, with parameter vectorθ= (θ1, . . . , θr), withθ >0 and θ1+· · ·+θr = 1.

Y∼Mult(θ1, . . . , θr)

The conjugate prior for the vectorθis Dirichlet, with hyperparametersα= (α1, . . . , αr)

whereαk >0,k= 1, . . . r.

θ1, . . . , θr ∼Dir(α1, . . . , αr) withθ1, . . . , θr ≥0 and r X

k=1

θk= 1

The normalising factor in the Dirichlet likelihood is a ratio of gamma functions Γ(α) =R∞

0 xα−1e−xdx, for which, in particular, Γ(α) = (α−1)! forα∈N.

p(θ1, . . . , θr) = Γ(α1+· · ·+αr) Γ(α1). . .Γ(αr) r Y k=1 θαk−1 k

The mean isαk(Prk=1αk)−1 for each θk.

The posterior distribution of θ is parameterised in terms of a contingency table constructed from the observationsy, such that nk is the number of observations in

thekth category,k= 1, . . . , r.

θ1, . . . , θr |y∼Dir(α1+n1, . . . , αr+nr)

The formulation simplifies in the natural manner for binomial data with beta priors. In using this formulation, we are assuming that the data are independent, identically- distributed draws from a multinomial distribution. It will often be the case that some heterogeneity exists and so it is more appropriate to use a model that is conditional on some collection of covariates; we consider this possibility in Section 2.4.1.

2.3.3 Normal inverse-gamma

The models for normally-distributed data that we consider will similarly build upon standard univariate models (e.g. Gelman et al., 2004). Suppose we have a random vectorY, components of which are independent random variables distributed according to a normal distribution, with meanµand varianceσ2.

Y ∼N(µ, σ2)

When both µ and σ2 are unknown, the conjugate priors for µ and σ2 are normal and inverse-gamma respectively.

µ|σ2 ∼N(m, v−1σ2) σ2 ∼IG(a, b)

The hyperparameters a and b are respectively the shape and scale parameters of the inverse-gamma distribution. The hyperparametersm can be interpreted as the prior mean, andvis inversely proportional to the prior variance. The inverse-gamma

distribution has density

π(σ2) = b

Γ(a)(σ

2₎−(a+1)_exp(₋_b/σ2₎_.

The joint prior for (µ, σ2) is thus normal inverse-gamma NIG(m, v, a, b).

π(µ, σ2) = √ v σ√2π ba Γ(a)(σ 2₎−(a+1)_exp −2b+v(µ−m) 2 2σ2

By conjugacy, the joint posterior distribution for (µ, σ2) is also normal inverse- gamma.

µ, σ2|y∼NIG(m?, v?, a?, b?) where, with ¯y= _n1Pn

i=1yi and s2=

i=1(yi−y¯)2, the parameters are

m?= mv+ny¯ n+v v?= 1 n+v a?=a+n 2 b?=b+1 2 s2+nv(¯x−m) 2 n+v .

In document Bayesian structural inference with applications in social science (Page 44-48)