1.4 Outline of the rest of the thesis
2.1.1 Probabilistic Bayesian inference
A probabilistic Bayesian analysis has, at its core, a specification of the probabilities of all of the different possible outcomes of a given experiment. Even for small prob- lems, it may be extremely challenging to make such a specification, but the reward for doing so is access to a provably coherent set of rules for updating beliefs upon learning the values of certain quantities within the model.
Motivation Jeffrey [2002] considers fair prices for betting slips which pay out upon the occurrence of certain events as a means of proving the basic properties which probability specifications must obey, and of deriving update rules which must
be obeyed. For a finite or countably infinite set of incompatible events A1, A2, . . . ,
if we define the event H = A1∨ A2∨ . . . , then Jeffrey argues that if presented with
tickets which pay rewards in probability currency (P (H) if H occurs, P (A1) if
A1 occurs etc.), then in order to avoid inconsistently valuing the same proposition
presented to us in different ways, we must have
P (H) = P (A1) + P (A2) + . . . .
We may use a similar ‘Dutch book’ argument to handle relationships between con- ditional (for example, the probability P (H|D) that H will occur given that D occurs) and joint (for example, the probability P (H, D) that both H and D will occur) probabilities: in a situation where we have a ticket which pays 1 if H ∧ D occurs and P (H|D) if ¬D occurs, it can also be shown that
P (H ∧ D) = P (H|D) P (D) . (2.1.2)
If we have a set of events {Di} which form a partition (i.e. exactly one of them
must occur), then we can combine these two rules to obtain
P (H) =X
i
P (H|Di) P (Di) .
Analysis The results which Jeffrey gives can be generalised to the case of continu-
ous parameters by making appropriate additional assumptions. For a given problem, then, we should perform a Bayesian analysis by introducing a set of assumptions M (referred to as model assumptions) which specify the functional relationships be-
tween a set of parameters θ = {θ1, . . . , θnp}; these parameters are assumed to take
values in some space Θ. Our model then consists of a probability density function (pdf) p (θ|M) which generates the probabilities of individual events θ ∈ χ (for some set χ) as follows
P (θ ∈ χ|M) = Z
χ
2.1. Bayesian analysis 23 where integration should be replaced by summation in the case of discrete parame- ters, and our pdfs are normalised so that
P (θ ∈ Θ|M) = Z
Θ
p (θ|M) dθ = 1 . (2.1.3)
Using a probabilistic model, we recover the expectation of an individual parameter
θi as
E [θi] =
Z
Θi
θip (θi) dθi (2.1.4)
with similar relations for the variance and the higher-order moments of the distri- bution. Under any given probability distribution, we can compute the marginal (unconditional) distribution of a set of parameters (indexed by I) by ‘integrating out’ all others
p (θI) =
Z
Θ¬I
p (θ) dθ¬I (2.1.5)
and we can compute the conditional distribution of a set of parameters given the remainder by ‘dividing out’
p (θI|θ¬I) =
p (θ)
p (θ¬I)
.
Using this final relation, we can obtain perhaps the most useful probabilistic rela- tionship; if θ = {α, β}, then as a trivial consequence of the continuous version of the product rule (2.1.2), we have that
p (β|α) = p (α|β) p (β)
p (α) . (2.1.6)
This relation tells us how we should use data to learn about the world. If, before observing the data, our beliefs about β are summarised through p (β) (known as the prior distribution), and we make a specification p (α|β) for the distribution of α conditional on each possible value of β (often referred to as the likelihood), then if we learn the value of α, we know that our beliefs about β should be updated according to (2.1.6) (where the denominator p (α) is computed using (2.1.5) and (2.1.2)). This is a powerful and widely applicable result; if α is the outcome of an experiment that we will perform in order to learn about β, and we can specify p (α|β) for all values of α that we might obtain, then (2.1.6) automatically gives us our updated belief state from our prior specification.
Implementation For a continuous parameter θ, the direct specification of a prob- ability density p (θ) would require us to specify relative density values at an infinite number of parameter settings (and then integrate to enforce the condition (2.1.3)); doing this through explicit consideration of the individual elements of the parameter space Θ is only really possible for discrete problems, and even then, it may present a significant challenge. In practice, therefore, models are built using a small handful of well-known distributional forms.
While a probability distribution should be chosen so that it represents our beliefs about the distribution of a given parameter as faithfully as possible, in practice, the choice of distributions on the basis of their nice computational properties is far more common. Members of the exponential family of distributions are particularly common choices, since they have the very useful property that for certain choices of such distributions for the prior and the likelihood, the posterior distribution will be of the same type as the prior [Diaconis and Ylvisaker, 1979].
Even when using such conjugate distributions to build models, in problems with more than a handful of parameters, we can quickly lose our ability to directly per- form the integrals necessary to compute marginal distributions for subsets of the parameters (equation 2.1.5) or to compute moments of functions of parameters (e.g. expectations, variances, covariances; see equation 2.1.4). In recent years, much work has been done towards handling such problems through numerical integration tech- niques; perhaps the most commonly encountered are Markov-Chain Monte-Carlo (MCMC) methods. These work by generating a Markov chain (a stochastic se- quence in which each state is sampled as a function of previous states) in such a way that the set of generated samples is guaranteed to converge in the limit to a set of samples from the required distribution. The Metropolis-Hastings algorithm is the simplest and perhaps the most widely used MCMC method; Robert [2015] gives an introduction, and provides references to more detailed works. A descrip- tion of a wider range of Monte-Carlo sampling methods is provided in Robert and Casella [1999], and some more advanced methods which exploit gradient informa- tion to give better exploration of the distribution are presented in Girolami and Calderhead [2011].
2.1. Bayesian analysis 25 A very large literature now also exists investigating situations in which larger num- bers of parameters and a more complex dependency structure are required; graphical models (see, for example, Bishop [2003], Lauritzen and Wermuth [1989], Rue and Held [2005]) provide a useful framework for specifying the dependence structure between parameters, and also for performing the calculations necessary to extract information from the model upon updating using data.