Methods for the thesis
5.1 Bayesian hierarchical models
In this section I give a brief introduction to the Bayesian view of probability and model estimation, to give both a theoretical and practical basis for the methods used in this thesis. More detail about the standard techniques discussed here can be found in Gelman et al. (2004). Bayesian methods provide a different framework for statistical inference from the framework provided by frequentist methods. In a frequentist setting we assume the parameter θ we are trying to estimate using observed data y takes an unknown true value. This is fixed so we cannot make probability statements about it such as “there is a 95% probability that θ is greater than 0”; it either is or it is not. The probability statements we can make under the frequentist paradigm refer to the experiment we are conducting to estimate the value, not value itself. For example we can calculate the p-value as probability of observing data y (or something more extreme) under some assumed null hypothesis for θ. Correspondingly, we can construct a confidence interval for θ using the data y such that the probability that the interval contains θ is over a
certain percentage. As these are probabilities about what we observe, we can think of them as long-run frequencies were we to repeat our experiment many times.
In a Bayesian setting our uncertainty can be expressed directly as a probability in terms of θ, so we are allowed to make statements such as “there is 95% probability that θ is greater than 0”. Our uncertainty about the parameters after the experiment is quantified by the posterior distribution (the probability of the parameter θ given the data y), which is calculated using Bayes’ theorem (Bayes and Price, 1763):
p(θ|y) = p(y|θ)p(θ)/p(y) ∝ p(y|θ)p(θ) (5.1)
The likelihood, p(y|θ), is the same as that required by likelihood-based frequentist methods, but additionally we now require a prior, p(θ). The prior quantifies our beliefs about the distribution of θ prior to seeing the data y. The posterior distribution is what we use to draw inferences. The extent to which the prior influences the posterior depends on the how much information is the prior contributes relative to the data. This makes sense if we consider the probability distributions of the parameters of interest in terms of our belief: if we have some idea about the parameters prior to seeing the data, but the data do not tell us much about the parameters, then the posterior distribution will be similar to the prior. Similarly if we have no idea about the values the parameters might take and we have a reasonable amount of data from our experiment, the posterior will mainly be informed by the data (and as the sample size tends to infinity we arrive at the same numerical estimates as those obtained by maximum likelihood, though the interpretation is different).
The prior distributions are included in the specification of the model. For example in model 4.1 we might assign the priors
α∼ N(0, 100000) β ∼ N(0, 100000) For model 4.2 we might assign
α ∼ N(0, 100000) β ∼ N(0, 100000) uj ∼ N(0, σ2u) σu ∼ U(0, 100)
This demonstrates the continuity between single level and hierarchical models in a Bayesian framework which arises because there are no fixed parameters: all parameters
are considered random since they are assigned a distribution. The only difference in the specification is that in the hierarchical model, the distribution of the uj is governed by a higher level parameter σu, known as a hyperparameter, which is also assigned a prior distribution (a hyperprior ).
As mentioned previously, it is reasonable to assume that the provider effects are not identical, nor independent, but similar. This notion is formalised in Bayesian statistics as exchangeability. A set of parameters θ1, ..., θn are defined as finitely exchangeable if the joint distribution p(θ1, ..., θn) remains unchanged for any permutation of the labels 1, ..., n. That is, if I know the set of providers 1, ..., n, knowing which provider is number 1, which is number 2 and so forth, does not change my beliefs about θ1, ..., θn. We do not require independence, so knowledge of some of the parameters can give us information about the others. It is this borrowing of information that results the provider effects being shrunk towards the mean, with more shrinkage where there is less information.
Exchangeability is also a crucial part of Bayesian estimation itself as it allows the joint prior for all parameters to be decomposed into a product of conditional distributions, further details can be found in Gelman et al. (2004).
5.1.1 Choice of priors
It is recommended to check the sensitivity to the choice of prior by implementing a community of priors (Spiegelhalter et al., 1994) which reflect different positions of belief about the parameters. In practice we do not always have much information on which to base the priors, so they can be set to convey minimal information (e.g. a flat distri-bution over all possible values). These are usually called vague or reference priors. For some parameters we may be interested in a transformation, in which case to be truly non-informative a prior has to be so under transformation. Priors can also be weakly informative, a term applied to those which are restricted to plausible values but do not express any particular prior belief (Gelman, 2006).
For location parameters such as regression coefficients, a Normal distribution with zero mean and a very large variance is a standard option (Lunn et al., 2012). Such parameters are usually insensitive to choice of vague prior when there is a reasonable amount of data available (Gelman, 2006; Lambert et al., 2005). Scale parameters can be more sensitive, in particular the variance parameters in hierarchical models. Gelman (2006) recommends a uniform prior on the standard deviation with a suitably large upper limit if a proper prior is required (as is the case for the OpenBUGS software used here), provided the number of groups not less than five.
5.1.2 Estimation of Bayesian models
A closed-form of the posterior distribution is only available in a limited number of cases. Beyond this, there are various ways of evaluating the required integrals but the most widely used is Markov Chain Monte Carlo (MCMC). Monte Carlo integration is used to approximate any summary of a probability distribution by calculating that summary across repeated independent draws from the distribution. The accuracy of the approximation increases as the number of samples increases. A Markov Chain is a sequence of random variables such that given the present, the future is independent of the past, so all information about the history of the process is contained within the present state. The key property of interest to Bayesian inference is that Markov chains with particular properties converge to a stationary distribution (Gilks et al., 1996). Say we have our parameter vector of interest θ and a set of initial values θ(0). If we can construct a transition distribution (the set of probabilities governing movement from one set of samples to the next) for which gives the distribution of θ(t+1) given θ(t) such that the resulting Markov chain converges on the posterior distribution p(θ|y), we will obtain a series of samples from a distribution sufficiently close to the target posterior distribution which are independent of the starting values. The starting values can be chosen using the data (as they are not part of our prior belief) but should be suitably different for each Markov chain to allow convergence to be checked (Lunn et al., 2012). This brings us to the required samples we need to carry out Monte Carlo integration as previously described. Gibbs sampling is one way of proceeding from one set of samples to the next using conditional distributions (Gelfand and Smith, 1990; Geman and Geman, 1984) and is the method implemented in the OpenBUGS software (OpenBUGS Project Management Group, 2014) used in this thesis. Practically speaking, if we have a set of parameters it means we sample the next value for one of these parameters (or a subset of several) at a time conditional on the current value of all the other parameters. This approach can be applied to transformations of parameters, allowing us to obtain measures of uncertainty for quantities whose distribution cannot be described parametrically.
5.1.3 Model convergence and estimation of the joint posterior distri-bution
We take successive samples until we believe that the Markov chain has converged to a stationary distribution. There are no tests which prove convergence but there are a number of statistics to help assess the likelihood of convergence. It is recommended to use multiple ways of assessing convergence as different methods (Lunn et al., 2012);
I will describe a number of methods here. Multiple chains can be run from different
starting points and the sampled values plotted for each parameter over time (trace plots). If the paths of the chains are indiscernible from one another and look like a random trace then this is consistent with convergence. The Brooks Gelman Rubin (BGR) statistic compares the posterior variability pooling samples from both chains with the average of the posterior variability measured in each chain separately; this ratio should tend to 1 as convergence is reached (Brooks and Gelman, 1998; Gelman and Rubin, 1992). The Geweke statistic (Geweke, 1991) assumes that the chain has converged by the half way point, and compares the last 50% of the chain with the first 10% (these are typical values used). If these sections of the chain are different then this is taken as evidence that the chain has not converged by 10% of the way through. The Heidelberger-Welch stationarity diagnostic (Heidelberger and Welch, 1983) tests a null hypothesis of convergence, first using the whole chain, then discarding an increasing percentage of the initial part of the chain up to 50%.
Once we believe convergence has been reached we can then discard the MCMC samples up to this point (called burn in) and run the MCMC until we have enough samples with which to summarise the posterior distribution with sufficient precision.
Even once the MCMC simulation has run long enough to reach stationarity, we will still have correlation from one iteration to the next because of the nature of Markov Chains.
Independence is not an assumption of Monte Carlo integration, so our inferences are still valid, but they are less efficient (Gilks et al., 1996). If there is a high level of correlation from one sample to the next, it is more difficult for the sampler to explore the parameter space. Autcorrelation can be reduced by only keeping some of the samples (known as thinning), though using the whole chain results in more accurate posterior inferences (Lunn et al., 2012).
5.1.4 Model checking and model comparison
Posterior predictive checks are a way of checking whether predictions produced by the model (using the parameters estimated by the posterior distribution) give results that are similar to the original data (Gelman et al., 1996). If the model is correct we would expect that the observed data would be likely to occur under the posterior distribution.
Note that, as with most checks of model fit, the converse does not hold, so the lack of apparent problems does not mean that the model is true. Rather, posterior predictive checks are used to investigate ways in which the model might be deficient. We can make predictions for the same individuals and new clusters, or for new individuals and clusters (or combinations thereof) depending on what we are interested in. To conduct posterior predictive checks we simulate data based on sets of parameters sampled from the joint posterior distribution (which is straightforward when we already have such samples from MCMC). We then choose test statistics with which to compare the real
and replicated data. A range of test statics should be chosen to reflect the various ways in which the model may be inadequate, with a focus on the inferences we are interested in (Gelman et al., 2000).
Quantile-quantile (Q-Q) plots can be used to check the assumed distribution of the random effects.These plot the sample (posterior means of the random effects) against the theoretical quantiles of the desired distribution. Even if the model is true and the underlying provider effects follow the specified distribution, the distribution of posterior means will be underdispersed relative to the estimated random effects variance, as the estimates are shrunk towards the overall mean. It is for this reason that the posterior means themselves do not provide a good estimate of the overall variability (Shen and Louis, 1998).
There are a number of statistics which describe how well the model fits the data.
The deviance, D(θ) is a measure of how likely the data are under the parameters θ and is given by:
D(θ) =−2 log p(y|θ)
The better the fit of the model, the lower the deviance. This measure does not take into account the complexity of the models; usually we want to quantify the trade-off between improvement in fit and added complexity from extra parameters. The Akaike Information Criterion (AIC) which was used in section 3.4.1 does this by adding twice the number of parameters to the estimated deviance. For Bayesian models the number of model parameters is not clear because the parameters are constrained by the priors (Lunn et al., 2012). We consider instead the effective number of parameters, pD which has a number of alternative definitions (Gelman et al., 2004) but I will use that proposed by Speigelhalter (Spiegelhalter et al., 2002):
pD = ¯D− D(¯θ)
where ¯D is the mean of the posterior distribution of the deviance and D(¯θ) is the deviance of the posterior means of the parameters θ. The Deviance Information Criterion (DIC) is an analagous to the AIC and is defined as:
DIC = ¯D + pD = D(¯θ) + 2pD I presented DIC, ¯D and pD for all models.