Model choice and Bayesian model averaging (BMA)

2. Literature Review

2.4 Methods of modelling issues

2.4.2 Model choice and Bayesian model averaging (BMA)

Various methods of posterior simulation have been developed. The most ubiquitous method is Markov chain Monte Carlo (MCMC). The Gibbs sampler (Geman and Geman, 1984) and the Metropolis-Hasting (Hastings, 1970; Metropolis et al., 1953) are general and popular computing approaches among many different MCMC algorithms.

As discussed earlier in this chapter, there are numerous applications of Bayesian approaches in time series analysis (Choudhuri et al., 2004; Guo et al., 1999; Simon et al., 2004; West et al., 1999) and spatial and spatiotemporal analyses (Banerjee et al., 2008; Sahu et al., 2005;

Besag et al., 1991). In these situations, a hierarchical statistical modelling is often proposed to model the uncertainties of the temporal, spatial or spatiotemporal process using conditional probabilities at different levels. A Bayesian hierarchical model is commonly decomposed into three (or more) level probability models: the data model [ , the process model [ and the parameter model [ (Cressie and Wilke, 2011). The parameter model is at the lowest level and can be expressed by the joint probability distribution of all the unknown parameters.

More details are described in the growing number of books on the topic; see, for example Banerjee et al. (2004), Cressie and Wilke (2011), Diggle and Ribeiro (2007), and Lawson et al.(2003).

2.4.2 Model choice and Bayesian model averaging (BMA)

Spiegelhalter et al., (2002) developed the deviance information criterion (DIC) to compare complex hierarchical models in which the uncertainty of number of parameters exists. The DIC is based on the posterior expectation of the deviance and the effective number of parameters in the model, and is expressed as:

( ( ))

( ( )) ( ( ̅( )))

where D is deviance, and is the difference between the posterior mean deviance and the deviance of posterior means and is used to assess model complexity. ̅ is the posterior mean deviance and can be used as comparing discrepancies between models, that is, it can measure how well the model fits the data, the larger value indicating worse fitting. The DIC is easily computed from the samples generated through MCMC (Banerjee et al., 2004). A smaller DIC value indicates a better model fit, accounting for model parsimony. Akaike Information Criterion (AIC) (Akaike, 1973) and Bayesian Information Criterion (BIC) (Schwarz, 1978), are also widely used in model choice. However, they need to count the number of parameters.

Hence, BIC and AIC are not appropriate for many problems in model selection such as hierarchical models with random effects due to the uncertainty of the number of parameters (Spiegelhalter et al., 2002).

Bayesian model averaging is alternative approach to model selection. BMA can take account of model uncertainty and make inference by taking a weighted average of models over the

model space (Hoeting, 2002). Let M be a model space, comparing a number of possible model structures Mi with parameter θi based on data D, i=1,…, L. Let Δ be the quantity of interest; this could represent, for example, spatial correlation structure in this study. Hence the posterior distribution of Δ given data D is (Hoeting et al., 1999):

( ) ∑ (

) ( )

The posterior probability for Mi is given by:

( ) _∑ ⁽_{( |}^{) (}_{) (}⁾ ₎

where

( ) ∫ ( ) ( )

Here, p(D|Mi) is the marginal likelihood of the data D given model Mi; θi denotes the vector of parameters of model Mi; L is the number of models; p(Mj) is the prior probability for model Mj; p(θi |Mi) is the prior density of θi given model Mi; and p(Mi) is the prior probability for model Mi (Hoeting et al., 1999).

A Laplace approximation, typically the Bayesian information criterion (BIC) (Schwarz, 1978) can be used to approximate p(D|Mi) (Clyde, 2000; Hoeting et al., 1999; Jackson et al., 2009) :

( ( ) { ( | ̂ )} ( )

{ ( | ̂ )} ( )

Here { ( | ̂ )} is the maximized log-likelihood of model i, which estimates goodness of fit; di is the number of parameters in model i, and n is the sample size. In the absence of other information, it is common to assume equal prior model probabilities p(Mi) for the candidate models (Boone and Bullock, 2008; Jackson et al., 2009). Hence the BMA weights are approximately

( )

The posterior probability for Mi is calculated as

( ) ∑

Other information criterion can be used instead of the BIC. For example Akaike’s information criterion (AIC { ( | ̂ )} ) (Akaike, 1973) was suggested by Jackson et al. (2009). Jackson et al. (2009) also suggested that it may be worth investigating the use of the DIC as a basis for model averaging, given the increasing popularity of Bayesian hierarchical models.

Boone and Bullock (2008) employed Bayesian model averaging to evaluate the ‘best’ spatial correlation structure and to average across these structure to develop a non-parametric alternative structure for a loblolly pine database. In their study, they developed four spatial models incorporating different spatial correlation structures, which were independent, Matern, CAR and SAR. The prior p(Mi) was set to 1/4 for all models because there was no information to show which correlation structure was preferred. The marginal likelihood p(D|Mi) was approximated by

( ) ∑ ( ) ( ) ( )

where g(θ) was the candidate density. The above numerical approximation was calculated by importance sampling and Monte Carlo integration. However, integration was inefficient if g(θ) was not close to the density to be integrated (Boone and Bullock, 2008). The authors concluded that these spatial correlation structures could be combined using BMA to form a hybrid structure that includes the class of the original structures.

2.4.3 Variable selection

Although one can use a statistical criterion (e.g. BIC, AIC etc) to select an optimal model from a set of candidate plausible models, concerns remain about the uncertainty of various types of models or which subset of variables should be involved in analysis. Several variable selection approaches have been developed in a Bayesian framework, for example, adaptive shrinkage, Gibbs Variable Selection, Stochastic Search Variable Selection and reversible jump MCMC. Hara and Sillanpää (2009) give details and compare and describe these methods. Reversible jump MCMC is the most effective but is often quite computationally demanding (O'Hara and Sillanpää, 2009; Sisson, 2005).

The reversible jump MCMC was first suggested by Green (1995). Let Y be a response variable and have n observations. The total number of covariates X is m. Let ( ) denote the column indices of X and k be the current total number of selected variables in the model. The posterior probability of each possible model can be obtained through modelling the joint distribution of (k, , Y)

( ) ( ) ( ) ( )

∑ ( ) ( ) ( )

The main idea is that a fixed distinct number of parameters is available for each possible model structure, where the dimension k of the parameter spaces will vary from one model to another. The MCMC approach can be constructed to accommodate such ‘jumps’ between different parameter sets. There is a wide variety of RJMCMC applications in the literature (Brooks et al., 2003; Denison et al., 1998; Lunn et al., 2006).

In document Spatiotemporal modelling in estimation of nitrous oxide emissions from soil (Page 64-69)