Finally, we compare our Bayesianmodelselection procedure to that of Lanne and Saikkonen (2008). They strongly recommend using diagnostic checks to confirm the adequacy of the model suggested by the maximized likelihood criterion, but we ignore this step as it is difficult to incorporate into the simulation experiment. For simplicity, we consider the case where the order of the autoregressive polynomial operators is assumed to be known. In particular, we set r + s at 2 and calculate the marginal likelihoods and the maximum values of the approximate log likelihood function for the causal, purely noncausal and mixed models. We assume the same three parameter combinations ( 1 , 1 ) = {(0.1,0.7), (0.7,0.1), (0.7,0.7)} as in section 3. Again, the results (not reported in detail) are based on 1000 realizations of a series of 150 observations where the error terms ε t are assumed to have the standardized Student’s t-distribution with 3 degrees of freedom and
Previous authors have also attempted modelselection experiments for the GIG cycle, but with various limitations compared to our approach. Roe and Allen (1999) compared deterministic models plus autoregressive process noise using an F-test and found no support for any one model over any other. Feng and Bailer-Jones (2015) used Bayesianmodelselection to select between competing forcing functions over the Pleistocene, concluding that obliquity influences the termination times over the entire Pleistocene, and that precession also has explanatory power following the mid-Pleistocene transition. Their approach requires a tractable likelihood function, which heavily restricts the class of models that can be compared, in particular, ruling out the use of SDE models. As in the previously mentioned hypothesis tests, they also begin by discarding most of the data and using a summary consisting of just the termination times ( ∼ 12 over the past 1 Myr), which is necessary as the low- order deterministic models used do not fit well to the complete dataset. They also only sample parameter values from the prior, leading to poor numerical efficiency. Finally, Kwasniok (2013) compares conceptual models over the last glacial period using the Bayesian information criterion. The likelihood of each model is estimated using an unscented Kalman filter (UKF) (Wan et al., 2000). Whilst this approach focussed on a smaller time horizon than our application, it can be applied using the data and models in this paper. However, the Gaussian approximation used by the UKF, whilst working well for filtering, is unproven for parameter estimation and modelselection, and the particle filter offers a more natural approach for non-linear dynamical systems.
Previous authors have also attempted modelselection experiments for the GIG cycle, but with various limitations compared to our approach. Roe and Allen (1999) compared deterministic models plus autoregressive process noise using an F-test and found no support for any one model over any other. Feng and Bailer-Jones (2015) used Bayesianmodelselection to select between competing forcing functions over the Pleistocene, concluding that obliquity influences the termination times over the entire Pleistocene, and that precession also has explanatory power following the mid-Pleistocene transition. Their approach requires a tractable likelihood function, which heavily restricts the class of models that can be compared, in particular, ruling out the use of SDE models. As in the previously mentioned hypothesis tests, they also begin by discarding most of the data and using a summary consisting of just the termination times ( ∼ 12 over the past 1 Myr), which is necessary as the low- order deterministic models used do not fit well to the complete dataset. They also only sample parameter values from the prior, leading to poor numerical efficiency. Finally, Kwasniok (2013) compares conceptual models over the last glacial period using the Bayesian information criterion. The likelihood of each model is estimated using an unscented Kalman filter (UKF) (Wan et al., 2000). Whilst this approach focussed on a smaller time horizon than our application, it can be applied using the data and models in this paper. However, the Gaussian approximation used by the UKF, whilst working well for filtering, is unproven for parameter estimation and modelselection, and the particle filter offers a more natural approach for non-linear dynamical systems.
It is quite common in statistical modeling to select a model and make inference as if the model had been known in advance; i.e. ignoring modelselection uncertainty. The resulted estimator is called post-modelselection estimator (PMSE) whose properties are hard to derive. Conditioning on data at hand (as it is usually the case), Bayesianmodelselection is free of this phenomenon. This paper is concerned with the properties of Bayesian estimator obtained after modelselection when the frequentist (long run) performances of the resulted Bayesian estimator are of interest. The pro- posed method, using Bayesian decision theory, is based on the well known Bayesianmodel aver- aging (BMA)’s machinery; and outperforms PMSE and BMA. It is shown that if the unconditional modelselection probability is equal to model prior, then the proposed approach reduces BMA. The method is illustrated using Bernoulli trials.
This dissertation explores Bayesianmodelselection and estimation in settings where the model space is too vast to rely on Markov Chain Monte Carlo for posterior calculation. First, we consider the problem of sparse multivariate linear regression, in which several correlated outcomes are simultaneously regressed onto a large set of covariates, where the goal is to estimate a sparse matrix of covariate effects and the sparse inverse covariance matrix of the residuals. We propose an Expectation-Conditional Maximization algorithm to target a single posterior mode. In simulation studies, we find that our algorithm outperforms other regularization competitors thanks to its adaptive Bayesian penalty mixing. In order to better quantify the posterior model uncertainty, we then describe a particle optimization procedure that targets several high-posterior probability models simultaneously. This procedure can be thought of as running several ``mutually aware'' mode-hunting trajectories that repel one another whenever they approach the same model. We demonstrate the utility of this method for fitting Gaussian mixture models and for identifying several promising partitions of spatially- referenced data. Using these identified partitions, we construct an approximation for posterior functionals that average out the uncertainty about the underlying partition. We find that our approximation has favorable estimation risk properties, which we study in greater detail in the context of partially exchangeable normal means. We conclude with several proposed refinements of our particle optimization strategy that encourage a wider exploration of the model space while still targeting high-posterior probability models.
Such an approach not only avoids problems associated with improper priors when calcu- lating the Bayes factor, but also has the potential to allow more general loss functions (for example replacing the posterior predictive density with a more general scoring rule (see Section 2.5) to be incorporated in the model assessment. However, it leaves open two im- portant questions - firstly, the extent to which overlapping subsets used for model training and validation introduce bias into the assessment, and secondly the extent to which the power of the assessment is reduced by assessing performance on models conditioned on an incomplete sample of data. This approach and associated issues are closely linked to the cross-validatory approaches we now consider.
Parameter estimation for complex models using Bayesian inference is usually a very costly process as it requires a large number of solves of the forward problem. We show here how the construction of adaptive surrogate models using a posteriori error estimates for quantities of interest can significantly reduce the computational cost in problems of statistical inference. As surrogate models provide only approximations of the true solutions of the forward problem, it is nevertheless necessary to control these errors in order to construct an accurate reduced model with respect to the observables utilized in the identification of the model parameters. Effectiveness of the proposed approach is demonstrated on a numerical example dealing with the Spalart–Allmaras model for the simulation of turbulent channel flows. In particular, we illustrate how Bayesianmodelselection using the adapted surrogate model in place of solving the coupled nonlinear equations leads to the same quality of results while requiring fewer nonlinear PDE solves.
Bayesianmodelselection offers an alternative approach (detailed in the next sec- tion). Rather than directly using penalised likelihoods, a posterior distribution across a set of models is constructed, and comparison between pairs of models can be made using Bayes factors. Bayesianmodelselection and penalised likelihood methods are closely related: the log posterior distribution is given, up to a con- stant, by the sum of the log likelihood and the log prior. The (negative) log prior can therefore be viewed as a penalty term. This view makes clear the relationship between various penalised likelihood estimators and related Bayesian formulations. Approaches that draw ideas from the Bayesian approach in a frequentist context are also available. For example, Buckland et al. (1997) propose a method for assign- ing weights to models, but the weights arise from functions of information criteria, rather than from a posterior distribution. The BIC also straddles both frameworks: although it takes the form of a penalised likelihood, it is also an asymptotic approx- imation to the Bayes factor.
In this paper we suppose that we are in a context similar to that of Example 1, where, for any possible model, the sample space of the problem must be consistent with a single event tree, but where on the basis of a sample of students’ records we want to select one of a number of different possible CEG models, i.e. we want to find the “best” partitioning of the situations into stages. We take a Bayesian approach to this problem and choose the model with the highest posterior probability — the Maximum A Posteriori (MAP) model. This is the simplest and possibly most common Bayesianmodelselection method, advocated by, for example, Dennison et al [6], Castelo [7], and Heckerman [8], the latter two specifically for Bayesian network selection.
Previous attempts to control for model quality in GLMs for fMRI include statistical tests for goodness of fit (Razavi et al., 2003) and the application of Akaike’s or Bayesian information criterion for activation detection (Seghouane and Ong, 2010) or theory selec- tion (Gl¨ascher and O’Doherty, 2010). Additionally, voxel-wise Bayesianmodel assessment (Penny et al., 2003, 2005, 2007) and random-effects Bayesianmodelselection (Rosa et al., 2010) have been included in the popular software package Statistical Parametric Map- ping (SPM), but are only rarely used due to low visibility, high analytical complexity and interpretational difficulty. Finally, a toolbox for frequentist model diagnosis and ex- ploratory data analysis called SPMd (“d” for “diagnostic”) has been released for SPM (Luo and Nichols, 2003), but was discontinued several years ago 1 (Nichols, 2013).
A number of Bayesian formulations of PCA have followed from the probabilistic formulation of Tipping and Bishop (1999a), with the necessary marginalization being approximated through both Laplace approximations (Bishop, 1999a; Minka, 2000, 2001a) and variational bounds (Bishop, 1999b). More recently, work within the statistics research community has used a Bayesian vari- ational approach to derive an explicit conditional probability distribution for the signal dimension given the data ( ˇSm´ıdl and Quinn, 2007). However, these results have only been tested on low di- mensional data with relatively large sample sizes. A somewhat more tractable expression for the signal dimension posterior was also obtained by Minka (2000, 2001a) and it is that Bayesian for- mulation of PCA that we draw upon. By performing a Laplace approximation (Wong, 1989), that is, expanding about the maximum posterior solution, Minka derived an elegant approximation to the probability, the model evidence p(D | k), of observing a data set D given the number of principal components k (Minka, 2000, 2001a). The signal dimensionality of the given data set is then esti- mated by the value of k that maximizes p(D | k). As with any Bayesianmodelselection procedure, if the data has truly been generated by a model of the form proposed, then one is guaranteed to select the correct model dimensionality as the sample increases to an infinite size. Minka’s dimensionality selection method performs well when tested on data sets of moderate size and dimensionality. In- deed, the Laplace approximation incorporates the leading order term in an asymptotic expansion of the Bayesian evidence, with the sample size N playing the role of the ‘large’ parameter, and so we would expect the Laplace approximation to be increasingly accurate as N → ∞. In real-world data sets, such as those emanating from molecular biology experiments, the number of variables d is often very much greater than the sample size N, with d ∼ 10 4 yet N ∼ 10 or N ∼ 10 2 not uncommon
to the ones made by Schwarz (1978) and Haughton (1988). In this sense, our paper generalizes the mentioned works, providing valid asymptotic formulas for a new type of marginal likelihood integrals. The resulting asymptotic approximations, presented in Theorem 4, deviate from the stan- dard BIC score. Hence the standard BIC score is not justified for Bayesianmodelselection among Bayesian networks with hidden variables. Moreover, no uniform score formula exists for such mod- els; our adjusted BIC score changes depending on the different types of singularities of the sufficient statistics, namely, the coefficient of the ln N term (Eq. 2) is no longer − d 2 but rather a function of the sufficient statistics. An additional result presented in Theorem 5 describes the asymptotic marginal likelihood given a degenerate (missing links) naive Bayesianmodel; it complements the main result presented by Theorem 4.
applications (i.e., models with many parameters). If prior knowledge about the parameters is not available or vague, a further simplification leads to the Bayesian information criterion or Schwarz’ information crite- rion (BIC) [Schwarz, 1978; Raftery, 1995]. The Akaike information criterion (AIC) [Akaike, 1973] originates from information theory and is frequently applied in the context of BMA in social research [Burnham and Ander- son, 2003] for its ease of implementation. Previous studies have revealed that these information criteria (IC) differ in the resulting posterior model weights or even in the ranking of the models [Poeter and Anderson, 2005; Ye et al., 2008, 2010a, 2010b; Tsai and Li, 2010, 2010; Singh et al., 2010; Morales-Casique et al., 2010; Foglia et al., 2013]. This implies that they do not reflect the true Bayesian trade-off between performance and complexity, but might produce an arbitrary trade-off which is not supported by Bayesian theory and cannot provide a reliable basis for Bayesianmodelselection. Burnham and Anderson [2004] conclude that ‘‘. . . many reported studies are not appropriate as a basis for inference about which criterion should be used for modelselection with real data.’’ The work of Lu et al. [2011] has been a first step into clarifying the so far contradictory results by comparing the KIC and the BIC against a Markov chain Monte Carlo (MCMC) reference solution for a synthetic geostatistical application.
Chapter 3 introduces an individual-based SIS model for the spread dynamics of an infectious disease among a population of individuals partitioned into house- holds. The proposed hidden Markov model, that naturally accounts for partially observed data and imperfect test sensitivity, is used as the basic model for the methods developed throughout the thesis. Special attention is given to the data augmentation MCMC algorithm that is used to facilitate inferences for this model. In Chapter 4 we consider the problem of Bayesianmodelselection in the presence of high-dimensional missing data, focusing on epidemiological applications where observations are gathered longitudinally and the population under investi- gation is organised in small groups. In particular, we outline an algorithm that combines ideas of MCMC, importance sampling and filtering to provide estimates of the marginal likelihood, and is well suited for small-scale epidemics. Even though several alternative approaches exist, there are currently only few studies assessing the performance of modelselection methods in such settings. Hence, one of the main contributions of this chapter is the comparison of the proposed method with existing approaches, achieved through an extended simulation study on synthetic data generated in order to resemble real-life epidemiological problems. The impor- tance of modelselection procedures is further demonstrated in Chapter 5, where we successfully apply these methods to uncover new insights into the transmission dynamics of E. coli O157:H7 in cattle.
Horizontal gene transfer (HGT) plays a critical role in evolution across all domains of life with important biological and medical implications. I propose a simple class of stochastic models to examine HGT using multiple orthologous gene alignments. The models function in a hierarchical phylogenetic framework. The top level of the hierarchy is based on a random walk process in “tree space” that allows for the development of a joint probabilistic distribution over multiple gene trees and an unknown, but estimable species tree. I consider two general forms of random walks. The first form is derived from the subtree prune and regraft (SPR) operator that mirrors the observed effects that HGT has on inferred trees. The second form is based on walks over complete graphs and offers numerically tractable solutions for an increasing number of taxa. The bottom level of the hierarchy utilizes standard phylogenetic models to reconstruct gene trees given multiple gene alignments conditional on the random walk process. I develop a well-mixing Markov chain Monte Carlo algorithm to fit the models in a Bayesian framework. I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis. Using 144 orthologous gene alignments from six prokaryotes previously collected and analyzed, Bayesianmodelselection finds support for (1) the SPR model over the alternative form, (2) the 16S rRNA reconstruction as the most likely species tree, and (3) increased HGT of operational genes compared to informational genes.
The Naive Bayes method is based on the work of Thomas Bayes (1702-1761). In Bayesian classification, we have a hypothesis that the given data belongs to a particular class. We then calculate the probability for the hypothesis to be true. This is among the most practical approaches for certain types of problems. The approach requires only one scan of the whole data. Also, if at some stage there are additional training data, then each training example can incrementally increase/decrease the probability that a hypothesis is correct. Thus, a Bayesian network is used to model a domain containing uncertainty {12, 13] & evolutionary optimization of RBF network architectures (feature and modelselection) applicable to a wide range of data mining problems (in particular, classification problems). Therefore, the overall runtime of the EA had to be reduced substantially. We decided to optimize the most important architecture parameters only and to use standard techniques for representation, selection, and reproduction.
Since the network-based penalised-likelihood approach [21] does not incorporate interaction terms we performed a third simulation to investigate its performance under a data-generating model without interaction terms. In particular, we used the same true underlying predictor subset as in Simulation 2 (i.e. γ 3 ∗ = γ 2 ∗ ), which con- tains predictors that are neighbours in the network, but generated data using a linear model without interaction terms; Y = A + 2B + 3C + , where A, B, C are the three influential variables. We note that each predictor in the data-generating model has a different magnitude of influ- ence on the response (i.e. different regression coefficients). Average ROC curves are shown in Figure 3c. Compar- isons are made to other approaches as described above, but all methods now use linear models without interac- tion terms. As in Simulations 1 and 2 the Bayesian variable selection approach with empirical Bayes and pathway- based priors outperforms a flat prior and an incorrect prior, with empirical Bayes selecting the correct prior in 99% of iterations (correct and incorrect priors are the same as for Simulation 2). The Bayesian approach with Markov random field prior showed a similar performance to the proposed pathway-based priors (a correct value of λ > 0 was selected in 90% of iterations). However, the approach of Li and Li [21], whilst now more competi- tive compared with Simulation 2, is still outperformed by the empirical Bayes approach with pathway-based priors. Moreover, it does not display a clear improvement over Lasso regression.
The promise of augmenting accurate predictions provided by modern neural networks with well-calibrated predictive uncertainties has reinvigorated interest in Bayesian neural net- works. However, modelselection—even choosing the number of nodes—remains an open question. Poor choices can severely affect the quality of the produced uncertainties. In this paper, we explore continuous shrinkage priors, the horseshoe, and the regularized horse- shoe distributions, for modelselection in Bayesian neural networks. When placed over node pre-activations and coupled with appropriate variational approximations, we find that the strong shrinkage provided by the horseshoe is effective at turning off nodes that do not help explain the data. We demonstrate that our approach finds compact network structures even when the number of nodes required is grossly over-estimated. Moreover, the model selec- tion over the number of nodes does not come at the expense of predictive or computational performance; in fact, we learn smaller networks with comparable predictive performance to current approaches. These effects are particularly apparent in sample-limited settings, such as small data sets and reinforcement learning.
The departure from normality can also be seen from the Bayesian residual test. We first fit the MSGM model with one regime and one state, which is effectively a model of multivariate normal returns. We conduct a series of residual tests by normalizing the historical returns using the posterior draws of the mean and covariance matrix. If the returns are indeed normally distributed, then the classical Kolmogorov-Smirnov test should accept the null. The histogram of the test statistics are reported in Figure 1. The six panels correspond to the six assets in sequence. Since we have a fairly large sample size of more than 2000 observations, the 1% significance critical value of the test statistics can be approximated by 1.63/ √ T , which is about 0.03. Figure 1 shows that test statistics are larger than the critical value in every circumstance so that the normality can be decisively rejected.
Inflation’s volatility has attracted a good deal of attention recently; the interest has been sparked by the debate on the Great Moderation that has been documented for real economic aggregates. Inflation stabilization is indeed a possible source of the reduction in the volatility of macroeconomic aggregates. The issue is also closely bound up with inflation persistence and predictability. In an influential paper Stock and Watson (2007), using a local level model with stochastic volatility, document that inflation is less volatile now than it was in the 1970s and early 1980s; moreover, persistence, which measure the long run effect of a shock, has declined, and predictability has increased.