Bayesian Inference - Statistical Methods in Metabolomics

issue of reproducibility and robustness of results.

1.3 Bayesian Inference

Both NMR and LC-MS require sophisticated statistical techniques in order to perform inference on the data obtained. Within statistics there are two main inferential frameworks: ‘frequentist’ and ‘Bayesian’. Frequentist inference is based upon the assumption that the observed data can be considered as one instance of a series of infinitely repeatable experiments [54] and standard frequentist methodologies include statistical hypothesis testing and calculating confidence in- tervals. Whilst frequentists tend to consider the probability of the observed data arising given a hypothesis, Bayesians are more converned with the probability of a hypothesis given the observed data. Bayesian inference is derived from Bayes’ Theorem (Equation 1.3):

p(θ|y, φ) = p(y|θ)p(θ|φ)

p(y|φ) (1.3)

where p(y|θ) is the likelihood of the data given the model parameters, θ is a parameter of the likelihood distribution with prior θ ∼ p(θ|φ), φ is a hyper-parameter of the distribution of θ, p(y|φ) is the marginal likelihood:

p(y|φ) = Z

p(y|θ)p(θ|φ)dθ (1.4)

and p(θ|y, φ) is the posterior probability of parameter θ.

Practically speaking this means setting up a full probability model, conditioning on the observed data (y) in order to calculate the posterior distribution (the conditional probability distribution of the unobserved quantities of interest, given the observed data) and then evalu- ating the model fit [55]. Frequentist ‘results’ are usually true or false conclusions drawn from significance tests, whereas Bayesian results more often take the form of probability distributions for parameters that attempt to describe the data. Bayesian methodology is employed widely within metabolomics for a variety of different purposes, for example variable selection/dimen-

16 Chapter 1. Introduction

sion reduction, latent variables analysis, network/pathway analysis and spectral deconvolution among many others.

When practicing Bayesian inference, it is often useful to be able to calculate posterior estimates of characteristics of the model parameters. In the case of multi-parameter models, where θ = (θ1, ..., θk), this requires averaging over ‘nuisance’ parameters (parameters on which one is not

concerned with performing inference). Supposing the parameter of interest is θ1, the conditional

distribution p(θ1|y) must be derived from the joint posterior distribution p(θ|y) = p(θ1, ..., θk|y).

Averaging over the nuisance parameters gives:

However, these posterior distributions are often high dimensional and very difficult to calculate either analytically or numerically. The problem of making inferences on this type of distribution is addressed by using Markov Chain Monte Carlo (MCMC) methods. MCMC enables simula- tion of random draws from a complex probability distribution, say f (x). MCMC methods are based on Markov Chains, which are stochastic processes Xt, t = 0, 1, 2, ... such that:

P (Xn = xn|X0 = x0, ..., Xn−1 = xn−1) = P (Xn= xn|Xn−1 = xn−1) (1.7)

i.e. the current observation only depends on the previous one and not the entire observation history [54]. MCMC methods essentially involve constructing a Markov Chain whose target distribution is the distribution of interest f (x). Once a large enough sample has been simulated, the functionals of interest can be calculated to any degree of accuracy.

There are numerous methods for building the required Markov Chain. For example Metropolis- Hastings uses a ‘proposal distribution’ q(x) to propose a candidate value Y for Xt+1, possibly

1.3. Bayesian Inference 17

depending on Xt, and accepts it with probability α(Xt, Y ) where

α(Xt, Y ) = min(1,

f (Y )q(Xt|Y )

f (Xt)q(Y |Xt)

) (1.8)

and f (x) is the function of interest. A special case of the Metropolis-Hastings is the Gibbs sampler. Suppose draws of θ = (θ1, ..., θk) must be obtained from the joint distribution function

f (θ1, ...θk). Then for each draw t = 1, 2, ..., each θ (t)

i is sampled from the conditional distribution

given by p(θ_i(t)|θ₁(t), ..., θ(t)_i−1, θ(t−1)_i+1 , ..., θ_k(t−1)) (proportional to the joint distribution) so that each variable is sampled using the most recently updated values of the other variables. There are many variations on these samplers, so-called ‘adaptive’ modifications aimed to increase the sampling efficiency of the algorithms.

A ‘burn-in’ period is usually utilised (i.e. the first M iterations are discarded) as we accept we are unlikely to choose good starting conditions and thus the initial estimates are likely to be poor. Thinning is the practice of saving only every n-th iteration sometimes with the aim of speeding up post-processing or reducing required memory but also as an attempt to remove auto-correlation. For example, to obtain a run of 10,000 iterations one would run n × 10, 000 simulations and save only every n-th one. However it has been argued that if the entire chain is long enough, the auto-correlation has likely averaged out anyway and thinning provides little additional benefit to this end [56].

Convergence of MCMC is a major concern for Bayesian statisticians with the parameters estimated often correlated with themselves over the iterations or with each other making convergence slow and difficult to execute. Despite much theoretical research into convergence computations, there is limited benefit thus far for practical applications. Although it remains impossible to be certain that your MCMC sample is truly representative of the target distribution, there are many diagnostic measures and techniques to help somewhat evaluate success. Indeed Cowles and Carlin provide thorough reviews of these[57].

One of the simplest checks is visualisation of how well your chains are ‘mixing’ or moving around the sample space. Plotting traces (parameter value vs. iteration number) can show

18 Chapter 1. Introduction

you if your parameter gets stuck or moves poorly. ‘Running mean plots’ (plotting the sample mean up to each iteration vs. iteration number) can also be useful to this end. Investigating the auto-correlation can also provide clues as to the convergence of your chain. A ‘k-th lag auto-correlation’ can be computed and we would expect the correlation to decrease as the lag increases. Persistently high auto-correlation for high k is again indicative of poor mixing.

One commonly used convergence diagnostic is the ‘Gelman and Rubin Multiple Sequence Di- agnostic’ which is calculated using multiple chains per parameter. Consider the ‘potential scale reduction factor’: ˆ R = s ˆ V ar(θ) W (1.9)

where V ar(θ) is the estimated variance of the target distribution as a weighted average of theˆ within-chain, W , and between-chain, B, variances:

V ar(θ) = (1 − 1 n)W +

nB (1.10)

When ˆR is high (> 1.1 or so) this indicates the chains should be run longer to improve convergence. A ‘Gelman plot’ shows how the potential scale reduction factor changes through the iterations and is another useful diagnostic.

There are several software packages available for automating the Bayesian analysis of models via MCMC methods. WinBUGS (arising from the BUGS - ‘Bayesian inference Using Gibbs Sampling’ project based in the MRC Biostatistics Unit at Cambridge, England) is one of the most popular [58]. It provides a framework for defining Bayesian hierarchical models and a library of sampling routines to perform inference. Several extensions have been developed allowing construction and analysis of ever more complicated models. One such alternative, JAGS (Just Another Gibbs Sampler), was developed with the objective of being an open-source engine for the BUGS language [59]. JAGS is highly extensible and allows users to develop new libraries and add-ons. Both WinBUGS and JAGS can interface to R, making them useful tools

In document Statistical Methods in Metabolomics (Page 41-45)