Markov Chain Monte Carlo Bayesian inference

1.5 Inference

1.4.2.3 Markov Chain Monte Carlo Bayesian inference

One of the main benefits of a sparse Bayesian method such as RVM is the comparative speed of computation compared with a full Bayesian approach. Conversely, although the weights associated with the relevance vectors provide an indication of the importance of each dimension — in this case our voxels —the algorithm is constructed to try and

minimize the final number of dimensions resulting in the vast majority being pruned to zero. Although the RVM model has a relatively good level of prediction, the trade off for improved speed is a reduction in the localization of the critical regions.

Full Bayesian techniques maintain all dimensions allowing for a more detailed picture of relative importance between dimensions. As with most if not all Bayesian approaches is the difficulty in obtaining the posterior distribution. This generally requires the integration of high-dimensional functions which cannot be solved analytically, with the alternative computational solutions presenting a considerably challenging hurdle.

In the following section I will focus solely on Markov Chain Monte Carlo (MCMC) methods. Once again this is not meant to be an exhaustive description and a more detailed

explanation can be found (Neal 1993).

Markov Chain Monte Carlo (MCMC) methods are computational techniques that rely on random sampling to try and simulate direct draws from some complex distribution of interest. As the name suggests there are 2 main parts to the technique, Markov chains and Monte Carlo integration.

1.4.2.3.1 Monte Carlo integration

Originally the Monte Carlo approach was devised by physicists to use random number generation to compute integrals. First consider a complex integral:

න ݄ሺݔሻ݀ݔ௕

௔

This can be decomposed into the production of a function and a probability density function ݌ሺݔሻ defined over the interval ሺܽǡ ܾሻ.

න ݄ሺݔሻ݀ݔ ൌ௕

௔ න ݂ሺݔሻ݌ሺݔሻ݀ݔ

௕ ௔

As a result the integral can be expressed as an expectation of ݂ሺݔሻ over the density ݌ሺݔሻ. support vectors. Similarly with relevance vector machines, the optimal solution is derived by iteratively maximizing the likelihood margin for the posterior distribution.

Although the generation of the classification function with RVM is slower than SVM, the parameters are automatically derived through the iterative method (although a threshold value will need to be specified for the minimum relevance magnitude), whilst a grid search for the optimal parameters is need with the SVM approach.

1.4.2.3Markov Chain Monte Carlo Bayesian inference

In the following section I will focus solely on Markov Chain Monte Carlo (MCMC) methods. Once again this is not meant to be an exhaustive description and a more detailed

explanation can be found (Neal 1993).

1.4.2.3.1 Monte Carlo integration

Originally the Monte Carlo approach was devised by physicists to use random number generation to compute integrals. First consider a complex integral:

න ݄ሺݔሻ݀ݔ௕

௔

This can be decomposed into the production of a function and a probability density function ݌ሺݔሻ defined over the interval ሺܽǡ ܾሻ.

න ݄ሺݔሻ݀ݔ ൌ௕

௔ න ݂ሺݔሻ݌ሺݔሻ݀ݔ

௕ ௔

As a result the integral can be expressed as an expectation of ݂ሺݔሻ over the density ݌ሺݔሻ.

This can be decomposed into the product of a function and a probability density function p(x) defined over the interval (a,b).

As a result the integral can be expressed as an expectation of f(x) over the

density p(x).

� ��

� � ��

�

� � ��

The Monte Carlo simulation draws an independent and identically distributed (iid) set of samples from a target density distribution �� defined by the set of possible configurations of a system. By drawing a large number of random variables �� from the density ��

then the equation can be approximated as: � �� 1 � � ��

This is referred to as Monte Carlo integration. This arrangement is particularly useful as it provides a means to derive the posterior distribution required for Bayesian analysis. For example to derive the posterior distribution, with a given prior the normalizing factor in Bayes’ theorem needs to be computed:

��|�� |��

� ��|��

The normalization factor integral can therefore be approximated by the Monte Carlo method as follows:

�� |�� 1_{� � ��|�}��

��

Similarly the marginal posterior, ��|�� |��_� , can be solved using the same method. Therefore the crux of the method relies on drawing (pseudo-) random samples from a specified probability distribution to estimate the intractable integral. To achieve this there are a number of prerequisites:

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

The simulation can only apply the law of large numbers if the samples are independent (the average result will tend towards the expected result as the number of samples tends towards infinity). Satisfying this criterion may be difficult with Monte Carlo methods, however one solution is to use a Markov chain.

1.4.2.3.2 Markov Chains

A Markov chain is a mathematical system that undergoes transitions from one state to another in a chain-like manner. Key features are that it is a random process, with the next state depending only on the current state. Although time is usually treated as a continuous variable, in this case time is considered to exist as discrete steps with the system

� ��

� � ��

�

� � ��

then the equation can be approximated as: � �� 1 � � ��

��|�� |��

� ��|��

The normalization factor integral can therefore be approximated by the Monte Carlo method as follows:

�� |�� 1_{� � ��|�}��

��

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

1.4.2.3.2 Markov Chains

occupying a specific state at each step and changing randomly between them. The � ��

� � ��

�

� � ��

then the equation can be approximated as: � �� 1 � � ��

��|�� |��

� ��|��

The normalization factor integral can therefore be approximated by the Monte Carlo method as follows:

�� |�� 1_{� � ��|�}��

��

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

1.4.2.3.2 Markov Chains

occupying a specific state at each step and changing randomly between them. The � ��

� � ��

�

� � ��

then the equation can be approximated as: � �� 1 � � ��

��|�� |��

� ��|��

The normalization factor integral can therefore be approximated by the Monte Carlo method as follows:

�� |�� 1_{� � ��|�}��

��

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

1.4.2.3.2 Markov Chains

occupying a specific state at each step and changing randomly between them. The

� ��

� � ��

�

� � ��

� �� 1 � � ��

��|�� |�� |��

The normalization factor integral can therefore be approximated by the Monte Carlo method as follows:

�� |�� 1_{� � ��|�}��

��

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

The simulation can only apply the law of large numbers if the samples are independent (the average result will tend towards the expected result as the number of samples tends

towards infinity). Satisfying this criterion may be difficult with Monte Carlo methods, however one solution is to use a Markov chain.

1.4.2.3.2 Markov Chains

A Markov chain is a mathematical system that undergoes transitions from one state to another in a chain-like manner. Key features are that it is a random process, with the next state depending only on the current state. Although time is usually treated as a continuous This is referred to as Monte Carlo integration. This arrangement is particularly

useful as it provides a means to derive the posterior distribution required for Bayesian analysis. For example to derive the posterior distribution, with a given prior the normalising factor in Bayes’ theorem needs to be computed:

The normalisation factor integral can therefore be approximated by the Monte Carlo method as follows:

Similarly the marginal posterior, , can be solved using the same method. Therefore the crux of the method relies on drawing The Monte Carlo simulation draws an independent and identically distributed (iid) set of samples from a target density distribution p(x) defined by the set

of possible configurations of a system. By drawing a large number of random variables x₁⋯xn from the density p(x) then the equation can be approximated as:

(pseudo-) random samples from a specified probability distribution to estimate the intractable integral. To achieve this there are a number of prerequisites:

1. Probability distribution functions (pdf). This is the target distribution that must be specified by a set of pdfs

2. A random number generator

3. A sampling rule. This is a prescription for sampling from the specified pdfs.

1.5.2.4.2 Markov Chains

A Markov chain is a mathematical system that undergoes transitions from one state to another in a chain-like manner (Neal, 1993). The key feature is that it is a random process, with the next state depending only on the current state. Although time is usually treated as a continuous variable, in this case time is considered to exist as discrete steps with the system occupying a specific state at each step and changing randomly between them. The changes in state are called transitions, each with an associated transition probability. The set of all states and transition probabilities completely characterizes the Markov chain.

Critical for its application to MCMC methods are three further features. First it needs to be aperiodic, such that the chain of transitions does not get trapped into a cycle. Second it needs to be irreducible, so that for any state of the Markov chain there is a positive probability of visiting all other states.

This ensures the transition matrix which defines the transition probabilities between states cannot be reduced to separate smaller matrices. This property is sometimes referred to as the transition graph being connected. Lastly, the Markov chain must have a stationary distribution. Thus irrespective of what initial distribution was used, the chain will eventually stabilize to this stationary (equilibrium) distribution.

1.5.2.4.3 Markov Chain Monte Carlo methods

Markov chain Monte Carlo methods therefore sample from a probability distribution based on the available data, by constructing a Markov chain that has the desired distribution as its stationary distribution. It combines the Monte Carlo method for sampling randomness and the Markov chain method for sampling independence with its stationary distribution. Although the Markov chain is constructed so that it will eventually converge towards its stationary distribution, the number of steps required in the chain can be excessive. Therefore it is important to design samplers that converge quickly, and that we do not begin using these samples until convergence has been achieved, otherwise we will not have been sampling from our desired distribution.

1.5.2.4.4 Auto correlation functions and burn in periods

With all MCMC methods, we need to ensure we are close to approaching, if not achieving the stationary distribution of the Markov chain. Unfortunately the number of steps required to reach this point is unknown, with “well-behaved” Markov chains only necessitating a few tens of steps, whilst others demanding tens of thousands. The steps involved in the approach towards the stationary distribution are ideally discarded, and these steps are generally referred to as the “burn in” period. Although predicting the burn in period analytically is not possible there are techniques available to help guide its selection. This is of

particular importance, not only to ensure the stationary distribution is reached,

In document The foundations of lesion-function inference in the human brain (Page 82-87)