Relevance Vector Machines - The foundations of lesion-function inference in the human brain

1.5 Inference

1.5.2.2 Relevance Vector Machines

Relevance Vector Machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to Relevance Vector

Machines, before comparing it with Support Vector Machines. This introduction is not meant to be an exhaustive insight into RVM, but hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found in Tipping (2001)

Relevance Vector Machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:

Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.

For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the

classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.

1.4.2.2Relevance Vector Machines

Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but

hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)

Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:

ݕ ൌ ݓݔ ൅ ܿ

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship

between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓ׎ሺݔሻ

In this case ݔ հ ׎ሺݔሻ is a non-linear mapping — a basis function.

In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:

ݐ௜ ൌ ݕ௜൅ ߝ௜

ൌ ݓ߶ሺݔ௜ሻ ൅ ߝ௜

1.4.2.2Relevance Vector Machines

hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)

ݕ ൌ ݓݔ ൅ ܿ

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship

between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓ׎ሺݔሻ

In this case ݔ հ ׎ሺݔሻ is a non-linear mapping — a basis function.

ݐ௜ ൌ ݕ௜൅ ߝ௜

ൌ ݓ߶ሺݔ௜ሻ ൅ ߝ௜

1.4.2.2Relevance Vector Machines

hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)

ݕ ൌ ݓݔ ൅ ܿ

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship

between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓ׎ሺݔሻ

In this case ݔ հ ׎ሺݔሻ is a non-linear mapping — a basis function.

ݐ௜ ൌ ݕ௜൅ ߝ௜

ൌ ݓ߶ሺݔ௜ሻ ൅ ߝ௜

1.4.2.2Relevance Vector Machines

hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)

ݕ ൌ ݓݔ ൅ ܿ

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship

between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓ׎ሺݔሻ

In this case ݔ հ ׎ሺݔሻ is a non-linear mapping — a basis function.

ݐ௜ ൌ ݕ௜൅ ߝ௜

ൌ ݓ߶ሺݔ௜ሻ ൅ ߝ௜

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector

w. If the relationship between x and y is non-linear then a kernel function can

be used.

1.4.2.2Relevance Vector Machines

hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)

ݕ ൌ ݓݔ ൅ ܿ

Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship

between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓ׎ሺݔሻ

In this case ݔ հ ׎ሺݔሻ is a non-linear mapping — a basis function.

In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:

ݐ_௜ ൌ ݕ_௜൅ ߝ_௜ ൌ ݓ߶ሺݔ_௜ሻ ൅ ߝ_௜

In this case is a non-linear mapping — a basis function.

In our arrangement, we are trying to derive w – the weights – from our training data. The assumption here is that our training data is representative of our true model y_i, albeit with some additional noise. Thus our function can now be written as:

Therefore the probability of an outcome t_i given an input x_i with our model should be:

The assumption for �� is assumed to be a set of independent samples from a Gaussian

noise process with zero mean and variance ��_{, such that �}

�� ∀� . Therefore the

probability of an outcome ti given an input xi with our model should be: ��|��

��|�� _��1_��

Ideally we wish to incorporate all our training data. To do so we can represent each training data point, �� —the outcome values — in a vector �, with an associated design matrix Φ

such that the last row in the matrix represents the vector ��. The design matrix merely

contains the different basis functions, ��, at all the training points, ��, for each of the

weights in the vector �.

��|�� _��1�� |�� _��1�� Φ� �� Where: � � ��

There are M weights associated with the algorithm at initialization. As alluded to earlier, smoother functions, and thus less complex functions, are generally more resilient to over- fitting and result in better generalization. By applying constraints on the number of weights, we are in essence applying a smoothing term, thereby reducing the risk of over-fitting. This is achieved in the form of a prior on the weights, with a zero-mean Gaussian distribution.

��|��

Here �� describes the inverse variance — the precision — of each ��. Therefore there is a

separate �� associated with each weight, modifying the strength of the prior.

To make predictions using the Bayesian model the posterior probability, over all the unknown parameters, given the data needs to be computed. This probability cannot be computed analytically because of its complexity, and approximations need to be made. First decompose the posterior probability to:

�� _{|�� |�� }�_{� � �� }�_|��

Rearranging and substituting ��_{for �}�

��|�� |�� |��_{��|�� } ��

The assumption for �� is assumed to be a set of independent samples from a Gaussian

noise process with zero mean and variance ��_{, such that �}

�� ∀� . Therefore the

probability of an outcome ti given an input xi with our model should be: ��|��

��|�� _��1_��

Ideally we wish to incorporate all our training data. To do so we can represent each training data point, �� —the outcome values — in a vector �, with an associated design matrix Φ

such that the last row in the matrix represents the vector ��. The design matrix merely

contains the different basis functions, ��, at all the training points, ��, for each of the

weights in the vector �.

��|�� _��1_�� |�� _��1_�� Φ� �� Where: � � ��

��|��

Here �� describes the inverse variance — the precision — of each ��. Therefore there is a

separate �� associated with each weight, modifying the strength of the prior.

�� _{|�� |�� }�_{� � �� }�_|��

Rearranging and substituting ��_{for �}�

��|�� |�� |��_{��|�� } ��

The assumption for �� is assumed to be a set of independent samples from a Gaussian

noise process with zero mean and variance ��_{, such that �}

�� ∀� . Therefore the

probability of an outcome ti given an input xi with our model should be: ��|��

��|�� _��1��

Ideally we wish to incorporate all our training data. To do so we can represent each training

In document The foundations of lesion-function inference in the human brain (Page 76-80)