1.5 Inference
1.5.2.2 Relevance Vector Machines
Relevance Vector Machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to Relevance Vector
Machines, before comparing it with Support Vector Machines. This introduction is not meant to be an exhaustive insight into RVM, but hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found in Tipping (2001)
Relevance Vector Machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.
For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the
classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.
1.4.2.2Relevance Vector Machines
Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but
hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)
Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
ݕ ൌ ݓݔ ܿ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship
between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓሺݔሻ
In this case ݔ հ ሺݔሻ is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
ݐ ൌ ݕ ߝ
ൌ ݓ߶ሺݔሻ ߝ
Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.
For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the
classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.
1.4.2.2Relevance Vector Machines
Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but
hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)
Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
ݕ ൌ ݓݔ ܿ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship
between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓሺݔሻ
In this case ݔ հ ሺݔሻ is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
ݐ ൌ ݕ ߝ
ൌ ݓ߶ሺݔሻ ߝ
Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.
For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the
classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.
1.4.2.2Relevance Vector Machines
Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but
hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)
Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
ݕ ൌ ݓݔ ܿ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship
between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓሺݔሻ
In this case ݔ հ ሺݔሻ is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
ݐ ൌ ݕ ߝ
ൌ ݓ߶ሺݔሻ ߝ
Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.
For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the
classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.
1.4.2.2Relevance Vector Machines
Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but
hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)
Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
ݕ ൌ ݓݔ ܿ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship
between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓሺݔሻ
In this case ݔ հ ሺݔሻ is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
ݐ ൌ ݕ ߝ
ൌ ݓ߶ሺݔሻ ߝ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector
w. If the relationship between x and y is non-linear then a kernel function can
be used.
Although the ability to quantify the contribution of each voxel is not essential to the performance of the classification model, this detail may provide a window into which features/dimensions are more important for the task of classification.
For example, consider each voxel in a brain volume as a separate dimension. Next using a set of lesioned brain volumes and associated outcome data such as the ability to walk we can attempt to differentiate between those individuals who can walk and those who cannot, based on their lesioned brain image. The weight of each voxel’s contribution to the
classification model describes how important that location is to determining the presence or absence of the function in relation to the other dimensions, thus providing an insight into which areas are believed to be critical. It should be noted that the support vectors identified define the classification function by modelling the boundary between the different classes, rather than the areas typical of each class. Further information regarding support vector machines can be found in earlier sections of this thesis.
1.4.2.2Relevance Vector Machines
Relevance vector machines (RVM) are similar to SVM, but instead uses a sparse Bayesian approach which enables it to offer a probabilistic solution to the problem. I will first provide a brief introduction to relevance vector machines, before comparing it with support vector machines. This introduction is not meant to be an exhaustive insight into RVM, but
hopefully convey the concept of RVM to assist the understanding of its application to our situation. A more comprehensive description of RVM can be found from the following references. (Tipping 2001)
Relevance vector machines are based on a Bayesian formulation of a linear model with an appropriate prior that result in a sparse representation. It is this sparseness that facilitates its speed of computation, by ignoring those dimensions whose relevances are deemed insignificant. The linear model can then be represented as:
ݕ ൌ ݓݔ ܿ
Where w is the parameter vector, c is the offset and x is the input values used to predict the outcome y. Generally the offset c is incorporated into the vector w. If the relationship
between x and y is non-linear then a kernel function can be used. ݕ ൌ ݓሺݔሻ
In this case ݔ հ ሺݔሻ is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w —the weights — from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
ݐ ൌ ݕ ߝ ൌ ݓ߶ሺݔሻ ߝ
In this case is a non-linear mapping — a basis function.
In our arrangement, we are trying to derive w – the weights – from our training data. The assumption here is that our training data is representative of our true model yi, albeit with some additional noise. Thus our function can now be written as:
77
Therefore the probability of an outcome ti given an input xi with our model should be:
The assumption for �� is assumed to be a set of independent samples from a Gaussian
noise process with zero mean and variance ��, such that �
������ ��� ∀� . Therefore the
probability of an outcome ti given an input xi with our model should be: ����|��� �� ��������� ���
����|��� �� ��� � ��������� ��� ����1����� ���������
Ideally we wish to incorporate all our training data. To do so we can represent each training data point, �� —the outcome values — in a vector �, with an associated design matrix Φ
such that the last row in the matrix represents the vector �����. The design matrix merely
contains the different basis functions, �����, at all the training points, ��, for each of the
weights in the vector �.
���|��� �� ��� � ���������� ��� ����1����� ��������� � ��� ���|��� �� ��� � ��������� ��� ����1�� � � � ��� Where: � � ��� �� � � ��� ��
There are M weights associated with the algorithm at initialization. As alluded to earlier, smoother functions, and thus less complex functions, are generally more resilient to over- fitting and result in better generalization. By applying constraints on the number of weights, we are in essence applying a smoothing term, thereby reducing the risk of over-fitting. This is achieved in the form of a prior on the weights, with a zero-mean Gaussian distribution.
���|�������� �����
Here �� describes the inverse variance — the precision — of each ��. Therefore there is a
separate �� associated with each weight, modifying the strength of the prior.
To make predictions using the Bayesian model the posterior probability, over all the unknown parameters, given the data needs to be computed. This probability cannot be computed analytically because of its complexity, and approximations need to be made. First decompose the posterior probability to:
���� �� ��|�� � ���|�� �� ��� � ���� ��|��
Rearranging and substituting ��� for ��
���|�� �� �� ����|�� �� � ���|�����|�� �� ����� ��
The assumption for �� is assumed to be a set of independent samples from a Gaussian
noise process with zero mean and variance ��, such that �
������ ��� ∀� . Therefore the
probability of an outcome ti given an input xi with our model should be: ����|��� �� ��������� ���
����|��� �� ��� � ��������� ��� ����1����� ���������
Ideally we wish to incorporate all our training data. To do so we can represent each training data point, �� —the outcome values — in a vector �, with an associated design matrix Φ
such that the last row in the matrix represents the vector �����. The design matrix merely
contains the different basis functions, �����, at all the training points, ��, for each of the
weights in the vector �.
���|��� �� ��� � ���������� ��� ����1����� ��������� � ��� ���|��� �� ��� � ��������� ��� ����1�� � � � ��� Where: � � ��� �� � � ��� ��
There are M weights associated with the algorithm at initialization. As alluded to earlier, smoother functions, and thus less complex functions, are generally more resilient to over- fitting and result in better generalization. By applying constraints on the number of weights, we are in essence applying a smoothing term, thereby reducing the risk of over-fitting. This is achieved in the form of a prior on the weights, with a zero-mean Gaussian distribution.
���|�������� �����
Here �� describes the inverse variance — the precision — of each ��. Therefore there is a
separate �� associated with each weight, modifying the strength of the prior.
To make predictions using the Bayesian model the posterior probability, over all the unknown parameters, given the data needs to be computed. This probability cannot be computed analytically because of its complexity, and approximations need to be made. First decompose the posterior probability to:
���� �� ��|�� � ���|�� �� ��� � ���� ��|��
Rearranging and substituting ��� for ��
���|�� �� �� ����|�� �� � ���|�����|�� �� ����� ��
The assumption for �� is assumed to be a set of independent samples from a Gaussian
noise process with zero mean and variance ��, such that �
������ ��� ∀� . Therefore the
probability of an outcome ti given an input xi with our model should be: ����|��� �� ��������� ���
����|��� �� ��� � ��������� ��� ����1����� ���������
Ideally we wish to incorporate all our training data. To do so we can represent each training