Gaussian Process Regression - Background on Machine Learning

2.4 Background on Machine Learning

2.4.16 Gaussian Process Regression

Gaussian Process Regression (GPR) is a common method of learning non-parametric, kernel- based, probabilistic models [108]. The goal of GPR is to probabilistically estimate the expected output Y given and input X . In Bayesian Linear Regression (BLR) it is assumed that the state space function is linear f (x) = X w, yielding a prediction of the form P(y|X , w) ∼N (Xw,σn2I).

Process Regression makes no inherent assumption about the form of the state space model. Instead GPR uses kernel functions to represent the model directly from the training data.

By definition a Gaussian process is a set of random variables such that a discrete sample of them comprises a joint Gaussian distribution. If { f (x), x ∈ Rd} is a Gaussian Pro- cess, then given the observations x1, x2, · · · x3, the joint distribution of the random variables f(x1), f (x2), · · · f (x3) is also a Gaussian process. Therefore a Gaussian process can be fully represented by a mean function, µ(x) and a covariance function K(x, x0) (Eq. 2.93).

p∼ GP(µ(x), K(x, x0)) (2.93)

The most common derivation of a GPR model assumes a data set of the form D = {(xi, yi)n_i=1} =

(X , y). It is assumed that the output follows the form of Equation 2.94.

y= f (xi) + εi (2.94)

Where f is a latent variable and ε is zero mean Gaussian noise:

f ∼ GP(0, K) (2.95)

ε ∼N (0,σ2) (2.96)

Since the prior on f is a Gaussian process then the posterior on f (p( f |D)) is also a Gaussian process. This model is used to make predictions for output estimates (y∗) given new samples

(x∗) as in Equation 2.97.

p(y∗|x∗, D) = Z

p(y∗|x∗, f , D) p( f |D) d f (2.97)

Given this distribution, a predictor of the form p(y∗|x∗, X , y) is desired, in other words the

estimate of the output is dependent only on the new sample, the training data, and the latent variables. The definition of a Gaussian process yields Equation 2.98.

" y y∗ # =N 0, " KN K∗NT K∗N K∗∗ #! (2.98)

Where KN is the covariance matrix of the training data, K∗N is the covariance between the

training data and the online data and K∗∗ is the covariance of the test data. This permits a

predictive distribution of the form in Equation 2.99.

p(y∗|x∗, X , y) =N (µ∗|σ∗2) (2.99)

In this case the mean and covariance are given by Equations 2.100 and 2.101.

µ∗= K∗N(KN+ sigma2nI)−1y (2.100)

σ∗2= K∗∗− K∗N(KN+ sigma2nI)−1K T

∗N (2.101)

Where σnis a tunable parameter relating to inherent observation noise in the system. This

then permits an output estimate y∗for any sample x∗which is based solely on the training data

and the covariance kernel K. The choice of covariance kernel is a key component in developing a GPR model. The most common kernel function is the squared exponential kernel (Eq. 2.102) which is very similar to a Radial Basis Function.

k(x, x0) = σ2f exp(−

||x − x0||2

2`2 ) (2.102)

Where σf is a tunable parameter related to the process noise, and ` is related to the charac-

teristic length scale or bandwidth of the data. The squared exponential kernel can be thought of as summing a Gaussian at one data point given all other data points. It should be noted that any kernel function can be used for a GPR model. Using this kernel results in a covariance matrix of the form in Equation 2.103.

K(X , X0) =        k(x1, x10) k(x1, x02) · · · k(x1, x0n) k(x2, x10) k(x2, x02) · · · k(x2, x0n) .. . ... ... ... k(xn, x₁0) k(xn, x0₂) · · · k(xn, x0n)        (2.103)

In the standard implementation of the GPR model a training data set of the form D = {(xi, yi)n_i=1} is assumed. The first step in training this model is to find a representative sub-

set of the training data. For large data sets, it is intractable to use a data covariance matrix KN

Therefore from D, a subset of data points Dtrain= {Xtrain, ytrain} of length d < n is identified.

For analysis purposes, a subset of test data with which to check our model Dtest = {Xtest, ytest}

of length k < n is also identified.

Given both our training and testing data sets, three separate covariance matrices are computed (Eq. 2.103): Ktrain,train, Ktest,train, and Ktest,test, corresponding to the three covariances

found in 2.98. These kernels also require explicit parameters which can be estimated from the training data as σf = std(Xtrain) and ` = sqrt(range(Ytrain)).

Given our covariance matrices, the inverse matrix from Eq. 2.100 is computed. While one could utilize a brute force matrix inversion, a more elegant solution does exist. Since KN is a

covariance matrix it is inherently a hermitian, positive-definite matrix. Since this is the case we can utilize a Cholesky decomposition approach to compute a lower triangular matrix L which satisfies the condition LTL= K. Given L, it is more computationally efficient to compute the matrix inverse of (LT_L)−1_{. This results in Equation 2.104.}

K_train,train−1 = (L−1)TL−1 (2.104)

Where

L= chol(Ktrain,train+ σn2I) (2.105)

Using the inverse of the covariance Ktrain,train, the mean and covariance of the predictive

distribution for our test data Xtest is computed. As in Equations 2.100 and 2.101, the Gaussian

process estimates are computed in Equation 2.106- 2.107.

µtest= Ktest,trainKtrain,train−1 ytrain (2.106)

σ_test2 = Ktest,test− Ktest,trainKtrain,train−1 K T

test,train (2.107)

This results in a predictive distribution for our test data which is a function of only our training data Dtrain. As an illustrative example, a function of the form f = xsin(x) is used with

additive noise applied. A subsample is taken from this data for the training and testing. The results of this test can be found in Figures 2.10 - 2.10b.

(a) Sample Train and Test Data. (b) Estimated Output for Test Data.

Figure 2.10: Gaussian Process Regression example.

Given this approach, to classify online data one needs to save only the inverse covariance K_train,train−1 and the training output data ytrain. Then for online estimation, Ktest,trainand Ktest,test

are computed in order to arrive at the predictive distribution.

The Gaussian Process Regression model provides an elegant solution for the probabilistic modeling of non-linear and non-parametric data. The core problem with GPR models is that for large data sets the covariance matrix KN becomes very large and thus makes the matrix

inverse computationally expense to compute. Additionally for online estimation, this algorithm requires at a minimum d + d2 multiplications (O(d2)) where d is the size of our training data. Furthermore, the naive GPR model provides no inherent logic in sub-sampling the data so that Dtrain is sufficiently representative of the true function shape. Therefore the training data in

our model can possibly be data scarce in certain regions. While some work has investigated the use of GPR for classification problems, the standard formulation is designed for regression problems.

In document Dynamic Discriminant Analysis with Applications in Computational Surgery (Page 65-69)