2.4 Background on Machine Learning
2.4.16 Gaussian Process Regression
Gaussian Process Regression (GPR) is a common method of learning non-parametric, kernel- based, probabilistic models [108]. The goal of GPR is to probabilistically estimate the expected output Y given and input X . In Bayesian Linear Regression (BLR) it is assumed that the state space function is linear f (x) = X w, yielding a prediction of the form P(y|X , w) ∼N (Xw,σn2I).
Process Regression makes no inherent assumption about the form of the state space model. Instead GPR uses kernel functions to represent the model directly from the training data.
By definition a Gaussian process is a set of random variables such that a discrete sam- ple of them comprises a joint Gaussian distribution. If { f (x), x ∈ Rd} is a Gaussian Pro- cess, then given the observations x1, x2, · · · x3, the joint distribution of the random variables f(x1), f (x2), · · · f (x3) is also a Gaussian process. Therefore a Gaussian process can be fully represented by a mean function, µ(x) and a covariance function K(x, x0) (Eq. 2.93).
p∼ GP(µ(x), K(x, x0)) (2.93)
The most common derivation of a GPR model assumes a data set of the form D = {(xi, yi)ni=1} =
(X , y). It is assumed that the output follows the form of Equation 2.94.
y= f (xi) + εi (2.94)
Where f is a latent variable and ε is zero mean Gaussian noise:
f ∼ GP(0, K) (2.95)
ε ∼N (0,σ2) (2.96)
Since the prior on f is a Gaussian process then the posterior on f (p( f |D)) is also a Gaussian process. This model is used to make predictions for output estimates (y∗) given new samples
(x∗) as in Equation 2.97.
p(y∗|x∗, D) = Z
p(y∗|x∗, f , D) p( f |D) d f (2.97)
Given this distribution, a predictor of the form p(y∗|x∗, X , y) is desired, in other words the
estimate of the output is dependent only on the new sample, the training data, and the latent variables. The definition of a Gaussian process yields Equation 2.98.
" y y∗ # =N 0, " KN K∗NT K∗N K∗∗ #! (2.98)
Where KN is the covariance matrix of the training data, K∗N is the covariance between the
training data and the online data and K∗∗ is the covariance of the test data. This permits a
predictive distribution of the form in Equation 2.99.
p(y∗|x∗, X , y) =N (µ∗|σ∗2) (2.99)
In this case the mean and covariance are given by Equations 2.100 and 2.101.
µ∗= K∗N(KN+ sigma2nI)−1y (2.100)
σ∗2= K∗∗− K∗N(KN+ sigma2nI)−1K T
∗N (2.101)
Where σnis a tunable parameter relating to inherent observation noise in the system. This
then permits an output estimate y∗for any sample x∗which is based solely on the training data
and the covariance kernel K. The choice of covariance kernel is a key component in developing a GPR model. The most common kernel function is the squared exponential kernel (Eq. 2.102) which is very similar to a Radial Basis Function.
k(x, x0) = σ2f exp(−
||x − x0||2
2`2 ) (2.102)
Where σf is a tunable parameter related to the process noise, and ` is related to the charac-
teristic length scale or bandwidth of the data. The squared exponential kernel can be thought of as summing a Gaussian at one data point given all other data points. It should be noted that any kernel function can be used for a GPR model. Using this kernel results in a covariance matrix of the form in Equation 2.103.
K(X , X0) = k(x1, x10) k(x1, x02) · · · k(x1, x0n) k(x2, x10) k(x2, x02) · · · k(x2, x0n) .. . ... ... ... k(xn, x10) k(xn, x02) · · · k(xn, x0n) (2.103)
In the standard implementation of the GPR model a training data set of the form D = {(xi, yi)ni=1} is assumed. The first step in training this model is to find a representative sub-
set of the training data. For large data sets, it is intractable to use a data covariance matrix KN
Therefore from D, a subset of data points Dtrain= {Xtrain, ytrain} of length d < n is identified.
For analysis purposes, a subset of test data with which to check our model Dtest = {Xtest, ytest}
of length k < n is also identified.
Given both our training and testing data sets, three separate covariance matrices are com- puted (Eq. 2.103): Ktrain,train, Ktest,train, and Ktest,test, corresponding to the three covariances
found in 2.98. These kernels also require explicit parameters which can be estimated from the training data as σf = std(Xtrain) and ` = sqrt(range(Ytrain)).
Given our covariance matrices, the inverse matrix from Eq. 2.100 is computed. While one could utilize a brute force matrix inversion, a more elegant solution does exist. Since KN is a
covariance matrix it is inherently a hermitian, positive-definite matrix. Since this is the case we can utilize a Cholesky decomposition approach to compute a lower triangular matrix L which satisfies the condition LTL= K. Given L, it is more computationally efficient to compute the matrix inverse of (LTL)−1. This results in Equation 2.104.
Ktrain,train−1 = (L−1)TL−1 (2.104)
Where
L= chol(Ktrain,train+ σn2I) (2.105)
Using the inverse of the covariance Ktrain,train, the mean and covariance of the predictive
distribution for our test data Xtest is computed. As in Equations 2.100 and 2.101, the Gaussian
process estimates are computed in Equation 2.106- 2.107.
µtest= Ktest,trainKtrain,train−1 ytrain (2.106)
σtest2 = Ktest,test− Ktest,trainKtrain,train−1 K T
test,train (2.107)
This results in a predictive distribution for our test data which is a function of only our training data Dtrain. As an illustrative example, a function of the form f = xsin(x) is used with
additive noise applied. A subsample is taken from this data for the training and testing. The results of this test can be found in Figures 2.10 - 2.10b.
(a) Sample Train and Test Data. (b) Estimated Output for Test Data.
Figure 2.10: Gaussian Process Regression example.
Given this approach, to classify online data one needs to save only the inverse covariance Ktrain,train−1 and the training output data ytrain. Then for online estimation, Ktest,trainand Ktest,test
are computed in order to arrive at the predictive distribution.
The Gaussian Process Regression model provides an elegant solution for the probabilistic modeling of non-linear and non-parametric data. The core problem with GPR models is that for large data sets the covariance matrix KN becomes very large and thus makes the matrix
inverse computationally expense to compute. Additionally for online estimation, this algorithm requires at a minimum d + d2 multiplications (O(d2)) where d is the size of our training data. Furthermore, the naive GPR model provides no inherent logic in sub-sampling the data so that Dtrain is sufficiently representative of the true function shape. Therefore the training data in
our model can possibly be data scarce in certain regions. While some work has investigated the use of GPR for classification problems, the standard formulation is designed for regression problems.