2.2 Gaussian Processes
2.2.1 Priors on random linear functions
In this section a simple construction of random functions is presented. This strategy builds up from the linear model (2.5). Diaconis [62] traces this construction all the way back to Poincar´e [181]. The strategy is to consider the coefficients β in the linear model as Gaussian random variables, β ∼ N (β0, Σ0). Without loss of
generality, it is assumed that β0 = 0. Thus, β is seen as a zero mean random vector with possibly correlated components.
The model is constructed as follows. Under the assumption of independent observations, and given the vectors of coefficients and predictors, the target variable is assumed to be distributed as
p(y|X, β) = N (Xβ, σ2In), (2.7)
where as before, data comes in pairs (xi, yi). The vector y ∈ Rn contains all
recorded target values, X ∈ Rn×d is the matrix formed by placing in each row every observed vector corresponding to predictors, and In ∈ Rn×n denotes the
identity matrix. Note that every realization of β from the prior, defines a different linear model as depicted in Figure2.2. In this sense, the prior distribution over β reflects our belief on which linear models are deemed possible.
The posterior distribution is computed as
p(β|X, y) = p(y|X, β) p(β) p(y|X)
∝ p(y|X, β) p(β), (2.8)
where the normalising constant is discarded for computational convenience. The reason behind this is that the evidence is a function of the available data, and the
-2 -1 0 1 2 -2 -1 0 1 2 β0 β1 Parameter space: β -4 -2 0 2 4 6 -6 -3 0 3 6 x y Prior model: y =0+ 1x
Figure 2.2: Prior assumptions on a linear model. The contour levels of the prior distribution assumed for the coefficients are shown. Random samples are taken from the joint distribution of β = (β0, β1), and are depicted as dots.
Each realisation of the prior induces a linear model as shown on the right with matching colours. Note that the prior does not show preference for a particular linear model.
focus is on the distribution for β. The Gaussian assumption on both the target and the coefficients leads to the Gaussian posterior
p(β|X, y) = N (β1, Σ1) (2.9)
where β1 = σ−2Σ1X>y, and Σ−11 = σ−2X>X + Σ −1
0 . In order to make predictions
for an unseen data point x∗, the posterior predictive distribution for y∗is calculated
as
p(y∗|x∗, X, y) =
Z
p(y∗|x∗, β) p(β|X, y) dβ, (2.10)
where, as before, the parameter β is marginalised. The difference is that the marginalisation is performed under the posterior distribution. This effectively takes into account all remaining uncertainty after assimilating the available data.
As before, every realisation of β defines a posterior linear model as shown in Figure 2.3. However, at this stage, the random nature of β is governed by the posterior. That is, after updating our prior beliefs in the light of data. The resulting distribution (2.10) is the predictive posterior distribution. The posterior form of β being a Gaussian and the Gaussian assumption on the response of the
-1 0 1 2 3 -1 0 1 2 3 β0 β1 Parameter space: β -2 0 2 4 6 -6 -3 0 3 6 x y Posterior model: y =0+ 1x
Figure 2.3: Posterior inference after observing data. The crosses shown in the right panel correspond to the data used for updating the prior. As before, the left panel shows the contour levels from the posterior distribution. Each sample drawn from the posterior induces a linear model on the right panel. Colours are used to identify each induced linear model with the corresponding sample. Note that the posterior reflects what is learned from the data. A linear model with positive slope and positive intercept.
linear model, yield together the following analytic expression for the predictive posterior
p(y∗|x∗, X, y) = N (β>1x∗, x>∗Σ1x∗). (2.11)
Moreover, this can be readily extended to make predictions in more than one point at a time. For example, assume there is interest in predicting the response for unseen points x∗ and x∗∗. The joint predictive distribution can be written as
" y∗ y∗∗ # ∼ N " β>1x∗ β>1x∗∗ # , " x>∗Σ1x∗ x>∗Σ1x∗∗ x>∗∗Σ1x∗ x>∗∗Σ1x∗∗ #! . (2.12)
Note that the predictive posterior induces a correlation based on the predictors x∗ and x∗∗. This correlation can be written as k(x∗, x∗∗) = x>∗Σ1x∗∗. This
observation is important in the discussion of the Gaussian process model.
The preceding discussion can be extended for a more flexible linear model. The linear expression in (2.5) can be written in terms of p known basis functions hj(x),
p(y|X, β) = N (Hβ, σ2In), (2.13)
where H is known as the design matrix with each row being equal to a vector of the form hi = h(xi) = [h1(xi), . . . , hp(xi)]. The posterior distribution of β is
equal to
p(β|X, y) = N (β1, Σ1) (2.14)
where β1 = σ−2Σ1H>y, and Σ−11 = σ−2H>H + Σ −1
0 . Finally, the posterior
predictive distribution for an unseen point x∗ can be shown to be
p(y∗|x∗, X, y) = N (β>1h∗, h>∗Σ1h∗), (2.15)
where h∗ = h(x∗). This allows the extension of the linear regression model with
linear predictors to models with more flexible known basis functions.
In summary, a linear model with unknown coefficients β has been specified to relate predictors x and target variable y. The prior distribution on β reflects our belief on which linear models are deemed possible. By combining likelihood, prior beliefs and data, the result is the posterior distribution on β. This effectively reflects our updated beliefs in the linear model after observing some data.