• No results found

Chapter 3 Single and multiple response Gaussian processes

3.4 Estimating hyperparameters

Having established the prior functions of the Gaussian processes for the single and multiple response case it is now possible to address the question of estimating its parameters, here denoted as hyperparameters, from available data. Loosely speaking the Gaussian process is being ‘trained’ to match supplied data, so that its functions resemble the training data has reliably has possible.

The hyperparameters which have been described in the previous sections

φ = {β,Σ2,ω} can be estimated through maximum likelihood estimation (MLE) between the observed data D = {Y,X} and the mrGp statistical model. In the subsequent text only the multiple response case will be detailed (from which the

single response case can be readily deduced).

Using MLE is similar to Bayesian estimation, in the sense that a likelihood has also to be built to estimate a hyperparameter of a statistical model. However, the main difference is that MLE does not require us to define our prior beliefs in order to estimate the hyperparameter. Additionally, the hyperparameter attains a fixed deterministic value at the maximum of the likelihood function, whereas Bay- esian estimation assumes it as a random variable. MLE is also known as evidence approximation or empirical Bayes [194].

The first step requires labouring a multivariate normal likelihood function of the mrGp, which is analogous to the multivariate normal distribution function found throughout statistical literature

fX(x1, . . . , xN) = exp −1 2(x−µ) TΣ−1(xµ) p (2π)N|Σ| (3.8)

only differing in the mean vectorµand covariance matrixΣ, which are replaced by the priors which were presented in the previous sections.

It should be made clear that in the above expression,xis usually the random variable andµandΣare fixed statistics, but oppositely, the likelihood assumes that the data is fixed (observed as mentioned before), and that the statistics are variables which values need to be determined.

Note also that a determinant property of the Kronecker product

|A⊗B|=|A|m|B|n. (3.9)

has to be used to expand the denominator of Eq. (3.8). The exponent in |A|is the order ofBand the exponent in |B|is the order ofA.

Hence, the likelihood function of a mrGp is given by

p(Y|β,Σ2,ω) = (2π)−N q/2|Σ2|−N/2|R|−q/2 exp −1 2vec(Y −Hβ) T(Σ2R)−1vec(Y ) ,(3.10) and its log-likelihood is

`(Y|β,Σ2,ω) ∝ −1

2{Nlog|Σ

2|+qlog|R|

+vec(Y −Hβ)T(Σ2⊗R)−1vec(Y −Hβ)} (3.11) By differentiating Eq. (3.11) in order toβandΣ2 and equating these deriva-

tives to zero, it is possible to obtain the analytic MLE of these hyperparameters. The following two subsections will detail the derivation and estimation of the regression coefficients βand the process variance matrixΣ2.

3.4.1 Estimation of regression coefficients

As described above, the likelihood function of a mrGp is multivariate normal, and its mean function is made of regression functions. Fortunately, the solution of the MLE for the particular case is a well known solution (reproduced below).

To determine a weighted least squares solution to the system

Y =Hβ+ (3.12)

where is a “residual” term with covariance as specified above. Firstly, the gener- alised measure of the squared distance from Y to Hβ in standard deviation units can be written as

(Y −Hβ)TR−1(Y −Hβ), (3.13) which is identical to the univariate normal case xσ−2µ

= (x −µ)(σ2)−1(x −µ). Secondly, by differentiating this equation with respect to β, and by setting it equal to zero, it can be seen that the minimum of the squares (and its norm) occurs at the

ˆ

β that satisfies

HTR−1Hβˆ =HTR−1Y, (3.14) which are also called the normal equations. The circumflex above β denotes that it is an estimated value. Finally, multiplying the inverse of the first three matrices on both sides the estimate can be isolated as

ˆ

β= (HTR−1H)−1HTR−1Y. (3.15) The above shown procedure is analogous to solving∂`/∂β= 0, since for the Gaussian case the MLE and the least squares solution is identical. An alternative form of the above estimate can be written as

ˆ

β=W HTR−1Y, (3.16) with the Gram-Schmidt matrix inverse, that dictates the numerical stability of the solution defined as

Its condition determines the expected accuracy of a solution to the least squares problem. This solution correspond to the classical generalised least squares solution. If the correlation matrixRhad been assumed instead as diagonal, i.e. an uncorrelated process, the result would be the ordinary least squares solution.

3.4.2 Estimation of process variance

On the other hand, the MLE of the process variance can be obtained by solving

∂`/∂Σ2 = 0, and results in the estimate

ˆ

Σ2 = 1

N−p(Y −H ˆ

β)TR−1(Y −Hβˆ). (3.18) known as generalised sample variance or average of the squared deviations.

3.4.3 Numerical optimisation of the log-likelihood function

By plugging expressions of Eqs. (3.16) and (3.18) into Eq. (3.11) it simplifies into

`(Y|ω)∝−1

2{Nlog| ˆ

Σ2|+qlog|R|} (3.19) where both termsΣˆ2 and R depend of the roughness coefficients ω. This function has to be maximised numerically in order to estimate the values ofω.

Related documents