4.6 Posterior consistency
4.6.4 General consistency theory
In general terms, the problem needs to consider the consistency of fd(·), θ and X simultaneously. Whereas this is an interesting problem it is considerably more demanding and will require further development.
4.7 Chapter summary
Current non-linear models dealing with latent variables tend to focus primarily on prediction while sidestepping model interpretability; while this may be appropriate in those applications where the latent embedding of the data is not of interest, in applications of process control it is of particular importance. With physical interpretability in mind, in this chapter we have introduced and defined a new class of nonparametric models, the Gaussian process functional factor analysis model. Its main characteristic is that it allows maps to be built between subsets of the latent variables and the dependent observations. We have further proposed a method of estimation for the unknown parameters and also discussed the model asymptotic properties.
The next natural step is towards model selection. In relation to the right panel of Figure 4.1, model selection is related to establishing the links (represented pictorially by arrows) between latent variables and what is observed. In (linear) factor analysis
a model is normally hypothesized based on theoretical knowledge; then it is all left to the data to further support (or not) the initial theory. In an engineering setting, while that approach is still possible, it is generally harder to pursue and a different methodology will be needed. These and other aspects will be discussed further in the next chapter.
Model Selection
In so far as model selection is concerned, there are two main questions that need to be taken into account:
(i) How many different Gaussian process priors are needed to model the data appropriately? This is related to the way the output dimensions are grouped together as briefly discussed in Section 4.2.3. One potential way of doing this is to use any knowledge that we may have about the system. For instance, if we had temperatures in a distillation tower or other related equipment, there are explicit relationships amongst them all arising from physical/chemical laws and therefore they should all probably be modelled together.
(ii) Once a decision has been made as to how the observations should be grouped, the second question we face is related to the way the latent variables and their indicators y(d) are linked together.
This chapter assumes that a decision has been made about (i). Then, an automated way of letting the data decide about (ii) is sought.
The parameter vector θ is key to any proposal for model selection. In this respect, two approaches are considered. Firstly, a profile log likelihood can be written by considering that θ is the vector of parameters of interest and X is a matrix of nuisance parameters. And, secondly, the latent variables X are to be integrated out of the joint density of Y and X. Under any of these two scenarios, the resulting
profile/marginal density will allow the associated likelihood function to be written as a function of the kernel hyperparameters only, `(θ). If that is feasible, θ can then be penalized using an appropriate penalty function and both, variable selection and model parameter estimation could be carried out simultaneously; in this regard, the theory developed by Yi et al. [2011] for penalized Gaussian processes can be extended and adapted for the problem at hand.
Before providing any more details, it is worth highlighting what it is required in order to integrate the latent variables out of the joint density. The starting point is the marginal distribution of Y given by
p(Y|θ) = Z
p(Y|X, θ)p(X|θ)dX = Z
p(Y|X, θ)p(X)dX. (5.1) Unfortunately the N × Q dimension of X is very large and the calculation of this integral is not tractable. This problem has similarities to that arising in binary Gaussian process classification where the latent function needs to be integrated out.
In that specific case several approximations have been provided in the literature;
Kuss and Rasmussen [2005] review and compare the results using a Laplace’s ap-proximation (LA) with an Expectation-Propagation (EP) algorithm. Their work has been subsequently extended by Nickisch and Rasmussen [2008] who provide a very comprehensive review including additional approximations like the Kullback-Leibler (KL) divergence minimization and Variational Bayes(VB) approaches; all those re-sults are compared against a gold standard based on a Markov chain Monte Carlo (MCMC) sampling procedure.
In this chapter, the following three ideas will be developed:
(a) Can a profile log likelihood approach be used to estimate the model parameters and carry out model selection?
(b) How feasible it is to use a Laplace approximation to solve the numerical inte-gration problem posed in Equation (5.1).
(c) Can the resulting profile/marginal likelihood be penalized in order to automate the variable selection problem?
5.1 Profile log likelihood
As it has been introduced, from a model selection perspective the parameters vector θ is of central interest whereas the matrix X plays a secondary role; this matrix is more like a nuisance term. In this respect, the profile log likelihood [Davison, 2003, chapter 4] for the GPFFA model can be expressed as
`prof(θ) = max
X `M AP(X, θ) = g( bXθ, θ), (5.2) where bXθ is the maximum likelihood estimate for a known θ and `M AP(X, θ) is given by Equation (4.9), that is1
g(X, θ) = log p(Y|X, θ) + log p(X|θ). (5.3) Let us now define x = vec(X) and n∗ = N · Q. The function `prof(θ) can now be optimized w.r.t the hyperparameters, θ. In order to do that, the derivatives of
`prof(θ) are needed. On the one hand, the derivatives can be obtained numerically;
this is a quick but computationally intensive process as for every hyperparameter, θj, a numerical optimization must be carried out in order to find the maximum of g(X, θ).
Alternatively, the derivatives can be worked out analytically. The covariance matrix K is an explicit function of the hyperparameters but also, implicitly, bX is a function of θ, as when the hyperparameters change, the optimum of g(X, θ) also changes (see alsoRasmussen and Williams [2006, p. 125] for a similar problem). Hence
∂`prof(θ)
∂θj is given inAppendix C.1 whereas the second term in the previous expression vanishes as ∂g(X,θ)
∂x = 0 at x = ˆxθ.
Finally, by further penalizing `prof(θ), a model selection approach could subsequently be implemented.
1The change from `M AP to g is only for notational convenience.