Chapter 3. Structured PCA: Theory
3.1.1 Exact model
The joint probability over all random variables in the exact model has the factor- isation p(X, V, W, 2, ) = p(X|V, W, 2)p(V)p(W| )p( 2)p( ) (3.1) =⇥ n Y i=1 p(xi|vi, 2, W)p(vi) ⇤⇥Yk j=1 p(wj| ) ⇤ p( 2)p( ) (3.2)
which corresponds to the plate diagram in Figure 3.1. Although p(W| ) depends also on the location variables t1. . . td, this is omitted from the syntax – as are d
and k – since these are fixed, non-random quantities. Equation 3.2 contains the sub-product
n Y i=1 p(xi|vi, 2, W)p(vi) = n Y i=1 p(xi, vi| 2, W) (3.3) = p(X, V|W, 2) (3.4) which we call the complete data likelihood, as this is the joint likelihood of the data and the latent variables given all parameters. The distribution of a single xidepends
only on the paired vi, and does not change when all other vj6=i change. vi is thus
The model is defined fully by defining each of the terms in the factorisation. p(xi|vi, W, 2) =N (xi|Wvi, 2I ) i2 1, . . . , n (3.5) p(vi) =N (vi|0, I ) i2 1, . . . , n (3.6) p(wi| ) = N (wi|0, K ) i2 1, . . . , k (3.7) p( )/ 1 (3.8) p( 2)/ 2 (3.9)
From Equation 3.5, the role of parameters W and 2 can be seen. If we know the model parameters and the latent representation of a sample, we obtain a distribution over what the observation could possibly have been. The mean of this distribution is obtained by applying the loadings W to project vi up to dimension
d. W is thus inRd⇥k. There is 2 isotropic variance around this mean, which is the same for every sample.
The term K in Equation 3.7 is the prior covariance matrix and is a function of the hyper-parameters . We assume that each of the d dimensions of the observed data space is associated with a location parameter ti, i2 1 · · · d. K
is then constructed by applying a covariance function C(· , · ; ) parametrised by to obtain the prior covariance between every pair of dimensions. Symbolically, the i, jth element of K is constructed as:
(K )ij = C(ti, tj; ) 8 i, j 2 1 · · · d (3.10)
As a simple example, say we are working with images such that each element of an observation is the grey-scale intensity of a pixel, and ti is the location of pixel
i. Choosing a covariance function such as the Squared Exponential then encodes the prior knowledge that nearby pixels have strong prior covariance, so are expected to take similar values. When performing inference, this translates into encouraging elements of each wi associated with nearby pixels to take similar values. A single
learned wi will then capture a smooth image, and a sample reconstruction Wv =
Pk
i=1wi>vi is built from a linear combination of these underlying images.
The marginal priors p( ) and p( 2) are improper. We assume the domain
of is over the reals, so an uninformative uniform prior is appropriate. For 2, we
use the Je↵rey’s prior for a scale parameter [e.g, Jaynes, 2003, Chapter 12]. The uniform prior over does not present any problems since we do not integrate over at any point (we approximate the evidence with the marginal likelihood). We do integrate over 2 in computing the marginal likelihood, which is improper; however,
Variable Domain Meaning
n Z+ Number of observations.
d Z+ Dimensionality of observations. k Z+, k d Dimensionality of latent variables.
xi Rd A single observation indexed by i2 1 · · · n.
X Rn⇥d Design matrix of observations; the ith row is x>i . vi Rk Latent representation of xi, i2 1 · · · n.
V Rn⇥k Matrix of latent variables. Row i is v>i .
W Rd⇥k The loadings matrix which maps the k-dimensional latent space up to the space of d-dimensional ob- servations.
2 R+ The variance of the predicted distribution of an ob-
servation given its latent representation.
K Rd⇥d Prior covariance matrix which encodes our prior knowledge of the covariance structure of our obser- vations. Depends on hyper-parameters .
Rn Hyper-parameters controlling K .
ti Rnt Location parameter associated with each of the d
dimensions of the observation space, i2 1 · · · d. C(· , · ; ) Rnt⇥ Rnt 7! R Covariance function, parametrised by , used to
construct K : (K )ij = C(ti, tj; ) 8 i, j 2 1 · · · d.
Table 3.1: Variables used in StPCA
we do not deal with the true marginal likelihood and consider only the Laplace approximation, the nature of which always produces a normalised distribution.
StPCA is related to the GP-LVM (Section 1.1.1), which also introduces a Gaussian prior on W. The GP-LVM uses the prior p(W) = Qki=1N (wi|0, ↵I ),
which can be also be constructed in StPCA using the independent covariance func- tion. Using this prior and integrating out the loadings leads to a linear Gaussian Process mapping from latent to observaion space [Lawrence, 2004]. The di↵erence in model between StPCA and GP-LVM is that the GP-LVM may extend to a non-linear latent-to-observation mapping by introducing a non-linear kernel in the Gaussian Process mapping. However, with a linear GP-LVM the models are quite similar. The inference procedures still di↵er in that StPCA marginalises out the latent vari- ables and optimises the loadings, whilst the GP-LVM marginalises out the loadings and optimises the latent variables.
Calculating the likelihood
The likelihood of StPCA can be computed in closed form. Using Equations 3.5 and 3.6, we marginalise out the latent variables
p(xi|W, 2) = Z Rk p(xi|vi, W, 2)p(vi) dvi (3.11) = Z RkN (xi|Wvi , 2I )N (vi|0, I ) dvi (3.12)
The same marginalisation is performed in PPCA [Tipping and Bishop, 1999, Equa- tion 3]. Equation 3.12 has the form of a marginalisation over the mean of a Gaussian with a conjugate Gaussian prior [Gelman et al., 2013, Section 2.4]. The marginal distribution is thus Gaussian, so we only need to compute the mean and covariance:
E⇥xi|W, 2
⇤
= 0 (3.13)
cov(xi|W, 2) = WW>+ 2I (3.14)
The derivation of these is bulky and has been moved to Appendix D.1.1. Using these, this gives us the closed form likelihood
p(xi|W, 2) =N (xi|0, WW>+ 2I ) (3.15)
Considering the covariance of this Gaussian, we see that k directions of vari- ance are modelled by the rank-k matrix WW>, and each of these directions has variance of at-least 2 due to the addition of the 2I term. The remaining d k directions all have variance of exactly 2. If 2 is very small, the data are modelled
as laying on a degenerate Gaussian spanning only a k-dimensional subspace. Latent Space
We have seen from Equation 3.5 that if we know the latent representation of a sample, one can obtain a distribution over possible observations (repeated here, for convenience):
p(xi|vi, W, 2) =N (xi|Wvi, 2I )
However, if we have some observations and are interested in their unobserved latent representations, it is more natural to consider the posterior over latent variables:
which uses the definition M = W>W + 2I . Note that the di↵erence from C =
WW>+ 2I , being of order k d. This may be computed by noting both p(xi|vi)
and p(vi) are Gaussian and using the rules of Gaussian conditioning [e.g., Bishop,
2006, Section 2.3.3].
Non-identifiability of W and V
StPCA has a non-identifiability in W and V. For any orthogonal matrix R2 Rk⇥k,
making the substitution W ! WR does not change the likelihood nor the prior (and thus the posterior). This is the case because R, by definition, satisfies RR>= R>R = I , and W only appears in the likelihood and prior as WW>, which would be substituted as WW> ! (WR)(R>W>) = WW>. The interpretation of this
is that W defines a subspace, and one can always change basis within this subspace without changing the model.
Furthermore, when considering the complete likelihood in which vi also oc-
curs, these also appear only as the product Wvi, which also is invariant to the
substitution Wvi! (WR)(R>vi) = Wvi. R captures rotations around the origin
in the latent space, as well as sign flips (reflections around the origin) and permuta- tions of the columns of V. This tells us that there is no preferred orientation of the latent space; one may always rotate and reflect the points in the latent space around the origin, as long as the projection W from the latent to observed data space is also modified.