• No results found

1.1 Latent Variable Models

1.1.2 Inference

Expectation Maximisation

Expectation Maximisation (EM) is an algorithm for finding the maximum likelihood or maximum-a-posteriori parameters in a latent variable model. EM is typically used when directly maximising the likelihood is difficult, but maximising the joint likeli- hood p(X, V|✓) is easier. This is the case for Factor Analysis where arg max p(X|✓)

PCA Linear mapping justified as maximum variance preserving, or minimum reconstruction error.

PPCA Probabilistic model: MAP parameters recover principal sub- space.

FA Generalises PPCA with a separate parameter for noise on each observed variable.

Categorical PCA Categorical likelihood considers discrete-valued data. ICA Learn loadings such that latent variables are independent. SPCA PCA variant with sparsity in loadings.

DL PCA variant with sparsity in latent variables. NMF Loadings and latent variables both non-negative.

RPCA Decompose matrix of data into sparse + low rank matrices. GP-LVM Nonlinear; data are modelled as Gaussian Process mapping

from latent space.

Autoencoder Generalises many of the above. Usually deep (highly non- linear) architecture.

Table 1.1: List of latent variable models presented in this section, as well as their most salient feature when compared to classical PCA.

does not have a solution in closed form, but arg max p(X, V|✓) does.

EM describes an alternating scheme with an expectation step (E-step) and a maximisation step (M-step), beginning with some guess at the parameters ✓old.

In the E-step, we compute the posterior over latent variables p(V|X, ✓old), and con- struct a function giving the expected complete log likelihoodE[log p(X, V|✓)]p(V|X,✓old).

This often simplifies, especially in the case where p(X, V|✓) is in the exponential family, since the log cancels the exponentiation. In the case of a Gaussian complete log likelihood, this reduces to computing moments, which we will see. In the M-step, the expected complete log likelihood is maximised with respect to ✓. This EM cycle is repeated until convergence.

This procedure converges to the maximum likelihood parameters, and we walk through a proof of this below. The proof is well known, and involves decom- posing log p(X|✓) into a lower bound plus a non-negative residual. The E-step and M-step are shown to be a consequence of maximising this lower bound.

We first introduce q(V) which is any arbitrary distribution over the latent variables with the same support as p(V|X, ✓). Since q(V) integrates to 1, it is trivially true that

log p(X|✓) = Z

q(V) log p(X|✓) dV (1.46) By the product rule, p(X|✓) = p(X, V|✓)/p(V|X, ✓). We use this below, introduce

q(V)/q(V) and split the equation into two terms. log p(X|✓) = Z q(V) logp(X, V|✓) p(V|X, ✓)dV (1.47) = Z q(V) logq(V)p(X, V|✓) q(V)p(V|X, ✓)dV (1.48) = Z q(V) logp(X, V|✓) q(V) dV Z q(V) logp(V|X, ✓) q(V) dV (1.49) =L(q, ✓) + KL(q||p) (1.50) KL(q||p) is the Kullback-Leibler divergence of the distribution q from the distribu- tion p(V|X, ✓). This is a non-negative quantity, with KL(q||p) = 0 holding if and only if q(V) = p(V|X, ✓) everywhere. Since KL(q||p) is non-negative, L(q, ✓) is a lower bound on log p(X|✓). We can thus maximise log p(X|✓) by maximising L(q, ✓), as long as KL(q||p) = 0 at this maximum, which we will see to be true. Iterating between maximisation with respect to q and ✓ turns out to correspond to the E-step and M-step respectively.

If we have some current value of the parameters ✓old, we can maximiseL(q, ✓)

with respect to q by considering that this maximum must occur when KL(q||p) = 0, meaning q = p. This gives us the solution that q(V) = p(V|X, ✓old). By substituting

this into the lower bound we arrive at L(p, ✓) = Z p(V|X, ✓old) log p(X, V|✓) p(V|X, ✓old) dV (1.51) = Z

p(V|X, ✓old) log p(X, V|✓) dV + const (1.52)

=E [log p(X, V|✓)]p(V|X,✓

old)+ const (1.53)

=: Q(✓, ✓old) + const (1.54)

(1.55) This is the E-step of EM. The expectation can be computed in closed form in a number of models, and frequently involves computing a set of sufficient statistics. Note thatL(p, ✓old) is equal to log p(X|✓old) since the KL term is zero.

We have maximised L(q, ✓) with respect to q, and we can now maximise with respect to ✓. For any ✓ such that Q(✓, ✓old) > Q(✓old, ✓old), it will hold that

p(X|✓) > p(X|✓old). However, this will also cause the (typically unknown) KL term

MaximisingL(q, ✓) with respect to ✓ is the M-step in EM. It is sufficient to maximise Q(✓, ✓old), which is the expected complete-data log likelihood. In many

models, including PPCA and FA, p(X|V, ✓) and p(V) both belong to the exponential family, producing a closed-form maximisation. Whilst EM specifies full maximisa- tion in the M-step, any increase in the lower bound during the M-step will also increase the likelihood and thus converge to a maxima, and is known as Generalised EM.

Example: Expectation Maximisation for PPCA

EM is useful if maximising Q(✓, ✓old) is easier than maximising p(X|✓). This is often

the case for the exponential family, where the log cancels the exponentiation. EM for PPCA illustrates this; the E-step is computed as:

E [log p(X, V|✓)]p(V|X,✓old)/ 1 2 n X i=1 ⇢ d log 2+ 2||xi||2 2 2E[vi]>W>xi + 2TrhE[viv>i ]W>W i + TrhE[viv>i ] i (1.56)

The full expectation has reduced to computing just the sufficient statistics of the Gaussian. This requires p(vi|xi, ✓) (Equation 1.20) and uses the identityE[viv>i ] =

cov[vi] +E[vi]E[vi]>, giving:

E[vi] = Mold1W>oldxi (1.57)

E[vivi>] = old2 (Wold> Wold+ 2oldI ) 1+E[vi]E[vi]> (1.58)

To find ✓new, the M-step may then be performed by taking derivatives of Equa-

tion 1.56 with respect to ✓, setting to zero and solving analytically. This gives the update step: Wnew = " n X i=1 xiE[vi]> # " n X i=1 Ehvivi> i# 1 (1.59) 2 new = 1 nd n X i=1 ⇢ xix>i 2E[vi]>W>newxi + TrhEhviv>i i W>newWnew i (1.60)

The fact that Wnew does not depend on 2new means that calculating Wnew and

then new2 gives us the joint maxima of E⇥log p(X, V|W, 2).

Gradient Ascent

One of the simplest techniques to find the maximum likelihood parameters is gradi- ent ascent [see e.g., Goodfellow et al., 2016, Section 4.3]. The idea is that one can think of the log likelihood as a high dimensional surface on which we try and find the highest point by repeatedly taking small steps in the direction pointing most steeply uphill. This requires some initial value of the parameters ✓(0) and the ability

to compute the gradient @ log p(X@✓ |✓). Taking a step in the uphill direction corresponds to computing the gradient ascent step:

✓(t+1)= ✓(t)+ ⌘@ log p(X|✓)

@✓ ✓=✓(t) (1.61)

This is iterated until some convergence criteria is met, such as the gradient becoming sufficiently small. The learning rate ⌘ is a hyper-parameter and controls the size of the step taken. If ⌘ is too large then the gradient steps may decrease the log likelihood, resulting in poor performance. Using a small value of ⌘ will reduce the risk of this, at the cost of slower convergence.

Gradient ascent can fail in very flat regions of the likelihood landscape. If ✓(0) is chosen poorly, such as a value with extremely low likelihood, the gradient may not point towards the maxima, or may become numerically indistinguishable from zero. Saddle points may also be problematic; since the gradient nearby will be very close to zero, very small steps may be taken leading to slow or premature convergence.

Basic gradient ascent can be modified in a number of ways. Instead of fixing ⌘, each gradient step may be evaluated with multiple values of ⌘ and the step is made with the value causing the biggest increase in the log likelihood. This is known as line search.

Gradient ascent may perform poorly if the log likelihood changes rapidly in some directions and slowly in others, producing a surface looking like a long ridge. Ideally each gradient step would move more in the direction of low variance. If we can compute the Hessian, once can instead use the step:

✓(t+1)= ✓(t)+ H(t)1

@ log p(X|✓)

which uses the Hessian

H(t) =

@2log p(X|✓)

@✓@✓> ✓=✓(t) (1.63)

This is Newton’s method [see e.g., Goodfellow et al., 2016, Section 4.3.1], which con- sists of fitting a quadratic to the function being minimised, and jumping straight to the maxima. This can produce faster convergence as long as the log likelihood is locally concave, however the Hessian may be very large if there are many paramet- ers.