1.2 Models and Methods
2.1.2 Inference for the Poisson Likelihood Model
Before proceeding to discuss inference methods for the Poisson likelihood model, we briefly revisit the relevant priors for the latent variable f . Here, we consider two classes of prior distributions, that are commonly encountered in practice: Gaussian process (GP) priors [Rasmussen and Williams, 2006] and sparse linear models (SLM) [Seeger, 2008, Seeger and Nickisch, 2011a].
Gaussian Processes. GP priors prominently feature in applications of spatio-temporal statistics to social or ecological questions [Vanhatalo et al., 2010, Diggle et al., 2013] or
1We took this example from [Vanhatalo et al., 2013], using the same setup, i.e. isotropic squared
exponential kernel function and constant mean. Inference is done using Expectation Propagation. Hyper- parameters for mean and covariance function are learned. More information on the data can be found in Section (2.1.4.2).
1860 1880 1900 1920 1940 1960 0 0.5 1 1.5 2 2.5 3 3.5 4 Year
Posterior Mean Intensity
Coal Mining Disasters: Posterior Mean Intensity
Linear Exponential Logistic
Figure 2.4 – Coal mining disaster data: posterior means of latent functions E [g(f )|y]. We recognize the stronger peaking behavior of the exponential non-linearity in high-density regions, while the other non-linearities are more sensitive in low density regions.
to the analysis of neural spike counts [Pillow, 2007, Park and Pillow, 2013, Park et al., 2014] as the multivariate normal is well suited to represent dependencies and dynamics in the input domain. We use the following notation: Let f :X → R be a latent function distributed as f ∼ GP(m(x), k(x, x)), where μ(x) and k(x, x) denote the mean- and covariance functions. For M inputs {xj ∈ X }j=[M ] corresponding to the observations y, the prior over f can be written as a multi-variate normal distribution:
P (f ) =N (μ, K) (2.4)
with mean vector μj = μ(xj) and covariance- or kernel matrix Ki,j = k(xi, xj),∀i, j ∈ [M ].
Sparse Linear Models. In SLMs f itself is defined as a linear function f = X u of a latent vector u, where u exposes non-Gaussian, heavy-tailed statistics in an appropriately chosen linear transform domain s = Bu. SLMs are often encountered in the context of inverse problems, e.g. in image processing, where the prior belief that image gradients or Wavelet coefficients of natural images are sparse [Simoncelli, 1999], has become a popular strategy to regularize ill-posed reconstruction problems. For example, for the deconvolution problem we define the linear operator X such that multiplying it with a vectorized image u amounts to convolving the image with a blur kernel k, i.e. f = k∗ u = Xu. Assuming that u is well described by piecewise constant functions,
g(f ) Prior Laplace VB EP
exp (f ) GP Tract. Tract. Approx.
SLM N/A
log (1 + exp (f )) GP Tract. Approx. Approx.
SLM N/A
max (0, f ) GP N/A. N/A Tract. (New)
SLM N/A
Table 2.2 –Variational Inference methods for different non-linearities. We use the following abbreviations: Tract.: Computations are analytically Tractable (i.e. gradients/updates available in closed form). Approx.: Computations require additional Approximations, such as bounding techniques or numerical integration.
one could be interested in penalizing the total variation of u, such that B = [∇x;∇y] consists of the horizontal and vertical gradient operators. We model sparsity for s independently for each transform coefficient:
P (u)∝
M
j
ts(sj) (2.5)
For simplicity we consider the Laplace potential ts(sj) = e−τ|sj| [Gerwinn et al., 2008, Seeger, 2008, Seeger and Nickisch, 2011a].
Before we begin the discussion of methods for approximate inference, we unify our notation. We would like to approximate an intractable distribution of the following form:
P (f ) = Z−1
M j=1
tj(fj) t0(f ) (2.6)
For GPs the optional coupled potential t0(f ) is the prior defined in (2.4), and we have a product of the M = N likelihood potentials tj(fj) = P (yj | λj). For SLMs t0(f ) = 1, and we redefine f = [X ; B]u. The potentials are tj(fj) = P (yj | fj) = tyj(fj) for j ≤ M and tj(fj) = ts(fj) for j > M . We denote the approximation to P (f ) by Q(f ) and choose it to be a multivariate normal distribution Q(f ) =N (f | ξ, Ξ). This is justified by the fact that the likelihood for all non-linearities in Table 2.1 as well as the priors mentioned here are log-concave in f , as well as the priors considered here. Therefore, the posterior is unimodal [Paninski, 2004].
In Table 2.2, we list common approximate inference techniques to find the parameters of the approximation. While Laplace’s method is often used in the GP setting [Diggle et al., 2013, Park and Pillow, 2013, Park et al., 2014], it cannot be applied to SLMs, because by design we expect many transform coefficients to be zero, where the Laplace
potential is not differentiable. Another popular variational Bayesian (VB) technique is referred to as Variational Gaussian approximation [Opper and Archambeau, 2009] or KL method [Nickisch and Rasmussen, 2008, Challis and Barber, 2011]. It is analytically tractable for the exponential function [Ko and Khan, 2014], whereas the softplus function requires approximations, e.g. quadrature or bounding techniques, as shown in [Seeger and Bouchard, 2012]. For the RL function, however, this method is not even defined. This can be seen by examining the VB objective which is the following Kullback-Leibler divergence:
min
ξ,Ξ DKL[Q(f )P (f )] (2.7)
Expanding it, here in the GP case, gives
DKL[Q(f )P (f )] = EQ logQ(f ) t0(f ) − M j=1 EQlog tyj(fj) (2.8) This reveals that the logarithm of Eq. (2.1) needs to be integrated over the real line, which is infinite in case of the RL function:
EQ[log P (y| f)]= y.
Q(f ) log (max (0, f )) df =−∞ (2.9) A simple fix would be to slightly modify the RL function to be non-zero max (, f ) for
> 0. As we will see next, Expectation Propagation approximate inference does not
require such modifications and deals much more gracefully with non-differentiability.
2.1.2.1 Expectation Propagation
We briefly recapitulate our exposition of EP from Section 1.2.3.3. EP [Minka, 2001a, Opper and Winther, 2005] approximates P (f ) in Eq. (2.6) by approximating each non- Gaussian potential tj(fj) using unnormalized Gaussians ˜tj(fj) = ˜ZjN (fj|˜μj, ˜σj2) to form a Gaussian approximation Q(f ) following the same factorization:
Q(f ) = ZEP−1 M j=1 ˜ tj(fj) t0(f ) (2.10)
The EP-approximation to the marginal likelihood is given by:
ZEP = M j=1 ˜ Zj M j=1 N (fj|˜μj, ˜σ2j) t0(f ) df (2.11)
EP employs the following strategy to determine the variational parameters ˜μj, ˜σ2j.
˜
ti(fi) from Q(f ) and marginalizing over f\i:={fj : j= i}, denoted as :
Q−i(fi) =Nfi|μ−i, σ2−i∝
j=i ˜
tj(fj) t0(f ) df\i (2.12)
The so called tilted distribution replaces the approximate potential ˜ti(fi) in Q(f ) by the true non-Gaussian potential ti(fi) by multiplying it with the cavity marginal:
ˆ
P (fi) = ˆZi−1ti(fi) Q−i(fi) where Zˆi =
ti(fi) Q−i(fi) dfi (2.13)
The criterion to minimize in order to update the parameters of ˜ti is the KL-divergence between the tilted- and the variational distribution DKL
ˆ
P (fi)Q(fi)
and can be solved using moment matching:
EQ[fi] = EPˆ[fi] VarQ[fi] = VarPˆ[fi]
(2.14)
The constant ˜Zi is chosen such that the normalization constants of ˆP (fi) and Q(fi) match, i.e. we solve:
˜
Zi
N (fi|˜μi, ˜σ2i) Q−(fi) dfi = ˆZi (2.15)
The EP update therefore consists of determining the first two moments and the normal- ization constant of the tilted distribution.
Once the parameters of a single ˜ti are changed, we can update the representation of the full approximation Q(f ), which consists of recomputing ξ and VarQ[f ] = diag (Ξ). This process is repeated until convergence, i.e. until a fixed point in terms of tilted and approximate moments is reached.
To the best of our knowledge, tilted moments for the exponential- and softplus functions are not available in closed form. Implementations based on quadrature are commonly found in the context of Gaussian processes [Rasmussen and Nickisch, 2010, Vanhatalo et al., 2013].
Computing tilted marginals is not a trivial task. E.g. plugging (3.1) into (2.13) shows that this quantity depends exponentially both on y and f . In Section 2.1.4 we illustrate that evaluating this expression directly using quadrature can lead to numerical problems. So far, we have seen that popular methods, such as Laplace and VB approximations, are not particularly suitable for the RL function in contrast to EP, which in turn depends on the tractability of tilted moments. Next, we show that for the RL function these computations are indeed analytically tractable.