Inference for the Poisson Likelihood Model

1.2 Models and Methods

2.1.2 Inference for the Poisson Likelihood Model

Before proceeding to discuss inference methods for the Poisson likelihood model, we brieﬂy revisit the relevant priors for the latent variable f . Here, we consider two classes of prior distributions, that are commonly encountered in practice: Gaussian process (GP) priors [Rasmussen and Williams, 2006] and sparse linear models (SLM) [Seeger, 2008, Seeger and Nickisch, 2011a].

Gaussian Processes. GP priors prominently feature in applications of spatio-temporal statistics to social or ecological questions [Vanhatalo et al., 2010, Diggle et al., 2013] or

1_{We took this example from [Vanhatalo et al., 2013], using the same setup, i.e. isotropic squared}

exponential kernel function and constant mean. Inference is done using Expectation Propagation. Hyper- parameters for mean and covariance function are learned. More information on the data can be found in Section (2.1.4.2).

1860 1880 1900 1920 1940 1960 0 0.5 1 1.5 2 2.5 3 3.5 4 Year

Posterior Mean Intensity

Coal Mining Disasters: Posterior Mean Intensity

Linear Exponential Logistic

Figure 2.4 – Coal mining disaster data: posterior means of latent functions E [g(f )|y]. We recognize the stronger peaking behavior of the exponential non-linearity in high-density regions, while the other non-linearities are more sensitive in low density regions.

to the analysis of neural spike counts [Pillow, 2007, Park and Pillow, 2013, Park et al., 2014] as the multivariate normal is well suited to represent dependencies and dynamics in the input domain. We use the following notation: Let f :X → R be a latent function distributed as f ∼ GP(m(x), k(x, x)), where μ(x) and k(x, x) denote the mean- and covariance functions. For M inputs {xj ∈ X }_{j=[M ]} corresponding to the observations y, the prior over f can be written as a multi-variate normal distribution:

P (f ) =N (μ, K) (2.4)

with mean vector μj = μ(xj) and covariance- or kernel matrix Ki,j = k(xi, xj),∀i, j ∈ [M ].

Sparse Linear Models. In SLMs f itself is defined as a linear function f = X u of a latent vector u, where u exposes non-Gaussian, heavy-tailed statistics in an appropriately chosen linear transform domain s = Bu. SLMs are often encountered in the context of inverse problems, e.g. in image processing, where the prior belief that image gradients or Wavelet coefficients of natural images are sparse [Simoncelli, 1999], has become a popular strategy to regularize ill-posed reconstruction problems. For example, for the deconvolution problem we define the linear operator X such that multiplying it with a vectorized image u amounts to convolving the image with a blur kernel k, i.e. f = k∗ u = Xu. Assuming that u is well described by piecewise constant functions,

g(f ) Prior Laplace VB EP

exp (f ) GP Tract. Tract. Approx.

SLM N/A

log (1 + exp (f )) GP Tract. Approx. Approx.

SLM N/A

max (0, f ) GP N/A. N/A Tract. (New)

SLM N/A

Table 2.2 –Variational Inference methods for diﬀerent non-linearities. We use the following abbreviations: Tract.: Computations are analytically Tractable (i.e. gradients/updates available in closed form). Approx.: Computations require additional Approximations, such as bounding techniques or numerical integration.

one could be interested in penalizing the total variation of u, such that B = [∇_x;∇_y] consists of the horizontal and vertical gradient operators. We model sparsity for s independently for each transform coeﬃcient:

P (u)∝

t_s(s_j) (2.5)

For simplicity we consider the Laplace potential ts(sj) = e−τ|sj| [Gerwinn et al., 2008, Seeger, 2008, Seeger and Nickisch, 2011a].

Before we begin the discussion of methods for approximate inference, we unify our notation. We would like to approximate an intractable distribution of the following form:

P (f ) = Z−1

M j=1

t_j(f_j) t0(f ) (2.6)

For GPs the optional coupled potential t0(f ) is the prior defined in (2.4), and we have a product of the M = N likelihood potentials tj(fj) = P (yj | λj). For SLMs t0(f ) = 1, and we redefine f = [X ; B]u. The potentials are t_j(f_j) = P (y_j | f_j) = t_y_j(f_j) for j ≤ M and tj(fj) = ts(fj) for j > M . We denote the approximation to P (f ) by Q(f ) and choose it to be a multivariate normal distribution Q(f ) =N (f | ξ, Ξ). This is justified by the fact that the likelihood for all non-linearities in Table 2.1 as well as the priors mentioned here are log-concave in f , as well as the priors considered here. Therefore, the posterior is unimodal [Paninski, 2004].

In Table 2.2, we list common approximate inference techniques to ﬁnd the parameters of the approximation. While Laplace’s method is often used in the GP setting [Diggle et al., 2013, Park and Pillow, 2013, Park et al., 2014], it cannot be applied to SLMs, because by design we expect many transform coeﬃcients to be zero, where the Laplace

potential is not diﬀerentiable. Another popular variational Bayesian (VB) technique is referred to as Variational Gaussian approximation [Opper and Archambeau, 2009] or KL method [Nickisch and Rasmussen, 2008, Challis and Barber, 2011]. It is analytically tractable for the exponential function [Ko and Khan, 2014], whereas the softplus function requires approximations, e.g. quadrature or bounding techniques, as shown in [Seeger and Bouchard, 2012]. For the RL function, however, this method is not even deﬁned. This can be seen by examining the VB objective which is the following Kullback-Leibler divergence:

min

ξ,Ξ DKL[Q(f )P (f )] (2.7)

Expanding it, here in the GP case, gives

DKL[Q(f )P (f )] = EQ logQ(f ) t0(f ) − M j=1 E_Qlog t_y_j(f_j) (2.8) This reveals that the logarithm of Eq. (2.1) needs to be integrated over the real line, which is inﬁnite in case of the RL function:

E_Q[log P (y| f)]= y.

Q(f ) log (max (0, f )) df =−∞ (2.9) A simple ﬁx would be to slightly modify the RL function to be non-zero max (, f ) for

> 0. As we will see next, Expectation Propagation approximate inference does not

require such modiﬁcations and deals much more gracefully with non-diﬀerentiability.

2.1.2.1 Expectation Propagation

We brieﬂy recapitulate our exposition of EP from Section 1.2.3.3. EP [Minka, 2001a, Opper and Winther, 2005] approximates P (f ) in Eq. (2.6) by approximating each non- Gaussian potential tj(fj) using unnormalized Gaussians ˜tj(fj) = ˜ZjN (fj|˜μj, ˜σj2) to form a Gaussian approximation Q(f ) following the same factorization:

Q(f ) = Z_EP−1 M j=1 ˜ t_j(f_j) t0(f ) (2.10)

The EP-approximation to the marginal likelihood is given by:

ZEP = M j=1 ˜ Z_j M j=1 N (fj|˜μj, ˜σ2j) t0(f ) df (2.11)

EP employs the following strategy to determine the variational parameters ˜μ_j, ˜σ2_j.

t_i(f_i) from Q(f ) and marginalizing over f_\i:={f_j : j= i}, denoted as :

Q_−i(f_i) =Nf_i|μ_−i, σ2_−i∝

j=i ˜

t_j(f_j) t0(f ) df_\i (2.12)

The so called tilted distribution replaces the approximate potential ˜t_i(f_i) in Q(f ) by the true non-Gaussian potential ti(fi) by multiplying it with the cavity marginal:

P (f_i) = ˆZ_i−1t_i(f_i) Q_−i(f_i) where Zˆ_i =

t_i(f_i) Q_−i(f_i) df_i (2.13)

The criterion to minimize in order to update the parameters of ˜t_i is the KL-divergence between the tilted- and the variational distribution DKL

P (f_i)Q(f_i)

and can be solved using moment matching:

E_Q[f_i] = E_Pˆ[fi] Var_Q[f_i] = Var_Pˆ[fi]

(2.14)

The constant ˜Z_i is chosen such that the normalization constants of ˆP (f_i) and Q(fi) match, i.e. we solve:

Z_i

N (fi|˜μi, ˜σ2i) Q−(fi) dfi = ˆZi (2.15)

The EP update therefore consists of determining the ﬁrst two moments and the normalization constant of the tilted distribution.

Once the parameters of a single ˜t_i are changed, we can update the representation of the full approximation Q(f ), which consists of recomputing ξ and VarQ[f ] = diag (Ξ). This process is repeated until convergence, i.e. until a ﬁxed point in terms of tilted and approximate moments is reached.

To the best of our knowledge, tilted moments for the exponential- and softplus functions are not available in closed form. Implementations based on quadrature are commonly found in the context of Gaussian processes [Rasmussen and Nickisch, 2010, Vanhatalo et al., 2013].

Computing tilted marginals is not a trivial task. E.g. plugging (3.1) into (2.13) shows that this quantity depends exponentially both on y and f . In Section 2.1.4 we illustrate that evaluating this expression directly using quadrature can lead to numerical problems. So far, we have seen that popular methods, such as Laplace and VB approximations, are not particularly suitable for the RL function in contrast to EP, which in turn depends on the tractability of tilted moments. Next, we show that for the RL function these computations are indeed analytically tractable.

In document Applications of Approximate Learning and Inference for Probabilistic Models (Page 39-44)