Challenge II: Intractable Models - Statistical computation with kernels

We have now concluded our initial discussion of numerical integration, the main challenge that will be tackled in this thesis. A second challenge is that of statistical models for which the density is not available. It should be clear from previous sections that both Bayesian inference (see Equation 1.3) and maximum likelihood inference (see Equation 1.7) are likelihood-based inference, meaning that they require us to be able to evaluate the likelihood at different data points and parameter values. However, in the case of complex statistical models this may not be possible, or computationally feasible. We now highlight two such scenarios.

1.2.1 Intractability in Unnormalised Models

A first scenario which is common in applications of statistics is when the likelihood can only be accessed in an unnormalised form:

p(x|θ) = p¯(x|θ)

Z(θ) , (1.15)

where ˜p(x|θ) is an unnormalised density which can be evaluated and Z(θ) ∈ _R+

is an unknown normalisation constant which depends on the parameter vector

θ. Usually this scenario arises due to the high computational cost of evaluating the normalisation constant, or because this constant is itself defined as some intractable integral of the formZ(θ) =R

Dp˜(x|θ)dx(whenDis a continuous domain)

orZ(θ) =P

x∈Dp˜(x|θ) (when D is a discrete, but very large, domain). Examples

include Gibbs distributions, which are popular in statistical physics and the study of social networks [Caimo and Mira, 2015], as well as Markov random fields, which are popular in image modelling and spatial statistics [Hyv¨arinen, 2006, 2007; Moores et al., 2015].

Of course, this can be a particular challenge for maximum likelihood estimation since we need to know the normalisation constant Z(θ) in order to solve the

optimisation problem in Equation 1.7. Such problems have also received a lot of attention in the Bayesian literature, where they are known as “doubly intractable” problems due to the fact that both the normalisation constant of the likelihood and the normalisation constant of the posterior (i.e. the model evidence) are unknown. In these cases, combining Equations 1.3 and 1.15, we get that the posterior distribution takes the form:

p(θ|X) =

¯ p(X|θ)

Z(θ) p0(θ)

p(X) . (1.16)

where bothZ(θ) and p(X) are unknown. To resolve the issue of unknown normalisation constant for the likelihood, several authors have proposed to use plug-in MC and MCMC estimates of the intractable integrals [Geyer, 1991; Lyne et al., 2015] (this clearly this highlights another area where numerical integration is important!). Other popular approaches have focused on approximations to the likelihood and can be computed at much lower computational cost; see for example the pseudo- likelihood method of Besag [1974] and related composite likelihood methods (see Varin et al. [2011] for an overview). These are however not asymptotically exact and it is not always easy to assess the bias created by the approximations.

In a frequentist setting, issues with these approaches have led to the devel- opment of alternative methods to maximum likelihood, most notably score-based inference methods such as score-matching [Hyv¨arinen, 2006, 2007; Karakida et al., 2016] or proper scoring rules [Gneiting and Raftery, 2007; Dawid, 2007; Parry et al., 2012] (see Chapter 4 for more details). These methods only require access to the gradient of the log-density. Advantages include the fact that we can bypass the computation of expensive normalisation constants whilst still obtaining an asymptotically exact solution since:

∇_xlogp(x|θ) = ∇_xlog ¯p(x|θ) + (((( (( ∇_xlogZ(θ) | {z } =0 = ∇_xlog ¯p(x|θ). (1.17)

where ∇x is a vector of partial derivatives with respect to each of the coordinates

of x. In a Bayesian setting, pseudo-marginal approaches, including the exchange algorithm [Murray et al., 2006; Møller et al., 2006] have been proposed to sample from posterior distributions efficiently. These usually provide good approximations, but at a high computational cost.

1.2.2 Intractability in Generative Models

A second scenario which recently received renewed interest is that of generative models [Mohamed and Lakshminarayanan, 2016], sometimes also called implicit models or likelihood-free models, for which the likelihood is not available in any form. Instead, we assume that it is possible to obtain IID samples from the model for any value of the parameter vectorθ∈Θ. Let (U,ΣU,U) be a probability space.

Formally we regard generative models as a family of probability measures such that for any value of the parameter θ ∈ Θ, we can obtain some IID data {xi}ni=1 from

the corresponding probability measurePθ. This data is obtained in two steps: first

IID random variables {ui}ni=1 are obtained from U, then some map Gθ :U → X is

applied to each of these random variables to obtainPθdistributed random variables:

xi=Gθ(ui) for i= 1, . . . , n.

Generative models are used throughout the sciences, including in the fields of ecology [Wood, 2010; Beaumont, 2010; Hartig et al., 2011], population genetics [Beaumont et al., 2002] or astronomy [Cameron and Pettitt, 2012]. They also appear in machine learning as black-box models; see for example generative adversarial networks (GANs) [Goodfellow et al., 2014; Dziugaite et al., 2015; Li et al., 2015] and variational autoencoders (VAEs) [Kingma and Welling, 2014].

The problem of inference within generative models is of course very closely related to the classical problem of density estimation [Diggle and Gratton, 1984]. To tackle it, a common approach is the method of simulated moments and its special case of indirect inference [Hall, 2005]. Here, the idea is to simulate data from Pθ

for a wide range of parameter valuesθ∈Θ and keep the parameter value for which a weighted linear combination of moments (such as the mean or variance) of the samples agree the most with moments of the data simulated from the true data generating process.

Furthermore, another recent approach to this problem relating to optimal transport of measure was discussed in Bassetti et al. [2006]; Bernton et al. [2017]; Genevay et al. [2018], where the authors proposed to minimise the Wasserstein distance, or an approximation thereof, between an empirical probability measure induced by the samples from the true data generating process and the statistical model under consideration.

In a Bayesian context, a common approach to obtain an approximate posterior is approximate Bayesian computation. Here, a parameter valueθis accepted as a sample from the approximate posterior if data generated for this value is close enough (in the sense of some summary statistics) to the data from the true generating process. See Marin et al. [2012]; Lintusaari et al. [2017] for an overview.

In document Statistical computation with kernels (Page 33-36)