Practical Considerations - Kernel-based Estimators for Intractable Models

Chapter 5 Statistical Inference and Computation with Intractable

5.2 Kernel-based Estimators for Intractable Models

5.2.4 Practical Considerations

As will now be obvious, the choice of loss function (or equivalently of kernel and kernel hyperparameters), and of numerical optimisation routine will be of great importance for practical implementation. These tuning choices will influence the performance of our estimators in three main ways: the asymptotic efficiency of the estimator, the robustness of the estimator and the difficulty of optimising the loss function.

Numerical Optimisation

Recall our goal of inferring the parameter θ ∈ Θ by minimising the loss function

L(θ) = KSD(Q||Pθ)2 or L(θ) = MMD(Q,Pθ)2. These loss functions are usually

non-convex, and potentially high-dimensional when the parameter space is large, which might lead to computational challenges for practical implementation.

Several optimisation algorithms could be used, but we propose to focus on gradient-based methods. A common approach is gradient descent, which consists of initialising atθ0 ∈Θ, then iterating over descent steps. For the tth iteration, we

have the following update:

θ(t) = θ(t−1)−ηt∇θL(θ(t−1)),

where {ηt}t∈_N is a step size sequence chosen to guarantee convergence to a local

minimum. To use gradient descent for our minimum distance estimators based on KSDs and MMDs, we will need the gradient of these statistical divergences with respect to the parameters of the model:

Proposition 17(Gradients of the Kernel Stein Discrepancy and Maximum Mean Discrepancy Loss Functions). The gradient of the KSD loss function

LKSD(θ) :=KSD(Q||Pθ)2 is given by: ∇_θLKSD(θ) = Z X Z X h k(x,y)∇_θ∇_xlogp(x|θ)∇_ylogp(y|θ)

+k(x,y)∇_θ∇_ylogp(y|θ)∇_xlogp(x|θ) + ∇_θ∇_xlogp(x|θ)

∇₂k(x,y)

+∇_θ∇_ylogp(y|θ)∇₁k(x,y)iQ(dx)Q(dy).

The gradient of the MMD loss function LMMD₍_θ_{) :=}_MMD₍

Q,Pθ)2, when Pθ is the

pushforward of some base measureU through the map Gθ, is given by:

∇_θLMMD(θ) = Z U Z U ∇_θGθ(u) (∇1k(Gθ(u), Gθ(v)) +∇2k(Gθ(v), Gθ(u)))>U(du)U(dv) −2 Z U Z X ∇_θGθ(u)∇1k(Gθ(u),y)U(du)>Q(dy).

The usual way to motivate gradient descent methods is to say that they se- quentially decrease the objective function in the optimal direction. However, since we are optimising within a specific statistical manifold, notions of distance are different from Euclidean spaces and the classic gradient descent algorithm does not decrease the objective in an optimal direction anymore. The algorithm is still a valid optimisation algorithm, but it will require a large number of iterations to at- tain the minimum of the loss function. On a manifold, the gradient vector of the functionL(i.e. the optimal descent direction) is given by the vector field∇g_L_{, which}

in a local coordinate system is: ∇g_L₍_θ_{) =}_g−1₍_θ₎_∇

θL(θ), whereg−1(θ) is the inverse

of the matrixg(θ). The natural generalisation of the gradient descent is then to follow the geodesics of the manifold: pt = exppt−1(−ηt∇

g_L₍_p

the tangent vector vp at some p ∈ M to the point expp(vp) := γ(1) ∈ M, where

γ is the unique geodesic s.t. γ(0) = p, ˙γ = vp. However it is often hard to follow

geodesics exactly in practice. Instead, the Euclidean space formula for the exponential, exp_θv=θ+v, suggests the following iterations: θ(t)=θ(t−1)−ηt∇gL(θ(t−1)).

This corresponds to natural gradient algorithms [Amari, 1998, 2016]. These algorithms were previously introduced in the context of the KL divergence and SM divergence, but we can straightforwardly derive similar algorithms for the KSD and MMD using some of our previously-derived formulae for information metrics in Propositions 13 and 15. Raskutti and Mukherjee [2015] also reinterpreted this algorithm as a mirror descent algorithm, whilst Pascanu and Bengio [2014]; Martens [2014] demonstrated its connections to many popular optimisation algorithms for training large machine learning models.

For minimum distance estimation based on KSD, we can use the information metric derived in Proposition 13 and the gradient derived in Proposition 17. Since these will not be available in closed form, we can approximate the trajectories of the natural gradient algorithm by using U-statistic approximations of these quantities. In this case, we target the minimiser of ˆLKSD_U (θ) = KSD2_U(Qm||Pθ). The gradient

descent algorithm follows the iterationsθ(t)=θ(t−1)−ηt∇θLˆKSDU (θ(t−1)) whilst the

natural gradient descent algorithm follows the iterations

θ(t) = θ(t−1)−ηt(gUKSD(θ(t

−1)₎₎−1_∇

θLˆKSDU (θ(t −1)₎_.

On the other hand, for minimum distance estimation based on MMD, we propose to target the minimiser of ˆLMMD_U,U (θ) = MMD_U,U2 (Qm,Pn_θ). To do so, we can use

U-statistic approximations of the information metric in Proposition 15 and gradient in Proposition 17. The gradient descent algorithm follows the iterations θ(t) =

θ(t−1)−ηt∇θLˆMMD_U,U (θ(t−1)) whilst the natural gradient descent algorithm follows:

θ(t) = θ(t−1)−ηt(gU,UMMD(θ(t−1)))−1∇θLˆMMDU,U (θ(t−1)).

Note that we cannot expect to find a unique minimum to this problem since realisations from the generative model are obtained at each iteration. This is necessary in order to compute ˆLMMD_U,U at a new parameter value, but implies that a different objective function is optimised at each iteration. Consistency properties of MMD estimators and of U-statistics approximations do however guarantee that the algorithm above will move towards a minimum of the idealised loss function MMD(Q,Pθ)2 as

nand m grow.

domain Θ, so we also introduce a projection operator Proj :Rp →Θ, applied after

each step, and which maps parameters to their closest value in Θ in terms of some norm to be defined. The approximation may also mean that the metric tensor is not invertible, and regularisation might be required to resolve this issue. Finally, using the same samples for the U-statistic of the gradient and of the information metric may lead to strong biases.

In any case, finding the minimiser may still be challenging for several reasons: 1. The gradient descent procedures will be expensive in certain settings. Let

X ⊆ Rd and Θ ⊂ Rp. For KSD estimation, the cost of each iteration of

gradient descent isO(mpd), whilst the cost of each iteration of natural gradient descent is O(m2p2d+p3). On the other hand, for MMD estimation the cost of each iteration of gradient descent is O (n2+nm)pd2 whilst the cost of each natural gradient descent iteration is O (n2+nm)p2d2+p3

. The cost for MMD estimation is linear in the number of data pointsm, but quadratic in the number of simulated samplesn. We note that taking n=m is optimal in terms of computational cost.

This computational cost could be made linear innby considering approximations of the MMD or KSD as found in Chwialkowski et al. [2015]; Jitkrittum et al. [2017]. In large data settings, subsamplingbelements uniformly at ran- dom from{yj}mj=1 may lead to significant speed-ups. The additional term in

the natural gradient descent algorithm incurs aO(p3) cost due to the need to invert a matrix, which for large-psettings may be prohibitive. In these cases, approximate linear solvers could also be used to reduce this cost.

2. The gradient of the generator ∇_θGθ may not be available, precluding exact

gradient descent inference. In this case, the method of finite difference stochas- tic approximation [Kushner and Yin, 2003] can be used to approximate the descent direction. Finally, it is important to point out that the loss function may be non-convex and that we might converge to a local minimum.

Kernel Selection

Kernel selection for KSD and MMD estimators is delicate, since it will significantly influence the geometry of the statistical manifolds, and will hence have a significant impact on a range of issues including the efficiency and robustness of the estimators. This was clearly demonstrated by the Gaussian distribution example used throughout the section. This example highlighted the impact of the choice of kernel hyperparameters on the robustness of the method. First, the KSD estimator was

shown to be non-bias-robust, but the choice of lengthscale could help reduce the growth of the influence function. Later on, the MMD estimator was shown to be bias-robust, but the choice of lengthscale could impact the gross-error sensitivity. The kernel can therefore be seen as an additional free parameters to adapt these estimators to the problem at hand; something which is not possible with estimators based on the KL or SM divergence.

Since the choice of kernel will also impact the loss function itself, practical considerations should also prevail and it might be of interest to select a kernel which makes the loss function easy to minimise with gradient descent. The natural gradient algorithm will however be able to alleviate some of these issues by adapting directly to the geometry induced by the choice of kernel.

Previous work on the use of maximum mean discrepancy for hypothesis test- ing could also guide our choice of kernel. A common approach in that case was to study the asymptotic distribution of the tests, and choose kernel parameters so as to minimise the power of the test. Extensions where a linear combination of kernels whose weights are optimised was proposed in Gretton et al. [2012b], and Sutherland et al. [2017] used this approach for inference with generative adversarial networks. Note that this will require using held out data, which might therefore decrease the accuracy of our estimators.

Simulation Study

To highlight in detail all of the important theoretical and practical details discussed earlier in the section, our simulation study will focus solely on Gaussian models. Although this is of course not the intended class of models for KSD and MMD estimators, working with Gaussian models will be convenient because they are simple enough to be analysed in detail.

Gaussian Models as Unnormalised Models

We begin by discussing the estimators which make use of the log-likelihood of the model in a normalised or unnormalised form: KL, SM and KSD estimators. We will focus on a problem where our target is a multi-dimensional isotropic Gaussian distributionN(µ,Σ) with meanµ= (µ1, . . . , µd) andd×dscaled diagonal covariance

matrix Σ with diagonal entries (σ₁2, . . . , σ_d2). To begin with, we will assume that both the mean vector and the diagonal values of the covariance function are unknown such thatθ= (µ1, . . . , µd, σ1, . . . , σd) and Θ =Rd×Rd+.

Figure 5.2: Descent trajectories of gradient descent and natural gradient descent algorithms on a one-dimensional Gaussian model for estimators based on the KL,

SM and KSD divergences. We compare 20 iterations of gradient descent (pink)

and natural gradient descent (orange) with constant step sizes. The loss function optimised are the KL, SM and KSD divergence with inverse-multiquadric kernel and Gaussian RBF kernel, each computed using m = 100 realisations. The minimum of the empirical loss functions are represented with an orange star. All of the optimisation algorithms were initialised at theθ0 = (2.5,1.6) and data was obtained

from a model withθ∗ = (1.5,2).

the mean and standard deviation in Figure 5.2. Clearly, the loss function varies in geometry according to the choice of divergence, but is always convex since the Gaussian distribution is an example of the exponential family of distributions. For illustration, we still used numerical optimisation routines to tackle this problem (although a closed form solution was provided in Proposition 14). As seen in Figure 5.2, natural gradient algorithms are able to leverage knowledge of the geometry of the statistical manifold to provide more efficient updates towards the minimum. Indeed, the algorithm approximates a geodesic between the starting position and the minimum, resulting in a direct line when seen as a function of the parameters, which are the coordinates of the manifold. On the other hand, the gradient descent follows descent directions which are perpendicular to the contour lines of the heat

Figure 5.3: Performance of natural gradient descent algorithms on a20-dimensional

Gaussian model for estimators based on the KL, SM and KSD divergences. Con-

sider an inference problem in an M-closed setting with m = 300 samples of a 20 dimensional Gaussian model with 40 parameters (mean vector and diagonal entries of the covariance function). The top plots compares the speed at which the gradient descent (GD) algorithms (full line) and the natural gradient descent (NGD) algorithms (dashed line) minimise each loss function. The KSD was computed with an inverse-multiquadric kernel with lengthscale parameter l= 1. The bottom plots compute l1 and l2 errors between the estimated parameter at iteration t and the

true parameterθ∗.

map, which results in slower convergence towards the minimum.

We now extend this experiment to a Gaussian distribution on X = R20, in

which case the parameter space Θ⊂_R40_{. This is significantly more challenging since}

in a high-dimensional data space X, empirical approximations of the divergences will not be as accurate. Furthermore, the optimisation problem also becomes more challenging in higher dimensions. Figure 5.3 compares the use of the KL, SM and KSD for estimation of this problem. Once again, the natural gradient algorithms are able to minimise the loss functions in fewer iterations. This time, the speed up obtained by the natural gradient algorithm is much more significant than in lower dimensions. This clearly highlights the advantage of making use of the geometry of the statistical inference problem.

Figure 5.4: Maximum mean discrepancy estimator based on a Gaussian RBF kernel

for a Gaussian location model. Top: Comparison of the loss landscape for various

lengthscale values. Bottom Left: Robustness problem with varying location z for the Dirac but threshold fixed to= 0.2. Bottom Right: Robustness problem with varying threshold but fixed location for the Dirac atz= 10.

Gaussian Models as Generative Models

We now consider inference for generative models using the MMD. We use a synthetic generative model Pθ which is a multi-dimensional isotropic Gaussian distribution N(µ, σ2) with mean vectorµand covarianceσ2 whereσ2 >0 so thatθ= (µ, σ). In this case Uis a standard Gaussian distribution N(0,1) distribution on U =Rand

Gµ,σ(u) =µ+σu. We have that∇µGµ,σ(u) = 1 and∇σGµ,σ(u) =u. For simplicity,

and to understand the performance of the model for location and scale parameters separately, we first study the case whereµ∗ is unknown butσ∗known, which we call location model, then later move on to the case whereµ∗ is known butσ∗ unknown, which we call scale model.

Starting with the location model, we first generate realisations from the Gaussian model with known scale parameterσ= 1 and unknown location parameter

µ∗ =θ∗= 0. The landscape of the loss function of an MMD estimator with Gaussian RBF kernel is presented in Figure 5.4 (top). We notice that the choice of lengthscale has a significant influence on this landscape. When the lengthscale is smaller than 5 or larger than 25, large parts of the loss function are flat. In those case, optimising this loss function will be challenging with gradient-based methods. We will hence need to repeatedly reinitialise the algorithm to be able to minimise the function.

Figure 5.5: Maximum mean discrepancy estimator based on a Gaussian RBF kernel

for a Gaussian scale model. Top: Comparison of the loss landscape for various

When the lengthscale is in the interval [5,25], the loss function will be amenable to gradient-based methods.

To highlight some of the issues surrounding robustness of kernel estimators, we also propose two additional experiments in an M-open setting. For these experiments, the data is generated from a measureQ corresponding to a mixture of

the modelP∗ with unknown parameter µ∗ = 0, attributed a weight 1− for some

∈ [0,1], and a Dirac measure at some point z, considered to be an outlier and attributed a weight of . Figure 5.4 (bottom left) shows the l1 error between the

true valueθ∗ and the MMD estimatorθMMD_n,m forn=m= 500 as a function of the lo- cationz of the Dirac when= 0.2. The MMD estimator has the desirable property that as z→ ∞, the estimator ignores the corruption from the outlier. This clearly illustrates one of the advantages of bias-robust estimators. In Figure 5.4 (bottom right),z= 10 but we vary . In this case, the corruption only affects the estimator in a significant manner when approaches a value around 0.5, in which case we realistically cannot considerzas an outlier anymore. In both of these plots, we see that the lengthscale has a significant impact on the robustness of the estimator.

Moving on to the scale model, we repeat the three experiments previously performed on the location model in Figure 5.5. The same conclusions can be ob-

tained in this case: once again a value of the lengthscale which is either too large or too small will make the loss function impossible to minimise using gradient-based method, and the lengthscale has a significant impact on the robustness.

Summary

Kernel-based estimator can be useful for a variety of challenging statistical inference problems involving complex intractable models such as unnormalised or generative models. In this section, we framed the study of these estimators in the context of minimum distance estimators and strictly proper scoring rules, and discussed their bias-robustness.

Clearly, much more work remains from a theoretical viewpoint. First, ongo- ing work is focusing on proving consistency of these estimators, as well central limit theorems for the M-closed setting. These results do not follow directly from classical consistency and central limit theorem results from the scoring rule literature [Dawid and Musio, 2014] due to the necessity of using (multiple) U-statistic approximations of quantities of interest. It will also be interesting to study other types of robustness [Huber and Ronchetti, 2009].

From the point of view of applications, it will also be interesting to test some of the methodology described in this section on a wide range of models, including unnormalised graphical models from the imaging literature, or complex generative models from the ABC literature.

In document Statistical computation with kernels (Page 173-183)