Stein Reproducing Kernels for Numerical Integration

Chapter 5 Statistical Inference and Computation with Intractable

5.1.4 Stein Reproducing Kernels for Numerical Integration

Recall our main challenge of numerically approximating integrals Π[f] =R

Xf(x)Π(dx),

and assume the measure Π admits a continuously differentiable density π with respect to the Lebesgue measure. We will show in this section that Stein’s method can be extremely useful in creating efficient quadrature rules which can be used as control variates for MC and MCMC integration.

Assume that we have access to a set of points{xi}ni=1⊂ X such that the em-

pirical measure _n1Pn

i=1δ(xi) is a good approximation of the target Π. These points

might be Π-distributed, realisations of a Markov chain with invariant distribution Π, or even obtained with deterministic methods, such as the Stein points algorithms from the previous subsection. For the sake of simplicity, we will limit ourselves to MC and MCMC methods.

We have already seen how these point sets lead to quadrature rules, and have discussed/studied their performance through several error criterion. For example,

in the MC or MCMC case, recall that the central limit theorem states that √ n 1 n n X i=1 f(xi)−Π[f] ! D −→ N(0, σ2),

In the MC case, the variance of the central limit theorem isσ2_MC= Varπ[f], which

corresponds to the variance off under Π. On the other hand, in the MCMC case, the central limit theorem has variance: σ2_MCMC= Varπ[f] + 2P∞_k₌₁Covπ[f(X0), f(Xk)]

[Jones, 2004]. Direct MC or MCMC estimation of Π[f] would hence be prohibitive whenever f had high variance with respect to the target Π. To reduce the error of these schemes, it is common to use control variates, which are functions ˜fCV:X →R

such that the integral Π[ ˜fCV] is known analytically. In this case, we can rewrite the

integral of interest as

Π[f] = Π[f]−Π[ ˜fCV] + Π[ ˜fCV] = Π[f −f˜CV] + Π[ ˜fCV],

where now the second term is known in closed form and the first term needs to be estimated using some quadrature rule:

Π[f] ≈ Π[ˆ f−f˜CV] + Π[ ˜fCV]. (5.15)

If ˜fCV is chosen such that Varπ[f−f˜CV] is much smaller than Varπ[f], the error in

approximating Π[f] via Equation 5.15 will be lower than when using direct MC or MCMC integration.

In general, such a function ˜fCV may be directly available through domain-

specific knowledge [Newton, 1994; Henderson and Glynn, 2002], but this is rarely the case in general. Alternatively, control variate can sometimes be built using known properties of the method used for obtaining samples. See Andrad´ottir et al. [1993]; Hammer and Tjelmeland [2008]; Dellaportas and Kontoyiannis [2012] for control variates based on the proposal densities of MCMC samplers, and Hickernell et al. [2005] for control variates specialised to QMC. An obvious drawback is that these approaches cannot be used in general settings where properties of{xi}ni=1 are

unknown. A more general and applicable approach is the following. First, separate

X={xi}ni=1 into two setsX1 ={xi}mi=1 and X2 ={xi}ni=m+1. Then:

1. UseX1 to build an approximation ˜fCV of f in some spaceH such that ∀h ∈ H,∃c∈_Rsuch that Π[h] =c is known in closed form.

In this case, if the integrand can be approximated at a fast rate inm, then Varπ[f−

fCV] will decrease at a fast rate which may reduce the integration error at a faster

rate than the Monte Carlo rate.

Clearly the first step will be the most challenging as finding a function space

H with the property that the integral of all functions is known in closed-form will be non-trivial. An example is given in Paisley et al. [2012]; Wang et al. [2013], who use a Taylor expansion of the integrand. Unfortunately, this will only be a feasible approach when integrating against simple probability measures, like a Gaussian or uniform, but we would like a general methodology which can be applied to any measure with density known up to normalisation constant.

However, the first step is clearly amenable to the use of Stein’s method. Any function of the form ˜fCV = TΠ[g] +c for g ∈ G and c ∈ R, where TΠ and G are

a pair of Stein operator and Stein class, is a possible choice of control variate. In this case, step 1 reduces to finding a function of this form leading to the greatest reduction in numerical integration error. It is common to selectgfrom a parametric family of functions {gθ}θ∈Θ, in which case the search in G is replaced by a search

over the parameter space Θ. This problem can be solved by considering a general discrepancy loss function, which given a value in Θ, returns a value describing the suitability ofgθ. We now highlight two examples.

The first example is to choose θ by interpolation, which can be done by solving numerically ˜fCV(x) = TΠ[gθ](x) +c = f(x) in terms of (c, θ). Of course,

selecting ˜fCV by interpolation will indirectly minimise the variance Varπ[f−f˜CV],

and the variance will take value zero if we interpolate the function exactly. A second option would be to select gθ to minimise the asymptotic variance Varπ[f−f˜CV] =

Varπ[f − TΠ[gθ]−c] directly. In this case, the term c is not needed and can be

set to zero by default. This is because the variance is not affected by constants: Varπ[f − TΠ[gθ]] = Varπ[f− TΠ[gθ]−c].

Our proposed strategy for building control variates is therefore the following. First, separateX ={xi}in=1 into two sets X1 = {xi}mi=1 and X2 ={xi}ni=m+1 and

fix a Stein operatorT_Π and parametric Stein class G. Then:

1. UseX1 to select a control variate of the form ˜fCV=TΠ[gθ] +cby minimising

a loss function inθ.

2. Compute a quadrature approximation of Π[f−f˜CV] using X2.

It turns out that many existing control variates methodologies available in the lit- erature can be recovered as special cases of this approach. We now highlight a few examples.

1. Motivated by a specific Hamiltonian differential operator from statistical physics, Assaraf and Caffarel [1999]; Mira et al. [2013] proposed to perform step 1 using functions: TΠ[gθ](x) =− ∆[P(x|θ)pπ(x)] 2pπ(x) + P(x|θ)∆[pπ(x)] 2pπ(x)

where P(x|θ) is a class of polynomials of order p ∈ N with coefficients sum-

marised in the vector θ. They estimate the coefficients θ by minimising Varπ[f−f˜CV]. Full implementation details can be found in Papamarkou et al.

[2014]. This can be shown to be equivalent to using the Itˆo Stein operator in Equation 5.5 together with a Stein class consisting of polynomials of orderp.

2. Another example, called control functionals, was recently proposed in Oates et al. [2017a,c, 2018]; Oates and Girolami [2016]. These control variates are based on interpolants in a RKHS of the form L_Π[gθ](x) =

i=1θikΠ(x,xi)

withxi ∈ X andθi∈R∀i= 1, . . . , m, and where the kernelkΠis the Langevin

Stein reproducing kernel previously defined in Equation 5.10. Finding the optimal θ for interpolation can be solved in closed form as a least-squares problem. Control functionals were shown to be effective for variance reduction and can lead to faster convergences rates than direct MCMC integration [Oates et al., 2018].

3. Zhu et al. [2018] also approached this problem using neural networks and used functions of the form: LΠ[gθ](x) =h∇x, g(x|θ)i+hg(x|θ),∇xlogπ(x)i, where

g(x|θ) are vector-valued neural networks with weights θ. Zhu et al. [2018] then propose to use an estimate of the mean-squared error to optimise the parameters. Neural networks have been shown to be particularly effective at approximating high dimensional functions which can be written as a compo- sition of low-dimensional functions [Poggio et al., 2017].

Before concluding this section, we note that the integral of a control variate can itself be used as a stand-alone quadrature estimator. Indeed, we can simply disregard the estimator of Π[f −f˜CV] and use Π[ ˜fCV] as an estimate of Π[f] in

Equation 5.15. For example, when considering the control functionals approach of Oates et al. [2018, 2017c], we notice that the integral Π[ ˜fCV] can be obtained in

closed form, and actually corresponds to the BQ estimator of Π[f] obtained when using the kernel k+. This is in fact what was done in Chapter 4 for the differen-

tial equation example. Stein’s method therefore provides us with an alternative to the methodologies developed in Chapter 3 for BQ with intractable kernel means.

Unfortunately, one drawback of this approach is that the estimators will be biased.

In document Statistical computation with kernels (Page 157-161)