Stein Reproducing Kernels for Approximating Measures

Chapter 5 Statistical Inference and Computation with Intractable

5.1.3 Stein Reproducing Kernels for Approximating Measures

We have already seen in previous chapters how quadrature estimators need efficient point selection methods for enhanced performance. KSDs can be useful for this task, especially in cases where the integrals of interest are taken against measures with densities known only up to normalisation constants (as is usually the case in Bayesian statistics).

This subsection briefly discusses one approach, called Stein points [Chen et al., 2018, 2019]. The philosophy behind Stein points is to see the problem of approximating a target measure Π (against which we would like to integrate) as an optimisation problem. More precisely, we propose to select points{xi}ni=1 to form

an empirical measure ˆΠn = _n1 Pni=1δ(xi) which approximates Π well. This is done

by minimising the KSD between these two measures. arg min

{xi}n i=1⊂X

This can equivalently be seen as selecting the optimal states with respect to the WCE inHkΠ for an equally-weighted quadrature rule. For the remainder of Section

5.1, we will use the notation DkΠ({xi}

i=1) to denote the kernel Stein discrepancy

with Langevin Stein operator. This choice is made to make the dependence on the point set explicit.

Note that point-selection algorithms based on optimisation of statistical di- vergences already exist in the literature. These include the minimum energy de- signs of Joseph et al. [2015, 2017], which minimise the energy distance, the Stein variational gradient descent algorithm of Liu and Wang [2016]; Liu [2017], which minimises the KL divergence, and the kernel herding and FW algorithms of Chen et al. [2010]; Bach et al. [2012], which minimise MMD.

Obviously, the problem in Equation 5.12 is a highly non-convex optimisation problem, which will be high-dimensional in the case where we want a high number of points n. To reduce the complexity of this problem, we propose two different point-sequences.

The first and simplest algorithm that we consider follows a greedy strategy and is hence called Stein greedy points. The initial pointx1 is taken to be a global

maxima of the densityπ of Π, then each subsequent pointxn is taken to be a global

minima ofDk,π({xi}ni=1), with the objective function being viewed as a function of

xnwith {xi}ni=1−1 being fixed. This is equivalent to selecting:

xn ∈ arg min x∈X n−1 X i=1 kΠ(xi,x) + kΠ(x,x) 2 . (5.13)

As seen in Chapter 4, another approach is to use a FW algorithm, which boils down to solving the problem: arg min_g_∈M1₂kg−Π[kΠ(·,x)]k2Hk_Π, whereMis the marginal

polytope of the RKHSH_k_Π (see Equation 4.3 in Chapter 4). As might be expected, the objective function is closely related to KSD; forg(x) = _n1Pn

i=1kΠ(xi,x):

DkΠ({xi}

i=1) = kg−Π[kΠ(·,x)]kHk_Π.

This leads us to our second algorithm, where the initial point x1 is once

again taken to be a global maximum of the density π; which in the context of this algorithm corresponds to an element g1(x) = kΠ(x1,x). Then, at iteration

n > 1, the convex combination gn = n−1_n gn−1 + _n1g¯n is constructed where the

element ¯gn encodes a direction of steepest descent. Given that minimisation of a

linear objective over a convex set can be restricted to the boundary of that set, it follows that ¯gn(x) =k(xn,x) for somexn∈ X (see step 1 of the algorithm in 4.2.1).

The second algorithm, called Stein herding, can hence be concisely summarised as follows. First selectx1 ∈arg maxx∈Xπ(x), then at iterationn >1:

xn ∈ arg min x∈X n−1 X i=1 kΠ(xi,x). (5.14)

The Stein greedy and Stein herding updates (Equations 5.13 and 5.14 respectively) are very similar to one another. First, the Stein greedy update can be seen as a regularised version of the Stein herding update, with regulariser 1₂kΠ(x,x). The two

updates coincide ifkΠ(x,x) is a constant. This is true for most reproducing kernels

used in practice as these tend to be isotropic, however, this is typically not true for a Stein reproducing kernel such as the Langevin Stein kernel in Equation 5.10.

The Stein greedy and Stein herding algorithms both require solving a global (non-convex) optimisation problem over X at each iteration. In practice, this will be infeasible, and the use of numerical methods such as a grid search, MC search or Nelder-Mead search will be required. Both algorithm will also have roughly the same computational cost, which will beO(n2) in addition to any computational cost of the global optimisation routine. We thus anticipate applications in which the evaluation ofπ (or its gradient) constitutes the principal computational bottleneck.

We now highlight the performance of Stein points on a synthetic example popular in the sampling literature: the Rosenbrock density. The Rosenbrock target has density of the form: logπ(x) ∝ −100(x2−x21)2−(1−x1)2, which tends to be

challenging since the region of high density is narrow and has high curvature (see Figure 5.1). We demonstrate the performance of the Stein greedy algorithm on this target, where a Monte Carlo search is performed at iteration, using a high number of IID uniform points on [−4,4]×[−1,10]. The KSD used in this example used a base kernel which was an inverse-multiquadric kernel k(x,x0) = (kx−x0k2

2 + 1)−l

with parameterl= 0.7. As seen in Figure 5.1, the Stein greedy algorithm is able to select representative points from this target. This required a large number of Monte Carlo points due to the fact that the region of high density is very narrow.

Further applications to problems in Bayesian computation were also pre- sented in [Chen et al., 2018, 2019], including approximating the posterior distribution over parameters of a GP model, and the posterior distribution over parameters of an integrated generalised autoregressive conditional heteroskedasticity model.

On the theoretical side, under regularity conditions, it is in fact possible to show that both the Stein greedy and Stein herding algorithms will minimise KSD asymptotically. One such condition is that kΠ is Π-sub-exponential, which means

Figure 5.1: Stein greedy points for the Rosenbrock density. The algorithm starts at the global maximumx0 = (1,1) of the density, then greedily add points to minimise

the Langevin KSD. In this case, the inner-optimisation loops where performed using a Monte Carlo search with IID uniform random variables on [−4,4]×[−1,10].

Theorem 16 (Consistency of Stein greedy points). Suppose that the Stein reproducing kernel kΠ is a Π-sub-exponential reproducing kernel. Then ∃c1, c2 >0 such that for all {xi}ni=1 satisfying

kΠ(xj,xj) 2 + j−1 X i=1 kΠ(xi,xj) ≤ δ 2 +x∈X:kminΠ(x,x)≤R2j kΠ(x,x) 2 + j−1 X i=1 kΠ(xi,x)

withp2 log(j)/c2 ≤Rj ≤ ∞for each j= 1, . . . , n, we have

e( ˆΠn; Π,HkΠ) = DkΠ({xi} n i=1) ≤ eπ/2 s 2 log(n) c2n + c1 n + δ n.

The proof of this result can be found in the supplementary material of Chen et al. [2018], and a similar theorem for the herding case can also be found in this paper. We note a particular strength of this theorem: the rate holds even when the global optimisation routine at each iteration has not converged. Indeed, the

δ/2 term allows for error at each iteration. Another advantage is that we do not require the kernel to be bounded, but weaken this condition to Π−sub-exponential. We note that the theorem gives a convergence rate of OP(n−

2+) for functions in

quadrature rules in Chapter 3. However, Stein points have the significant advantage that they can be used without access to a kernel mean. The result in this theorem also does not seem to match the impressive approximation properties highlighted in Figure 5.1 or Chen et al. [2018], indicating that there is most likely a gap between empirical results and the theory available for these algorithms.

To summarise, we have now proposed two algorithms, called Stein greedy and Stein herding, for the approximation of measures whose densities are only known up to normalisation constants. This is particularly useful in the case of Bayesian statistics, where the posterior often includes an intractable integral which is hard to approximate. In this section, we illustrated how these algorithms can be particularly efficient at this task, and given theoretical backing for this performance.

In terms of theory, one question remains: is minimising a KSD a sensible objective for obtaining a point set? Or in other words, is the RKHS H_k_Π large enough to differentiate two measures? The answer to this question can be shown to be affirmative under several conditions on the base kernel and target measure. Gorham and Mackey [2017] (Section 3.2 and 3.3) and Chen et al. [2018] (Section 5.2) provided sufficient conditions to guarantee convergence in distribution of ˆΠn to the

target measure Π for several heavy-tail kernels (such as the inverse-multiquadric). This was later extended to pre-conditioned kernels in Chen et al. [2019].

In document Statistical computation with kernels (Page 153-157)