Kernel Stein Discrepancies - Statistical Inference and Computation with Intractable

Chapter 5 Statistical Inference and Computation with Intractable

5.1.2 Kernel Stein Discrepancies

In this section, we introduce a divergence based on MMD where the underlying RKHS has a kernel with certain properties which allow us to avoid intractability issues in the case of unnormalised densities. Our method is based on Stein’s method

1_{Note that we changed the notation from Chapter 3 to emphasise that we now see the MMD as}

[Stein, 1972], which was first used as a tool for constructing a central limit theorem for dependent variables.

Stein Discrepancies

Stein’s method is based on three components: a probability measureQ, a function

space G (called Stein space), and an operator T_Q (called Stein operator), which together satisfy the following equation called Stein’s identity:

T_Q[g](x)P(dx) = 0 ∀g∈ G ⇔ P=Q. (5.3)

In this case, it is said that the Stein operator characterises the measureQ. Stein’s

method has mostly been developed for analytic convergence results in probability theory; see the reviews by Barbour and Chen [2005]; Chen et al. [2011]; Barbour and Chen [2014]; Ross [2011]. More recently, it has also been used for several tasks in statistics: the analysis of maximum likelihood estimators [Anastasiou and Reinert, 2017, 2018; Anastasiou, 2017], the comparison of prior distributions in Bayesian inference [Ley et al., 2017; Ghaderinezhad and Ley, 2018] and goodness-of- fit testing [Gaunt et al., 2017]. Later in this section, we will also discuss applications to numerical integration [Oates et al., 2018, 2017c] and approximation of posterior measures [Chen et al., 2018, 2019]

Of course, finding triplets of probability measures, operators and function space which satisfy Stein’s identity (Equation 5.3) can be challenging. Under regularity conditions onq (the density ofQ), a common choice of operator whenX =Rd

is linked to the generator of an overdamped Langevin equation [Barbour and Chen, 2005; Gorham et al., 2016] and hence referred to as Langevin Stein operator:

L_Q[g](x) = h∇, q(x)g(x)i q(x) = hg(x),∇logq(x)i+h∇, g(x)i, (5.4) where∇= (_∂x∂ 1, . . . , ∂ ∂xd) > _and _h∇_{, g}₍_x₎_i₌Pd j=1 ∂gj(x)

∂xj . This operator must oper-

ate on a Stein class G of vector-valued functions mapping from X to Rd. We can

also choose operators based on infinitesimal generators of other diffusions, see for example the following generator of an Itˆo diffusion process:

I_Q[g](x) = h∇, q(x)∇g(x)i

q(x) = h∇g(x),∇logq(x)i+ ∆g(x), (5.5)

which can be used with Stein classes of scalar-valued functions on the domain X, and where ∆g(x) =h∇,∇g(x)i =Pd

j=1 ∂2_g_j(x)

∂x2

are many other such operators; for example a generalised version of the above two is studied by Gorham et al. [2016]:

S_Q[g](x) = h∇, q(x)(a(x) +c(x))g(x)i

q(x) . (5.6)

where g : X → Rd is a vector-valued function, a : X → Rd×d is a positive semi-

definite matrix-valued function andc:X →_Rd×d_{is a skew symmetric matrix-valued}

function. Note that the three Stein operators above can be evaluated without knowl- edge of the normalisation constant of q. They are also all based on the generator of a diffusion process, and can be derived using the generator approach to Stein’s method, which was introduced in Barbour [1988]. The importance of the particular choice of Stein operator is unclear for the applications of interest in this thesis. The main property of interest here comes from the Stein identity which allows us to construct zero-mean functions.

Kernel Stein Discrepancies

It turns out that the Stein identity (Equation 5.3) can be extremely useful to simplify the expression of integral probability metrics. In particular, it allows us to remove the problem of integration against one of the measures (which may have had an unnormalised density). Taking the function class of the IPM to be the image of functions in the Stein class through the corresponding Stein operator leads to a general class of divergences, called Stein discrepancy, and first proposed by Gorham and Mackey [2015]: DStein(P1||P2) = sup g∈G Z X T_P₂[g](x)P1(dx)− Z X T_P₂[g](x)P2(dx) = sup g∈G Z X T_P₂[g](x)P1(dx) . (5.7)

where T_P₂ is a Stein operator adapted to P2 and we can hence use Equation 5.3

to obtain the second identity. Note that this expression will only be a divergence under regularity conditions on the function classG. Intuitively, we want the function class to be large enough to differentiate the two measures well. When this is the case, we clearly will have the property that whenever P1 is equal to P2, then

X TP2[g](x)P1(x) = 0 so the Stein divergence will have value zero. The general

kernel Stein discrepancy (KSD): KSD (P1||P2) := sup kgkH_k≤1 Z X T_P₂[g](x)P1(dx) . (5.8)

Note that the choice of base RKHS could also be optimised, as proposed in Jitkrit- tum et al. [2017]. Alternative choices of Stein classes are also possible; see for example the complete graph Stein discrepancies and spanner Stein graph discrepancies of Gorham and Mackey [2015] or the random feature Stein discrepancies of Huggins and Mackey [2018]. Larger function classes could also be used, but they will tend to make the Stein discrepancy intractable.

If the Stein operator maps scalar-valued functions to other scalar-valued functions, we will take the function class G to be a RKHS H_k with reproducing kernel k : X × X → _R. Alternatively, if the Stein operator maps vector-valued functions to scalar-valued functions, we will take the function classGto be the unit ball of some vector-valued RKHS which takes the form of the tensor product space

H_k⊗. . .⊗ H_k (also sometimes written asHd

kwhered∈Nis the number of elements

in the tensor). In either case, under regularity conditions, the image of G under a Stein operatorT_P is a scalar-valued RKHS, denotedH_k

P. When this is the case, the kernel k_P :X × X → _R of H_k

P is called a Stein reproducing kernel and takes the formk_P(x,x0) =T_PT¯_Pk(x,x0), where k is called a base kernel. Here, ¯T_P correspond to the operatorT_P but acting on the second argument of the function. Note that we emphasise the distributionPto which the Stein kernel is adapted to in the notation

k_P. The KSD can alternatively be obtained from the MMD with a Stein kernel adapted to the second argument of the discrepancy, and can hence be expressed as: KSD (P1||P2)2 = Z X ×X k_P2(x,x 0₎ P1(dx)P1(dx0)−2 Z X ×X k_P2(x,x 0₎ P1(dx)P2(dx0) + Z X ×X k_P₂(x,x0)P2(dx)P2(dx0) = Z X ×X k_P2(x,x 0 )P1(dx)P1(dx0). (5.9)

The expression above was simplified using the fact that Stein reproducing kernels are elements of a Stein class corresponding to a Stein operator T_P₂, and hence possess the useful property that the kernel mean satisfiesR_Xk_P2(x,x

0₎ P2(dx) = 0 and hence R X ×XkP2(x,x 0₎ P2(dx)P2(dx0) = 0 and R_{X ×X}kP2(x,x 0₎ P1(dx)P2(dx0) = 0. This is

the main property of interest from the point of view of computational statistics. Clearly, the expression above may not be a metric anymore since it might

not be symmetric as the kernel depends on one of the arguments. However, under regularity assumptions on the base kernel, the expression above will be a statistical divergence. Recall that a kernel is called characteristic if and only if the corresponding MMD is a probability metric. To parallel this notion, we will call a Stein kernel a characteristic Stein reproducing kernel if and only if the corresponding KSD is a statistical divergence. This will be a strong assumption on the Stein kernel which will need to be checked on a case-by-case basis.

When the first argument is an empirical measure Qn = Pn_i₌₁wiδ(xi) ap-

proximating some measureQ, the expression further simplifies to:

KSD(Qn||P) = v u u t n X i,j=1 wiwjkP(xi,xj).

The equation above can be seen as an exact expression for the KSD between Qn

andP, or an approximation of the KSD between Qand P.

Langevin Kernel Stein Discrepancies

We will now focus on the case whereG is a vector-valued RKHSHk⊗. . .⊗ Hk and

where the operator is the Langevin Stein operator in Equation 5.4 adapted to some measureP. In this case, we have a Stein reproducing kernel of the form [Oates and

Girolami, 2016; Oates et al., 2017c, 2018]:

k_P(x,x0) = h∇₁,∇₂k(x,x0)i+h∇₁k(x,x0),∇logp(x0)i (5.10) +h∇₂k(x,x0),∇logp(x)i+k(x,x0)h∇logp(x),∇logp(x0)i.

where∇1k(x,y) = (∂k(x,y)/∂x1, . . . , ∂k(x,y)/∂xd)> and

∇2k(x,y) = (∂k(x,y)/∂y1, . . . , ∂k(x,y)/∂yd)>. We now have a kernel which de-

pends on the measureP, but notice that it only depends on it through∇logp, which

itself can be evaluated without access to the normalisation constant ofp. The KSD between two measuresP1 andP2 with continuously differentiable densities is hence:

KSD (P1||P2) = Z X [hk(x,·),∇logp2(x)i+h∇x, k(x,·)i]P1(dx) _H k = Z X ×X

h∇logp2(x)− ∇logp1(x),∇logp2(x0)− ∇logp1(x0)i

which can be seen either as Stein discrepancy with Stein space H_k⊗. . .⊗ H_k or as the MMD with underlying Langevin Stein kernel as given in Equation 5.10, but adapted to P2. The KSD with Langevin Stein operator is a statistical divergence

whenever it is based on a characteristic Stein kernel, which will impose certain regularity conditions on the base reproducing kernel kand the densities of the two measures. We now present several sufficient conditions for the property to hold (in all casesX ⊆_Rd_):

• Theorem 2.2 in [Chwialkowski et al., 2016] shows that the Langevin KSD is a divergence if the kernelk isC0-universal,P1 andP2 both admit continuously

differentiable densitiesp1andp2, and

Xk∇logp2(x)− ∇logp1(x)k22P1(dx)<

∞and R_Xk_P2(x,x)P1(x)<∞.

• Proposition 3.3 in Liu et al. [2016] shows that the Langevin KSD is a divergence if the kernelkis integrally strictly positive definite,P1andP2admit continuous

densitiesp1 and p2, and

Xk∇logp2(x)− ∇logp1(x)k22P1(dx)<∞.

This Langevin kernel Stein discrepancy was recently used for several tasks across statistics, including hypothesis testing [Chwialkowski et al., 2016; Liu et al., 2016], sampling [Liu and Wang, 2016; Liu and Lee, 2017; Liu, 2017] and convergence of sampling methods [Gorham and Mackey, 2017]. For the remainder of this section, we will highlight two more applications: the approximation of posterior measures, using a method called Stein points [Chen et al., 2018, 2019], and the construction of control variates in MC and MCMC integration [Oates et al., 2017c, 2018].

In document Statistical computation with kernels (Page 148-153)