Chapter 5 Statistical Inference and Computation with Intractable
5.1.2 Kernel Stein Discrepancies
In this section, we introduce a divergence based on MMD where the underlying RKHS has a kernel with certain properties which allow us to avoid intractability issues in the case of unnormalised densities. Our method is based on Stein’s method
1Note that we changed the notation from Chapter 3 to emphasise that we now see the MMD as
[Stein, 1972], which was first used as a tool for constructing a central limit theorem for dependent variables.
Stein Discrepancies
Stein’s method is based on three components: a probability measureQ, a function
space G (called Stein space), and an operator TQ (called Stein operator), which together satisfy the following equation called Stein’s identity:
Z
X
TQ[g](x)P(dx) = 0 ∀g∈ G ⇔ P=Q. (5.3)
In this case, it is said that the Stein operator characterises the measureQ. Stein’s
method has mostly been developed for analytic convergence results in probability theory; see the reviews by Barbour and Chen [2005]; Chen et al. [2011]; Barbour and Chen [2014]; Ross [2011]. More recently, it has also been used for several tasks in statistics: the analysis of maximum likelihood estimators [Anastasiou and Reinert, 2017, 2018; Anastasiou, 2017], the comparison of prior distributions in Bayesian inference [Ley et al., 2017; Ghaderinezhad and Ley, 2018] and goodness-of- fit testing [Gaunt et al., 2017]. Later in this section, we will also discuss applications to numerical integration [Oates et al., 2018, 2017c] and approximation of posterior measures [Chen et al., 2018, 2019]
Of course, finding triplets of probability measures, operators and function space which satisfy Stein’s identity (Equation 5.3) can be challenging. Under regu- larity conditions onq (the density ofQ), a common choice of operator whenX =Rd
is linked to the generator of an overdamped Langevin equation [Barbour and Chen, 2005; Gorham et al., 2016] and hence referred to as Langevin Stein operator:
LQ[g](x) = h∇, q(x)g(x)i q(x) = hg(x),∇logq(x)i+h∇, g(x)i, (5.4) where∇= (∂x∂ 1, . . . , ∂ ∂xd) > and h∇, g(x)i=Pd j=1 ∂gj(x)
∂xj . This operator must oper-
ate on a Stein class G of vector-valued functions mapping from X to Rd. We can
also choose operators based on infinitesimal generators of other diffusions, see for example the following generator of an Itˆo diffusion process:
IQ[g](x) = h∇, q(x)∇g(x)i
q(x) = h∇g(x),∇logq(x)i+ ∆g(x), (5.5)
which can be used with Stein classes of scalar-valued functions on the domain X, and where ∆g(x) =h∇,∇g(x)i =Pd
j=1 ∂2gj(x)
∂x2
j
are many other such operators; for example a generalised version of the above two is studied by Gorham et al. [2016]:
SQ[g](x) = h∇, q(x)(a(x) +c(x))g(x)i
q(x) . (5.6)
where g : X → Rd is a vector-valued function, a : X → Rd×d is a positive semi-
definite matrix-valued function andc:X →Rd×dis a skew symmetric matrix-valued
function. Note that the three Stein operators above can be evaluated without knowl- edge of the normalisation constant of q. They are also all based on the generator of a diffusion process, and can be derived using the generator approach to Stein’s method, which was introduced in Barbour [1988]. The importance of the particular choice of Stein operator is unclear for the applications of interest in this thesis. The main property of interest here comes from the Stein identity which allows us to construct zero-mean functions.
Kernel Stein Discrepancies
It turns out that the Stein identity (Equation 5.3) can be extremely useful to simplify the expression of integral probability metrics. In particular, it allows us to remove the problem of integration against one of the measures (which may have had an unnormalised density). Taking the function class of the IPM to be the image of functions in the Stein class through the corresponding Stein operator leads to a general class of divergences, called Stein discrepancy, and first proposed by Gorham and Mackey [2015]: DStein(P1||P2) = sup g∈G Z X TP2[g](x)P1(dx)− Z X TP2[g](x)P2(dx) = sup g∈G Z X TP2[g](x)P1(dx) . (5.7)
where TP2 is a Stein operator adapted to P2 and we can hence use Equation 5.3
to obtain the second identity. Note that this expression will only be a divergence under regularity conditions on the function classG. Intuitively, we want the func- tion class to be large enough to differentiate the two measures well. When this is the case, we clearly will have the property that whenever P1 is equal to P2, then
R
X TP2[g](x)P1(x) = 0 so the Stein divergence will have value zero. The general
kernel Stein discrepancy (KSD): KSD (P1||P2) := sup kgkHk≤1 Z X TP2[g](x)P1(dx) . (5.8)
Note that the choice of base RKHS could also be optimised, as proposed in Jitkrit- tum et al. [2017]. Alternative choices of Stein classes are also possible; see for example the complete graph Stein discrepancies and spanner Stein graph discrep- ancies of Gorham and Mackey [2015] or the random feature Stein discrepancies of Huggins and Mackey [2018]. Larger function classes could also be used, but they will tend to make the Stein discrepancy intractable.
If the Stein operator maps scalar-valued functions to other scalar-valued functions, we will take the function class G to be a RKHS Hk with reproducing kernel k : X × X → R. Alternatively, if the Stein operator maps vector-valued functions to scalar-valued functions, we will take the function classGto be the unit ball of some vector-valued RKHS which takes the form of the tensor product space
Hk⊗. . .⊗ Hk (also sometimes written asHd
kwhered∈Nis the number of elements
in the tensor). In either case, under regularity conditions, the image of G under a Stein operatorTP is a scalar-valued RKHS, denotedHk
P. When this is the case, the kernel kP :X × X → R of Hk
P is called a Stein reproducing kernel and takes the formkP(x,x0) =TPT¯Pk(x,x0), where k is called a base kernel. Here, ¯TP correspond to the operatorTP but acting on the second argument of the function. Note that we emphasise the distributionPto which the Stein kernel is adapted to in the notation
kP. The KSD can alternatively be obtained from the MMD with a Stein kernel adapted to the second argument of the discrepancy, and can hence be expressed as: KSD (P1||P2)2 = Z X ×X kP2(x,x 0) P1(dx)P1(dx0)−2 Z X ×X kP2(x,x 0) P1(dx)P2(dx0) + Z X ×X kP2(x,x0)P2(dx)P2(dx0) = Z X ×X kP2(x,x 0 )P1(dx)P1(dx0). (5.9)
The expression above was simplified using the fact that Stein reproducing kernels are elements of a Stein class corresponding to a Stein operator TP2, and hence possess the useful property that the kernel mean satisfiesRXkP2(x,x
0) P2(dx) = 0 and hence R X ×XkP2(x,x 0) P2(dx)P2(dx0) = 0 and RX ×XkP2(x,x 0) P1(dx)P2(dx0) = 0. This is
the main property of interest from the point of view of computational statistics. Clearly, the expression above may not be a metric anymore since it might
not be symmetric as the kernel depends on one of the arguments. However, under regularity assumptions on the base kernel, the expression above will be a statistical divergence. Recall that a kernel is called characteristic if and only if the correspond- ing MMD is a probability metric. To parallel this notion, we will call a Stein kernel a characteristic Stein reproducing kernel if and only if the corresponding KSD is a statistical divergence. This will be a strong assumption on the Stein kernel which will need to be checked on a case-by-case basis.
When the first argument is an empirical measure Qn = Pni=1wiδ(xi) ap-
proximating some measureQ, the expression further simplifies to:
KSD(Qn||P) = v u u t n X i,j=1 wiwjkP(xi,xj).
The equation above can be seen as an exact expression for the KSD between Qn
andP, or an approximation of the KSD between Qand P.
Langevin Kernel Stein Discrepancies
We will now focus on the case whereG is a vector-valued RKHSHk⊗. . .⊗ Hk and
where the operator is the Langevin Stein operator in Equation 5.4 adapted to some measureP. In this case, we have a Stein reproducing kernel of the form [Oates and
Girolami, 2016; Oates et al., 2017c, 2018]:
kP(x,x0) = h∇1,∇2k(x,x0)i+h∇1k(x,x0),∇logp(x0)i (5.10) +h∇2k(x,x0),∇logp(x)i+k(x,x0)h∇logp(x),∇logp(x0)i.
where∇1k(x,y) = (∂k(x,y)/∂x1, . . . , ∂k(x,y)/∂xd)> and
∇2k(x,y) = (∂k(x,y)/∂y1, . . . , ∂k(x,y)/∂yd)>. We now have a kernel which de-
pends on the measureP, but notice that it only depends on it through∇logp, which
itself can be evaluated without access to the normalisation constant ofp. The KSD between two measuresP1 andP2 with continuously differentiable densities is hence:
KSD (P1||P2) = Z X [hk(x,·),∇logp2(x)i+h∇x, k(x,·)i]P1(dx) H k = Z X ×X
h∇logp2(x)− ∇logp1(x),∇logp2(x0)− ∇logp1(x0)i
which can be seen either as Stein discrepancy with Stein space Hk⊗. . .⊗ Hk or as the MMD with underlying Langevin Stein kernel as given in Equation 5.10, but adapted to P2. The KSD with Langevin Stein operator is a statistical divergence
whenever it is based on a characteristic Stein kernel, which will impose certain regularity conditions on the base reproducing kernel kand the densities of the two measures. We now present several sufficient conditions for the property to hold (in all casesX ⊆Rd):
• Theorem 2.2 in [Chwialkowski et al., 2016] shows that the Langevin KSD is a divergence if the kernelk isC0-universal,P1 andP2 both admit continuously
differentiable densitiesp1andp2, and
R
Xk∇logp2(x)− ∇logp1(x)k22P1(dx)<
∞and RXkP2(x,x)P1(x)<∞.
• Proposition 3.3 in Liu et al. [2016] shows that the Langevin KSD is a divergence if the kernelkis integrally strictly positive definite,P1andP2admit continuous
densitiesp1 and p2, and
R
Xk∇logp2(x)− ∇logp1(x)k22P1(dx)<∞.
This Langevin kernel Stein discrepancy was recently used for several tasks across statistics, including hypothesis testing [Chwialkowski et al., 2016; Liu et al., 2016], sampling [Liu and Wang, 2016; Liu and Lee, 2017; Liu, 2017] and convergence of sampling methods [Gorham and Mackey, 2017]. For the remainder of this section, we will highlight two more applications: the approximation of posterior measures, using a method called Stein points [Chen et al., 2018, 2019], and the construction of control variates in MC and MCMC integration [Oates et al., 2017c, 2018].