Chapter 5 Statistical Inference and Computation with Intractable
5.1.1 Distances on Probability Measures
As discussed, we would like to have an easily computable notion of distance between two complex probability measures, such as statistical divergences. LetX be a metric space, and denote byP(X) be the set of Borel probability measures on this space. Statistical divergences are functions of the form D : P(X) × P(X) → R+ that satisfyD(P1||P2) = 0 if and only ifP1 =P2 for allP1,P2 ∈ P(X). Divergences are
usually not symmetric and do not satisfy the triangle inequality. Divergences have many uses in statistical computation including, amongst other examples, inference in statistical models [Kass and Vos, 1997] and the construction of novel variational inference schemes [Jordan et al., 1999; Blei et al., 2017], numerical optimisation algorithms [Amari, 1998; Karakida et al., 2016] or robust inference [Knoblauch et al., 2018]. As highlighted below, there exists many divergences with useful “principled” properties, but a common drawback is that they are hard or impossible to compute for most complex models.
The most commonly used divergence is the Kullback-Leibler (KL) divergence:
DKL(P1||P2) := Z X log dP1 dP2 dP1, (5.1)
where dP1/dP2 is the Radon-Nikodym derivative of P1 with respect to P2. The
KL divergence is closely linked to the field of information complexity (where it is often called the information gain or relative entropy), and is also popular due to
its invariance to transformations of the coordinates of X and its convexity in the first argument. In fact, the KL divergence is a special case of two important classes of divergences: the f-divergences and the Bregman divergences [Amari, 2016]. The former is a class of divergences of the form Df(P1||P2) = RXf(dP1/dP2)dP2 for
some convex function f satisfying f(1) = 0, which includes the Hellinger distance (f(x) = (√x−1)2) and the total-variation distance (f(x) = 1/2(x−1)).
Instead of using statistical divergences, it is also common to directly work with metrics or pseudo-metrics on probability measures. Pseudo-probability met- rics are functions dH : P(X)× P(X) → R+ which satisfy (i) dH(P1,P1) = 0, (ii)
symmetry: dH(P1,P2) =dH(P2,P1), and (iii) the triangle inequality: dH(P1,P3) ≤
dH(P1,P2)+dH(P2,P3) for all probability measuresP1,P2,P3∈ P(X). Furthermore,
probability metrics are pseudo-probability metrics which satisfy (iv)dH(P1,P2) = 0
if and only if P1 = P2. Clearly, all probability metrics are divergences, but the
converse does not necessarily hold. The most common pseudo-probability metrics are the integral (pseudo-)probability metrics [M¨uller, 1997; Sriperumbudur et al., 2010b, 2012; Sriperumbudur, 2016]: dH(P1,P2) := sup f∈H Z X f(x)P1(dx)− Z X f(x)P2(dx) . (5.2)
Equation 5.2 should of course be familiar, since it corresponds to the definition of WCE for integration in H. Familiar examples of integral (pseudo-)probability metrics include the following:
(i) The total variation distance, obtained using the unit ball of the set of bounded functions H={f :X →R: supx∈X|f(x)| ≤1},
(ii) The 1−Wasserstein distance (or Kantorovich metric or earth mover’s dis- tance), obtained by the unit-ball of 1-Lipschitz functions: H = {f : X → R: supx6=y∈X|f(x)−f(y)|/kx−yk ≤1},
(iii) The Dudley probability metric, obtained by considering the set of bounded Lipschitz functions: H = {f : X → R : supx6=y∈X|f(x)−f(y)|/kx−yk+ supx∈X|f(x)| ≤1},
(iv) The maximum mean discrepancy for which H is taken to be the unit ball of some RKHS Hk: H={f :X →R:kfkHk ≤1}.
Under rather weak conditions on X, examples (i), (ii) and (iii) are all probability metrics, but (iv) is only a probability metric under certain conditions on the kernel (and otherwise is a pseudo-probability metric). Any kernel which makes (iv) a
probability metric is called a characteristic kernel [Sriperumbudur et al., 2010b]. Other examples of integral probability metrics can also be found in [M¨uller, 1997; Sriperumbudur et al., 2010b, 2012; Sriperumbudur, 2016].
Taking a step back to our objective of finding a statistical distance which can be computed for intractable models, it should be obvious that all of the diver- gences and metrics highlighted above are somewhat inadequate for our purpose. The KL divergence requires access to densities in normalised form, whilst the integral probability metrics require computation of a supremum overH. Computing these notions of distance will hence usually be impossible whenever the model is in an unnormalised or generative form.
In the next section, we will derive a distance between probability measures called kernel Stein discrepancy (KSD) [Chwialkowski et al., 2016; Liu et al., 2016], which bypasses these issues for unnormalised models. KSDs can be recovered from maximum mean discrepancies (MMDs) by specific choice of kernels, and under sev- eral assumptions can be shown to be statistical divergences. MMDs were extensively discussed in previous chapters and correspond to the WCEs in some RKHSs. Let
k:X × X →Rbe the reproducing kernel of a RKHSHkof functionsX →R. From
Proposition 2 in Chapter?? we have that the MMD has a straightforward expres- sion in term of integrals of the kernel k. Furthermore, recall from Equation 3.7 in Chapter 3 that given an empirical measureQn=Pni=1wiδ(xi), where{xi}ni=1⊂ X
andw= (w1, . . . , wn)∈Rn, and a target measure P, the MMD is given by1:
MMD (Qn,P)2 := Z X ×X k(x,x0)P(dx)P(dx0)−2 n X i=1 wi Z X k(xi,x)P(dx) + n X i,j=1 wiwjk(xi,xj).
As we have already clearly highlighted in Chapter 4, there are very few cases where we can actually compute this expression in closed form. Certainly, this will in general not be possible whenever the densityp ofP is unnormalised.