Chapter 5 Statistical Inference and Computation with Intractable
5.2 Kernel-based Estimators for Intractable Models
5.2.1 Minimum Distance Estimators
Our kernel-based estimators for intractable models fall within the class of minimum distance estimators, which are introduced below together with the related field of information geometry. Information geometry [Amari, 1987, 2016; Barndorff-Nielsen, 1978] is concerned with the geometry of statistical manifolds. These are manifolds for which each point corresponds to a Borel probability measureP∈ P(X). Com-
monly, these manifolds correspond to parametric familiesPΘ(X)⊂ P(X) which are
classes of probability measuresPθ indexed by a parameterθ= (θ1, . . . , θp)∈Θ. An
obvious choice of coordinates on a statistical manifold is given by the parameterθ. The parameter space Θ will be assumed to be a subset ofRp for somep∈Nfor the
remainder of this chapter, but could itself be a space of functions.
A common example of statistical manifold is the exponential family, which is a class of probability measures with probability density function of the form:
p(x|θ) = h(x) exp (hθ, T(x)i −c(θ)), (5.16)
for some function h:X →R of the formh(x)∝exp(b(x)), which is the density of some base measure, some summary statistic T : X → Rp and some normalisation
constantc : Θ→ R which guarantees that p(x|θ) is a probability density function (i.e. is normalised). In this case, the parameter space is given by Θ = {θ ∈ Rp :
logc(θ) = RXh(x) exp(hθ, T(x)i)dx < ∞}. The formulation above is in terms of a parameterisation called the natural parameterisation. The exponential family is a large family which includes some classical distributions such as the Gaussian, Poisson, Dirichlet and Gamma distributions. It also includes many more complex models such as graphical models, including pairwise interaction models [Lin et al.,
2016], or certain neural networks [Gutmann and Hyv¨arinen, 2012].
Going back to the concept of statistical manifold, we need to construct a notion of distance on a parametric class of probability models. This will usually be derived from a statistical divergence. Although a divergence does not define a metric onP(X), it induces a symmetric tensorgwhose matrix (gij) is positive semi-
definite: gij(θ) :=−(∂2/∂αi∂βj)D Pα||Pβ
|θ=α=β. When gij(θ) is positive definite
for allθ∈Θ, it defines a functiongwhich mapsθto the matrixgij(θ). This is called
the metric tensor or information metric, and can be used to define a Riemannian geodesic distance Amari [2016].
Minimum Distance Estimators and Scoring Rules
Consider now the problem of statistical inference for a given statistical model. A common approach is to consider some loss functionL: Θ→Rbased on a divergence
between an element of the parametric familyPΘ(X)) and an empirical probability measureQm= m1 Pmj=1δ(yj) obtained from the IID realisations{yj}mj=1available to
us from the correct modelQ. These estimators are called minimum distance estima-
tors and are given by the solution of the following (usually non-convex) optimisation problem: ˆ θm = arg min θ∈Θ L(θ) = arg min θ∈Θ D(Qm||Pθ). (5.17)
See the books of Pardo [2005] and Basu et al. [2011] for more details, or the recent paper by Jewson et al. [2018] for a Bayesian alternative. In special cases, this optimisation problem can be solved in closed form, but it will generally be necessary to employ numerical optimisation routines. Clearly, this pair of parametric family and statistical divergence directly leads to the notion of statistical manifold, and we will be able to use information geometry to study this problem.
Minimum distance estimators are closely connected to the concept of scoring rules [Gneiting and Raftery, 2007; Dawid, 2007; Parry et al., 2012], although not all scoring rules lead minimum distance estimators2. A scoring rule is a function S :
X ×P(X)→Rsuch thatS(x,P) quantifies the accuracy of a modelPupon observing
the realisation x. A scoring rule is said to be strictly proper if R
X S(x,P2)P1(dx)
is uniquely minimised when P1 = P2. Any strictly proper scoring rule induces a
divergence of the formDS(P1||P2) =RXS(x,P2)P1(dx)−RXS(x,P2)P2(dx), which
by construction will be minimised when P1 = P2. These divergences can then be
2
We note that the name “scoring rule” is in no way related to the score function∇logp, although some scoring rules might depend on∇logp.
used as loss functions to get minimum distance estimators of the form: ˆ θSm = arg min θ∈Θ DS(Qm||Pθ) = arg min θ∈Θ Z X S(y,Pθ)Qm(dy) (5.18) = arg min θ∈Θ 1 m m X j=1 S(yj,Pθ).
See the work by Mameli and Ventura [2015] and Dawid et al. [2016] for asymp- totic properties of such estimators, and Merkle and Steyvers [2013] for advice on choosing a scoring rule. A popular choice of strictly proper scoring rules are the local strictly proper scoring rules, which only depend on the log-likelihood and its derivatives [Parry et al., 2012; Ehm and Gneiting, 2012; Parry, 2016]. Note that scoring rules can also be defined for discrete domains [Dawid et al., 2012]. Esti- mators based on scoring rules require finding the solution to the following equa- tions inθ: Pm
j=1∇θS(yj,Pθ) =0, which are called estimating equations and where
0 = (0, . . . ,0)> ∈ Rp. For strictly proper scoring rules, one can easily show that
these estimating equations are unbiased (i.e. RX∇θS(x,Pθ)Pθ(dx) = 0), and as a
consequence the associated estimators are consistent (see for example Theorem 1 and Corollary 2 of Dawid [2007]).
There are two scenarios of interest in the context of minimum distance es- timator: The M-closed and M-open cases. First, in the M-closed case, we assume that Q is an instance of the parametric family PΘ(X). The statistical inference
problem therefore boils down to finding the valueθ∗ ∈Θ such that Pθ∗ corresponds
toQ. Alternatively, in the M-open case,Qcan be any probability measure inP(X),
and is not necessarily in the parametric family PΘ(X). In this case, we look for the value θ∗ such that Pθ∗ is the closest possible to Q in terms of some statistical
divergence. Obviously, the M-closed case is much more restrictive, but can be more easily understood from a theoretical viewpoint. The M-open case, on the other hand, reflects the practical realities illustrated by George E. P. Box’s now famous phrase: “all models are wrong, but some are useful”. The M-open case is, however, much harder to analyse from a theoretical viewpoint.
The M-open setting requires us to study the robustness of an estimator, which is concerned with corruptions in the data generating process. For example, in applied statistics, data might be assumed to correspond to IID realisations of some model but might in fact consist of correlated observation. Alternatively, we might be in an M-open setting where our data consists of realisations from a mixture distribution consisting of a model from the parametric family, and of some distribution of outliers. The reader is referred to Huber and Ronchetti [2009] or Chapter 10 in Steinwart and
Christmann [2008] for extensive introductions. Here, the choice of divergence will significantly influence the robustness of the associated estimator. There is usually a trade-off between robustness and efficiency of estimators, and the choice of scoring rule should hence be made with this in mind.
An important concept in robust statistics is that of the influence function
IFS : X × PΘ(X) → R where IFS(z,Pθ) measures the impact of an infinitesimal
contamination of the data generating modelPθ in the direction of a Dirac measure
located at some point z. The influence function of a minimum distance estimator based on a scoring ruleS is given by [Dawid and Musio, 2014]:
IFS(z,Pθ) = Z X ∇θ∇θS(x,Pθ)Pθ(dx) −1 ∇θS(z,Pθ). (5.19)
where (∇θ∇θS(x,Pθ))jk =∂2S(x,Pθ)/∂θj∂θk. The supremum of the influence func-
tion over z ∈ X is called the gross-error sensitivity, and if it is finite, we say that an estimator is bias-robust (also called B-robust, or robust in the sense of Hampel) [Hampel, 1971].
Maximum Likelihood Estimation
To illustrate the definitions above, we now consider the most widely studied exam- ple of minimum distance estimator. When using the KL divergence, the minimum distance estimator in Equation 5.17 becomes equivalent to maximum likelihood es- timators [Fisher, 1922]: arg min θ∈Θ L(θ) = arg min θ∈Θ DKL(Qm||Pθ) = arg max θ∈Θ 1 m m X j=1 logp(yj|θ). (5.20)
This can be derived as strictly proper scoring rule from the log-score: SKL(x,P) = −logp(x). Since it is a strictly proper scoring rule, we can trivially show that maximum likelihood estimation is consistent in the M-closed case.
In the case of exponential family models, the problem of maximum likelihood estimation can be simplified significantly. In this case,∇θS(x,Pθ) =−∇θlogp(x|θ) = −T(x) + ∇θc(θ), and so maximum likelihood estimation is equivalent to solving the following estimation equations: Pm
j=1T(yj) = −∇θc(θ). Clearly this requires
knowledge of the normalisation constant of the model or, more precisely, of the derivative of the log normalisation constant. Maximum likelihood estimation will hence not be feasible in cases where this constant is not available in closed form.
divergences, the performance of these estimators will be closely interlinked with the geometry of the corresponding statistical manifold. The metric tensor obtained from the KL-divergence is called the Fisher information metric. It corresponds to the covariance of the score vectors of the distribution:
gKLjk (θ) = Z X ∂logp(x|θ) ∂θj ∂logp(x|θ) ∂θk p(x|θ)dx.
Geometric quantities can be useful to understand asymptotic properties of the es- timator. The most common example of this is the Cramer-Rao theorem (see for example Amari [2016], Theorem 7.7) which states that for any asymptotically un- biased estimator ˆθ of θ, we have: E[(ˆθj −θj)(ˆθk −θk)] ≥ (1/m)gKLjk , where the
expectation is taken with respect to the distribution of the data-generating process. Since maximum likelihood estimation attains this lower bound, we say that it is effi- cient. Unfortunately, as previously mentioned, efficiency often has to be traded with robustness and maximum likelihood estimation is not robust. This can be noticed by looking at the influence function (obtained by plugging in SKL into Equation
5.19): IFKL(z,Pθ) = − Z X ∇θ∇θlogp(x|θ)p(x|θ)dx −1 (−∇θlogp(z|θ)) = gKL(θ)−1∇θlogp(z|θ).
Even for simple models such as a Gaussian distribution with unknown standard deviation, the influence function will beO(z) and hence unbounded, clearly demon- strating the lack of bias-robustness of maximum likelihood estimation.
Maximum likelihood methods have nonetheless been widely popular in the past due to the likelihood principle [Young and Smith, 2005], which states that, given a model, all of the evidence in a data set which is relevant to parameter inference is contained in the likelihood function. There are however several limitations to this approach, the most obvious being the requirement to have access to the likelihood (or equivalently the log-likelihood). We will now highlight alternative loss functions for use when the likelihood is not available.