In general, inference about the posterior distribution is challenging because for a complex model p(x|y) no closed-form simplifications can be made. This is especially true in the case that we consider, where p(x|y) corresponds to a graphics engine rendering images. Despite this apparent complexity, we observe the following: for many computer vision applications there exist well performing discriminative approaches, which, given the im- age, predict some target variables y or distributions thereof. These do not correspond to the posterior distribution that we are interested in, but, intuitively the availability of discriminative inference methods should make the task of inferring p(y|x) easier. Fur- thermore a physically accurate generative model can be used in an offline stage prior to inference to generate as many samples as we would like or can afford computation- ally. Again, intuitively this should allow us to prepare and summarize useful information about the distribution in order to accelerate the test-time inference.
Concretely, in our case we will use a discriminative method to provide a global den- sity TG(y|x), which we then use in a valid MCMC inference method. The standard Metropolis-Hasting Markov Chain Monte Carlo (MCMC) is already described in Sec- tion 2.2.1 of the previous chapter, where in each time-step, a proposal is made with a proposal distribution which is then either accepted or rejected based on the acceptance probability:
1. Propose a transition using a proposal distribution T and the current state yt
¯y∼ T (·|yt)
2. Accept or reject the transition based on Metropolis Hastings (MH) acceptance rule:
yt+1=
(
¯y, rand(0, 1) < min1,π(¯y)T (¯y→yt) π(yt)T (yt→¯y)
, yt, otherwise.
Refer to Section 2.2.1 for more details about MCMC sampling. Different MCMC techniques mainly differ in the type of the proposal distribution T . Next, we describe our informed proposal distribution which we use in standard Metropolis-Hastings sampling resulting in our proposed ‘Informed Sampler’ technique.
3.3.1 Informed Proposal Distribution
We use a common mixture kernel for Metropolis-Hastings (MH) sampling. Given the present target sample yt, the informed proposal distribution for MH sampling is given as:
Tα(·|x,yt) = α TL(·|yt) + (1− α) TG(·|x). (3.1)
3.3 The Informed Sampler
Here TL is an ordinary local proposal distribution, for example a multivariate Normal
distribution centered around the current sample yt, and TGis a global proposal distribu-
tion independent of the current state. We inject knowledge by conditioning the global proposal distribution TG on the image observation x. We learn the informed proposal
TG(·|x) discriminatively in an offline training stage using a non-parametric density esti- mator described below.
The mixture parameter α ∈ [0,1] controls the contribution of each proposal, for α = 1 we recover MH. For α= 0 the proposal Tα would be identical to TG(·|x) and the resulting
Metropolis sampler would be a valid Metropolized independence sampler [174]. With α = 0, we call this baseline method ‘Informed Independent MH’ (INF-INDMH). For intermediate values, α ∈ (0,1), we combine local with global moves in a valid Markov chain. We call this method ‘Informed Metropolis Hastings’ (INF-MH).
3.3.2 Discriminatively Learning T
GThe key step in the construction of TG is to include some discriminative information
about the sample x. Ideally we would hope to have TGpropose global moves which im-
prove mixing and even allow mixing between multiple modes, whereas the local proposal TLis responsible for exploring the density locally. To see that this is possible in principle, consider the case of a perfect global proposal where TGmatches the true posterior distri-
bution, that is, TG(y|x) = P(y|x). In this case, we would get independent samples with
α = 0 because every proposal is accepted. In practice TGis only an approximation to the
true posterior P(y|x). If the approximation is good enough then the mixture of local and global proposals will have a high acceptance rate and explore the density rapidly.
In principle, we can use any conditional density estimation technique for learning a proposal TGfrom samples. Typically high-dimensional density estimation is difficult and
even more so in the conditional case; however, in our case we do have the true generating process available to provide example pairs(y, x). Therefore we use a simple but scalable non-parametric density estimation method based on clustering a feature representation of the observed image, v(x)∈ Rd. For each cluster we then estimate an unconditional
density over y using kernel density estimation (KDE). We chose this simple setup since it can easily be reused in many different scenarios, in the experiments we solve diverse problems using the same method. This method yields a valid transition kernel for which detailed balance holds. In addition to the KDE estimate for the global transition ker- nel we also experimented with a random forest approach that maps the observations to transition kernels TG. More details will be given in Section 3.5.3.
For the feature representation, we leverage successful discriminative features and heuris- tics developed in the computer vision community. Different task specific feature repre- sentations can be used in order to provide invariance to small changes in y and to nuisance parameters. The main inference method remains the same across all problems.
We construct the KDE for each cluster and we use a relatively small kernel bandwidth in order to accurately represent the high probability regions in the posterior. This is
Chapter 3 The Informed Sampler
Algorithm 1 Learning a global proposal TG(y|x)
1. Simulate{(y(i), x(i))}i=1,...,nfrom p(x|y) p(y)
2. Compute a feature representation v(x(i)) 3. Perform k-means clustering of{v(x(i))}i
4. For each cluster Cj⊂ {1,...,n}, fit a kernel density estimate KDE(Cj) to the vectors
y{Cj}
Algorithm 2 INF-MH (Informed Metropolis-Hastings) Input: observed image x
TL ← Local proposal distribution (Gaussian) C← cluster for v(x)
TG← KDE(C) (as obtained by Alg. 1) T = αTL+ (1− α)TG
π(y|x) ← Posterior distribution P(y|x) Initialize y1
for t = 1 to N− 1 do 1. Sample ¯y∼ T (·)
2. γ = min1, π(¯y|x)T (¯y→yt) π(yt|x)T (yt→¯y) if rand(0, 1) < γ then yt+1= ¯y else yt+1= yt end if end for
similar in spirit to using only high probability regions as “darts” in the Darting Monte Carlosampling technique of [233]. We summarize the offline training in Algorithm 1.
At test time, this method has the advantage that given an image x we only need to identify the corresponding cluster once using v(x) in order to sample efficiently from the kernel density TG. We show the full procedure in Algorithm 2.
This method yields a transition kernel that is a mixture kernel of a reversible symmet- ric Metropolis-Hastings kernel and a metropolized independence sampler. The combined transition kernel T is hence also reversible. Because the measure of each kernel dom- inates the support of the posterior, the kernel is ergodic and has the correct stationary distribution [38]. This ensures correctness of the inference and in the experiments we investigate the efficiency of the different methods in terms of convergence statistics.