Importance sampling - Maximum likelihood parameter estimation in time series models using seque

We saw that rejection sampling can be wasteful as it uses only about 1/M of generated random samples to construct an approximation to π. In contrast, importance sampling uses every sample but weights each one according to the degree of similarity between the target and instrumental distributions. The idea of importance sampling follows from the importance sampling fundamental identity [Robert and Casella, 2004]: if there is a probability measure µ such that π_{≪ µ with the Radon-Nikod´ym derivative w =} dπ_dµ, then we have

π(ϕ) = µ (ϕw) .

This identity can be used with a µ which is easy to sample from. Sampling X(1)_{, . . . , X}(N )

from µ, the integral π(ϕ) = µ (ϕw) can be approximated by using perfect Monte Carlo as πN_IS(ϕ) := 1 N N X i=1 ϕ(X(i))w(X(i)). (2.4)

Algorithm 2.2. Importance sampling:

• For i = 1, . . . , N; generate X(i) _{∼ µ, calculate w(X}(i)_{) =} dπ

dµ(X(i)).

• Set πN

IS(ϕ) = N1

i=1w(X(i))ϕ(X(i)).

The importance sampling is summarised in Algorithm 2.2. The Radon-Nikod´ym derivatives w(X(i)_{) are known as the importance sampling weights. Noting its equivalence}

to perfect Monte Carlo for µ (ϕw), the estimator in (2.4) is unbiased and justified by the strong law of large numbers and the central limit theorem, provided that π(ϕ) and varµ[w(X)ϕ(X)] are finite. Moreover, as we have freedom to choose µ we can control

the variance of importance sampling [Robert and Casella, 2004]

varπ_ISN(ϕ)= 1 Nvarµ[w(X)ϕ(X)] = 1 N µ(w 2_ϕ2₎_{− [µ(wϕ)]}2 = 1 N µ(w 2_ϕ2₎_{− [π(ϕ)]}2 . Therefore, minimising varπN

IS(ϕ)

is equivalent to minimising µ(w2_ϕ2_{), which can be}

lower bounded as

µ(w2ϕ2)≥ [µ(w|ϕ|)]2_{= [π(}_|ϕ|)]2

we choose µ such that it satisfies

w(x) = dπ dµ(x) =

π(_|ϕ|)

|ϕ(x)|, x∈ X , ϕ(x) 6= 0. This results in the optimum choice of µ to be

µ(dx) = π(dx)|ϕ(x)| π(_|ϕ|)

for points x∈ X such that ϕ(x) 6= 0, and the resulting minimum variance is given by

min µ var πN_IS(ϕ) = 1 N [π(|ϕ|)] 2 − [π(ϕ)]2.

Note that this minimum value is 0 if ϕ is nonnegative π-almost everywhere. Therefore, importance sampling in principle can achieve a lower variance than perfect Monte Carlo. Of course, if we can not already compute π(ϕ), it is unlikely that we can compute π(|ϕ|). Also, it will be rare that we can easily simulate from the optimal µ even if we can construct it. Instead, we are guided to seek a µ close to the optimal one, but from which it is easy to sample.

2.3.1 Self-normalised importance sampling

Like rejection sampling, the importance sampling method is available also when π =

b π Zπ, µ =

b µ

Zµ and we only have bπ and bµ. This time, letting w =

dbπ

dbµ we write the importance

sampling fundamental identity in terms of bπ and bµ as

π(ϕ) = µ (ϕw) Zπ/Zµ

= µ (ϕw) µ (w) .

The importance sampling method can be modified to approximate both the nominator (the unnormalised estimate) and the denominator (the normalisation constant) by using perfect Monte Carlo. Sampling X(1)_{, . . . , X}(N ) _{from µ, we have the approximation}

πN_IS(ϕ) =

1 N

i=1ϕ(X(i))w(X(i)) 1 N PN i=1w(X(i)) = N X i=1 W(i)ϕ(X(i)).

where W(i) = PNw(X(i))

j=1w(X(j)) are called the normalised importance weights as they sum up to

1. Being the ratio of two unbiased estimators, estimator of the self-normalised importance sampling is biased for finite N. However, its consistency and stability are provided by a strong law of large numbers and a central limit theorem in Geweke [1989]. In the same work, the variance of the self normalised importance sampling estimator is analysed and

an approximation is provided, from which it reveals that it can provide lower variance estimates than the unnormalised importance sampling method. Therefore, this method can be preferable to its unnormalised version even if it is not the case that π and µ are known only up to proportionality constants.

Algorithm 2.3. Self-normalised importance sampling: • For i = 1, . . . , N; generate X(i) _{∼ µ, calculate w(X}(i)_{) =} dbπ

dbµ(X (i)_).

• For i = 1, . . . , N; set W(i) ₌ w(X(i)₎

PN j=1w(X(j)) . • Set πN IS(ϕ) = PN

i=1W(i)ϕ(X(i)).

Self-normalised importance sampling is also called Bayesian importance sampling in Geweke [1989], since in most Bayesian inference problems the normalising constant of posterior distribution is unknown.

One approximation to the variance of the self-normalised importance sampling estimator is proposed in Kong et al. [1994] to be

varπ_ISN(ϕ)_≈ 1

Nvarπ[ϕ(X)]{1 + varµ[w(X)]} = varπ_{M C}N (ϕ)_{{1 + var}µ[w(X)]}.

This approximation might be confusing at the first instance since it suggests that the variance of self-normalised importance sampling is always greater than that of perfect Monte Carlo, which we have just seen is not the case. However, it is useful as it provides an easy way of monitoring the efficiency of the method. Consider the ratio of variances of the self-normalised importance sampling method with N particles and perfect Monte Carlo with N′ _{particles, which is given according to this approximation by}

varπN IS(ϕ) varπN′ M C(ϕ) ≈ N ′ N {1 + varµ[w(X)]}.

The number N′ _{for which this ratio is 1 would suggest how many samples for perfect}

Monte Carlo would be equivalent to N samples for self-normalised importance sampling. For this reason this number is defined as the effective sample size [Kong et al., 1994; Liu, 1996] and it is given by

Neff=

1 + varµ[w(X)]

Obviously, the term varµ[w(X)] itself is usually estimated using the samples X(1), . . . , X(N )

In document Maximum likelihood parameter estimation in time series models using sequential Monte Carlo (Page 36-39)