The predictive distribution - Variational Bayesian Learning and its Applications

The posterior predictive distribution provides a distribution for a new data point given the observed data, in which it makes use of the entire posterior distribution. Suppose y∗ _{= (y}∗

1,· · · , yn∗∗) is a new observation, then the posterior predictive distribution of y∗ given

y is defined as

p(y∗_{|y) =} Z

p(y∗_{|Θ)p(Θ|y)dΘ,} (5.7)

where Θ refers the model parameters. For the one-way random-effects model with a DP prior this quantity is intractable however MCMC methods provide a straightforward approximation. Having a sample of T points from the posterior, we can estimate it by

p(y∗_{|y) =} 1 T T X t=1 p(y∗_|Θ(t)), (5.8)

where Θ(t) _{is the sample drawn from the posterior distribution after the chain reaches its}

stationary distribution. For Algorithm 6, p(y∗_|Θ(t)_{) is given as follows:}

p(y∗|Θ(t)_{) =} |c∗(t)_|

k=1

P (c∗(t) = k)f (y∗|ζk(t), σ2(t))

where again _|c∗(t)_{| denotes the number of values which c}∗(t) _{takes. For Algorithm 8, it is}

given as follows: p(y∗_|Θ(t)) = B X b=1 v_b(t)f (y∗_|ζ_b(t), σ2(t))

For the VB method, it is natural to use the VB approximations to replace the unknown posterior distributions in (5.7). Thus, we can have the following approximation for the posterior predictive distribution:

p(y∗_{|y) ≈} Z B X b=1 vbf (y∗|ζb, σ2) ! dQ(v, ζ, σ2₎ = B X b=1 Eq(vb)[vb] Z f (y∗|ζb, σ2) dQ(ζb)dQ(σ2) (5.9)

where Q is the VB approximation. Unfortunately, although we have obtained the simple and well-recognised distributions for Q(ζb) and Q(σ2), the integrals in (5.9) are still not

available in a closed form. However, we can apply the variational principle again to obtain a lower bounds on this quantity, and propose using this lower bound as an approximation for the posterior predictive distribution.

We denote Lb as Lb =

(f (y∗_|ζ

b, σ2)) dQ(ζb)dQ(σ2). If we regard Q(ζb) and Q(σ2) as

prior distributions, then Lb can be regarded as a marginal likelihood, that can be approxi-

mated by the variational method. We denote v(ζb) and v(σ2) as the variational approxima-

tions which result from treating Q(ζb) and Q(σ2) as priors. Again, Theorem 2.2 can be used

to obtain the distributional forms for v(ζb) and v(σ2), and gives the following results:

v(ζb)=N(Ab, Bb2); Ab = G H Pn∗ i=1yi∗+ab2b b G Hn∗+ 1 b2 b , B_b2 = _G 1 Hn∗ + 1 b2 b v(σ2)=IG(G, H); G = g + n∗ 2, H = h + 1 2S ∗₊n∗ 2 ((Ab − ¯y∗) 2_{+ B}2 b),

where n∗ _{is the number of observations in y}∗_{, and ¯}_y∗ _{is the mean of y}∗_{, and S}∗ _{is the total}

sum of squares of y∗_{, and a}

Once the variational parameters of Ab, Bb2, G, and H converge, we can obtain a lower

bound of the logarithm of Lb, denoted as Fb, which is given as follows:

Fb= Z q(ζb) v(ζb) dV (ζb) + Z q(σ2₎ v(σ2₎dV (σ 2_{) +} Z log(f (y∗_|ζb, σ2))dV (ζb, σ2) =log 1 bb − log 1_B b − _2b12 b ((Ab− ab)2+ Bb2) +(G_{− g)(log H − ψ(G)) + G} 1₋ h H + log h g Γ(g)+ log HG Γ(G) −n∗ 2 (log 2π + log H− ψ(G)) − 1 2 G H n∗ X i=1 (y_i∗− Ab)2− n∗Bb2 ! ,

where Γ(.) is the gamma function.

Once we obtain the values of each Fb for b = 1,· · · , B, we can obtain a lower bound for

(5.9) B X b=1 Eq(vb)[vb]Lb ≥ B X b=1 Eq(vb)[vb] exp(Fb)≡ F.

Thus, we propose to use F as an approximation for the posterior predictive distribution of p(y∗_|y).

5.5 Numerical studies

We examine the performance of the VB method by comparing it with the two MCMC methods on simulated data. To generate the data, we set µ and τ2 _{for the base distribution}

in (5.2) to be µ = 0 and τ2 = 16 and σ2 equal to 0.64. We use the truncated stick-breaking representation to construct the random distribution F . For demonstration purposes, we simply truncate F at level 5, shown in Table 5.1. A data set of 60 groups data are generated

from F , and each group contains 80 data points. We use 50 groups as the observed data and 10 groups as the future data.

Table 5.1: A random distribution F , truncated at level 5 ζb -2.22 -0.54 1.01 4.28 7.10

P (ζb) 0.35 0.14 0.13 0.13 0.26

In the VB learning, we assume we have no knowledge about the distribution F , and also mis-specify the truncation level to 10. The algorithm converges after 19 iterations.

Table 5.2: The VB approximations for the random distribution F E[v1] E[v2] E[v3] E[v4] E[v5] E[v6] E[v7] E[v8] E[v9] E[v10]

E[ζ1] E[ζ2] E[ζ3] E[ζ4] E[ζ5] E[ζ6] E[ζ7] E[ζ8] E[ζ9] E[ζ10]

0.167 0.16 0.12 0.12 0.01 0.01 0.01 0.13 0.13 0.11 -2.24 -2.24 -0.55 0.97 2.06 2.06 2.06 4.23 7.12 7.12

Table 5.2 gives the expected values for vb and ζb under the VB approximations. We

can see a clear pattern. The expected probability weights for the component 5, 6, and 7, are close to zero. This may suggest they can be ruled out from the true model. The component 1 and 2 share the exact same value of _{−2.24, which is close to the value of} component 1 in Table 5.1, and the cumulated expected probability weight of 0.327 is also close to 0.35 in Table 5.1. We can observe a similar situation for component 9 and 10. Thus, by combining same components (with same values) and ruling out the empty components (with very small probability weights), we can conclude that VB picks up 5 components for the random distribution F .

For the Polya-urn type Gibbs sampler (Algorithm 6), we run 2× 105 _{iterations. We use}

the last 20% data, which we believe the chain has reached its stationary distribution. To reduce the serial correlation effect, we pick the every 25th _{data point. The frequencies of the}

distinct number of ζb are given in Table 5.3. We see that the posterior probability favors 5,

6, or 7 components, and 6 components has the largest probability.

Table 5.3: Posterior probabilities for the number of ζ

# of ζ 5 6 7 8 9 10

P(# of ζ) 0.270 0.386 0.254 0.068 0.018 0.002

For the blocked Gibbs sampler (Algorithm 8), we run 2.5_{× 10}6 _{iterations. The last 20%}

data is used. To reduce the serial correlation effect, we pick the every 25th _{data point. Even}

with the order constraints on ζ, the chain still shows the signs of label switching. Thus, a single value of vb or ζb may lose the interpretability.

Finally, we compare the posterior predictive distribution approximated by the three methods. We compute the log predictive likelihoods, shown in Table 5.4, for the 10 groups of future data. For the Gibbs samplers, additional 2,500 samples are collected and used in the computation. We see that the three methods give very similar values. The mean values are given as _{−95.95, −97.30, −97.32 respectively. A t test, for the log predictive likelihoods} computing by Algorithm 8 and by VB, is performed, and it can not reject the hypothesis that the true difference in means is equal to 0 at a p-value equal to 0.9923, and we also can obtain a p-value equal to 0.5049 for Algorithm 8 versus Algorithm 6,

Table 5.4: Log predictive likelihood for 10 groups of future data

Polya-urn -96.19 -98.43 -89.45 -97.35 -104.31 -95.64 -90.36 -99.84 -92.86 -95.11 Blocked -97.40 -99.67 -90.59 -98.53 -105.84 -96.76 -91.50 -100.82 -95.53 -96.32 VB -97.29 -99.88 -90.46 -98.74 -105.90 -96.88 -91.37 -100.62 -95.47 -96.54

5.6 Discussion

The variational Bayes method provides a computational efficient technique to approximate certain posterior quantities in the context of hierarchical modelling using Dirichlet process priors. To avoid the limitation in the existing variational formalism which relies on conjugate exponential families, we consider VB in a new framework. The parameter separation param- eterization (Section 2.5.2) gives a factorization which allows flexible dependence structures. Based on this new framework, we provide a full variational solution for the Dirichlet process with non-conjugate base prior. The numerical results show that the VB method is very computationally efficient. Moreover, the comparison with two different MCMC methods shows that VB provides accurate approximations for the posterior predictive distribution. Finally, we propose an empirical method to estimate the truncation level for the truncated DP.

Chapter 6 Variational Bayes for

Regime-switching Lognormal Models

This chapter describes how to apply the VB method to the regime-switching log-normal model and how it provides a computationally fast solution to quantify the uncertainty in the model specification and parameter specitication. The results show that the method can recover exactly the model structure, gives the reasonable point estimates, and is very computationally efficient. The potential problems of the method in quantifying the parameter uncertainty are discussed. To remedy these problems, the methods proposed in Chapter 4 are used to compute the true posterior covariance matrix.

6.1 Introduction

Switching between different states or regimes is a common phenomenon in many time series, and regime-switching models, originally proposed by Hamilton (1989), have been used to model these switching processes. Of particular interest to this chapter is the regime-switching lognormal model (RSLN) proposed by Hardy (2001). As demonstrated in Hardy (2002), the

maximum likelihood estimate (MLE) does not give a simple method to deal with parameter uncertainty. The asymptotic normality of maximum likelihood estimators may not apply for sample sizes commonly found in practice. Hence, to understand parameter uncertainty Hardy (2002) considered the RSLN model in a Bayesian framework using the Metropolis- Hastings algorithm. Furthermore, model uncertainty, in particular selecting the correct number of regimes, is a major issue. Hence, model selection criteria have to be used to choose the best model. Hardy (2001) found that a two-regime RSLN model maximized the Bayes information Criterion (BIC) (Schwarz, 1978) for both monthly TSE 300 total return data and S&P 500 total return data, however, according to the Akaike Information Criterion (AIC) (Akaike, 1974), a three regime model was the optimal on S&P data. To account for the model uncertainty associated with the number of regimes, Hartman and Heaton (2011) offered a dynamic estimation of the number of regimes using a Chinese restaurant process.

MCMC methods make possible the computation of all posterior quantities, however there are a number of practical issues associated with their implementation. Detailed discussions can be found in Chapter 1 in particular computational speed is one of the main advantages of VB.

This chapter shows how to apply the VB method to the RSLN model and presents a solution to investigate the model specification problem. In particular it looks at how to find the appropriate number of regimes. While the simplification in the dependence gives computation advantages it also comes at a cost. For example we also found that the posterior variance may be underestimated, and the correlation structure is distorted. We will use the techniques introduced in Chapter 4 to approximate the true posterior covariance matrix.

Moreover, through the numerical results, we can observe that the VB approximations tend to present an approximately symmetric and bell shaped pattern. In this chapter, we aim to explore the asymptotic properties of the VB method.

tion in the RSLN model. Numerical studies on simulated data and real data are provided in Section 6.3, where the VB method is compared with both the criterion-based model selection procedure and the MCMC method. Section 6.4 uses the three method proposed in Chapter 4 to estimate the true posterior covariance matrix. Section 6.5 discusses the asymptotic normality. Conclusions are available in the last section.

In document Variational Bayesian Learning and its Applications (Page 116-125)