3.3 Bayesian inference
3.3.2 The prior distributions
Unlike the derivation of the likelihood in Section 3.3.1, nding a joint prior distribution for our parameters requires us to make a good and suitable choice of distributions. In Bayesian analysis, this choice of prior distributions constitutes perhaps a dangerous but compulsory step in order to provide a good posterior distribution after that. We can clearly see that all our analysis will be aected by this choice.
The general prior
Let (xt)t=Tt=1 be a realisation of a p-dimensional Vector Autoregressive process of lag length
k from which a Vector Error Correction Model in (2.10) is derived. We have then 3 unknown parameters to determine: Π, Ψ and the covariance matrix of the errors Σ.
Works of Bauwens and Lubrano (1996), Geweke (1996) and Kleibergen and van Dijk (1994) directly impose a reduced rank r for the cointegrating matrix Π and decompose this later into two full rank p × r matrices α and β. They infer α and β to analyse the cointegration relations. For each model, the Bayesian analysis is conditioned on the knowledge of the rank r beforehand. They actually try dierent models according to dierent values of the cointegration rank. They generally use posterior or predictive Bayesian odds ratios to assess the value of the cointegration rank.
On the other hand, the method used by Villani (2005) consists of inferring the parameters α, β, Ψ and Σ of the Error Correction Model conditional on the rank and to develop a posterior distribution for the rank, that is conditional on the data only (see Section 2.4.3).
In this chapter, we want to determine the cointegration rank by including it in the MCMC procedure with all the other parameters. For that, we decided not to do any decomposition of Π and to use a non-singular prior distribution on Π, that is a distribution from which we can only simulate a non-singular (or invertible) matrix. The idea of this chapter is to simulate at each step of the MCMC procedure the matrix Π and to estimate the rank of that matrix based on the irrelevance of some of its singular values. We consider in this thesis that the number of
independent cointegration relations, i.e. the cointegration rank, is a one-to-one function of the cointegrating matrix Π, and thus does not need to have a prior distribution.
A non-singular prior distribution for Π given Σ
This section is about giving a non-singular prior distribution for Π and explaining the choice of non-singularity for Π despite the fact that it is theoretically singular.
Since our time series are considered as I(1) and that at least one of them is not stationary, Π has lower rank r. Therefore in principle, we should not choose a non-singular distribution for Π. However, the general assumption of this chapter is to consider the cointegrating matrix Π as being a full rank matrix. We can rstly assume that Π has a non-singular prior distribution. A non-singular posterior distribution will then be derived and under the programming language R Core Team (2013), we will see that for each simulated cointegrating matrix Π, some singular values will be close to 0. These singular values will therefore be considered as irrelevant and we can then have an estimation of the cointegration rank by the number of singular values that are not considered as irrelevant.
A reasonable non-singular prior for Π given Σ is to consider a matrix normal prior distribution: Π|Σ ∼ Np×p(0, v−1Σ, Ip) (3.5)
The aim of this section is to motivate now the choice of this normal distribution. First of all, the property of conjugacy is witnessed, i.e. the posterior distribution will also be a matrix normal distribution. Now let us analyze the form of the prior distribution:
Π|Σ ∼ Np×p(0, v−1Σ, Ip) ⇐⇒ V ec(Π)|Σ ∼ Np2(0, Ip⊗ v−1Σ) (3.6)
We have chosen the prior mean to be equal to 0 in order to have no inuence on the values of the coecients at the beginning. The scalar v is a regulatory xed hyperparameter reecting how much the probability distribution of Π is concentrated around 0 (i.e. the mean of Π). But on the other hand, since we have no information and do not want to emphasize the initial non- cointegration assumption too much, we will increase the value of the variance covariance matrix
of Π. In that way, we will use a weakly informative prior on Π. For that, we will use the scalar v in order to get a bigger covariance matrix. We can input a small value for v = 0.001 so that the variance is increased by v−1 = 1000. The regular Bayesian updates will anyway shift the mean
of Π towards a more actual value thanks to the information we add from the data and the other parameters.
Let us now have a look at the variance of V ec(Π) in order to explain in more detail our choice:
Ip ⊗ v−1Σ = v−1Σ 0 · · · 0 0 v−1Σ · · · 0 ... ... ... ... 0 0 · · · v−1Σ = v−1 Σ 0 · · · 0 0 Σ · · · 0 ... ... ... ... 0 0 · · · Σ
This variance must depend on Σ since it is reasonable to think that Π depends on the error terms of our model. In order to stay the most objective as possible, we decided to create a block diagonal matrix containing the same amount (v−1) of Σ for each block diagonal element. We set
all matrix elements not in the diagonal to be 0 because we assume that the columns of Π are uncorrelated, that is, if we let πi denote the ithcolumn of Π, then we have that Cov[πi, πk] = 0p×p
with i 6= k.
A prior for Ψ given Σ
Ψ is a random matrix of size d × p (with d = (k − 1)p), that will depend on Σ. This prior is chosen to be a matrix normal distribution, a change from Villani (2005), where he sets a uniform prior on Ψ for more simplicity. In addition, like for the distribution of Π, we can introduce a scale w in order to control the weakness of prior information about Ψ. However, in all our results in this thesis, a scale of w = 1 was used without any problem. This scale hyperparameter was created by convention, in the potential case we needed it.
The prior distribution of Ψ|Σ is given in that case by:
A prior for Σ: Inverse-Wishart distribution
The denition below recapitulates the probability density function of an Inverse-Wishart distribution:
Denition 6. Probability density function of the Inverse-Wishart distribution
Let V ∼ IWp(B, m) where B is a positive denite scale matrix, and m and p are non-zero
integers. Then V is positive denite and has the probability density function: f (V ) = |B| m/2 |V |m+p+12 2 mp 2 Γp(m 2) exp −1 2Tr(BV −1 ) where Γp(.) is the multivariate gamma function.
The Inverse-Wishart distribution is commonly used in Bayesian statistics to infer covariance matrices of normally distributed data. For instance, we can consider 1, 2, ..., N to be a sequence
of N random variables where each random p-vector i has a multivariate normal distribution with
mean 0 and covariance matrix Σ. Then if an Inverse-Wishart IWp(B, m) prior is dened for Σ,
we shall achieve the property of conjugacy and obtain an Inverse-Wishart posterior distribution IWp(B + S, m + N ) where S represents the sample sums of squares P
N
i=1ii0. The equivalent
univariate distribution is the Inverse-Gamma, which is also used to infer the variance of a uni- variate random variable in Bayesian statistics.
In this chapter, the prior of Σ is chosen to be an Inverse-Wishart distribution with parameters A and q:
Σ ∼ IWp(A, q) (3.8)
The parameter A is called the scale matrix and q represents the degrees of freedom of the Inverse-Wishart distribution. A and q will then consist of hyperparameters which we must set to suitable values. The hyperparameter A is estimated from a pre-sample (a training data set) of the data (see Section 3.3.6). As for q, it must be strictly higher than p + 3 in order for the variance of Σ (see equation (3.33) in Section 3.3.6) to stay positive denite, see also the denition of the density of an Inverse-Wishart distribution in Gupta and Nagar (2000). We subjectively took a value of q = p + 4.
The prior for all the parameters
In this chapter, we will assume that Π|Σ, Ψ|Σ and Σ are independent. Thus, we can easily obtain the joint prior distribution of these parameters, which in other words constitutes the prior of our model:
f (Π, Ψ, Σ) = f (Π|Σ) f (Ψ|Σ) f (Σ) (3.9) According to the chosen prior distributions of Π|Σ , Ψ|Σ and Σ, we can write the relation of proportionality that their respective densities verify:
f (Π|Σ) ∝ |Σ|−p/2exp −1 2Tr(Σ −1 vΠΠ0) (3.10) f (Ψ|Σ) ∝ |Σ|−d/2exp −1 2Tr(Σ −1 wΨΨ0) (3.11) f (Σ) ∝ |Σ|−(p+q+1)/2exp −1 2Tr(Σ −1 A) (3.12) Therefore by using (3.9) and by multiplying f(Π|Σ) f(Ψ|Σ) and f(Σ), we immediately obtain the relation of proportionality, that the full prior of the VECM veries:
f (Π, Ψ, Σ) ∝ |Σ|−(2p+d+q+1)/2exp −1 2Tr(Σ −1 vΠΠ0+ Σ−1A + Σ−1wΨΨ0) (3.13) where | · | denotes determinant and Tr(·) denotes trace.