GENERALIZED LINEAR MODELS WHEN THE EXPOSURE IS UNTRANSFORMED
MISE Turning
5 M EASUREMENT ERROR CORRECTION FOR SELECTION OF A QUADRATIC TERM
5.4 B AYESIAN TRANSFORMATION SELECTION (BTS)
As an alternative to the use of either DIC or using the 95% CrIs to determine the best model, one may employ methods traditionally used for Bayesian variable selection; that is, selecting a subset of variables from a larger set. Bayesian variable selection encompasses a wide range of specific methods [120]. A few publications have specifically explored Bayesian variable and transformation selection together [121β123], although only one publication specifically explores selection of fractional polynomial transformations [55]. In this thesis, I focus solely on selection of the best transformation for a single error-prone variable; therefore, this approach is referred to as βBayesian transformation selectionβ (BTS) when adapted to this context. As with DIC, this
104 approach is unnecessary for the selection of the quadratic model over the linear model. BTS was applied in this setting in order to assess the performance in a simpler setting before extending the method to the selection of fractional polynomial models for an error-prone exposure in the next chapter.
I will first describe the Kuo and Mallick method of Bayesian variable selection [60] (Section 5.4.1). This method is arguably the simplest to implement and requires no tuning variables (i.e. parameters with no other purpose than to ensure good mixing of the sampler). In Section 5.4.2, I adapt the Kuo and Mallick method to determine the posterior probability of a quadratic model (Equation 4.1) over a linear model (Equation 1.1) in the presence of exposure measurement error. To my knowledge, this method has not previously been used to select the best transformation(s) of a single variable.
Incorporation of the selection of the squared term into the Bayesian model has the additional advantage of being able to include uncertainty due to model selection in the 95% CrIs of the regression coefficients. Uncertainty due to model selection is not likewise incorporated into the 95% CI/CrIs of any of the other methods in this chapter.
Kuo and Mallick method applied to Bayesian variable selection
Kuo and Mallick outlined a Bayesian method of variable selection designed to select a reduced number of variables for a generalized linear model [60]. Given a number of predictor variables, π½, from π = 1,2, . . π½, the regression coefficient, π½π, for each variable is replaced with a composite
parameter which can take a βspike and slabβ prior. The composite parameter ππ is composed of
the original regression coefficient which takes a Gaussian prior and an indicator value, πΌπ, which takes a Bernoulli prior with probability π, i.e. ππ = πΌππ½π. If πΌπ takes on the value one, then the value of ππ is drawn from the Gaussian distribution associated with π½π or the βslabβ. If πΌπ takes on the value zero, then ππ is also zero or takes its value from the density βspikeβ at zero. If ππ is zero, then the variable ππ contributes nothing to the linear predictor.
A generalized linear model including the spike and slab prior for all regression coefficients is:
5.1 π(πΈ[ππ|ππ]) = π½0+ π½1πΌ1π1π+ β― + π½π½πΌπ½ππ½π = π½0+ π°π·π½π = π½0+ ππ½π.
An appropriate hyperprior distribution may be specified for π reflecting either no prior knowledge (i.e. a uniform distribution from 0 to 1) or some prior knowledge (i.e. a normal distribution centered at an expected value). Alternatively, π may be assigned a fixed value reflecting the desired proportion of variables to be retained (or selected) in the final model.
In the samples drawn from this model using MCMC, sampled values being indicated by βΜ (e.g. π°Μ), different values of the vector π° can be said to define different substantive models. The model
105 drawn most frequently or having the highest posterior probability [60] may be said to be the nominal best model. As several models may have a high posterior probability or the performance of a specific model within the set of possible models may be of interest, the magnitude of the evidence in favor of one model versus another may be assessed by use of the Bayes factor (BF) which is described in Section 5.4.2.
When πΌΜπ= 0, particularly over many consecutive samples, π½π continues to be sampled but
contributes nothing to the linear predictor. As a result, the value of each π½Μπ may rarely be in a
region of the model space where there is posterior support for πΌπ2 to change from 0 to 1. Good
mixing in this setting is defined by the frequency with which πΌΜπ changes value [120]. Appropriate
mixing can be quite challenging in this setting. Therefore, the priors for the regression coefficients π· cannot be very vague.
Kuo and Mallick method applied to selection of the quadratic model (Bayesian transformation selection)
In order apply the above method to the selection of the quadratic model (Equation 4.1) versus the linear model (Equation 1.1), the βspike and slabβ prior is applied only to the regression coefficient of the squared term:
5.2 π(πΈ[ππ|ππ, ππ]) = π½0+ π½π1ππ+ πΌπ2π½π2ππ
2+ π· π π»π
π.
Each sample of πΌπ2 takes the value either 1 or 0 as dictated by the draw from the Bernoulli
distribution with probability π; if πΌΜπ2 = 1 then the ππ2 term is included in the model and if πΌΜ π2 = 0
then it is not. If the hyperparameter π is assigned the fixed value 0.5, neither inclusion nor exclusion of the ππ2 term is favored; this value is suggested by George and McCulloch in order to
be uninformative [124]. Further discussion on the effect of the choice of value for π can be found in Section 5.5.8 as demonstrated in a sensitivity analysis.
A BF may be used to assess whether there is significant evidence in favor of the quadratic model over the linear model (Box 5.1 presents a common interpretation of the value of the BF [125,126]). The BF may be considered as a constant which quantifies the magnitude of the evidence in the data for one model (or hypothesis) over another [127]. Specifically, the posterior odds are equivalent to the prior odds multiplied by the BF:
5.3 π(πΌπ2 = 1|πΎ, π, π, π; π½) π(πΌπ2 = 0|πΎ, π, π, π; π½)= π(πΌπ2=1) π(πΌπ2=0) Γ π(πΎ, π, π|π, πΌπ2= 1) π(πΎ, π, π|π, πΌπ2= 0).
Box 5.1 Bayes Factor (BF)
BF =posterior odds prior odds
BF = 1 β 3 weak evidence BF = 3 β 10 moderate evidence BF = 10 β 100 strong evidence
106 The posterior probability of the quadratic model is equivalent to the proportion of samples wherein πΌΜπ2= 1 is sampled. If π is assigned a fixed value, then the prior probability of the
quadratic model is π and the prior probability of the linear model is 1 β π. The prior odds are then π (1 β π)β .
For very strong quadratic associations, when the model is believed to have converged to the stationary distribution, all samples will be from the quadratic model. In this case, πΌπ2 will have a posterior mean value of 1, πΈ[πΌΜπ2|πΎ, π, π, π, π½] = 1, as well as a 95% CrI from 1 to 1. Regression
coefficients and their variances can then be estimated from the posterior in the same manner as for standard MCMC. However, for a weaker quadratic association, when πΈ[πΌΜπ2|πΎ, π, π, π, π½] is
between 0 and 1, non-inclusive, some of the samples will be from the quadratic model and some will be from the linear model. In these cases, πΈ[πΌΜπ2|πΎ, π, π, π, π½] represents the posterior
probability that inclusion of the squared term is appropriate.
Unfortunately, in cases where there are a significant number of samples drawn from both models, convergence diagnostics such as π Μ cannot be used to assess convergence. This is characteristic of MCMC sampling with mixture models wherein βlabel-switchingβ occurs between samples to indicate different models being sampled [28,128].
As in the standard application of the approach, the prior for π½π2 must not be too vague (specific
prior values are discussed in Section 5.5.2), otherwise mixing will be poor and the sampler will rarely if ever change πΌΜπ2 = 0 to πΌΜπ2= 1 [120]. Appropriate scaling of the untransformed and
transformed terms, i.e. π and π2, as outlined in Section 4.5.3, is important for efficient mixing
and convergence of this model [28,85].
In order to select the best model describing the exposure-outcome relationship while accommodating measurement error, this modified substantive model (Equation 5.2) and specified priors are included in the joint model with the specified exposure and measurement error models in the modular fashion discussed previously (Section 2.3).
The 95% CrI and variance estimates for the regression coefficients include uncertainty due to model selection as well as measurement error and population sampling.