B AYESIAN TRANSFORMATION SELECTION (BTS) - M EASUREMENT ERROR CORRECTION FOR SELECTION OF A QUA

GENERALIZED LINEAR MODELS WHEN THE EXPOSURE IS UNTRANSFORMED

MISE Turning

5 M EASUREMENT ERROR CORRECTION FOR SELECTION OF A QUADRATIC TERM

5.4 B AYESIAN TRANSFORMATION SELECTION (BTS)

As an alternative to the use of either DIC or using the 95% CrIs to determine the best model, one may employ methods traditionally used for Bayesian variable selection; that is, selecting a subset of variables from a larger set. Bayesian variable selection encompasses a wide range of specific methods [120]. A few publications have specifically explored Bayesian variable and transformation selection together [121–123], although only one publication specifically explores selection of fractional polynomial transformations [55]. In this thesis, I focus solely on selection of the best transformation for a single error-prone variable; therefore, this approach is referred to as “Bayesian transformation selection” (BTS) when adapted to this context. As with DIC, this

104 approach is unnecessary for the selection of the quadratic model over the linear model. BTS was applied in this setting in order to assess the performance in a simpler setting before extending the method to the selection of fractional polynomial models for an error-prone exposure in the next chapter.

I will first describe the Kuo and Mallick method of Bayesian variable selection [60] (Section 5.4.1). This method is arguably the simplest to implement and requires no tuning variables (i.e. parameters with no other purpose than to ensure good mixing of the sampler). In Section 5.4.2, I adapt the Kuo and Mallick method to determine the posterior probability of a quadratic model (Equation 4.1) over a linear model (Equation 1.1) in the presence of exposure measurement error. To my knowledge, this method has not previously been used to select the best transformation(s) of a single variable.

Incorporation of the selection of the squared term into the Bayesian model has the additional advantage of being able to include uncertainty due to model selection in the 95% CrIs of the regression coefficients. Uncertainty due to model selection is not likewise incorporated into the 95% CI/CrIs of any of the other methods in this chapter.

Kuo and Mallick method applied to Bayesian variable selection

Kuo and Mallick outlined a Bayesian method of variable selection designed to select a reduced number of variables for a generalized linear model [60]. Given a number of predictor variables, 𝑽, from 𝑗 = 1,2, . . 𝐽, the regression coefficient, 𝛽𝑗, for each variable is replaced with a composite

parameter which can take a “spike and slab” prior. The composite parameter 𝜑𝑗 is composed of

the original regression coefficient which takes a Gaussian prior and an indicator value, 𝐼_𝑗, which takes a Bernoulli prior with probability 𝜋, i.e. 𝜑_𝑗 = 𝐼_𝑗𝛽_𝑗. If 𝐼_𝑗 takes on the value one, then the value of 𝜑_𝑗 is drawn from the Gaussian distribution associated with 𝛽_𝑗 or the “slab”. If 𝐼_𝑗 takes on the value zero, then 𝜑_𝑗 is also zero or takes its value from the density “spike” at zero. If 𝜑_𝑗 is zero, then the variable 𝑉_𝑗 contributes nothing to the linear predictor.

A generalized linear model including the spike and slab prior for all regression coefficients is:

5.1 𝑔(𝐸[𝑌𝑖|𝑋𝑖]) = 𝛽0+ 𝛽1𝐼1𝑉1𝑖+ ⋯ + 𝛽𝐽𝐼𝐽𝑉𝐽𝑖 = 𝛽0+ 𝑰𝜷𝑽𝒊 = 𝛽0+ 𝝋𝑽𝒊.

An appropriate hyperprior distribution may be specified for 𝜋 reflecting either no prior knowledge (i.e. a uniform distribution from 0 to 1) or some prior knowledge (i.e. a normal distribution centered at an expected value). Alternatively, 𝜋 may be assigned a fixed value reflecting the desired proportion of variables to be retained (or selected) in the final model.

In the samples drawn from this model using MCMC, sampled values being indicated by ∙̃ (e.g. 𝑰̃), different values of the vector 𝑰 can be said to define different substantive models. The model

105 drawn most frequently or having the highest posterior probability [60] may be said to be the nominal best model. As several models may have a high posterior probability or the performance of a specific model within the set of possible models may be of interest, the magnitude of the evidence in favor of one model versus another may be assessed by use of the Bayes factor (BF) which is described in Section 5.4.2.

When 𝐼̃𝑗= 0, particularly over many consecutive samples, 𝛽𝑗 continues to be sampled but

contributes nothing to the linear predictor. As a result, the value of each 𝛽̃𝑗 may rarely be in a

region of the model space where there is posterior support for 𝐼𝑋2 to change from 0 to 1. Good

mixing in this setting is defined by the frequency with which 𝐼̃𝑗 changes value [120]. Appropriate

mixing can be quite challenging in this setting. Therefore, the priors for the regression coefficients 𝜷 cannot be very vague.

Kuo and Mallick method applied to selection of the quadratic model (Bayesian transformation selection)

In order apply the above method to the selection of the quadratic model (Equation 4.1) versus the linear model (Equation 1.1), the “spike and slab” prior is applied only to the regression coefficient of the squared term:

5.2 𝑔(𝐸[𝑌𝑖|𝑋𝑖, 𝒁𝒊]) = 𝛽0+ 𝛽𝑋1𝑋𝑖+ 𝐼𝑋2𝛽𝑋2𝑋𝑖

2_{+ 𝜷} 𝒁 𝑻_𝒁

𝒊.

Each sample of 𝐼𝑋2 takes the value either 1 or 0 as dictated by the draw from the Bernoulli

distribution with probability 𝜋; if 𝐼̃_𝑋₂ = 1 then the 𝑋_𝑖2_{term is included in the model and if 𝐼̃} 𝑋2 = 0

then it is not. If the hyperparameter 𝜋 is assigned the fixed value 0.5, neither inclusion nor exclusion of the 𝑋_𝑖2_{term is favored; this value is suggested by George and McCulloch in order to}

be uninformative [124]. Further discussion on the effect of the choice of value for 𝜋 can be found in Section 5.5.8 as demonstrated in a sensitivity analysis.

A BF may be used to assess whether there is significant evidence in favor of the quadratic model over the linear model (Box 5.1 presents a common interpretation of the value of the BF [125,126]). The BF may be considered as a constant which quantifies the magnitude of the evidence in the data for one model (or hypothesis) over another [127]. Specifically, the posterior odds are equivalent to the prior odds multiplied by the BF:

5.3 𝑓(𝐼𝑋2 = 1|𝑾, 𝑋, 𝑌, 𝒁; 𝜽) 𝑓(𝐼_𝑋₂ = 0|𝑾, 𝑋, 𝑌, 𝒁; 𝜽)= 𝑓(𝐼_𝑋2=1) 𝑓(𝐼_𝑋2=0) × 𝑓(𝑾, 𝑌, 𝒁|𝑋, 𝐼_𝑋2= 1) 𝑓(𝑾, 𝑌, 𝒁|𝑋, 𝐼_𝑋₂= 0).

Box 5.1 Bayes Factor (BF)

BF =posterior odds prior odds

BF = 1 – 3 weak evidence BF = 3 – 10 moderate evidence BF = 10 – 100 strong evidence

106 The posterior probability of the quadratic model is equivalent to the proportion of samples wherein 𝐼̃𝑋2= 1 is sampled. If 𝜋 is assigned a fixed value, then the prior probability of the

quadratic model is 𝜋 and the prior probability of the linear model is 1 − 𝜋. The prior odds are then 𝜋 (1 − 𝜋)⁄ .

For very strong quadratic associations, when the model is believed to have converged to the stationary distribution, all samples will be from the quadratic model. In this case, 𝐼_𝑋₂ will have a posterior mean value of 1, 𝐸[𝐼̃𝑋2|𝑾, 𝑋, 𝑌, 𝒁, 𝜽] = 1, as well as a 95% CrI from 1 to 1. Regression

coefficients and their variances can then be estimated from the posterior in the same manner as for standard MCMC. However, for a weaker quadratic association, when 𝐸[𝐼̃𝑋2|𝑾, 𝑋, 𝑌, 𝒁, 𝜽] is

between 0 and 1, non-inclusive, some of the samples will be from the quadratic model and some will be from the linear model. In these cases, 𝐸[𝐼̃𝑋2|𝑾, 𝑋, 𝑌, 𝒁, 𝜽] represents the posterior

probability that inclusion of the squared term is appropriate.

Unfortunately, in cases where there are a significant number of samples drawn from both models, convergence diagnostics such as 𝑅̂ cannot be used to assess convergence. This is characteristic of MCMC sampling with mixture models wherein “label-switching” occurs between samples to indicate different models being sampled [28,128].

As in the standard application of the approach, the prior for 𝛽𝑋2 must not be too vague (specific

prior values are discussed in Section 5.5.2), otherwise mixing will be poor and the sampler will rarely if ever change 𝐼̃𝑋2 = 0 to 𝐼̃𝑋2= 1 [120]. Appropriate scaling of the untransformed and

transformed terms, i.e. 𝑋 and 𝑋2_{, as outlined in Section 4.5.3, is important for efficient mixing}

and convergence of this model [28,85].

In order to select the best model describing the exposure-outcome relationship while accommodating measurement error, this modified substantive model (Equation 5.2) and specified priors are included in the joint model with the specified exposure and measurement error models in the modular fashion discussed previously (Section 2.3).

The 95% CrI and variance estimates for the regression coefficients include uncertainty due to model selection as well as measurement error and population sampling.

5.5 S

IMULATION STUDY EXTENSION TO MODEL SELECTION FOR A

In document Use of the Bayesian family of methods to correct for effects of exposure measurement error in polynomial regression models (Page 104-107)