and this can be estimated by the Wald (ratio) estimator defined in Section 2.3.1. Note that the J genetic associations with the outcome βYj (j = 1, . . . , J) can be estimated
by regressing the outcome against each of the genetic variants in linear regression models.
Figure 2.1 contains a DAG of Gj, X, U and Y , where Gj satisfies the IV assump-
tions. As highlighted in Section 1.5.1, the arrow between Gj and X does not have to
be causal, but Gj should be in linkage disequilibrium with the genetic variant that
has a causal effect on X. In Figure 2.1, we have included the parameters defined in the model assumptions, i.e. θ represents the causal parameter of interest as defined in Equation 2.2. Strictly speaking, DAGs should provide a non-parametric representation of the relationships between a set of variables, but throughout this dissertation we include the parameters considered in the model assumptions for ease of interpretation.
𝐺" 𝑋 𝑌
𝛽&' 𝜃
𝑈
𝜁& 𝜁+
Fig. 2.1 Directed acyclic graph illustrating the Mendelian randomization assumptions for the J genetic variants Gj (j = 1, . . . , J) to investigate the causal effect of a continuous risk factor
X on a continuous outcome Y . The genetic effect of Gj on X is βXj, and the causal effect of
the risk factor X on the outcome Y is θ. U represents the set of unmeasured variables that confound the association between X and Y with effects ζX and ζY.
2.3
Estimating the causal effect
Under the model assumptions defined in Section 2.2, we consider the IV methods that are most frequently used in Mendelian randomization to estimate the causal parameter θ in Equation 2.2: the Wald (ratio) estimator that typically uses summary level data (Section 2.3.1) [2]; and two stage-least squares (TSLS) regression that uses individual level data (Section 2.3.1) [38]. Since this dissertation is primarily interested in Mendelian randomization methods that use summary level data, the Wald (ratio) estimator is discussed in detail. Although not considered here, methods based on limited information maximum likelihood [39], generalised methods of moments [40, 41], and Bayesian approaches [42, 43] may also be used to estimate the causal effect. Since
we assume that Y is a continuous variable throughout this Chapter, we highlight some of the issues of estimating the causal effect when the outcome is binary (Section 2.3.3).
2.3.1
Wald (ratio) estimator
We assume that we have summary level data on the risk factor and outcome from two independent samples: the genetic association estimates ( ˆβXj and ˆβYj) and their
standard errors (se( ˆβXj) and se( ˆβYj)) for the J genetic variants Gj (j = 1, . . . , J). The
causal effect θ of the risk factor X on the outcome Y can be estimated with one genetic variant Gj using the Wald (ratio) method by dividing the genetic association estimate
with the outcome by the genetic association estimate with the risk factor: ˆθj =
ˆβYj ˆβXj
. (2.3)
The ratio method can also be applied directly to individual level data. For example, if
Gj consisted of two subgroups, then the ratio estimator is the average difference in the
risk factor between the two subgroups of Gj divided by the average difference in the
outcome between the two subgroups of Gj.
An estimate of the causal effect based on all the genetic variants can also be obtained from the weighted average of the J causal ratio estimates:
ˆθIV W = PJ j=1wjˆθj PJ j=1wj , (2.4)
where wj is the inverse-variance of the causal ratio estimate ˆθj [44]. The pooled estimate
in Equation 2.4 is known as the ‘inverse-variance weighted’ (IVW) method [45]. Under a fixed effect model, where we assume that there is no heterogeneity among the causal ratio estimates [46], the variance of the IVW estimate is given by:
var(ˆθIV W) =
1
PJ
j=1wj
. (2.5)
The inverse-variance weights wj in Equations 2.4 and 2.5 can be approximated from
a delta method expansion of the ratio estimate [47]. The first order approximation of
wj from the delta expansion is most commonly used in the IVW estimator [48]:
1st order approximation of w j = ˆβ2 Xj se( ˆβYj)2 . (2.6)
2.3 Estimating the causal effect 15
Equation 2.6 assumes that there is no uncertainty in the genetic associations with the risk factor, known as the NO Measurement Error (NOME) assumption [49]. The NOME assumption will only be satisfied if N1 is infinite. Since summary level data
is obtained from GWASs and consortia with very large sample sizes, the NOME assumption may be considered reasonable.
The causal effect of the risk factor on the outcome can also be estimated using a weighted linear regression of the genetic association estimates with the risk factor ( ˆβXj) and the genetic association estimates with the outcome ( ˆβYj) [45], with the
inverse-variance as weights (se( ˆβYj)
−2):
ˆβYj = θIV WˆβXj+ ϵj, ϵj ∼ N(0, φ
2se( ˆβ
Yj)
2) , (2.7)
where ϵj represents the error term, φ represents the residual standard error, and the
intercept term is set to zero under the IV2 and IV3 assumptions. To obtain the same variance as the IVW estimate in Equation 2.5, the residual standard error in the weighted linear regression model in Equation 2.7 must be set to one. By fixing φ to one, Equation 2.7 is equivalent to performing a fixed-effect meta-analysis of the J causal ratio estimates ˆθj (j = 1, . . . , J) [50].
If heterogeneity among the ratio estimates is suspected, then a multiplicative random-effects model may be preferred to a fixed-effect model. Although the point estimates from the fixed- and random-effect models will be the same, the standard error of the causal estimate from the multiplicative random-effects model will be larger if there is heterogeneity among the ratio estimates. The variance of the IVW estimator under a multiplicative random-effect model with first order weights (Equation 2.6) is given by: var(ˆθIV W) = ˆφ2 PJ j=1 ˆβX2jse( ˆβYj) −2 ,
where ˆφ is the estimate of the residual standard error. If ˆφ > 1, then this suggests that there is over-dispersion in the ratio estimates [50]. Note that it is not biologically plausible for the causal ratio estimates to be under-dispersed (ˆφ < 1) if the genetic variants are independent (not in linkage disequilibrium) [46]. ˆφ is not allowed to be lower than one to ensure that the causal estimate from the multiplicative random-effect model is never more precise than the estimate from the fixed-effect model.
Instead of using multiplicative random-effects, an additive random-effects model could be used (not considered throughout this dissertation). This would be equivalent
to performing an additive random-effects meta-analysis of the J causal ratio estimates ˆθj (j = 1, . . . , J) [50]. The estimates and standard errors from the fixed-effects and
additive random-effects models will differ if there is heterogeneity among the J causal ratio estimates ˆθj (j = 1, . . . , J). However, additive random-effects are rarely used
in Mendelian randomization, with multiplicative random-effects generally being used when heterogeneity among the ratio estimates is suspected. This preference may be due to the fixed-effects and multiplicative random-effects models estimating the same point estimate. Additionally, Bowden et al. [51] have cautioned against the use of additive random-effects as weak instruments may be given too much weight under certain scenarios, resulting in more biased estimates of the causal effect under the additive random-effects model than the fixed-effect model.
2.3.2
Two-stage least squares regression
If there is individual level data on the risk factor, outcome, and genetic variants, then the casual effect θ can be estimated using two-stage least squares (TSLS) regression [38] in a one–sample Mendelian randomization study. Under TSLS regression, θ is estimated from the two linear regression models: 1) the regression of the risk factor
X against the genetic variants G; and 2) the regression of the outcome Y against the
predicted values of the risk factor ˆX from 1). The coefficient of ˆX in the second stage
regression model is the TSLS estimate of the causal effect θ. If TSLS is performed manually, the uncertainty in the first stage regression will not have been accounted for, and the causal estimate will be too precise. As such, TSLS regression software should be used to obtain accurate standard errors of the causal estimate. The estimate from the IVW method will be asymptotically equivalent to the estimate from the TSLS method if the genetic variants are uncorrelated [52].
2.3.3
Binary outcomes
Throughout this Section, we only consider linear additive models where the risk factor X and outcome Y are continuous variables. It is likely that the outcome of interest will be binary in an epidemiological study, and the causal odds ratio will be the preferred measure of association. Odds ratios are a non-collapsible measure of association, meaning that if the odds ratio takes a constant value across the strata of a covariate, the value obtained from the marginal analysis may not be equal to this constant value [53]. Whilst the numerator in Equation 2.3 could be replaced with the estimate of the log odds ratio of the jth genetic variant with the outcome, and the