• No results found

Mendelian randomization

Mendelian randomization uses genetic variants as IVs to detect and/or estimate the causal effect of a risk factor on an outcome using observational data. Katan [14] first introduced the idea of using genetic variants as IVs to detect causal effects, and their use in epidemiological research has been popularized by Davey Smith and Ebrahim [15].

1.5 Mendelian randomization 5

In this Section, we discuss the merits of using genetic variants as IVs and introduce different types of Mendelian randomization studies.

1.5.1

Using genetic variants as instrumental variables

A genetic variant must be associated with the risk factor for the IV1 assumption to be satisfied. Since there has been a substantial increase in the number of genome wide association studies (GWAS), and the results from these studies are usually publicly available, this assumption should be relatively straight forward to verify. Typically, uncorrelated genetic variants (not in linkage disequilibrium) that are associated with the risk factor at the genome wide significance level (p-value < 5 × 108) are considered

in a Mendelian randomization study.

Since increases in sample sizes have led to more genetic variants being identified in GWASs, and common genetic variants typically explain little variation in the risk factor, many Mendelian randomization analyses now include multiple genetic variants as IVs [16]. The genetic variants do not have to be causally associated with the risk factor to be valid IVs. Any genetic variant that is in linkage disequilibrium with the causal variant and satisfies the IV assumptions can be used as a IV [17]. Including multiple genetic variants in the analysis will only increase the power to detect the causal effect if the variants explain additional variability in the risk factor [18, 19]. Note that since genetic variants are determined at conception, the association between the variant and the risk factor should not be subject to reverse causation [20, 21].

The IV2 assumption that the genetic variant is not associated with any of the un- measured confounders of the risk factor–outcome association is an untestable condition. The assumption that genetic variants are ‘randomly’ distributed in the population, combined with Mendel’s laws of inheritance, are often used to justify the validity of the IV2 assumption as it implies that the genetic variants are randomly distributed in the population with respect to potentially confounding variables, such as social and environmental factors [15]. The credibility of the IV2 condition could be considered by testing the genetic variants with known measured confounders of the risk factor– outcome association in the dataset used in the main analysis, and by looking up the genetic associations with known unmeasured confounders in external datasets and consortia. Although this is a sensible suggestion, it is by no means exhaustive.

If a genetic variant is associated with more than one trait then it is said to be a ‘pleiotropic’ variant. The inclusion of a pleiotropic genetic variant in a Mendelian randomization analysis may lead to the violation of the IV2 or IV3 assumptions. Since GWASs have identified many genetic variants that are associated with multiple traits,

including pleiotropic variants in a Mendelian randomization study is a major concern [22]. This limitation has led to various methods being introduced into the Mendelian randomization literature that either detect and remove pleiotropic variants, or estimate consistent causal effects in the presence of pleiotropic variants.

1.5.2

Classification of studies

Figure 1.2 provides an illustration of the two main types of Mendelian randomization studies considered in the literature and this dissertation, and the type of data that can be used in the analysis of these two studies. When Mendelian randomization was initially considered in the literature, data on the same set of individuals was generally used, known as a ‘one–sample’ Mendelian randomization study [23]. Typically, individual level data on the risk factor, outcome, and genetic variants are used in the analysis model for one–sample Mendelian randomization. However, it is possible for estimates and standard errors of the genetic associations with the risk factor and with the outcome, referred to as ‘summary level data’, to be used in the analysis of a one–sample Mendelian randomization study.

It has now become increasingly popular for Mendelian randomization analyses to use data from two independent samples, known as a ‘two–sample’ Mendelian randomization study [24]. Two–sample Mendelian randomization studies generally use summary level data where the estimates and standard errors of the genetic associations with the risk factor are obtained from one sample, and the estimates and standard errors of the genetic associations with the outcome are obtained from the other sample. It is assumed that the two independent samples come from the same underlying population.

Typically, ‘summary level data’ refers to the case where the genetic associations with the risk factor and the genetic associations with the outcome have been estimated in two independent samples (i.e. a two–sample Mendelian randomization study). However, as noted above, it is possible for summary level data to be used in a one–sample study. Throughout this dissertation, we assume that ‘summary level data’ refers to the two–sample setting unless explicitly stated otherwise.

Since access to individual level data can be restrictive, and summary level data is often publicly available from GWASs and large consortia, two–sample Mendelian randomization studies continue to grow in popularity [25]. This has led to numerous methodological developments in using summary level data in Mendelian randomization. Databases, such as Phenoscanner [26], and software, such as MR-Base [27], have been developed to allow users to extract summary level data from published GWASs and

1.6 Motivation for the dissertation 7