Single Marker Tests - Statistical Methods

Chapter 2 General Methods

2.3 Statistical Methods

2.3.1 Single Marker Tests

Several software applications are available to perform single marker tests, commonly known as genome-wide association studies (GWAS) when all autosomes are investigated. Commonly used applications that were implemented in this series of investigations include SNPTEST v2.5, PLINK v1.9/2.0 and BOLT-LMM v2.3 (Purcell et al., 2007; Marchini and Howie, 2010; Chang et al., 2015; Loh et al., 2015; Loh et al., 2018). For all single marker association tests performed here, they used linear or logistic regression approaches or, as seen in Chapter 6, newer, mixed linear models.

2.3.1.1 Linear and Logistic Regression

Linear and logistic regression models are the simplest methods of performing single marker association tests whilst allowing the inclusion of covariates (such as age, sex or ethnicity). These covariates are likely predictors of the trait of interest and may also share independent genetic associations with it; therefore they must be accounted for in order to prevent confounding when testing for associations of genetic variants to the trait of interest (Clarke et al., 2011; Bush and Moore, 2012). Linear regression was employed for continuous traits with effect sizes reported as the average increase (or decrease) in trait value for each copy of the “effect” allele

𝑌 = 𝛽₀+ 𝛽_𝑔𝑋_𝑔+ 𝛽_𝑐𝑋_𝑐+ 𝜀

Equation 2.1: Linear Regression Equation. Y = expected outcome (trait); β0 =

intercept of linear regression line; βgXg = effect size per copy of the effect allele g; βcXc = effect size per unit of the covariate c (e.g. age, sex); ε = residual error term. Conversely, for dichotomous traits, logistic regression was employed as there are only two trait classifications, with variant effects reported as odds ratios. Odds ratios state the average factor by which the likelihood of an individual having case status increases (or decreases) for each copy of the “effect” allele possessed (Equation 2.2). If the likelihood of having case status is equal to that for being unaffected for a given variant, the odds ratio will be 1.

𝑂𝑑𝑑𝑠 𝑅𝑎𝑡𝑖𝑜 = 𝑃𝑐𝑎𝑠𝑒 𝑃𝑐𝑜𝑛𝑡𝑜𝑙

Equation 2.2: Odds Ratio for a given genotype. Pcase = probability of being a case;

Pcontrol = probability of being a control.

Rather than estimating the expected trait value for a given genotype as is the case for linear regression, logistic regression estimates the natural logarithm (ln) of the odds ratio, given the predictors (Equation 2.3; Lever, Krzywinski and Altman (2016)).

ln ( 𝑃𝑐𝑎𝑠𝑒

𝑃𝑐𝑜𝑛𝑡𝑜𝑙) = 𝛽0+ 𝛽𝑔𝑋𝑔+ 𝛽𝑐𝑋𝑐 + 𝜀

Equation 2.3: Logistic Regression Equation. ln(Pcase/Pcontrol) = log odds ratio given the predictor (X); β0 = intercept of logistic regression line; βgXg = effect size per copy of the effect allele g; βcXc = effect size per unit of the covariate c (e.g. age, sex); ε = residual error term. Adapted from Lever et al. (2016).

In all instances, an additive model has been applied whereby reported effects are an average per copy of the effect allele possessed (i.e. if β is the magnitude of effect of a particular variant if an individual were heterozygous for this variant, then the magnitude of effect would be 2β for an individual who is homozygous for this same allele).

2.3.1.2 Mixed Linear Models

As implemented in BOLT-LMM v2.3 (Loh et al., 2015; Loh et al., 2018), mixed linear models are an alternative to standard linear regression models for single marker association tests.

𝑌 = 𝛽0+ 𝛽𝑔𝑋𝑔+ 𝛽𝑐𝑋𝑐 + 𝐺 + 𝜀

Equation 2.4: Mixed Linear Model. Y = expected outcome (trait); β0 = intercept of

linear regression line; βgXg = effect size per copy of the effect allele g; βcXc = effect size per unit of the covariate c (e.g. age, sex); G = genetic effects (included as a random effect); ε = residual error term.

Mixed linear models share features of standard linear regression models (Equation 2.4 vs. Equation 2.1), with the terms from linear regression forming the fixed effects portion of the linear mixed model. However, a key advantage of mixed linear models over linear regression is that residual population stratification and relatedness within the study sample can be accounted for as random effects (term

G in Equation 2.4), which (if otherwise left unaccounted for) can lead to reduced

power to detect associations or an excess of false positive association signals (Yang et al., 2014). This consideration of population effects therefore facilitates the

inclusion of related individuals who would traditionally be excluded from analyses, thus improving association study power.

Mixed linear models traditionally use pre-constructed GRMs in order to account for this residual population structure (Yang et al., 2014); however, BOLT-LMM on the other hand computes the parameters required for the random effects portion of the model during analyses. This improves computational efficiency in part due to the leave-one-chromosome-out (LOCO) procedure implemented by BOLT-LMM, which would otherwise be computationally intensive if multiple GRMs were required to be stored in memory (as is the case for the alternative software application GCTA-LOCO (Yang et al., 2014)). In LOCO analyses, variants situated on the same chromosome as the variant tested are not included in the computation of the random effects portion of the model (Yang et al., 2014). This is because, the variant tested would be considered twice in the model – once as a fixed effect (by default as the tested variant) and secondly as a random effect (either directly, or indirectly through variants in LD acting as proxies). This over-fitting would dampen the true association signal of the tested variant and therefore, use of the LOCO procedure improves the power to identify associations.

BOLT-LMM has the ability to employ two different models when performing single marker association tests; the standard “infinitesimal” and the Bayesian “non- infinitesimal” models (Loh et al., 2015). The “infinitesimal” model assumes that all variants are causal and are of small, normally distributed effect sizes. This model is equivalent to alternative mixed linear models used for single marker association

tests. The “non-infinitesimal” model on the other hand is a more complex model and is only applied if BOLT-LMM computes that there will be an expected improvement in performance compared to the standard “infinitesimal” model (Loh et al., 2015). This Bayesian model assumes effect sizes of variants do not follow a pre-defined single normal distribution, as is the case with frequentist tests (i.e. standard linear/logistic regression or the BOLT-LMM “infinitesimal” model), but rather they can be fitted to two normal distributions that consider the presence of a small number of variants of large effect, with the remaining variants assumed to be of small effect size. These probability distributions are calculated by the software based on the input data and parameters from these distributions are subsequently incorporated into their mixed linear models (Stephens and Balding, 2009). For all single marker association tests performed using BOLT-LMM, the standard “infinitesimal” model was used as the software computed there would be no gain in performance from the “non-infinitesimal” model.

In document Discovery of genetic determinants for refractive error (Page 67-71)