SNP-Heritability - Statistical Methods

Chapter 2 General Methods

2.3 Statistical Methods

2.3.6 SNP-Heritability

SNP-heritability estimates were calculated using Genome-wide Complex Trait Analysis (GCTA) (Yang et al., 2011a). In the most commonly-employed analysis approach, GCTA is used to estimate the proportion of trait variance due to additive effects of commonly occurring SNPs (h2SNP) (GCTA-GREML). This is done by

considering pairwise relatedness between individuals (in the form of a GRM) as a random effect in a mixed linear model fitted using restricted maximum likelihood (REML) (Yang et al., 2011a). The model fitted is equivalent to the mixed linear model described in Equation 2.4 and allows for the inclusion of covariates; however the variance of the trait due to genetic effects (σ2g) from this model is of interest. Here, the trait variance (VP) is defined as the sum of the variance due to genetic

variants included in estimation of the GRM whilst considering the variance- covariance structure of the genetic variants between individuals (Agσg2) and the residual variance which is attributable to non-genetic effects (Iσε2) (Equation 2.9; Yang et al. (2011a)). If multiple GRMs are fitted as in the case of a joint analysis, Agσg2 becomes ΣAiσi2, whereby i denotes the individual variance components with a single GRM is constructed for each (Ai).

𝑉𝑃 = 𝐴𝑔𝜎𝑔2+ 𝐼𝜎𝜀2

Equation 2.9: Estimating trait variation using GCTA. VP = trait variance; Ag = genetic relationship matrix (GRM); σg2 = variance due to variants contributing to the GRM; I = identity matrix of residual effects (diagonal elements = 1, off diagonal elements = 0); σ2ε = variance due to non-genetic effects (Yang et al., 2011a).

In order to construct the GRM required to estimate variance due to additive genetic effects, a relatedness coefficient between each pair of individuals was calculated using Equation 2.10, which provides a weighted count of how many variant genotypes are common to both individuals. A relatedness coefficient of less than 0.025, which corresponds to the level of genetic similarity expected for second or third cousins, is generally used to define “unrelated” individuals (Yang et al., 2010).

𝐴𝑗𝑘 = 1 𝑁∑ (𝑥_𝑖𝑗− 2𝑝_𝑖)(𝑥_𝑖𝑘− 2𝑝_𝑖) 2𝑝𝑖(1 − 𝑝𝑖) 𝑁 𝑖=1

Equation 2.10: Determining genetic relatedness between two individuals, j and k.

A = genetic relationship coefficient, i = variant number (1, 2, 3… N), pi = frequency of the reference allele of variant i, xij genotype of individual j at variant i, xik genotype of individual k at variant i (Yang et al., 2011a).

To determine whether SNP-heritability estimates are greater than zero, since heritability is restricted to a continuous scale of values between 0 and 1; p-values are determined by halving the p-value obtained using the LRT (likelihood ratio test) statistic based on a one-tailed χ2 test with 1 degree of freedom. This LRT statistic is computed by GCTA during analysis however it can be closely approximated by: (estimated h2 / SE)2 (Yang et al., 2011a).

2.3.6.1 Observed to Liability Scale Conversion

As heritability is traditionally estimated on the ‘observed scale’ (see below for definition), conversion to the ‘liability scale’ was required for the heritability estimates of the different dichotomous traits to be compared either against each other or across different case threshold definitions (Lee et al., 2011). The liability scale is a continuous scale used for discrete variables, for example cases and controls for a particular phenotype. It is assumed that the liability is normally distributed across the study sample, with the area under the normal distribution curve representing the sample (Figure 2.3).

Figure 2.3: The normal probability density function for transformation from the observed scale to the liability scale. If cases account for 20% of the individuals

sampled (shaded area), the height of the function at the threshold value, t (edge of shaded area) is 0.28 (z, denoted by dotted line). Adapted from Lee et al. (2011).

The threshold value (t) on the liability scale is set according to the prevalence of the phenotype, i.e. the number of cases relative to the sample size. For example, if a particular phenotype is presumed to have a prevalence of 20%, the threshold on the liability scale is set such that the 20% of the area under the curve is set to the right of the threshold value with the remaining 80% of the area under the curve to the left of the threshold value (Figure 2.3). This transformation allows a dichotomous trait to be accurately expressed on a continuous scale, which is the

liability scale’ is independent of the prevalence of the investigated phenotype (Falconer and Mackay, 1996). It has been reported by Lee et al. (2011) that without such a conversion, heritability estimates for dichotomous traits are prone to a downward bias due to uncorrected measurement errors. Transformation to the liability scale is performed automatically by GCTA for dichotomous traits (based on Equation 2.11; (Lee et al., 2011)).

ℎ_𝑙2 _{= ℎ} 𝑜2 𝐾(1 − 𝐾) 𝑧2 𝐾(1 − 𝐾) 𝑃(1 − 𝑃)

Equation 2.11: Transformation from heritability on the observed scale (𝒉𝒐𝟐) to

heritability on the liability scale (𝒉𝒍𝟐) with consideration of ascertainment bias. K =

population prevalence, P = sample prevalence, z = height of the standard normal probability density function at threshold t (Lee et al., 2011).

This transformation also takes into consideration ascertainment bias whereby there is a difference between the population and sample prevalence rates.

2.3.6.2 Consideration of Uncorrected Population Effects

Effects such as population stratification and cryptic relatedness within a sample may affect the results and interpretations of genetic studies if not accounted for. The major principal components have commonly been included as covariates when performing such studies to adjust for population stratification (Price et al., 2006; Price et al., 2010). However, mixed linear models like those utilised by GCTA, can also be used to account for uncorrected population stratification and cryptic relatedness (Yu et al., 2006; Kang et al., 2010; Zhang et al., 2010).

For SNP-heritability estimation using GCTA or equivalent methods, there is no general consensus regarding the inclusion or exclusions of principal components to account for population stratification within the examined sample. However, some authors have suggested that the inclusion of principal components may not fully account for population stratification, and may in fact make SNP-heritability estimates less accurate due to over-fitting (Browning and Browning, 2011; Goddard et al., 2011; Dandine-Roulland et al., 2016; Krishna Kumar et al., 2016).

Strategies have been proposed to estimate the degree of inflation of SNP- heritability estimates due to uncorrected population stratification and cryptic relatedness, based on the principle that genetic regions (e.g. chromosomes) are independent of each other if there are no related individuals present and the sample is homogenous (Yang et al., 2011b). If related individuals are present and/or a non-homogenous sample is analysed, these regions will no longer be independent since there will be some correlation between them through ancestry informative markers (AIMs). AIMs are variants whose allele frequencies are highly correlated within population subgroups. For example, if there are loci where particular alleles are found at high frequency in individuals of European ancestry, yet these same alleles are much rarer in all other population subgroups, these alleles would be deemed to be specific to individuals of European ancestry. It is expected that these AIMs are randomly distributed throughout the genome; hence, regions will be correlated with each other due to these population effects if left unaccounted for (Yang et al., 2011b). Thus, SNP-heritability estimates from an individual region

individual analyses; yet, when all regions are combined in a joint analysis, these correlated effects are taken into consideration as estimates for each region (h2joint)

are obtained by conditioning on all other regions. As are result, estimates for each region would be independent of all other regions included in the model and inflation from these population effects would be eliminated (Yang et al., 2011b).

To ascertain the extent to which SNP-heritability estimates obtained may be biased (inflated) due to uncorrected population stratification or residual cryptic relatedness, the method proposed by Yang et al. (2011b) was utilised. First, the genome was split into individual chromosomes, with SNP-heritability estimated for each chromosome separately and then jointly for all chromosomes. The difference between the heritability estimates for the individual and joint analyses for each chromosome was regressed against chromosome length (obtained from UCSC Genome Browser (Kent et al., 2002) to NCBI human genome build 37 (hg19 / GRCh37) coordinates) (Figure 5.1). Finally, an estimate for the proportion of variance attributable to population structure across the whole genome was obtained by applying Equation 2.12 (Yang et al., 2011b):

𝑏022 21⁄ + 𝑏1∑ 𝐿𝐶⁄21 22

𝐶=1

Equation 2.12: Estimating the proportion of variance attributable to population structure across the whole genome. b0 = intercept; b1 = gradient; C = chromosome;

2.3.6.3 Quantification of Dominance Effects

In addition, GCTA can be used to estimate the proportion of trait variance due to partitioned additive and dominance effect components (GCTA-GREMLd) (Zhu et al., 2015). Here, separate GRMs are constructed by GCTA prior to SNP-heritability estimation; one based on additive effects and another based on dominance effects (Equations 2.10 and 2.13). As with the aforementioned joint analysis for estimating uncorrected population effects, a joint analysis is performed in order to partition the overall genetic component of trait variance into these two constituent parts.

In order to compute the GRM based on dominance effects, Equation 2.10 is adjusted to account for the recoded genotypic values (for the genotypes AA, AB and BB respectively: additive effects (x) = 0, 1, 2; dominance effects (x’D) = 0, 2p, or (4p – 2) whereby p is the frequency of allele B) thus becoming Equation 2.13:

𝐷_𝑗𝑘 = 1 𝑁∑ (𝑥′_{𝐷(𝑖𝑗)}− 2𝑝_𝑖)(𝑥′_{𝐷(𝑖𝑘)}− 2𝑝_𝑖2₎ 4𝑝_𝑖2_{(1 − 𝑝} 𝑖)2 𝑁 𝑖=1

Equation 2.13: Determining genetic relatedness between two individuals, j and k, based on dominance effects. D = genetic relationship coefficient, i = variant number

(1, 2, 3… N), pi = frequency of the reference allele of variant i, x’D(ij) genotype of individual j at variant i, x’D(ik) genotype of individual k at variant i (Zhu et al., 2015).

As a result, Equation 2.9 becomes:

𝑉𝑃 = 𝐴𝜎𝐴2 + 𝐷𝜎𝐷2+ 𝐼𝜎𝜀2

Equation 2.14: Estimating trait variation due to additive and dominance effects using GCTA-GREMLd. VP = trait variance; A = additive effects GRM; σA2 = variance due to additive effects of variants contributing to the GRM; D = dominance effects GRM; σD2 = variance due to dominance effects of variants contributing to the GRM;

In document Discovery of genetic determinants for refractive error (Page 84-92)