Chapter 2 General Methods
2.3 Statistical Methods
2.3.5 Gene-based and Gene-set Analyses
Many GWAS investigations have primarily concentrated on single marker tests. However, translating the association signals identified by a GWAS into potential causal mechanisms or biological functions is not straightforward. Combining test results across whole genes or collections of genes, aims to provide greater understanding of the functional consequences of genetic variants with respect to the trait of interest (Neale and Sham, 2004; Cantor, Lange and Sinsheimer, 2010).
Two main software applications available to perform gene-based association tests using association test summary statistics are VEGAS2 and MAGMA (de Leeuw et al., 2015; Mishra and Macgregor, 2015). For analyses undertaken here, both applications were employed.
For both software applications, genes were defined according to NCBI build 37 (hg19/GRCh37) coordinates, which is the corresponding build to which all variants were mapped to during their respective imputation phases. Furthermore, flanking regions of 50 kb (unless otherwise stated) were appended to the gene transcription start and stop sites. These flanking regions were included as variants in these
nearest gene but other nearby genes too (Guo and Jamison, 2005; Schork et al., 2013; Brodie, Azaria and Ofran, 2016; Corradin et al., 2016). For both applications, LD patterns between variants included in the analysis for each gene were estimated using an ancestry matched reference panel (see below).
2.3.5.1 VEGAS2
Developed by Liu et al. (2010b) and updated by Mishra and Macgregor (2015), VEGAS2 (VErsatile Gene-based Association Study 2) computes gene-based test statistics by initially converting single marker association test p-values into upper tail χ2 test statistics with 1 degree of freedom before summating for all variants within each gene locus to give a single value. This gene-based χ2 test statistic with n degrees of freedom, with n defined as the number of variants within the gene locus, is then examined under the null hypothesis of no association. However, variants within a gene locus are rarely in complete linkage equilibrium (i.e. independent of each other) and therefore, this correlation between variants also needs to be considered. These LD patterns (Σ) are considered using an n x n matrix of pairwise LD values which are estimated from an ancestry matched reference panel. In order to obtain p-values for the gene-based association tests, VEGAS2 performs a two- step process. Firstly, values following a multivariate normal distribution with a mean of zero and covariance matrix (Σ) are simulated by the software. These simulated values are then compared against the summated gene-based χ2 test statistic described above. Gene-based p-values are defined as the proportion of simulations where the simulated test statistic is greater than the original (observed) gene-based test statistic (Equation 2.7).
𝑃 = 𝑁𝑜. 𝑜𝑓 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠 𝑤ℎ𝑒𝑟𝑒 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑒𝑑 𝜒2 > 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝜒2 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑖𝑚𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠
Equation 2.7: VEGAS2 Gene-Based P-value Calculation.
As the observed gene-based test statistic is based on the upper tail of the χ2 distribution, few simulations would be expected to have greater χ2 values in the event that the gene is associated with the trait of interest.
It is important to note that VEGAS2 uses simulations rather than permutations to determine the association signal. Before the development of VEGAS, gene-based association tests such as those performed by PLINK [command --set-test] were computationally intensive. In the method implemented by PLINK (Purcell et al., 2007), rather than performing simulations to generate alternative χ2 test statistics to compare against, individual’s phenotype data was permuted over 1000s of iterations to generate several single marker association test statistics which were subsequently used to generate a single gene-based test statistic. As with VEGAS, the proportion of permuted gene-based χ2 test statistics greater than the observed χ2 test statistic was defined as the gene-based p-value.
In order to maximise efficiency, the number of simulations performed varies depending on the p-value obtained after each round of simulations. Initially, 1000 simulations are performed before generating the gene-based p-value for a particular gene. If this p-value is less than 0.1, 10,000 simulations are performed before reviewing again. If this new gene-based p-value is less than 0.001, 1 million
than the observed test statistic after 1 million simulations, no further simulations are performed and the gene is assigned a reported p-value of P < 1 x 10-6. This is because the Bonferroni adjusted p-value threshold has already been crossed (0.05 / number of genes) and this is known to be an overly conservative threshold due to gene regions not being fully independent of each other since some variants may contribute to more than one gene (Liu et al., 2010b).
Definitions of gene loci and reference files for estimating LD patterns were built in to the VEGAS2 software application. Specifically, gene loci were defined according to a list of all RefSeq genes obtained from the UCSC table browser (Karolchik et al., 2004), and LD patterns were estimated by VEGAS2 using reference files composed of data for the 379 unrelated individuals of European ancestry from Phase 1, Version 3 of the 1000 Genomes Project (The 1000 Genomes Project Consortium et al., 2012; Mishra and Macgregor, 2015). The European ancestry dataset was used specifically as the cohorts included in these analyses were restricted to those of European ancestry only. The 1000 Genomes Project reference dataset is notably larger than the previously used HapMap2 reference dataset as utilised in the initial release of VEGAS (379 vs. 90 individuals), therefore providing greater accuracy when estimating LD patterns (Mishra and Macgregor, 2015).
2.3.5.2 MAGMA
Unlike VEGAS2, gene definitions and reference files for estimating the LD structure of variants are not built in to the MAGMA (Multi-marker Analysis of GenoMic Annotation; de Leeuw et al. (2015)) software application, however these files are
available for download from the MAGMA website (URL: https://ctg.cncr.nl/software/magma). From here, gene definitions (to build 37 coordinates) originally obtained from the NCBI Entrez Gene database (Maglott et al., 2011) were downloaded, alongside reference files for European ancestry individuals from the 1000 Genomes Project.
An initial “Annotation” step must be run in MAGMA in order to assign variants to genes for gene-based analyses. It is at this stage that flanking regions can be appended to the gene transcription start and stop sites so that variants that may affect a particular gene’s regulatory processes are also included.
As GWAS summary statistics (variant ID labels and association test p-values) from single marker tests were used as input data in the absence of raw genotype data, as was the case in Chapter 4, the “snp-wise=mean” model was implemented by default. All variants assigned to a specific gene from the annotation step were included in the analysis of that gene with its observed gene-based test statistic determined by converting the respective variant p-values to -log10 values (i.e. a
variant with P = 1 x 10-6 would be converted to a value of 6) before summating. The respective gene-based p-values are subsequently computed by MAGMA using these gene-based test statistics, the LD structure of the variants analysed and a sampling distribution appropriate for the test statistic as determined by MAGMA.
pathway). For the gene-set analyses undertaken in the following chapters, “competitive” gene-set analysis was performed with Z-statistics from the MAGMA gene-based analysis used as input data. From the Molecular Signatures Database (MSigDB) (Subramanian et al., 2005), all gene-sets and their definitions (i.e. which genes contribute to each gene-set) were downloaded and used in these gene-set analyses.
In competitive gene-set analysis implemented by MAGMA, a linear regression is performed with genes coded as “1” if they are in the gene-set analysed or “0” if they are not, and the outcome taken as the gene-based Z-statistic. The mean difference in association between genes within the gene-set and those outside of it (βS) is tested against the null hypothesis of no difference in association (H0: mean βS = 0) (Equation 2.8; de Leeuw et al. (2015)).
𝑍 = 𝛽0+ 𝑆𝛽𝑆+ 𝐶𝛽𝐶+ 𝜀
Equation 2.8: Computing test statistics for competitive gene-set analyses in MAGMA. Z = gene-set test statistic; β0 = intercept of the linear regression model; S
= number of genes in the gene-set; βS = difference in association between genes
within the gene-set and those outside it; βC = matrix of covariates (e.g. no. of
variants within a gene); ε = residuals with correlations aligned with the gene-gene correlations to accounts for LD between genes (de Leeuw et al., 2015).