Gene-based testing: combined VC and burden test implemented in MASS

4.5. Specific Aim 2: Gene-based rare variant analysis of red blood cell traits

4.5.1. Gene-based testing: combined VC and burden test implemented in MASS

As briefly introduced above, combining the methods of burden and variance components tests improves power to detect loci with different distributions of causal variants. In genetic transcripts containing few variants with large effects which all act in the same direction, a burden test is better powered to detect an effect. However, in genomic loci for which a large number of non-causal variants are present alongside functional variants, or for which causal variants exhibit opposite directions of effect—both completely reasonable scenarios given current knowledge of human evolution—SKAT (Sequence Kernel Association Test) is more powerful than burden tests. A combined test which allows for both of these scenarios uses the effect magnitude and standard error values for all rare variants within a defined region to identify genes that exhibit association with the trait of interest615_.

First, we will generate variant annotation set files in R for deleterious and CADD-filtered transcripts (see below). Defining each annotation set as the respective group files, we will

generate score statistics by transcript group for each annotation set using SUGEN, which is well- suited to produce input files for the combined test because it automatically outputs a mass- compatible results file (https://github.com/dragontaoran/SUGEN)616_{. We will then use the “VC-}

O” test flag in MASS to perform the combined test with a minor allele count (MAC) minimum

can only simulate a minimum p-value of 1E-6, which is higher than our genome-wide- significance threshold, we will increase the number of simulations to 25 million (minimum p~4E-08) for variants with a p-value lower than 1E-5 in the first run. Due to computational limitations, we currently do not expect to simulate p-values lower than 1E-9.

MASS automatically outputs both fixed-effect and random-effect p-values, as it functions primarily as a meta-analysis program, as well as the Het-SKAT-O p-value using the method described by Lee, et al615_{. However, our study does not meet the criteria for the Lee method,}

which was designed for binary outcomes with small study populations, and we are only

evaluating one study population, therefore we will be using the fixed-effects p-value. SKAT-O generates a weighted average of the burden test and SKAT statistics into one framework, resulting in the following test statistic:

𝑄𝑘(𝜌) = (1 − 𝜌)𝑄𝑘,𝑆𝐾𝐴𝑇 + 𝜌𝑄𝑘,𝑏𝑢𝑟𝑑𝑒𝑛

This statistic follows a chi-square distribution, which allows for the combination of effects with opposite directions of effect. Because we are under-powered to evaluate ancestry- specific effects genome-wide, we will perform sensitivity analyses for transcripts significant for each trait within each mask in African American- and Hispanic/Latino-only subpopulations.

4.5.1.1 Variant inclusion, genic unit definition, and adjustment for known variants

In order to perform gene-based testing, we will generate an inclusive list of all variants with a MAF >1% in the MEGA-genotyped study population. Parameters will be defined based on previously described standard field practices. A maximum threshold of 1% minor allele frequency will be used for incorporating variants, a strategy which is supported by both field standards and the fact that common and low-frequency variants will be analyzed individually in

Two groups of variants will be used to define the respective annotation sets: deleterious coding variants only (frameshift, stop-gain, stop-loss, and nonsynonymous coding variants), and a set additionally allowing for filtered regulatory variants (employed the recommended PHRED- scaled CADD score filter>10 to restrict to variants expected to have modest or higher effects)617

Using whole-genome-sequence annotation files for all MEGA-imputed variants generated by the PAGE coordinating center, we will compile both annotation sets defining all transcripts meeting the aforementioned criteria and comprising >2 variants. All genes with ENSEMBL-defined transcripts will be included in the analysis, resulting in a GWS alpha threshold of 2.87x10-7_{, or a}

significance threshold Bonferroni-corrected for 25,000 genes and seven traits. Variants with an imputation quality below 0.4 were excluded during imputation for VCF genotype files by the PAGE coordinating center. Genic unit size (i.e., the number of variants included per gene) is expected to range from very small (~3 variants, particularly for the deleterious mask) to very large (several hundred for protein-coding genes with a large number of CADD-significant regulatory variants).

4.5.1.2 Power

As an approximation, PAGE investigators recently performed power simulations for SKAT testing in 10,000 individuals under the assumption of 1% variance explained per region. In SKAT and related tests, power in a region depends on the proportion of phenotype variance that is explained by the linear effects of the full set of variants being included in the test (the narrow-sense heritability, h2_{), the number of variants being tested (varying from a handful to}

many thousands), and the sample size (scaled by the imputation info). Finally, there is a

Bonferroni correction for the total number of regions tested. Figure 7 shows approximate power curves for selected values of sample size, heritability, and number of variants tested per region,

at alpha= 0.05/number of regions. With a sample size of 10,000 and 1,000 variants tested per region, there is 80% power to detect heritability greater than 3 percent in any one region; with a sample size of 100,000 individuals, the detectable heritability would be approximately 10 times lower, or 0.3%. These simulations provide a reasonable framework for the power that can be expected from the proposed work.

In document Hodonsky_unc_0153D_18920.pdf (Page 127-130)