In order to evaluate the extensions of the LLARRMA framework to incorporate dom- inance modeling, we consider a variety of settings to test the models. We present two simulations studies. The first will focus on method performances when considering a
single effect type, i.e. only additive, only minor allele dominant, etc.. The second con- siders models which contain a mixture of effect types. The first will focus on a model which effects are most likely to be additive, while the second will focus on the case where additive effects are less prevalent.
3.3.1
Simulating Genotypes
We have chosen to simulate data sets using HAPGEN2 (Su, Marchini and Donnelly, 2011) which was developed for generation of SNP data for complex diseases that have an LD structure mimicking a provided real data set. With the use of HAPGEN2, we are able to easily generate a new data set for each simulation, allowing us to test on a wider variety of data sets than if we generated phenotypes based on a single fixed real data set.
The simulated data set we consider is a HAPGEN2 version of the hit region from the ’58 data (WTCCC, 2007) used in Valdar et al. (2012). As not all of the SNPs selected from the ’58data are available from HapMap (Tanaka, 2009); the subset of 386 SNPs of the hit region present in HapMap are used to generate our HAPGEN2 data set. Each data set will consist of 2500 subjects. The LD of this hit region is displayed in Fig 3.1. 0 100 200 300 376 SNPs Hapgen2 Data LD:r2 0.0 0.5 1.0
Figure 3.1: LD structure of the HAPGEN2 data sets used in the simulations. Shading indicates pairwise LD between SNPs, ranging from white (r2 = 0) to black (r2 = 1).
3.3.2
Simulation study 1: preliminary model comparisons
When extending the RMA model to detect dominance in the model we want to compare how the extended model will fair in multiple different settings. Each sub-simulation is
performed to evaluate how each method compares when used on a model with only a single type of effect.
Placement of true loci
The location of the true loci for the simulation sub-studies will be chosen at random, with a restriction on the minor allele frequencies (MAF) of the selected loci. The MAF of true signal SNPs has been restricted to be at least 0.1 to ensure sufficient signal is present to detect dominant effects.
Simulating phenotypes
Phenotypes are simulated based on the regression model given by Eq 3.1. Given a set of true SNPs with genotypesXqwith corresponding model predictorsAqandDq and their
corresponding effects βa,q and βd,q, we first calculate individuals expected phenotype
yi = AqTβa,q +DqTβd,q, and then add a Gaussian error ei ∼ N(0, σ) to obtain the
individual’s observed phenotype, where σ is chosen to obtain the desired signal to noise ratio (SNR) of 1/4, where SNR =
√
[βa,βd]Tvar([A,D])[β a,βd]
σ . This corresponds to
the region explaining 5.8% of the phenotypes variability, which is comparable to the observed variability explained within hit regions in Warren et al. (2012) and Dastani et al. (2012).
Simulation substudies: generation of model effects
The simulations are broken into 5 sub-simulations in order to investigate how each model performs in each of the specific settings. For each sub-simulation we will consider 5 true loci with effects β?q generated as N(1.35(−1)νj,0.022) with ν
j ∼ Bernoulli(0.5).
Each sub-simulation will emphasize a different combination ofβa and βd as a function
of β?q to consider additive only, heterosis, and general dominant effects. Table 3.2
Table 3.2: Summary of the sub-simulation models where β?q ∼ N(1.35(−1)νj,0.022)
with νj ∼ Bernoulli(0.5), α is chosen randomly from {0.5,0.75,1,1.25}, and υj ∼
Bernoulli(0.5).
Substudy Model Additive predictor Dominant predictor
1A Additive βa,q =β?q βd,q =0
1B Minor Allele Dominant βa,q =β?q βd,q =βa,q
1C Major Allele Dominant βa,q =β?q βd,q =−βa,q
1D Heterosis βa,q =0 βd,q =β?q
1E General Dominant βa,q =β?q βd,q =α(−1)υjβ a,q
3.3.3
Simulation study 2: general predictors
For our second simulation study, we will consider a general setting that will be a mixture of the effects tested in simulation study 1. Specifically we will consider a combination of simulation 1A, 1D, and 1E; as the settings of 1B and 1C are special cases of 1D.
Let (ma, md, mh) be the number of true additive, dominant, and heterosis effects in
the model respectively. We propose to model (ma, md, mh)∼Multinomial(5, pa, pd, ph)
wherepa, pd,and ph are the probabilities of a true locus being additive, dominant, and
heterosis effects respectively. Under this model, we can characterize the simulation 1 sub-studies by explicitly setting two of the three probabilities to zero. Thus, we have a natural extension of the simple sub-studies to a more general simulation. We propose two simulation settings. The first models (ma, md, mh)∼Multinomial(5, pa =
.6, pd = .3, ph = .1), which deviates slightly from the standard complex trait analysis
assumption by allowing some effects to differ from additivity. In the second setting, we model (ma, md, mh)∼Multinomial(5, pa =.3, pd=.6, ph =.1), which emphasizes a
more extreme view where the non additive effects are most prevalent.
3.3.4
Computation
All analyses were performed in R (R Development Core Team, 2010), with the glmnet package (Friedman, Hastie and Tibshirani, 2010) used for fitting LASSO models and