Simulation framework

In order to evaluate the extensions of the LLARRMA framework to incorporate dominance modeling, we consider a variety of settings to test the models. We present two simulations studies. The first will focus on method performances when considering a

single effect type, i.e. only additive, only minor allele dominant, etc.. The second con- siders models which contain a mixture of effect types. The first will focus on a model which effects are most likely to be additive, while the second will focus on the case where additive effects are less prevalent.

3.3.1 Simulating Genotypes

We have chosen to simulate data sets using HAPGEN2 (Su, Marchini and Donnelly, 2011) which was developed for generation of SNP data for complex diseases that have an LD structure mimicking a provided real data set. With the use of HAPGEN2, we are able to easily generate a new data set for each simulation, allowing us to test on a wider variety of data sets than if we generated phenotypes based on a single fixed real data set.

The simulated data set we consider is a HAPGEN2 version of the hit region from the ’58 data (WTCCC, 2007) used in Valdar et al. (2012). As not all of the SNPs selected from the ’58data are available from HapMap (Tanaka, 2009); the subset of 386 SNPs of the hit region present in HapMap are used to generate our HAPGEN2 data set. Each data set will consist of 2500 subjects. The LD of this hit region is displayed in Fig 3.1. 0 100 200 300 376 SNPs Hapgen2 Data LD:r2 0.0 0.5 1.0

Figure 3.1: LD structure of the HAPGEN2 data sets used in the simulations. Shading indicates pairwise LD between SNPs, ranging from white (r2 = 0) to black (r2 = 1).

3.3.2 Simulation study 1: preliminary model comparisons

When extending the RMA model to detect dominance in the model we want to compare how the extended model will fair in multiple different settings. Each sub-simulation is

performed to evaluate how each method compares when used on a model with only a single type of effect.

Placement of true loci

The location of the true loci for the simulation sub-studies will be chosen at random, with a restriction on the minor allele frequencies (MAF) of the selected loci. The MAF of true signal SNPs has been restricted to be at least 0.1 to ensure sufficient signal is present to detect dominant effects.

Simulating phenotypes

Phenotypes are simulated based on the regression model given by Eq 3.1. Given a set of true SNPs with genotypesXqwith corresponding model predictorsAqandDq and their

corresponding effects β_a,q and β_d,q, we first calculate individuals expected phenotype

yi = AqTβa,q +DqTβd,q, and then add a Gaussian error ei ∼ N(0, σ) to obtain the

individual’s observed phenotype, where σ is chosen to obtain the desired signal to noise ratio (SNR) of 1/4, where SNR =

√

[β_a,β_d]T_var([_A,D_])[_β a,βd]

σ . This corresponds to

the region explaining 5.8% of the phenotypes variability, which is comparable to the observed variability explained within hit regions in Warren et al. (2012) and Dastani et al. (2012).

Simulation substudies: generation of model effects

The simulations are broken into 5 sub-simulations in order to investigate how each model performs in each of the specific settings. For each sub-simulation we will consider 5 true loci with effects β?_q generated as N(1.35(−1)νj_,₀_.₀₂2_{) with} _ν

j ∼ Bernoulli(0.5).

Each sub-simulation will emphasize a different combination ofβa and βd as a function

of β?_q to consider additive only, heterosis, and general dominant effects. Table 3.2

Table 3.2: Summary of the sub-simulation models where β?_q ∼ N(1.35(−1)νj_,₀_.₀₂2₎

with νj ∼ Bernoulli(0.5), α is chosen randomly from {0.5,0.75,1,1.25}, and υj ∼

Bernoulli(0.5).

Substudy Model Additive predictor Dominant predictor

1A Additive β_a,q =β?_q β_d,q =0

1B Minor Allele Dominant β_a,q =β?_q β_d,q =β_a,q

1C Major Allele Dominant β_a,q =β?_q β_d,q =−β_a,q

1D Heterosis β_a,q =0 β_d,q =β?_q

1E General Dominant β_a,q =β?_q β_d,q =α(−1)υj_β a,q

3.3.3 Simulation study 2: general predictors

For our second simulation study, we will consider a general setting that will be a mixture of the effects tested in simulation study 1. Specifically we will consider a combination of simulation 1A, 1D, and 1E; as the settings of 1B and 1C are special cases of 1D.

Let (ma, md, mh) be the number of true additive, dominant, and heterosis effects in

the model respectively. We propose to model (ma, md, mh)∼Multinomial(5, pa, pd, ph)

wherepa, pd,and ph are the probabilities of a true locus being additive, dominant, and

heterosis effects respectively. Under this model, we can characterize the simulation 1 sub-studies by explicitly setting two of the three probabilities to zero. Thus, we have a natural extension of the simple sub-studies to a more general simulation. We propose two simulation settings. The first models (ma, md, mh)∼Multinomial(5, pa =

.6, pd = .3, ph = .1), which deviates slightly from the standard complex trait analysis

assumption by allowing some effects to differ from additivity. In the second setting, we model (ma, md, mh)∼Multinomial(5, pa =.3, pd=.6, ph =.1), which emphasizes a

more extreme view where the non additive effects are most prevalent.

3.3.4 Computation

All analyses were performed in R (R Development Core Team, 2010), with the glmnet package (Friedman, Hastie and Tibshirani, 2010) used for fitting LASSO models and

In document 5247.pdf (Page 100-104)