CHAPTER 3: A PERMUTATION-BASED GENETIC AS-
3.1 Introduction
In a cohort study, participants are followed over a period of time to evaluate the associ- ation between an outcome of interest (such as a disease) and exposures (risk factors). This study design allows one to measure exposures before the disease occurs and estimate the incidence rate, which is an attractive design for studies in common diseases. However, co- hort studies take a long time to complete and are relative expensive, which is problematic in genetic studies. Moreover, the sample size would be prohibitively large for rare diseases in order to have sufficient power. Compared with cohort studies, case-control studies are much more cost-effective. They are attractive alternatives where researchers select partici- pants with disease (cases) and without the disease (controls). Although case-control stud- ies have a greater potential for confounding and do not allow one to estimate the disease’s incidence rate, they are generally much cheaper and easier to perform than cohort studies. In particular, genome-wide association studies (GWAS) typically use case-control design.
GWA studies are expensive in both time and cost, and researchers want to collect as much information as possible to maxmize the return. In addition to identifying single- nucleotide polymorphisms (SNP) genotype’s associated with the disease of interest, sec- ondary phenotypes, which are associated with primary outcome are also collected and studied. For example, in a study of TMD, one may be interested in SNP’s association with pain sensitivity in addition to SNP’s association with TMD case-control status. TMD pa- tients tend to be more sensitive to pain than TMD-free controls, so the secondary outcome
pain density might be correlated with TMD case status. Similary, the association between body mass index (BMI) and SNPs were studied to understand the etiology of type II dia- betes (Frayling et al. 2007). In the GWA study of lung cancer, the correlation of secondary outcome smoking and genetic variants were investigated to reveal useful information in lung cancer (Villeneuve and Mao 1993) (Hung et al. 2008). Moreover, secondary pheno- type provide more information about disease and are useful in gene mapping as well (He et al. 2011).
The secondary phenotype studies should be analyzed carefully to avoid problematic or even misleading results. The common approaches to analyze the association of secondary outcome and SNPs in case-control studies are standard regression methods by using cases only, controls only, samples with both cases and controls, or joint analysis of cases and controls adjusting for disease status. However, the above methods led to spurious associ- ation of secondary outcome and genetic variants because the samples from both cases and controls are not representative of random samples from population. Hence none of them are reliable (Monsees et al. 2009, Lin and Zeng 2009).
Data from case-control studies are not a representative sample of the population since cases are overrepresented. it has been shown that logistic regression produce unbiased co- efficient estimates for estimating the risk of case status in a case-control study. (Prentice and Pyke 1979) However, there is no guarantee that coefficient estimates are unbiased if the outcome is an secondary phenotype rather than case status. Indeed, logistic regres- sion without adjustment for the study design will produce biased estimates (Monsees et al. 2009).
A number of statistical methods have been developed for the remedies. Richardson et al. (2007) recommended inverse probability weighting (IPW), whereby cases and con- trols are weighted such that the total weights of cases are comparable to the proportion
of cases in the general population. However, standard logistic regression models will pro- duce incorrect estimates of the standard error after applying such weights. Monsees et al. (2009) demonstrated that one may calculate the standard error correctly using a sand- wich estimator of the variance. The estimates by IPW method can be unstable when the secondary phenotype is strongly associated with case status. For example, in genetic stud- ies of TMD, one may wish to find genetic factors associated with the severity of orofacial pain. Essentially all TMD cases will report some level of orofacial pain, but the majority of controls won’t. When standard IPW regression is applied in this situation, tests of no association between a given SNP and pain severity produce widely inflated type I errors. In this situation, a more robust method is needed , which will help to identify genetic risk factors for TMD, or be used as a tool for future similar situations in other GWA studies. Wang and Shete (2011) proposed a bootstrap-based method for the estimation of the odds ratio for a binary secondary outcome. However, it cannot be applied to a continuous out- come. Lin and Zeng (2009) proposed a method based on maximum likelihood that can be applied to both continuous and categorical intermediate phenotypes. This method is only unbiased when the genotypes are not related to the primary outcome (Li et al. 2010).
In this chapter, we propose a permutation-based IPW method to analyze the associa- tion between a genetic marker and secondary phenotypes in a case-control study where the secondary outcome is correlated with case status. The proposed method has the advantage of being completely non-parametric, so it will produce valid p-values even when the as- sumptions of parametric models are violated (which is a common occurence when examin- ing secondary phenotypes in the OPPERA study). We showed that the proposed method has comparable power to existing parametric methods and it avoided the risk of elevated type I error when the model assumptions are violated. We further applied the method to the OPPERA data, and found two SNPs highly associated with clinical pain density.