Population Stratification - Variable Measurement

CHAPTER 2 METHODS

2.3 Variable Measurement

2.4.5 Population Stratification

Population stratification has been defined as differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than associations of genes with disease. In genetic epidemiology studies, population stratification may result in biased results when the outcome of interest also varies based on genetic ancestry. Population stratification can result in both false positive and false negative results and the strength of the bias will depend on the magnitude of allelic variation among ancestry groups and differences in disease among those same groups.211 In non-genetic epidemiologic studies race is often considered as a confounder for similar reasons. Self-reported race is considered a marker for a wide variety of social, cultural, dietary, economic, stress, and educational experiences. In genetic epidemiologic studies however, race is also seen as a marker of genetic ancestry which will have practical consequences for LD structure, haplotypes, and allelic frequencies. In this setting self-reported race is often an inadequate surrogate for genetic ancestry. Not only are individuals poor reporters of their genetic ancestry212 at a level necessary for genetic studies, self-reported race may not adequately capture the genetic heterogeneity among self-reported racial groups.213, 214

While population stratification can be quite complex when dealing with large metropolitan populations with a wide variety of genetic ancestry, the concern still remains in largely biracial populations as present in this study. In the United States, among those who self-report as African American, the percent of European genetic ancestry is quite variable. As measured in metropolitan centers across the US, estimated European ancestral proportion ranged from

105

11.6% in Charleston, SC to 22.5% in New Orleans, LA215 with geographically isolated areas showing even more extreme values.216

A number of methods have been proposed for addressing population stratification in genetic epidemiology studies. Genomic control was an early method which used random markers to estimate an inflation factor which was used to adjust all test statistics. Genomic control only protects against false positives and tends to overcorrect when non-random markers are used.211

More recent work has focused on choosing informative markers called ancestry informative markers (AIMs), which are independent markers throughout the genome that have large allele differences between the ancestral populations of interest. Although allele differences between ancestral populations (δ) is important in selecting AIMs, factors such as the allele frequency in the ancestral population irrespective of δ (p) and the respective genetic contribution of each ancestral population to the admixed population (m) also influence the precision of ancestral estimates.217 Pfaff’s Fisher Information Criterion217 captures δ, p, and m into a single information estimate which allows for marker selection to optimize precision based on the actual, or hypothesized, ancestral populations and admixture proportions.

Estimating genetic ancestry using AIMs can be accomplished using a number of different methods. Consensus on the optimal method of calculating genetic ancestry has not yet emerged. Commonly used methods include maximum likelihood estimation (MLE) methods, structured association and principal component analysis.218 Under situations with informative markers and large and accurate ancestral group information, both MLE and structured

association methods perform well.219 MLE methods have been shown to be superior when marker information is low and there is little information on the allelic frequency in the ancestral

population. MLE methods are generally faster and less computationally intensive, however, when the assumption of independence among markers is violated, the confidence intervals may be too narrow.219 Structured association methods use Bayesian and MCMC methods to assign individuals to clusters or sub-populations. Structured association methods are dependent on the number of ancestral populations specified. This specification is at the discretion of the

investigator and can be difficult to both determine and interpret in cosmopolitan populations.211

Principal Component Analysis (PCA) is also used to correct for population stratification. PCA methods infer continuous axes of genetic variation by using eigenvectors of the covariance matrix of SNPs between samples.220 Unlike structured association methods, which are

dependent on the correct choice of the number of clusters, PCA techniques are invariant to the number of axes chosen. Although many PCA techniques employ all the SNPs genotyped in GWAS panels, work has shown that well-chosen AIMs panels of 50-200 SNPs are equally good at controlling bias and optimizing power.221

Software exists for both MLE (FRAPPE), Structured Association (STRUCTURE) and PCA (Eigenstrat) methods. Given the informativeness of the AIMs used, the relatively small size of the study population, the presence of only two ancestral populations, ease of use and familiarity of the software and previous work using similar populations in North Carolina189 which has found high correlation between MLE and Structure methods, Structure will be used initially for calculation of ancestry.

107 Additional issues with ancestry

While adjustment for genetic ancestry will address confounding by population

stratification, there is also the possibility of heterogeneity between the two genetic ancestry groups included in this study. Previous genetic epidemiologic studies looking at preterm birth4, 6,

97, 98, 113, 114, 149, 155, 156

found that genetic associations varied by genetic ancestry. Pathway analysis in one cohort composed of US Whites and African Americans with the outcome of spontaneous preterm birth suggested that different pathways were operating in the two racial groups.97 Tag selection for this study resulted in some SNPs that are polymorphic in a single ancestry group (Table S1 and S2 in Appendix). While these private SNPs may be important for a specific ancestry group, they will provide no additional information for the other group.

Although an analysis with all women combined would have more power due to an

increased case and control group, differences in the size of the racial groups, the likelihood that different genes and SNPs are acting in different genetic ancestry groups, and the presence of private SNPs, increases the likelihood that associations would be missed in a combined analysis of both genetic ancestry groups. For this reason all analyses were performed within strata of genetic ancestry and additionally adjusted for continuous percent ancestry.

In document 5673.pdf (Page 122-125)