Population stratification - DETERMINING THE VALIDITY OF AN ASSOCIATION: STATISTICAL INFERENCE

I.4. DETERMINING THE VALIDITY OF AN ASSOCIATION: STATISTICAL INFERENCE

I.4.1.1. Population stratification

“Human populations differ from one another almost entirely in the varying proportions of the allelic genes of the various sets of hereditary factors, and not in the kinds of genes they contain. The extreme positions held by those who on the one hand maintain that there are no significant genetic differences between human races, and those who on the other hand hold that certain races are

‘superior’ and others ‘inferior’, require drastic modification in the light of the accumulated data on the gene frequency dynamics of human populations.”

Laurence Snyder, 1951

The definition of race is usually subjective and based on proxies such as skin colour, language, physical properties and geographical location (Bamshad et al., 2003; Pritchard et al., 2000), with little insight into the genetic differences or similarities between populations (Foster and Sharp, 2002; Witzig, 1996; Goodman, 2000). Indeed, genetically similar groups may be labelled differently due to cultural differences or geographical location. Likewise, genetically dissimilar groups may be classified as a single population.

Comparative studies of within-group versus between-group genetic diversity indicates that approximately 90% of genetic variation occurs within human populations. Consequently, only 10% of genetic variation attributed to between-group differences (Lewontin, 1972; Cavalli-Sforza and Piazza, 1975; Jorde et al., 2001; Barbujani et al., 1997), which influences average differences in physical characteristics, disease susceptibility and treatment outcome amongst populations. In all the major population groups, there seems to be some degree of cryptic population substructure, which generally follows ethnic lines (Ziv and Burchard, 2003).

In population stratification, the observed association between a genetic susceptibility variant (G) and disease (D) is biased, due to the fact that G is associated with some true risk factor that varies with ethnicity. Therefore, if the population under investigation comprises cryptic subpopulations in which allele frequencies for the candidate gene and baseline risks of disease differ, it may result in spurious association between a genetic variant and the disease under investigation. This is because any allele that has a higher frequency in the subpopulation possessing a greater disease risk will appear to be associated with the disease. Likewise, it is also possible that population stratification may result in a Type II error – if a disease is more prevalent amongst a subgroup possessing a lower frequency of the disease-causing allele, the association with the disease will be masked (Deng, 2001).

Population stratification normally arises when the genetic background of the source populations differ between cases and controls (Cardon and Palmer, 2003), although it can also occur as a result of “cryptic relatedness” within a population considered to represent a sample of independent cases and controls. The hypothesis is that, if the disorder (in this case OCD), has a genetic aetiology, the affected individuals in the study are likely to be more genetically similar than case-control pairs, because they share a common genetic disorder that has, in essence, a common genetic basis. Thus, under the initial assumption of an independent sample and no genetic association with the disease, the false-positive rate may be increased due to cryptic relatedness within the case subjects (Bacanu et al., 2000; Devlin and Roeder, 1999).

In most epidemiological and disease risk studies, self-reported ancestry can serve as a suitable proxy for genetic clustering, with the obvious exception of recently admixed populations (Thomas and Witte, 2002). However, in case-control genetic association studies that require the identification of loci with very small effects, even the slightest difference in genetic ancestry between cases and controls may result in false positive results. Therefore, in such studies, if one is uncertain about the presence of cryptic subpopulations or the degree of admixture in the population from which the sample is drawn, methods that are capable of detecting, and correcting for, such stratification should be employed. The goal when correcting for population stratification is to determine whether cases and controls differ in ancestry to such an extent that an excess number of markers will, by chance, be associated with disease status. One can account and correct for stratification by using better measures of populations, making use of family members as controls and by means of genomic adjustment (Thomas and Witte, 2002).

I.4.1.1.1. Better measures of populations

As has already been mentioned, broad conventional population categories usually result in confounding due to population stratification. It has therefore been proposed that more specific, detailed information regarding a subject’s ethnic origin be obtained when conducting association studies, so that individuals can be allocated to the finest ethnic origins that can be determined. For example, in mixed ethnic families, it may be more valuable to obtain information regarding the origins of an individual’s parents and grandparents. This allows one to construct a covariate for each stratum in the analysis (rather than allocating the individual to a single stratum), noting the proportion of ancestors derived from each stratum, subsequently adjusting for these covariates using a multiple logistic regression model (Thomas and Witte, 2002).

I.4.1.1.2. Using family members as controls

Two major family-based association tests presently make use of parents and siblings as family-based controls: the transmission disequilibrium test (TDT) and the haplotype relative risk (HRR). Briefly, the TDT compares the frequency of a marker allele at a given locus in a sample of probands with the frequency of the parental non-transmitted alleles (“controls”). If transmission of the marker allele from heterozygous parents to affected individuals exceeds that expected by chance alone, it is assumed to be associated with the disorder in some way (Spielman et al., 1993).

The haplotype relative risk test looks at an affected individual and both his/her parents (Falk and Rubenstein, 1987). All three individuals are typed for a genetic marker that is hypothesised to be associated with susceptibility to the disorder. The genotypic frequencies in affected children are calculated and compared to the genotypic frequencies formed by merging the parental alleles that are not transmitted to the affected child. In effect, this creates a “pseudo-control” genotype from the alleles that are not transmitted to the affected offspring (Terwilliger and Ott, 1992). The marker allele frequencies are then compared between the case and pseudo-control group, and the resulting odds ratio is known as the HRR (Falk and Rubenstein, 1987; Schaid, 1998; Schulze and McMahon, 2002).

However, utilising family members in genetic studies may not always be the most feasible option, since the studies have been found to possess limitations. Not every case has a sibling, and Teng and Risch (1999) found that using unaffected siblings as controls resulted in a

substantial decrease in power when compared to studies using unrelated controls. When using parents as controls, at least one parent has to be readily available, and the possibility exists that some of the parent-case trios will be discarded because they are uninformative. Moreover, TDT-related methods yield approximately two-thirds of the genotyping efficiency of the case-control design, because for every case-case-control pair, genotype information is required from two parents and the proband. Finally, it is especially difficult to obtain large collections of family members for psychiatric disorders, since there seems to be a certain amount of stigma attached to being diagnosed with a psychiatric disorder.

Therefore, population-based case-control association analyses offer numerous advantages over the family-based association methods, including easier and cheaper recruitment of subjects; greater power to detect associations where the GRR is low; the inclusion of a more representative sample of subjects than in family-based designs, and the ability to explore environmental co-actions.

I.4.1.1.3. Genomic Adjustment

If population substructure affects candidate gene allele frequencies, then, theoretically, there should also be systematic differences in the allele frequencies at other genes (Devlin and Roeder, 1999; Pritchard and Rosenberg, 1999). It is these differences that are exploited when using genomic adjustment to detect and control for population stratification. These methods can be divided into two broad categories: first, model-based or structured association (SA) methods, which assume that the heterogeneous sample population is composed of genetically homogeneous subpopulations. Programs implementing this design are Structure (Pritchard et al., 2000) and latent class analysis (LCA) programs (such as L-POP) (Sham and Purcell, 2002). Second, non model-based, or genomic control methods, which correct for population stratification by accounting for overdispersion of statistics generated by population substructure can also be implemented. Genomic Control (GC) (Devlin and Roeder, 1999) is an example of a program implementing non model-based methodology. Both categories utilise a panel of polymorphic markers that may or may not be linked to the candidate locus.

i. Model-based methods

Model-based methods attempt to detect the underlying population substructure and adjust the association accordingly. They are conducive to association studies since they allow the identification of situations resulting in false positive and negative findings, and the choice of

markers should not bias the subsequent correction in any way. Structure (Pritchard et al., 2000) is a Bayesian model-based algorithm that assigns individuals probabilistically to one or more subpopulations based on allelic frequencies at each locus studied. It involves genotyping random markers (in linkage equilibrium with each other) in order to reflect the baseline genetic differences between cases and controls. The procedure places individuals into ‘K’

number of clusters. ‘K’ is chosen in advance, but can be varied across independent runs of the Structure algorithm. It is possible for individuals to have membership in multiple clusters; in this case, the program will indicate an estimate of the fraction of the individual’s genome that originated from each of the ‘K’ subpopulations, providing a means for capturing the degree of admixture.

The major drawback of this method is that, although it allows the detection of population structure, the algorithm itself offers no means of adjusting the significance value if the stratification is found to influence the validity of the association. However, a program called

“strat” has been designed (Pritchard, 2000) in order to correct for confounding due to stratification. Structure is a presently widely-used program, and has been successfully implemented in numerous studies attempting to delineate human population structure (Rosenberg et al., 2002; Bamshad et al., 2003), the genetic structure of certain dog breeds (Parker et al., 2004) and it has even been used to distinguish between selected breeds of chickens (Rosenberg et al., 2001).

A slightly modified approach to the methods implemented in Structure is represented by the latent class analysis (LCA) of population substructure (Satten et al., 2001; Purcell and Sham, 2004). This implementation involves the simultaneous estimation of population membership and the effect of the disease-susceptibility variant in case subjects in the respective subpopulations, thus bypassing the 2-stage procedure required by Structure.

ii. Non-model-based methods

Genomic Control (GC) methods utilise unlinked markers that are usually independent of disease to calculate a correction factor to control for the inflated χ² value that is a consequence of population stratification (Devlin et al., 2001; Devlin and Roeder, 1999; Bacanu et al., 2000). In other words, the method involves re-calibrating the χ² value for association according to how many of the control markers (the null loci) are found to be associated with the disease. It is therefore important to choose the control markers so that they are randomly

distributed and thus provide a true reflection of the overall differences between case and controls; any marker that assumes a higher degree of differentiation between cases and controls will result in an overly conservative adjustment of the χ²values.

A drawback of this method is that the number of markers required can be prohibitive – for a reliable and valid correction for the presence of population substructure, 50 or more control markers may be required (Devlin et al., 2001). Moreover, the method is limited only to SNPs (Bacanu et al., 2000; Devlin and Roeder, 1999). It has also been found that GC methods do not control against a Type I error if the difference in candidate allele frequencies between populations is small (Redden and Allison, 2003).

Accuracy to detect population substructure using genomic adjustment

The resolution at which population substructure can be detected depends largely on a combination of the characteristics of the genetic data utilised in the study, including expected heterozygosity or number of alleles at a locus (Shriver et al., 1997; Bamshad et al., 2003) and maximal difference in allele frequencies between the populations under investigation (Rosenberg et al., 2001). The more informative a marker, the greater the power with which to accept or reject the null hypothesis of no genome-wide differences in allele frequencies between the case and control populations will be. For biallelic markers, informativity will be maximised if one allele is absent in one of the subgroups or populations under investigation, and is only limited to one of the populations (Bamshad et al., 2003).

The accuracy with which subpopulations are characterised will also depend on the level of genetic variance within and between the subpopulations. A positive Fst value (a measure that determines overall genetic differentiation between subpopulations) indicates that individuals from the same subpopulation are more genetically similar than those from different subpopulations. One also has to take note of the variance within subpopulations – if variances within subpopulations are high, it becomes more difficult to assign individuals to a particular population (Bamshad et al., 2003). Interestingly, Bamshad et al. (2003) found that Alu insertion/deletion markers possessed higher Fst values than microsatellite markers, and that these values were similar to those obtained for diallelic markers, and could be attributed to the high mutation rate of these polymorphisms. These polymorphisms have been successfully utilised to infer population structure, and have been found to possess comparable power to detect structure and assign origin – Bamshad et al. (2003) found that a minimum of 60 Alu

markers were required to assign individuals to the correct continent of origin with a mean accuracy of 90%.

The number of markers required depends on a combination of the number of alleles, heterozygosity and Fst values. Obviously, a population that exhibits a fine substructure requires a larger number of markers (and larger sample size) to resolve this substructure, although, in such a case, the degree to which the genetic association is confounded would be lower (Pritchard and Rosenberg, 1999).

I.4.1.1.4. The value of isolated populations in genetic association studies, and a brief overview of the genetic history of the South African Afrikaner

The statistical power to detect a true association depends, to a large extent, on the amount of background noise within the population from which the subjects are sampled. This “noise”

comprises a number of genetic and environmental aspects, which may vary amongst populations. Association studies in heterogeneous populations present with varying degrees of background noise; consequently, large samples are required to attain sufficient statistical power. In homogeneous populations, however, environmental and genetic variation is limited, improving the signal-to-noise-ratio, and the statistical power of the study.

By definition, all isolated populations originate from a few founders. Most of these populations experience bottlenecks, after which periods of rapid population growth (due to increased reproduction, not immigration) occur. During the bottleneck, the population experiences inbreeding and random genetic drift, ultimately limiting the genetic diversity and the number of new mutations occurring within the current population. Since recessive and neutral alleles are both subject to genetic drift in a population isolate, each population usually has its own set of recessive diseases that occur at relatively high frequency. Genetic drift will have much the same effect on rare marker alleles and haplotypes as it does on recessive and neutral alleles; common marker alleles and haplotypes will, on the other hand, not be affected to any large extent by drift, unless the number of initial founders is very small.

It is highly probable that, for complex diseases, the underlying susceptibility alleles are relatively common (section I.3.2.3), and experience very little selection pressure (Lander, 1996; Collins et al., 1997; Risch and Merikangas, 1996). Consequently, these variants predate the “Out-of-Africa” expansion, and the genome segments on which they are located have

experienced recombination over a large number of years (approximately 100 000 years), erasing much of the LD around the susceptibility locus. A high marker density will thus be required in order to detect these susceptibility variants using LD mapping strategies (Kruglyak, 1999[b]).

Exploiting the genomic structure of populations that exhibit an increased level of LD will thus be conducive to the detection of the underlying susceptibility alleles. Decay of LD is related to the number of recombination events and the effective population size (Hartl and Clark, 1997); therefore, one of the major demographic features influencing LD mapping studies is the number of generations to the most recent common ancestor (MRCA) (Wright et al., 1999).

Recombination events serve to equilibrate linked alleles – for disease alleles that are relatively ancient, the number of recombination events that would have occurred is high, therefore, the surrounding genomic regions that are identical-by-descent (IBD) will be relatively small. The rate of decay of LD is reduced by other factors that reduce the effect size of the population, such as inbreeding.

It is thus evident that isolated populations, with relatively less average time to the MRCA, and a high occurrence of inbreeding, exhibit higher degrees of LD over longer genomic distances compared to more stable, outbred populations (Graham and Thompson, 1998; Shifman and Darvasi, 2001). One will recall from the section on LD and haplotype mapping, that the value of r² is directly proportional to the required increase in sample size when testing the association between a single SNP and complex disorder (Ardlie et al., 2002; Shifman and Darvasi, 2001; Laan and Paabo, 1997). Consequently, due to the extended levels of LD, isolated populations require a smaller increase in sample size to identify the specific disease gene, and are therefore more conducive to the genetic association mapping procedure.

Ultimately, the genetic heterogeneity will also be reduced, resulting in a significant increase in GRR, making the association between variant and disease easier to detect (Shifman and Darvasi, 2001).

Genealogical records form the mainstay of any genetic investigation in isolated populations, since they allow for the identification of large pedigrees. These pedigrees will probably comprise multiple affected individuals, and, by utilising the genealogical records, one will be able to delineate the number of meiotic steps separating affected individuals, which will facilitate the identification of IBD segments (Kruglyak, 1999[b]; Heutink and Oostra, 2002).

Genealogical records also allow the identification of genetic “incomers”, facilitating the more accurate estimation of genetic variability (Angius et al., 2001).

The South African Afrikaner population: a brief overview of their history and suitability in localising disease-susceptibility genes by means of association studies

The Afrikaans-speaking Caucasian sub-population in South Africa, often referred to as Afrikaners, are of predominantly Dutch, German and French descent (Dunning et al., 2000;

Jenkins, 1990; Botha and Beighton, 1983). Their history over the past 350 years has contributed to their geographic and cultural isolation, and their relative genetic homogeneity.

The Dutch were the first Caucasians to settle in the Cape in 1652, and their numbers were subsequently boosted by French Hugenot and German immigrants. By 1687, the founding Afrikaner population consisted of only about 90 families (Theal, 1964). From this time on, the settler population expanded rapidly with large families of 10 or more children (Botha and Beighton, 1983), and marriage between family members was common in early generations.

Over a period of about 300 years, the Afrikaner population underwent a 2500-fold increase, compared to Britain’s population increase at the same period, which was only six-fold (Jenkins, 1990). This population growth was almost entirely due to reproduction, as the immigration following the founding event in 1652 was minimal (Jenkins, 1990).

The Afrikaner immigrants spread inland from about 1838, forming small, geographically isolated communities. Their language and religion (most Afrikaners were members of the Dutch Reformed Church) contributed further to their isolation and social cohesion. This cultural identity has, for the most part, been maintained, largely due to intermarriage (Botha

In document Investigating the molecular aetiology of Obsessive-compulsive disorder (OCD) and clinically-defined subsets of OCD (Page 61-71)