CHAPTER 4: EVIDENCE OF SELECTION FROM GENOMIC DATA 144
4.1 INTRODUCTION 144
4.1.3 EXISTING METHODS OF DETECTING SELECTION 147
While the genetic phenomena described in the previous section have been known to exist for a considerable time, the density of polymorphism data from a wild population has rarely been sufficient to allow the practical genome-‐wide detection of selective sweeps by searching directly for haplotypes in which this is happening. Nonetheless, the general principles of the detection of sweeps have frequently been applied more modestly to specific loci containing genes of likely scientific interest, and have remained essentially the same for at least twelve years.
The major goal of this project was to detect instances within the UK population of A. thaliana of adaptation to local habitat despite gene flow from other sources, and to attempt prediction of possible cause(s). The typical pattern of a study of local adaptation involves first identifying samples displaying phenotypes of high fitness exclusively in their native habitat, and then seeking evidence that genes possessing variation associated with these traits have undergone selection in the observable past. This chapter essentially sought to reverse that process, by identifying genomic loci in samples taken from particular habitats exhibiting signatures of selection, which should serve as targets for validation via future field experiments.
Sabeti et al. (Sabeti et al. 2002) demonstrated an approach and thought process that served as a major inspiration for the work carried out in this chapter. Working with two loci in the human genome suspected to possess variation associated with resistance to malaria, Sabeti et al. (2002) identified core
haplotypes and measured the degree of conserved co-‐segregating similarity in flanking loci in order to estimate the age of the haplotype. Recently emerged haplotypes (those with a high degree of co-‐segregating variation) found at a high frequency in the studied population were marked as likely candidates for selection, having risen to high frequency before meiotic recombination broke down the linkage disequilibrium with the surrounding variation. This is unlikely for selectively neutral variation. To gauge the likelihood of any such instance being a true signature of selection, the degree of deviation from simulated haplotypes under a coalescent process was quantified. Several haplotypes were identified as exhibiting a highly significant deviation from coalescent expectations, and thus as probable instances of alleles favoured by selective sweeps.
Detection of selection across broader sections of the genome from genetic data has historically proved much more problematic, however. Genome-‐wide detection of selection had been attempted by comparison with predictions drawn from population genetic models (Hanfstingl et al. 1994; Hagenblad & Nordborg 2002; Nordborg et al. 2005), but prior to the advent of widely available whole genome sequencing these attempts were plagued by a lack of cross-‐compatibility of data from various experiments, and by difficulties in determining statistical significance due to confounding from drift and demographic factors (see Chapter 1.3.1; for review, see (Sabeti et al. 2006)). Methods for detecting signatures of selection fall into at least five different classes, each searching for distinct genomic patterns arising as a consequence of selective sweeps, and each with their own strengths and weaknesses. The suitability of each class of analysis to the goals of this project will now be discussed.
Most evolution of a genotype is expected to proceed through neutral changes
(i.e., those with no effect on a phenotype). In terms of base substitutions within
a sequence, this means that substitutions producing no change in phenotype may typically expected to be observed much more frequently than substitutions producing a change in phenotype. This ratio may be quantified by comparing the number of sequence differences producing codons coding for different amino acids (non-‐synonymous, or functional mutations) with that producing codons coding for the same amino acid (synonymous, or non-‐functional mutations). Once measured, this ratio may then be compared against either the equivalent ratio at the same loci in other species, the ratio at loci carefully chosen for their neutrality, or the typical ratio across the rest of the genome. Sustained selection over long timescales has been identified through higher proportions of non-‐synonymous mutations than expected by chance (McDowell 1998; Rose 2004; Ding et al. 2007). Since it is also expected that deleterious mutations are unlikely to ever rise to a high frequency in a population (due to selection acting against them), it is reasonable to conclude that such an observation constitutes a signature of a selective sweep.
This type of analysis is routinely applied to sequence data collected from closely related species, and is best suited to analysis of strong, persistent selection pressures at a single gene’s locus over many millions of years. SNP data is not ideally suited to this type of analysis, but resequencing data is, such as that from the 1001 Arabidopsis Genome project. Therefore, this method was not utilised for the primary detection of sweeps, but may be useful for secondary analysis of candidate loci.
• Local reduction of genetic diversity
As a selective sweep progresses and linked alleles are drawn towards fixation by genetic hitchhiking, the genetic diversity (i.e., the number of alleles in the population) at those linked loci necessarily decreases from the typical level encountered across the rest of the genome. Selective sweeps may therefore be
recognised by a sudden and progressive drop in the genetic diversity of genotypes centred on a particular locus (Carlson et al. 2005; Sabeti et al. 2006). Eventually, diversity at the linked loci rises again. If the sweep occurred across the entire native range of the species, diversity will rise slowly as new mutations begin to appear; if the sweep occurred only across a fraction of the species’ range, though diversity at these loci may be restored more quickly as migrants reintroduce variation if, for example, it were restricted to a relatively isolated sub-‐population.
While classic selective sweeps decrease allelic diversity at linked loci, balancing selection has been shown to actually increase diversity (Charlesworth 2006). This provides a means of not only identifying selection, but of predicting its nature.
SNP datasets are well suited to this type of analysis, which may inform us of the nature of selection e occurring up to several hundred thousand years in the past (Sabeti et al. 2006; Pritchard et al. 2010; Hernandez et al. 2011). A simple implementation of this method was carried out in this project, and the results contrasted with other methods employed in this chapter.
• Presence of high-‐frequency derived alleles
Derived alleles (i.e., those created by mutation of ancestral alleles) usually exist at low frequency in a population. Should these alleles be linked to an allele that undergoes a selective sweep, they will be drawn towards fixation through genetic hitchhiking. Loci undergoing selective sweeps may therefore be identified by the presence of derived alleles at unusually high frequency.
This analysis requires knowledge of a population’s ancestral alleles, in order that they may be distinguished from derived alleles. In A. thaliana, ancestral genotypes cannot be inferred with any confidence, since the population
structure and degree of admixture render any attempt futile (see Chapter 2); therefore, this method of detecting selection was not used in this project.
• Population differentiation
If a population is divided into relatively distinct sub-‐populations, then large differences in allele frequencies between populations may be indicative of a selective sweep (Kreitman 2000; Sabeti et al. 2007). Distinguishing the precise cause of observations of this nature in the absence of additional information is often extremely challenging, however, as the same observations may very often be attributed with at least equal plausibility to demographic effects.
Since research in this chapter set out explicitly to develop a means of distinguishing between demographic and selective effects, this method was not employed.
• Haplotype length
Loci undergoing a selective sweep are likely to maintain linkage with nearby alleles as the sweep progresses (as described in the previous section). Loci under selection are therefore identifiable due to the preservation of a greater degree of linkage than expected for their observed frequency.
Detection of selection via haplotypes may only detect very recent selection events, since large haplotypes tend to break down rapidly. On the other hand, the haplotype-‐based detection method is capable of detecting partial sweeps (in which the allele under selection rises in frequency, but does not reach fixation), and is relatively unaffected by any potential biases arising from choice of SNPs to use in the analysis (see Chapter 2.3.1). This method of detection is therefore both ideally suited to the data available to this project, and to its goals.
4.1.4 DISEASE RESISTANCE IN A.THALIANA: MODEL PLANT MEETS MODEL