• No results found

CHAPTER  4:   EVIDENCE OF SELECTION FROM GENOMIC DATA 144

4.1   INTRODUCTION 144

4.1.3   EXISTING METHODS OF DETECTING SELECTION 147

While   the   genetic   phenomena   described   in   the   previous   section   have   been   known  to  exist  for  a  considerable  time,  the  density  of  polymorphism  data  from   a  wild  population  has  rarely  been  sufficient  to  allow  the  practical  genome-­‐wide   detection  of  selective  sweeps  by  searching  directly  for  haplotypes  in  which  this   is   happening.   Nonetheless,   the   general   principles   of   the   detection   of   sweeps   have  frequently  been  applied  more  modestly  to  specific  loci  containing  genes  of   likely   scientific   interest,   and   have   remained   essentially   the   same   for   at   least   twelve  years.  

The  major  goal  of  this  project  was  to  detect  instances  within  the  UK  population   of  A.   thaliana  of   adaptation   to   local   habitat   despite   gene   flow   from   other   sources,  and  to  attempt  prediction  of  possible  cause(s).  The  typical  pattern  of  a   study   of   local   adaptation   involves   first   identifying   samples   displaying   phenotypes  of  high  fitness  exclusively  in  their  native  habitat,  and  then  seeking   evidence   that   genes   possessing   variation   associated   with   these   traits   have   undergone  selection  in  the  observable  past.  This  chapter  essentially  sought  to   reverse   that   process,   by   identifying   genomic   loci   in   samples   taken   from   particular   habitats   exhibiting   signatures   of   selection,   which   should   serve   as   targets  for  validation  via  future  field  experiments.    

Sabeti  et  al.  (Sabeti  et  al.  2002)  demonstrated  an  approach  and  thought  process   that   served   as   a   major   inspiration   for   the   work   carried   out   in   this   chapter.   Working   with   two   loci   in   the   human   genome   suspected   to   possess   variation   associated   with   resistance   to   malaria,   Sabeti   et   al.   (2002)   identified   core  

haplotypes  and  measured  the  degree  of  conserved  co-­‐segregating  similarity  in   flanking   loci   in   order   to   estimate   the   age   of   the   haplotype.   Recently   emerged   haplotypes   (those   with   a   high   degree   of   co-­‐segregating   variation)   found   at   a   high  frequency  in  the  studied  population  were  marked  as  likely  candidates  for   selection,   having   risen   to   high   frequency   before   meiotic   recombination   broke   down   the   linkage   disequilibrium   with   the   surrounding   variation.   This   is   unlikely   for   selectively   neutral   variation.   To   gauge   the   likelihood   of   any   such   instance   being   a   true   signature   of   selection,   the   degree   of   deviation   from   simulated   haplotypes   under   a   coalescent   process   was   quantified.   Several   haplotypes   were   identified   as   exhibiting   a   highly   significant   deviation   from   coalescent  expectations,  and  thus  as  probable  instances  of  alleles  favoured  by   selective  sweeps.    

Detection  of  selection  across  broader  sections  of  the  genome  from  genetic  data   has   historically   proved   much   more   problematic,   however.   Genome-­‐wide   detection   of   selection   had   been   attempted   by   comparison   with   predictions   drawn   from   population   genetic   models   (Hanfstingl   et   al.   1994;   Hagenblad   &   Nordborg   2002;   Nordborg   et   al.   2005),   but   prior   to   the   advent   of   widely   available  whole  genome  sequencing  these  attempts  were  plagued  by  a  lack  of   cross-­‐compatibility   of   data   from   various   experiments,   and   by   difficulties   in   determining   statistical   significance   due   to   confounding   from   drift   and   demographic  factors  (see  Chapter  1.3.1;  for  review,  see  (Sabeti  et  al.  2006)).       Methods   for   detecting   signatures   of   selection   fall   into   at   least   five   different   classes,  each  searching  for  distinct  genomic  patterns  arising  as  a  consequence   of   selective   sweeps,   and   each   with   their   own   strengths   and   weaknesses.   The   suitability   of   each   class   of   analysis   to   the   goals   of   this   project   will   now   be   discussed.  

Most   evolution   of   a   genotype   is   expected   to   proceed   through   neutral   changes  

(i.e.,  those  with  no  effect  on  a  phenotype).  In  terms  of  base  substitutions  within  

a   sequence,   this   means   that   substitutions   producing   no   change   in   phenotype   may  typically  expected  to  be  observed  much  more  frequently  than  substitutions   producing   a   change   in   phenotype.   This   ratio   may   be   quantified   by   comparing   the   number   of   sequence   differences   producing   codons   coding   for   different   amino   acids   (non-­‐synonymous,   or   functional   mutations)   with   that   producing   codons   coding   for   the   same   amino   acid   (synonymous,   or   non-­‐functional   mutations).  Once  measured,  this  ratio  may  then  be  compared  against  either  the   equivalent   ratio   at   the   same   loci   in   other   species,   the   ratio   at   loci   carefully   chosen   for   their   neutrality,   or   the   typical   ratio   across   the   rest   of   the   genome.   Sustained   selection   over   long   timescales   has   been   identified   through   higher   proportions  of  non-­‐synonymous  mutations  than  expected  by  chance  (McDowell   1998;   Rose   2004;   Ding   et   al.   2007).   Since   it   is   also   expected   that   deleterious   mutations  are  unlikely  to  ever  rise  to  a  high  frequency  in  a  population  (due  to   selection   acting   against   them),   it   is   reasonable   to   conclude   that   such   an   observation  constitutes  a  signature  of  a  selective  sweep.  

This   type   of   analysis   is   routinely   applied   to   sequence   data   collected   from   closely   related   species,   and   is   best   suited   to   analysis   of   strong,   persistent   selection   pressures   at   a   single   gene’s   locus   over   many   millions   of   years.   SNP   data  is  not  ideally  suited  to  this  type  of  analysis,  but  resequencing  data  is,  such   as  that  from  the  1001  Arabidopsis  Genome  project.  Therefore,  this  method  was   not   utilised   for   the   primary   detection   of   sweeps,   but   may   be   useful   for   secondary  analysis  of  candidate  loci.  

• Local  reduction  of  genetic  diversity  

As  a  selective  sweep  progresses  and  linked  alleles  are  drawn  towards  fixation   by   genetic   hitchhiking,   the   genetic   diversity   (i.e.,   the   number   of   alleles   in   the   population)   at   those   linked   loci   necessarily   decreases   from   the   typical   level   encountered  across  the  rest  of  the  genome.  Selective  sweeps  may  therefore  be  

recognised   by   a   sudden   and   progressive   drop   in   the   genetic   diversity   of   genotypes  centred  on  a  particular  locus  (Carlson  et  al.  2005;  Sabeti  et  al.  2006).   Eventually,  diversity  at  the  linked  loci  rises  again.  If  the  sweep  occurred  across   the   entire   native   range   of   the   species,   diversity   will   rise   slowly   as   new   mutations  begin  to  appear;  if  the  sweep  occurred  only  across  a  fraction  of  the   species’  range,  though  diversity  at  these  loci  may  be  restored  more  quickly  as   migrants  reintroduce  variation  if,  for  example,  it  were  restricted  to  a  relatively   isolated  sub-­‐population.    

While  classic  selective  sweeps  decrease  allelic  diversity  at  linked  loci,  balancing   selection   has   been   shown   to   actually   increase   diversity   (Charlesworth   2006).   This   provides   a   means   of   not   only   identifying   selection,   but   of   predicting   its   nature.    

SNP  datasets  are  well  suited  to  this  type  of  analysis,  which  may  inform  us  of  the   nature  of  selection  e  occurring  up  to  several  hundred  thousand  years  in  the  past   (Sabeti   et   al.   2006;   Pritchard   et   al.   2010;   Hernandez   et   al.   2011).   A   simple   implementation  of  this  method  was  carried  out  in  this  project,  and  the  results   contrasted  with  other  methods  employed  in  this  chapter.    

• Presence  of  high-­‐frequency  derived  alleles  

Derived  alleles  (i.e.,  those  created  by  mutation  of  ancestral  alleles)  usually  exist   at  low  frequency  in  a  population.  Should  these  alleles  be  linked  to  an  allele  that   undergoes   a   selective   sweep,   they   will   be   drawn   towards   fixation   through   genetic   hitchhiking.   Loci   undergoing   selective   sweeps   may   therefore   be   identified  by  the  presence  of  derived  alleles  at  unusually  high  frequency.  

This   analysis   requires   knowledge   of   a   population’s   ancestral   alleles,   in   order   that   they   may   be   distinguished   from   derived   alleles.   In  A.   thaliana,   ancestral   genotypes   cannot   be   inferred   with   any   confidence,   since   the   population  

structure   and   degree   of   admixture   render   any   attempt   futile   (see   Chapter   2);   therefore,  this  method  of  detecting  selection  was  not  used  in  this  project.  

• Population  differentiation  

If   a   population   is   divided   into   relatively   distinct   sub-­‐populations,   then   large   differences   in   allele   frequencies   between   populations   may   be   indicative   of   a   selective  sweep  (Kreitman  2000;  Sabeti  et  al.  2007).  Distinguishing  the  precise   cause  of  observations  of  this  nature  in  the  absence  of  additional  information  is   often  extremely  challenging,  however,  as  the  same  observations  may  very  often   be  attributed  with  at  least  equal  plausibility  to  demographic  effects.    

Since   research   in   this   chapter   set   out   explicitly   to   develop   a   means   of   distinguishing  between  demographic  and  selective  effects,  this  method  was  not   employed.    

• Haplotype  length  

Loci   undergoing   a   selective   sweep   are   likely   to   maintain   linkage   with   nearby   alleles   as   the   sweep   progresses   (as   described   in   the   previous   section).   Loci   under  selection  are  therefore  identifiable  due  to  the  preservation  of  a  greater   degree  of  linkage  than  expected  for  their  observed  frequency.    

Detection   of   selection   via   haplotypes   may   only   detect   very   recent   selection   events,  since  large  haplotypes  tend  to  break  down  rapidly.  On  the  other  hand,   the  haplotype-­‐based  detection  method  is  capable  of  detecting  partial  sweeps  (in   which  the  allele  under  selection  rises  in  frequency,  but  does  not  reach  fixation),   and  is  relatively  unaffected  by  any  potential  biases  arising  from  choice  of  SNPs   to  use  in  the  analysis  (see  Chapter  2.3.1).  This  method  of  detection  is  therefore   both  ideally  suited  to  the  data  available  to  this  project,  and  to  its  goals.  

4.1.4  DISEASE  RESISTANCE  IN  A.THALIANA:  MODEL  PLANT  MEETS  MODEL