• No results found

International HapMap Project, 1000 Genomes Project and Encode

1.5   Identifying susceptibility genes in complex disease

1.5.2   International HapMap Project, 1000 Genomes Project and Encode

alleles  that  were  present  on  the  particular  chromosomal  background  on  which   it  arose,  and  this  association  is  measured  by  the  amount  of  LD  i.e.  alleles  within   a  haplotype  show  LD  and  reside  in  recombination  hotspots.    

The   concept   of   LD   is   centralized   on   the   non-­‐random   association   of   alleles   at   different  loci.  Natural  selection,  or  chance,  caused  the  spread  of  common  SNP   mutations   that   arose   thousands   of   generations   ago.   A   second   mutation   occurring   later   but   close   to   an   earlier   one   results   in   both   alleles   being   transmitted   to   the   same   offspring   in   subsequent   generations.   It   is   this   model   that  is  exploited  in  a  GWAS  (Xiong  and  Guo  1997).  An  increased  risk  of  disease   caused  by  one  SNP  denotes  direct  association  between  that  SNP  and  disease  in   the  population  and  indirect  association  between  several  nearby  SNPs  due  to  LD.   Therefore   it   is   possible   to   identify   association   in   the   chromosomal   region   without  genotyping  every  SNP  in  a  GWAS  i.e.  by  using  tagging  SNPs.  LD  is  prone   to   decay   by   recombination   (since   the   probability   of   recombination   increases   with   distance,   the   strength   of   LD   between   loci   declines   with   distance)   recurrence  of  the  same  mutation  and  gene  conversion.    

 

1.5.2  International  HapMap  Project,  1000  Genomes  Project  and  Encode      

The  International  HapMap  Project  commenced  in  2002  with  a  focus  to  map  all   common  genetic  variation  (greater  than  5%  MAF)  across  11  populations  (1,400   individuals),  equating  to  3.5  million  SNPs.  There  have  been  26  data  releases  so   far   capturing   approximately   90%   of   genetic   variation   in   the   Caucasian   population   by   using   high   throughput   genotyping   chips   (Consortium   2003;   Thorisson,  Smith  et  al.  2005).  This  dataset  was  the  first  to  describe  the  different   types  of  variants,  where  they  occur  in  our  DNA  and  their  distribution  within  and   amongst   populations.   By   comparing   1,400   individual   DNA   sequences,   haplotypes   could   be   deciphered   by   mapping   chromosomal   regions   of   shared   genetic  variants.  This  preceded  the  initiation  and  rise  of  many  GWA  studies  as   the   HapMap   provided   a   detailed   measurement   of   genetic   variation   and   LD   patterns  across  major  populations,  as  well  as  the  identification  of  tag  SNPs  that  

act  as  haplotype  markers  (Smith,  Wang  et  al.  2006).  Over  the  last  decade  the   quantity  of  know  variation  has  increased  from  20%  discovery  by  the  HGP  to  90%   of  mapped  human  variation  with  the  help  of  HapMap  and  other  similar  projects.     The  1000  Genomes  Project  (1000G)  was  set  up  in  2007  with  a  goal  of  identifying   95%   of   SNPs   present   at   least   1%   frequency   in   a   range   of   populations   (www.1000genomes.org).  In  the  pilot  phase,  which  commenced  in  2008,  three   different   strategies   were   used:   high   coverage   sequencing   of   family   trios   to   obtain  true  phasing  of  the  variants  detected,  low  coverage  sequencing  of  many   individuals  (179)  to  allow  broader  detection  of  variants  but  requiring  statistical   phasing   and   sequencing   of   specific   exon   targets   in   a   larger   number   of   individuals   (700)   to   allow   detection   of   rare   variants   but   would   remain   un-­‐ phased   (Durbin,   Abecasis   et   al.   2010).     A   main   goal   here   was   to   reconstruct   haplotypes   using   all   variants   typed   from   all   datasets.   The   more   recently   published  phase  one  dataset  includes  the  genomes  of  1,092  individuals  from  14   populations  (Abecasis,  Auton  et  al.  2012).  In  this  paper,  functional  variation  was   mapped  by  a  combination  of  low  coverage  whole  genome  sequence  data  (2-­‐6x   read   depth),   targeted   deep   exome   sequence   data   (50-­‐100x),   and   dense   SNP   genotype   data.   The   phase   two   dataset   compiled   in   2011   includes   1,715   individuals   from   19   populations.   The   final   phase   three   includes   an   additional   2,500   African   and   South   Asian   samples.   This   public   reference   catalogue   of   human   genetic   variation   is   already   being   used   for   imputation   and   will   aid   in   identifying   previously   missed   associations   and   provide   a   filter   in   Mendelian   disease  for  exclusionary  purposes.    

Another   project   named   Encylopedia   of   DNA   Elements   (Encode)   published   a   myriad   of   papers   in   2012   based   on   the   identification   of   transcription   regions,   transcription  factor  association,  chromatin  structure  and  histone  modifications   in  the  human  genome.  This  project  differs  completely  from  the  genotype-­‐based   HapMap   and   1000G   projects   and   focuses   on   functional   elements   of   gene   products   giving   previously   unknown   insights   into   gene   regulation   and   how   statistical   associations   with   disease   correspond   to   these   functional   elements   (Dunham,  Kundaje  et  al.  2012).    

1.5.3  Family  based  studies    

Family  based  designs  for  the  investigation  of  inherited  disease  have  been  used   since   Mendel’s   laws   of   inheritance   dominated   the   fundamental   concepts   of   genetics.   Studies   of   extended   pedigrees   have   several   favourable   features   for   novel   gene   discovery:   causative   gene   pathways   are   more   homogenous   and   there   is   a   certain   level   of   phenotypic   control   against   genetic   background   and   environmental  exposures  (Borecki  and  Province  2008).  Gene  mapping  strategies   utilize   linkage   and   association   studies,   both   of   which   use   family   data,   but   association   studies   can   also   be   performed   with   unrelated   individuals.   A   commonly  used  family  based  association  test  is  the  transmission  disequilibrium   test  (TDT),  first  introduced  in  1993  (Spielman,  McGinnis  et  al.  1993).  A  TDT  uses   parents   as   controls   for   the   cases,   who   are   the   affected   offspring,   so   any   confounding   effects   of   population   stratification   are   removed.   The   purpose   of   the   test   is   to   confer   whether   the   disease   allele   is   transmitted   from   parent   to   offspring  more  often  in  a  disease  population  using  genetic  markers  in  nuclear   families   (trios)   by   mapping   disequilibrium   between   the   marker   allele   and   disease  locus.  If  the  disease  allele  is  transmitted  to  unrelated  cases  more  often   than  expected  by  chance,  this  implicates  a  linked  allele  that  is  associated  with   the  disease  mutation.    If  the  allele  is  only  seen  in  related  cases,  then  it  becomes   a   test   of   linkage,   not   association.     In   essence,   the   TDT   combines   linkage   and   association  approaches  in  cases  where  either  performed  separately  have  failed   to  provide  a  positive  result.  This  test  has  been  developed  to  include  all  family   members  and  genotypic  information  (Abecasis,  Cookson  et  al.  2000).    

Where  association  analysis  is  powerful  for  the  detection  of  common  alleles  that   confer   modest   disease   risk,   linkage   analysis   is   more   powerful   for   identifying   high-­‐risk   disease   alleles.   The   independence   of   segregation,   as   inferred   by   Mendel’s  law  of  segregation,  is  not  always  true:  there  are  group  of  traits  which   are  linked  and  the  genes  controlling  them  tend  to  be  inherited  together  by  the   offspring   as   a   group,   not   independently.   This   is   the   underlying   principle   of   a   linkage   study:   if   two   individuals   are   phenotypically   similar   i.e.   carry   disease,   then  a  genetic  marker  located  near  a  disease  susceptibility  gene  must  also  be