• No results found

Chapter  3:   Whole-­‐genome analysis and

3.2.2   Comparison of original and re-­‐sequenced whole genomes of Rd and R2866

3.2.2.2   Identification of indels

Indels,  like  SNPs,  are  a  form  of  nucleotide-­‐level  genetic  variation,  where  there  is   either   a   deletion   or   an   insertion   of   any   number   of   nucleotides   in   a   reference   DNA  sequence  when  compared  to  a  query  DNA  sequence.  Indels  were  identified   between  the  original  and  re-­‐sequenced  Rd  and  R2866  whole  genomes.  As  with   SNPs,   the   original   genome   was   used   as   a   reference   for   detecting   indels,   while   the  re-­‐sequenced  genome  was  a  query.    

 

A  total  of  226  indels  were  detected  for  the  Rd  strain.  Of  these,  93  were  present   in   intergenic   regions   and   the   remaining   133   were   located   in   genes.   61   indels   were  identified  in  45  genes  that  were  annotated  as  pseudogenes  in  the  original   published   Rd   genome   sequence   (see   Table   3.3).   The   majority   of   these   indels   were  present  in  the  coding  sequence  in  a  way  that  they  corrected  frameshifts,   which   were   most   likely   responsible   for   the   annotation   of   the   gene   as   a   pseudogene.  Only  three  indels,  present  in  HI0247,  HI1099a  and  HI1268  genes,   did  not  have  an  obvious  correction  of  the  frame.  

                       

Table   3.3:  Indels   present   in   pseudogenes   in   the   Rd   strain.   The   original   (reference;  Ref)  and  re-­‐sequenced  (query;  Que)  genomes  were  compared.    

Coordinate  

(bp)   Ref   Que   Pseudogene     Coordinate  (bp)   Ref   Que   Pseudogene   108367   A   AG   HI0101     1195324   TG   T   HI1126.1   130905   AT   A   HI0116     1195695   C   CA   166073   A   AT   HI0148.1     1249451   G   GC   HI1183   170592   C   CT   dcuB     1257711   TA   T   HI1191   265439   TC   T   HI0234     1257768   TG   T   266933   G   GT   perM     1257775   TG   T   277114   TAA   T   HI0247     1257781   TG   T   313016   AG   A   HI0279m     1307904   T   TA   HI1235   354945   G   GC   HI0326     1330978   AG   A   HI1254   369921   G   GC   napF     1331102   C   CT   370788   GC   G   napA     1331439   C   CT   576723   GA   G   devB     1347623   C   CA   HI1268   607968   G   GC   HI0585     1377957   T   TG   dgt   654433   GC   G   HI0620.1     1476596   TC   T   pstB   787178   T   TAA   HI0732     1523755   GC   G   HI1434.2   787337   A   AT     1525290   GT   G   thiI   787392   A   AT     1541685   G   GA   HI1458m   892480   A   AG   HI0842     1569019   G   GC   mor   921808   AG   A   HI0869     1580044   G   GC   HI1506   922034   CA   C     1607054   GT   G   HI1534   929031   CA   C   pepB     1608081   G   GC   1014632   AG   A   HI0956     1634585   TC   T   HI1565m   1034913   T   TA   HI0976     1637524   G   GC   HI1570   1082187   T   TG   HI1018     1648801   A   AT   HI1581   1127714   G   GT   HI1063     1662092   A   AG   ftsK   1128978   T   TC     1662266   AT   A   1128980   T   TC     1662939   AC   A   1144510   CG   C   pnuC     1687060   TG   T   HI1619   1144973   AC   A     1734148   G   GC   HI1666   1161063   C   CA   HI1099a     1789588   GC   G   HI1718   1195317   TG   T   HI1126.1      

The  remaining  71  indels  were  detected  in  51  coding  sequences  (see  Table  3.4).   The  majority  of  these  indels  were  located  either  towards  the  very  start  or  the   very   end   of   a   gene,   therefore   not   greatly   altering   gene   length.   Six   indels   were   present  in  the  coding  sequences  that  were  split  across  two  coding  frames  in  the   original  genome,  therefore  correcting  the  frameshift.  Seven  indels  were  located   in   the   middle   of   the   coding   sequences   without   any   effect   on   the   gene   length.   There   were   only   three   indels   that   were   present   in   the   middle   of   the   coding   sequences  and  resulted  in  a  significantly  truncated  gene.  The  distribution  of  all   indels  and  SNPs  in  the  Rd  genome  is  depicted  in  Figure  3.1.  

                                 

Table   3.4:  Indels   identified   in   protein-­‐coding   genes   in   the   Rd   strain.  The   original   (reference;   Ref)   and   re-­‐sequenced   (query;   Que)   genomes   were   compared.  

 

Coordinate  

(bp)   Ref   Que   Gene   Gene  product   Comments   52066   C   CA   HI0050m   Integral  membrane  protein  transporter   Start  of  gene   77924   A   AT   ppnK   Inorganic  polyphosphate/ATP-­‐NAD  kinase   End  of  gene   104315   A   AG   hitA   Iron-­‐utilisation  periplasmic  protein  hFbpA   Corrects  frameshift   142122   TG   T  

fbpC   Ferric  transporter  ATP-­‐binding  protein   End  of  gene   142152   TA   T  

142169   TC   T   142202   GC   G   142225   TC   T   142237   GC   G  

143334   G   GCA   afuB   Ferric  transport  system  permease-­‐like  protein   End  of  gene   162571   G   GCA  

HI0147   Hypothetical  protein   End  of  gene   162605   G   GA  

201555   AG   A  

tatA   Sec-­‐independent  protein  secretion  pathway  component  TatA   Start  of  gene   201574   AG   A  

391547   AG   A   HI0367   Hypothetical  protein   End  of  gene   401967   AC   A   HI0380.2   tRNA-­‐Lys   Start  of  gene   511410   CA   C   aphA   Acid  phosphatase/phosphotransferase   Corrects  frameshift   560060   GC   G   ureH   Urease  accessory  protein   End  of  gene   579334   A   AT   HI0559.1   Hypothetical  protein   End  of  gene   588149   AG   A   tex   Transcription  accessory  protein   End  of  gene   608500   GC   G   pepE   Peptidase  E   End  of  gene   620939   G   GT   ccrB   Camphor  resistance  protein  CrcB   Possibly  true   706625   TC   T   oapA   Hemoglobin-­‐binding  protein   Start  of  gene   721475   G   GT   HI0680   RarD  protein   End  of  gene   850970   A   AG   rpsM   30S  ribosomal  protein  S13   End  of  gene   926774   G   GA   HI0874   Hypothetical  protein   End  of  gene   989395   A   AT   HI0930   Hypothetical  protein   Start  of  gene   1036342   TG   T  

prmA   Ribosomal  protein  L11  methyltransferase   Middle  of  gene,  no   effect     1036357   AT   A  

1036363   GA   G  

1068024   GC   G   HI1004   Peptidyl-­‐propyl  cis-­‐trans  isomerase   Corrects  frameshift   1104427   C   CT   ureF   Urease  accessory  protein   Start  of  gene   1229388   TC   T   HI1159m   Thioredoxin  domain-­‐containing  protein   Corrects  frameshift   1231800   T   TCCGC   HI1162   Hypothetical  protein   Start  of  gene  

Coordinate  

(bp)   Ref   Que   Gene   Gene  product   Comments   1244068   TG   T   HI1174   Opacity  protein   Start  of  gene   1277863   A   AG  

lysS   Lysyl-­‐tRNA  synthetase   Middle  of  gene;  no   effect     1277877   GA   G  

1290107   CA   C  

cmk   Cytidylate  kinase   End  of  gene   1290126   TC   T  

1300566   TC   T   lpdA   Dihydrolipoamide  dehydrogenase   End  of  gene   1370921   A   AGC   truB   tRNA  pseudouridine  synthase  B   End  of  gene   1375515   CG   C  

HI1296   Nuclease   End  of  gene   1375546   A   AT  

1394146   TC   T  

HI1317   Hypothetical  protein   Start  of  gene   1394184   TC   T  

1449975   AT   A  

HI1364   Transcriptional  regulator   End  of  gene   1449988   TG   T  

1449998   GT   G  

1480065   G   GA   pstS   Phosphate  ABC  transporter  substrate-­‐binding  protein   Corrects  frameshift   1505885   G   GC   HI1410   Terminase  large  subunit-­‐like  protein   Start  of  gene   1509059   G   GC   HI1418   Hypothetical  protein   End  of  gene   1527279   G   GCAC   ispA   Geranyltranstransferase   End  of  gene   1564858   C   CG  

muB   DNA  transposition  protein   End  of  gene   1564863   G   GC  

1570070   T   TG   HI1493   Hypothetical  protein   Possibly  true   1571863   CG   C   HI1498.1   Hypothetical  protein   End  of  gene   1587908   A   AG   HI1516m   Unannotated   Possibly  true   1617702   CA   C   HI1546   Hypothetical  protein   End  of  gene   1621283   G   GA   bioD   Dithiobiotin  synthetase   End  of  gene   1647894   CA   C  

lpp   15  kDa  peptidoglycan-­‐associated  lipoprotein   End  of  gene   1647913   CA   C  

1647920   CA   C  

1675969   G   GT   tyrS   Tyrosyl-­‐tRNA  synthetase   End  of  gene   1689811   AT   A   HI1625   Hypothetical  protein   End  of  gene   1691975   C   CA  

HI1629   Hypothetical  protein   End  of  gene   1691994   T   TA  

1697887   A   AT  

purR   DNA-­‐binding  transcriptional  repressor  PurR   Middle  of  gene;  no   effect     1697907   TG   T  

1718873   A   AG   tldD   Hypothetical  protein   Corrects  frameshift   1738448   G   GC   HI1670   Solute/DNA  competence  effector   End  of  gene   1775364   ATC   A   HI1704   Hypothetical  protein   End  of  gene    

                             

Figure   3.1:  Positions   of   indels   and   SNPs   in   the  Rd   genome.   The   genomic   locations   of   indels   and   SNPs   are   shown   in   two   concentric   circles.   The   outer   circle  shows  the  positions  of  all  identified  indels,  while  the  inner  circle  shows   the  positions  of  all  identified  SNPs.    

           

Rd  

Only  one  indel  was  present  in  the  R2866  strain  at  the  953,610  bp  coordinate  in   the   original   genome.   It   was   located   in   an   intergenic   region   containing   several   tetranucleotide  repeats.  The  indel  itself  constituted  a  single  GCAA  repeat,  which   was   present   in   the   original   R2866   genome,   but   absent   in   the   re-­‐sequenced   genome.