Identification of indels

Chapter 3: Whole-‐genome analysis and

3.2.2 Comparison of original and re-‐sequenced whole genomes of Rd and R2866

3.2.2.2 Identification of indels

Indels, like SNPs, are a form of nucleotide-‐level genetic variation, where there is either a deletion or an insertion of any number of nucleotides in a reference DNA sequence when compared to a query DNA sequence. Indels were identified between the original and re-‐sequenced Rd and R2866 whole genomes. As with SNPs, the original genome was used as a reference for detecting indels, while the re-‐sequenced genome was a query.

A total of 226 indels were detected for the Rd strain. Of these, 93 were present in intergenic regions and the remaining 133 were located in genes. 61 indels were identified in 45 genes that were annotated as pseudogenes in the original published Rd genome sequence (see Table 3.3). The majority of these indels were present in the coding sequence in a way that they corrected frameshifts, which were most likely responsible for the annotation of the gene as a pseudogene. Only three indels, present in HI0247, HI1099a and HI1268 genes, did not have an obvious correction of the frame.

Table 3.3: Indels present in pseudogenes in the Rd strain. The original (reference; Ref) and re-‐sequenced (query; Que) genomes were compared.

Coordinate

(bp) Ref Que Pseudogene Coordinate (bp) Ref Que Pseudogene 108367 A AG HI0101 1195324 TG T HI1126.1 130905 AT A HI0116 1195695 C CA 166073 A AT HI0148.1 1249451 G GC HI1183 170592 C CT dcuB 1257711 TA T HI1191 265439 TC T HI0234 1257768 TG T 266933 G GT perM 1257775 TG T 277114 TAA T HI0247 1257781 TG T 313016 AG A HI0279m 1307904 T TA HI1235 354945 G GC HI0326 1330978 AG A HI1254 369921 G GC napF 1331102 C CT 370788 GC G napA 1331439 C CT 576723 GA G devB 1347623 C CA HI1268 607968 G GC HI0585 1377957 T TG dgt 654433 GC G HI0620.1 1476596 TC T pstB 787178 T TAA HI0732 _1523755 _GC _G _HI1434.2 787337 A AT 1525290 GT G thiI 787392 A AT 1541685 G GA HI1458m 892480 A AG HI0842 1569019 G GC mor 921808 AG A HI0869 _1580044 _G _GC _HI1506 922034 CA C 1607054 GT G HI1534 929031 CA C pepB 1608081 G GC 1014632 AG A HI0956 1634585 TC T HI1565m 1034913 T TA HI0976 1637524 G GC HI1570 1082187 T TG HI1018 1648801 A AT HI1581 1127714 G GT HI1063 _1662092 _A _AG ftsK 1128978 T TC 1662266 AT A 1128980 T TC 1662939 AC A 1144510 CG C pnuC _1687060 _TG _T _HI1619 1144973 AC A 1734148 G GC HI1666 1161063 C CA HI1099a 1789588 GC G HI1718 1195317 TG T HI1126.1

The remaining 71 indels were detected in 51 coding sequences (see Table 3.4). The majority of these indels were located either towards the very start or the very end of a gene, therefore not greatly altering gene length. Six indels were present in the coding sequences that were split across two coding frames in the original genome, therefore correcting the frameshift. Seven indels were located in the middle of the coding sequences without any effect on the gene length. There were only three indels that were present in the middle of the coding sequences and resulted in a significantly truncated gene. The distribution of all indels and SNPs in the Rd genome is depicted in Figure 3.1.

Table 3.4: Indels identified in protein-‐coding genes in the Rd strain. The original (reference; Ref) and re-‐sequenced (query; Que) genomes were compared.

Coordinate

(bp) Ref Que Gene Gene product Comments 52066 C CA HI0050m Integral membrane protein _transporter Start of gene 77924 A AT ppnK Inorganic polyphosphate/ATP-‐NAD _kinase End of gene 104315 A AG hitA Iron-‐utilisation periplasmic protein _hFbpA Corrects _frameshift 142122 TG T

fbpC Ferric transporter ATP-‐binding _protein End of gene 142152 TA T

142169 TC T 142202 GC G 142225 TC T 142237 GC G

143334 G GCA afuB Ferric transport system permease-‐_{like protein} End of gene 162571 G GCA

HI0147 Hypothetical protein End of gene 162605 G GA

201555 AG A

tatA Sec-‐independent protein secretion _{pathway component TatA} Start of gene 201574 AG A

391547 AG A HI0367 Hypothetical protein End of gene 401967 AC A HI0380.2 tRNA-‐Lys Start of gene 511410 CA C aphA Acid _{phosphatase/phosphotransferase} Corrects _frameshift 560060 GC G ureH Urease accessory protein End of gene 579334 A AT HI0559.1 Hypothetical protein End of gene 588149 AG A tex Transcription accessory protein End of gene 608500 GC G pepE Peptidase E End of gene 620939 G GT ccrB Camphor resistance protein CrcB Possibly true 706625 TC T oapA Hemoglobin-‐binding protein Start of gene 721475 G GT HI0680 RarD protein End of gene 850970 A AG rpsM 30S ribosomal protein S13 End of gene 926774 G GA HI0874 Hypothetical protein End of gene 989395 A AT HI0930 Hypothetical protein Start of gene 1036342 TG T

prmA Ribosomal protein L11 _{methyltransferase} Middle of gene, no effect 1036357 AT A

1036363 GA G

1068024 GC G HI1004 Peptidyl-‐propyl cis-‐trans isomerase Corrects _frameshift 1104427 C CT ureF Urease accessory protein Start of gene 1229388 TC T HI1159m Thioredoxin domain-‐containing _protein Corrects _frameshift 1231800 T TCCGC HI1162 Hypothetical protein Start of gene

Coordinate

(bp) Ref Que Gene Gene product Comments 1244068 TG T HI1174 Opacity protein Start of gene 1277863 A AG

lysS Lysyl-‐tRNA synthetase Middle of gene; no effect 1277877 GA G

1290107 CA C

cmk Cytidylate kinase End of gene 1290126 TC T

1300566 TC T lpdA Dihydrolipoamide dehydrogenase End of gene 1370921 A AGC truB tRNA pseudouridine synthase B End of gene 1375515 CG C

HI1296 Nuclease End of gene 1375546 A AT

1394146 TC T

HI1317 Hypothetical protein Start of gene 1394184 TC T

1449975 AT A

HI1364 Transcriptional regulator End of gene 1449988 TG T

1449998 GT G

1480065 G GA pstS Phosphate ABC transporter _{substrate-‐binding protein} Corrects _frameshift 1505885 G GC HI1410 Terminase large subunit-‐like protein Start of gene 1509059 G GC HI1418 Hypothetical protein End of gene 1527279 G GCAC ispA Geranyltranstransferase End of gene 1564858 C CG

muB DNA transposition protein End of gene 1564863 G GC

1570070 T TG HI1493 Hypothetical protein Possibly true 1571863 CG C HI1498.1 Hypothetical protein End of gene 1587908 A AG HI1516m Unannotated Possibly true 1617702 CA C HI1546 Hypothetical protein End of gene 1621283 G GA bioD Dithiobiotin synthetase End of gene 1647894 CA C

lpp 15 kDa peptidoglycan-‐associated _lipoprotein End of gene 1647913 CA C

1647920 CA C

1675969 G GT tyrS Tyrosyl-‐tRNA synthetase End of gene 1689811 AT A HI1625 Hypothetical protein End of gene 1691975 C CA

HI1629 Hypothetical protein End of gene 1691994 T TA

1697887 A AT

purR DNA-‐binding transcriptional _{repressor PurR} Middle of gene; no effect 1697907 TG T

1718873 A AG tldD Hypothetical protein Corrects _frameshift 1738448 G GC HI1670 Solute/DNA competence effector End of gene 1775364 ATC A HI1704 Hypothetical protein End of gene

Figure 3.1: Positions of indels and SNPs in the Rd genome. The genomic locations of indels and SNPs are shown in two concentric circles. The outer circle shows the positions of all identified indels, while the inner circle shows the positions of all identified SNPs.

Rd

Only one indel was present in the R2866 strain at the 953,610 bp coordinate in the original genome. It was located in an intergenic region containing several tetranucleotide repeats. The indel itself constituted a single GCAA repeat, which was present in the original R2866 genome, but absent in the re-‐sequenced genome.

In document The application of high throughput sequencing to study the genome composition and transcriptional response of Haemophilus influenzae (Page 87-93)

Chapter 3: Whole-­‐genome analysis and

3.2.2 Comparison of original and re-­‐sequenced whole genomes of Rd and R2866

3.2.2.2 Identification of indels

Rd

Chapter 3: Whole-‐genome analysis and

3.2.2 Comparison of original and re-‐sequenced whole genomes of Rd and R2866