2.3 Results and Discussion
2.3.3 Improvements to the S hominis genome using long read sequence data
PacBio sequencing was used to produce an improved assembly of S. hominis strain J31. Pacbio sequencing of this strain resulted in one large contig, plus a further 4 smaller contigs (Table 2.5). The PacBio assembly was compared to the Illumina assembly by aligning the two using a nucleotide mummer plot (Figure 2.1). Regions of similarity between the two genomes, that is forward alignments with the same topography, are shown in red on the hypothetical line f(x)=x. Reverse alignments are shown in blue and translocated regions are represented by deviations from f(x)=x. The mummer plot shows that many of the Illumina contigs are assembled differently compared to the regions in the PacBio genome.
Table 2.5 Contig lengths and read coverage per contig resulting from the PacBio assembly of
S. hominis J31
Contig Length Coverage (X)
0 2,188,298 207
1 60,147 256
2 28,669 92.5
3 40,632 117.5
4 6417 45
Further investigation of the PacBio contigs showed that the longest contig, numbered 0, had coverage of 207X and represents the bacterial chromosome (Table 2.5). Evidence for this contig being the entire, circular bacterial chromosome is found in a 15000 bp region from the start of the contig being repeated at the end. The gene content and synteny of the region is conserved and the presence of an rRNA gene operon is observed. This indicates that some of the tRNAs in the expanded repertoire observed in the J31 assembly of PacBio data, and 4 of the rRNA genes annotated, may be as a result of a sequencing artefact, in addition to the cluster of fol genes present at the end of the chromosome contig (Figure 2.2). When this 15000 bp region is blasted against contig 0 there are an additional 4 significant hits along the contig (Figure 2.3). When the genes in the region of these blast hits were investigated the rRNA operon was found to be the only coding sequences present.
The 4 smaller contigs were also examined to investigate if these contigs might represent complete circular plasmid DNA. Bases from the beginning of the contig were blasted against the contig itself and the sequence was assumed to be circular if this produced significant blast hits to both the start and end. Further evidence of the circularity of these contigs can be found in the annotation of the plasmids discussed in chapter 3 (3.3.15.1) as gene annotation from the beginning of the plasmids is repeated at the end. The exception to this is contig 4 which represents the smallest contig. Although there are significant blast hits, of the first 1000 bases to the beginning and the end of the contig, there is also a significant blast hit of the same region equidistant between the two (Figure 2.4 D). One explanation for this may be misassembly of the contig. Another, however, is that this contig represents two copies of a small plasmid. This contig also exhibits lower read coverage (Table 2.5), which may be as a result of the fact that the contig is approaching read length for PacBio sequencing.
Figure 2.2 Detailed analysis of the alignment of the first 15 000 bp of contig 0 with all contigs of the J31 PacBio assembly. A) the first 15 000 bases of contig 0 aligned with the start of contig 0. The yellow line indicates the end of the speculated overlap region. Genes in the region are annotated (left to right) folP_1, FolP_2, FolB_1, FolK_1 and LysS_1. These genes are followed by rRNA 5S gene, a cluster of rRNAs, rRNA 16S, 23S and 5S genes. B) the first 15 000 bases of contig 0 aligned with the end of contig 0. The yellow line indicates the start of the speculated overlap region, the blue line indicates the end of contig 0. Genes in the
Figure 2.3 Alignment of the first 15000 bases of contig 0 with the rest of contig 0. Green arrows indicate the locations of distributed rRNA operons which occur as matches to a region of the first 15000 bases containing rRNA genes. The regions indicated by the green arrows are around 5500 bp in length and contain only rRNA genes.
Figure 2.4 Visualisation of the circularity of the contigs produced from the PacBio assembly of S. hominis strain J31. A) The first 15,000 bp of contig 1 aligned against contig 1 B) The first 10,000 bp of contig 2 aligned against contig 2 C) The first 15,000 bp of contig 3 aligned against contig 3 D) The first
2.3.4 Gene synteny of the S. hominis genomes sequenced
Since the large contig resulting from PacBio sequencing was taken to be the circular bacterial chromosome, this assembly is considered to be of better quality compared to the Illumina assembly of the J31 genome by the metrics stated earlier, namely, a large N50 and low contig number. Consequently, the contigs resulting from CLC workbench assembly of the S. hominis genomes sequenced on the Illumina platform were reordered relative to the PacBio HGAP assembly of S. hominis J31.
Alignments of the reordered assemblies were performed using Progressive Mauve and visualised using Mauve software (version snapshot 2015-01-25 (1)). Local collinear blocks (LCBs) are used to represent homologous backbone sequence. Rearrangements such as translocations and inversions can be identified by LCBs in different locations on the backbone. The high degree of synteny among the S. hominis genomes is indicated by the conservation of the arrangements of the LCBs (Figure 2.5). This synteny, together with the differential assembly observed between the J31 Illumina and PacBio assemblies (Figure 2.1), indicates that rearrangements of contigs against a PacBio reference can result in an improved arrangement of assembled contigs.
Figure 2.5 Progressive Mauve genome alignment of the S. hominis genomes. Contigs resulting from CLC workbench assemblies of Illumina sequence data rearranged using HGAP alignment of PacBio sequence data from strain J31 . The colour bloack indicate homologous regions without any internal