Genome annotation and comparative analysis

2. Materials and methods

2.12 Genome annotation and comparative analysis

The assembled genomes were annotated using Prokka v1.12 [112], executed using in-house Perl scripts (Appendix D) [113]. Prokka uses external programmes for predicting genomic features, and annotation of the genomes includes predicting co-ordinates of coding regions and their putative products [112]. The external programmes used are: Prodigal (v2.6.3) [114] which predicts coding sequences; RNAmmer (v1.2) [115], ribosomal RNA genes; ARAGORN (v1.2) [116], transfer RNA genes; SignalP (v4.1) [117], signal leader peptides and Infernal (v1.1.2) [118], non-coding RNA. Prokka annotates the genomes in two steps, (i) Prodigal [114] predicts co-ordinates of coding genes and (ii) identifies

the putative gene product by comparing the sequence to known sequences in a database [112]. Prokka searches the databases to assign an annotation using a hierarchical system, initially searching from reliable databases through to

domain-specific databases. For each database the e-value threshold of 10-6_is

used [112]. Examples of databases utilised by Prokka include user-provided databases, UniProt [119], RefSeq, Pfam [120], TIGRFAMs [121], and if the sequence does not match any known proteins in these databases it is labelled as a hypothetical protein. Prokka outputs a range of files such as fasta files of the genomic sequences, summary statistics and an annotated GenBank file [112].

2.12.2 Center for Genomic Epidemiology

The Center for Genomic Epidemiology (CGE) [122] is an online service that contains a range of databases and tools for WGS analysis. The Bacterial Analysis Pipeline [123] was used to upload the assembled genomes and corresponding metadata, which then ran the genomes through various other CGE databases. These databases included the SpeciesFinder v1.2 [124] and SerotypeFinder v1.1

[23] which were used to confirm that the WGS belonged E. coli serogroup O145

and identify the H antigens carried by each genome, if present. The CGE VirulenceFinder v1.5 [125], PlasmidFinder [126] and MLST v1.8 [24] were also

used to identify various virulence factors, E. coli plasmid types and the ST type,

respectively.

2.12.3 Comparison of virulence genes

The presence or absence of 31 virulence genes (identified using the VirulenceFinder, section 2.12.2) were used to make a Neighbor-Net tree according to Euclidean distance (A/Prof. Patrick Biggs). The tree was subsequently edited using the Interactive Tree of Life (iTOL) v4.1 and isolate

metadata was included for eae subtype, ST and isolation source as indicated by

the colour keys. The presence and absence of the virulence factors that differ between isolates (n=23) was indicated by the matrix. Virulence factors that were either present or absent in all of the strains were not shown in the matrix but were used to construct the tree.

2.12.4 Ribosomal multi-locus sequence typing (rMLST)

In silico analysis of rMLST was performed based on classification described by

Jolley et al. [127] using in-house Perl scripts (Appendix D) [113]. Briefly, in silico

rMLST analysis uses the BIGSdb database which contains all of the known

diversity among the ribosomal protein genes. New alleles are periodically added to the database, and checks are in place to ensure there is no redundancy. Unique sequences in the database are designated an arbitrary allele number [127], and a combination of alleles are used to distinguish and identify different rMLST types. rMLST provides an indication of phylogenetic relationships using

SNPs identified in the 53 genes encoding the ribosome protein subunits (rps, rpm

and rpl). These genes are found in all bacteria and are under constant selection

for functional conservation and therefore are used to resolve bacterial phylogenies [127]. This method is also robust against horizontal genetic

exchange. The in silico rMLST analysis produced a range of outputs, and the

alignments can be viewed in SplitsTree v4.14.4 [111].

2.12.5 Identification of the locus of enterocyte effacement (LEE) pathogenicity island integration sites and stx-bacteriophage insertion sites

The LEE pathogenicity island integration sites were identified using two methods:

(i) visualisation of the GenBank files in Geneious [94] and identification of the eae

coding sequences (CDS), tir and other LEE-encoded genes including a prophage

integrase next to the likely tRNA gene (selC, pheU or pheV) integration site, or

(ii) the contigs were assembled to a reference genome and the likely tRNA integration site predicted based on the mapped contigs and gene synteny. The

reference genomes used are shown in Table 2.2. The stx-bacteriophage insertion

sites were also detected by identifying either (i) the insertion of stx genes in

known insertion sites for stx-positive strains, or (ii) lack off disruption of known

Table 2.2: Reference genomes used to identify LEE pathogenicity island integration sites.

Serotype Strain LEE

integration site

Accession

no. Publication

O145:H28 RM13514 tRNA selC NZ_CP006027 [46]

O145:H28 RM12761 tRNA selC NZ_CP007133 [46]

O26:H11 11368 tRNA pheU AP010953 [47]

O103:H2 12009 tRNA pheV AP010958 [47]

2.12.6 Download of publicly available serogroup O145 raw read data Publicly available serogroup O145 WGS data (Appendix E) were identified to offer a global comparison to New Zealand serogroup O145 strains sequenced. Serogroup O145 strains were identified from NCBI [128], EnteroBase [129] and published papers. Only whole genome sequences in which the raw read sequence data was publicly available were analysed in this study using the same pipeline (quality assessment and assembly) for improved comparison. The raw read sequence data was downloaded using the SRA toolkit [130]. Publicly available WGS data was excluded from the analysis if any discrepancies were identified during the quality assessment such as an unexpected genome size or GC content indicative of potential contamination.

2.12.7 Identification of orthologous groups

GET_HOMOLOGUES (v20170302) [131], a program used for comparative genome analysis, was used to identify orthologous groups from WGS data. The sequence-clustering algorithms COGtriangles [132] and OrthoMCL [133] were used to identify orthologous groups and distinguish paralogues, according to BLAST reciprocal best hit results [131]. Both sequencing-clustering algorithms were used to obtain a consensus of the cluster sets. GET_HOMOLOGUES was also used to determine the genome composition of the strains, by estimating the core and pan genome size by random sampling of the genomes [131] and by calculating the cloud, shell and core genome as defined by Tettelin et al. [134]. The definition of these genome statistics are: core, genes found in all genomes being investigated; soft-core, genes found in 95% genomes [135]; cloud, genes

found in a few genomes (the second most populated gene cluster); and shell, the remaining genes present in multiple genomes [131]. The core and pan genome identified by GET_HOMOLOGUES [131] was compared to the core and pan genome identified by Roary (v3.8.2) [136] as described in section 2.13.1.

2.12.8 BEAST

BEAST 2.0 (v2.4.6) [137], which uses Bayesian evolutionary analysis, was used to estimate the time of most recent common ancestor (TMRCA) of aligned core genomes based on mutations rates, and was calibrated using the isolation date of each strain. Briefly, the core SNP alignment obtained using Snippy v3.0 [107] was analysed using Gubbins (v2.2.2) [138], an algorithm that removes regions containing increased densities of base substitutions. This core SNP alignment, with removed recombination regions, was used as the input for BEAUti, a program to set the parameters for BEAST 2.0. For BEAST 2.0, the following parameters were used: coalescent constant population, the Hasegawa-Kishino- Yano (HKY) substitution model, a relaxed molecular clock model, effective sample sizes (ESS) greater than 100 were accepted and the Markov Chain Monte Carlo (MCMC) chain was run for 50,000,000 iterations. A 10% burn-in was used and the data was stored every 10,000 iterations. Summary statistics were analysed using the BEAST 2.0 program Tracer, and the tree estimates were visualised using FigTree v1.4.3 [139].

2.13 Comparison of phenotypic and genotypic data

In document Metabolic characteristics and genomic epidemiology of Escherichia coli serogroup O145 : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Microbiology at Massey University, Palmerston North, New Zealand (Page 47-51)