Bioinformatic data analysis

Chapter 2: Materials and Methods

2.12 Bioinformatic data analysis

2.12.1 H. influenzae genome sequences

Whole-‐genome reference sequences of Rd and R2866 strains were available from the NCBI database (http://www.ncbi.nlm.nih.gov). Accession numbers were NC_000907 for Rd and CP002277 for R2866.

2.12.2 Whole-‐genome assembly

SPAdes software was used to assemble sequencing reads into joined contiguous sequences (contigs) (Bankevich et al., 2012). "Careful" mode was selected to reduce the number of mismatches as well as short insertions and deletions (indels). QUAST, included in SPAdes software, was used to assess the whole-‐ genome assembly properties (Gurevich et al., 2013). Contigs were removed if they were shorter than 200 bp and the read coverage was lower than 10x. Mauve was used to align contigs to the appropriate reference genome sequence from the NCBI database (Darling et al., 2004). The Mauve Contig Mover module was subsequently used to reorder contigs based on Rd or R2866 reference genome (see section 2.12.1) (Rissman et al., 2009). Ordered contigs were concatenated into one complete sequence with the EMBOSS union online tool (http://www.bioinformatics.nl/cgi-‐bin/emboss/union). Qualimap was used to determine read coverage of each assembled genome after mapping sequencing

reads against the assembled genome (see section 2.12.5) (Garcia-‐Alcalde et al., 2012).

2.12.3 Whole-‐genome annotation

Prokka was used to annotate sequenced whole genomes of H. influenzae Rd and R2866 strains (Seemann, 2014). It was important to retain the original annotation of genome sequences of these strains. Hence, the makeblastdb (part of BLAST+ package) command-‐line tool was used to create a genus database from the reference genome sequences (Camacho et al., 2009). The genus database was used during Prokka annotation of sequenced genomes.

2.12.4 Sequence comparison and visualisation

Whole-‐genome and RNA-‐Seq data were visualised in the Artemis genome browser (Rutherford et al., 2000). The Artemis Comparison Tool (ACT) was used to compare the genomes of Rd and R2866 strains (Carver et al., 2005). For this purpose, comparison files were generated with an online tool WebACT (http://www.webact.org/WebACT/home) using the BLASTn algorithm with default parameters. The average nucleotide identity (ANI) was calculated using best hit and reciprocal best hit methods (http://enve-‐omics.ce.gatech.edu/ani/) (Goris et al., 2007).

2.12.5 Mapping and processing sequencing reads

Paired-‐end reads from RNA-‐Seq experiments were in the opposite orientation: the first read was reverse (3'-‐5') and the second read was forward (5'-‐3'). In order to visualize mapped RNA-‐Seq reads in Artemis, they needed to be of the same orientation. Hence, the first read was reverse complemented using the seqtk command-‐line tool, so that both reads were in the forward orientation. This was not required for whole-‐genome sequencing reads.

The reference genome was indexed using bowtie2-‐build command (Langmead and Salzberg, 2012). Sequencing reads were mapped to the reference genome using bowtie2 software (Langmead and Salzberg, 2012). Read alignment data was generated in SAM (sequence alignment/map) file format. SAMtools was used to convert alignment data to BAM (binary alignment/map) file format, which is a binary version of SAM file format (Li et al., 2009). SAMtools was subsequently used to sort and index BAM files.

2.12.6 Genome variant calling

The SAMtools command "mpileup" was used to generate a pileup format file from a sorted BAM file and a FASTA file of the reference genome (Li et al., 2009). This was used as input for VarScan2 software, which identifies single nucleotide polymorphisms (SNP) and indels present between two genome sequences (Koboldt et al., 2012). The minimum read coverage was set to 20. The minimum number of reads needed to support SNP or an indel was chosen as 15. The minimum quality for a bp was set to 30. The minimum allele frequency threshold was 0.9. Finally, the minimum allele frequency to be called a homozygote was set to 0.9.

2.12.7 Differential gene expression analysis

The R package DESeq2 uses a negative binomial distribution model to test for the differential expression in RNA-‐Seq data (Love et al., 2014). Sorted BAM and GFF (general feature format) files were used as input for the coverageBed tool, outputting a text file with read coverage information for every feature in the genome. These text files, one per biological replicate, were used as input for DESeq2. P-‐values were adjusted for a false discovery rate at 5% using the Benjamini-‐Hochberg method (Benjamini and Hochberg, 1995). Data were further filtered by applying a standard cut-‐off of 2 for the fold change and 0.05 for adjusted p-‐value (Baddal et al., 2015).

2.12.8 Analysis of enriched functional groups

DAVID (Database for Annotation, Visualization, and Integrated Discovery) was used to identify gene ontology (GO) terms and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways that were enriched in lists of differentially expressed genes (Huang da et al., 2009a, Huang da et al., 2009b). Reference Sequence (RefSeq) protein identifiers for every gene from a list were used as input. KEGG pathway diagrams were generated using KEGG Mapper (http://www.kegg.jp/kegg/tool/map_pathway2.html).

2.12.9 TPM normalisation

For absolute expression analysis, RNA-‐Seq data was manually normalised using the Transcripts per Million (TPM) method (Wagner et al., 2012).

2.12.10 BLAST

All BLAST searches were performed online on the BLAST server (http://blast.ncbi.nlm.nih.gov) or using the BLAST+ package on the command line (Camacho et al., 2009). Homology search of ncRNAs was carried out using the E-‐value cut-‐off of 1e-‐05.

2.12.11 Identification of ncRNAs

Sorted BAM files and a GFF file, containing coordinates of the coding sequences, were used as input for coverageBed and genomeCoverageBed command-‐line tools, which are both part of the BEDTools suite (Quinlan and Hall, 2010). CoverageBed was used to produce read coverage information for each nucleotide that is present in every coding sequence in a genome. GenomeCoverageBed was used to produce read coverage information for each nucleotide in the genome: on both strands and for each strand separately. These files were used as input for a Python script, which was written in-‐house to

identify ncRNA sequences from RNA-‐Seq data. See Chapter 5 for a detailed description of the script.

2.12.12 RNA and protein family analysis

Protein domain and family analysis was carried out using the InterPro database (Mitchell et al., 2015). The Rfam database was used to identify homologues from known RNA families (Griffiths-‐Jones et al., 2003, Nawrocki et al., 2015).

2.12.13 RNA secondary structure and gene targets

Secondary RNA structure was predicted using the RNAfold web server (Hofacker and Stadler, 2006). Homologues of ncRNAs were identified using the GLASSgo online tool, using the "very high specificity" option (http://rna.informatik.uni-‐freiburg.de). Five homologous sequences were then used to predict potential gene targets using CopraRNA (Wright et al., 2013, Wright et al., 2014). Potential target sequences were analysed 75 bp around the start codon of each gene.

2.12.14 Figure generation and statistical analysis

Microsoft Excel was used to produce simple graphs of numeric data. False colour heatmaps were generated in R using the "heatmap.2" function of the "gplots" package. The Circos tool was used to visualize the genomic data in a circularized layout (Krzywinski et al., 2009). Venn diagrams were generated with the online tool Venny (http://bioinfogp.cnb.csic.es/tools/venny/). Image analysis was performed using the Fiji image processing package (Schindelin et al., 2012).

In document The application of high throughput sequencing to study the genome composition and transcriptional response of Haemophilus influenzae (Page 76-81)