• No results found

Chapter  2:   Materials and Methods

2.12   Bioinformatic data analysis

2.12.1  H.  influenzae  genome  sequences  

Whole-­‐genome   reference   sequences   of   Rd   and   R2866   strains   were   available   from   the   NCBI   database   (http://www.ncbi.nlm.nih.gov).   Accession   numbers   were  NC_000907  for  Rd  and  CP002277  for  R2866.  

 

2.12.2  Whole-­‐genome  assembly  

SPAdes  software  was  used  to  assemble  sequencing  reads  into  joined  contiguous   sequences   (contigs)   (Bankevich   et   al.,   2012).   "Careful"   mode   was   selected   to   reduce   the   number   of   mismatches   as   well   as   short   insertions   and   deletions   (indels).   QUAST,   included   in   SPAdes   software,   was   used   to   assess   the   whole-­‐ genome   assembly   properties   (Gurevich   et   al.,   2013).   Contigs   were   removed   if   they   were   shorter   than   200   bp   and   the   read   coverage   was   lower   than   10x.   Mauve  was  used  to  align  contigs  to  the  appropriate  reference  genome  sequence   from  the  NCBI  database  (Darling  et  al.,  2004).  The  Mauve  Contig  Mover  module   was   subsequently   used   to   reorder   contigs   based   on   Rd   or   R2866   reference   genome   (see   section   2.12.1)   (Rissman   et   al.,   2009).   Ordered   contigs   were   concatenated   into   one   complete   sequence   with   the   EMBOSS   union   online   tool   (http://www.bioinformatics.nl/cgi-­‐bin/emboss/union).   Qualimap   was   used   to   determine  read  coverage  of  each  assembled  genome  after  mapping  sequencing  

reads  against  the  assembled  genome  (see  section  2.12.5)  (Garcia-­‐Alcalde  et  al.,   2012).    

 

2.12.3  Whole-­‐genome  annotation  

Prokka  was  used  to  annotate  sequenced  whole  genomes  of  H.  influenzae  Rd  and   R2866   strains   (Seemann,   2014).   It   was   important   to   retain   the   original   annotation  of  genome  sequences  of  these  strains.  Hence,  the  makeblastdb  (part   of   BLAST+   package)   command-­‐line   tool   was   used   to   create   a   genus   database   from   the   reference   genome   sequences   (Camacho   et   al.,   2009).   The   genus   database  was  used  during  Prokka  annotation  of  sequenced  genomes.  

 

2.12.4  Sequence  comparison  and  visualisation  

Whole-­‐genome   and   RNA-­‐Seq   data   were   visualised   in   the   Artemis   genome   browser   (Rutherford   et   al.,   2000).   The   Artemis   Comparison   Tool   (ACT)   was   used  to  compare  the  genomes  of  Rd  and  R2866  strains  (Carver  et  al.,  2005).  For   this   purpose,   comparison   files   were   generated   with   an   online   tool   WebACT   (http://www.webact.org/WebACT/home)   using   the   BLASTn   algorithm   with   default  parameters.  The  average  nucleotide  identity  (ANI)  was  calculated  using   best  hit  and  reciprocal  best  hit  methods  (http://enve-­‐omics.ce.gatech.edu/ani/)   (Goris  et  al.,  2007).  

 

2.12.5  Mapping  and  processing  sequencing  reads  

Paired-­‐end  reads  from  RNA-­‐Seq  experiments  were  in  the  opposite  orientation:   the   first   read   was   reverse   (3'-­‐5')   and   the   second   read   was   forward   (5'-­‐3').   In   order  to  visualize  mapped  RNA-­‐Seq  reads  in  Artemis,  they  needed  to  be  of  the   same   orientation.   Hence,   the   first   read   was   reverse   complemented   using   the   seqtk   command-­‐line   tool,   so   that   both   reads   were   in   the   forward   orientation.   This  was  not  required  for  whole-­‐genome  sequencing  reads.  

The  reference  genome  was  indexed  using  bowtie2-­‐build  command  (Langmead   and  Salzberg,  2012).  Sequencing  reads  were  mapped  to  the  reference  genome   using   bowtie2   software   (Langmead   and   Salzberg,   2012).   Read   alignment   data   was   generated   in   SAM   (sequence   alignment/map)   file   format.   SAMtools   was   used   to   convert   alignment   data   to   BAM   (binary   alignment/map)   file   format,   which   is   a   binary   version   of   SAM   file   format   (Li   et   al.,   2009).   SAMtools   was   subsequently  used  to  sort  and  index  BAM  files.    

 

2.12.6  Genome  variant  calling  

The   SAMtools   command   "mpileup"   was   used   to   generate   a   pileup   format   file   from  a  sorted  BAM  file  and  a  FASTA  file  of  the  reference  genome  (Li  et  al.,  2009).   This  was  used  as  input  for  VarScan2  software,  which  identifies  single  nucleotide   polymorphisms   (SNP)   and   indels   present   between   two   genome   sequences   (Koboldt  et  al.,  2012).  The  minimum  read  coverage  was  set  to  20.  The  minimum   number   of   reads   needed   to   support   SNP   or   an   indel   was   chosen   as   15.   The   minimum   quality   for   a   bp   was   set   to   30.   The   minimum   allele   frequency   threshold   was   0.9.   Finally,   the   minimum   allele   frequency   to   be   called   a   homozygote  was  set  to  0.9.  

 

2.12.7  Differential  gene  expression  analysis  

The  R  package  DESeq2  uses  a  negative  binomial  distribution  model  to  test  for   the  differential  expression  in  RNA-­‐Seq  data  (Love  et  al.,  2014).  Sorted  BAM  and   GFF  (general  feature  format)  files  were  used  as  input  for  the  coverageBed  tool,   outputting   a   text   file   with   read   coverage   information   for   every   feature   in   the   genome.   These   text   files,   one   per   biological   replicate,   were   used   as   input   for   DESeq2.   P-­‐values   were   adjusted   for   a   false   discovery   rate   at   5%   using   the   Benjamini-­‐Hochberg   method   (Benjamini   and   Hochberg,   1995).   Data   were   further  filtered  by  applying  a  standard  cut-­‐off  of  2  for  the  fold  change  and  0.05   for  adjusted  p-­‐value  (Baddal  et  al.,  2015).    

2.12.8  Analysis  of  enriched  functional  groups  

DAVID  (Database  for  Annotation,  Visualization,  and  Integrated  Discovery) was   used   to   identify   gene   ontology   (GO)   terms   and   Kyoto   Encyclopaedia   of   Genes   and   Genomes   (KEGG)   pathways   that   were   enriched   in   lists   of   differentially   expressed   genes   (Huang   da   et   al.,   2009a,   Huang   da   et   al.,   2009b).   Reference   Sequence   (RefSeq)   protein   identifiers   for   every   gene   from   a   list   were   used   as   input.   KEGG   pathway   diagrams   were   generated   using   KEGG   Mapper   (http://www.kegg.jp/kegg/tool/map_pathway2.html).  

 

2.12.9  TPM  normalisation  

For  absolute  expression  analysis,  RNA-­‐Seq  data  was  manually  normalised  using   the  Transcripts  per  Million  (TPM)  method  (Wagner  et  al.,  2012).  

 

2.12.10  BLAST  

All   BLAST   searches   were   performed   online   on   the   BLAST   server   (http://blast.ncbi.nlm.nih.gov)   or   using   the   BLAST+   package   on   the   command   line  (Camacho  et  al.,  2009).  Homology  search  of  ncRNAs  was  carried  out  using   the  E-­‐value  cut-­‐off  of  1e-­‐05.  

 

2.12.11  Identification  of  ncRNAs  

Sorted  BAM  files  and  a  GFF  file,  containing  coordinates  of  the  coding  sequences,   were   used   as   input   for   coverageBed   and   genomeCoverageBed   command-­‐line   tools,   which   are   both   part   of   the   BEDTools   suite   (Quinlan   and   Hall,   2010).   CoverageBed   was   used   to   produce   read   coverage   information   for   each   nucleotide   that   is   present   in   every   coding   sequence   in   a   genome.   GenomeCoverageBed  was  used  to  produce  read  coverage  information  for  each   nucleotide  in  the  genome:  on  both  strands  and  for  each  strand  separately.  These   files   were   used   as   input   for   a   Python   script,   which   was   written   in-­‐house   to  

identify   ncRNA   sequences   from   RNA-­‐Seq   data.   See   Chapter   5   for   a   detailed   description  of  the  script.  

 

2.12.12  RNA  and  protein  family  analysis  

Protein  domain  and  family  analysis  was  carried  out  using  the  InterPro  database   (Mitchell  et  al.,  2015).  The  Rfam  database  was  used  to  identify  homologues  from   known  RNA  families  (Griffiths-­‐Jones  et  al.,  2003,  Nawrocki  et  al.,  2015).  

 

2.12.13  RNA  secondary  structure  and  gene  targets  

Secondary   RNA   structure   was   predicted   using   the   RNAfold   web   server   (Hofacker  and  Stadler,  2006).  Homologues  of  ncRNAs  were  identified  using  the   GLASSgo   online   tool,   using   the   "very   high   specificity"   option   (http://rna.informatik.uni-­‐freiburg.de).   Five   homologous   sequences   were   then   used   to   predict   potential   gene   targets   using   CopraRNA   (Wright   et   al.,   2013,   Wright  et  al.,  2014).  Potential  target  sequences  were  analysed  75  bp  around  the   start  codon  of  each  gene.  

 

2.12.14  Figure  generation  and  statistical  analysis  

Microsoft   Excel   was   used   to   produce   simple   graphs   of   numeric   data.   False   colour   heatmaps   were   generated   in   R   using   the   "heatmap.2"   function   of   the   "gplots"   package.   The   Circos   tool   was   used   to   visualize   the   genomic   data   in   a   circularized   layout   (Krzywinski   et   al.,   2009).   Venn   diagrams   were   generated   with   the   online   tool   Venny   (http://bioinfogp.cnb.csic.es/tools/venny/).   Image   analysis  was  performed  using  the  Fiji  image  processing  package  (Schindelin  et   al.,  2012).