Bioinforma)cs workpackages

(1)

Bioinforma)cs workpackages

IFB General Assembly

January 2016

Valen)n

Loux

(2)

WP organiza)on

• WP1

: two tasks oriented towards the

e-infrastructure and regional sequencing

facili)es infrastructure adapta)on

• WP2:

7 bioinforma)cs workpackages with

common objec)ves :

–

 

List and evaluate exis)ng analysis soPware

–

 

C u r r e n t u s e - c a s e s a n a l y s i s p i p e l i n e

implementa)on

–

 

Installa)on and distribu)on of the pipelines

(3)

(4)

WP1.1 : e-infrastructure

Archivage des

données

ini)ales

Espace de stockage

dédié FG : 2 Po

HSM 5 Po,

_extensible

Calculateur Airain,

20.000 cœurs (420

Tﬂops), dont 3.000

dédiés FG

Le stockage est

accessible aux autres

calculateurs du TGCC,

dont Curie (100.000

cœurs, 2 Pﬂops, appels

à projets PRACE)

220 logiciels

bioinformaHques

installés.

(5)

2.1 :

Quality Control and

technology evalua8on

S. Engelen

2.2 Read mapping so=ware

evalua8on

V. Loux

2.3. :

Assembly so=ware

evalua8on

J. M. Aury

2.4 :

Variant detec8on in

genomic data

F. ArHguenave – E. Barillot

2.5 : Expression level

(RNA-seq)

D. Gautheret – E. Rivals

2.6 Gene expression

regula8on:

{ChIP,Methyl,sRNA}-seq

C. Gaspin – H. Touzet- J. van

Helden – N. Touleimat

2.7 :

Genomic and metagenomic data analysis

(6)

WP2.1 & 2.3 : QC and assembly

Focus on «

long read

» technologies :

–

 

ONT MinIon

–

 

Paciﬁc Bioscience

Read error correc8on

(using 2

nd

genera)on sequencing technologies)

–

 

NaS

(7)

Les algorithmes existants ne sont pas adaptés pour des lectures longues avec

beaucoup d’erreurs

2D MinION reads

Aligned: 83%

Mean identity: 75%

1D MinION reads

Aligned: 17%

Mean identity: 56%

Oxford Nanopore data

(8)

NaS is based on micro-assemblies to

produce near perfect reads

Illumina short

reads

MinION

read

short reads

alignment

short reads

recruitment

short reads

micro-assembly

NaS read

seed-reads

Recruit and

seed reads

M ET H O D O L O GY AR TI C L E

Open Access

Genome assembly using Nanopore-guided long

and error-free DNA reads

Mohammed-Amin Madoui

1†

, Stefan Engelen

1†

, Corinne Cruaud

1

, Caroline Belser

1

, Laurie Bertrand

1

, Adriana Alberti

1

,

Arnaud Lemainque

1

, Patrick Wincker

1,2,3

and Jean-Marc Aury

1*

Abstract

Background:

Long-read sequencing technologies were launched a few years ago, and in contrast with short-read

sequencing technologies, they offered a promise of solving assembly problems for large and complex genomes.

Moreover by providing long-range information, it could also solve haplotype phasing. However, existing long-read

technologies still have several limitations that complicate their use for most research laboratories, as well as in large

and/or complex genome projects. In 2014, Oxford Nanopore released the MinION® device, a small and low-cost

single-molecule nanopore sequencer, which offers the possibility of sequencing long DNA fragments.

Results:

The assembly of long reads generated using the Oxford Nanopore MinION® instrument is challenging as

existing assemblers were not implemented to deal with long reads exhibiting close to 30% of errors. Here, we

presented a hybrid approach developed to take advantage of data generated using MinION® device. We sequenced

a well-known bacterium,

Acinetobacter baylyi ADP1

and applied our method to obtain a highly contiguous (one single

contig) and accurate genome assembly even in repetitive regions, in contrast to an Illumina-only assembly. Our hybrid

strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error

to the reference genome and that spanned highly conserved repetitive regions. The average accuracy of NaS reads

reached 99.99% without losing the initial size of the input MinION® reads.

Conclusions:

We described NaS tool, a hybrid approach allowing the sequencing of microbial genomes using the

MinION® device. Our method, based ideally on 20x and 50x of NaS and Illumina reads respectively, provides an efficient

and cost-effective way of sequencing microbial or small eukaryotic genomes in a very short time even in small facilities.

Moreover, we demonstrated that although the Oxford Nanopore technology is a relatively new sequencing technology,

currently with a high error rate, it is already useful in the generation of high-quality genome assemblies.

Keywords:

Nanopore sequencing, Oxford nanopore, MinION

®

device,

de novo

genome assembly, Genome finishing

Background

The technology of long-read sequencing now offers different alternatives to solve genome assembly problems (for example, in complex regions involving repeated elements or segmental duplications) and haplotype phas-ing, which cannot be resolved adequately by short-read sequencing. Application of the single-molecule real-time sequencing (SMRT) platform produced by Pacific Bios-ciences to small microbial as well as large complex eukaryotic genomes demonstrated the possibility of

considerably improving genome assembly quality [1-4]. Microbial genome could now be fully assembled (at least in some cases) using Pacific Biosciences’s SMRT reads alone [2] or in combination with short but high quality reads [1]. The high error rate of SMRT reads renders the necessity for either deep coverage or a strategy of error correction using Illumina reads. It’s clear that the current yield and high cost per base of this technology remain a barrier for most genomic projects targeting large genomes. Moreover, the price of the commercially available Pacific Biosystems PacBio RS II instrument is high and the needs in terms of infrastructure and imple-mentation does not make it accessible to the whole research community. Similar improvements in read length were also accomplished by the Illumina Truseq synthetic

* Correspondence:[email protected]

†_{Equal contributors}

1_{Commissariat à l}_’_{Energie Atomique (CEA), Institut de Génomique (IG),}

Genoscope, BP5706, 91057 Evry, France

Full list of author information is available at the end of the article

© 2015 Madoui et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Madouiet al. BMC Genomics (2015) 16:327

(9)

Read set

MinION reads (1D and

_2D)

NaS (sensi)ve

_mode)

# reads

89 011

28 492

Cumul size (Mb)

381.3 271.6 (75X)

N50 size (bp)

13 376

13 256

Avg Size (bp)

4 284

9 530

Max Size (bp)

137 043

91 255

# reads >10Kb

14 947

11 252

Number of aligned reads

29 954 (33.6%)

28 492 (100%)

Average iden8ty percent

70%

99.9897%

Max alignement size

84 914

91 255

Error-free reads

0 (0%)

27 049 (94.93%)

Acinetobacter

dataset

NaS reads were obtained

from 6 MinION runs (R7 and

R7.3 ﬂowcells) and 50X of

2x250bp Illumina reads

hpps://github.com/ins)tut-de-genomique/NaS

_{[email protected]}

(10)

PacBio:

hybrid error correction with Lordec

Vol. 30 no. 24 2014, pages 3506–3514

BIOINFORMATICS

ORIGINAL PAPER

doi:10.1093/bioinformatics/btu538

Sequence analysis

Advance Access publication August 26, 2014

LoRDEC: accurate and efficient long read error correction

Leena Salmela

1,

* and Eric Rivals

2,

_*

1_{Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki,}

Finland and2LIRMM and Institut de Biologie Computationelle, CNRS and Universite Montpellier, 34095 Montpellier!

Cedex 5, France

Associate Editor: Michael Brudno

ABSTRACT

Motivation:PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with com-paratively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping orde novoassembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads pro-vides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.

Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving compar-able accuracy.

Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.

Contact:[email protected].

Supplementary information: Supplementary data are available at

Bioinformaticsonline.

Received on April 6, 2014; revised on July 28, 2014; accepted on August 4, 2014

1 INTRODUCTION

Sequencing, the determination of DNA or RNA sequences, now belongs to the basic experiments in life sciences. Compared with the Sanger method, the so-called next-generation sequencing technologies (of the second, third or even fourth generations) have drastically lowered its cost and increased its efficiency, making genome-wide and transcriptome-wide sequencing feas-ible. Numerous types of ‘omics’ experiments, beyond de novo

genome sequencing and assembly, have been invented and rely on high-throughput sequencing.

All currently available technologies produce reads that repre-sent only a piece of the target molecule sequence. Processing these reads requires aligning them against other sequences: for instance, while mapping them against a reference genome, or when computing overlaps among reads during assembly. Optimal, and sometimes suboptimal, alignments are retained for further analysis. The strength of an alignment (and hence

its usefulness) is mostly controlled by two factors: its percentage of identity and its length. Clearly, errors introduced during the sequencing process, sequencing errors, blur the signal in an alignment by introducing mismatches or by breaking it into shorter ones. Weaker alignments may not pass subsequent filters and are lost for downward analyses. The finer the analysis, the higher the necessity to capture the information available in all alignments: for instance, when trying to bridge a gap in a less covered region of genome during assembly, or to reconstruct the sequence of a less expressed RNA. To counteract sequencing errors, error correction algorithms have been found effective for de novo assembly (Salzberg et al., 2012), and so they are often incorporated in assembly pipelines [see e.g. Euler SR (Chaisson and Pevzner, 2008), ALLPATHS-LG (Gnerreet al., 2011) and SOAPdenovo2 (Luoet al., 2012)].

1.1 Related works for second-generation sequencing

In the case of long sequences (Sanger or PacBio reads), algo-rithms compute multiple alignments of the reads and call a con-sensus sequence to correct erroneous regions. Alignment computation has the inconvenience of long running time and parameter dependency (Salmela and Schroder, 2011). In the€

case of second-generation reads, meaning larger input size and modest error rates, the key idea is to exploit the coverage of sequencing. One distinguishes erroneous from error-free sub-strings by counting their number of occurrences in the read set. With a sufficient coverage, it is possible to compute a minimal threshold such that, with high probability, each error-freek-mer appears at least that number of times in the read set. Ak-mer above/below the threshold is qualified as solid or weak, respect-ively. This idea is exploited in second-generation assembly programs based on De Bruijn Graphs (DBG), where only solid

k-mers form the nodes of the DBG (e.g. Zerbino and Birney, 2008), or during mapping against a reference to distinguish erroneous positions from biological mutations (Philippeet al., 2013). Many current error correction algorithms for second-generation sequencing (Illumina, Roche, or Solid) adopt this counting strategy, also called spectral alignment (Chaisson

et al., 2004; Pevzneret al., 2001): one computes the spectrum of solidk-mers and corrects each read by updating each weak

k-mer with its closest solid k-mer. Implementation relying on hash tables is well adapted tok-mers (i.e. to substrings of fixed length), while approaches based on more flexible indexes of the reads (e.g. suffix trees or suffix arrays) can correct substrings of different lengths (Salmela, 2010; Schroder€ et al., 2009). Spectral alignment-based approaches are more efficient and scalable than

*To whom correspondence should be addressed.

! The Author 2014. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

by guest on November 23, 2015

http://bioinformatics.oxfordjournals.org/

Downloaded from

(11)

WP2.2 : Mapping

hpp://mapdecode.france-genomique.org

• 96 tools

listed and categorized

(12)

WP2.4 : variant analyses

• Bacteria

:

–

 

Development of

pipeline

for bacteria popula)on

analysis

• Human

:

–

 

Publicly available analysis workﬂows

(GATK, whole

exome sequencing,…) deployed on galaxy-public.curie.fr

–

 

Varscope2.0

to search for SNP, Structural Variants &

Copy Number Varia)ons implemented and deployed on

e-infrastructure

(13)

WP2.5 : Transcriptome analysis

• List and categoriza8on

of more than 50 isoform

detec)on soPware

• RNAprof :

detec8on of diﬀeren8ally

expressed

transcripts isoforms (Tran et al.

RNA Biology

2015 in

press)

• Pipeline for

de novo transcriptome assembly

• SARTools

: a DESeq2- and edgeR-based R pipeline for

comprehensive diﬀeren)al analysis of RNA-Seq data.

• Training session :

– 

Assembly

(14)

WP2.6 : Regula)on

sRNA-seq :

• Command line and Galaxy pipelines for miRNA

detec8on and annota8on

:

–

 

Model organisms

–

 

Animals

–

 

Plants

• Small RNA workshop during may 2015 France

(15)

WP2.6 : ChIP-seq

Figure 1:Figure 1: workflow ChIP-seq

5

• Regulatory Sequence

Analysis Tools

(RSAT)

integra)on into

Galaxy

• ChIP-Seq Virtual Machine

deployed and available on

IFB’s cloud

• Implementa)on of

pipelines for

speciﬁc use

cases

(16)

WP2.6 : Methyl-seq

• Lis8ng, categoriza8on and

evalua8on

of available

analysis tools

• Implementa8on,

deployement and

accelera8on

of the selected

analysis pipeline on the

e-infrastructure

´

Evaluation et installation d’un pipeline de traitement de

donn´ees de s´equen¸cage bisulfite

Xavier Benigni, Nizar Touleimat, Fran¸cois Artiguenave

12 novembre 2015

— Laboratoire de Bioinformatique

— Centre National de G´enotypage

— Direction des Sciences du Vivant

— Commissariat `a l’´

Energie Atomique

1

Figure12 – Schema global du pipeline BS-seq implémenté au CCRT. Illustration pour le process de données BS-seq issues d’une ”ligne” (lane) de séquen¸cage. Le même processus est appliqué pour des données issues de plusieurs lignes, fusionnées à l’étape ”data merge”. Le pipeline permet également de traiter plusieurs lignes de manière indépendante.