Bioinforma)cs workpackages
IFB General Assembly
January 2016
Valen)n
Loux
WP organiza)on
•
WP1
: two tasks oriented towards the
e-infrastructure and regional sequencing
facili)es infrastructure adapta)on
•
WP2:
7 bioinforma)cs workpackages with
common objec)ves :
–
List and evaluate exis)ng analysis soPware
–
C u r r e n t u s e - c a s e s a n a l y s i s p i p e l i n e
implementa)on
–
Installa)on and distribu)on of the pipelines
WP1.1 : e-infrastructure
Archivage des
données
ini)ales
Espace de stockage
dédié FG : 2 Po
HSM 5 Po,
extensible
Calculateur Airain,
20.000 cœurs (420
Tflops), dont 3.000
dédiés FG
Le stockage est
accessible aux autres
calculateurs du TGCC,
dont Curie (100.000
cœurs, 2 Pflops, appels
à projets PRACE)
220 logiciels
bioinformaHques
installés.
2.1 :
Quality Control and
technology evalua8on
S. Engelen
2.2
Read mapping so=ware
evalua8on
V. Loux
2.3. :
Assembly so=ware
evalua8on
J. M. Aury
2.4 :
Variant detec8on in
genomic data
F. ArHguenave – E. Barillot
2.5
: Expression level
(RNA-seq)
D. Gautheret – E. Rivals
2.6
Gene expression
regula8on:
{ChIP,Methyl,sRNA}-seq
C. Gaspin – H. Touzet- J. van
Helden – N. Touleimat
2.7 :
Genomic and metagenomic data analysis
WP2.1 & 2.3 : QC and assembly
Focus on «
long read
» technologies :
–
ONT MinIon
–
Pacific Bioscience
Read error correc8on
(using 2
nd
genera)on sequencing technologies)
–
NaS
Les algorithmes existants ne sont pas adaptés pour des lectures longues avec
beaucoup d’erreurs
2D MinION reads
Aligned: 83%
Mean identity: 75%
1D MinION reads
Aligned: 17%
Mean identity: 56%
Oxford Nanopore data
NaS is based on micro-assemblies to
produce near perfect reads
Illumina short
reads
MinION
read
short reads
alignment
short reads
recruitment
short reads
micro-assembly
NaS read
seed-reads
Recruit and
seed reads
M ET H O D O L O GY AR TI C L E
Open Access
Genome assembly using Nanopore-guided long
and error-free DNA reads
Mohammed-Amin Madoui
1†, Stefan Engelen
1†, Corinne Cruaud
1, Caroline Belser
1, Laurie Bertrand
1, Adriana Alberti
1,
Arnaud Lemainque
1, Patrick Wincker
1,2,3and Jean-Marc Aury
1*Abstract
Background:
Long-read sequencing technologies were launched a few years ago, and in contrast with short-read
sequencing technologies, they offered a promise of solving assembly problems for large and complex genomes.
Moreover by providing long-range information, it could also solve haplotype phasing. However, existing long-read
technologies still have several limitations that complicate their use for most research laboratories, as well as in large
and/or complex genome projects. In 2014, Oxford Nanopore released the MinION® device, a small and low-cost
single-molecule nanopore sequencer, which offers the possibility of sequencing long DNA fragments.
Results:
The assembly of long reads generated using the Oxford Nanopore MinION® instrument is challenging as
existing assemblers were not implemented to deal with long reads exhibiting close to 30% of errors. Here, we
presented a hybrid approach developed to take advantage of data generated using MinION® device. We sequenced
a well-known bacterium,
Acinetobacter baylyi ADP1
and applied our method to obtain a highly contiguous (one single
contig) and accurate genome assembly even in repetitive regions, in contrast to an Illumina-only assembly. Our hybrid
strategy was able to generate NaS (Nanopore Synthetic-long) reads up to 60 kb that aligned entirely and with no error
to the reference genome and that spanned highly conserved repetitive regions. The average accuracy of NaS reads
reached 99.99% without losing the initial size of the input MinION® reads.
Conclusions:
We described NaS tool, a hybrid approach allowing the sequencing of microbial genomes using the
MinION® device. Our method, based ideally on 20x and 50x of NaS and Illumina reads respectively, provides an efficient
and cost-effective way of sequencing microbial or small eukaryotic genomes in a very short time even in small facilities.
Moreover, we demonstrated that although the Oxford Nanopore technology is a relatively new sequencing technology,
currently with a high error rate, it is already useful in the generation of high-quality genome assemblies.
Keywords:
Nanopore sequencing, Oxford nanopore, MinION
®
device,
de novo
genome assembly, Genome finishing
Background
The technology of long-read sequencing now offers different alternatives to solve genome assembly problems (for example, in complex regions involving repeated elements or segmental duplications) and haplotype phas-ing, which cannot be resolved adequately by short-read sequencing. Application of the single-molecule real-time sequencing (SMRT) platform produced by Pacific Bios-ciences to small microbial as well as large complex eukaryotic genomes demonstrated the possibility of
considerably improving genome assembly quality [1-4]. Microbial genome could now be fully assembled (at least in some cases) using Pacific Biosciences’s SMRT reads alone [2] or in combination with short but high quality reads [1]. The high error rate of SMRT reads renders the necessity for either deep coverage or a strategy of error correction using Illumina reads. It’s clear that the current yield and high cost per base of this technology remain a barrier for most genomic projects targeting large genomes. Moreover, the price of the commercially available Pacific Biosystems PacBio RS II instrument is high and the needs in terms of infrastructure and imple-mentation does not make it accessible to the whole research community. Similar improvements in read length were also accomplished by the Illumina Truseq synthetic
* Correspondence:[email protected]
†Equal contributors
1Commissariat à l’Energie Atomique (CEA), Institut de Génomique (IG),
Genoscope, BP5706, 91057 Evry, France
Full list of author information is available at the end of the article
© 2015 Madoui et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Madouiet al. BMC Genomics (2015) 16:327
Read set
MinION reads (1D and
2D)
NaS (sensi)ve
mode)
# reads
89 011
28 492
Cumul size (Mb)
381.3
271.6 (75X)
N50 size (bp)
13 376
13 256
Avg Size (bp)
4 284
9 530
Max Size (bp)
137 043
91 255
# reads >10Kb
14 947
11 252
Number of aligned reads
29 954 (33.6%)
28 492 (100%)
Average iden8ty percent
70%
99.9897%
Max alignement size
84 914
91 255
Error-free reads
0 (0%)
27 049 (94.93%)
Acinetobacter
dataset
NaS reads were obtained
from 6 MinION runs (R7 and
R7.3 flowcells) and 50X of
2x250bp Illumina reads
hpps://github.com/ins)tut-de-genomique/NaS
[email protected]
PacBio:
hybrid error correction with Lordec
Vol. 30 no. 24 2014, pages 3506–3514
BIOINFORMATICS
ORIGINAL PAPER
doi:10.1093/bioinformatics/btu538Sequence analysis
Advance Access publication August 26, 2014LoRDEC: accurate and efficient long read error correction
Leena Salmela
1,* and Eric Rivals
2,*
1Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki,
Finland and2LIRMM and Institut de Biologie Computationelle, CNRS and Universite Montpellier, 34095 Montpellier!
Cedex 5, France
Associate Editor: Michael Brudno
ABSTRACT
Motivation:PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with com-paratively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping orde novoassembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads pro-vides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.
Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving compar-able accuracy.
Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.
Contact:[email protected].
Supplementary information: Supplementary data are available at
Bioinformaticsonline.
Received on April 6, 2014; revised on July 28, 2014; accepted on August 4, 2014
1 INTRODUCTION
Sequencing, the determination of DNA or RNA sequences, now belongs to the basic experiments in life sciences. Compared with the Sanger method, the so-called next-generation sequencing technologies (of the second, third or even fourth generations) have drastically lowered its cost and increased its efficiency, making genome-wide and transcriptome-wide sequencing feas-ible. Numerous types of ‘omics’ experiments, beyond de novo
genome sequencing and assembly, have been invented and rely on high-throughput sequencing.
All currently available technologies produce reads that repre-sent only a piece of the target molecule sequence. Processing these reads requires aligning them against other sequences: for instance, while mapping them against a reference genome, or when computing overlaps among reads during assembly. Optimal, and sometimes suboptimal, alignments are retained for further analysis. The strength of an alignment (and hence
its usefulness) is mostly controlled by two factors: its percentage of identity and its length. Clearly, errors introduced during the sequencing process, sequencing errors, blur the signal in an alignment by introducing mismatches or by breaking it into shorter ones. Weaker alignments may not pass subsequent filters and are lost for downward analyses. The finer the analysis, the higher the necessity to capture the information available in all alignments: for instance, when trying to bridge a gap in a less covered region of genome during assembly, or to reconstruct the sequence of a less expressed RNA. To counteract sequencing errors, error correction algorithms have been found effective for de novo assembly (Salzberg et al., 2012), and so they are often incorporated in assembly pipelines [see e.g. Euler SR (Chaisson and Pevzner, 2008), ALLPATHS-LG (Gnerreet al., 2011) and SOAPdenovo2 (Luoet al., 2012)].
1.1 Related works for second-generation sequencing
In the case of long sequences (Sanger or PacBio reads), algo-rithms compute multiple alignments of the reads and call a con-sensus sequence to correct erroneous regions. Alignment computation has the inconvenience of long running time and parameter dependency (Salmela and Schroder, 2011). In the€
case of second-generation reads, meaning larger input size and modest error rates, the key idea is to exploit the coverage of sequencing. One distinguishes erroneous from error-free sub-strings by counting their number of occurrences in the read set. With a sufficient coverage, it is possible to compute a minimal threshold such that, with high probability, each error-freek-mer appears at least that number of times in the read set. Ak-mer above/below the threshold is qualified as solid or weak, respect-ively. This idea is exploited in second-generation assembly programs based on De Bruijn Graphs (DBG), where only solid
k-mers form the nodes of the DBG (e.g. Zerbino and Birney, 2008), or during mapping against a reference to distinguish erroneous positions from biological mutations (Philippeet al., 2013). Many current error correction algorithms for second-generation sequencing (Illumina, Roche, or Solid) adopt this counting strategy, also called spectral alignment (Chaisson
et al., 2004; Pevzneret al., 2001): one computes the spectrum of solidk-mers and corrects each read by updating each weak
k-mer with its closest solid k-mer. Implementation relying on hash tables is well adapted tok-mers (i.e. to substrings of fixed length), while approaches based on more flexible indexes of the reads (e.g. suffix trees or suffix arrays) can correct substrings of different lengths (Salmela, 2010; Schroder€ et al., 2009). Spectral alignment-based approaches are more efficient and scalable than
*To whom correspondence should be addressed.
! The Author 2014. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on November 23, 2015
http://bioinformatics.oxfordjournals.org/
Downloaded from
WP2.2 : Mapping
hpp://mapdecode.france-genomique.org
•
96 tools
listed and categorized
WP2.4 : variant analyses
•
Bacteria
:
–
Development of
pipeline
for bacteria popula)on
analysis
•
Human
:
–
Publicly available analysis workflows
(GATK, whole
exome sequencing,…) deployed on galaxy-public.curie.fr
–
Varscope2.0
to search for SNP, Structural Variants &
Copy Number Varia)ons implemented and deployed on
e-infrastructure
WP2.5 : Transcriptome analysis
•
List and categoriza8on
of more than 50 isoform
detec)on soPware
•
RNAprof :
detec8on of differen8ally
expressed
transcripts isoforms (Tran et al.
RNA Biology
2015 in
press)
•
Pipeline for
de novo transcriptome assembly
•
SARTools
: a DESeq2- and edgeR-based R pipeline for
comprehensive differen)al analysis of RNA-Seq data.
•
Training session :
–
Assembly
WP2.6 : Regula)on
sRNA-seq :
•
Command line and Galaxy pipelines for miRNA
detec8on and annota8on
:
–
Model organisms
–
Animals
–
Plants
•
Small RNA workshop during may 2015 France
WP2.6 : ChIP-seq
Figure 1:Figure 1: workflow ChIP-seq
5
•
Regulatory Sequence
Analysis Tools
(RSAT)
integra)on into
Galaxy
•
ChIP-Seq Virtual Machine
deployed and available on
IFB’s cloud
•
Implementa)on of
pipelines for
specific use
cases
WP2.6 : Methyl-seq
•
Lis8ng, categoriza8on and
evalua8on
of available
analysis tools
•
Implementa8on,
deployement and
accelera8on
of the selected
analysis pipeline on the
e-infrastructure
´
Evaluation et installation d’un pipeline de traitement de
donn´ees de s´equen¸cage bisulfite
Xavier Benigni, Nizar Touleimat, Fran¸cois Artiguenave
12 novembre 2015
— Laboratoire de Bioinformatique
— Centre National de G´enotypage
— Direction des Sciences du Vivant
— Commissariat `a l’´
Energie Atomique
1
Figure12 – Schema global du pipeline BS-seq impl´ement´e au CCRT. Illustration pour le process de donn´ees BS-seq issues d’une ”ligne” (lane) de s´equen¸cage. Le mˆeme processus est appliqu´e pour des donn´ees issues de plusieurs lignes, fusionn´ees `a l’´etape ”data merge”. Le pipeline permet ´egalement de traiter plusieurs lignes de mani`ere ind´ependante.
18