A L I M E N T A T I O N
S. Marthey / O. Rué
Fastq
Reference Genome
Galaxy
Format
Conversion
---
Groomer
Quality
Control
---
FastQC
Mapping
---
BWA
Format
conversion
---
Sam-to-Bam
Removing PCR
duplicates
---
MarkDup
Preprocess
GATK
---
Indel
Realignment
Variant
Calling
GATK
---Unified
Genotyper
Variant
Calling
VarScan
VCF
Filtering
VCF
Annotation
Preprocess
GATK
---
Base
Recalibration
Mpileup
Workflow
A L I M E N T A T I O N
S. Marthey / O. Rué
Plan
•
Introduction
•
Prétraitements des données NGS
•
Recherche de Variants
•
Pourquoi faire la Real/Recab ?
A L I M E N T A T I O N
S. Marthey / O. Rué
The Genome Analysis Toolkit : A MapReduce framework for analyzing next-generation DNA
sequencing data
, McKenna et al. (2010)
•
GATK : Genome Analysis ToolKit
•
http://www.broadinstitute.org/gatk/about/
•
Développé par l'équipe de développement du Broad Institute (USA)
•
Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome
Atlas...)
•
A la base développé pour génetique humaine mais maintenant générique
•
Développé en Java
•
Citations :
Sources
2010
2011
2012
2013
GATK Website*
2
9
25
…
Google Scholar
28
145
436
767
* Nature, Science, Nature Genetics, Nature Biotechnology, New England Journal of Medicine, Cell, and Genome Research.
A L I M E N T A T I O N
S. Marthey / O. Rué
A L I M E N T A T I O N
S. Marthey / O. Rué
Comment est détecté un SNP ?
Complex bayesian algorithms based on :
Base scale Read scale Position scale Genotype scale
ALT allele count
REF allele count
ALT / REF Read Depth Overall genotype association SNP quality Forward/Reverse
Phred-Quality Base Mapping quality
10 => Perror = 1 / 10 30 => Perror = 1 / 1000
A L I M E N T A T I O N
S. Marthey / O. Rué
Comment est détecté un SNP ?
Biais de séquençage connus:
– GA / Hi-Seq : Base Quality
– 454 : Homopolymères
– SOLiD : Base Quality + Color space traduction
Base scale Read scale Position scale Genotype scale
ALT allele count
REF allele count
ALT / REF Read Depth Overall genotype association SNP quality Forward/Reverse
A L I M E N T A T I O N
S. Marthey / O. Rué
Raw reads
• Produits par les logiciels des
Séquenceurs
• Une première étape de
recalibration/correction des reads
peut être effectuée :
– 454 : Pyrobayes / Pyrocleaner
– SOLiD : Rsolid
– Illumina : Ibis
/BayesCall
+
Taux erreur amélioré de 5 à 30 %
Raw reads
• Fastq
• csFasta + Qual
@HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 NATAAATGCTGTCATACAGACTTGTTGGTGTTGTAAGGCAGCAGACTCCTTTGAGCTTTCATCCGAGAACAATTGAGACTAAATTCCTGGTGCAAAGTCCA +HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 BP\cceeegffgghfhiiiefgihhhii[baegegfgiiiihhiiihhfhfhighihiiifhhfihieeegaceeedcdddd`bcbcccbbcbcccccbcb @HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 NAAGAAGGCACGAAGCAACTACTTCACTGCATGCTGCCTGTCCTTGGGCTGTTTGCTGCCTTTGGCTAACACCTTTGATTATTTCTGGCTAAGTAGATAGG +HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 BS\ceeeegggfgiiiiiiiiiiiiiiiiiiiiiiiiiiihiiiiihhghihhihiiiiiihhiihgggfgeeedddddddbededcccc`bcbeccddcc @HISEQ4_0105:4:1101:3251:1984#TAGCTT/1 NAGAGCTATTTATGAAAACGAGGATGACTAAAACTGCCCAGAAAAAAAACCAACCAACCACGTTTCCAGTGACTGCCACCCTTAGCAAGCAAGGTAATAACA L I M E N T A T I O N
S. Marthey / O. Rué
Mapping
• Alignement reads VS Génome de
référence
• Tout logiciel produisant des BAM
– Ex: BWA, Bowtie, Gsnap, SOAP, SSAHA
http://seqanswers.com/forums/showthread.php?t=43
1 fichier par lane / individu / condition
ou
Mapping
PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 73 13 10354712 37 101M = 10354712 0
GCCTAGTCCTTTGAGACAGGAGTAAGACAAGAACTCAGGTTAGGGACCTCAAGGACTTGCTGAAGCCCACAAAGATTAGGACAAGCTAATGGAACTCAGAC
@@CFDFDFHGHHHIIJJJIJJJCFHIJIJIIFIJJJIJECFGGIGJIIJIJIIJJIGIIIIGGIJJJIGHHEFDFFFDDDCCED?BDDCCDDCDDDDCAC: X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 133 13 10354712 0 * = 10354712 0
GTTAGGGACCTTAAGGATCAATCTTGTCTGAGTTCCATTAGCTTGTCCTAATCTTTGTGGGCTTCAGCAAGTCCTTGAGGTCCCTAACCTGAGTTCTTGTC @@CFFFFFHHHHHJJJJIJJJJIIIIGIIJJJFHIJIJIIJJJJE?DGGCGHIJIJIGIIIIDGFHIIIIGHIJJF@CEH@CFF@CCEEA=CC;@ACA@C5 RG:Z:ind1
PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 99 13 10355951 60 101M = 10355989 139
TGGGAAGGCTTACTGTCTTCATGCAGGATCTGTGTGGCTCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCAT
@@CFDFF?DHFHHIIHGHIJJG@HG<FHIIIIJJGGGDGIIJIIJJIGGEBD*?DDGHGGGIGHIH>GG;C>AAAC@DFD;@CECAACDCBBBB9A>>@CA X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 147 13 10355989 60 101M = 10355951 -139
TCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCATCAGTATAAGGCACTCTGAAAGAAAGCAATCTAAATCCC
:>DCDDDECAA>>@BFFEC@EIHE;GBHF=GFGHGGGGIIHFHGDG@GDB9IIJIIGHHGGGHIIGDIIHFHHEFGEIIJHGH?GBGIHHGGDFFDFFCC@ X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
PHOSPHORE:181:C0KD3ACXX:8:1202:6947:20338 99 13 10358279 29 99M = 10358378 154
GCAGGCTTTTAAGAATATGTTCTGTTTTCAAATAGTAACCCAAAAAGGGGTGGGGGCGGGGGCAAAGTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
CC@FFFFFGHGFHFGGGII>JHGGEHIJIIEHHEGHIGHIJJGGIJFGIJ@FHIIHFBDBDDBB@BC44@:@4?><8A2<2?8?<B<<2<2<<A<ABB? X0:i:1 X1:i:0 MD:Z:99 RG:Z:ind2 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:U
A L I M E N T A T I O N
S. Marthey / O. Rué
Duplicate
Marking/Removing
• Duplicats PCR (construction des librairies)
• Samtools rmdup || Picard MarkDuplicates
Identification
Local Realignment
• Identification des régions à
réaligner :
• Réalignement des reads
“ The algorithm begins by first identifying regions for
realignment where 1) at least one read contains an
indel, 2) there exists a cluster of mismatching bases
or 3) an already known indel segregates at the site …”
DePristo et al (2011)
“ Next, all reads are realigned against just the best
haplotype Hi and the reference (H0), and each read Rj
is assigned to Hi or H0 …”
A L I M E N T A T I O N
S. Marthey / O. Rué
Base quality recalibration
Mean BQ = 32,8 - Median = 36,7
R
aw
data
« The per-base quality scores, which convey the
probability that the called base in the read is
the true sequenced base, are quite inaccurate and
co-vary with features like sequencing technology,
machine cycle and sequence context »
A L I M E N T A T I O N
S. Marthey / O. Rué
Conséquences
Base quality recalibration
Mean BQ = 32,8 - Median = 36,7 Mean BQ = 28,8 – Median = 28,7
R aw data R ec al ibrat ed data • Baisse de la variabilité
A L I M E N T A T I O N
S. Marthey / O. Rué
Analysis-ready reads
• Nouveau fichier BAM
• Peut être utilisé ensuite avec
d’autre outils pour la suite des
analyses (Samtools mpileup,
Popoolation, etc…)
R
aw
A L I M E N T A T I O N
S. Marthey / O. Rué
Single vs Multiple sample analysis
Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 8th, 2011 (http://www.broadinstitute.org/gatk/best-practices.htm)
Unified Genotyper
• Outil GATK
• Multiple sample analysis
• Différents modes de détection
– SNP
– Indels
A L I M E N T A T I O N
S. Marthey / O. Rué
Format VCF
A L I M E N T A T I O N
S. Marthey / O. Rué
Comparaison d’outils de SNP calling
• SIGENAE Team – LGC - INRA
• APACHE Project (Alain Vignal)
• To find SNPs (Single Nucleotide Polymorphism) which differentiate
populations
• Barbary Duck : no reference genome (Beijing duck genome is
available)
Beijing duck Barbarie duck
raw data realigned data recalibrated data Realigned & recalibrated data 0 10000 20000 30000 40000 50000 60000 70000 80000 Mpileup Mpileup -B Mpileup -E GATK Popoolation2 Δ = 777% Δ = 714% Δ = 42% Δ = 45%
Impact of realignment / recalibration on
SNP count
• More homogenous SNP count
A L I M E N T A T I O N
S. Marthey / O. Rué
Reliable results with other species ?
raw data realigned datarecalibrated dataRealigned & recalibrated data 0 10000 20000 30000 40000 50000 60000 70000 80000 0 10000 20000 30000 40000 50000 60000 70000 80000 DUCK Mpileup Mpileup -B Mpileup -E GATK BAMs bruts BAMs réalignés BAMs recalibrés BAMs réalignés/recalibrés 0 50000 100000 150000 200000 0 50000 100000 150000 200000 PIG Mpileup Mpileup -B Mpileup -E GATK
BAMs brutsBAMs réalignésBAMs recalibrésBAMs réalignés/recalibrés 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 CHICKEN Mpileup Mpileup -B Mpileup -E GATK
Raw data Realigned/Recal data
Δ tools 234% 4%
Raw data Realigned/Recal data
Δ tools 777% 20%
Raw data Realigned/Recal data
Δ tools 454% 9%
Conclusion
• Variability between called SNP by different tools
• GATK realignment/recalibration greatly helps to reduce this variability
• High impact of base quality score
• Reliable on various DNA data, but not on RNA data
« … We recommend a recalibration of per-base quality
scores as in GATK or SOAPsnp … »
« … Several additional steps can be taken to improve
genotype calls, such as local realignments ... »
A L I M E N T A T I O N
S. Marthey / O. Rué
Bilan
GATK nécessite un peu d'habitude
Points forts :
Assez rapide d'exécution grâce à la parallélisation possible
Comptage allélique
Prise en compte des positions multi-alléliques
Beaucoup de fonctionnalités et d'options
SNPs semblent être fiables
Améliorations fréquentes
Site Internet
Points faibles :
Recalibration basée sur des SNPs connus...
À l'origine créé pour l'analyse de génomes humains
Beaucoup d'étapes avant de lancer l'UnifiedGenotyper
A L I M E N T A T I O N
S. Marthey / O. Rué
Le site de référence GATK
• Download logiciels + ressources (vcf)
• Guide Analyse
• Best Practices
• Forum
• Documentation Technique
• Etc…
http://www.broadinstitute.org/gatk/index.php
References
• Samtools :
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup - The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009).
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851-8 (2008).
• GATK
A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491 (2011).
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. McKenna AH, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Depristo M. Genome Res. (2010).
• Popoolation2
R. Kofler, R. V. Pandey, C. Schlotterer. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics (2011).
• Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nat Methods (2008) • BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Kao W-C, Stevens K, Song YS. Genome Res (2009). • Ibis
Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Kircher M, Stenzel U, Kelso J. . Genome Biol. (2009). • Pyrocleaner