• No results found

Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion Groomer. Mapping BWA GATK Preprocess

N/A
N/A
Protected

Academic year: 2021

Share "Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion Groomer. Mapping BWA GATK Preprocess"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

A L I M E N T A T I O N

S. Marthey / O. Rué

Fastq

Reference Genome

Galaxy

Format

Conversion

---

Groomer

Quality

Control

---

FastQC

Mapping

---

BWA

Format

conversion

---

Sam-to-Bam

Removing PCR

duplicates

---

MarkDup

Preprocess

GATK

---

Indel

Realignment

Variant

Calling

GATK

---Unified

Genotyper

Variant

Calling

VarScan

VCF

Filtering

VCF

Annotation

Preprocess

GATK

---

Base

Recalibration

Mpileup

Workflow

(2)
(3)

A L I M E N T A T I O N

S. Marthey / O. Rué

Plan

Introduction

Prétraitements des données NGS

Recherche de Variants

Pourquoi faire la Real/Recab ?

(4)
(5)

A L I M E N T A T I O N

S. Marthey / O. Rué

The Genome Analysis Toolkit : A MapReduce framework for analyzing next-generation DNA

sequencing data

, McKenna et al. (2010)

GATK : Genome Analysis ToolKit

http://www.broadinstitute.org/gatk/about/

Développé par l'équipe de développement du Broad Institute (USA)

Utilisé dans de nombreux projets (1000 Genomes Project, The Cancer Genome

Atlas...)

A la base développé pour génetique humaine mais maintenant générique

Développé en Java

Citations :

Sources

2010

2011

2012

2013

GATK Website*

2

9

25

Google Scholar

28

145

436

767

* Nature, Science, Nature Genetics, Nature Biotechnology, New England Journal of Medicine, Cell, and Genome Research.

(6)
(7)

A L I M E N T A T I O N

S. Marthey / O. Rué

(8)
(9)

A L I M E N T A T I O N

S. Marthey / O. Rué

(10)

Comment est détecté un SNP ?

Complex bayesian algorithms based on :

Base scale Read scale Position scale Genotype scale

ALT allele count

REF allele count

ALT / REF Read Depth Overall genotype association SNP quality Forward/Reverse

Phred-Quality Base Mapping quality

10 => Perror = 1 / 10 30 => Perror = 1 / 1000

(11)

A L I M E N T A T I O N

S. Marthey / O. Rué

Comment est détecté un SNP ?

Biais de séquençage connus:

– GA / Hi-Seq : Base Quality

– 454 : Homopolymères

– SOLiD : Base Quality + Color space traduction

Base scale Read scale Position scale Genotype scale

ALT allele count

REF allele count

ALT / REF Read Depth Overall genotype association SNP quality Forward/Reverse

(12)
(13)

A L I M E N T A T I O N

S. Marthey / O. Rué

Raw reads

• Produits par les logiciels des

Séquenceurs

• Une première étape de

recalibration/correction des reads

peut être effectuée :

– 454 : Pyrobayes / Pyrocleaner

– SOLiD : Rsolid

– Illumina : Ibis

/BayesCall

+

Taux erreur amélioré de 5 à 30 %

(14)

Raw reads

• Fastq

• csFasta + Qual

@HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 NATAAATGCTGTCATACAGACTTGTTGGTGTTGTAAGGCAGCAGACTCCTTTGAGCTTTCATCCGAGAACAATTGAGACTAAATTCCTGGTGCAAAGTCCA +HISEQ4_0105:4:1101:1533:1998#TAGCTT/1 BP\cceeegffgghfhiiiefgihhhii[baegegfgiiiihhiiihhfhfhighihiiifhhfihieeegaceeedcdddd`bcbcccbbcbcccccbcb @HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 NAAGAAGGCACGAAGCAACTACTTCACTGCATGCTGCCTGTCCTTGGGCTGTTTGCTGCCTTTGGCTAACACCTTTGATTATTTCTGGCTAAGTAGATAGG +HISEQ4_0105:4:1101:2421:1947#TAGCTT/1 BS\ceeeegggfgiiiiiiiiiiiiiiiiiiiiiiiiiiihiiiiihhghihhihiiiiiihhiihgggfgeeedddddddbededcccc`bcbeccddcc @HISEQ4_0105:4:1101:3251:1984#TAGCTT/1 NAGAGCTATTTATGAAAACGAGGATGACTAAAACTGCCCAGAAAAAAAACCAACCAACCACGTTTCCAGTGACTGCCACCCTTAGCAAGCAAGGTAATAAC
(15)

A L I M E N T A T I O N

S. Marthey / O. Rué

Mapping

• Alignement reads VS Génome de

référence

• Tout logiciel produisant des BAM

– Ex: BWA, Bowtie, Gsnap, SOAP, SSAHA

http://seqanswers.com/forums/showthread.php?t=43

1 fichier par lane / individu / condition

ou

(16)

Mapping

PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 73 13 10354712 37 101M = 10354712 0

GCCTAGTCCTTTGAGACAGGAGTAAGACAAGAACTCAGGTTAGGGACCTCAAGGACTTGCTGAAGCCCACAAAGATTAGGACAAGCTAATGGAACTCAGAC

@@CFDFDFHGHHHIIJJJIJJJCFHIJIJIIFIJJJIJECFGGIGJIIJIJIIJJIGIIIIGGIJJJIGHHEFDFFFDDDCCED?BDDCCDDCDDDDCAC: X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind1 XG:i:0 AM:i:0 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

PHOSPHORE:181:C0KD3ACXX:8:2101:3676:147949 133 13 10354712 0 * = 10354712 0

GTTAGGGACCTTAAGGATCAATCTTGTCTGAGTTCCATTAGCTTGTCCTAATCTTTGTGGGCTTCAGCAAGTCCTTGAGGTCCCTAACCTGAGTTCTTGTC @@CFFFFFHHHHHJJJJIJJJJIIIIGIIJJJFHIJIJIIJJJJE?DGGCGHIJIJIGIIIIDGFHIIIIGHIJJF@CEH@CFF@CCEEA=CC;@ACA@C5 RG:Z:ind1

PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 99 13 10355951 60 101M = 10355989 139

TGGGAAGGCTTACTGTCTTCATGCAGGATCTGTGTGGCTCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCAT

@@CFDFF?DHFHHIIHGHIJJG@HG<FHIIIIJJGGGDGIIJIIJJIGGEBD*?DDGHGGGIGHIH>GG;C>AAAC@DFD;@CECAACDCBBBB9A>>@CA X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

PHOSPHORE:181:C0KD3ACXX:8:1206:13256:144743 147 13 10355989 60 101M = 10355951 -139

TCCTTACTTTCAACAGCCTCCATTACCAATTCCAGGGAAAGTCTCCATCAACCAGGAATGCATCAGTATAAGGCACTCTGAAAGAAAGCAATCTAAATCCC

:>DCDDDECAA>>@BFFEC@EIHE;GBHF=GFGHGGGGIIHFHGDG@GDB9IIJIIGHHGGGHIIGDIIHFHHEFGEIIJHGH?GBGIHHGGDFFDFFCC@ X0:i:1 X1:i:0 MD:Z:101 RG:Z:ind2 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U

PHOSPHORE:181:C0KD3ACXX:8:1202:6947:20338 99 13 10358279 29 99M = 10358378 154

GCAGGCTTTTAAGAATATGTTCTGTTTTCAAATAGTAACCCAAAAAGGGGTGGGGGCGGGGGCAAAGTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTGT

CC@FFFFFGHGFHFGGGII>JHGGEHIJIIEHHEGHIGHIJJGGIJFGIJ@FHIIHFBDBDDBB@BC44@:@4?><8A2<2?8?<B<<2<2<<A<ABB? X0:i:1 X1:i:0 MD:Z:99 RG:Z:ind2 XG:i:0 AM:i:29 NM:i:0 SM:i:29 XM:i:0 XO:i:0 XT:A:U

(17)

A L I M E N T A T I O N

S. Marthey / O. Rué

Duplicate

Marking/Removing

• Duplicats PCR (construction des librairies)

• Samtools rmdup || Picard MarkDuplicates

Identification

(18)

Local Realignment

• Identification des régions à

réaligner :

• Réalignement des reads

“ The algorithm begins by first identifying regions for

realignment where 1) at least one read contains an

indel, 2) there exists a cluster of mismatching bases

or 3) an already known indel segregates at the site …”

DePristo et al (2011)

“ Next, all reads are realigned against just the best

haplotype Hi and the reference (H0), and each read Rj

is assigned to Hi or H0 …”

(19)

A L I M E N T A T I O N

S. Marthey / O. Rué

(20)

Base quality recalibration

Mean BQ = 32,8 - Median = 36,7

R

aw

data

« The per-base quality scores, which convey the

probability that the called base in the read is

the true sequenced base, are quite inaccurate and

co-vary with features like sequencing technology,

machine cycle and sequence context »

(21)

A L I M E N T A T I O N

S. Marthey / O. Rué

Conséquences

Base quality recalibration

Mean BQ = 32,8 - Median = 36,7 Mean BQ = 28,8 – Median = 28,7

R aw data R ec al ibrat ed data • Baisse de la variabilité

(22)
(23)

A L I M E N T A T I O N

S. Marthey / O. Rué

Analysis-ready reads

• Nouveau fichier BAM

• Peut être utilisé ensuite avec

d’autre outils pour la suite des

analyses (Samtools mpileup,

Popoolation, etc…)

R

aw

(24)
(25)

A L I M E N T A T I O N

S. Marthey / O. Rué

Single vs Multiple sample analysis

Data processing and analysis of genetic variation using nextgeneration sequencing Mark DePristo Dec. 8th, 2011 (http://www.broadinstitute.org/gatk/best-practices.htm)

(26)

Unified Genotyper

• Outil GATK

• Multiple sample analysis

• Différents modes de détection

– SNP

– Indels

(27)

A L I M E N T A T I O N

S. Marthey / O. Rué

Format VCF

(28)
(29)

A L I M E N T A T I O N

S. Marthey / O. Rué

Comparaison d’outils de SNP calling

• SIGENAE Team – LGC - INRA

• APACHE Project (Alain Vignal)

• To find SNPs (Single Nucleotide Polymorphism) which differentiate

populations

• Barbary Duck : no reference genome (Beijing duck genome is

available)

Beijing duck Barbarie duck

(30)

raw data realigned data recalibrated data Realigned & recalibrated data 0 10000 20000 30000 40000 50000 60000 70000 80000 Mpileup Mpileup -B Mpileup -E GATK Popoolation2 Δ = 777% Δ = 714% Δ = 42% Δ = 45%

Impact of realignment / recalibration on

SNP count

• More homogenous SNP count

(31)

A L I M E N T A T I O N

S. Marthey / O. Rué

Reliable results with other species ?

raw data realigned datarecalibrated dataRealigned & recalibrated data 0 10000 20000 30000 40000 50000 60000 70000 80000 0 10000 20000 30000 40000 50000 60000 70000 80000 DUCK Mpileup Mpileup -B Mpileup -E GATK BAMs bruts BAMs réalignés BAMs recalibrés BAMs réalignés/recalibrés 0 50000 100000 150000 200000 0 50000 100000 150000 200000 PIG Mpileup Mpileup -B Mpileup -E GATK

BAMs brutsBAMs réalignésBAMs recalibrésBAMs réalignés/recalibrés 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 CHICKEN Mpileup Mpileup -B Mpileup -E GATK

Raw data Realigned/Recal data

Δ tools 234% 4%

Raw data Realigned/Recal data

Δ tools 777% 20%

Raw data Realigned/Recal data

Δ tools 454% 9%

(32)

Conclusion

• Variability between called SNP by different tools

• GATK realignment/recalibration greatly helps to reduce this variability

• High impact of base quality score

• Reliable on various DNA data, but not on RNA data

« … We recommend a recalibration of per-base quality

scores as in GATK or SOAPsnp … »

« … Several additional steps can be taken to improve

genotype calls, such as local realignments ... »

(33)

A L I M E N T A T I O N

S. Marthey / O. Rué

Bilan

GATK nécessite un peu d'habitude

Points forts :

Assez rapide d'exécution grâce à la parallélisation possible

Comptage allélique

Prise en compte des positions multi-alléliques

Beaucoup de fonctionnalités et d'options

SNPs semblent être fiables

Améliorations fréquentes

Site Internet

Points faibles :

Recalibration basée sur des SNPs connus...

À l'origine créé pour l'analyse de génomes humains

Beaucoup d'étapes avant de lancer l'UnifiedGenotyper

(34)
(35)

A L I M E N T A T I O N

S. Marthey / O. Rué

Le site de référence GATK

• Download logiciels + ressources (vcf)

• Guide Analyse

• Best Practices

• Forum

• Documentation Technique

• Etc…

http://www.broadinstitute.org/gatk/index.php

(36)

References

Samtools :

Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup - The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9 (2009).

Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18:1851-8 (2008).

GATK

A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491 (2011).

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. McKenna AH, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, Depristo M. Genome Res. (2010).

Popoolation2

R. Kofler, R. V. Pandey, C. Schlotterer. PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics (2011).

Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nat Methods (2008) • BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Kao W-C, Stevens K, Song YS. Genome Res (2009). • Ibis

Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Kircher M, Stenzel U, Kelso J. . Genome Biol. (2009). • Pyrocleaner

References

Related documents

Various kinds of polarization diversity antenna have been studied[1-11]Wide band printed slot antenna with polarization diversity were used to realize

However, compari- son of experimental and theoretical histograms can show whether such data con- form to normal distributions on the make scale, and can determine the

Research on fine boring simulation based on squeezed liquid film damper RESEARCH Open Access Research on fine boring simulation based on squeezed liquid film damper Qiang Shao1,3,

With a different strategy to explore the effects of blood pressure, we looked at Iba-1 and GFAP levels in the hippocampus of mice receiving the hypertensive dose of Ang II

It is tempting, however, to divide the countries into two subgroups: India and China are peasant economies with relatively closed, state-controlled, regulated cap- ital markets;

It shows a significant and positive relation between decentralization, measured as subnational share of government revenue and expenditure, and the public sector size. According

FIBROID: Fibroid Registry Outcomes for Outcomes Data; HRQL: Health-Related Quality of Life; ITT: Intent-to- Treat; MRgFUS: MRI-guided Focused Ultrasound thermal ablation; OTE: