• No results found

Metagenomic Assembly

N/A
N/A
Protected

Academic year: 2021

Share "Metagenomic Assembly"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Metagenomic Assembly

Successes, Validation,

And

Challenges

Matthew Scholz

Los Alamos National Laboratory

Genome Sciences Division

(2)

Metagenomic Assembly is Complex

Complexity dictates method of assembly

Simple communities assemble easily

Complex communities….do not

Complexity dictates sequencing requirement

More sequence means more complexity and

(3)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Assembly Successes

High throughput pipeline

Improved assemblies

JGI/LANL has successfully assembled 123 metagenomes in last

year

Average time ~ 1 week/sample

HMP metagenome assemblies

LANL assembled 223 metagenomes from whole genome shotgun

sequencing of HMP

Assembly of 10 site specific samples (multiple samples from same

site)

(4)

Assembly differences

HMP shotgun metagenome assembly

Optimization

Tool Selection

Kmer Selection

Selection of # Cores

Volume production

1 Kmer

Metrics

JGI Metagenome Assembly

Multiple tool selection

Range of Kmers utilized

(5)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

HMP assemblies

Draft

Many Samples

Many metrics

Max contig

Total Assembly size

Etc.

(6)

JGI/LANL Metagenome

Assembly Pipeline

(7)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

JGI/LANL Assembly Process

“Improved” merging of Multiple assemblies

Statistical Metrics are better

(8)

Validation

How do you validate metagenomes?

Contig Statistics

Read Mapping

Annotation

Similarity Searches

Phylogenetic distribution

Reference genomes

(9)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Read Mapping

Are contigs correctly assembled?

Do read data confirm contigs?

Edge effects

Generate information about coverage

Average Fold Coverage

Percent of Contig Covered

SNP/INDEL information

(10)

Finding Potentially Erroneous Contigs with

Read mapping

Read coverage of each base

from pipeline.

Lines delineate regions where

mean coverage deviates past

thresholds

Is this a good contig?

Where did contig come from?

Can use to automatically

break or discard contigs that

fail read-mapping

(11)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Reference Genomes

Map reads to reference genomes

Determine which/how many sequenced genomes present in sample

Allow partitioning of reads

SNP/INDEL analysis

Post-Assembly

Map contigs to genomes

SNP/INDEL

(12)

Ongoing Challenges

Hardware/Software

Kmer assembly Speed/RAM tradeoff

Algorithms for metagenomes

Too much software

How to Determine “Correct” assembly?

Metadata

Too much data

(13)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Read Incorporation

0

10

20

30

40

50

60

70

80

90

100

0.00E+00

2.00E+08

4.00E+08

6.00E+08

8.00E+08

% of Reads Incorporated

# of Reads in Sample

Biofilm

Compost

Sediment

Soil

Water

(14)

Not Enough Coverage

Coverage varies by sample

• 

1 Lane HiSeq is maximum for current hardware/assemblers (complex samples)

~ 70X coverage

~28X coverage

~600X coverage

~6000X coverage

(15)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Concluding Thoughts

We need metadata (standards)

Tiers

1.

Assembly Statistics

2.

Assembly Level

3.

Contig analysis

(16)

Classification of Metagenomes

1.

Assembly Statistics

% Read Incorporation

Total Assembled Bases (post filtering?)

G+C Content

Coverage histograms

2.

Assembly Level

Draft

1 assembler, 1 set of parameters

Improved Draft

Current JGI/LANL assembly/merge method

Read based validation

High Quality (Theoretical)

Binning strategies pre-assembly

(17)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Classification of Metagenomes

1.

Assembly Statistics

2.

Assembly Level

3.

Contig analysis

Read mapping based validation

Clustering

(18)

Many Thanks:

Wet-lab Team

Cheryl Gleasner

Kim McMurry

Krista Reitenga

Xiaohong Shen

Others…

Management Team

Chris Detter

David Bruce

Tracy Erkkila

Lance Green

Shunsheng Han

Project Management

Shannon Johnson

Lynne Goodwin

Others…

Metagenomics and Data

Analysis Team

Patrick Chain

Tracey Freitas

Ron Croonenberg

Bin Hu

Chien-Chi Lo

Shawn Starkenburg

Gary Xie

Shannon Steinfadt

Others…

Informatics Team

Ben Allen

Andy Seirp

Criag Blackhart

Yan Xu

Todd Yilk

Single cell work

Roger Lasken

Ramunas

Stepanaskus

Steve Hallam

Finishing and SCG

Olga Chertov

Karen Davenport

Armand Dichosa

Michael Fitzsimons

Ahmet Zeytun

Others…

Metagenome work

Jim Tiedje

Titus Brown

Adina Howe

HMP consortium

Mihai Pop

Joe Zhou

Kmer team

Joel Berendzen

Nick Hengartner

Ben McMahon

Judith Cohn

(19)

LA-UR-11-10900

Advancing Science with DNA Sequence The World’s Greatest Science Protecting America

Metagenomic Assembly

Strategies

Bigger computers

• 

1 Lane HiSeq PE reads = 400 M reads

• 

1TB RAM (complex communities)

Better Assemblers

• 

MetaIDBA

• 

MetaVelvet

• 

Ray

• 

ABySS

• 

AllPaths

Binning Reads

• 

Unsupervised/Heuristic

• 

Machine Learning

• 

Statistical

• 

Reference Based

References

Related documents

The HP SmartStream Photo Enhancement Server is a component of the full end-to-end offering that HP Indigo offers for photo specialty printing, which also includes: file creation

Power plants are included in the ETS sector, while electricity is partly used by the non-ETS sector. The ETS and non-ETS sectors have different mechanisms to reach reduction

Information that was previously only accessible through books or by studying in Islamic Boarding Schools ( pesantren ) can now be easily accessed on Google.

We also include as covariates time dummies ( c t ) and, in the case of foreign immigration to Spain (or emigration of Spaniards), the logarithm of the stock of migrants of the

The main objective of this paper is to design a finite difference WENO scheme for the gas dynamic equations with gravitational source terms, which maintains the well-balanced

Assessment of the results from both the operational (predominantly deep-focussed) and research (predominantly shallow-focussed) monitoring activities from Sleipner

Based on the key principles of the World Bank and open source information from the Financial Access 2010 database, an article has developed an algorithm for evaluating the quality

Because there were 568 AHs with more than 2,000 APs, the level of impacts caused by the road is ranked significant, in accordance with OP4.12 on Involuntary resettlement of the