• No results found

Next generation sequencing (NGS)

N/A
N/A
Protected

Academic year: 2021

Share "Next generation sequencing (NGS)"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

Vijayachitra Modhukur

BIIT

[email protected]

Next generation sequencing (NGS)

Bioinformatics course

11/13/12

(2)
(3)

Microarrays vs NGS

11/13/12

Bioinformatics course

3

—

Sequences do not need to be known in advance

—

Highly quantitative

—

Lesser noise levels , do not suffer from cross hybridization

—

NGS provides increased sensitivity to detect rare sequences

in complex genomic samples

—

Accurate single-nucleotide resolution permits the

discrimination between highly related sequences

—

The lowered cost of NGS makes comprehensive mapping of

multiple features possible

(4)
(5)

Why sequencing?

—

Genome architecture

—

Disease diagnosis

—

Variability studies

—

Comparative genomics

—

Gene regulation

—

Drug design

—

and many more……

Bioinformatics course

11/13/12

(6)
(7)

Different generations (computers and

sequencing)

11/14/12

Bioinformatics course

(8)

First Generation – Sanger sequencing

—

http://www.youtube.com/watch?

(9)

Application – Human genome project

1990-2002

11/14/12

Bioinformatics course

(10)

Human genome project key finding

—

1. There are approximately 23,000 genes in human beings,

the same range as in mice and roundworms. Understanding

how these genes express themselves will provide clues to

how diseases are caused.

—

2. The human genome has significantly more segmental

duplications (nearly identical, repeated sections of DNA)

than other mammalian genomes. These sections may underlie

the creation of new primate-specific genes

—

3. At the time when the draft sequence was published fewer

than 7% of protein families appeared to be vertebrate specific

http://en.wikipedia.org/wiki/Human_Genome_Project/

(11)

Second generation sequencing

11/13/12

Bioinformatics course

(12)
(13)

ER Mardis.

Nature

470

, 198-203 (2011) doi:10.1038/nature09796

Break through NGS technology

Bioinformatics course

11/13/12

(14)

NGS platforms

Leading Platforms

454

Solexa/Illumina

SOLiD (ABI)

Bp per run

400 Mb

2-3 Gb

3-6 Gb

Read length

250-400 bp

35-50 (70-100) bp

35-50 bp

run time

10 hr

2.5 days

5 days

Download

20 min

27 hr (44 min)

~1 day

Analysis

2-5 hr

2 days

2-3 days

Files

20-50 Gb

1T

1 T

(15)

Massive amount of sequenced data

Bioinformatics course

11/13/12

(16)
(17)

Application

11/13/12

Bioinformatics course

(18)
(19)

Human genome

11/14/12

Bioinformatics course

19

(20)
(21)

1,000 genome project

11/13/12

Bioinformatics course

21

—

Small inter individual differences in regulatory regions found

in all human population

—

Genetic variation association to disease

—

Discover novel genetic variats such as snps, cnvs etc.,

—

Better improvement of human reference sequence.

—

Key results

—

“Each person carry 250 to 300 loss-of-function variants in

annotated genes and 50 to 100 variants previously implicated

in inherited disorders”.

(22)
(23)

data to analysis

(24)
(25)

11/13/12

Bioinformatics course

25

Name

Description

BLAT

BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.

Bowtie

Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB

memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.

BWA

Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie

but allows indels in alignment

ELAND

Implemented by Illumina. Includes ungapped alignment with a finite read length.

GMAP and

GSNAP

Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for

digital gene expression, SNP and indel genotyping.

MAQ

Ungapped alignment that takes into account quality scores for each base

MOSAIK

Fast gapped aligner and reference-guided assembler. Aligns reads using a banded

Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in

size from very short to very long.

RazerS

No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable

and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.

SHRiMP

Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can

map ABI SOLiD color space reads.

SLIDER

Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files

instead of the sequence files as an input for alignment to a reference sequence or a set of

reference sequences.

SOAP

Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses

a 12 letter hash table. Now SOAP2 is much faster than the first version.

SOCS

For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color

errors). Uses an iterative version of the Rabin-Karp string search algorithm.

SSAHA

Fast for a small number of variants.

Taipan

de-novo Assembler for Illumina reads

(26)

Quality scores

—

Each base from a sequencer comes with a quality score

—

Base-calling error probabilities

—

Phred quality score

—

Q = 10 log10 P

—

higher quality score indicates a smaller probability of error

(27)

Quality scores

Bioinformatics course

11/13/12

27

(28)
(29)

fastQ

Raw data

(30)
(31)

SAM/BAM format

11/13/12

Bioinformatics course

31

Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

SAM/BAM Format

Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.

SAM (Sequence Alignment/Map) format

Single unified format for storing read alignments to a reference genome

BAM (Binary Alignment/Map) format

Binary equivalent of SAM

Developed for fast processing/indexing

Advantages

Can store alignments from most aligners

Supports multiple sequencing technologies

Supports indexing for quick retrieval/viewing

Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)

Reads can be grouped into logical groups e.g. lanes, libraries,

individuals/genotypes

Supports second best base call/quality for hard to call bases

Possibility of storing raw sequencing data in BAM as replacement to SRF &

fastq

(32)
(33)

Each bit in SAM format

11/14/12

Bioinformatics course

(34)

Sequence alignment

—

Reference alignment

(35)

Spaced seed vs BWT

Bioinformatics course

11/14/12

(36)

Burrows wheeler transform

—

Original :

WBWBWB#

(37)

Burrows wheeler transform

Book IV

Chapter 4

Data Compression

Algorithms

Lossless Data Compression Algorithms

437

The BWT algorithm must use a character that marks the end of the data, such

as the

#

symbol. Then the BWT algorithm works in three steps. First, it rotates

text through all possible combinations, as shown in the Rotate column of Table

4-1. Second, it sorts each line alphabetically, as shown in the Sort column of

Table 4-1. Third, it outputs the final column of the sorted list, which groups

identical characters together in the Output column of Table 4-1. In this

exam-ple, the BWT algorithm transforms the string

WBWBWB#

into

WWW#BBB.

Table 4-1

Rotating and Sorting Data

Rotate

Sort

Output

WBWBWB#

BWBWB#W

W

#WBWBWB

BWB#WBW

W

B#WBWBW

B#WBWBW

W

WB#WBWB

WBWBWB#

#

BWB#WBW

WBWB#WB

B

WBWB#WB

WB#WBWB

B

BWBWB#W

#WBWBWB

B

At this point, the BWT algorithm hasn’t compressed any data but merely

rearranged the data to group identical characters together; the BWT

algo-rithm has rearranged the data to make the run-length encoding algoalgo-rithm

more efficient. Run-length encoding can now convert the

WWW#BBB

string

into

3W#3B,

thus compressing the overall data.

After compressing data, you’ll eventually need to uncompress that same

data. Uncompressing this data

(3W#3B)

creates the original BWT output of

WWW#BBB,

which contains all the characters of the original, uncompressed

data but not in the right order. To retrieve the original order of the

uncom-pressed data, the BWT algorithm repetitively goes through two steps, as

shown in Figure 4-1.

The BWT algorithm works in reverse by adding the original BWT output

(WWW#BBB)

and then sorting the lines repetitively a number of times equal

to the length of the string. So retrieving the original data from a 7-character

string takes seven adding and sorting steps.

After the final add and sort step, the BWT algorithm looks for the only line

that has the end of data character (#) as the last character, which identifies

the original, uncompressed data. The BWT algorithm is both simple to

understand and implement, which makes it easy to use for speeding up

ordi-nary run-length encoding.

@CLDIRYQ1.1BHF

Bioinformatics course

11/13/12

(38)

Sequence assembly- Solving a jigaw

puzzle

(39)

Sequence assembly- repeating

patterns

Bioinformatics course

11/13/12

(40)

Greedy Assemblers

—

Greedily joins the reads together that are most similar to

each other.

—

Examples : Phrap, Cap3, TIGR assembler,

© 2009 SIB LF June 4, 2010

Greedy

• Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.

• An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

Overlap-layout-consensus

• Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure below). This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.

• An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

Bioinformatics course

11/13/12

(41)

Overlap layout consensus

Page 9 Barbara Hutter Assembly

Based on all pairwise comparisons

Constuction of an

overlap

graph

nodes = reads (sequences)

egdes = connections between overlapping reads

Layout

:

look for paths in the overlap graph which are segments of the genome to

assemble (contigs)

goal: find Hamiltonian path = a path that contains all nodes exactly once

Consensus

: following the Hamiltonian path, combine the overlapping sequences in

the nodes into the sequence of the genome

in case of different nucleotides: majority vote considering base qualities

Programs using the OLC:

Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP

Overlap-Layout-Consensus

http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next

Bioinformatics course

11/13/12

(42)
(43)

Online resources

—

NCBI-SRA

—

NCBI-GEO

—

The European Nucleotide Archive (ENA)

—

Array express

Bioinformatics course

11/13/12

(44)

Visualization tools

REVIEW

sequence similarity. A user can interactively explore the sequence relationships between different contigs and view the results of search operations such as ‘find repeats’. Consed’s assembly view can display the output of a sequence comparison utility called ‘cross_match’, using arcs to connect regions with sequence similarity between user-selected contigs. Different colors dis-tinguish features such as directed repeats from inverted repeats. One advantage of viewing sequence similarity in ‘assembly view’ is that it can be integrated with a read coverage plot (Fig. 1a), which can reveal regions of unexpectedly high coverage often indicative of similar sequences that were erroneously collapsed by the assembler into one. The user can click to examine the sequence similarity at the base level, and click again to exam-ine the underlying reads. There are also standalone tools with related functionality; for example, Miropeats15, widely used for

early genome sequencing projects, is a UNIX C-shell script that generates static images using arc representations to indicate different types of repeats.

Next-generation sequence viewers. As sequencing through-put increases and costs decrease, individual genome sequenc-ing has become feasible and has led to initiatives such as the 1,000 Genomes project (http://www.1000genomes.org/). These data provide an unprecedented opportunity to characterize the landscape of human genotypes, and a new generation of com-putational methods has emerged as a result16. In some cases,

visual inspection can facilitate the evaluation and interpretation

Assembly visualization tools possess most of the necessary functionality, but they were built with Sanger data in mind and initially strained under the substantially higher read volume of NGS technologies. Several of these tools are being retrofitted to tackle larger data sets, including Consed and the updated Gap5, but a new wave of tools is also being designed with this purpose in mind: for example, EagleView17, MapView18 and IGV (Table 1). Unlike finishing software, these tools are primarily data viewers and do not provide direct editing functionality. Because of their emphasis on browsing, many provide more flexible zooming capabilities and enable a user to freely zoom out to higher-level views. The commercially available CLC Genomics Workbench (CLC bio) is particularly user friendly and includes its own read alignment programs, which can be launched through a GUI.

In the resequencing context, mate pairs provide valuable infor-mation about structural variation, such as insertions, deletions and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation detection on draft assemblies should be aware of these issues. LookSeq19 and Gap5 use the vertical-axis position to indicate

insertion size. This places inconsistent mate pairs at the extremes of the plot and visually separates large insert sizes, which are con-sistent with deletions, from small insert sizes, which suggest inser-tion events. When analyzing structural variainser-tions, it is important to consider gene annotations—for example, whether a single nucleotide variation leads to a synonymous or nonsynonymous

Table 1 | Tools for visualizing sequencing data

Name Cost OS Description URL

Stand-alone tools

ABySS-Explorer25 Free Win, Mac, Linux Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/

CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools;

user friendly http://www.clcbio.com/

Consed3* Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/

DNASTAR Lasergene14 $ Win, Mac Analysis suite with an assembly finishing package;

NGS compatible

http://www.dnastar.com/

EagleView17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/

Gap12,13 Free Linux Widely used; assembly finishing package; Gap5 is

NGS compatible

http://staden.sourceforge.net/

Hawkeye6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/

Integrative Genomics Viewer (IGV)*

Free Win, Mac, Linux Genome browser with alignment view support (Table 2);

NGS compatible

http://www.broadinstitute.org/igv/

MapView18 Free Win, Linux Read alignment viewer; custom file format for fast

NGS data loading

http://evolution.sysu.edu.cn/mapview/

MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq

alignment files http://maq.sourceforge.net/

Orchid Free Linux (S) Assembly viewer customized to display paired-end

relationships

http://tinyurl.com/orchid-view/

Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/

SAMtools tview8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/

Web-based tools

LookSeq19 Free Uses AJAX; y axis for insert size; user configures data

resources; NGS compatible

http://lookseq.sourceforge.net/ NCBI Assembly

Archive Viewer7 Free Graphical interface to contig and trace data in NCBI’s Assembly Archive http://tinyurl.com/assmbrowser/

Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement.

*Our recommendation

Bioinformatics course

11/13/12

44

(45)

Dr. Ece Gamsiz

Bioinformatics course

11/13/12

(46)

Next lectures

—

RNA sequencing, method, application, advantages over

microarrays

—

Chip sequencing

—

Epigenomics, DNA methylation, histone modification

References

Related documents

Guided by the transactional model of stress and coping as the theoretical framework, the purpose of this research was to evaluate the use of guided imagery as an effective

Our pilot study involved three participants in a laboratory setting who reported on negative critical incidents that occurred while performing tasks with an ERP system (note that

Københav ns Ene rgi A/S 30 Loss 10% Fuel 100% Electricity 40% Generator Cooling (Sea) Steam Turbine Boiler Water ~ Heat exchanger Building. Combined Heat and Power

With the process of adopting sustainable practices in the form social, environmental and economic factors, sustainability has also emphasize firms to focus on their

Beebe & Beebe (2012), said in persuasive strategy there are some types, using evidence consists of using credible evidence, using new evidence, using specific evidence,

category (cf. Germany 14 percent, Russia 12 percent). The “careless debtor” has fewer problems getting into debt; only five percent of this group find debt an emotional

It uses harmonic analysis tools of lemmas on restricted convolutions and Littlewood-Paley dyadic decomposition to prove global regularity of the limit resonant

The second module focuses on the main political currents and political parties in Turkey, examining Turkish nationalism, conservatism-Islamism, left-social democracy