Next generation sequencing (NGS)

(1)

Vijayachitra Modhukur

BIIT

[email protected]

Bioinformatics course

11/13/12

(2)

(3)

Microarrays vs NGS

11/13/12

3



Sequences do not need to be known in advance



Highly quantitative



Lesser noise levels , do not suffer from cross hybridization



NGS provides increased sensitivity to detect rare sequences

in complex genomic samples



Accurate single-nucleotide resolution permits the

discrimination between highly related sequences



The lowered cost of NGS makes comprehensive mapping of

multiple features possible

(4)

(5)

Why sequencing?



Genome architecture



Disease diagnosis



Variability studies



Comparative genomics



Gene regulation



Drug design



and many more……

11/13/12

(6)

(7)

Different generations (computers and

sequencing)

11/14/12

(8)

First Generation – Sanger sequencing



http://www.youtube.com/watch?

(9)

Application – Human genome project

1990-2002

11/14/12

(10)

Human genome project key finding



1. There are approximately 23,000 genes in human beings,

the same range as in mice and roundworms. Understanding

how these genes express themselves will provide clues to

how diseases are caused.



2. The human genome has significantly more segmental

duplications (nearly identical, repeated sections of DNA)

than other mammalian genomes. These sections may underlie

the creation of new primate-specific genes



3. At the time when the draft sequence was published fewer

than 7% of protein families appeared to be vertebrate specific

http://en.wikipedia.org/wiki/Human_Genome_Project/

(11)

Second generation sequencing

11/13/12

(12)

(13)

ER Mardis.

Nature

470

, 198-203 (2011) doi:10.1038/nature09796

Break through NGS technology

Bioinformatics course

11/13/12

(14)

NGS platforms

Leading Platforms

454

Solexa/Illumina

SOLiD (ABI)

Bp per run

400 Mb

2-3 Gb

3-6 Gb

Read length

250-400 bp

35-50 (70-100) bp

35-50 bp

run time

10 hr

2.5 days

5 days

Download

20 min

27 hr (44 min)

~1 day

Analysis

2-5 hr

2 days

2-3 days

Files

20-50 Gb

1T

1 T

(15)

Massive amount of sequenced data

11/13/12

(16)

(17)

Application

11/13/12

(18)

(19)

Human genome

11/14/12

19

(20)

(21)

1,000 genome project

11/13/12

21



Small inter individual differences in regulatory regions found

in all human population



Genetic variation association to disease



Discover novel genetic variats such as snps, cnvs etc.,



Better improvement of human reference sequence.



Key results



“Each person carry 250 to 300 loss-of-function variants in

annotated genes and 50 to 100 variants previously implicated

in inherited disorders”.

(22)

(23)

data to analysis

(24)

(25)

11/13/12

25

Name

Description

BLAT

BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.

Bowtie

Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB

memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.

BWA

Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie

but allows indels in alignment

ELAND

Implemented by Illumina. Includes ungapped alignment with a finite read length.

GMAP and

GSNAP

Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for

digital gene expression, SNP and indel genotyping.

MAQ

Ungapped alignment that takes into account quality scores for each base

MOSAIK

Fast gapped aligner and reference-guided assembler. Aligns reads using a banded

Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in

size from very short to very long.

RazerS

No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable

and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.

SHRiMP

Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can

map ABI SOLiD color space reads.

SLIDER

Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files

instead of the sequence files as an input for alignment to a reference sequence or a set of

reference sequences.

SOAP

Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses

a 12 letter hash table. Now SOAP2 is much faster than the first version.

SOCS

For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color

errors). Uses an iterative version of the Rabin-Karp string search algorithm.

SSAHA

Fast for a small number of variants.

Taipan

de-novo Assembler for Illumina reads

(26)

Quality scores



Each base from a sequencer comes with a quality score



Base-calling error probabilities



Phred quality score



Q = 10 log10 P



higher quality score indicates a smaller probability of error

(27)

Quality scores

11/13/12

27

(28)

(29)

fastQ

Raw data

(30)

(31)

SAM/BAM format

11/13/12

31

Thomas Keane 9th European Conference on Computational Biology 26th_{September, 2010}

SAM/BAM Format

Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.

SAM (Sequence Alignment/Map) format



Single unified format for storing read alignments to a reference genome

BAM (Binary Alignment/Map) format



Binary equivalent of SAM



Developed for fast processing/indexing

Advantages



Can store alignments from most aligners



Supports multiple sequencing technologies



Supports indexing for quick retrieval/viewing



Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)



Reads can be grouped into logical groups e.g. lanes, libraries,

individuals/genotypes



Supports second best base call/quality for hard to call bases

Possibility of storing raw sequencing data in BAM as replacement to SRF &

fastq

(32)

(33)

Each bit in SAM format

11/14/12

Bioinformatics course

(34)

Sequence alignment



Reference alignment

(35)

Spaced seed vs BWT

11/14/12

(36)

Burrows wheeler transform



Original :

WBWBWB#

(37)

Burrows wheeler transform

Book IV

Chapter 4

Data Compression

Algorithms

Lossless Data Compression Algorithms

437

The BWT algorithm must use a character that marks the end of the data, such

as the

#

symbol. Then the BWT algorithm works in three steps. First, it rotates

text through all possible combinations, as shown in the Rotate column of Table

4-1. Second, it sorts each line alphabetically, as shown in the Sort column of

Table 4-1. Third, it outputs the final column of the sorted list, which groups

identical characters together in the Output column of Table 4-1. In this

exam-ple, the BWT algorithm transforms the string

WBWBWB#

into

WWW#BBB.

Table 4-1

Rotating and Sorting Data

Rotate

Sort

Output

WBWBWB#

BWBWB#W

W

#WBWBWB

BWB#WBW

W

B#WBWBW

W

WB#WBWB

WBWBWB#

#

BWB#WBW

WBWB#WB

B

WBWB#WB

WB#WBWB

B

BWBWB#W

#WBWBWB

B

At this point, the BWT algorithm hasn’t compressed any data but merely

rearranged the data to group identical characters together; the BWT

algo-rithm has rearranged the data to make the run-length encoding algoalgo-rithm

more efficient. Run-length encoding can now convert the

WWW#BBB

string

into

3W#3B,

thus compressing the overall data.

After compressing data, you’ll eventually need to uncompress that same

data. Uncompressing this data

(3W#3B)

creates the original BWT output of

WWW#BBB,

which contains all the characters of the original, uncompressed

data but not in the right order. To retrieve the original order of the

uncom-pressed data, the BWT algorithm repetitively goes through two steps, as

shown in Figure 4-1.

The BWT algorithm works in reverse by adding the original BWT output

(WWW#BBB)

and then sorting the lines repetitively a number of times equal

to the length of the string. So retrieving the original data from a 7-character

string takes seven adding and sorting steps.

After the final add and sort step, the BWT algorithm looks for the only line

that has the end of data character (#) as the last character, which identifies

the original, uncompressed data. The BWT algorithm is both simple to

understand and implement, which makes it easy to use for speeding up

ordi-nary run-length encoding.

@CLDIRYQ1.1BHF

11/13/12

(38)

Sequence assembly- Solving a jigaw

puzzle

(39)

Sequence assembly- repeating

patterns

11/13/12

(40)

Greedy Assemblers



Greedily joins the reads together that are most similar to

each other.



Examples : Phrap, Cap3, TIGR assembler,

Greedy

• Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.

• An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

Overlap-layout-consensus

• Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure below). This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.

• An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

Bioinformatics course

11/13/12

(41)

Overlap layout consensus

Page 9 Barbara Hutter Assembly

●

Based on all pairwise comparisons

●

Constuction of an

overlap

graph

•

nodes = reads (sequences)

•

egdes = connections between overlapping reads

●

Layout

:

look for paths in the overlap graph which are segments of the genome to

assemble (contigs)

•

goal: find Hamiltonian path = a path that contains all nodes exactly once

●

Consensus

: following the Hamiltonian path, combine the overlapping sequences in

the nodes into the sequence of the genome

•

in case of different nucleotides: majority vote considering base qualities

●

Programs using the OLC:

•

Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP

Overlap-Layout-Consensus

http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next

11/13/12

(42)

(43)

Online resources



NCBI-SRA



NCBI-GEO



The European Nucleotide Archive (ENA)



Array express

11/13/12

(44)

Visualization tools

REVIEW

sequence similarity. A user can interactively explore the sequence relationships between different contigs and view the results of search operations such as ‘find repeats’. Consed’s assembly view can display the output of a sequence comparison utility called ‘cross_match’, using arcs to connect regions with sequence similarity between user-selected contigs. Different colors dis-tinguish features such as directed repeats from inverted repeats. One advantage of viewing sequence similarity in ‘assembly view’ is that it can be integrated with a read coverage plot (Fig. 1a), which can reveal regions of unexpectedly high coverage often indicative of similar sequences that were erroneously collapsed by the assembler into one. The user can click to examine the sequence similarity at the base level, and click again to exam-ine the underlying reads. There are also standalone tools with related functionality; for example, Miropeats15_{, widely used for}

early genome sequencing projects, is a UNIX C-shell script that generates static images using arc representations to indicate different types of repeats.

Next-generation sequence viewers. As sequencing through-put increases and costs decrease, individual genome sequenc-ing has become feasible and has led to initiatives such as the 1,000 Genomes project (http://www.1000genomes.org/). These data provide an unprecedented opportunity to characterize the landscape of human genotypes, and a new generation of com-putational methods has emerged as a result16_{. In some cases,}

visual inspection can facilitate the evaluation and interpretation

Assembly visualization tools possess most of the necessary functionality, but they were built with Sanger data in mind and initially strained under the substantially higher read volume of NGS technologies. Several of these tools are being retrofitted to tackle larger data sets, including Consed and the updated Gap5, but a new wave of tools is also being designed with this purpose in mind: for example, EagleView17, MapView18 and IGV (Table 1). Unlike finishing software, these tools are primarily data viewers and do not provide direct editing functionality. Because of their emphasis on browsing, many provide more flexible zooming capabilities and enable a user to freely zoom out to higher-level views. The commercially available CLC Genomics Workbench (CLC bio) is particularly user friendly and includes its own read alignment programs, which can be launched through a GUI.

In the resequencing context, mate pairs provide valuable infor-mation about structural variation, such as insertions, deletions and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation detection on draft assemblies should be aware of these issues. LookSeq19_{and Gap5 use the vertical-axis position to indicate}

insertion size. This places inconsistent mate pairs at the extremes of the plot and visually separates large insert sizes, which are con-sistent with deletions, from small insert sizes, which suggest inser-tion events. When analyzing structural variainser-tions, it is important to consider gene annotations—for example, whether a single nucleotide variation leads to a synonymous or nonsynonymous

Table 1 | Tools for visualizing sequencing data

Name Cost OS Description URL

Stand-alone tools

ABySS-Explorer25 _Free _{Win, Mac, Linux} _{Interactive assembly structure visualization tool} _{http://tinyurl.com/abyss-explorer/}

CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools;

user friendly http://www.clcbio.com/

Consed3_* _Free _{Mac, Linux} _{Widely used; assembly finishing package; NGS compatible http://www.phrap.org/}

DNASTAR Lasergene14 _$ _{Win, Mac} _{Analysis suite with an assembly finishing package;}

NGS compatible

http://www.dnastar.com/

EagleView17 _Free _{Win, Mac, Linux} _{Assembly viewer; compatible with single-end NGS} _{http://tinyurl.com/eagleview/}

Gap12,13 _Free _Linux _{Widely used; assembly finishing package; Gap5 is}

NGS compatible

http://staden.sourceforge.net/

Hawkeye6 _Free _{Win, Mac, Linux (S) Sanger sequencing assembly viewer} _{http://amos.sourceforge.net/hawkeye/}

Integrative Genomics Viewer (IGV)*

Free Win, Mac, Linux Genome browser with alignment view support (Table 2);

NGS compatible

http://www.broadinstitute.org/igv/

MapView18 _Free _{Win, Linux} _{Read alignment viewer; custom file format for fast}

NGS data loading

http://evolution.sysu.edu.cn/mapview/

MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq

alignment files http://maq.sourceforge.net/

Orchid Free Linux (S) Assembly viewer customized to display paired-end

relationships

http://tinyurl.com/orchid-view/

Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/

SAMtools tview8 _Free _{Win, Mac, Linux} _{Simple and fast text alignment viewer; NGS compatible} _{http://samtools.sourceforge.net/}

Web-based tools

LookSeq19 _Free _{Uses AJAX;}_y_{axis for insert size; user configures data}

resources; NGS compatible

http://lookseq.sourceforge.net/ NCBI Assembly

Archive Viewer7 Free Graphical interface to contig and trace data in NCBI’s _{Assembly Archive} http://tinyurl.com/assmbrowser/

Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement.

*Our recommendation

Bioinformatics course

11/13/12

44

(45)

Dr. Ece Gamsiz

11/13/12

(46)

Next lectures



RNA sequencing, method, application, advantages over

microarrays



Chip sequencing



Epigenomics, DNA methylation, histone modification