Vijayachitra Modhukur
BIIT
Next generation sequencing (NGS)
Bioinformatics course
11/13/12
Microarrays vs NGS
11/13/12
Bioinformatics course
3
Sequences do not need to be known in advance
Highly quantitative
Lesser noise levels , do not suffer from cross hybridization
NGS provides increased sensitivity to detect rare sequences
in complex genomic samples
Accurate single-nucleotide resolution permits the
discrimination between highly related sequences
The lowered cost of NGS makes comprehensive mapping of
multiple features possible
Why sequencing?
Genome architecture
Disease diagnosis
Variability studies
Comparative genomics
Gene regulation
Drug design
and many more……
Bioinformatics course
11/13/12
Different generations (computers and
sequencing)
11/14/12
Bioinformatics course
First Generation – Sanger sequencing
http://www.youtube.com/watch?
Application – Human genome project
1990-2002
11/14/12
Bioinformatics course
Human genome project key finding
1. There are approximately 23,000 genes in human beings,
the same range as in mice and roundworms. Understanding
how these genes express themselves will provide clues to
how diseases are caused.
2. The human genome has significantly more segmental
duplications (nearly identical, repeated sections of DNA)
than other mammalian genomes. These sections may underlie
the creation of new primate-specific genes
3. At the time when the draft sequence was published fewer
than 7% of protein families appeared to be vertebrate specific
http://en.wikipedia.org/wiki/Human_Genome_Project/
Second generation sequencing
11/13/12
Bioinformatics course
ER Mardis.
Nature
470
, 198-203 (2011) doi:10.1038/nature09796
Break through NGS technology
Bioinformatics course
11/13/12
NGS platforms
Leading Platforms
454
Solexa/Illumina
SOLiD (ABI)
Bp per run
400 Mb
2-3 Gb
3-6 Gb
Read length
250-400 bp
35-50 (70-100) bp
35-50 bp
run time
10 hr
2.5 days
5 days
Download
20 min
27 hr (44 min)
~1 day
Analysis
2-5 hr
2 days
2-3 days
Files
20-50 Gb
1T
1 T
Massive amount of sequenced data
Bioinformatics course
11/13/12
Application
11/13/12
Bioinformatics course
Human genome
11/14/12
Bioinformatics course
19
1,000 genome project
11/13/12
Bioinformatics course
21
Small inter individual differences in regulatory regions found
in all human population
Genetic variation association to disease
Discover novel genetic variats such as snps, cnvs etc.,
Better improvement of human reference sequence.
Key results
“Each person carry 250 to 300 loss-of-function variants in
annotated genes and 50 to 100 variants previously implicated
in inherited disorders”.
data to analysis
11/13/12
Bioinformatics course
25
Name
Description
BLAT
BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.
Bowtie
Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB
memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.
BWA
Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie
but allows indels in alignment
ELAND
Implemented by Illumina. Includes ungapped alignment with a finite read length.
GMAP and
GSNAP
Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for
digital gene expression, SNP and indel genotyping.
MAQ
Ungapped alignment that takes into account quality scores for each base
MOSAIK
Fast gapped aligner and reference-guided assembler. Aligns reads using a banded
Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in
size from very short to very long.
RazerS
No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable
and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.
SHRiMP
Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can
map ABI SOLiD color space reads.
SLIDER
Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files
instead of the sequence files as an input for alignment to a reference sequence or a set of
reference sequences.
SOAP
Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses
a 12 letter hash table. Now SOAP2 is much faster than the first version.
SOCS
For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color
errors). Uses an iterative version of the Rabin-Karp string search algorithm.
SSAHA
Fast for a small number of variants.
Taipan
de-novo Assembler for Illumina reads
Quality scores
Each base from a sequencer comes with a quality score
Base-calling error probabilities
Phred quality score
Q = 10 log10 P
higher quality score indicates a smaller probability of error
Quality scores
Bioinformatics course
11/13/12
27
fastQ
Raw data
SAM/BAM format
11/13/12
Bioinformatics course
31
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
SAM/BAM Format
Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.
SAM (Sequence Alignment/Map) format
Single unified format for storing read alignments to a reference genome
BAM (Binary Alignment/Map) format
Binary equivalent of SAM
Developed for fast processing/indexing
Advantages
Can store alignments from most aligners
Supports multiple sequencing technologies
Supports indexing for quick retrieval/viewing
Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)
Reads can be grouped into logical groups e.g. lanes, libraries,
individuals/genotypes
Supports second best base call/quality for hard to call bases
Possibility of storing raw sequencing data in BAM as replacement to SRF &
fastq
Each bit in SAM format
11/14/12
Bioinformatics course
Sequence alignment
Reference alignment
Spaced seed vs BWT
Bioinformatics course
11/14/12
Burrows wheeler transform
Original :
WBWBWB#
Burrows wheeler transform
Book IV
Chapter 4
Data Compression
Algorithms
Lossless Data Compression Algorithms
437
The BWT algorithm must use a character that marks the end of the data, such
as the
#
symbol. Then the BWT algorithm works in three steps. First, it rotates
text through all possible combinations, as shown in the Rotate column of Table
4-1. Second, it sorts each line alphabetically, as shown in the Sort column of
Table 4-1. Third, it outputs the final column of the sorted list, which groups
identical characters together in the Output column of Table 4-1. In this
exam-ple, the BWT algorithm transforms the string
WBWBWB#
into
WWW#BBB.
Table 4-1
Rotating and Sorting Data
Rotate
Sort
Output
WBWBWB#
BWBWB#W
W
#WBWBWB
BWB#WBW
W
B#WBWBW
B#WBWBW
W
WB#WBWB
WBWBWB#
#
BWB#WBW
WBWB#WB
B
WBWB#WB
WB#WBWB
B
BWBWB#W
#WBWBWB
B
At this point, the BWT algorithm hasn’t compressed any data but merely
rearranged the data to group identical characters together; the BWT
algo-rithm has rearranged the data to make the run-length encoding algoalgo-rithm
more efficient. Run-length encoding can now convert the
WWW#BBB
string
into
3W#3B,
thus compressing the overall data.
After compressing data, you’ll eventually need to uncompress that same
data. Uncompressing this data
(3W#3B)
creates the original BWT output of
WWW#BBB,
which contains all the characters of the original, uncompressed
data but not in the right order. To retrieve the original order of the
uncom-pressed data, the BWT algorithm repetitively goes through two steps, as
shown in Figure 4-1.
The BWT algorithm works in reverse by adding the original BWT output
(WWW#BBB)
and then sorting the lines repetitively a number of times equal
to the length of the string. So retrieving the original data from a 7-character
string takes seven adding and sorting steps.
After the final add and sort step, the BWT algorithm looks for the only line
that has the end of data character (#) as the last character, which identifies
the original, uncompressed data. The BWT algorithm is both simple to
understand and implement, which makes it easy to use for speeding up
ordi-nary run-length encoding.
@CLDIRYQ1.1BHF
Bioinformatics course
11/13/12
Sequence assembly- Solving a jigaw
puzzle
Sequence assembly- repeating
patterns
Bioinformatics course
11/13/12
Greedy Assemblers
Greedily joins the reads together that are most similar to
each other.
Examples : Phrap, Cap3, TIGR assembler,
© 2009 SIB LF June 4, 2010
Greedy
• Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.
• An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.
Overlap-layout-consensus
• Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure below). This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.
• An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.
Bioinformatics course
11/13/12
Overlap layout consensus
Page 9 Barbara Hutter Assembly
●
Based on all pairwise comparisons
●Constuction of an
overlap
graph
•
nodes = reads (sequences)
•
egdes = connections between overlapping reads
●
Layout
:
look for paths in the overlap graph which are segments of the genome to
assemble (contigs)
•
goal: find Hamiltonian path = a path that contains all nodes exactly once
●
Consensus
: following the Hamiltonian path, combine the overlapping sequences in
the nodes into the sequence of the genome
•
in case of different nucleotides: majority vote considering base qualities
●Programs using the OLC:
•
Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP
Overlap-Layout-Consensus
http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next
Bioinformatics course
11/13/12
Online resources
NCBI-SRA
NCBI-GEO
The European Nucleotide Archive (ENA)
Array express
Bioinformatics course
11/13/12
Visualization tools
REVIEW
sequence similarity. A user can interactively explore the sequence relationships between different contigs and view the results of search operations such as ‘find repeats’. Consed’s assembly view can display the output of a sequence comparison utility called ‘cross_match’, using arcs to connect regions with sequence similarity between user-selected contigs. Different colors dis-tinguish features such as directed repeats from inverted repeats. One advantage of viewing sequence similarity in ‘assembly view’ is that it can be integrated with a read coverage plot (Fig. 1a), which can reveal regions of unexpectedly high coverage often indicative of similar sequences that were erroneously collapsed by the assembler into one. The user can click to examine the sequence similarity at the base level, and click again to exam-ine the underlying reads. There are also standalone tools with related functionality; for example, Miropeats15, widely used for
early genome sequencing projects, is a UNIX C-shell script that generates static images using arc representations to indicate different types of repeats.
Next-generation sequence viewers. As sequencing through-put increases and costs decrease, individual genome sequenc-ing has become feasible and has led to initiatives such as the 1,000 Genomes project (http://www.1000genomes.org/). These data provide an unprecedented opportunity to characterize the landscape of human genotypes, and a new generation of com-putational methods has emerged as a result16. In some cases,
visual inspection can facilitate the evaluation and interpretation
Assembly visualization tools possess most of the necessary functionality, but they were built with Sanger data in mind and initially strained under the substantially higher read volume of NGS technologies. Several of these tools are being retrofitted to tackle larger data sets, including Consed and the updated Gap5, but a new wave of tools is also being designed with this purpose in mind: for example, EagleView17, MapView18 and IGV (Table 1). Unlike finishing software, these tools are primarily data viewers and do not provide direct editing functionality. Because of their emphasis on browsing, many provide more flexible zooming capabilities and enable a user to freely zoom out to higher-level views. The commercially available CLC Genomics Workbench (CLC bio) is particularly user friendly and includes its own read alignment programs, which can be launched through a GUI.
In the resequencing context, mate pairs provide valuable infor-mation about structural variation, such as insertions, deletions and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation detection on draft assemblies should be aware of these issues. LookSeq19 and Gap5 use the vertical-axis position to indicate
insertion size. This places inconsistent mate pairs at the extremes of the plot and visually separates large insert sizes, which are con-sistent with deletions, from small insert sizes, which suggest inser-tion events. When analyzing structural variainser-tions, it is important to consider gene annotations—for example, whether a single nucleotide variation leads to a synonymous or nonsynonymous
Table 1 | Tools for visualizing sequencing data
Name Cost OS Description URL
Stand-alone tools
ABySS-Explorer25 Free Win, Mac, Linux Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/
CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools;
user friendly http://www.clcbio.com/
Consed3* Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/
DNASTAR Lasergene14 $ Win, Mac Analysis suite with an assembly finishing package;
NGS compatible
http://www.dnastar.com/
EagleView17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/
Gap12,13 Free Linux Widely used; assembly finishing package; Gap5 is
NGS compatible
http://staden.sourceforge.net/
Hawkeye6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/
Integrative Genomics Viewer (IGV)*
Free Win, Mac, Linux Genome browser with alignment view support (Table 2);
NGS compatible
http://www.broadinstitute.org/igv/
MapView18 Free Win, Linux Read alignment viewer; custom file format for fast
NGS data loading
http://evolution.sysu.edu.cn/mapview/
MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq
alignment files http://maq.sourceforge.net/
Orchid Free Linux (S) Assembly viewer customized to display paired-end
relationships
http://tinyurl.com/orchid-view/
Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/
SAMtools tview8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/
Web-based tools
LookSeq19 Free Uses AJAX; y axis for insert size; user configures data
resources; NGS compatible
http://lookseq.sourceforge.net/ NCBI Assembly
Archive Viewer7 Free Graphical interface to contig and trace data in NCBI’s Assembly Archive http://tinyurl.com/assmbrowser/
Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement.
*Our recommendation
Bioinformatics course
11/13/12
44
Dr. Ece Gamsiz
Bioinformatics course
11/13/12
Next lectures
RNA sequencing, method, application, advantages over
microarrays
Chip sequencing
Epigenomics, DNA methylation, histone modification