Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

(1)

Introduction to transcriptome analysis using High

Throughput Sequencing

technologies (HTS)

(2)

(3)

A typical RNASeq experiment

 Library construction

(4)

Protocol variations

 Fragmentation methods



RNA: nebulization, hydrolysis



cDNA: sonication, Dnase I treatment

 Depletion of highly abundant transcripts



Positive selection of mRNA . Poly(A) selection.



Negative selection. (RiboMinus ^TM )

 Strand speciﬁcity



Most RNA sequencing is not strand-speciﬁc

 Single-end or Paired-end sequencing

http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf

(5)

Single and paireend sequencing

(6)

Microarrays versus RNAseq



RNA-seq



Counting



Absolute abundance of transcrits



All transcrits are present and can be analyzed



mRNA / ncRNA (snoRNA, lncRNA, eRNA,miRNA,...)



Analysis of allele-specific gene expression



Detection of fusions and other structural variations



Microarrays



Indirect record of expression level (complementary probe)



Relative abundance



Cross-hybridization



Content limited (can only show you what you're already looking

for)

(7)

High reproducibility and dynamic range

(a) Comparison of two brain technical replicate RNA- Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R

²

= 0.96.

(c) Six in vitro–synthesized reference transcripts of

lengths 0.3–10 kb were added to the liver RNA sample

(1.2 104 to 1.2 109 transcripts per sample; R2 > 0.99).

(8)

RNASeq drawback

 Current disadvantages



More expensive than standard expression arrays



More time consuming than any microarray technology



Some (lots of) data analysis issues



Computing accurate transcript models



Mapping reads to splice junctions



Contribution of high-abundance RNAs (eg

ribosomal) could dilute the remaining transcript population; sequencing depth is important

http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf

(9)

Do arrays and RNASeq tell a consistent story?



Do arrays and RNA-Seq tell a consistent story?



”The relationship is not quite linear … but the vast majority of the expression values are similar between the methods. Scatter increases at low expression … as background correction methods for arrays are complicated when signal levels approach noise levels. Similarly, RNA- Seq is a sampling method and stochastic events become a source of error in the quantification of rare transcripts ”



”Given the substantial agreement between the two methods, the array data in the literature should be durable”

Comparison of array and RNA-Seq data for measuring differential

gene expression in the heads of male and female D. pseudoobscura

(10)

Sequencing technologies

 RNA-Seq has been performed using SOLiD, Illumina and 454 platform



However, RNA-Seq needs high dynamic range



SOLiD and Illumina sequencers



Need PCR amplification



Low coverage of GC rich transcrits



Future ?



Helicos



No PCR amplification



Direct RNA sequencing seem possible



High error rates



Pacific Biosciences

(11)

Current technologies



Roche 454 (pyrosequencing)



Genome Analyzer IIx Illumina (sequencing by synthesis)



ABI SOLiD (Seq. by Oligo Ligation/Detection, sequencing by

ligation

(12)

Illumina

(13)

Illumina Genome Analyzer (I)

(14)

Illumina Genome Analyzer (II)

(15)

Illumina Genome Analyzer (III)

(16)

Illumina Genome Analyzer

(17)

ABI Next Gen Sequencing: SOLiD

http://www.flickr.com/photos/ricardipus/4452215186/

(18)

ABI Next Gen Sequencing: SOLiD

 Use emPCR on magnetic beads

 Sequencing by ligation using fluorescent

probes

(19)

Sequencing by ligation (SOLiD)

We have a flow cell (basically a microscope slide that can be

serially exposed to any liquids desired) whose surface is coated

with thousands of beads each containing a single genomic DNA

species, with unique adapters on either end. Each microbead

can be considered a separate sequencing reaction which is

monitored simultaneously via sequential digital imaging. Up to

this point all next-gen sequencing technologies are very similar,

this is where ABI/SOLiD diverges dramatically (see figure 4)

(20)

Sequencing by ligation (SOLiD)

(21)

Sequencing by ligation (SOLiD)

(22)



According to colour code table four different dinucleotides may correspond to the same colour.



This ambiguity is resolved using primer n-1 round



The first ligation of this primer round analyses dinucleotide with one known nucleotide

Sequencing by ligation (SOLiD)

(23)

 Sequences are in color space format (cfasta)



>853_64_861_F3

T00101210200133033001332333320203311



Dibases: TT TT TG GG ...



Sequence: TTGG...

(24)

Advantage of 2 base encoding

 Higher accuracy for SNP detection

(25)

Typical analysis pipeline

Raw sequence data

Pre-processing

Removing actifact

poor quality reads, sequencing adaptors, low-complexity reads,

near-identical reads (PCR biases)

Summarize to gene/transcrit

Statistical analysis : Diff. Expressed genes Assembly

1. Reference-based 2. De novo strategy

3. Combined strategy (1 & 2)

(26)

Overview of the referencebased transcriptome assembly

b | A connectivity or splice graph is then constructed to represent all possible isoforms at a locus.

c,d | Finally, alternative paths through the graph (blue, red,

yellow and green) are followed to join compatible reads together into isoforms.

a | Reads (grey) are first aligned to a reference genome using splice-

aware aligner, such as TopHat or

SpliceMap

(27)

Summarize to gene expression level



Transcrits of different length have different read count



Tag count is normalized for transcrit length and total read number in the measurement (RPKM, Reads Per Kilobase of exon model per Million mapped reads)

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction to transcriptome analysis using High­

Throughput Sequencing

technologies (HTS)

A typical RNA­Seq experiment

 Library construction

Protocol variations

 Fragmentation methods

RNA: nebulization, hydrolysis

cDNA: sonication, Dnase I treatment

 Depletion of highly abundant transcripts

Positive selection of mRNA . Poly(A) selection.

Negative selection. (RiboMinus TM )

 Strand speciﬁcity

Most RNA sequencing is not strand-speciﬁc

 Single-end or Paired-end sequencing

http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf

Single and paire­end sequencing

Microarrays versus RNA­seq

RNA-seq

Counting

Absolute abundance of transcrits

All transcrits are present and can be analyzed

mRNA / ncRNA (snoRNA, lncRNA, eRNA,miRNA,...)

Analysis of allele-specific gene expression

Detection of fusions and other structural variations

Microarrays

Indirect record of expression level (complementary probe)

Relative abundance

Cross-hybridization

Content limited (can only show you what you're already looking

for)

High reproducibility and dynamic range

(a) Comparison of two brain technical replicate RNA- Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R

= 0.96.

(c) Six in vitro–synthesized reference transcripts of

lengths 0.3–10 kb were added to the liver RNA sample

(1.2 104 to 1.2 109 transcripts per sample; R2 > 0.99).

RNA­Seq drawback

 Current disadvantages

More expensive than standard expression arrays

More time consuming than any microarray technology

Some (lots of) data analysis issues

Computing accurate transcript models

Mapping reads to splice junctions

Contribution of high-abundance RNAs (eg

ribosomal) could dilute the remaining transcript population; sequencing depth is important

Do arrays and RNA­Seq tell a consistent story?

Do arrays and RNA-Seq tell a consistent story?

”Given the substantial agreement between the two methods, the array data in the literature should be durable”

Comparison of array and RNA-Seq data for measuring differential

gene expression in the heads of male and female D. pseudoobscura

Sequencing technologies

 RNA-Seq has been performed using SOLiD, Illumina and 454 platform

However, RNA-Seq needs high dynamic range

SOLiD and Illumina sequencers

Need PCR amplification

Low coverage of GC rich transcrits

Future ?

Helicos

No PCR amplification

Direct RNA sequencing seem possible

High error rates

Pacific Biosciences

Current technologies

Roche 454 (pyrosequencing)

Genome Analyzer IIx Illumina (sequencing by synthesis)

ABI SOLiD (Seq. by Oligo Ligation/Detection, sequencing by

ligation

Illumina

Illumina Genome Analyzer (I)

Illumina Genome Analyzer (II)

Illumina Genome Analyzer (III)

Illumina Genome Analyzer

ABI Next Gen Sequencing: SOLiD

ABI Next Gen Sequencing: SOLiD

 Use emPCR on magnetic beads

 Sequencing by ligation using fluorescent

probes

Sequencing by ligation (SOLiD)

Introduction to transcriptome analysis using High

A typical RNASeq experiment

Negative selection. (RiboMinus ^TM )

Single and paireend sequencing

Microarrays versus RNAseq

RNASeq drawback

Do arrays and RNASeq tell a consistent story?

Overview of the referencebased transcriptome assembly