• No results found

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to transcriptome analysis using High­

Throughput Sequencing

technologies (HTS)

(2)
(3)

A typical RNA­Seq experiment

 Library construction

(4)

Protocol variations

 Fragmentation methods

RNA: nebulization, hydrolysis

cDNA: sonication, Dnase I treatment

 Depletion of highly abundant transcripts

Positive selection of mRNA . Poly(A) selection.

Negative selection. (RiboMinus TM )

 Strand specificity

Most RNA sequencing is not strand-specific

 Single-end or Paired-end sequencing

http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf

(5)

Single and paire­end sequencing

(6)

Microarrays versus RNA­seq

RNA-seq

Counting

Absolute abundance of transcrits

All transcrits are present and can be analyzed

mRNA / ncRNA (snoRNA, lncRNA, eRNA,miRNA,...)

Analysis of allele-specific gene expression

Detection of fusions and other structural variations

Microarrays

Indirect record of expression level (complementary probe)

Relative abundance

Cross-hybridization

Content limited (can only show you what you're already looking

for)

(7)

High reproducibility and dynamic range

(a) Comparison of two brain technical replicate RNA- Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R

2

= 0.96.

(c) Six in vitro–synthesized reference transcripts of

lengths 0.3–10 kb were added to the liver RNA sample

(1.2 104 to 1.2 109 transcripts per sample; R2 > 0.99).

(8)

RNA­Seq drawback

 Current disadvantages

More expensive than standard expression arrays

More time consuming than any microarray technology

Some (lots of) data analysis issues

Computing accurate transcript models

Mapping reads to splice junctions

Contribution of high-abundance RNAs (eg

ribosomal) could dilute the remaining transcript population; sequencing depth is important

http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf

(9)

Do arrays and RNA­Seq tell a consistent story?

Do arrays and RNA-Seq tell a consistent story?

”The relationship is not quite linear … but the vast majority of the expression values are similar between the methods. Scatter increases at low expression … as background correction methods for arrays are complicated when signal levels approach noise levels. Similarly, RNA- Seq is a sampling method and stochastic events become a source of error in the quantification of rare transcripts ”

”Given the substantial agreement between the two methods, the array data in the literature should be durable”

Comparison of array and RNA-Seq data for measuring differential

gene expression in the heads of male and female D. pseudoobscura

(10)

Sequencing technologies

 RNA-Seq has been performed using SOLiD, Illumina and 454 platform

However, RNA-Seq needs high dynamic range

SOLiD and Illumina sequencers

Need PCR amplification

Low coverage of GC rich transcrits

Future ?

Helicos

No PCR amplification

Direct RNA sequencing seem possible

High error rates

Pacific Biosciences

(11)

Current technologies

Roche 454 (pyrosequencing)

Genome Analyzer IIx Illumina (sequencing by synthesis)

ABI SOLiD (Seq. by Oligo Ligation/Detection, sequencing by

ligation

(12)

Illumina

(13)

Illumina Genome Analyzer (I)

(14)

Illumina Genome Analyzer (II)

(15)

Illumina Genome Analyzer (III)

(16)

Illumina Genome Analyzer

(17)

ABI Next Gen Sequencing: SOLiD

http://www.flickr.com/photos/ricardipus/4452215186/

(18)

ABI Next Gen Sequencing: SOLiD

 Use emPCR on magnetic beads

 Sequencing by ligation using fluorescent

probes

(19)

Sequencing by ligation (SOLiD)

We have a flow cell (basically a microscope slide that can be

serially exposed to any liquids desired) whose surface is coated

with thousands of beads each containing a single genomic DNA

species, with unique adapters on either end. Each microbead

can be considered a separate sequencing reaction which is

monitored simultaneously via sequential digital imaging. Up to

this point all next-gen sequencing technologies are very similar,

this is where ABI/SOLiD diverges dramatically (see figure 4)

(20)

Sequencing by ligation (SOLiD)

(21)

Sequencing by ligation (SOLiD)

(22)

According to colour code table four different dinucleotides may correspond to the same colour.

This ambiguity is resolved using primer n-1 round

The first ligation of this primer round analyses dinucleotide with one known nucleotide

Sequencing by ligation (SOLiD)

(23)

 Sequences are in color space format (cfasta)

>853_64_861_F3

T00101210200133033001332333320203311

Dibases: TT TT TG GG ...

Sequence: TTGG...

(24)

Advantage of 2 base encoding

 Higher accuracy for SNP detection

(25)

Typical analysis pipeline

Raw sequence data

Pre-processing

Removing actifact

poor quality reads, sequencing adaptors, low-complexity reads,

near-identical reads (PCR biases)

Summarize to gene/transcrit

Statistical analysis : Diff. Expressed genes Assembly

1. Reference-based 2. De novo strategy

3. Combined strategy (1 & 2)

(26)

Overview of the reference­based transcriptome assembly

b | A connectivity or splice graph is then constructed to represent all possible isoforms at a locus.

c,d | Finally, alternative paths through the graph (blue, red,

yellow and green) are followed to join compatible reads together into isoforms.

a | Reads (grey) are first aligned to a reference genome using splice-

aware aligner, such as TopHat or

SpliceMap

(27)

Summarize to gene expression level

Transcrits of different length have different read count

Tag count is normalized for transcrit length and total read number in the measurement (RPKM, Reads Per Kilobase of exon model per Million mapped reads)

1 RPKM corresponds to approximately one transcript per cell

FPKM, Fragments Per Kilobase of exon model per Million mapped

reads (paired-end sequencing)

References

Related documents