Normalization of RNA-Seq

(1)

Normalization of RNA-Seq

Davide Risso

Modified: April 27, 2012. Compiled: April 27, 2012

1 Retrieving the data

Usually, an RNA-Seq data analysis “from scratch” starts with a set of FASTQ files (see e.g. http://en.wikipedia.org/wiki/FASTQ_format) which contain information on both the quality and the sequence of the short reads.

There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, GSNAP, Stampy, . . . ).

A common output file format is the SAM/BAM format (of which you can read here: http://samtools.sourceforge.net/).

You just saw how to align reads when you don’t have a genome, and how to summarize them. When you do have a genome, a standard approach is to align the reads with Bowtie or TopHat, and then summarize them in “region of interests”, such as gene, exons, non-coding RNAs, etc.

To do this, you need your aligned reads and an annotation for your reference genome. There are tools and packages to summarize the aligned reads in gene counts. One of them is HTSeq (http://www-huber.embl.de/users/anders/

HTSeq/doc/overview.html) .

The simple command:

$ htseq-count example.sam Saccharomyces_cerevisiae.EF2.60.gtf

will produce a table of counts, i.e.,

YAL002W 1 YAL003W 19 YAL005C 8 YAL007C 2 YAL008W 2 YAL012W 9 YAL014C 1 YAL016W 3 YAL017W 2 YAL019W 1

By doing this for every sample in your study you end up with a table with m rows (genes) and n columns (samples). This is what you have in the file “geneLevelCounts.txt”.

(2)

$ head geneLevelCounts.txt YAL067W-A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL067C 0 0 0 2 2 1 9 7 20 11 13 44 12 13 YAL066W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL065C 0 0 0 0 0 0 1 0 0 0 0 0 0 0 YAL064W-B 0 0 0 0 0 0 2 1 0 0 0 0 0 0 YAL064C-A 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Today, we will consider an example based on the data analyzed in Rissoet al. [7]. The Sherlock Lab in Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.

Illumina’s standard Genome Analyzer pre-processing pipeline was used to yield 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie [4], considering only unique mapping and allowing up to two mismatches.

The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in theyeastRNASeqRisso2011 R package.

For Exploratory Data Analysis (EDA) and normalization purposes, it is useful to consider some features of the genes, such as GC-content and gene length.

To obtained this information, we need the gene sequences, that can be re-trieved from different sources (e.g., Ensembl, UCSC, FlyBase, . . . ). In the Yeast community, a standard resource is the SGD website (http://www.yeastgenome. org). In general, a good resource is Ensembl (http://www.ensembl.org). In any case, you need to download the sequences of your regions of interest (e.g., protein coding genes, non-coding RNAs, . . . ), usually in FASTA format.

Example of FASTA format:

$ head Scer.fasta >YAL001C

ATGGTACTGACGATTTATCCTGACGAACTCGTACAAA... >YAL002W

ATGGAGCAAAATGGCCTTGACCACGA...

Once you have your FASTA file, it is easy to compute length and GC-content of each gene using theShortRead Bioconductor package.

Bioconductor (http://bioconductor.org) is an open source project based on the R statistical programming language (http://r-project.org).

Enter a terminal and typeR. This will open an R console.

> library(ShortRead) > filename <- "Scer.fasta" > fa <- readFasta(filename)

> abc <- alphabetFrequency(sread(fa), baseOnly=TRUE)

(3)

> alphabet <- abc[,1:4]

> gc <- rowSums(alphabet[,2:3])/rowSums(alphabet) > length <- width(sread(fa))

> head(gc)

YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C

0.3712317 0.3717647 0.4460548 0.4490741 0.4406428 0.3703704

> head(length)

[1] 3483 3825 621 648 1929 648

We can create adata.frame to store this information, we will call it “gene-Info”. > geneInfo <- data.frame(length=length, gc=gc) > head(geneInfo) length gc YAL001C 3483 0.3712317 YAL002W 3825 0.3717647 YAL003W 621 0.4460548 YAL004W 648 0.4490741 YAL005C 1929 0.4406428 YAL007C 648 0.3703704

2 Exploratory Data Analysis

We will use theEDASeq [6] R package for the EDA and the normalization. This package provides a class of objects named SeqExpressionSet, useful to store gene counts along with gene and lane information.

First of all, we need to read the counts into R. This is done with the

read.tablefunction:

> geneLevelCounts <- read.table("geneLevelCounts.txt", header=TRUE, row.names=1) > laneInfo <- read.table("laneInfo.txt", header=TRUE, row.names=1)

We want to filter out the non-expressed genes. For simplicity, we consider only the genes expressed in all growth conditions, i.e., genes with an average read count of 10 or more.

> means <- rowMeans(geneLevelCounts) > filter <- means >= 10 > table(filter) filter FALSE TRUE 1041 5534

(4)

> geneLevelCounts <- geneLevelCounts[filter,]

This leaves us with 5534 genes.

Now we can store this information (gene and lane info along with gene counts) in one single object.

> library(EDASeq)

> data <- newSeqExpressionSet(exprs = as.matrix(geneLevelCounts),

+ featureData = geneInfo[rownames(geneLevelCounts), ],

+ phenoData = laneInfo)

> data

SeqExpressionSet (storageMode: lockedEnvironment) assayData: 5534 features, 14 samples

element names: exprs, offset protocolData: none

phenoData

sampleNames: Y1_1 Y1_2 ... G3 (14 total) varLabels: lib_prep conditions flow_cell

lib_prep_proto

varMetadata: labelDescription featureData

featureNames: YAL062W YAL061W ... YIR042C (5534 total)

fvarLabels: length gc

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

> head(exprs(data))

Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2

YAL062W 11 4 6 8 12 9 41 43 54 38 YAL061W 33 17 50 20 77 51 177 166 311 338 YAL060W 209 129 216 181 387 286 1328 1386 3316 1262 YAL059W 78 55 82 73 187 121 658 686 176 46 YAL058W 95 56 101 87 232 163 618 581 305 117 YAL056C-A 27 17 8 5 11 7 5 1 19 2 D7 G1 G2 G3 YAL062W 44 1628 57 256 YAL061W 301 29951 1310 2208 YAL060W 1130 16548 5222 3482 YAL059W 51 226 75 127 YAL058W 97 681 226 216 YAL056C-A 6 1 34 12 > pData(data)

(5)

lib_prep conditions flow_cell lib_prep_proto

Y1_1 Y1 YPD 428R1 Protocol1

Y1_2 Y1 YPD 4328B Protocol1

Y4_1 Y4 YPD 61MKN Protocol2

Y4_2 Y4 YPD 61MKN Protocol2

D1 D1 Del 428R1 Protocol1

G1 G1 Gly 6247L Protocol2

G2 G2 Gly 62OAY Protocol1

G3 G3 Gly 62OAY Protocol1

> head(fData(data)) length gc YAL062W 1374 0.4868996 YAL061W 1254 0.4840510 YAL060W 1149 0.4499565 YAL059W 639 0.4037559 YAL058W 1509 0.4340623 YAL056C-A 351 0.4131054

We can look at some graphical summary of the data. This will help us discover biases and artifacts in the data.

Between-lane distribution of gene-level counts. One of the main

con-siderations when dealing with gene-level counts is the difference in count distri-butions between lanes. Theboxplotmethod provides an easy way to produce boxplots of the logarithms of the gene counts in each lane.

(6)

> colors <- as.numeric(pData(data)[, 2]) + 1 > boxplot(data, col=colors) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2

0 2 4 6 8 10 12

Over-dispersion. The function meanVarPlotcan be used to check whether

the count data are over-dispersed (for the Poisson distribution, one would expect the points to be evenly scattered around the black line).

> meanVarPlot(data[, 1:8], log=TRUE) 0 2 4 6 8 10 0 5 10 15 20 mean v ar iance

(7)

Gene-specific effects on read counts. Several authors have reported selec-tion biases related to sequence features such as gene length, GC-content, and mappability [2,3,5,7].

Using biasPlot, one can see the dependence of gene-level counts on GC-content. The same plot could be created for gene length or mappability instead of GC-content.

> biasPlot(data[,1:8], "gc", log=TRUE, ylim=c(0, 8), col=1)

0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 gc

gene counts (log)

Y1 Y2 Y7 Y4

3 Normalization

Following Rissoet al. [7], we consider two main types of effects on gene-level counts: (1) within-lane gene-specific (and possibly lane-specific) effects, e.g., re-lated to gene length or GC-content, and (2) effects rere-lated to between-lane distri-butional differences, e.g., sequencing depth. Accordingly,withinLaneNormalization

and betweenLaneNormalization adjust for the first and second type of

ef-fects, respectively. We recommend to normalize for within-lane effects prior to between-lane normalization.

EDASeq implements four within-lane normalization methods, namely: loess robust local regression of read counts (log) on a gene feature such as GC-content (loess), global-scaling between feature strata using the median (median), global-scaling between feature strata using the upper-quartile (upper), and full-quantile normalization between feature strata (full). For a discussion of these methods in context of GC-content normalization see Rissoet al.[7].

Regarding between-lane normalization, the package implements three of the methods introduced in Bullardet al.[2]: global-scaling using the median (median), global-scaling using the upper-quartile (upper), and full-quantile normalization (full).

(8)

> dataWithin <- withinLaneNormalization(data, "gc", which="full") > dataNorm <- betweenLaneNormalization(dataWithin, which="median")

After normalization the GC-content bias is reduced, and the gene-level counts are comparable across lanes.

> biasPlot(dataNorm[,1:8], "gc", log=TRUE, ylim=c(0, 8), col=1)

0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 gc

gene counts (log)

Y1 Y2 Y7 Y4 > boxplot(dataNorm, col=colors) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2

0 2 4 6 8 10

Moreover, the overdispersion is reduced in normalized counts, even though the Poisson assumption still does not hold true.

(9)

> meanVarPlot(dataNorm[, 1:8], log=TRUE) 0 2 4 6 8 10 0 5 10 15 mean v ar iance

You can write to file your normalized counts with thewrite.tablefunction

> write.table(dataNorm, file="normalizedCounts.txt", sep="\t", quote=FALSE)

4 Differential expression (DE) analysis

One of the main applications of RNA-Seq is differential expression analysis. The normalized counts (or the original counts and the offset) obtained using theEDASeq package can be supplied to packages such asedgeR [8] or DESeq [1] to find differentially expressed genes.

Some authors have argued that it is better to leave the count data unchanged to preserve their sampling properties and instead use an offset for normalization purposes in the context of DE analysis [1, 3, 8]. This can be achieved easily using the argumentoffsetin both normalization functions.

> dataOffset <- withinLaneNormalization(data, "gc",

+ which="full", offset=TRUE)

> dataOffset <- betweenLaneNormalization(dataOffset,

+ which="full", offset=TRUE)

4.1 DESeq

If one wants to use the normalized data to perform a DE analysis withDESeq, there is a simple way to transform the data in the format needed byDESeq.

> library(DESeq)

> counts <- as(dataNorm,"CountDataSet") > counts

(10)

CountDataSet (storageMode: environment) assayData: 5534 features, 14 samples

element names: counts protocolData: none phenoData

sampleNames: Y1_1 Y1_2 ... G3 (14 total)

varLabels: sizeFactor lib_prep ... lib_prep_proto (5 total)

varMetadata: labelDescription featureData

featureNames: YAL062W YAL061W ... YIR042C (5534 total)

fvarLabels: length gc

fvarMetadata: labelDescription

experimentData: use 'experimentData(object)'

Annotation:

5 SessionInfo

> toLatex(sessionInfo()) • R version 2.15.0 (2012-03-30),x86_64-apple-darwin10.8.0 • Locale: en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

• Base packages: base, datasets, graphics, grDevices, methods, stats, utils • Other packages: aroma.light 1.24.0, Biobase 2.16.0, BiocGenerics 0.2.0,

Biostrings 2.24.1, DESeq 1.8.1, EDASeq 1.2.0, GenomicRanges 1.8.3, IRanges 1.14.2, lattice 0.20-6, latticeExtra 0.6-19, locfit 1.5-7, R.methodsS3 1.2.2, R.oo 1.9.3, RColorBrewer 1.0-5, Rsamtools 1.8.1, ShortRead 1.14.1

• Loaded via a namespace (and not attached): annotate 1.34.0, AnnotationDbi 1.18.0, bitops 1.0-4.1, BSgenome 1.24.0, DBI 0.2-5, genefilter 1.38.0, geneplotter 1.34.0, grid 2.15.0, hwriter 1.3,

KernSmooth 2.23-7, RCurl 1.91-1, RSQLite 0.11.1, rtracklayer 1.16.1, splines 2.15.0, stats4 2.15.0, survival 2.36-12, tools 2.15.0, XML 3.9-4, xtable 1.7-0, zlibbioc 1.2.0

References

[1] Anders, S. and Huber, W. (2010). Differential expression analysis for se-quence count data. Genome Biology,11(10), R106.

(11)

[2] Bullard, J., Purdom, E., Hansen, K., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics,11(1), 94.

[3] Hansen, K., Irizarry, R., and Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. [4] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and

memory-efficient alignment of short DNA sequences to the human genome. Genome Biology,10(3), R25.

[5] Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct,4(1), 14.

[6] Risso, D. and Dudoit, S. (2011). EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq. R package version 1.2.0 http://www.

bioconductor.org/packages/release/bioc/html/EDASeq.html.

[7] Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics, 12, 480.

[8] Robinson, M., McCarthy, D., and Smyth, G. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics,26(1), 139.