Statistical challenges in RNA-Seq data analysis

(1)

Statistical challenges in RNA-Seq data analysis

Julie Aubert

UMR 518 AgroParisTech-INRA Mathématiques et Informatique Appliquées

(2)

Outline

1 Introduction 2 Experimental design 3 Normalization 4 Differential analysis 5 Conclusions 6 R and Bioconductor

(3)

Introduction

A statistical model : what for ?

Aim of an experiment : answer to a biological question

Results of an experiment : (numerous, numerical) measurements Model : mathematical formula that relates the experimental

conditions and the observed measurements (response) (Statistical) modelling : translating a biological question into a

mathematical model (6=PIPELINE !) Statistical model : mathematical formula involving

the experimental conditions, the biological response,

(4)

Introduction Biological question Low-level analysis Experimental design Sequencing experiment Higher-level analysis

Biological validation and interprétation

Exploratory Data Analysis, image analysis, base calling, read mapping, metadata integration

Exploratory Data Analysis, normalization and expression quantification, differential analysis, metadata integration

*Adapted from S. Dudoit, Berkeley

(5)

Experimental design

Experimental Design

Basic principles - Fisher (1935)

(technical and biological) replications Randomization

Blocking

Application to NGS

Identify controllable biases / technical specificities

lane effect<run effect<library prep effect<<biological effect [Marioni et al 2008, Bullard et al 2010]

(6)

Experimental design

Experimental Design

A good design is a list of experiments to conduct in order to answer to the asked question which maximize collected information and minimize experiments cost with respect to constraints.

RNA-seq

Which genes or transcripts are expressed ? differentially expressed between several conditions ?

(7)

Experimental design

Experimental Design

Technical choices

Choice of sequencing technology, type of reads (paired-end ?), type of sequencing (directional ?), library preparation protocol

Sequencing depth

Barcoding (attaching a known sequence of nucleotides to the 3’ ends of the NGS technology adapter sequences identifing a sample) or not

Pooling* of barcoded sample for a simultaneous sequencing and number of samples.

(8)

Experimental design

Experimental Design

Replicate number and sample allocation to runs/lanes Technical vs biological replicates

Increasing the number of bio. replicates increases the precision and generalizability of the results

Technical variability =>inconsistent detection of exons at low levels of coverage (<5reads per nucleotide) (McIntyre et al. 2011)

Doing technical replication may be important in studies where low abundant mRNAs are the focus.

Guidelines from the Encyclopedia of DNA Elements (ENCODE) consortium (June 2011)

(9)

Experimental design

(10)

Experimental design

RNA-sequencing

# of reads mapped to Gene 1 Gene 2 … Gene m 25 320 … 23

Random fragmentation

Reverse transcription

mRNAs from a sample

Adapted from Li et al. (2011)

Fragmented mRNAs cDNAs

A vector of counts counting mapping PCR amplification & sequencing

mapped reads A list of reads from Gene 19 from Gene 23 … from Gene 56 ATTGCC... GCTAAC... … AGCCTC... __ _ __ _ __ … __ __ _ __ _ __ _ __ … __ __ _ _____ ___ … _____

(11)

Experimental design

Sequencing Mapping Counts

Fragment expression quantification

(12)

Experimental design

Exploratory data analysis

Conducting data analysis is like drinking a fine wine. It is important to swirl and sniff the wine, to unpack the complex bouquet and to appreciate the experience. Gulping the wine doesn’t work.(Daniel B. Wright - 2003)

(13)

Normalization

Definition

Normalization is a process designed to identify and correct technical biases removing the least possible biological signal. This step is technology and platform-dependant.

Within-lane normalization

Normalisation enabling comparisons of fragments (genes) from a same sample.

No need in a differential analysis context.

(14)

Normalization

Sources of variability

Within-sample Gene length

Sequence composition (GC content)

Between-sample

Depth (total number of sequenced and mapped reads) Sampling bias in library construction ?

Presence of majority fragments

Sequence composition du to PCR-amplification step in library preparation‘(Pickrell et al. 2010, Risso et al. 2011)

(15)

Normalization

Length bias (Oshlack 2009, Bullard et al. 2010)

At same expression level, a long transcript will have more reads than a shorter transcript. Number of reads 6=expression level

µ=E(X) =cNL=Var(X)

X mesured number of reads in a library mapping a specific transcript, Poisson r.v.

c proportionnality constant N total number of transcripts L gene length

Test :

t = p X1−X2

(16)

Normalization

StatOmique workshop

http://vim-iip.jouy.inra.fr:8080/statomique/

At lot of different normalization methods...

Some are part of models for DE, others are ’stand-alone’ They do not rely on similar hypotheses

But all of them claim to remove technical bias associated with RNA-seq data

Which one is the best ?

How to and on which criteria choice a normalisation adapted to our experiment ?

What impact of the bioinformatics, normalisation step or differential analysis method on lists of DE genes ?

(17)

Normalization

Global normalisation methods

Normalised counts are raw counts divided by a scaling factor calculated for each sample

Distribution adjustment

Assumption (TC, UQ, Median) : read counts are prop. to expression level and sequencing depth

Total number of reads : TC (Marioni et al. 2008) Quantile : FQ (Robinson and Smyth 2008) Upper Quartile : UQ (Bullard et al. 2010) Median

(18)

Normalization

Notations

xij : number of reads for genei in samplej

Nj : number of reads in sample j (library size of sample j)

n : number of samples in the experiment

ˆ

sj : normalization factor associated with samplej

Li : length of gene i

(19)

Normalization

Notations

ˆ

(20)

Normalization

Notations

ˆ

(21)

Normalization

Notations

ˆ

(22)

Normalization

Notations

ˆ

(23)

Normalization

Total read count normalization (TC) (Marioni et al. 2008)

Adjust for lane sequencing depth (library size)

Motivation greater lane sequencing depth=>greater counts whatever the transcript length and the expression level

Assumptionread counts are proportional to expression level and sequencing depth (same RNAs in equal proportion)

Methoddivide transcript read count by total number of reads

xij Nj = xij ˆ sj , ˆsj =Nj (1)

(24)

Normalization

Problemmakes frequenciescomparable between lanes, not read counts

Solution rescale scaling factors so that their sum across lanes is equal to, e.g., the number of lanes

Makes normalization procedures comparable

ˆsj = Nj 1 n P lNl (2)

(25)

Normalization

Upper Quartile normalization UQ (Bullard et al. 2010)

Adjust for lane sequencing depth

Motivation total read count is strongly dependent on a few highly expressed transcripts

Assumptionread counts are proportional to expression level and sequencing depth

Methoddivide transcript read count by, e.g., upper quartile

xij ˆ sj , ˆsj = Q3j 1 n P lQ3l (3)

(26)

Normalization

Other quantile-based normalizations

Median

Adjust for lane sequencing depth

xij ˆ sj , ˆsj = medianj 1 n P lmedianl (4)

Full quantile normalization FQ (Robinson and Smyth 2008)

Adjust for differences in count distribution (inspired from the microarray literature)

Assumptionread counts have identical distribution across lanes

Methodall quantiles of the count distributions are matched between lanes

(27)

Normalization

RPKM normalization

Reads Per Kilobase per Million mapped reads

Adjust for lane sequencing depth (library size) and gene length

Motivation greater lane sequencing depth and gene length=>

greater counts whatever the expression level

Assumptionread counts are proportional to expression level, gene length and sequencing depth (same RNAs in equal proportion)

Methoddivide gene read count by total number of reads (in million) and gene length (in kilobase)

xij

Nj ∗Li

∗103∗106 (5)

(28)

Normalization

The Effective Library Size concept

Motivation

Different biological conditions express different RNA repertoires, leading to different total amounts of RNA

Assumption

A majority of transcripts is not differentially expressed

Aim

Minimizing effect of (very) majority sequences

(29)

Normalization

Methods based on the Effective Library Size Concept

Trimmed Mean of M-values Robinson et al. 2010 (edgeR)

Filter on transcripts with nul counts, on the resp. 30% and 5% more extreme

Mi =log2(_YYik/Nk

ik0/N_k0)and A values

Hyp : We may not estimate the total ARN production in one condition but we may estimate a global expression change between two conditions from non extremeMi

distribution.

Calculation of a scaling factor between two conditions and normalization to avoid dependance on a reference sample

Anders and Huber 2010 - Package DESeq

ˆ sj=mediani( kij (πm v=1kiv)1/m )

(30)

Normalization

4 real datasets and one simulated dataset

RNA-seq or miRNA-seq, DE, at least 2 conditions, at least 2 bio. rep., no tech. rep.

(31)

Normalization

Comparison procedures

Distribution and properties of normalized datasets Boxplots, variability between biological replicates

Comparison of DE genes

Differential analysis : DESeq v1.6.1 (Anders and Huber 2010), default param.

Number of common DE genes, similarity between list of genes (dendrogram - binary distance and Ward linkage)

Power and control of the Type-I error rate simulated data

(32)

Normalization

Normalized data distribution

When large diff. in lib. size, TC and RPKM do not improve over the raw counts.

Example :Mus musculus dataset

(33)

Normalization

Within-condition variability

(34)

Normalization

Number of DE genes

DESeq v1.6.0, default parameters

Input data : raw counts + scaling factorsˆsj (except RPKM)

RPKM : normalized datanon rounded and normalization parameterˆsj =1

Example : E. histolyticadataset, common genes

(35)

Normalization

Consensus dendrogram - Ward linkage algo.

(36)

Normalization

Type-I Error Rate and Power (Simulated data)

Inflated FP rate for all the methods except TMM and DESeq

(37)

Normalization

So the Winner is ... ?

In most cases

The methods yield similar results

However ...

(38)

Normalization

So the Winner is ... ?

In most cases

The methods yield similar results

However ...

Differences appear based on data characteristics

(39)

Normalization

Interpretation

RawCountOften fewer differential expressed genes (A. fumigatus : no DE gene)

TC, RPKM

Sensitive to the presence of majority genes Less effective stabilization of distributions Ineffective (similar to RawCount)

FQ

Can increase between group variance

Is based on an very (too) strong assumption (similar distributions)

Median High variability of housekeeping genes

Csq1 : non-uniformity of the distribution of reads along the genome Csq2 : technical variability within and between-sample

A normalization is needed and has a great impact on the DE genes (Bullard et al 2010), StatOmique

Detection of differential expression in RNA-seq data is inherently biased (more power to detect DE of longer genes)

Do not normalise by gene length in a context of differential analysis.

(45)

Normalization

Conclusions on “StatOmique” workshop

Hypothesis : the majority of genes is invariant between two samples. Differences between methods when presence of majority sequences, very different library depths.

TMM and DESeq : performant and robust methods in a DE analysis context on the gene scale.

Normalisation isnecessary and not trivial.

Perspectives

Application to transcripts :estimation 6= count: to do. Comparison of differential analysis methods (in progress)

(46)

Differential analysis

Aim : To detect differentially expressed genes between two conditions Discrete quantitative data

Few replicates

Overdispersion problem

Challenge : method which takes into account overdispersion and few number of replicates

Proposed methods : edgeR, DESeq for the most used and known An abundant littérature

Few Comparison of methods : Pachter et al. (2011), Kvam et Liu (2012)

(47)

Differential analysis gene-by-gene- with replicates

For each gene i

Is there a significant difference in expression between condition A and B ? Statistical model (definition and parameter estimation) - Generalized linear framework

Test : Equality of relative abundance of gene i in condition A and B vs non-equality

Let beYijk the count for replicate j in condition k from gene i

Yijk follows a Poisson distribution (µijk).

µijk

mjk =exp(αi+βij)<=>log( µijk

mjk) =αi +βij

(48)

Alternative : Negative Binomial Models

A supplementary dispersion parameter φto model the variance

How to obtain good estimates of overdispersion parameter for each gene ?

1 Pseudo-likelihood (goodness-of-fit statistic) 2 _{Quasi-likelihood (deviance statistic)}

3 _{Adjusted likelihood}

Many genes, very few biological samples - so difficult to estimateφ on a gene-by-gene basis

(49)

φ

estimation - Some proposed solutions

Method Variance Reference

DESeq µ(1+φµµ) Anders et Huber (2010)

edgeR µ(1+φµ) Robinson et Smyth (2009) NBPseq µ(1+φµα−1) Di et al. (2011)

edgeR : borrow information across genes for stable estimates ofφ : 3 ways to estimate φ(common, trend, moderated)

DESeq : estimation of the relationship between mean and variance using a local or parametric regression

(50)

Negative Binomial Models

µijk =λijmjk wheremjk : size factor (library size) Test : H0i :λiA =λiB vsH1i :λiA 6=λiB

edgeR

Adjust observed counts up or down depending on whether library sizes are below or above the geometric mean=>Creates approximately identically distributed pseudodata

Estimation ofφi by conditional ML conditioning on the TC for gene i Empirical Bayes procedure to shrink dispersions toward a consensus value

An exact test analogous to Fisher’s exact test but adapted to overdispersed data (Robinson and Smyth 2008)

DESeq

Test similar to Fisher’s exact test (calculation has changed)

(51)

Practical considerations

Input Data= raw counts

normalization offsets are included in the model

Version matters : edgeR 2.4.1 et DESeq 1.6.1 (Bioconductor 2.9)

edgeR

TMM normalization Common dispersion must be estimated before tagwise dispersions GLM functionality (for experiments with multiple factors) now available

DESeq

Two possibilities to obtain a smooth functions fj (·)

Conservative estimates : genes are assigned the maximum of the fitted and empirical values ofα (sharingMode = "maximum")

(52)

edgeR or DESeq ?

None is perfect !

results obtained by these two methods are mostly the same Robles et al. 2011

edgeR

a slightly inflated FPR from edgeR (small values of n or only high-counts transcripts)

Performance improves as number of replicates increases

DESeq

conservative whatever n

over-conservative behaviour when only low-counts (<100)

Remark : non-parametric methods not enough detection power (Kvam and Liu 2012).

(53)

Conclusions on differential analysis

Methods dedicated to microarrays are not applicable to RNA-seq Fisher test is adapted only in the case without replicate (do not allow inference)

Poisson model is not adapted to experiments with biological replicates Negative binomial model : distinction of methods on the information sharing for modelisation of the dispersion parameter

Negative binomial model not enough flexible ? (how to take into account zero-inflation and heavy tail) : ZI-BN, Tweedie-Poisson Numerous methods are regularly proposed : no standard

(60)

R and Bioconductor

R un langage de programmation

R est un langage interprété et non compilé.

Utiliser R consiste à manipuler des objets (données ou fonctions). Un objet est défini par :

un type : vecteur, tableau, matrice, liste, etc. un mode : numerique, caractère, complexe, logique. une taille : nombre d’éléments contenus dans l’objet. une valeur : NA si manquant, NaN, -Inf, Inf.

un nom

Tests, boucles, fonctions, classes d’objets etc.

(61)

R and Bioconductor

Chercher de l’aide

help.start() : affiche un menu d’aide

help(maFonction) : affiche la description de maFonction

?maFonction : affiche la description demaFonction

help.search(“motclef”) : donne la liste des commandes dans lesquelles apparaitmotclef

apropos(“motclef”) : affiche toutes les fonctions dont le nom contient

motclef

Vignettes : certains packages disposent de Vignettes (guide utilisateur). Remarque : on peut utiliser des expressions régulières (?regexp)

(62)

R and Bioconductor

Quelques fonctions utiles

getwd() : affiche le répertoire de travail

setwd(“CheminMonRepertoire“) : spécifier son répertoire de travail list.files() : quels fichiers dans mon répertoire de travail ?

ls() : quels objets dans ma session R ?

load(“monAncienneSession.RData”) : charger un environnement save() : sauver un environnement.

(63)

R and Bioconductor

(64)

R and Bioconductor

Bioconductor

Projet dédié à l’analyse de données génomiques à haut débit sous R. pour les microarrays, les données de séquences, l’annotation, high-throughput assays

2 versions par an suite à chaque nouvelle version de R. Actuellement : v2.9 (R 2.15)

516 bibliothèques de fonctions (libraries) pour l’analyse et la réprésentation graphique

plus de 500 bibliothèques d’annotation Installation

source("http ://bioconductor.org/biocLite.R") biocLite(“monPackage”)

biocLite(c("pkg1", "pkg2"))

(65)

R and Bioconductor

Bioconductor pour l’analyse de données NGS

Différents types de données gérés : ChIPseq, RNAseq, reséquençage Différents types de fichiers gérés : Fasta, fastaq, BAM, GFF, BED,

(66)

R and Bioconductor

Bioconductor pour l’analyse de données NGS

Différents types d’analyse

Manipulation des séquences, des génomes ou régions génomiques

Biostrings, IRanges, GenomicRanges, genomeIntervals

Analyse qualité, prétraitement

ShortRead, Rsamtools

ChIPseq : Recherche et annotation des pics, recherche de motifs

chipseq, chIPseqR, ChIPpeakAnno, BayesPeak, PICS, CSAR, ChIPsim, rGADEM

RNAseq : Recherche de gènes différentiellement exprimés, d’évènements d’épissage alternatif

DESeq, DEXSeq, edgeR, baySeq, DEGseq, etc.

Reséquençage ciblé : analyse de l’enrichissement :TEQC

from C. Keime

(67)

R and Bioconductor

First commands

Installation des packages :

source("http ://www.bioconductor.org/biocLite.R") biocLite(c("DESeq", "edgeR"))

Chargement des packages : library(DESeq) library(edgeR)

(68)

R and Bioconductor

edgeR main commands

generate raw counts from NB, create list object

y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4) rownames(y) <- paste("Gene",1 :nrow(y),sep=".")

group <- factor(c(1,1,2,2))

perform DA with edgeR

y <- DGEList(counts=y,group=group) y <- calcNormFactors(y,method="TMM") y <- estimateCommonDisp(y)

y <- estimateTagwiseDisp(y)

result <- exactTest(y,dispersion="tagwise") Observe some results - DGE with FDR BH

topTags(result)

summary(decideTestsDGE(result),p.value=0.05)

(69)

R and Bioconductor

DESeq main commands

cds <- newCountDataSet(y, group) cds <- estimateSizeFactors(cds) sizeFactors(cds)

cds <- estimateDispersions(cds) res <- nbinomTest( cds, "1", "2" )

(70)

R and Bioconductor

Quelques références pour débuter

http://www.r-project.org/: manuel, FAQ, RJournal, etc... http://www.bioconductor.org/help/publications/

cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf G. Millot, (2009), Comprendre et réaliser les tests statistiques à l’aide de R, Editions De Boeck, 704 p.

(71)

R and Bioconductor

References

Anders, S and Huber, W. (2010) Differential expression analysis for sequence count data,Genome Biology,11 :R106.

Bullard JH, Purdom E, Hansen KD, Dudoit S. (2010)Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments,BMC Bioinformatics, 11 :94

Di Y, Schaef, DW, Cumbie JS, Chang JH (2011)The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq,Statistical Applications in Genetics and Molecular Biology, 10(1), Article 24.

(72)

R and Bioconductor

References

Kvam V, Liu P (2012)A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data Li J, Jiang H, Wong WH, (2010)Modeling non-uniformity in short read rates in RNA-Seq data,Genome Biology, 11 :R50

Li J, Witten DM, Johnstone IM, Tisbhirani R (2011)Normalization, testing, and false discovery rate estimation for RNA-sequencing data,Biostatistics, 1-16

Marioni J.C., Mason C.E. et al. (2008) RNA-seq : An assessment of technical reproducibility and comparison with gene expression arrays,

Genome Research, 18 : 1509-1517.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq.Nature Methods, 5(7), 621-628

Robles et al (2012) Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing, preprint

Pachter L (2011)Models for transcript quantification from RNA-seq

(73)

R and Bioconductor

References

Robinson MD, Oshlack A. (2010)A scaling normalization method for differential expression analysis of RNA-seq data.Genome Biology, 11 :R25

Robinson MD and Smyth, GK. (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data

Biostatistics, 9, 2 ; 321-332

Robinson MD, McCarthy DJ, Smyth, GK. (2009)edgeR : a Bioconductor package for differential expression analysis of digital gene expression data,Bioinformatics

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. (2011) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,Nature

http://www.r-project.org/

www.bioconductor.org

http://www.bioconductor.org/help/publications/