Gene Expression Analysis
Jie Peng
Department of Statistics University of California, Davis
May 2012
RNA expression technologies
• High-throughput technologies to measure the expression levels of thousands of genes simultaneously: Microarray, RNA-seq.
Platforms: Affymetrix GeneChip arrays; Genome Analyzer II, HiSeq 1000/2000.
• Goal: study the effects of treatments, developmental stages, tissues, etc. on gene expression.
• Experimental design issues.
• pooling, replication
• multiplexing – include multiple bar coded samples in the same sequencing reaction
• lane, flow cell run, batch
• Library preparation.
• Extract data: image analysis; reads mapping.
Analyzing data
• Data structure: microarray – intensity value for each probe on the array; RNA-seq: mapped reads count for each gene.
• Data exploration, filtering
• Normalization
• Fitting differential expression (DE) models
• Calling for significant genes
Data exploration
• Plots: MA plots, histograms, etc.
• Summaries: mean/median, variance/MAD, missing rate, library size, etc.
• Filtering:
• Microarray: low intensity, low variation
• RNA-seq: low count
Normalization
• Remove systematic biases due to library preparation, RNA composition, etc. such that samples are comparable.
• Depend on technology and platform.
• Basic assumption: majority of genes are not differentially expressed across samples.
• Global normalization – match certain global features of the samples. For example, make all samples have the same median and MAD; or make all samples to have the same .75%
quantile. Do not change data much (often upto a scaling factor), may not remove all systematic biases.
• Quantile normalization – impose the same empirical distribution to every sample. May change data a lot, may reduce signals while removing bias.
Quantile normalization: an R implementation
quan.norm<-function(x,quan=0.5){
##x: p by n data matrix, where columns are the samples.
norm<-x p<-nrow(x) n<-ncol(x)
x.sort<-apply(x, 2, sort) ## sort genes within a sample x.rank<-apply(x,2,rank) ## rank genes within a sample
## find the common distribution to be matched to:
qant.sort<-matrix(apply(x.sort,1,quantile, probs=quan), + p,n,byrow=FALSE)
## match each sample to the common distribution:
for (i in 1:n){
norm[,i]<-qant.sort[x.rank[,i],i]
}
return(norm) }
Normalization of RNA-seq data
Global normalization by scaling.
• Library size normalization
• choose a reference sample: e.g., the sample with a median library size.
• for a target sample: multiply its counts by the ratio between the library size of the reference and that of the target.
• TMM normalization
• takes into account RNA composition differences.
• Ref: Mark D Robinson and Alicia Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biology, 11(3):R25, 2010
• Quantile-matched normalization
• match a certain quantile across samples: e.g., make the 75%-quantile of counts the same for all samples.
• Ref: Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential
expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94.
RNA composition
• Observed quantities:
• counts: Ygk – number of reads mapped to gene g in sample k.
• library size: Nk :=P
gYgk – total number of mapped reads in sample k.
• gene length: Lg – length of gene g .
• Unobserved quantities:
• abundance: Agk – number of RNA transcripts of gene g in sample k.
• total abundance: Ak :=P
gAgk – total amount of RNA transcripts in sample k. Sk :=P
gAgkLg.
• relative abundance: λgk:= AAgk
k.
• For each gene g , we’d like to compare the relative abundance across samples, e.g., testing
H0g : λg 1= λg 2.
• The expected value of Ygk can be modeled as E(Ygk) = AgkLg
P
sAskLsNk = (λgkLg)(Ak
SkNk) =: µgk.
• Effective library size: ˜Nk := ASk
kNk.
• If ˜N1= ˜N2, then comparing λg 1, λg 2 is equivalent to
comparing µg 1, µg 2, which can be done by using a test based on the observed counts Ygk’s.
• The goal is therefore to equalize the effective sample size across samples.
• Note that E (Ygk/Nk) = (λgkLg)(Ak/Sk).
• By assuming that most of genes are not DE, i.e., for most genes, λg 1 = λg 2, the trimmed mean of the log ratios
{Mg := logYg 1/N1 Yg 2/N2
}g,
can be used to estimate
logA1/S1
A2/S2.
Model expression data
• Microarray data: assume a multiplicative noise model and model the log intensity as normal random variables.
• RNA-seq data.
• Within a sample, it is reasonable to model the counts as Poisson random variables with means proportional to the relative RNA abundance. When comparing two samples:
• R function glm() with famiy="poisson" can be used to fit data.
• findings are restricted to these two samples and can not be generalized to general populations.
• To account for biological variations across samples, various overdispersion models are considered.
• overdispersion: variance > mean. Note that for Poisson random variables, variance = mean.
• commonly used overdispersion models: negative binomial, quasi-Poisson, quasi-Binomial.
Cautions.
• The Poisson model is based on the assumption that reads are randomly and independently distributed.
• This may not be true due to various reasons such as random hexamer priming, GC content bias.
• Ref: Kasper D. Hansen, Steven E. Brenner, Sandrine Dudoit.
Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, Vol. 38, No. 12.
(01 July 2010), pp. e131-e131; Davide Risso, Katja Schwartz, Gavin Sherlock and Sandrine Dudoit. GC-Content
Normalization for RNA-Seq Data. BMC Bioinformatics 2011, 12:480 .
• Corrections and normalizations may be necessary depending on the goal of the study.
• Underdispersion is sometimes observed. Quasi-Poisson model can deal with both overdispersion and underdispersion.
Negative binomial model can only model overdispersion.
Differentially expressed genes
• Microarray: (moderated) t-tests based on log intensities.
• RNA-seq: likelihood ratio tests or exact tests based on counts.
• Permutation tests, rank tests, empirical Bayes methods, etc.
• Multiple comparison adjustment: based on pvalues.
• Control familywise error rate (FWER): bonferroni, holm, etc.
• Control false discovery rate (fdr): Benjamini & Hochberg (BH), Benjamini & Yekutieli (2001) (BY), etc.
• R function p.adjust.
• Other variants of fdr: R package locfdr, R package qvalue.
R packages
• Microarray: affy, limma, etc.
• RNA-seq: DESeq, edgeR, glm, etc.
• Bioconductor package edgeR
• Based on negative binomial models: Y ∼ NB(µ, φ), E(Y ) = µ, Var(Y ) = µ(1 + µφ) (µ > 0, φ > 0).
• To account for small sample sizes as is typical in RNA-seq studies, edgeR also utilizes empirical Bayes ideas to pool information across genes.
• Ref: Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data.
Bioinformatics, 26(1):139-40,2010; M. D Robinson and G. K Smyth. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23(21):2881-2887, 2007; M. D Robinson and G. K Smyth. Small-sample estimation of negative binomial dispersion, with applications to sage data.
Biostatistics, 9(2):321-332, 2008.
A Case study
• An RNA-seq data set with two groups: grp1 – eight replicates, grp2 – seven replicates.
• Data exploration.
• Data matrix : row – gene, column – sample. dim(counts)
geneID grp1 − sample1 grp1 − sample2 grp1 − sample3 ...
gene1 0 0 0 ...
gene2 109 71 128 ...
gene3 3 2 10 ...
... ... ... ... ...
• Library size: barplot(colSums(counts))
• Filtering:
allzero=(rowSums(counts)==0);counts=counts[!allzero,];
dim(counts)
• Clustering of samples: are samples from the same group clustered together?
> library(edgeR)
> group=factor(c(rep(1,8), rep(2,7)))
> d=DGEList(counts,group)
> d$samples$lib.size
> plotMDS(d)
−2 −1 0 1 2
−0.6−0.4−0.20.00.20.4
Dimension 1
Dimension 2
grp1−sample 1
grp1−sample 2 grp1−sample 3 grp1−sample 4 grp1−sample 5
grp1−sample 6 grp1−sample 7
grp1−sample 8
grp2−sample 1 grp2−sample 2
grp2−sample 3 grp2−sample 4
grp2−sample 5 grp2−sample 6
grp2−sample 7
Normalization and MA plots.
> d=calcNormFactors(d,method="TMM")
> samp1="grp1-sample 7"; samp2="grp2-sample 5"
> maPlot(d$counts[,samp1],d$counts[,samp2],normalize=TRUE, + lowess=TRUE, ylim=c(-8,8),pch=19, cex=0.4)
> abline(h=0, lty=2)
> eff.libsize=d$samples$lib.size*d$samples$norm.factors
> names(eff.libsize)=colnames(d$counts)
> maPlot(d$counts[,samp1]/eff.libsize[samp1],
+ d$counts[,samp2]/eff.libsize[samp2],normalize=FALSE, + lowess=TRUE, ylim=c(-8,8),pch=19, cex=0.4)
> abline(h=0, lty=2)
Two-group comparison and gene calling.
• Estimate dispersion parameters and plot genewise biological coefficient of variation (square root of dispersion) against gene abundance (in log2 counts per million).
> d=estimateCommonDisp(d, verbose=TRUE)
> d$common.dispersion
> d=estimateTagwiseDisp(d,prior.n=getPriorN(d))
> plotBCV(d)
• Exact test and gene calling.
> et=exactTest(d,pair=1:2,dispersion="tagwise", + rejection.region="doubletail",big.count=900)
> topTags(et,n=100, adjust.method="BY")
> de=decideTestsDGE(et, adjust.method="BY", + p.value=0.05)
> summary(de)
• FDR method ”BY” takes into account dependency and is more conservative than method ”BH”.
• Draw smear plot of log concentration vs. log fold-change: find both statistically significant and practically significant DE genes.
> plotSmear(et,
+ de.tags=rownames(et$table)[as.logical(de)])
Look at pvalue distribution
• Histogram:
> hist(et$table$PValue, breaks=50,xlab="pvalue")
• Observe a unusual high bar on pvalue close to one.
• Examine log-pvalue vs. log-concentration/log-cpm: this bar is primarily from genes with small number of counts.
• Use a threshold (e.g., 10) on the total number of counts across samples to filter out low-count genes.
• Similar phenomena occurs when analyzing exon sequence data in GWAS studies.
histogram of pvalues
pvalue
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0100020003000
histogram of pvalues
pvalue
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0100020003000
genes with at least 10 total counts: 84% genes pass
pvalue
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01000200030004000
genes with at least 20 total counts: 79% genes pass
pvalue
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01000200030004000
genes with at least 40 total counts: 73% genes pass
pvalue
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01000200030004000
Summary
• Explore data by graphs and numerical summaries.
• Examine normalization by MA plots.
• Filter out genes with small counts.
• Look at both p-values and fold change for significant genes.