• No results found

Gene Expression Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Gene Expression Analysis"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Gene Expression Analysis

Jie Peng

Department of Statistics University of California, Davis

May 2012

(2)

RNA expression technologies

High-throughput technologies to measure the expression levels of thousands of genes simultaneously: Microarray, RNA-seq.

Platforms: Affymetrix GeneChip arrays; Genome Analyzer II, HiSeq 1000/2000.

Goal: study the effects of treatments, developmental stages, tissues, etc. on gene expression.

Experimental design issues.

pooling, replication

multiplexing – include multiple bar coded samples in the same sequencing reaction

lane, flow cell run, batch

Library preparation.

Extract data: image analysis; reads mapping.

(3)

Analyzing data

Data structure: microarray – intensity value for each probe on the array; RNA-seq: mapped reads count for each gene.

Data exploration, filtering

Normalization

Fitting differential expression (DE) models

Calling for significant genes

(4)

Data exploration

Plots: MA plots, histograms, etc.

Summaries: mean/median, variance/MAD, missing rate, library size, etc.

Filtering:

Microarray: low intensity, low variation

RNA-seq: low count

(5)

Normalization

Remove systematic biases due to library preparation, RNA composition, etc. such that samples are comparable.

Depend on technology and platform.

Basic assumption: majority of genes are not differentially expressed across samples.

Global normalization – match certain global features of the samples. For example, make all samples have the same median and MAD; or make all samples to have the same .75%

quantile. Do not change data much (often upto a scaling factor), may not remove all systematic biases.

Quantile normalization – impose the same empirical distribution to every sample. May change data a lot, may reduce signals while removing bias.

(6)

Quantile normalization: an R implementation

quan.norm<-function(x,quan=0.5){

##x: p by n data matrix, where columns are the samples.

norm<-x p<-nrow(x) n<-ncol(x)

x.sort<-apply(x, 2, sort) ## sort genes within a sample x.rank<-apply(x,2,rank) ## rank genes within a sample

## find the common distribution to be matched to:

qant.sort<-matrix(apply(x.sort,1,quantile, probs=quan), + p,n,byrow=FALSE)

## match each sample to the common distribution:

for (i in 1:n){

norm[,i]<-qant.sort[x.rank[,i],i]

}

return(norm) }

(7)
(8)

Normalization of RNA-seq data

Global normalization by scaling.

Library size normalization

choose a reference sample: e.g., the sample with a median library size.

for a target sample: multiply its counts by the ratio between the library size of the reference and that of the target.

TMM normalization

takes into account RNA composition differences.

Ref: Mark D Robinson and Alicia Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biology, 11(3):R25, 2010

Quantile-matched normalization

match a certain quantile across samples: e.g., make the 75%-quantile of counts the same for all samples.

Ref: Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential

expression in mRNA-Seq experiments. BMC Bioinformatics 11, 94.

(9)
(10)

RNA composition

Observed quantities:

counts: Ygk – number of reads mapped to gene g in sample k.

library size: Nk :=P

gYgk – total number of mapped reads in sample k.

gene length: Lg – length of gene g .

Unobserved quantities:

abundance: Agk – number of RNA transcripts of gene g in sample k.

total abundance: Ak :=P

gAgk – total amount of RNA transcripts in sample k. Sk :=P

gAgkLg.

relative abundance: λgk:= AAgk

k.

For each gene g , we’d like to compare the relative abundance across samples, e.g., testing

H0g : λg 1= λg 2.

(11)

The expected value of Ygk can be modeled as E(Ygk) = AgkLg

P

sAskLsNk = (λgkLg)(Ak

SkNk) =: µgk.

Effective library size: ˜Nk := ASk

kNk.

If ˜N1= ˜N2, then comparing λg 1, λg 2 is equivalent to

comparing µg 1, µg 2, which can be done by using a test based on the observed counts Ygk’s.

The goal is therefore to equalize the effective sample size across samples.

(12)

Note that E (Ygk/Nk) = (λgkLg)(Ak/Sk).

By assuming that most of genes are not DE, i.e., for most genes, λg 1 = λg 2, the trimmed mean of the log ratios

{Mg := logYg 1/N1 Yg 2/N2

}g,

can be used to estimate

logA1/S1

A2/S2.

(13)

Model expression data

Microarray data: assume a multiplicative noise model and model the log intensity as normal random variables.

RNA-seq data.

Within a sample, it is reasonable to model the counts as Poisson random variables with means proportional to the relative RNA abundance. When comparing two samples:

R function glm() with famiy="poisson" can be used to fit data.

findings are restricted to these two samples and can not be generalized to general populations.

To account for biological variations across samples, various overdispersion models are considered.

overdispersion: variance > mean. Note that for Poisson random variables, variance = mean.

commonly used overdispersion models: negative binomial, quasi-Poisson, quasi-Binomial.

(14)
(15)
(16)

Cautions.

The Poisson model is based on the assumption that reads are randomly and independently distributed.

This may not be true due to various reasons such as random hexamer priming, GC content bias.

Ref: Kasper D. Hansen, Steven E. Brenner, Sandrine Dudoit.

Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, Vol. 38, No. 12.

(01 July 2010), pp. e131-e131; Davide Risso, Katja Schwartz, Gavin Sherlock and Sandrine Dudoit. GC-Content

Normalization for RNA-Seq Data. BMC Bioinformatics 2011, 12:480 .

Corrections and normalizations may be necessary depending on the goal of the study.

Underdispersion is sometimes observed. Quasi-Poisson model can deal with both overdispersion and underdispersion.

Negative binomial model can only model overdispersion.

(17)

Differentially expressed genes

Microarray: (moderated) t-tests based on log intensities.

RNA-seq: likelihood ratio tests or exact tests based on counts.

Permutation tests, rank tests, empirical Bayes methods, etc.

Multiple comparison adjustment: based on pvalues.

Control familywise error rate (FWER): bonferroni, holm, etc.

Control false discovery rate (fdr): Benjamini & Hochberg (BH), Benjamini & Yekutieli (2001) (BY), etc.

R function p.adjust.

Other variants of fdr: R package locfdr, R package qvalue.

(18)

R packages

Microarray: affy, limma, etc.

RNA-seq: DESeq, edgeR, glm, etc.

Bioconductor package edgeR

Based on negative binomial models: Y ∼ NB(µ, φ), E(Y ) = µ, Var(Y ) = µ(1 + µφ) (µ > 0, φ > 0).

To account for small sample sizes as is typical in RNA-seq studies, edgeR also utilizes empirical Bayes ideas to pool information across genes.

Ref: Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data.

Bioinformatics, 26(1):139-40,2010; M. D Robinson and G. K Smyth. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23(21):2881-2887, 2007; M. D Robinson and G. K Smyth. Small-sample estimation of negative binomial dispersion, with applications to sage data.

Biostatistics, 9(2):321-332, 2008.

(19)

A Case study

An RNA-seq data set with two groups: grp1 – eight replicates, grp2 – seven replicates.

Data exploration.

Data matrix : row – gene, column – sample. dim(counts)

geneID grp1 − sample1 grp1 − sample2 grp1 − sample3 ...

gene1 0 0 0 ...

gene2 109 71 128 ...

gene3 3 2 10 ...

... ... ... ... ...

Library size: barplot(colSums(counts))

Filtering:

allzero=(rowSums(counts)==0);counts=counts[!allzero,];

dim(counts)

Clustering of samples: are samples from the same group clustered together?

(20)
(21)

> library(edgeR)

> group=factor(c(rep(1,8), rep(2,7)))

> d=DGEList(counts,group)

> d$samples$lib.size

> plotMDS(d)

−2 −1 0 1 2

−0.6−0.4−0.20.00.20.4

Dimension 1

Dimension 2

grp1−sample 1

grp1−sample 2 grp1−sample 3 grp1−sample 4 grp1−sample 5

grp1−sample 6 grp1−sample 7

grp1−sample 8

grp2−sample 1 grp2−sample 2

grp2−sample 3 grp2−sample 4

grp2−sample 5 grp2−sample 6

grp2−sample 7

(22)

Normalization and MA plots.

> d=calcNormFactors(d,method="TMM")

> samp1="grp1-sample 7"; samp2="grp2-sample 5"

> maPlot(d$counts[,samp1],d$counts[,samp2],normalize=TRUE, + lowess=TRUE, ylim=c(-8,8),pch=19, cex=0.4)

> abline(h=0, lty=2)

> eff.libsize=d$samples$lib.size*d$samples$norm.factors

> names(eff.libsize)=colnames(d$counts)

> maPlot(d$counts[,samp1]/eff.libsize[samp1],

+ d$counts[,samp2]/eff.libsize[samp2],normalize=FALSE, + lowess=TRUE, ylim=c(-8,8),pch=19, cex=0.4)

> abline(h=0, lty=2)

(23)
(24)

Two-group comparison and gene calling.

Estimate dispersion parameters and plot genewise biological coefficient of variation (square root of dispersion) against gene abundance (in log2 counts per million).

> d=estimateCommonDisp(d, verbose=TRUE)

> d$common.dispersion

> d=estimateTagwiseDisp(d,prior.n=getPriorN(d))

> plotBCV(d)

(25)
(26)

Exact test and gene calling.

> et=exactTest(d,pair=1:2,dispersion="tagwise", + rejection.region="doubletail",big.count=900)

> topTags(et,n=100, adjust.method="BY")

> de=decideTestsDGE(et, adjust.method="BY", + p.value=0.05)

> summary(de)

FDR method ”BY” takes into account dependency and is more conservative than method ”BH”.

Draw smear plot of log concentration vs. log fold-change: find both statistically significant and practically significant DE genes.

> plotSmear(et,

+ de.tags=rownames(et$table)[as.logical(de)])

(27)
(28)

Look at pvalue distribution

Histogram:

> hist(et$table$PValue, breaks=50,xlab="pvalue")

Observe a unusual high bar on pvalue close to one.

Examine log-pvalue vs. log-concentration/log-cpm: this bar is primarily from genes with small number of counts.

Use a threshold (e.g., 10) on the total number of counts across samples to filter out low-count genes.

Similar phenomena occurs when analyzing exon sequence data in GWAS studies.

(29)

histogram of pvalues

pvalue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0100020003000

(30)
(31)

histogram of pvalues

pvalue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0100020003000

genes with at least 10 total counts: 84% genes pass

pvalue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

01000200030004000

genes with at least 20 total counts: 79% genes pass

pvalue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

01000200030004000

genes with at least 40 total counts: 73% genes pass

pvalue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

01000200030004000

(32)

Summary

Explore data by graphs and numerical summaries.

Examine normalization by MA plots.

Filter out genes with small counts.

Look at both p-values and fold change for significant genes.

References

Related documents

European central Bank: Impact of sovereign wealth funds on global financial markets.. Occasional

Turn left on the track, thus temp- orarily leaving the SBP (it goes straight ahead but on a rather roundabout route with little gain .) Follow the track as it curves right

An initial list of keywords was generated that represented descriptions of some of the major styles and approaches consistent with evidence-based practices (EBP). As shown in

Severity of exposure 11 Extension of service life provided by CP 22 Life cycle cost analysis 6 Consultant recommendation 1 FHWA recommendation 3 Experience with cathodic protection

These studies do not directly study the effects of magazine advertising to the body image satisfaction of women in the United States and South Korea; however, they provide

It is tempting, however, to divide the countries into two subgroups: India and China are peasant economies with relatively closed, state-controlled, regulated cap- ital markets;

' The district court found that Peart did not allege that Wells Fargo failed to investigate the dispute or that Equifax, Experian, and Trans Union failed to

The regression results show that rent, price-to-income ratio, price-to-rent ratio, urbanization (measured by % of urban population), per-capita GDP, inflation, the share