Analysis of two channel array data

(1)

Analysis of two channel array data

March 23 2004

(2)

Summary

•

Basics

•

Differential Gene Expression

•

Single Slide analysis

•

Replicates

•

SAM analysis

•

Analysis resources

(3)

Data files

Data files, .gpr files, typically contain a lot of columns with info about spot size, spot shape, intensity

measurements, background correction, variances etc. Select the columns you need, e.g.

• Channel 1 intensity (median)

• Channel 1 background

• Channel 2 intensity (median)

• Channel 2 background

• Flag / spot failed info

(4)

Example GPR file

(5)

Log ratio calculation

Background correction:

1,

1

1 c b

c   or 0 if this value is negative,

2,

2

2 c b

c   or 0 if this value is negative.

Log ratio calculation:

) log(

)

log(c₁ c₂

M    

Intensity calculation:

2

) log(

)

log(c₁ c₂

A  

Average (log) intensity of the red and green channel.

(6)

Calculate log ratio and log intensity

• Log ratio and log intensity (in this example using log2 for both measurements)

• Data from a single slide – red median, green

median, backgrounds and flags.

(7)

Plot of raw channel values

(8)

Examining MA plot

(9)

Normalisation strategies

There are a number of normalisation strategies available, e.g.

• Using housekeeping genes – assumption is that these genes are ubiquitously expressed across all conditions.

• Spiked controls.

• Loess normalisation – assumption is that for given

intensity window, the overall balance of upexpressed and downexpressed genes will be zero. Loess correction

should be carried out on log ratio scores relative to log intensity values. Further loess corrections can be carried out, e.g. based on spatial location, see Y.Fang’s methods.

(10)

Example of intensity dependent normalisation

Normalisation pulls the distribution to be centered on the x axis

(11)

Loess normalisation of log ratio based on log intensity

This kind of normalisation can be carried out in microarray analysis software packages or in matlab, R.

Parameters are the size of the window, the

percent to take into account.

(12)

Detecting differential expression

•

Looking for genes with significant fold change

•

What is significant fold change?

•

What about low intensity genes – high “fold change” might just be noise?

•

How do we assess confidence levels?

(13)

Detecting differential expression for a single slide

Option: Choose cutoffs to

• Reject all genes with low intensity

• Reject all genes with low fold change

(14)

How to choose fold change / intensity cutoffs

Options

• Choose an arbitrary cutoff – a favourite seems to be 2

fold. Advantage: easy. Assumption made is that everything worked perfectly and that 2 fold is significant

• Use the housekeeping genes to estimate mean and stdev of log ratios. Then, choose e.g. 3 stdev cutoff. Assumption made is that the withinslide variance of the housekeeping genes is indicative of between slide technical variance.

(DeRisi 1997) With a single slide, we can’t get to the

“between slide” technical variance.

• Use the above, combined with a cutoff for the intensity

• Other methods e.g. Newton et al’s GammaGamma

Bernoulli method.

(15)

Differential Expression Replicates

There are two main types of replicate which can be carried out:

• Biological replicates

• Technical replicates

Using an experimental design which uses both types of replication enables us to estimate both the

biological variance and the noise / error.

Technical replicates may involve repeat hybs using the same labelled sources, or repeats of labelling and hybs using the same raw sources.

(16)

Analysis when replicates have been used

One pair of conditions to compare, several slides.

Can use the replicates as part of the quality assessment process. Enables us to identify which slides should be discarded.

Can estimate the between slide variance, either on a gene by gene basis, or some overall estimate, in order to assess

significance of results

(17)

Performing a T test on data

Classical method is to assign some statistic and then have a cutoff based on this.

E.g. student’s T test, or modified T Test.

See this in the following example.

(18)

Example – rat / antiinflammatory experiment

Picked this example from GEO – picked one where there were technical replicates of

each comparison.

This example is a reference design – each condition is compared against a pool

reference. Each condition has three hybs.

The reference for the hybs is pooled RNA

from all conditions.

(19)

Example data

(20)

Example Data

Conditions:

• Uninjured

• Injured

• Various antiinflammatory drugs Three slides per condition.

Each hyb is against a pooled reference

(21)

T Test

Carry out a TTest on A vs B. Note that,

because it’s a ref design, we are interested in the

difference between logratio values.

(22)

Fold change and low p value criteria

(23)

Discussion

TTest is fine, but it looks at each gene individually.

The number of repeats carried out is often fairly small, which brings into question the use of TTest or other statistic which looks only at each gene individually. Other methods seek to incorporate

information about all genes – benefit from the large amount of data available. Some of these make

assumptions that all genes, when normalised by means, have the same variance.

(24)

Using SAM

SAM is used to assess FDR rates. (false discovery rates).

Having chosen a cutoff for the statistic, SAM can be used to estimate what the false

discovery rate will be having used that

cutoff.

(25)

SAM method

• Start off with the data from the two conditions, labelled 1 and 2. [1 1 1 1 1 2 2 2 2 2 ]

• Calculate the statistic, as usual (e.g. this could be the T statistic from the student test etc

• Choose the cutoff t value TC.

• We want to know, if we have “Random data”, how often would we be accepting (presumably false) conclusions drawn from that data. To make the random data, permute the labels and try it again. [1 2 2 2 1 2 2 1 1]

• How many of the values are greater than TC?

• This estimate is biased upward, and so a fudge correction is made.

(26)

Obtaining SAM

SAM can be obtained from the SAM website at

http://wwwstat.stanford.edu/~tibs/SAM/index.html

(27)

Experiments with several conditions

•

Apply anova analysis e.g. use software available from Churchill lab:

http://www.jax.org/staff/churchill/labsite/software/anova/index.html

(28)

methods

Cluster analysis Gene hunt – what genes

behave like my favourite gene

Search for predictor set which predict behaviour Classification of source

types

Looking at differential expression

Gene hunt – what genes are involved in this

process

(29)

PCA

Principal component analysis

Used as a quality check – performing PCA on the measurements to make sure

conditions group as expctd

Used as dimensionality reduction – find a

subset of genes which can be used to

diagnose subtype.

(30)

Cluster analysis

• Heirarchical clustering

• Non heirarchical clustering

• Top down methods

• Bottom up methods

Match the question to the cluster technique.

Remove uninteresting genes before clustering.

Apply a number of clustering techniques – see which things come up in most cases.

(31)

Resources – texts on micrarray analysis

•

Steen Knudsen “A Biologist’s guide to Analysis of DNA Microarray data”

explanation of basics and stats

•

Terry Speed “Statistical Analysis of Gene Expression Microarray Data” survey of

stats methods available

•

Giovanni Parmigiani et al “The Analysis of

Gene Expression Data” survey of software

tools available

(32)

Resources – discussions of experimental design issues

•

Terry Speed’s microarray pages:

http://statwww.berkeley.edu/users/terry/Group/index.html

•

TIGR microarray pages

http://www.tigr.org/tdb/microarray/

•

Gary Churchill’s microarray pages

http://www.jax.org/staff/churchill/labsite/

(33)

Resources: Papers on methods for determining differential expression

•

Analysis of two channel array data