Analysis of two channel array data
March 23 2004
Summary
•
Basics
•
Differential Gene Expression
•
Single Slide analysis
•
Replicates
•
SAM analysis
•
Analysis resources
Data files
Data files, .gpr files, typically contain a lot of columns with info about spot size, spot shape, intensity
measurements, background correction, variances etc. Select the columns you need, e.g.
• Channel 1 intensity (median)
• Channel 1 background
• Channel 2 intensity (median)
• Channel 2 background
• Flag / spot failed info
Example GPR file
Log ratio calculation
Background correction:
1,
1
1 c b
c or 0 if this value is negative,
2,
2
2 c b
c or 0 if this value is negative.
Log ratio calculation:
) log(
)
log(c1 c2
M
Intensity calculation:
2
) log(
)
log(c1 c2
A
Average (log) intensity of the red and green channel.
Calculate log ratio and log intensity
• Log ratio and log intensity (in this example using log2 for both measurements)
• Data from a single slide – red median, green
median, backgrounds and flags.
Plot of raw channel values
Examining MA plot
Normalisation strategies
There are a number of normalisation strategies available, e.g.
• Using housekeeping genes – assumption is that these genes are ubiquitously expressed across all conditions.
• Spiked controls.
• Loess normalisation – assumption is that for given
intensity window, the overall balance of upexpressed and downexpressed genes will be zero. Loess correction
should be carried out on log ratio scores relative to log intensity values. Further loess corrections can be carried out, e.g. based on spatial location, see Y.Fang’s methods.
Example of intensity dependent normalisation
Normalisation pulls the distribution to be centered on the x axis
Loess normalisation of log ratio based on log intensity
This kind of normalisation can be carried out in microarray analysis software packages or in matlab, R.
Parameters are the size of the window, the
percent to take into account.
Detecting differential expression
•
Looking for genes with significant fold change
•
What is significant fold change?
•
What about low intensity genes – high “fold change” might just be noise?
•
How do we assess confidence levels?
Detecting differential expression for a single slide
Option: Choose cutoffs to
• Reject all genes with low intensity
• Reject all genes with low fold change
How to choose fold change / intensity cutoffs
Options
• Choose an arbitrary cutoff – a favourite seems to be 2
fold. Advantage: easy. Assumption made is that everything worked perfectly and that 2 fold is significant
• Use the housekeeping genes to estimate mean and stdev of log ratios. Then, choose e.g. 3 stdev cutoff. Assumption made is that the withinslide variance of the housekeeping genes is indicative of between slide technical variance.
(DeRisi 1997) With a single slide, we can’t get to the
“between slide” technical variance.
• Use the above, combined with a cutoff for the intensity
• Other methods e.g. Newton et al’s GammaGamma
Bernoulli method.
Differential Expression Replicates
There are two main types of replicate which can be carried out:
• Biological replicates
• Technical replicates
Using an experimental design which uses both types of replication enables us to estimate both the
biological variance and the noise / error.
Technical replicates may involve repeat hybs using the same labelled sources, or repeats of labelling and hybs using the same raw sources.
Analysis when replicates have been used
One pair of conditions to compare, several slides.
Can use the replicates as part of the quality assessment process. Enables us to identify which slides should be discarded.
Can estimate the between slide variance, either on a gene by gene basis, or some overall estimate, in order to assess
significance of results
Performing a T test on data
Classical method is to assign some statistic and then have a cutoff based on this.
E.g. student’s T test, or modified T Test.
See this in the following example.
Example – rat / antiinflammatory experiment
Picked this example from GEO – picked one where there were technical replicates of
each comparison.
This example is a reference design – each condition is compared against a pool
reference. Each condition has three hybs.
The reference for the hybs is pooled RNA
from all conditions.
Example data
Example Data
Conditions:
• Uninjured
• Injured
• Various antiinflammatory drugs Three slides per condition.
Each hyb is against a pooled reference
T Test
Carry out a TTest on A vs B. Note that,
because it’s a ref design, we are interested in the
difference between logratio values.
Fold change and low p value criteria
Discussion
TTest is fine, but it looks at each gene individually.
The number of repeats carried out is often fairly small, which brings into question the use of TTest or other statistic which looks only at each gene individually. Other methods seek to incorporate
information about all genes – benefit from the large amount of data available. Some of these make
assumptions that all genes, when normalised by means, have the same variance.
Using SAM
SAM is used to assess FDR rates. (false discovery rates).
Having chosen a cutoff for the statistic, SAM can be used to estimate what the false
discovery rate will be having used that
cutoff.
SAM method
• Start off with the data from the two conditions, labelled 1 and 2. [1 1 1 1 1 2 2 2 2 2 ]
• Calculate the statistic, as usual (e.g. this could be the T statistic from the student test etc
• Choose the cutoff t value TC.
• We want to know, if we have “Random data”, how often would we be accepting (presumably false) conclusions drawn from that data. To make the random data, permute the labels and try it again. [1 2 2 2 1 2 2 1 1]
• How many of the values are greater than TC?
• This estimate is biased upward, and so a fudge correction is made.
Obtaining SAM
SAM can be obtained from the SAM website at
http://wwwstat.stanford.edu/~tibs/SAM/index.html
Experiments with several conditions
•
Apply anova analysis e.g. use software available from Churchill lab:
http://www.jax.org/staff/churchill/labsite/software/anova/index.html
methods
Cluster analysis Gene hunt – what genes
behave like my favourite gene
Search for predictor set which predict behaviour Classification of source
types
Looking at differential expression
Gene hunt – what genes are involved in this
process
PCA
Principal component analysis
Used as a quality check – performing PCA on the measurements to make sure
conditions group as expctd
Used as dimensionality reduction – find a
subset of genes which can be used to
diagnose subtype.
Cluster analysis
• Heirarchical clustering
• Non heirarchical clustering
• Top down methods
• Bottom up methods
Match the question to the cluster technique.
Remove uninteresting genes before clustering.
Apply a number of clustering techniques – see which things come up in most cases.
Resources – texts on micrarray analysis
•
Steen Knudsen “A Biologist’s guide to Analysis of DNA Microarray data”
explanation of basics and stats
•
Terry Speed “Statistical Analysis of Gene Expression Microarray Data” survey of
stats methods available
•
Giovanni Parmigiani et al “The Analysis of
Gene Expression Data” survey of software
tools available
Resources – discussions of experimental design issues
•
Terry Speed’s microarray pages:
http://statwww.berkeley.edu/users/terry/Group/index.html
•
TIGR microarray pages
http://www.tigr.org/tdb/microarray/
•
Gary Churchill’s microarray pages
http://www.jax.org/staff/churchill/labsite/
Resources: Papers on methods for determining differential expression
•
“On differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changesfrom
Microarray Data” Newton et al J. Comp Biol
•
“Improved Statistical Tests for Differential
Gene Expression by Shrinking Varinace
Components Estimates” Cui et al
Resources Data repositories
•
GEO http://www.ncbi.nlm.nih.gov/geo
•