Statistical issues in the analysis of microarray data
Daniel Gerhard
Institute of Biostatistics Leibniz University of Hannover
ESNATS Summerschool, Zermatt
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 1 / 30
Table of Contents
1 Outline
2 Experimental design
3 Statistical modelling
4 Hypotheses testing
5 Gene set enrichment analysis
6 Classification
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 2 / 30
Outline
Focus is set on
Single channel microarrays
I
One sample per array
I
Gene expressions for thousands of oligonucleotides Identifying genes that are differentially expressed due to a treatment
Finding significantly differentially expressed genes with a given error probability
(Predicting a treatment level given the gene expression data)
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 3 / 30
Controlled experiments
Independent replications
Multiple sources of variability present:
I
Sample-, array-, environmental variability, . . .
Account for this variability in the experimental design by several replications
I
of arrays, samples, multiple timepoints, . . .
Randomisation
Needed to separate treatment effects from other factors, which might influence gene expression
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 4 / 30
Experimental design
Planning an experiment
Multiple arrays per sample? Enables estimating array variability.
Large amount of RNA needed.
With more complex designs a larger number of arrays, samples is needed
Measuring covariates, which are not directly of interest, but might have an influence on gene expression
Simple classic design
2 Treatments (Control/Treatment), Multiple arrays/samples per treatments
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 5 / 30
Data structure
Treatment A Treatment B . . .
Array 1 Array 2 Array 3 Array 4 Array 5 Array 6 . . .
Gene 1 y 11 y 12 y 13 y 14 y 15 y 16 . . .
Gene 2 y 21 y 22 y 23 y 24 y 25 y 26 . . .
Gene 3 y 31 y 32 y 33 y 34 y 35 y 36 . . .
.. . .. . .. . .. . .. . .. . .. . .. .
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 6 / 30
Data example
Generating artificial data 2 treatments (A, B) 20 arrays per treatment 5000 genes per array
Normal distributed residuals, array effects within array sd = 1; between array sd = 0.5 100 genes show an effect (δ = ±2)
2 x transformation
D. Gerhard (LUH) Analysis of microarray data 23. Sep 09 7 / 30