Annotation and Analysis of Microarray Data
A primer for NERC researchers
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Data and the NERC
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
• Data is an asset
• Data may have unforeseen uses
• Analysis loses information
• Bulk analysis and data mining needs “uniform” data
• Data stored without adequate annotation is useless
• Data rescue is expensive and unreliable
Metadata and Microarrays
• Sequence data is static
• Post-genome is very state-dependant
– Transcriptome = no. of cell types * no. of environmental conditions – Annotation matters
– Data comparisons matter
• We need to take lessons from the gene debacle
– Protein-tyrosine phosphatase, non-receptor type 6, Protein-tyrosine phosphatase 1C, PTP-1C, Hematopoietic cell protein-tyrosine
phosphatase, SH-PTP1, Protein-tyrosine phosphatase SHP-1
– LARD, death receptor 3 beta, WSL-1R protein, lymphocyte associated receptor of death, death receptor 3
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Metadata standards and data repositories
•Repository needs to keep all relevant metadata associated with a data set
•To be easily submitted, and to be searchable, data must adhere to standards, both in content and format
Thus, have to decide:
•What should be captured and how?
•What format should data be in for submission?
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
What is MIAME?
• MIAME is the internationally adopted standard for the Minimal Information About a Microarray Experiment.
• The result of a MGED (www.mged.org) driven effort to codify the description of a microarray experiment.
• MIAME aims to define the core that is common to most experiments.
• Ultimately, it tries to specify the collection of
information that would be needed to allow somebody to completely reproduce an experiment that was
performed elsewhere.
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
The Six Parts of MIAME
1. Experimental design:
the set of hybridization experiments as a whole2. Array design:
each array used and each element (spot, feature) on the array3. Samples:
samples used, extract preparation and labeling4. Hybridizations:
procedures and parameters5. Measurements:
images, quantification and specifications6. Normalization controls:
types, values and specificationsEnvironmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
MIAME definitions
• Available from www.mged.org
• All details mentioned in MIAME should be captured
• Latest draft: Version 1.1 (Draft 5, March 5, 2002)
• See also: A. Brazma, et al., Nature Genetics, vol 29 (December 2001), pp 365 - 371
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
But…
• Environmental genomics is a diverse,
heterogeneous discipline, often involving multi- factorial experiments that can have an almost infinite number of experimental parameters.
• Describing this sort of data is hard.
• MIAME does not have the required vocabulary.
• However, NERC has made a commitment to making MIAME compliance a de-facto standard within its Science Programmes.
• NERC has invested in reconciling these…
MIAME/Env
• MIAME/Env is an initiative spearheaded by the
EGTDC to extend MIAME standards for annotation of environmental genomic data
• Includes the development of controlled vocabularies / ontologies to describe environmental genomic
experiments.
• MIAME/Env developed with the support of MGED society and in collaboration with MIAME/Tox and members of the EBI.
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Microarray Annotation for Environmental Researchers
• use the Standard
– MIAME/Env model is developed in communication with EG funded researchers to ensure that
environmental genomics experiments and data can be adequately described to MIAME standards
• use the Software
– maxdLoad2 is software developed by EGTDC partners facilitating
• MIAME/Env annotation
• Export in an appropriate format for submission to ArrayExpress
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Do I have to?
Simple Answer:
YES!!!
More specifically:
• You need to adhere to metadata standards to submit to a public repository
• You need to submit to a public repository (e.g. ArrayExpress) to get an accession number for your data
• You need to have an accession number for your data in order to publish on it in major journals
The final word:
• NERC requires grant holders to comply with MIAME standards for microarray data
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Benefits of using a data repository
Facilitates data sharing Catalogued / Backed-up
Pervasive advertisement for your work End users/Researchers
Access to data for analysis and algorithm development
Improves search capabilities
Encourages development of more capable software for annotation, analysis and submission
Bioinformaticians/Developers
Bio-Linux
The EGTDC distribution system for bioinformatics solutions
• Key bioinformatics software and documentation in a Linux environment
• Aim: to maximise the benefits of a pre-installed analysis system.
• provision of key software
• tools for automation of analysis and other customisations
• computing power
• ensure that what is provided can be reasonably maintained and supported
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Software on Bio-Linux
Includes programs for:
• Sequence analysis
• Similarity searching
• Sequence alignment
• Phylogenetics
• Genome annotation and analysis
• Est’s
• Transcriptomics
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Bio-Linux
Transcriptomics Databases maxdLoad2
GeNet access
Transcriptomics Analysis maxdView
GeneSpring
R/BioConductor
MIAME/Env annotation and MAGE/ML export maxdLoad2
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
GeNet maxDLoad2 R/BioConductor
ArrayExpress
Raw Data
Expression measures
(not normalised)
Proprietary software
(e.g. Affymetrix)
GeneSpring R/BioConductor maxDView
Quality Control Normalisation Analysis Presentation
Other analysis programs
MIAME/Env Annotation
GeNet
R/BioConductor
ArrayExpress
Raw Data
Expression measures
(not normalised)
Proprietary software
(e.g. Affymetrix)
GeneSpring R/BioConductor maxDView
Other analysis programs
Bio-Linux
maxDLoad2
MIAME/Env Annotation
Quality Control Normalisation Analysis Presentation
Transcriptomics Databases
Tools on Bio-Linux maxdLoad2
GeNet access
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
maxdLoad2
Navigator
Top level user interface
GeNet
Via Web Interface
Via GeneSpring
GeNet and maxdLoad2
Both are databases designed to handle transcriptomic data Differences:
GeNet
• Centralised repository
• Geared towards use as an analysis and sharing tool as well as a storage area
• Partial MIAME compliance is possible, but not the default
• Great for sharing data and analyses
maxdLoad2
• Local repository
• More like a LIMS system for transcriptomic data
• Geared towards MIAME compliant annotation, storage and export to public database
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Transcriptomic Analysis
Tools on Bio-Linux maxdView
GeneSpring
R/BioConductor
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Which software should I use??
Commercial vs. Open Source GeneSpring maxdView
R/BioConductor Ease of Use
GeneSpring > maxdView > R/BioConductor
Fine tuned control
R/BioConductor > maxdView > GeneSpring
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Why use just one??
E.g.
Fine Tuned Control R/BioConductor Ease of Use +GeneSpring Pre-analysis Choices R/BioConductor Easy but fine tuned manipulation +maxdView
Alternatively:
maxdView + GeneSpring All of them…
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
GeneSpring
Benefits:
• Graphical interface
• Choices of views
• Venn diagram visualisations
• Intuitive interface for filtering
• Extensive documentation
• Context dependent help
maxdView
Benefits:
• Graphical interface
• Quality control options
• Many analyses possible via menus or “calculator”
• Strong filtering capabilities
• Context dependent help
R/BioConductor
Command line package Benefits:
• flexible
• many, many functions to choose from
• take advantage of the full
functionality of the R stats package
• high degree of control
• great plotting facilities
• promotes thinking about data
• lots of documentation and help available
• automation possibilities
• some graphical facilities available
Documentation and Tutorials
Program Name Documentation Tutorials
GeneSpring •Extensive
•Available via help menu
•Basic tutorial available via help menu
maxdView •Good
•Available via help menu
•Basic tutorial
•Working with clusters tutorial
•Commands and hotkeys tutorial
all available via help menu
R/BioConductor •Extensive
•Available via command line or via BioConductor website
•Numerous
•Available via command line or via BioConductor website
Overview of Microarray Analysis Steps
Load Data
Apply Filters
Normalise
Analyse
Quality Control
Step 1
Text, GPR file, etc…
Step 2
Step 3
Step 4
Step 5
Raw Data
Expression measures
(not normalised)
•The raw microarray data scanned from images needs to be translated into some measurement of expression.
•The measurement used depends on the technology – e.g. relative measures (cDNA chips), or absolute measures (e.g. GeneChip).
•The measurements calculated depend on the algorithm used (e.g. MAS 5.0 vs.
RMA for GeneChips).
•Background correction happens at this point
translation into
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Import
Program Name Import
mechanism type Import file types Other Notes
GeneSpring Graphical •Text files (e.g. tab delimited)
•Upload from database
•Assumes “summarised”
data.
•Some level of normalisation will be applied automatically.
•Should recognise “common”
formats.
•Can save formats for rapid loading later.
maxdView Graphical •Text files (e.g. tab delimited)
•maxdView native files (XML)
•Upload from database
•For analysis, load up
“summarised” data.
•Pre-summarised data can also be loaded for quality control.
•Remembers your previous format choices.
R/BioConductor Command line
Some graphical tools available
•Text files (e.g. tab delimited) and any file type supported by R
•Raw data (e.g. .CEL files) or
“summarised” data can be loaded.
Export
Program Name Export
mechanism type Export options Other Notes
GeneSpring Menu •Upload to database
•Use External Programming Interface to transfer to
another program (e.g. R)
•Graphical files (e.g. plots)
•Difficult to retrieve pre- normalised data from GeneSpring.
maxdView Menu •Text files (e.g. tab delimited)
•maxdView native files (XML)
•Download to database
•Graphical files (e.g. plots)
•Can choose the columns of data to save.
•Text files and database data includes your data only.
•maxdView native files
include information about all the viewing options, etc., you had set when you saved the file.
R/BioConductor Command line •Text files (e.g. tab delimited)
•Graphical files (e.g. plots)
Quality Control
Very Important!
Generating high quality microarray data requires vigorous quality control measures at each individual step of the process:
• experimental design of the study
• the generation of samples
• extraction of RNA
• labeling of the probe
• microarray hybridization
• analysis
Systematic, reproducible errors can be minimized by applying various normalisations…BUT:
You should not try to rescue low quality hybridizations with mathematical techniques!
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Quality Control
Do the arrays look alright?
Look at the actual image scans – are there quality issues to be addressed on any of the chips?
Quality Control
Does the data have the distribution you expect?
The common array analysis functions assume that most genes will not change in expression level and that your
data is lognormal.
Quality Control
Figure and text from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/Method_qc2.html
Quality Control
Program Name Functions
Available Examples Other Notes
GeneSpring Few •Can filter out spots with particular features (e.g. very high or very low intensity) before further manipulation.
•Ideally, more extensive quality control should take place before uploading data into GeneSpring.
maxdView Some •Benford Analyser
•Distograms of data
•Easy methods to generate means, std. dev’s, etc, and filter on these
•Flexible filtering system
•Good levels of quality control can be achieved using maxdView
•requires good knowledge of the application to get full benefit
R/BioConductor Extensive •Many quality control
functions for different types of data
•Many, many options
•Highly recommended
•Not user friendly at first!
Quality Control
Does the data have the distribution you expect?
This plot is the result of running the Benford Analyser on data (pre-normalisation) in maxdView.
Fit your data and take a look at the reconstructed image surface using R/BioConductor:
>library(affyPLM)
>pset fitPLM(myData)
>image(pset)
Quality Control
Quality Control
Check out the density curves of the PM data using R/BioConductor
>hist(myData, col=pops2, type=“l”)
Normalisation
Program Name Available
GeneSpring •Graphical menu system
•Hints about effects of
normalisations given in window
maxdView •Graphical menu system
•Hints about effects of
normalisations given in window
R/BioConductor •Extensive choice
•Need to read about before applying
General advice:
• Apply normalisations that make sense for your data
• Use plotting facilities to view your data before and after normalisation to
check
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
GeneSpring
maxdView
Normalisation
Normalisation
>pops2 pData(myData)[,2]
>boxplot(myData, col = pos2 +1)
Pre-normalisation
R/BioConductor
>eset myData, bgcorrect=“rma”, normalize.method = “quantile”,pmcorrect.method=“pmonly”, summary.method =
“medianpolish”)
>boxplot(eset, col = pos2 +1)
Post-normalisation
Filters
• A Filter is a rule applied to each Spot
• Spots which do not pass through the filter are ignored in downstream steps
• Filters are useful for reducing the complexity of analyses or visualisations by discarding uninteresting Spots. They can also be used to locate Spots which match particular criteria.
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
GeneSpring
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Filter on Error
maxdView
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
MultiFilter
R/BioConductor
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
>library(genefilter)
Have to define your filter and then apply it.
Filters can be saved and used again.
Statistics and clustering
Most statistical tests have underlying
assumptions – know what these are and whether they are valid for your data!
GeneSpring, maxdView and R/BioConductor all provide facilities to run various statistical
analyses and clustering algorithms.
R provides the most extensive choice.
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
GeneSpring
maxdView
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
TTest
R/BioConductor
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
>library(multtest)…designed for microarray data
Many clustering functions available within R libraries
Other topics to consider
• Potential for automation
• Statistical choices
• Plotting choices
• Ability to interface with other programs
• No doubt lots of other things…
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
The danger of the black box
User friendly software is:
a) easy to use
b) easy to abuse
c) both of the above
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
What is your aim?
Looking for genes to test biologically?
• How many false positives can you afford?
• How many false negatives can you afford?
• How many replicates (technical? biological?) will you need to use the appropriate analysis
methods?
Your analysis methods should take these issues into account.
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Example: What is significant change?
Is a 2-fold change in expression meaningful?
• Do you have enough replicates to justify your claims statistically?
• Is it meaningful if the absolute expression level is low?
– What is the std. dev. of your measurements?
– Noise envelope diagrams – precision is an issue
• Is it meaningful if the absolute expression level is high?
– Saturation effects – Accuracy issues
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
The moral
Experimental design is more important than which analysis package you choose to use.
Plan your experiments! Your experimental design will affect what meaningful analyses you can
do.
Plan your analyses! There are many steps to carrying out transcriptomic analysis properly.
Don’t give in to the temptation of the black box!
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Key Web Sites
BioConductor www.bioconductor.org GeneSpring www.silicongenetics.com
maxd bioinf.man.ac.uk/microarray/maxd/
R www.r-project.org
Key EGTDC pages:
Home page envgen.nox.ac.uk
Bioinformatics Solutions envgen.nox.ac.uk/software.html Bio-Linux envgen.nox.ac.uk/biolinux.html
Environmental Genomics Thematic Programme Data Centre
http://envgen.nox.ac.uk
Normalisation
Intrachip Interchip
E.g. expt with Affy – may need to normalise regionally (intrachip) and across chips
(interchip) before data comparable
Expt with cDNA – normalise intrachip and
interchip?
Distogram
Normalisation
Technical Issues
• Biased response of dyes
• Positional bias of spots
• Bias due to gene sequence
• Inconsistencies between batches of chips
Quality Control
Remove using Lowess
SVD & PCA
Help Documentation
GeneSpring
Color Bar
for gene coloring (default coding:
expression level)
Genome Browser
to view expression data
Navigator
for project file management
Views
R/BioConductor
Command line statistics package Pros:
• flexible
• lots of functionality
• high degree of control
• great plotting facilities
• promotes thinking about data
• lots of documentation and help available Cons
• STEEP learning curve at beginning
R/BioConductor
E.g. With Affymetrix data
Can load data at various stages
• summary values
• raw values
• transformed values
• etc.…
Can then apply relevant functions using
various libraries
R/BioConductor
E.g. With Affymetrix data
>library(affy)
>listocelfiles = list.celfiles(filenames = “/home/user1/myfiles/”)
>myData = read.affybatch(filenames = listocelfiles)
>phenodata read.phenoData(“phenodata.txt”)
>phenoData(myData) phenodata
UGLY!
Mitigating factors:
Environment can be saved so you do not have to recreate objects from scratch each time
Files with sets of commands can be “sourced” so that many tasks are automatically run on starting R, or can be started up easily when in R You can do things like this…