• No results found

Annotation and Analysis of Microarray Data

N/A
N/A
Protected

Academic year: 2021

Share "Annotation and Analysis of Microarray Data"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Annotation and Analysis of Microarray Data

A primer for NERC researchers

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(2)

Data and the NERC

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

• Data is an asset

• Data may have unforeseen uses

• Analysis loses information

• Bulk analysis and data mining needs “uniform” data

• Data stored without adequate annotation is useless

• Data rescue is expensive and unreliable

(3)

Metadata and Microarrays

• Sequence data is static

• Post-genome is very state-dependant

– Transcriptome = no. of cell types * no. of environmental conditions – Annotation matters

– Data comparisons matter

• We need to take lessons from the gene debacle

– Protein-tyrosine phosphatase, non-receptor type 6, Protein-tyrosine phosphatase 1C, PTP-1C, Hematopoietic cell protein-tyrosine

phosphatase, SH-PTP1, Protein-tyrosine phosphatase SHP-1

– LARD, death receptor 3 beta, WSL-1R protein, lymphocyte associated receptor of death, death receptor 3

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(4)

Metadata standards and data repositories

•Repository needs to keep all relevant metadata associated with a data set

•To be easily submitted, and to be searchable, data must adhere to standards, both in content and format

Thus, have to decide:

•What should be captured and how?

•What format should data be in for submission?

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(5)

What is MIAME?

• MIAME is the internationally adopted standard for the Minimal Information About a Microarray Experiment.

• The result of a MGED (www.mged.org) driven effort to codify the description of a microarray experiment.

• MIAME aims to define the core that is common to most experiments.

• Ultimately, it tries to specify the collection of

information that would be needed to allow somebody to completely reproduce an experiment that was

performed elsewhere.

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(6)

The Six Parts of MIAME

1. Experimental design:

the set of hybridization experiments as a whole

2. Array design:

each array used and each element (spot, feature) on the array

3. Samples:

samples used, extract preparation and labeling

4. Hybridizations:

procedures and parameters

5. Measurements:

images, quantification and specifications

6. Normalization controls:

types, values and specifications

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(7)

MIAME definitions

• Available from www.mged.org

• All details mentioned in MIAME should be captured

• Latest draft: Version 1.1 (Draft 5, March 5, 2002)

• See also: A. Brazma, et al., Nature Genetics, vol 29 (December 2001), pp 365 - 371

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(8)

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

But…

• Environmental genomics is a diverse,

heterogeneous discipline, often involving multi- factorial experiments that can have an almost infinite number of experimental parameters.

• Describing this sort of data is hard.

• MIAME does not have the required vocabulary.

• However, NERC has made a commitment to making MIAME compliance a de-facto standard within its Science Programmes.

• NERC has invested in reconciling these…

(9)

MIAME/Env

• MIAME/Env is an initiative spearheaded by the

EGTDC to extend MIAME standards for annotation of environmental genomic data

• Includes the development of controlled vocabularies / ontologies to describe environmental genomic

experiments.

• MIAME/Env developed with the support of MGED society and in collaboration with MIAME/Tox and members of the EBI.

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(10)

Microarray Annotation for Environmental Researchers

• use the Standard

– MIAME/Env model is developed in communication with EG funded researchers to ensure that

environmental genomics experiments and data can be adequately described to MIAME standards

• use the Software

– maxdLoad2 is software developed by EGTDC partners facilitating

• MIAME/Env annotation

• Export in an appropriate format for submission to ArrayExpress

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(11)

Do I have to?

Simple Answer:

YES!!!

More specifically:

• You need to adhere to metadata standards to submit to a public repository

• You need to submit to a public repository (e.g. ArrayExpress) to get an accession number for your data

• You need to have an accession number for your data in order to publish on it in major journals

The final word:

• NERC requires grant holders to comply with MIAME standards for microarray data

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(12)

Benefits of using a data repository

Facilitates data sharing Catalogued / Backed-up

Pervasive advertisement for your work End users/Researchers

Access to data for analysis and algorithm development

Improves search capabilities

Encourages development of more capable software for annotation, analysis and submission

Bioinformaticians/Developers

(13)

Bio-Linux

The EGTDC distribution system for bioinformatics solutions

• Key bioinformatics software and documentation in a Linux environment

• Aim: to maximise the benefits of a pre-installed analysis system.

• provision of key software

• tools for automation of analysis and other customisations

• computing power

• ensure that what is provided can be reasonably maintained and supported

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(14)

Software on Bio-Linux

Includes programs for:

• Sequence analysis

• Similarity searching

• Sequence alignment

• Phylogenetics

• Genome annotation and analysis

• Est’s

• Transcriptomics

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(15)

Bio-Linux

Transcriptomics Databases maxdLoad2

GeNet access

Transcriptomics Analysis maxdView

GeneSpring

R/BioConductor

MIAME/Env annotation and MAGE/ML export maxdLoad2

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(16)

GeNet maxDLoad2 R/BioConductor

ArrayExpress

Raw Data

Expression measures

(not normalised)

Proprietary software

(e.g. Affymetrix)

GeneSpring R/BioConductor maxDView

Quality Control Normalisation Analysis Presentation

Other analysis programs

MIAME/Env Annotation

(17)

GeNet

R/BioConductor

ArrayExpress

Raw Data

Expression measures

(not normalised)

Proprietary software

(e.g. Affymetrix)

GeneSpring R/BioConductor maxDView

Other analysis programs

Bio-Linux

maxDLoad2

MIAME/Env Annotation

Quality Control Normalisation Analysis Presentation

(18)

Transcriptomics Databases

Tools on Bio-Linux maxdLoad2

GeNet access

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(19)

maxdLoad2

Navigator

Top level user interface

(20)

GeNet

Via Web Interface

Via GeneSpring

(21)

GeNet and maxdLoad2

Both are databases designed to handle transcriptomic data Differences:

GeNet

Centralised repository

Geared towards use as an analysis and sharing tool as well as a storage area

Partial MIAME compliance is possible, but not the default

Great for sharing data and analyses

maxdLoad2

Local repository

More like a LIMS system for transcriptomic data

Geared towards MIAME compliant annotation, storage and export to public database

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(22)

Transcriptomic Analysis

Tools on Bio-Linux maxdView

GeneSpring

R/BioConductor

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(23)

Which software should I use??

Commercial vs. Open Source GeneSpring maxdView

R/BioConductor Ease of Use

GeneSpring > maxdView > R/BioConductor

Fine tuned control

R/BioConductor > maxdView > GeneSpring

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(24)

Why use just one??

E.g.

Fine Tuned Control  R/BioConductor Ease of Use +GeneSpring Pre-analysis Choices  R/BioConductor Easy but fine tuned manipulation  +maxdView

Alternatively:

maxdView + GeneSpring All of them…

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(25)

GeneSpring

Benefits:

• Graphical interface

• Choices of views

• Venn diagram visualisations

• Intuitive interface for filtering

• Extensive documentation

• Context dependent help

(26)

maxdView

Benefits:

• Graphical interface

• Quality control options

• Many analyses possible via menus or “calculator”

• Strong filtering capabilities

• Context dependent help

(27)

R/BioConductor

Command line package Benefits:

• flexible

• many, many functions to choose from

• take advantage of the full

functionality of the R stats package

• high degree of control

• great plotting facilities

• promotes thinking about data

• lots of documentation and help available

• automation possibilities

• some graphical facilities available

(28)

Documentation and Tutorials

Program Name Documentation Tutorials

GeneSpring •Extensive

•Available via help menu

•Basic tutorial available via help menu

maxdView •Good

•Available via help menu

•Basic tutorial

•Working with clusters tutorial

•Commands and hotkeys tutorial

 all available via help menu

R/BioConductor •Extensive

•Available via command line or via BioConductor website

•Numerous

•Available via command line or via BioConductor website

(29)

Overview of Microarray Analysis Steps

Load Data

Apply Filters

Normalise

Analyse

Quality Control

Step 1

Text, GPR file, etc…

Step 2

Step 3

Step 4

Step 5

(30)

Raw Data

Expression measures

(not normalised)

•The raw microarray data scanned from images needs to be translated into some measurement of expression.

•The measurement used depends on the technology – e.g. relative measures (cDNA chips), or absolute measures (e.g. GeneChip).

•The measurements calculated depend on the algorithm used (e.g. MAS 5.0 vs.

RMA for GeneChips).

•Background correction happens at this point

translation into

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(31)

Import

Program Name Import

mechanism type Import file types Other Notes

GeneSpring Graphical •Text files (e.g. tab delimited)

•Upload from database

•Assumes “summarised”

data.

•Some level of normalisation will be applied automatically.

•Should recognise “common”

formats.

•Can save formats for rapid loading later.

maxdView Graphical •Text files (e.g. tab delimited)

•maxdView native files (XML)

•Upload from database

•For analysis, load up

“summarised” data.

•Pre-summarised data can also be loaded for quality control.

•Remembers your previous format choices.

R/BioConductor Command line

Some graphical tools available

•Text files (e.g. tab delimited) and any file type supported by R

•Raw data (e.g. .CEL files) or

“summarised” data can be loaded.

(32)

Export

Program Name Export

mechanism type Export options Other Notes

GeneSpring Menu •Upload to database

•Use External Programming Interface to transfer to

another program (e.g. R)

•Graphical files (e.g. plots)

•Difficult to retrieve pre- normalised data from GeneSpring.

maxdView Menu •Text files (e.g. tab delimited)

•maxdView native files (XML)

•Download to database

•Graphical files (e.g. plots)

•Can choose the columns of data to save.

•Text files and database data includes your data only.

•maxdView native files

include information about all the viewing options, etc., you had set when you saved the file.

R/BioConductor Command line •Text files (e.g. tab delimited)

•Graphical files (e.g. plots)

(33)

Quality Control

Very Important!

Generating high quality microarray data requires vigorous quality control measures at each individual step of the process:

• experimental design of the study

• the generation of samples

• extraction of RNA

• labeling of the probe

• microarray hybridization

• analysis

Systematic, reproducible errors can be minimized by applying various normalisations…BUT:

You should not try to rescue low quality hybridizations with mathematical techniques!

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(34)

Quality Control

Do the arrays look alright?

Look at the actual image scans – are there quality issues to be addressed on any of the chips?

(35)

Quality Control

Does the data have the distribution you expect?

The common array analysis functions assume that most genes will not change in expression level and that your

data is lognormal.

(36)

Quality Control

Figure and text from: http://cardiogenomics.med.harvard.edu/groups/proj1/pages/Method_qc2.html

(37)

Quality Control

Program Name Functions

Available Examples Other Notes

GeneSpring Few •Can filter out spots with particular features (e.g. very high or very low intensity) before further manipulation.

•Ideally, more extensive quality control should take place before uploading data into GeneSpring.

maxdView Some •Benford Analyser

•Distograms of data

•Easy methods to generate means, std. dev’s, etc, and filter on these

•Flexible filtering system

•Good levels of quality control can be achieved using maxdView

•requires good knowledge of the application to get full benefit

R/BioConductor Extensive •Many quality control

functions for different types of data

•Many, many options

•Highly recommended

•Not user friendly at first!

(38)

Quality Control

Does the data have the distribution you expect?

This plot is the result of running the Benford Analyser on data (pre-normalisation) in maxdView.

(39)

Fit your data and take a look at the reconstructed image surface using R/BioConductor:

>library(affyPLM)

>pset  fitPLM(myData)

>image(pset)

Quality Control

(40)

Quality Control

Check out the density curves of the PM data using R/BioConductor

>hist(myData, col=pops2, type=“l”)

(41)

Normalisation

Program Name Available

GeneSpring •Graphical menu system

•Hints about effects of

normalisations given in window

maxdView •Graphical menu system

•Hints about effects of

normalisations given in window

R/BioConductor •Extensive choice

•Need to read about before applying

General advice:

• Apply normalisations that make sense for your data

• Use plotting facilities to view your data before and after normalisation to

check

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(42)

GeneSpring

maxdView

Normalisation

(43)

Normalisation

>pops2  pData(myData)[,2]

>boxplot(myData, col = pos2 +1)

Pre-normalisation

R/BioConductor

>eset  myData, bgcorrect=“rma”, normalize.method = “quantile”,

pmcorrect.method=“pmonly”, summary.method =

“medianpolish”)

>boxplot(eset, col = pos2 +1)

Post-normalisation

(44)

Filters

• A Filter is a rule applied to each Spot

• Spots which do not pass through the filter are ignored in downstream steps

• Filters are useful for reducing the complexity of analyses or visualisations by discarding uninteresting Spots. They can also be used to locate Spots which match particular criteria.

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(45)

GeneSpring

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

Filter on Error

(46)

maxdView

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

MultiFilter

(47)

R/BioConductor

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

>library(genefilter)

Have to define your filter and then apply it.

Filters can be saved and used again.

(48)

Statistics and clustering

Most statistical tests have underlying

assumptions – know what these are and whether they are valid for your data!

GeneSpring, maxdView and R/BioConductor all provide facilities to run various statistical

analyses and clustering algorithms.

R provides the most extensive choice.

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(49)

GeneSpring

(50)

maxdView

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

TTest

(51)

R/BioConductor

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

>library(multtest)…designed for microarray data

Many clustering functions available within R libraries

(52)

Other topics to consider

• Potential for automation

• Statistical choices

• Plotting choices

• Ability to interface with other programs

• No doubt lots of other things…

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(53)

The danger of the black box

User friendly software is:

a) easy to use

b) easy to abuse

c) both of the above

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(54)

What is your aim?

Looking for genes to test biologically?

• How many false positives can you afford?

• How many false negatives can you afford?

• How many replicates (technical? biological?) will you need to use the appropriate analysis

methods?

Your analysis methods should take these issues into account.

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(55)

Example: What is significant change?

Is a 2-fold change in expression meaningful?

Do you have enough replicates to justify your claims statistically?

Is it meaningful if the absolute expression level is low?

– What is the std. dev. of your measurements?

– Noise envelope diagrams – precision is an issue

Is it meaningful if the absolute expression level is high?

– Saturation effects – Accuracy issues

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(56)

The moral

Experimental design is more important than which analysis package you choose to use.

Plan your experiments! Your experimental design will affect what meaningful analyses you can

do.

Plan your analyses! There are many steps to carrying out transcriptomic analysis properly.

Don’t give in to the temptation of the black box!

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(57)

Key Web Sites

BioConductor www.bioconductor.org GeneSpring www.silicongenetics.com

maxd bioinf.man.ac.uk/microarray/maxd/

R www.r-project.org

Key EGTDC pages:

Home page envgen.nox.ac.uk

Bioinformatics Solutions envgen.nox.ac.uk/software.html Bio-Linux envgen.nox.ac.uk/biolinux.html

Environmental Genomics Thematic Programme Data Centre

http://envgen.nox.ac.uk

(58)
(59)
(60)

Normalisation

Intrachip Interchip

E.g. expt with Affy – may need to normalise regionally (intrachip) and across chips

(interchip) before data comparable

Expt with cDNA – normalise intrachip and

interchip?

(61)

Distogram

(62)

Normalisation

(63)

Technical Issues

• Biased response of dyes

• Positional bias of spots

• Bias due to gene sequence

• Inconsistencies between batches of chips

 Quality Control

Remove using Lowess

(64)

SVD & PCA

(65)

Help Documentation

(66)

GeneSpring

Color Bar

for gene coloring (default coding:

expression level)

Genome Browser

to view expression data

Navigator

for project file management

(67)

Views

(68)

R/BioConductor

Command line statistics package Pros:

• flexible

• lots of functionality

• high degree of control

• great plotting facilities

promotes thinking about data

• lots of documentation and help available Cons

STEEP learning curve at beginning

(69)

R/BioConductor

E.g. With Affymetrix data

Can load data at various stages

• summary values

• raw values

• transformed values

• etc.…

Can then apply relevant functions using

various libraries

(70)

R/BioConductor

E.g. With Affymetrix data

>library(affy)

>listocelfiles = list.celfiles(filenames = “/home/user1/myfiles/”)

>myData = read.affybatch(filenames = listocelfiles)

>phenodata  read.phenoData(“phenodata.txt”)

>phenoData(myData)  phenodata

UGLY!

Mitigating factors:

Environment can be saved so you do not have to recreate objects from scratch each time

Files with sets of commands can be “sourced” so that many tasks are automatically run on starting R, or can be started up easily when in R You can do things like this…

References

Related documents

In part IV, I look in more detail at the merger retrospectives and assess their implications for merger control. If the predictive power of merger analysis was

When the compare/capture modules are programmed in the capture mode, software timer, or high speed output mode, an interrupt can be generated when the module executes its function..

partnership will break. Rahu is in pisces in 8.41 It means it is in saturn’s star utharabhadrapada. Rahu is in venus sub. The matter of this house will bematerialising, if the

А для того, щоб така системна організація інформаційного забезпечення управління існувала необхідно додержуватися наступних принципів:

We need new hearing aids if these chil- dren are to benefit from the talking that goes on in ordinary classes .” The other specialist teacher said, “ The children tend to want to

background and life experience prepared them for an online course and they maintained a 3.4 grade point average as opposed to their younger, more inexperienced counterparts who

While increasing electronegativity is also observed in He+H 2 O plasmas, in Ar+H 2 O plasmas the transition is more abrupt because Penning processes do not contribute to gas

The correlation coefficients were calculated to show the relationship between the variables (measures) of job satisfaction attainment (B), the rating of all of the interpersonal