Basic Analysis of Microarray Data

(1)

Basic Analysis of Microarray Data

A User Guide and Tutorial

Scott A. Ness, Ph.D.

Co-Director, Keck-UNM Genomics Resource and Dept. of Molecular Genetics and Microbiology University of New Mexico HSC

Tel. 505-272-9883 e-mail: ness@unm.edu

web: http://hsc.unm.edu/som/micro/scott.html

Introduction

Microarrays provide biological scientists with a powerful new tool for simultaneously analyzing the expression of many thousands of genes. However, the data sets generated by these experiments are large and complex. This document provides the basic information and background necessary for biological scientists to successfully perform and understand the results of microarray experiments. The goal is to provide biological scientists with the information necessary to design their microarray

experiments in a way that will provide the best possible data, and to describe the basics of analyzing microarray data in a way that fits the design of biological experiments.

Overview

This document is divided into two sections. Section 1 describes the basics of microarray data and its analysis including the differences between raw and normalized data, the concepts behind cluster

analysis and the basics of experimental design as applied to microarray experiment. Section 2 provides a tutorial, with sample data, for learning the basics of GeneSpring.

Other sources of information

For Affymetrix users, the Affymetrix manual for Data Mining Tool (DMT) contains a tutorial that can be used either with sample data or with your own data. The DMT tutorial provides an excellent overview of clustering methods and algorithms and statistics that can be applied to microarray data.

Silicon Genetics provides excellent support including detailed manuals that can be downloaded from their web site ( http://www.silicongenetics.com/cgi/SiG.cgi/Support/GeneSpring/support.smf ). You must register with them to download the PDF manual. Detailed on-line help is also available from within GeneSpring.

(2)

Section 1: Basics of Microarray Data Analysis

This section is a general overview of issues that affect microarray experiments and the analyses of the resulting data. It should be a good introduction to anyone beginning to analyze their own data, or preparing to start out by designing experiments for microarray analysis.

Types of microarrays and microarray data

The two basic types of microarrays are Affymetrix GeneChips, which have short (25-mer)

oligonucleotides synthesized directly on the glass, or spotted arrays made by dipping pens into a concentrated DNA solution and physically depositing the DNA on a microscope slide. There are some other variants on these approaches as well, such as specialized ink-jet printers that can be used to synthesize oligos directly on a solid support, but they are not yet in wide-spread use.

Affymetrix GeneChips

The Affymetrix microarrays have a number of advantages. First, the features (DNA spots) are extremely uniform and very close together. As a consequence, a single GeneChip can contain more than 400,000 different DNA spots, which allows these commercial chips to contain a large number of controls and multiple (16-20) spots for each gene. Although there have been some complaints of lot-to-lot variability, results with Affymetrix GeneChips are, for the most part, highly reproducible.

The Affymetrix microarrays use short (25-mer) oligonucleotides. Up to 20 perfect match oligos are present for each gene, as well as an equal number of corresponding mismatch oligos that have a single nucleotide substitution. Thus, the Affymetrix system is very good at detecting single nucleotide

differences. The disadvantage is that allelic differences could lead to changes in the data. However, in general, the Affymetrix system works extremely well, is very user friendly and is highly recommended as a starting point for investigators beginning their first microarray experiments.

Spotted arrays

Custom, or spotted arrays have two major advantages: cost and flexibility. Once the DNA samples have been prepared, spotted arrays can be produced for a very low cost – essentially just the price of the glass slides and the labor involved. In addition, spotted arrays can be designed to contain only the genes of interest. The disadvantage is that the DNA must be purchased (in the case of oligonucleotides) or prepared by PCR and purification (for cDNA clones). These can be costly and problematic

procedures.

Using longer probes (cDNAs or longer oligonucleotides) provides more specificity in the hybridization, and limits the effects of single nucleotide polymorphisms or allelic differences. However, the DNA spots made by spotters are much larger than the features present on an Affymetrix array. As a consequence, there are many fewer spots per slide, and many fewer controls. For example, with spotted arrays it is rare to have more than 15,000 spots per microscope slide, or to have more than duplicate spots for each gene. This makes spotted arrays more prone to reproducibility problems, especially chip to chip

variation. In contrast, the Affymetrix arrays have 16 or more spots per gene, and use statistical comparisons of the results to determine the expression level.

Data normalization

In order to compare gene expression results, it is necessary to normalize the microarray data, and there are two basic ways this is done. Per-chip normalization is essentially a type of scaling to adjust the total or average intensity of each array. Per-gene normalization compares the results for a single gene across all the samples.

(3)

Scaling or per-chip normalization

Per-chip normalization is extremely useful to help eliminate minor differences in probe preparation, hybridization conditions, etc. Essentially, this is like turning the sensitivity of the scanner up or down, or adjusting the brightness of the laser, so that each sample has the same overall intensity. Usually, the adjustments are made to set the average fluorescence intensity to some standard value, so that all the intensities on the chip go up or down to a similar degree. This approach makes sense if the samples are all similar, e.g. all from the same types of cells or tissues. However, this type of normalization will

obscure some aspects of the data, such as whether the RNA samples or the probe preparation steps were equivalent for each sample. Thus, without scaling, “bad” samples will be obviously dimmer across the board. With scaling, “bad” samples will be more difficult to detect. In addition, for samples of poor quality, where relatively few probes yield a detectable signal, those values that are detected will be amplified disproportionately so that they represent too great a fraction of the average total fluorescence. With the Affymetrix system, our standard approach is to turn scaling on, and to set the average

fluorescence for each GeneChip to 500. After analysis of the data it is important to check the scaling factors that the software used. The best samples will have scaling factors less than 20. Poor quality samples will have high scaling factors, often greater than 100. Comparing the scaling factors is one of the best ways of judging the quality of Affymetrix data when scaling has been used. In general, samples with high scaling factors should be avoided.

Per gene normalization

The goal of microarray experiments is to identify the genes whose expression change in different conditions. Per-gene normalization is necessary to compare the relative gene expression profiles of genes that may be expressed at very different absolute levels. An example of this is shown in the figures below where the samples appear along the X-axis, expression levels are on the Y-axis, and each line represents a different gene. Here, microarrays were used to follow changes in the expression of more than 18,000 human genes. The data have been filtered to show only 9 genes that were induced the most in samples 5 and 6. In the figure at left, the normalized data were plotted as fold change. All 9 genes show a similar pattern, and the lines representing the different genes cluster together nicely. In the figure at right, the raw data were plotted for the same 9 genes. Although the pattern is still

discernable, the genes do not appear to be clustered or expressed in the same pattern.

Clustering and comparisons

Clustering algorithms work by finding genes that have similar patterns of gene expression. In the

example shown above, the 9 genes would form a nice cluster, because all are induced more than 4-fold, relative to the other samples, in samples 5 and 6. As a consequence, all clustering methods require normalized data. Therefore, the results are highly dependent on how the data were normalized, and what data are included in the clustering analysis.

(4)

The forest vs. trees problem

The biggest problem with clustering is that it appears to offer a “one-click” solution to finding interesting genes. However, if too many genes are analyzed at once, the clusters become meaningless. This is a type of “forest vs. trees” problem. The presence of too many irrelevant genes obscures the changes in a few and makes them difficult to detect. For example, consider the results of K-means clustering shown below. In each case, the default clustering algorithms in GeneSpring were used, set on 5 clusters and 100 iterations. At left is the clustering performed on all 18,000 genes. At right, the same clustering method was used on 157 genes after filtering for those that were more than 2.5 fold induced and expressed at a value of at least 1200 in two or more samples.

The clusters on the left are very large (>2500 genes each) and do not distinguish different functional groups of genes. In this case, the “forest” of unchanging genes has obscured the fact that some genes are changing dramatically.

At right, the genes are clearly grouped into small clusters (<100 genes) with similar expression patterns. In particular, the cluster at top left looks very similar to the 9 genes shown in the previous figure. The use of filtering has removed the irrelevant genes, and permitted the formation of useful clusters. This is an example of why filtering is so important.

Defining thresholds

Filtering the data is important for focusing on the most important changes. However, it also introduces some bias. In the example shown above, the filters eliminated all but 157 out of more than 18,000 genes. The thresholds that were chosen were completely arbitrary (>2.5 fold induced in two or more samples, and expressed above 1200 in two or more samples), so this approach could eliminate some genes that really are regulated in an interesting fashion, but that are expressed at very low levels or that are induced to a lesser extent.

The Importance of Replicates

The experiments shown above were performed with four sample types or treatment groups, performed in two independent, replicate experiments. As a result, the analysis can be set up to look specifically for genes whose expression change in a similar manner in both replicates of a particular type. For example, genes that are induced in both samples 5 and 6 (top left cluster in right panel, above). Researchers planning microarray experiments should plan to analyze at least two independent replicates (i.e. the complete experiment is done twice, generating two independent RNA samples, and independent probes are hybridized to different microarrays) for each treatment group.

(5)

Section 2 – Data Analysis Tutorial

This step-by-step guide is meant as a general description for researchers using the Affymetrix system to generate data, and then using GeneSpring for the analysis. This tutorial outlines the basic steps, and should be especially helpful for new users. The thresholds and limits provided here have been used successfully, but are meant only as a starting point. Each user should adjust these for their own purposes.

Experimental Design

This tutorial will use Affymetrix data from six samples, two replicates each of three treatment groups. The samples are RNAs prepared from MCF7 cells infected with various adenoviruses. Two samples each were from uninfected cells or cells infected with adenoviruses expressing c-Myb or a control adenovirus.

Affymetrix Analysis

The cRNA probes were made according to Affymetrix protocols, and were analyzed using Affymetrix U95A GeneChips. Analysis conditions were set with scaling turned on, and the average intensity set to 500. No other analysis was performed with the Affymetrix software. The data was exported by choosing File>Save As in Microarray Suite, to generate a tab-delimited text file named “TutorialPivotData.txt”.

Importing Data into GeneSpring

Use the following steps to import the data into GeneSpring:

1. Start GeneSpring. The default data set (usually yeast) should load.

2. From File menu choose “Autoload Experiment”. Browse to the file (TutorialPivotData.txt) and choose “open”.

3. GeneSpring should open a Format Verification dialog box asking whether the data is in Affymetrix Pivot Table format. Choose yes.

4. The “Select or Create Genome” dialog box will open. In GeneSpring, genomes are lists of genes. At this point you can choose an existing genome or create a new one. If you want to compare this data to previous data, choose the genome of the previous data. For this example, enter “Tutorial Genome” as the Genome Name, then click on “Create New Genome”. GeneSpring will open a box to confirm that you want to create a new genome – choose yes.

5. After a short wait a new window should appear as well as a dialog box asking if there are additional files. For this example choose “No, load only this file”.

6. A new GeneSpring window will appear, then a window labeled “Choose Experiment Name”. Change the name to “Tutorial Experiment” and

click on Save at the bottom of the window. The default yeast window should still be open. Close the yeast window and expand the new Test Experiment window. The new window should look like the one at right. Each little box represents a gene. The colors indicate relative expression, according to the color bar at the right. At this stage, the data has been successfully imported and is ready for preliminary analysis.

(6)

Preliminary GeneSpring Analysis

Every time new data is loaded into GeneSpring, three additional steps must be performed before the data can be analyzed in a meaningful way. The three steps are all in the Experiments menu.

Change Experiment Parameters

From the Experiments menu, choose Change Experiment Parameters. A window appears showing the six samples with their respective labels. At this point, additional labels can be added as new parameters, or the samples can be reordered. It is important to note the sample numbers (Sample 1, Sample 2, etc.) do not change, even if the samples are reordered. Click at the top to highlight the column containing the sample names, then click on the button marked “Set Value Order”. In the next window, check that the order is OK, or use the buttons to modify it, then select OK and Save.

Experiment Normalizations

From the Experiments menu, choosing Experiment Normalizations opens a window that controls the normalization settings. This is probably the most important window in GeneSpring, as these settings affect all the subsequent analyses and results. Since the Affymetrix system has already ‘scaled’ our data, change the setting in the “Per Chip Normalization” section to “No per-sample normalization”.

Rather than normalize to the mean of all the samples, we will designate two samples to which all the others should be compared. In this case, Sample 1 and Sample 2 are the cells infected by the control adenovirus. In the ‘Per Gene

Normalizations’ section, select the button next to “Use Samples” and enter the sample numbers, separated by commas (i.e. “1,2”) in the box. Click on Save at the bottom to complete the process.

When the normalization window closes, the colors in the main window will be changed, mostly to yellow. Unless the default colors have been changed, yellow indicates unchanged expression. With the samples normalized properly, most genes will not be changed, and will have a yellow color.

However, moving the scroll bar from left to right at the bottom of the window will show the colors, and relative

(7)

Change Experiment Interpretation

From the Experiments menu, choose “Change Experiment Interpretation”. Here, make sure that the top pull-down menu shows “Ratio (signal/control)”.

For samples with few or no replicates, such as this one, you should select the button marked “Use Global Error Model”. Make sure that only one parameter is set as “Continuous Element” (usually the first) and that others are set on “Replicate”. If everything is set correctly, the window should look like the one at right. Click on Save to continue.

Now the data has been loaded, labeled, normalized and ordered and is ready for more detailed analyses.

Data Filtering

For our analysis, we will first limit ourselves to the genes that change appreciably. Using filters, we will eliminate genes that don’t change or that are expressed at background levels.

From the Tools menu select “Filtering and Statistical Analysis”. The window that opens does not look like much, but is probably the most powerful analysis tool in the GeneSpring

program. Click on the little “+” sign next to Experiments to expand it and make “Tutorial Experiment” visible.

Identify expressed genes

Right-click (Cntrl-click on Macs) on the “Tutorial Experiment” label, then choose “Add Expression Percentage Restriction” from the pop-up menu. The Expression Level Percentage Restrictions dialog box will open. For our first filter, we will restrict our analyses to the genes that are expressed at a level of at least 1200 in any two

samples. Since all of our samples were done in replicate, any genes that are truly expressed above background should be above 1200 in at least two samples. Type 1200 in the “Minimum” box, then type a 2 in the “In at least X out of a total of 6 conditions” box, then choose “Raw Data” from the “Restriction applies to” pull-down menu, then click on OK. At the top of the “Filtering and Statistical Analysis” window, it should now indicate that only “1,838 of 12,625 genes pass restrictions”. So this filter eliminated about 85% of the genes that were not really expressed at significant levels in these cells. Save this gene list by clicking the “Make List” button. In the “New Gene List” window, name the list “1838 expressed genes” and click on Save.

Check the global quality of the samples

This is an excellent time to check the quality of the DNA samples, using the 1838 expressed genes as a guide. However, this requires some bias, based on the biology. In our experiment, we expect the replicates to be very similar. Indeed, we expect that most genes will not change expression in this experiment. The whole point is to identify the subset of genes that do change. The best way to detect samples of poor quality is to look at global expression patterns with a gene tree.

(8)

With the 1838 expressed genes list selected, click on the Tools menu and choose Clustering. The Clustering window should appear and should indicate at the top that the Genes to Use will be “Choose from Genes: 1838 expressed genes”. In the middle of the window, for “Clustering Method” click on the menu and choose “GeneTree” then click Start at the bottom.

After a few moments, the Name New Gene Tree window should appear. Rename this “1838 expressed genes” then click Save at the bottom. Click on Close in the Clustering window to return to the main window, which should show a gene tree like the one shown below.

This figure summarizes the expression of all 1838 genes. The rows going across are the six samples. Each column represents a gene. The expression scale is shown in the colorbar at right with yellow unchanged, red induced and blue repressed, relative to the controls.

Remember that the data were normalized to the mean of samples 1 and 2. Nevertheless, Sample 2 has a lot of blue, as does sample 5, second from the bottom. In fact, these two samples are almost mirror images of the others. This is highly suggestive that something is wrong with those samples.

Based on this result, we checked the scaling factors for these two samples, and found that they were very high (>60). Apparently, something about the probe synthesis or hybridization affected the quality of these samples.

Since the quality of sample 2 is in doubt, it should, in hindsight, definitely NOT be used for normalizing the other samples.

Remove bad data

The bad samples can be left, or removed. In this case, we will remove them by using the Merge/Split Experiments window to separate the good ones. From the Experiments menu choose “Merge/Split Experiments”. Click on the little “+” to expand the “Tutorial Experiment” label, then choose each of the “good” data sets, Samples 1, 3, 4 and 6 and click the “Add” button for each. When they are all added to the list at right, click on OK.

An Experiment Parameters window will open. Highlight the column with the sample names, click on Set Value Order then OK and Save. A Choose Experiment Name window will appear. Name this experiment Good Tutorial then click Save. With the bad data removed, the gene tree should appear as shown at right.

Renormalize the data

With the bad samples removed, the data should be renormalized. From Experiments menu choose Experiment Normalizations. At the

bottom right of the window, uncheck the box marked “Start with normalized data”. Near the top, in the “Per Spot Normalizations” area, enter 0.1 in the “Use values over” box. Make sure the “No per-sample normalization” button is checked in the middle of the window. Finally, in the “Per Gene Normalizations” section, click on “Use samples”, enter “1,4” in the box and click on Save at the bottom. Now the data will be normalized to the mean of the remaining uninfected sample (#1) and the remaining control virus sample (#4).

(9)

Identify induced genes

Use filtering to identify induced genes. From the Tools menu, select Filtering and Statistical Analysis. Click on the little “+” signs to expand the Gene Lists and Experiments lists at top left. Click once on the Gene List “1838 expressed genes”, then Right-click (or Cntrl-click) on the “Good Tutorial” label, and select Add Expression Percentage Restriction from the pop-up menu. Type 2 in the “Minimum” box, put a 2 in the number of samples box, and make sure that the pull-down menu shows “Normalized Data”, then click OK. Now only 13 of 1,838 genes should pass the restrictions. These 13 genes are induced more than 2-fold in at least two samples, and are also in the 1838 expressed gene list.

Save this gene list by clicking the “Make List” button. In the “New Gene List” window, name the list “13 induced genes” and click on Save.

Identify repressed genes

Now we will look for repressed genes. From the “Filtering and Statistical Analysis” window, remove the previous filters, then right-click (Cntrl-click) on the “Good Tutorial” label and choose “Add Expression Restriction” like before. Add a restriction, this time setting the MAXIMUM to 0.4 in 2 samples, looking at normalized data. This will restrict the data to those genes where two samples went down to 0.4-fold, compared to the controls. This filter should limit the set to 12 genes. Click on Make List and name this set “12 repressed genes”.

Look for specific patterns.

Now we will look specifically for genes that are up-regulated in our replicate samples infected by the c-Myb adenovirus. Use the Remove button to remove any previous restrictions. At the top left, expand the Gene List window, and choose the 1838 expressed genes list. Click on the little “+” next to the “Good Tutorial” label to expand it, and then again on the “+” next to “Default Interpretation”. This should show a list of all the samples, as shown at right.

Right-click (Cntrl-click) on the first Myb sample and choose “Add Expression Restriction” from the pop-up menu. Enter 3 in the Minimum box, make sure the pull-down menu shows “Normalized Data” then click on OK. Repeat this for the second c-Myb sample. The “Filtering

and Statistical Analysis” window should indicate that only “3 of 1,838 genes pass restrictions”. Click on the “Make List” button and name this list “3 Myb induced” in the New Gene List window. Close the “Filtering and Statistical Analysis Window” by clicking on Cancel at the bottom.

In the main window, click on the View menu and choose Graph. You should have a window like the one shown at right, displaying only the 3 Myb induced genes in the most recently-saved gene list. In the main window, at the top under Gene Lists, click on the list labeled “13 induced genes”. Note that the main window displays only the genes in the list selected at the left.

(10)

Combine Lists with Venn Diagram

In the main window, Right-click (Cntrl-click) on the name “13 induced genes” in the Gene List panel at upper left and choose “Venn Diagram” and then “Left (Red)” in the pop-up menus. Do the same for the 3 Myb induced and 12 repressed gene lists, choosing “Right (Green)” and “Bottom (Blue)”, respectively. Then click once on the All Genes label at upper left to populate the Venn Diagram with all the genes. You should get a diagram like the one at right. The Venn Diagram is an intuitive way of comparing gene lists, showing how many genes are in common. In this case, the 13 induced genes list also includes

all 3 of the genes in the Myb induced list. The 12 repressed genes list does not overlap with the others Right-click (Cntrl-click) in the middle of the Venn Diagram and choose “Make list of genes in any list” from the pop-up menu. Save this list with the name “25 regulated genes”.

From the main window, in the Colorbar menu, choose “Color by Expression” to hide the Venn Diagram.

Clustering

With the 25 regulated genes list selected, click on the Tools menu and choose Clustering. The

Clustering window should appear and should indicate at the top that the Genes to Use will be “Choose from Genes: 25 regulated genes”. In the

middle of the window, for “Clustering Method” click on the menu and choose “GeneTree”. Just below that for “Measure similarity by”, click on the menu and choose “Smooth Correlation”. Then click Start at the bottom.

After a moment, the Name New Gene Tree window should appear. Rename this “25 regulated genes” then click Save at the bottom. Click on Close in the Clustering window to return to the main window, which should show the gene tree at right.

Summary

This completes the basic tutorial and should prepare you to start analyzing your own data, and to design and interpret microarray experiments. GeneSpring has many other tools and features that are not mentioned here. Please consult their documentation for more details.

Remember, the basic steps are: (1) Import and normalize the data. (2) Do a preliminary analysis to evaluate the quality of the samples. (3) Eliminate bad samples and re-normalize, if necessary. (4) Use filters to focus on the expressed genes, and to define sets of genes that are regulated in specific, interesting patterns. (5) Use clustering to compare the identified sub-sets of genes.