Microarray Data Used in Examples - means Clustering Algorithm

K- means Clustering Algorithm

3.2 Microarray Data Used in Examples

The National Cancer Institute’s (NCI) Developmental Therapeutics Program (DTP) has intensively studied 60 cancer cell lines (Ross et al., 2000), which are known as the NCI 60. This chapter compares the results of three independent microarray experiments which

studied the gene expression patterns in 60 human cancer cell lines derived from the nine tumor types listed in Table 3.1.

Table 3.1: Nine Tumor Types from NCI 60

Tumor Type Number of Cell Lines

Breast Cancer 8

Central Nervous System Cancer 6

Colon Cancer 7

Leukemia 6 Melanoma 8 Non-Small Cell Lung Cancer 9

Ovarian Cancer 6

Prostate Cancer 2

Renal Cancer 8

Each of the three experiments was performed by a different group and targeted the same 60 cell lines. However, the experiments included different numbers of genes and use different microarray technologies. A large number of these genes should be common to the three experiments.

The first data set is from a microarray experiment performed by Ross et al. (2000).

Using the two color Complementary DNA (cDNA) design, microarrays were prepared by robotically spotting 9,703 human cDNAs on glass microscope slides. The cDNAs included approximately 8,000 unique genes. Each hybridization compared Cy5 labeled cDNA reverse transcribed from mRNA isolated from one of the cell lines with Cy3 labeled cDNA reverse transcribed from a reference mRNA sample. The reference sample, used in all of the hybridizations, was prepared by combining an equal mixture of mRNA from 12 of the cell lines. Only 6,165 genes had complete data for all 60 cell lines. These values were transformed using the usual log background corrected ratio for the two channels. The

investigators presented the results from an average linkage cluster analysis using Pearson’s correlation as the similarity measure. They found that the cell lines with common tissues of origin tended to cluster together. Cluster analyses were repeated using different subsets of genes to assess cluster robustness. The authors concluded that the clusters appear to be reasonably robust. A major goal of this experiment was to examine the chemosensitivity of the NCI 60 to about 70,000 different chemical compounds. The chemosensitivity data has been analyzed by Ross et al. (2000) as well as in separate studies by Paull et al. (1989), van Osdol et al. (1994), and Weinstein et al. (1992, 1997) and is not discussed in this chapter.

The second experiment used the Affymetrix design and examined 5,611 genes for each of the 60 cell lines. This experiment was performed by the Millenium Pharmaceutical Company (http://dtp.nci.nih.gov/mtargets/millenium.html). Poly-A RNA was purified from the 60 human tumor cell lines using the Invitrogen Fast Track 2.0 System. All other steps in RNA extraction and preparation for hybridization were performed as suggested by Wodicka et al. (1997). The Affymetrix GeneChip system was used in these experiments. The Hu6000 chip design was used, consisting of 65,000 features each containing on the order of 10 million oligonucleotides designed on the basis of sequence data available from GenBank.

The oligonucleotides on the arrays were designed at Affymetrix to cover the complementary strand at the 3' end of the human genes. About 4,000 known fully sequenced human gene cDNA's and more than 2,000 human EST's displaying some similarity with known genes characterized in other organisms are represented in a set of four chips. Most genes are represented by 20 overlapping oligonucleotides. A homosubstitution mismatch oligonucleotide is included for each probe design. The sequence of the oligonucleotide

probes on the arrays was selected based on a combination of sequence uniqueness criteria and empirical rules developed at Affymetrix for the selection of oligonucleotides. A quantitative scan of an array and the analysis was done using the Microarray Suite 4.0 software from Affymetrix as described by Wodicka et al. (1997). The values reported by the authors are the average of the differences (signal from perfect match - signal from mismatch) after discarding the maximum, the minimum, and any outliers beyond three standard deviations from the mean for the perfect match oligonucleotides. Values less than zero represent measurements for which the mismatched oligonucleotide gave a greater signal than the perfect match oligonucleotide. Clustering results using the average linkage method and Pearson correlation measures of similarity were reported. The investigators found that the cell lines with common tissues of origin tended to cluster together.

The third experiment also used the Affymetrix design and collected data from 7,129 genes for each of the 60 cell lines. This experiment was reported by a group at the Massachusetts Institute of Technology (Staunton et al., 2001). Poly-A selected RNA from each cell line was used to prepare biotinylated cRNA targets. These targets were hybridized to Affymetrix high density Hu6800 microarrays, washed, stained with phycoerythrin conjugated streptavidin, and signal amplified using biotinylated anti-streptavidin antibodies.

Expression values were calculated using Affymetrix’s Microarray Suite 4.0 software. An expression level of 100 units was assigned to measurements of <100. Setting the threshold in this manner could create a systematic artifactual bias in the distribution of the signals.

The authors reported results from an average linkage cluster analysis and found that the cell

lines with similar tissues of origin tended to cluster together. Most of Staunton’s (2001) paper focuses on chemosensitivty data, which is not discussed in this chapter.

Notice that the three experiments involve different numbers of genes. There is not enough information in the publicly available data files to match up the common genes in the experiments. However, the 60 cell lines are easily matched. To our knowledge, no systematic study of the effect of cluster labeling on clustering method agreement measures has been reported for any of the three experiments. This chapter compares the labeling effects on clusters of cell lines.