Clustering Problems and Clustering Methods for Microarray Data

(1)

Clustering Problems and Clustering

Methods for Microarray Data

Hans-Hermann Bock, RWTH Aachen

bock@stochastik.rwth-aachen.de

Herbsttagung DANK, D¨usseldorf, 14.11.2003

1.

Gene expression data and clustering problems

2.

Classical hierarchical clustering methods

3.

Variance criterion and

k

-means clustering

4.

Maximum-likelihood clustering

using autoregressive time series

5.

Kohonen nets, Self-Organizing Maps (SOM)

6.

Clustering genes using auxiliary information

7.

Simultaneous clustering of genes and samples

8.

Gene shaving

9.

Software for gene clustering

(2)

1. Gene expression data and clustering problems

Analysis of gene expression data is an important problem

today.

Typical data set:

•

n

genes

k

= 1

, ..., n

from cDNA or mRNA

react with RNA (DNA) in a hybridization process

•

in

p

situations

j

= 1

, .., p

(e.g.,

p

samples,

p

time points,

p

tissues,...)

Result:

•

A

n

×

p

matrix

X

= (

x

kj

) of ’expression levels’

x

kj

:= value/intensity of gene

k

for situation

j

Samples

Genes

               

x

11

· · ·

x

1

j

· · ·

x

1

p

...

x

k

1

· · ·

x

kj

· · ·

x

kp

...

x

n

1

· · ·

x

nj

· · ·

x

np

               

=

               

x

1

...

x

j

...

x

n

               

=

X

Many different versions and technologies:

•

cDNA arrays

•

Oligonucleotide arrays (Affymetrix, Agilent)

•

Serial analysis of gene expression data (SAGE)

(3)

Analysis of microarray data:

•

Analysis of unknown, but conjectured (or observed)

hetero-geneity among samples (columns):

e.g.: tumor versus normal tissues, different time points,...

•

Comparing the behaviour of different genes (rows)

•

Visualization of large data tables (e.g., coloured arrays)

•

Looking for interesting patterns (in lines, in columns)

Clustering:

•

Find

clusters of similarly expressing genes

(rows)

•

Find

clusters of similarly behaving columns

e.g., tissues, mRNAs/Oligos, time points, diseases

Main goals of clustering:

– Ordering of columns and rows such that structures will

be evident

– Detecting unknown function of genes

(from class membership and class characteristics)

– Selection of predictive genes (1 from each gene cluster)

– Prediction of survival rates

(e.g., from a class-specific Cox model)

– Decreasing costs for sequencing (only 1 DNA per cluster!)

(4)

Verschiedene Clustermethoden:

Graphentheoretische Verfahren:

Highly connected subgraphs: Hartuv et al. (1999)

Zuf¨allige bin¨are Graphen mit Fehlermodell: Cluster affinity search technique: Ben-Dor et al. (1999)

Zusammenhangskomponenten als ’relevance network’: Butte et al. (2000)

Modellbasierte Clusterverfahren:

Mischung von Normalverteilungen: Alon et al. (1999) Fixed-classification models: Yeoung et al. (2001))

k-means bzw. SSQ-Clustern: Tavazoie et al. (1999)

Bayes-Modelle mit autoregressivem Prozeß f¨ur Zeitreihen: Ramoni et al. (2002)

Mode clustering mit gesch¨atzter Verteilungsdichte und Dimensionsreduktion: Bonnet et al. (2002)

Hierarchische Clustermethoden (vor allem: Average-linkage agglomerativ):

Klassisches, Average-LinkageagglomerativesClustern: Alizadeh et al. (2000), Eisen et al. (1998), Weinshtein et al. (1997), Wen et al. (1998), Sherf et al. (2000)

Klassisches Average-Linkage divisivesClustern: Alon et al. (1999) Ramoni et al. (2002)

Verschiedene, mehr oder weniger heuristische Verfahren:

’Gene shaving’ (verwandt zu: PCA-Clustern, projection pursuit clustering): Hastie et al. (2000), Choi et al. (2001)

Cluster scoring, significance analysis: Tibshirani et al. (2001) Two-way-Clustern von Kontingenztafeln: Bock (2003)

Auswahl interessanter Splits (Bipartitionen) nebst Variablenreduktion: Golub et al. (1999), Du-doit et al. (2000), Heydebreck et al. (2001), Markowetz et al. (2003); auch Allison et al. (2002) Resampling und Bagging bei Clusterverfahren: Fridyland and Dudoit (2001)

Neuronale Netze, Kohonen-Maps, SOM:

Kohonen Maps, SOMs: Tamayo et al. (1999), Herrero et al. (2001) Support-Vektor-Methoden: Brown et al. (2000), Markowetz et al. (2003)

Zeitliche Abl¨aufe, Zeitreihen (Zeit = Spalten der Datenmatrix):

Bayes-Modelle mit autoregressivem Prozeß f¨ur Zeitreihen: Ramoni et al. (2002a, 2002b) Ben-Dor (1999)

Kohonen Maps: Tamayo et al. (1999)

Graphiken:

Shaded diagrams f¨ur ¨Ahnlichkeitsmatrizen: Hartuv et al. (1999), Ben Dor et al. (1999) Rearrangement of the data matrix

Bipartitions: Markowetz et al. (2003), Heydebreck et al. (2001) Cluster profiles in SOM: Tamayo et al. (1999)

(5)

9. Software for

Clustering gene expression data

CLUSTER: for classical methods (Alizadeh et al. 2000) GENECLUSTER for SOMs (Tamayo et al. 1999)

CAGED for Bayesian and ML methods for time series (Ramoni et al. 2002): See CSDA 40 (2002), p. 425

http:/kebab/tch/harvard.edu/caged/ http://genomethods.org/cadeg/about.htm

BioMine from Gene Network Sciences

Bibliographie:

Alizadeh, A.A., Eisen, M.B. and 29 other authors (2000): Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511.

Allison, D.B., Gadbury, G.L., Heo, M., Fern´andez, J.R., Lee, Ch.-K., Prolla, T.A., Weindruch, R. (2002): A mixture model approach for the analysis of microarray gene expression data. Com-putational Statistics and Data Analysis 39, 1-20.

Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. (1999): Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sciences USA 96, 6745-6750. Beißbarth, T., and 10 co-authors (2000). Processing and quality control of DNA array hybridi-zation data. Bioinformatics 16, 1014-1022.

Ben-Dor, A., Shamir, R., Yakhini, Z. (1999): Clustering gene expression patterns. J. of Com-putational Biology 6, 281-297.

Bittner et al. (2000): Molecular classification of cutaneous malignant melanoma by gene ex-pression profiling. Nature 406, 536-540.

Bock, H.-H. (2003). Two-way clustering for contingency tables: maximizing a dependence mea-sure. In: M. Schader, M. Vichi, W. Gaul (des.): Between data science and applied data analysis. Proc. 26th Conference of the Gesellschaft f¨ur Klassifikation, Univ. Mannheim, July 22-24, 2002. Studies in Classification, data Analysis, and Knowledge Organization. Springer Verlag, Heidel-berg, 2003 (in press)

Bonnet, N., Herbin, M., Cutrona, J., Zahm, J.-M. (2002): A new clustering approach based on the estimation of the probability density function, for gene expression data. In: K. Jajuga, A. Sokolowski, H.-H. Bock (eds): Classification, clustering, and data analysis. Recent advances and applications. Proc. IFCS-2002, Cracow, Poland. Studies in Classification, data Analysis, and Knowledge Organization. Springer Verlag, Heidelberg, 2002, 35-42.

Botstein, D., Brown, P. (1999): Exploring the new world of the genome with DNA microarrays. Nature Genetics (Supp.) 21, 33-7.

Brown, M., Grundy, W., Lin, D., Christiani, N., Sugnet, C., Furey, T., Ares, M., Haussler, D. (2000): Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci USA 97, 262-267.

Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., Kohane, I.S. (2000): Discovering functio-nal relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci USA 97, 12182-12186.

Carr, D.B., Somogyi, R., Michaels, G. (1997): Templates for looking at gene expression cluste-ring. Statistical Computing & Graphics Newsletter 8, 20-29.

Choi, D., Lee, H., Jun, Ch.-H. (2001): On combining clustering methods for microarray data analysis. Bull. ISI, Proc. of the 53th Session, Seoul, Korea, 2001. Vol. LIX. book 3, 229-230. Dewey, T.G., Galas, D.J. (2001):Dynamic models of gene expression and classification. Springer-Verlag, Berlin.

Dougherty, E.R., Barrera, J., Brun, P.O., Botstein, D. (2002): Inference from clustering with application to gene-expression microarrays. J. Computational Biology 9, 105-126.

Dramanc, S., Stavropoulos, N.A., Labat, I., Vonau, J., Hauser, B., Soares, M.B., Drmanac, R. (1996): Gene-representing cDNA clusters defined by hybridization of 57419 clones from infant brain libraries with short oligonucleotide probes. Genomics 37, 29-40.

(7)

Dudoit, S., Fridlyand, J., Speed, T. (2002): Comparison of discrimination methods for the clas-sification of tumors using gene expression data. J. Amer. Statist. Assoc. 97, 77-87.

Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D (1998): Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci USA 95, 14863-14868.

Fridlyand, J., and Dudoit, S. (2001): Application of resampling methods to estimate the num-ber of clusters and to improve the accuracy of a clustering method. Technical Report no. 600, Division of Biostatistics, School of Public Health, Univ. of California, Berkeley, 50pp.

Getz, G., Levine, E., Domany, E. (2000): Coupled two-way clustering analysis of gene microar-ray data. Proc. Natl. Acad. Sci. USA 97, 12079-12084.

Ghosh, D., Chinnaiyan, A.M. (2002): Mixture modelling of gene expression data from microar-ray experiments. Bioinformatics 18, 275-286.

Golub, T.R., Slonim, D.K., Tamayo, P., and 9 other authors (1999): Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537. Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Bot-stein, D., Brown, P. (2000): ’Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology I(2),:research003.1-003.21

Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrach, H., Shamit, R.: (1999): An al-gorithm for clustering cDNAs for gene expression analysis. Proc. Third Annual International Conference on Computational Molecular Biology (RECOMB 99), 188-197.

Herbin, M., Bonet, N., Vautrot, P. (1996): A clustering method based on the estimation of the probability density function and on the skeleton by influence zones. Pattern Recognition Letters 17, 1557-1568.

Herbin, M., Bonet, N., Vautrot, P. (2001): Estimation of the number of clusters and influence zones. Pattern Recognition Letters 22, 1141-1150.

Herrero, J., Valencia, A., Dopazo, J. (2001): A hierarchical unsupervised growing neural net-work for clustering gene expression patterns. Bioinformatics 17, 126-136.

Heydebreck, A.v., Huber, W., Poustka, A., Vingron, M. (2001): Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics 17, Suppl. 1, S107-S114.

Kaski, S. (2003): SOM-based exploratory analysis of gene expression data. Helsinki Univ. of Technology.

Kaski, S., Nikkilä, J., Törönen, P., Castrén, E., Wong, G. (2003): Analysis and visualization of gene expression data using self-organizing maps. Helsinki Univ. of Technology.

Kaski, S., Sinkkonen, J., Nikkil¨a, J. (2003): Clustering gene expression data by mutual infor-mation with gene function. Neural Network Research Centre, Helsinki Univ. of Technology. Markowetz, F., Heydebreck, A.v. (2002): Class discovery in gene expression data: characteri-zing splits by support vector machines. In: M. Schader, M. Vichi, W. Gaul (des.): Between data science and applied data analysis. Proc. 26th Conference of the Gesellschaft f¨ur Klassifikation, Univ. of Mannheim, July 22-24, 2002. Studies in Classification, data Analysis, and Knowledge Organization. Springer Verlag, Heidelberg, 2003 (in press)

McLachlan, G.J., Bean, R.W., Peel, D. (2002): A mixture model-based approach to the cluste-ring of microarray expression data. Bioinformatics 18, 413-422.

(8)

Meier-Ewert, S., Mott, R., Lehrach, H. (1995): Gene identification by oligonuclotide fingerprin-ting - a plot study. Technical Report, Max-Planck Institute.

Merz, P. (2003) An iterated local search approach for minimum sum-of-squares clustering. To appear.

Ramoni, M.F., Sebastiani, P., Kohane, I.S. (2002a): Cluster analysis of gene expression dyna-mics. Proc. Natl. Acad. Sci USA 99, 9121-9126.

Ramoni, M.F., Sebastiani, P., Cohen. P. (2002b): Bayesian clustering by dynamics. Machine Learning 47, 91-121.

Sherf, U. et al. (2000): A gene expression database for the molecular pharmacology of cancer. Nature genetics 24, 236-244.

Sinkkonen, J., Kaski, S. (2003): Clustering based on conditional distributions in an auxiliary space. Neural Computation (to appear).

Speed, T. (ed.) (2002): Statistical analysis of gene expression microarray data. CRC Press, USA.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Glub, D.R. (1999): Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sciences USA 96, 2907-2912. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. (1999): Systematic deter-mination of genetic network architecture. Nature Genetics 22, 281-285.

Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P. (1999)): Clustering methods for the analysis of DNA microarray data. Manuscript.

Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P. (1999): Clustering methods for the analysis of DNA microarray data. Report, Univ. of Stanford at http://www-stat.stanford.edu.

Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Botstein, D. (2001): Exploratory screening of genes and clusters from microarray experiments. Internat Re-port, Univ. of Stanford at http://www-stat.stanford.edu.

Weinshtein, J.N., et al. (1997): An information-intensive approach to molecular pharmacology of cancer. Science 275, 343-349.

Wen, X., Fzhraman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L., Somogyi, R. (1998): Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA 95, 334-339.

Yeung, K.Y., Fraley, C., Murua, A. Raftery, A.E., Ruzzo, W.L. (2001): Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977-987.

Yeung, K.Y., Haynor, D.R., Ruzzo, W.L. (2001): Validating clustering for gene expression data. Bioinformatics 17, 309-318.

Yeung, K.Y., Ruzzo, W.L. (2001): Principal component analysis for clustering gene expression data. Bioinformatics 17, 763-774.

See also: http://www.molgen.mpg.de/ heydebre/explit.html http://cmgm.stanford.edu/pbrown/mguide