Clustering and Visualising Microarray Data Patterns

(1)

data mining and integration applications in biosciences. He is on the editorial board of the IEEE Transactionson

Nanobioscience and theOnline Journal of Bioinformatics.

Keywords:clustering, expression data, cluster annotation and visualisation, data mining

Francisco Azuaje, School of Computing and Mathematics,

University of Ulster at Jordanstown, Newtownabbey BT37 0QB, UK Tel: +44 (0)28 9036 6553 Fax: +44 (0)28 9036 6803 E-mail: [email protected]

to discovering and visualising

microarray data patterns

Francisco Azuaje

Date received (in revised form): 18th December 2002

Abstract

This article focuses on clustering techniques for the analysis of microarray data and discusses contributions and applications for the implementation of intelligent diagnostic systems and therapy design studies. Approaches to validating and visualising expression clustering results and software and other relevant resources to support clustering-based analyses are reviewed. Finally, this paper addresses current limitations and problems that need to be investigated for the development of an advanced generation of pattern discovery tools.

INTRODUCTION

The analysis of microarray data has become a fundamental approach to modelling important biological processes. Its outcomes may have a profound impact on the development of new methods for the prevention and treatment of

diseases.1,2However, it has been shown that traditional statistical techniques are not sufficient to address many of the data mining and integration challenges of this domain.3

Key data analysis problems that need to be considered are: data preprocessing, which includes missing value estimation,4 normalisation5and feature

transformation;6feature selection;7 supervised classification;8–10clustering;11 correlation and association studies;12 statistical validation and evaluation.11 Several techniques originating from statistics, machine learning, information retrieval and extraction and signal processing have been applied to perform these functions.3A number of books have recently reported the acquisition,

discovery and classification of expression patterns.13–15In 2000 a critical assessment of microarray data-analysis techniques (CAMDA) was initiated.16

Several journal papers have overviewed microrray data acquisition and analysis issues. Dugganet al.17introduced basic

experimental and data analysis techniques for cDNA microarrays. Lockhart and Winzeler1presented microarray methods and applications. They discussed goals for the development of more effective expression data analysis techniques. Sherlock18overviewed data analysis problems and procedures based on self-organising maps(SOMs), k-meansand hierarchical clustering. Quackenbush19 reviewed general aspects in microarray data analysis, ranging from data normalisation and distance metrics, through clustering algorithms and principal component analysis, to supervised classification methods. Wu20 has studied important topics for the analysis of gene expression profiles, which involve: data acquisition, preprocessing, traditional clustering methods such as hierarchical clustering and SOMs, and statistical procedures for hypothesis testing. Azuaje3discussed data mining and management problems, including

discovery goals, methods and applications in a number of biomedical domains.

Unlike the papers referred above, this review focuses on clustering approaches to analysing microarray data.11This contribution discusses not only solutions based on traditional algorithms (such as k-means, SOMs and hierarchical clustering), but also novel advances and

(2)

applications. It places emphasis on key design, visualisation and validation issues required to develop data clustering tasks. It highlights problems that need to be addressed by an advanced generation of clustering-based analysis tools. Software, books and electronic resources relevant to clustering-based interpretation will be given.

EXPRESSION DATA

CLUSTERING: BASIC

DEFINITIONS AND

APPLICATIONS

Clustering identifies groups of genes or samples showing similar expression patterns. These partitions are known as clusters.Clustering studies are based on the idea that genes that are contained in a particular pathway should be co-regulated and therefore should exhibit similar patterns of expression.21Figure 1 shows a 2D representation of a clustering result. In this hypothetical example, two types of genes, each one associated with a different biological function (AandB), are

clustered based on their expression profiles. Each cluster is encircled, and the genes that are linked to each cluster are

displayed randomly within the correspondent circle.

Clustering algorithms are useful for: (a) measuring the similarity between genes according to their expression patterns under different conditions; or (b)

measuring the similarity between samples described by the expression levels of a set of genes.3This framework has facilitated the classification of diseases, the discovery of gene function, and the identification of relationships among a subset of variables such as biological conditions or

perturbations.22,23 Moreover, clustering has been proposed to support the

discovery of sub-categories for diagnostic and prognostic purposes.24Hypothesis testing and cluster description procedures have been implemented to evaluate experimental models and generate decision support rules.25

A clustering algorithm generally requires the data to be described by a matrix of values. In a microarray data experiment, an element of such a matrix may represent, for instance, the gene expression value associated with a particular biochemical perturbation. Other methods process amatrix of pairwise values, where each of them is associated with the similarity (or dissimilarity) value between two objects. A matrix element may define, for instance, the similarity between two genes expressed under a specific biological condition.

Clustering algorithms typically aim to optimise a partition quality measure. These measures may related to: (a) the heterogeneity of the clusters, also known as their intra-cluster distances; and (b) their separation from the rest of the data, also referred to as the inter-cluster distances. Therefore, clustering algorithm designers and users may need to define methods not only to assess the distance between genes (or samples), but also intra- and inter-cluster distances. Several sample-to-sample, intra- and inter-cluster metrics have been applied and/or

combined in microarray data clustering models.11

The selection of a clustering algorithm

Clustering identifies genes showing similar expression patterns

Clustering algorithms optimise a partition quality measure

Figure 1: A 2D representation of a

clustering result. A hypothetical example, in which two types of genes, each one

associated with a different biological function (AandB), are clustered based on their expression profiles

Genes related to functionA Genes related to functionB

(3)

depends on the statistical nature of the data, problem and user requirements, and the attributes and constraints exhibited by the algorithms available. Clustering algorithms may be categorised into different types, according to the way they process and partition the data. Hasti and colleagues,26for example, define three major types of clustering techniques: combinatorial algorithms,mixture modelling andmode seeking. Combinatorial algorithms partition the data without making reference to a probability model. Mixture modelling methods assume that the data originate from a population characterised by a probability density function. In this case, a density function is defined by a mixture of component density functions, where each component represents one of the existing classes. The mode seeking models apply a non-parametric method, which aims to predict different modes of a probability density function. Some algorithms for the generation of association rules represent typical examples of this category.25 The k-means method, hierarchical clustering and SOMs may be defined as examples of combinatorial algorithms.

Hierarchical clustering may be implemented by applyingagglomerative anddivisiveparadigms. These techniques aim to produce a tree-like structure in which the nodes represent subsets of an expression data set. The agglomerative paradigm starts at the bottom of the data hierarchy (individual genes or samples). At each hierarchical level, it recursively merges a selected pair of clusters into a single cluster. An agglomerative

technique depends on the approach used to assess the separation between clusters, such assingleandaverage linkage

methods.19Divisive strategies perform a top-down approach because they begin to process the entire data set as a single cluster, and recursively divide the existing clusters into two clusters at each iteration. Figure 2 illustrates a tree (also known as dendrogram) obtained after performing a hierarchical clustering of a collection of genes. Each column is associated with a

gene, and the branch lengths represent the distance or dissimilarity between genes.

Azuaje and Bolshakova11categorise microarray data clustering algorithms into three major types: (a) hierarchical

clustering, (b) models based on iterative relocation and (c) adaptive systems and other advances. Hierarchical models may be defined as above. Models based on iterative relocation consist of a number of ‘learning’ steps to search for an optimal partition of samples. Such processes commonly require: (1) the specification of an initial partition of objects into a

number of classes; (2) the definition of a number of clustering parameters to implement the search process and assess the adequacy of its outcomes; (3) a set of procedures to transform the structure or composition of a partition; and (4) a repetitive sequence of such transformation procedures. Typical examples are the SOMs and thek-means algorithm. Adaptive systems and other advances may consist of a mixture of hierarchical and

Hierarchical clustering may be implemented using agglomerative and divisive models

Iterative relocation techniques search for an optimal partition

Figure 2: A dendrogram representing

the hierarchical clustering of a collection of genes

Gene 1

Gene 3

Gene 4

Gene 2

Gene 6

Gene 5

(4)

iterative relocation clustering. They may also combine different clustering

paradigms to improve classification accuracy3or to perform clustering as a supervised method.27Dopazo28classifies expression data clustering methods into two major categories: hierarchical and non-hierarchical.

It has been generally accepted that the application of more than one clustering technique may facilitate the generation of accurate and reliable results. Thus,

scientists may implement voting strategies, consensus classifications, clustering fusion techniques and statistical measures of consistency.11,29Azuaje and Bolshakova11 have outlined basic factors for the selection of clustering algorithms in expression analysis applications.

ADVANCES IN

CLUSTERING-BASED

INTERPRETATION OF

MICROARRAY DATA

Bioscientists may choose clustering solutions from a diverse and extensive collection of algorithms originating from statistical learning, pattern recognition and machine learning.19,20Recent advances include, for example, different methods based on the combination of multi-layer neural networks and fuzzy logic,27,30 concepts and techniques from the graph theory31and several Bayesian

approaches.32,33 Unlike traditional methods, some of these techniques may be adapted and/or combined to perform both supervised and unsupervised classification of expression patterns.27 These models also exhibit advantages in terms of their computational efficiency and learning complexity. For example, the models reported in Azuaje27and Tomidaet al.30require the user to define a small number of learning parameters in comparison to other neural networks. One of these systems, which is based on a neural network known as Simplified Fuzzy ARTMAP, may allow the

execution of a clustering task with a single processing iteration.27

The combination of traditional and

advanced solutions, together with the application of decision support and evaluation models may provide the basis for more reliable and understandable analyses.

Wuet al.34have proposed a novel approach to predicting gene function based on expression data. They showed the importance of applying multiple clustering methods to discover relevant biological patterns. The clustering methods applied were: hierarchical clustering,k-means, SOMs, ‘block up/

down’ and ‘top-N’.34Each method may

produce partially overlapping expression clusters. A class prediction provided by a clustering algorithm is associated with a probability value,P, which may assign a gene to multiple functional categories. This strategy assesses the possibility that a cluster of genes was obtained by chance. Thus, predictions are based on the

minimumP-value exhibited by a category in the cluster. Similarly, they implement tools to visualise gene clusters and their annotations. This framework also suggested that small clusters are more suitable for producing reliable predictions. Wuet al.confirmed the validity of these results by implementing biochemical experiments. This is an elegant and robust solution, which demonstrates how machine learning models, statistics and validation methods may be integrated to improve the interpretation of biological data. One important aspect of this type of frameworks is that, unlike traditional methods, it can associate multiple reliable predictions to a gene, which are described by probabilities. The implementation of tools for automatically annotating clusters also allows users to accelerate the

validation of predictions. Figure 3 summarises the sequence of analysis steps of this approach.

The statistical evaluation or validation of clusters may comprise the

implementation of tests to measure consistency and significance, and validity indices to estimate their reliability.35,36 These factors have received relatively little attention in microarray data analyses.3

Machine learning and statistics may be combined to interpret clusters

Consistency tests and validity indices are applied to assess clustering results

(5)

Apart from the validation procedure implemented by Wuet al.,34other studies have suggested, for instance, the

application of permutation tests37 to aid in the selection of robust clusters. This type of methods may be applied to assess the relevance of clusters generated by different algorithms. But also they may represent central processing components in a clustering technique. One example is theself-organising tree algorithm(SOTA),37 which allows the visualisation of

hierarchical relationships. The product of this divisive method is a binary tree of clusters, whose size may be controlled by heterogeneity threshold or validity values. SOTA embodies some of the features observed in hierarchical clustering and SOMs. Moreover, it adds key capabilities, such as the assessment of cluster statistical significance and the generation of clusters at different hierarchical levels, which are not implemented by those traditional methods. Unlike the traditional

hierarchical clustering and SOMs, SOTA statistically describes the quality and robustness of the obtained clusters. It has been shown that this system may be faster than the typical hierarchical model when processing massive expression data sets. Additionally, like the SOMs, SOTA generates prototypes to characterise the clusters. Table 1 compares fundamental attributes exhibited by the traditional hierarchical clustering, SOMs and the SOTA algorithm.

The estimation of the number of

clusters represented in an expression data set is a crucial and complex task, which may significantly influence the products of an analysis process. There are three major types of methods:null hypothesis tests,internal validity indicesandexternal validity indices. The first type aims to provide evidence against the hypothesis: ‘there are no clusters in the data’.38 Internal validity indices are calculated using the same data used to perform clustering. Several indices have been

proposed by Kaufman and Rousseeuw,39

Bezdek and Pal40and Tibshiraniet al.41 Bolshakova and Azuaje42have successfully applied internal validity indices to

expression data analyses. They have also proposed strategies to combine the outcomes originating from multiple validity indicators, which may be used to generate more reliable and robust predictions about the correct number of clusters.

External indices assess the agreement between an experimental partition and a reference partition. The experimental partition is the clustering product under study, while the reference data set may be a partition witha prioriknown cluster structure.35,38 Dudoit and Fridlyand38 have designed a method based on the resampling of learning and test sets, and the application of external validity indices. They have demonstrated that such a system can accurately predict the number of clusters in a cancer expression database. Similarly, it shows a robust prediction

The estimation of the number of correct clusters is a complex task

Null hypothesis tests and validity indices support this task

Figure 3: A gene function prediction and validation system based on the generation of

multiple expression clusters34

(I) Expression data (II) Clustering (III) Annotation of clusters (IV) Predictions (V) Validation k-means, hierarchical, ‘top-N’, SOMs, ’up/down block’ Annotation algorithm, cellular role and

localisation P-values, selection of relevant clusters

(6)

performance in comparison to other validity indices. This solution, unlike the ones studied by Bolshakova and Azuaje, estimates both the correct number of clusters and the quality of a clustering algorithm based on the reproducibility of clustering results.

DATA INTEGRATION

Clustering algorithms may be useful to support the integration of microarray and other biomolecular data resources. Werner43has combined expression cluster analysis and promoter sequence

information for the identification of target genes in therapeutic studies. Barash and Friedman33have implemented a probability model of binding sites and expression patterns to improve the classification of genes and the description of clusters. Liang and Kachalo44 have studied the integration of gene sequences, microarray data and other chemical structures to aid in the identification of differentially expressed genes and the classification of biological samples.

Bilu and Linial45have applied clustering algorithms to combine information originating from BLAST searches and expression patterns. They showed a significant correlation between clusters of genes described by the BLASTE-scores of their similarity and clusters of

expression patterns.

Chaussabel and Sher46 have

implemented an approach to combining

expression data and large literature resources. They applied clustering methods to define associations between relevant literature profiles and expression data, which may represent a powerful tool to support gene discovery functions. In this study the terms found in thousands of abstracts significant to a collection of genes are retrieved and filtered. This vocabulary allows the generation of a structure of term–occurrence indices for each gene. A hierarchical clustering algorithm is then applied to detect relationships and display groups of genes based on term occurrence patterns. Thus, a user interprets occurrence indicators for terms such as ‘apoptosis’ and

‘histocompatibility’, which may provide an insight into the functional roles assigned to the genes in the literature. However, this research may benefit from the incorporation of more reliable methods for the retrieval of relevant literature records and terms associated with each gene.

VISUALISATION OF

EXPRESSION CLUSTERS

Cluster visualisation has traditionally consisted of the representation of a dendrogram together with a colour-coded plot of the expression data.19 Figure 4 depicts the hierarchical clustering of a number of genes, which are characterised by the expression values obtained during a number of experiments. In the original

Clustering may support data integration

There is a need to improve cluster visualisation

Table 1: Comparison of clustering techniques: traditional hierarchical clustering (THC) and

SOMs, against an alternative solution (SOTA), which incorporates computational features desirable for improving cluster identification and validation. Table adapted from Herrero

et al.37

Property THC SOM SOTA

Topology Hierarchical tree Hexagonal or rectangular Hierarchical tree

Structure adaptation Aggregative Static Divisive

Number of clusters As many as items User predefined Customisable

Statistical assessment of clusters No No Yes

Generation of clusters at different hierarchical levels

No No Yes

Noise tolerance and robustness No Yes Yes

Generation of cluster prototypes No Yes Yes

(7)

plot generated by theTreeViewsoftware,47 upregulated genes are depicted in red, downregulated genes appear in green, and the intensity of the colour is associated with the relative expression value. These tools may be combined with other graphical representations of cluster

annotation information, confidence values and literature resources.34,46One of its limitations is that the identification of clusters and associations is generally left to the user. Moreover, its complexity may be significantly increased when thousands of genes or experiments are analysed.

SOMs, self-adaptive neural networks and other non-linear mapping techniques may provide the basis for the

implementation of more effective hierarchical and non-hierarchical visualisation methods.48 Advanced self-organising networks can automatically model and organise the data under study

based on adaptive structures, which do not need to be predefined by the user. The shape of the resulting structures can reflect complex similarity relationships, and multi-resolution hierarchical analysis of clusters may be implemented.49Thus, they provide graphical components to characterise the data in a more meaningful and interactive fashion. The application of Sammon’s Non-Linear Mappinghas been studied to detect relevant clusters and outlying elements using 2D and 3D graphical representations.50However, these visualisation tasks may be enormously facilitated by developing novel algorithms for the automatic determination of relevant partitions and cluster boundaries.51

Scientists may take advantage of existing data mining techniques to improve their visualisation and interpretation capabilities. Thus, it is

Existing visualisation techniques exhibit several limitations It is crucial to develop more understandable and interactive techniques

(8)

important to assess different adaptations and applications of methods such as: surface and volume visualisation, parallel coordinates, hyperbolic projections, dimensional stacking and other visual data mining methods,52which may represent useful options for cluster analysis and visualisation. There is also the need to develop tools to graphically represent the importance of features or attributes in classification studies, which can be measured by applying multiple relevance discovery methods.

CLUSTERING SOFTWARE

AND BIBLIOGRAPHIC

RESOURCES

Reviews on software for microarray data analysis have been published in this journal53and in Berraret al.54Table 2 shows a list of both publicly available and commercial software for clustering-based microarray data analysis. Most of them are designed to run on recent versions of WindowsorLinux.They allow the implementation of clustering algorithms such as: hierarchical clustering,k-means, SOMs, biclustering55 and Bayesian models.

Several books on gene expression

analysis have been recently published, which dedicate chapters on clustering-based models and

interpretation.13,14,15,53,56

DISCUSSION AND

CONCLUSION

Clustering is a fundamental microarray data analysis task.57Several algorithms have represented key components of expression analysis applications, which include expression profiling in response to cancer treatments,58,59plant biology studies60and the identification of differentially expressed genes in disease models.61 They will continue playing an important role in molecular diagnosis,62,63 prognostic classification,64determination of expression signatures for complex diseases and developmental processes,65 and the understanding of physiological systems.66Furthermore, clustering methods will significantly support new microarray data application areas, such as the search for vaccine candidates in parasitology.67

In order to fully exploit the predictive capabilities of these systems, it is crucial to develop solutions that have been partially or never approached. Microarray data

Feature relevance representation is important

Clustering will continue playing a key role in expression data studies

Table 2: Software tools for clustering-based analysis of microarray data

Product Clustering algorithms Designer and access

BioMine k-means, biclustering Gene Network Sciences Inc.

http://www.gnsbiotech.com/biomine.shtml Cluster and TreeView Hierarchical clustering, SOMs,k-means. Michael Eisen, Lawrence Berkeley National Lab and

University of California at Berkeley http://rana.lbl.gov/EisenSoftware.htm CNIO Server Hierarchical clustering, SOMs, SOTA. Bioinformatics Unit, Spanish National Cancer

Center (CNIO)

http://bioinfo.cnio.es/dnarray/analysis CAGED Bayesian clustering Children’s Hospital, Harvard Medical School.

http://genomethods.org/caged

GeneCluster SOMs Center for Genome Research, Whitehead

Institute-MIT, http://www-genome.wi.mit.edu/cancer

GeneMaths SOMs Applied Maths

http://www.applied-maths.com/home.html Genesis Hierarchical clustering,k-means, SOMs Institute of Biomedical Engineering, Graz University

of Technology http://genome.tugraz.at/Software/ GenesisCenter.html

GeneSpring Hierarchical clustering,k-means, SOMs Silicon Genetics http://www.silicongenetics.com Rosetta Resolver k-means,k-median, SOMs Rosetta Biosoftware

http://www.rosettabio.com/home.html Spotfire DecisionSite Hierarchical andk-means clustering Spotfire http://www.spotfire.com/products/fg.asp

(9)

analyses require the implementation of more rigorous statistical evaluation frameworks.3,36Integrated software platforms, comprising several clustering techniques and cluster validation methods, should improve users’ pattern recognition capabilities and the quality of results. The problems of estimating the correct number of clusters and identifying cluster boundaries require further

investigations.68,69 Such platforms should also implement data reduction or

projection methods, such as principal component analysis and advanced non-linear mapping techniques,70and strategies for the estimation of missing values.71

This review has also discussed solutions for data integration and cluster

visualisation. Both tasks may be greatly benefited from advanced research on feature selection or relevance discovery.72 They may significantly contribute, for example, to the development of new options for simultaneously clustering genes and samples.73

An advanced generation of clustering-based techniques will require the

implementation of more powerful and understandable mechanisms to visualise patterns. Significant improvements in computing power, and the consolidation and emergence of sophisticated

information visualisation techniques should support the achievement of those goals.74

Researchers should promote

investigations on the representation and visualisation of patterns that are novel and useful for the genomic expression research community. An intelligent generation of microarray data analysis tools must take into account scientists’ perceptions of what is relevant for biomedical

knowledge discovery. Are there graphical metaphors, tools or formats within a particular expression analysis task that might be more acceptable to users? Are there more understandable representations of uncertainty, cluster membership and feature relevance in clustering-based analysis? Can 3D visualisation techniques

provide benefits over 2D representations of informational patterns? Answers to these and other important questions require the development of long-term collaborations between converging technologies and disciplines.

References

1. Lockhart, D. and Winzeler, E. (2000), ‘Genomics, gene expression and DNA arrays’,

Nature, Vol. 405, pp. 827–836.

2. Ross, D., Scherf, U., Eisen, M.et al. (2000), ‘Systematic variation in gene expression patterns in human cancer cell lines’,Nature Genet., Vol. 24, pp. 227–235.

3. Azuaje, F. (2002), ‘In silicoapproaches to microarray-based disease classification and gene function discovery’,Ann. Med., Vol. 34, pp. 299–305.

4. Troyanskaya, O., Botstein, D. and Altman, R. (2002), ‘Missing value estimation’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 65–75.

5. Morrison, N. and Hoyle, D. (2002), ‘Normalization’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 76–90. 6. Wall, M., Rechtsteiner, A. and Rocha, L. (2002), ‘Singular value decomposition and principal component analysis’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 91–109.

7. Xing, E. (2002), ‘Feature selection in microarray analysis’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 110–131.

8. Dudoit, S. and Fridlyand, J. (2002), ‘Introduction to classification in microarray experiments’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 132–149.

9. Mukherjee, S. (2002), ‘Classifying microarray data using support vector machines’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 166–185.

10. Ringner, M., Eden, P. and Johansson, P. (2002), ‘Classification of expression patterns using artificial neural networks’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data More rigorous

evaluation frameworks are required

Tools must take into account scientists’ perceptions

(10)

Analysis’, Kluwer Academic Publishers, London, pp. 166–185.

11. Azuaje, F. and Bolshakova, N. (2002), ‘Clustering genomic expression data: Design and evaluation principles’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 230–245.

12. Lin, S. and Johnson, K. (2002), ‘Correlation and association studies’, in Berrar, D., Dubitzky, W. and Granzow, M., Eds, ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic Publishers, London, pp. 289–305.

13. Baldi, P., Hatfield, G. and Hatfield W. (2002), ‘DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling’, Cambridge University Press, Cambridge.

14. Knudsen, S. (2002), ‘A Biologist’s Guide to Analysis of DNA Microarray Data’, John Wiley & Sons, Chichester.

15. Kohane, I., Kho, A. and Butte, A. (2002), ‘Microarrays for an Integrative Genomics’, MIT Press, Cambridge, MA.

16. Johnson, K. and Lin, S. (2001), ‘Call to work together on microarray data analysis’,Nature, Vol. 411, p. 885.

17. Duggan, D., Bittner, M., Chen, Y.et al.

(1999), ‘Expression profiling using cDNA microarrays’,Nature Genet. Suppl., Vol. 21, pp. 10–14.

18. Sherlock, G. (2000), ‘Analysis of large-scale gene expression data’,Current Opin. Immunol., Vol. 12, pp. 201–205.

19. Quackenbush, J. (2001), ‘Computational analysis of microarray data’,Nature Rev. Genet., Vol. 2, pp. 418–427.

20. Wu, T. (2002), ‘Large-scale analysis of gene expression profiles’,Briefings Bioinformatics, Vol. 3, pp. 7–17.

21. Fitch, J. and Sokhansanj, B. (2000), ‘Genomic engineering: Moving beyond DNA sequence to function’,Proc. IEEE, Vol. 12,

pp. 1949–1971.

22. Perou, C., Sorlie, T., Eisen, M.et al.(2000), ‘Molecular portraits of human breast tumours’,

Nature, Vol. 406, pp. 747–752. 23. Ideker, T., Thorsson, V., Ranish, J.et al.

(2001), ‘Integrated genomic and proteomic analyses of a systematically perturbated metabolic network’,Science, Vol. 292, pp. 929–933.

24. Golub, T., Slonim, D., Tamayoet al. (1999), ‘Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring’,Science, Vol. 286, pp. 531–537.

25. Granzow, M., Berrar, D., Dubitzky, W.et al.

(2001), ‘Tumor identification by gene expression profiles: A comparison of five different clustering methods’,ACM-SIGBIO Lett., Vol. 21, pp. 16–22.

26. Hasti, T., Tibshirani, R. and Friedman, J. (2001), ‘The Elements of Statistical Learning’, Springer, New York.

27. Azuaje, F. (2001), ‘A computational neural approach to support the discovery of gene function and classes of cancer’,IEEE Trans. Biomed. Eng., Vol. 48, pp. 332–339.

28. Dopazo, J. (2002), ‘Microarray data processing and analysis’, in Lin, S. and Johnson, K., Eds, ‘Methods of Microarray Data Analysis II’, Kluwer Academic Publishers, Boston, MA, pp. 43–63.

29. Everitt, B. (1993), ‘Cluster Analysis’, Edward Arnold, London.

30. Tomida, S., Hanai, T., Honda, H. and Kobayashi, T. (2002), ‘Analysis of expression profiles using fuzzy adaptive resonance theory’,

Bioinformatics, Vol. 18, pp. 1073–1083. 31. Xu, Y., Olman, V. and Xu, D. (2002),

‘Clustering gene expression data using graph-theoretic approach: An application of minimum spanning trees’,Bioinformatics, Vol. 18, pp. 536–545.

32. Medvedovic, M. and Sivaganesan, S. (2002), ‘Bayesian infinite mixture model based clustering of gene expression profiles’,

Bioinformatics, Vol. 18, pp. 1194–1206. 33. Barash, Y. and Friedman, N. (2002),

‘Context-specific Bayesian clustering for gene expression data’,J. Comp. Biol., Vol. 9, pp. 169–191. 34. Wu, L., Hughes, T., Davierwala, A.et al.

(2002), ‘Large-scale prediction ofSaccharomyces cerevisiaegene function using overlapping transcriptional clusters’,Nature Genet., Vol. 31, pp. 255–265.

35. Yeung, K., Haynor, D. and Ruzzo, W. (2001), ‘Validating clustering for gene expression data’,Bioinformatics, Vol. 17, pp. 309–318.

36. Mendez, M., Hodar, C., Vulpe, C.et al.

(2002), ‘Discriminant analysis to evaluate clustering of gene expression data’,FEBS Lett., Vol. 522, pp. 24–28.

37. Herrero, J., Valencia, A. and Dopazo, J. (2001), ‘A hierarchical unsupervised growing neural network for clustering gene expression patterns’,Bioinformatics, Vol. 17, pp. 126–136. 38. Dudoit, S. and Fridlyand, J. (2002), ‘A

prediction-based resampling method for estimating the number of clusters in a dataset’,

Genome Biol., Vol. 3 (Biomed Central; URL: http://genomebiology.com).

39. Kaufman, L. and Rousseeuw,P. (1990), ‘Finding Groups in Data: An Introduction to Cluster Analysis’, Wiley, New York.

(11)

40. Bezdek, J. and Pal, N. (1998), ‘Some new indexes of cluster validity’,IEEE Trans. Systems, Man Cybernetics, Vol. 28, pp. 301–315.

41. Tibshirani, R., Walther, G. and Hastie, T. (2000), ‘Estimating the number of clusters in a dataset via the gap statistic’, Technical report, Department of Biostatistics, Stanford University (URL: http://

www.stat.stanford.edu/~tibs/research.html). 42. Bolshakova, N. and Azuaje, F. (2003), ‘Cluster

validation techniques for genome expression data’,Signal Proc., in press.

43. Werner, T. (2001), ‘Cluster analysis and promoter modelling as bioinformatics tools for the identification of target genes from expression array data’,Pharmacogenomics, Vol. 2, pp. 25–36.

44. Liang, J. and Kachalo, S. (2002),

‘Computational analysis of microarray gene expression profiles: clustering, classification, and beyond’,Chemometrics Intell. Lab. Syst., Vol. 62, pp. 199–216.

45. Bilu, Y. and Linial, M. (2002), ‘The advantage of functional prediction based on clustering of yeast genes and its correlation with non-sequence based classifications’,J. Comp. Biol., Vol. 9, pp. 193–210.

46. Chaussabel, D. and Sher, A. (2002), ‘Mining microarray expression data by literature profiling’,Genome Biol., Vol. 3 (Biomed Central; URL: http://genomebiology.com). 47. URL: http://rana.lbl.gov/EisenSoftware.htm 48. Kohonen, T. (2001), ‘Self-organizing Maps’,

Springer, Berlin.

49. Wang, H., Azuaje, F. and Black, N. (2003), ‘Improving biomolecular pattern discovery and visualisation with hybrid self-adaptive

networks’, submitted toIEEE Trans. Nanobiosci, in press.

50. Ewing, R. and Cherry, M. (2001), ‘Visualization of expression clusters using Sammon’s non-linear mapping’,Bioinformatics, Vol. 17, pp. 658–659.

51. Horimoto, K. and Toh, H. (2001), ‘Statistical estimation of cluster boundaries in gene expression profile data’,Bioinformatics, Vol. 17, pp. 1143–1151.

52. Fayyad, U., Grinstein, G. and Wierse, A., Eds, (2002), ‘ Information Visualization in Data Mining and Knowledge Discovery’, Morgan Kaufmann, San Francisco.

53. Reviews inBriefings in Bioinformatics(2001), Vol. 2, pp. 388–404.

54. Berrar, D., Dubitzky, W. and Granzow, M., Eds, (2002), ‘A Practical Approach to Microarray Data Analysis’, Kluwer Academic, London.

55. Cheng, Y. and Church, G. (2000),

‘Biclustering of expresssion data’, in ‘Proceedings of 8th ISMB International Conference on Intelligent Systems for Molecular Biology, August 19–23, La Jolla, California’, AAAI Press, Menlo Park, CA. 56. Warrington, J., Todd, R. and Wong, D., Eds,

(2002), ‘Microarrays and Cancer Research’, Eaton Pub. Co., MA.

57. Tamames, J., Clark, D., Herrero, J.et al.

(2002), ‘Bioinformatics methods for the analysis of expression arrays: Data clustering and information extraction’,J. Biotechnol., Vol. 98, pp. 269–283.

58. Sotiriou, C., Powles, T., Dowsett, M.et al.

(2002), ‘Gene expression profiles derived from fine needle aspiration correlate with response to systemic chemotherapy in breast cancer’,

Breast Cancer Res., Vol. 4 (BioMed Central; URL: http://breast-cancer-research.com). 59. Kitahara, O., Katagiri, T., Tsunoda, T.et al.

(2002), ‘Classification of sensitivity or resistance of cervical cancers to ionizing radiation according to expression profiles of 62 genes selected by cDNA microarray analysis’,

Neoplasia, Vol. 4, pp. 295–303.

60. Cho, Y., Fernandes, J., Kim, S. and Walbot, V. (2002), ‘Gene-expression profile comparisons distinguish seven organs of maize’,Genome Biol., Vol. 3 (BioMed Central; URL: http://genomebiology.com).

61. Zou, J., Young, S., Zhu, F.et al.(2002), ‘Microarray profile of differentially expressed genes in a monkey model of allergic asthma’,

Genome Biol., Vol. 3 (BioMed Central; URL: http://genomebiology.com).

62. Lin, Y., Furukawa, Y., Tsunoda, T.et al.

(2002), ‘Molecular diagnosis of colorectal tumors by expression profiles of 50 genes expressed differentially in adenomas and carcinomas’,Oncogene, Vol. 21, pp. 4120–4128.

63. Matei, D., Graeber, T., Baldwin, R.et al.

(2002), ‘Gene expression in epithelial ovarian carcinoma’,Oncogene, Vol. 21,

pp. 6289–6298.

64. Devilard, E., Bertucci, F., Trempat, P.et al.

(2002), ‘Gene expression profiling defines molecular subtypes of classical Hodgkin’s disease’,Oncogene, Vol. 21, pp. 3095–3102. 65. Ferrando, A., Neuberg, D., Staunton, J.et al.

(2002), ‘Gene expression signatures define novel oncogenic pathways in T cell acute lymphoblastic leukemia’,Cancer Cell, Vol. 1, pp. 75–87.

66. Bonaventure, P.,Guo, H., Tian, B.et al.

(2002), ‘Nuclei and subnuclei gene expression profiling in mammalian brain‘,Brain Res., Vol. 943, pp. 38–47.

67. Rathod, P, Ganesan, K, Hayward, R.et al.

(2002), ‘DNA microarrays for malaria’,Trends Parasitol., Vol. 18, pp. 39–45.

(12)

68. Lukashin, A.V. and Fuchs, R. (2001), ‘Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters’,Bioinformatics, Vol. 17, pp. 405–414.

69. De Smet, F., Mathys, J., Marchal, K.et al.

(2002), ‘Adaptive quality-based clustering of gene expression profiles’,Bioinformatics, Vol. 18, pp. 735–746.

70. Yeung, K. and Ruzzo, W. (2001), ‘Principal component analysis for clustering gene expression data’,Bioinformatics, Vol. 17, pp. 763–774.

71. Troyanskaya, O., Cantor, M., Sherlock, G.et

al.(2001), ‘Missing value estimation methods for DNA microarrays’,Bioinformatics, Vol. 17, pp. 520–525.

72. Kele, S., van der Laan, M. and Eisen, M. (2002), ‘Identification of regulatory elements using a feature selection method’,

Bioinformatics, Vol. 18, pp. 1167–1175. 73. Pollard, K. and van der Laan, M. (2002),

‘Statistical inference for simultaneous clustering of gene expression data’,Math. Biosci., Vol. 176, pp. 99–121.

74. Ball, B. (2002), ‘Data visualization: Picture this’,Nature, Vol. 418, pp. 11–13.