Pattern Recognition Techniques in Microarray Data Analysis: A Survey

(1)

Pattern Recognition Techniques in Microarray Data Analysis: A Survey

Faramarz Valafar

Department of Computer Science, San Diego State University

5500 Campanile Drive, San Diego, CA 92182, USA

[email protected]

Keywords: Pattern recognition, microarray, datamining, expression level data, neural networks, statistical analysis

Abstract: Recent development of technologies (e.g. microarray technology) that are capable of producing massive amounts of genetic data has highlighted the need for new pattern recognition techniques that can mine and discover “biologically meaningful” knowledge in large data sets. Many researchers have begun an endeavor in this direction to devise such datamining techniques. As such, there is a need for survey articles that periodically review and summarize the work that has been done in the area. This article presents one such survey. The first portion of the paper is meant to provide the basic biology (mostly for non-biologists) that is required in such a project. This part is only meant to be a starting point for those experts in the technical fields who wish to embark on this new area of bioinformatics. The second portion of the paper is a survey of various data mining techniques that have been used in mining microarray data for biological knowledge and information (such as sequence information). This survey is not meant to be treated as complete in any form, as the area is currently one of the most active, and the body of research is very large. Furthermore, the applications of the techniques mentioned here are not meant to be taken as the most significant applications of the techniques, but simply as some examples among many.

Molecular Genome Biology

Cell: In every living organism there is a basic and fundamental working unit which we call the cell. Humans, like many other species, have trillions of cells (metazoa). Some organisms, such as yeast, have only one cell. We call these organisms protozoa. There are many types of cells (e.g. blood, skin, and nerve cells), but they all can be traced back to a single cell, the fertilized egg.

Genome and chromosome: The genome can be considered the blueprint for all cellular structures and activities in the body. This blue print is encoded in DNA (deoxyribonucleic acid) molecules. Each cell of an organism contains a complete copy of this blue print, the organism’s genome. The human genome is distributed along 22 pairs of autosomal chromosomes and one pair of sex chromosomes. A chromosome is made of compressed and entwined DNA (Figure 1). In each pair of chromosomes, one is paternally inherited and the other is maternally inherited.

(2)

DNA and gene: A DNA molecule is a double-stranded polymer structured in the form of a double-helix (Figure 2). A gene is a segment of protein coding in the chromosomal DNA that directs the synthesis of a protein. The DNA is composed of four basic molecular units called nucleotides. Each nucleotide is comprised of a phosphate group, a deoxyribose carbohydrate (sugar), and one of four nitrogen bases called adenine (A), guanine (G), cytosine (C), and thymine (T) (Figure 2). The two chains of the DNA are held together by hydrogen bonds between nitrogen bases (base-pairs). Base-pairing occurs only between G and C, or between A and T.

Figure 2: The DNA molecule is a double-stranded, double helix polymer.

The human genome is comprised of 23 pairs of chromosomes, or about three billion base-pairs. All together, the human genome contains nearly three billion base-pairs, and thirty to forty thousand genes. (The number of genes in the human genome has been the subject of much debate in the past years. Initially this number was estimated at anywhere from 65,000 to as high as 150,000 [e.g. John Quackenbush and colleagues at the Institute of Genome Research estimated this number to be about 120,000]. The human genome project (HGP), however, revealed that this number is much lower than initially thought, and is somewhere between thirty and forty thousand.) Specifically, HGP revealed 26,588 protein-encoding transcripts (genes) for which there was a strong corroborative evidence.1The study also presented roughly 12,000 additional computationally derived genes with mouse matches, or weak supporting evidence.1 Proteins: Proteins are large molecules composed of one or more chains of amino acids. Proteins play a structural or functional role in all parts of an organism’s or a species’ body. As the result, for instance, proteins determine how foreign organisms and our body interact with each other (relevant to infectious diseases), or, for instance, how a body produces, or reacts to, cancerous cells.

The DNA is responsible for the synthesis of proteins. The sequence in which base nucleotides appear in the DNA (base sequence) determines the order of amino acids that comprise a protein. As a result, these sequences determine the production level of hormones, enzymes, antibodies, etc. The DNA also determines the types of different cells, by using a mechanism called differential gene expression. Each cell contains a complete copy of the organism's genome. However, not all genes are expressed in all cells all the time. Differential gene expression determines where, when, and in what quantity each gene is

S S S S Sugar-phosphate backbone Sugar-phosphate backbone S P S P S P S P Sugar-phosphate backbone Nucleotid Base Hydrogen bonds

(3)

expressed. This process produces different types of cells (skin, blood, nerve, dividing, dividing, cancerous, etc.) using the same genome. On average, only forty percent of human genes are expressed at a given time. Sydney Brenner, Francis Crick, and colleagues were the pioneers in the area in the early 60’s. Brenner and colleagues first proposed that a particular set of RNA molecules, later called transfer RNAs (tRNAs), act to “decode” the DNA.2In the same year, Brenner et al. showed that short-lived RNA molecules, which they called messenger RNA (mRNA), carry the genetic instructions from DNA to structures in the cell called ribosomes, which they also showed to be the site of protein synthesis.3Brenner’s ground breaking work in 1965 laid out the theoretical foundation for developing techniques that could measure gene expression levels. The results of Brenner’s work, along with other discoveries that came later, can be partially expressed in the following: the expression of the genetic information stored in the DNA molecule occurs in two stages, called transcription and translation. During transcription, the DNA is “transcribed” into mRNA (messenger ribonucleic acid). (RNA is a nucleic acid molecule similar to DNA. However, it is only single stranded, its carbohydrate is ribose instead of deoxyribose, and it uses uracil (U) in place of thymine (T) of DNA.) During translation, the mRNA is translated to produce a protein. The effects of many possible sequence variations in transcription factor binding sites can be determined using microarray binding experiments (e.g. Church and colleagues’ work at Harvard Medical School4).

The human genome is largely (about 98%) comprised of non-coding regions. The non-coding sequences (sequences that do not participate in coding for protein synthesis) in genes are called intron, whereas the coding sequences are called exon. Only exons, which comprise only about two percent of the human genome, contain the genes that are translated into proteins. The introns, or non-coding regions, may provide chromosomal structural integrity or act as regulatory regions (when, where, and what quantity of proteins are made).

Basic problem definition: How can we understand the role of the genome as a whole in biological function? The central dogma in molecular biology, as it pertains to genome biology, is two-fold. The first, and more relevant to this survey, is that understanding gene expression will explain cell function and cell pathology. So, two main problems in this challenge persist. First, given the recent developments in measuring expression levels (in particular in microarray technology), how can we produce gene sequence information from gene expression data? Second, how can we go from sequence to function? In other words, how can we define the role of each gene (or sequence of genes) in some biological function, and subsequently understand how the genome functions as a whole.

A question that arises from the basic problem definition, and one that we will not focus on in this study, is: how can we measure gene expression? The second part of our central dogma discussed above is that gene expression is measurable in terms of specific mRNA abundance. Much effort in this direction has been extended towards the development of microarray technology. Using this technology, various genome projects have revealed the complete DNA sequence of many organisms, including that of humans1,5. One of the impacts of microarray technology has been the increased acquisition rate of genomic data to levels several orders of magnitude higher than initially anticipated. Microarray technology allows expression levels of thousands of genes to be measured at the same time. Many unanswered, and important, questions could potentially be answered by correctly selecting, assembling, analyzing, and interpreting these data.

Contrary to what was originally thought, the complete genome sequence doesn’t tell us much about how the organism functions as a biological system. We need to study how different gene products function to produce various components. Most important activities are not the result of a single molecule, but depend on the coordinated effects of multiple molecules. They are normally the result of coordinated activities of several molecules. These activities and their components are called pathways. To understand a specific biological function in an organism we will need to understand the relevant pathways for the function in that organism. For instance, TGF-β(transforming growth factor beta) plays an essential role in the control of development and morphogenesis in multicellular organisms. At the moment, it is not clear how microarray data can assist in understandin biochemical pathways. However, it is being predicted that the relative levels of many genes need to be examined simultaneously in order to understand pathways and their

(4)

function. There are many open questions regarding the relationship between expression levels and pathways.

Microarray technology: Microarray technology relies on the hybridization properties of nucleic acids to monitor DNA or RNA abundance on a genomic scale. Microarrays have revolutionized the study of genome by allowing researchers to study the expression of thousands of genes simultaneously for the first time.

The action of discovering patterns of gene expression is closely related to correlating sequences of genes to specific biological functions, and thereby understanding the role of genes in biological functions on a genomic scale. The ability to simultaneously study thousands of genes under a host of differing conditions presents an immense challenge in the fields of computational science and data mining.

In order to properly comprehend and interpret expression data produced by microarray technology, new computational and data mining techniques need to be developed. The analysis and understanding of microarray data (expression level data) includes a search for genes that have similar or correlated patterns of expression. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach, neural network techniques, and methods imported from statistical learning theory are among those that have been applied in micro-array data analysis6. We have surveyed a number of these techniques and are presenting a summary of this survey here. Following, you will find a summary of our survey.

Simple approaches: In order to understand the connection between microarray data and biological function, one could ask the following fundamental question: how do expression levels of specific genes or gene-sequences differ in a control group versus a treatment group? In other words, in a controlled study, if a specific biological function (condition) is present, how do expression levels change when the function (condition) is turned off (absent), and visa versa. This line of questioning could be useful in, for instance, determining the effect of a specific genomic treatment of a disease. In such a case, the question would be: has the treatment changed the expression level(s) of specific (target) gene(s)/gene-sequence(s) to noticeably different levels. If so, then have the changes in expression levels resulted in eliminating (or alleviating) the symptoms of the patient’s condition? If the answers to the above questions are “yes”, then a correlation between the specific genes/gene-sequences that show changed levels and the biological function can be drawn.

One of the simple methods attempting to answer the above question is called the fold approach. In this approach, if the average expression level of a gene has changed by a predetermined number of “folds”, that gene expression is declared to have been changed (from on to off, or visa versa) due to the treatment. In many studies, a 2-fold technique is used (rather than 3-fold or 4-fold), in which the average expression level has to change to at least two folds of its initial level in order for it to be classified as changed. The drawback of this method is that it is unlikely that it reveals the desired correlation between expression data and function, as a predetermined factor of 2 (or 3 or 4) has different significance depending on the expression levels of various genes. A further drawback is that this method only compares the expression level of the gene under question to determine whether it has been turned “on” or “off”. A better, and more biologically relevant method of analysis, would be to consider expression patterns of related (or neighboring) genes to determine the “on” or “off” state of the gene currently under observation. The fold approach does not allow this type of analysis.

Similar to the fold approach, t-test7 is another simple method applied in gene expression analysis. It manipulates logarithm of expression levels, and requires comparison computation against the means and variances of both treatment and control groups. The fundamental problem with T-test is that it requires repeated treatment and control experiments, which is both tedious and costly. As a result, a small number of experimental repetition could affect the reliability of the mean-based approach8,9.

Karhunen–Loe`ve expansion: Karhunen–Loe`ve expansion,10as it is known in pattern recognition, is also known as principal component analysis (PCA), or singular value decomposition (SVD)11in statistics. Use of SVD in microarray data analysis has been suggested by various researchers12-16. PCA is a linear

(5)

mathematical technique that finds base vectors that expand the problem space (gene expression space). These vectors are called principal components (PCs). In expression data analysis, the vectors are also referred to as mean expression vectors, or eigengenes. An eigengene (PC) can be thought of as a major pattern in the data set (expression data). The more eigengenes are used to expand (model) the problem space, the more accurate the representation will be. However, one should also be aware that the lower the significance of an eigengene, the more noise it represents. So a balance needs to be struck between the need for maximal expansion of the problem space and the need for elimination of noise. In most cases, PCA reduces the dimensionality of the problem space without much loss of generality or information. It is easy to think of each eigengene as the mean expression vector representing a cluster of expression data (expression pattern). In most studies involving PCA, the technique has been used to find underlying patterns or “modes” in expression data with the intention of linking these modes to the action of transcriptional regulators.

The advantage of PCA is its simplicity, and ease of understanding of the algorithm. The method’s inherent linear nature is its prominent disadvantage. Because of its linear property, PCA works well in problem spaces that are linearly separable. However, it is yet to be shown that finding underlying modes in expression data is a linearly separable problem. PCA is a powerful technique for the analysis of gene-expression data when used in combination with another classification technique, such as k-means clustering, or self organizing maps (SOMs) that requires the user to specify the number of clusters. We will discuss both of these methods in the coming sections.

Bayesian Belief Networks (BBN): A more sophisticated approach, Bayesian probabilistic framework has been suggested to analyze expression data by various researchers.8,17,18 An alternative approach to the SVD methodology described above is to use prior knowledge of the regulatory network's architecture to design competing models, and then use Bayesian belief networks to pick the model that best fits the expression data.19Gifford and co-workers have used this approach to distinguish between two competing models for galactose regulation.20Friedman and co-workers have used BBNs to analyze genome-wide expression data in order to identify significant interactions between genes in a variety of metabolic and regulatory pathways.21,17

Baldi et al.8 used BBNs to model “log-expression values by independent normal distributions, parameterized by corresponding means and variances with hierarchical prior distributions. They derive point estimates for both parameter and hyperparameter, and regularize expressions for the variance of each gene by combining the empirical variance with local background variance associated with neighboring genes”. The authors report that the approach compares favorably with the simple t-test and fold method, and that it can accommodate noise, variability, and low replication often typical of microarray data.8 Clustering

Clustering is an approach that discovers natural groupings in the data. These groupings are also called clusters, or classes. Some clustering algorithms, such as Derivative-clustering algorithms, can also be used for prediction. Hierarchical clustering and k-means clustering (also called partitioning) are two major classes of clustering algorithms. For a more indebt review of the recent work in clustering, see Fasulo.22 k-means Clustering (Partitioning): k-means is a divisive clustering approach. It partition data (genes or experiments) into groups that have similar expression patterns. k is the number of clusters (also sometimes called buckets, bins, or classes) that the user believes the data should fall into. The number k is an input value given to the algorithm by the user. The k-mean clustering algorithm is a three step process. In the first step, the algorithm randomly assigns all training data to one of the k clusters. In the second step, the mean inter- and intra-class distances are calculated. The mean inter-class distance (δc) of each cluster is

calculated by first calculating a mean vector (µc) for each cluster and then averaging the distances between

the vectors (data) of a cluster and its mean vector. In expression level data, the mean vector is called the average expression vector. In the following formulae, we assume Euclidian measure of distance, and arithmetic averaging for calculating the means.

(6)

∑

=

n i i c c

v

n

1

1 µ

and

∑

=

−

=

n i c i c c

v

n

1

1 µ

δ

In various forms of the algorithm, various measures of distance and averaging techniques have been used. A popular and more representative measure of distance has been the Mahalanobis distance. The mean intra-class distance between two clusters is the distance between their respective mean vectors. ∆1,2, for instance, is the mean intra-class distance between clusters 1 and 2:

2 1 2 , 1

=

µ

−

µ

∆

The third step is an iterative step, and its goal is to minimize the mean inter-class distances (δc), maximize

intra-class distances (∆1,2), or both, by moving data from one cluster to another. In each iteration, one piece of data is moved to a new cluster where it is closest to aµc(the mean vector of the new cluster). After each

move, new expression vectors for the two effected classes are recalculated. This process continues, until any further move would increase the mean inter-class means (expression variability for each class) or reduce intra-class distances. There are additional (sometimes optional) steps that can be found in variations of the basic k-means clustering algorithm. Quackenbush discusses a few optional steps and variations of this algorithm.23

k-means clustering is easy to implement. However, the major disadvantage of this method is that the number k is often not known in advance. Another potential problem with this method is that because each gene is uniquely assigned to some cluster, it is difficult for the method to accommodate a large number of stray data points, intermediates, or outliers. Further concerns about the algorithm have to do with the algorithms’ biological interpretation (in the case of expression data) of the final clustered data. In this regard, Tamayo et al. explain that k-means clustering is “a completely unstructured approach, which proceeds in an entirely local fashion and produces an unorganized collection of clusters that is not conductive to interpretation”.24

An advantage of k-means algorithm is that, because of its simplicity, it can be used in a variety of problems. For instance the most recent variant of the k-means clustering algorithm (at the time of this survey) designed specifically for the assessment of gene spots (on the array images) is the work of Bozinov et al.25. The technique is based on clustering pixels of a target area into foreground and background clusters. The authors report: “results from the analysis of real gene spots indicate that our approach performs superior to other existing analytical methods”.25

Hierarchical Clustering: There are various hierarchical clustering algorithms that can be applied to microarray data analysis. These include single-linkage clustering, complete-linkage clustering, average-linkage clustering, weighted pair-group averaging, and within pair-group averaging.23,26-28 These algorithms only differ in the manner in which distances are calculated between the growing clusters and the remaining members of the data set. Hierarchical clustering algorithms usually generate a gene similarity score for all gene combinations, place the scores in a matrix, join those genes that have the highest score, and then continue to join progressively less similar pairs. In the clustering process, after similarity score calculations, the most closely related pairs are identified in an above-diagonal scoring matrix. In this process, a node in the hierarchy is created for the highest-scoring pair, the gene expressed profilers of the two genes are averaged, and the joined elements are weighted by the number of elements they contain. The matrix is then updated replacing the two joined elements by the node. For n genes, the process is repeated n-1 times until a single element (that contains all genes) remains.

The first report by Wen et al.26 uses clustering and data-mining techniques to analyze large-scale gene expression data. This report is significant in that it shows how integrating the results that were obtained by using various distance metrics can reveal different but meaningful patterns in the data. Eisen et al.27also make an elegant demonstration of the power of hierarchical clustering in the analysis of microarray data.

(7)

Similar to the k-means algorithm, the advantage of hierarchical clustering is in its simplicity. A further advantage of hierarchical technique versus the k-means method is that the results from hierarchical clustering methods can easily be visualized. Although hierarchical cluster analysis is a powerful technique and possesses clear advantages for expression data analysis, it also presents researchers with two major drawbacks. The first problem arises from the “greedy” nature of the algorithm. Hierarchical clustering is essentially a greedy algorithm, and like other such algorithms, it suffers from sensitivity to early mistakes in the greedy process. Because by definition greedy algorithms cannot go back (backtrack) to redo the step that was taken by mistake, “small errors in cluster assignment in early stages of the algorithm can be drastically amplified”29. Therefore, the dependence on the results produced by certain arbitrarily imposed clustering structures (that do not correspond to reality) can give rise to misleading results. For instance, in time-course gene expression studies, hierarchical clustering has received mixed reviews. These algorithms often fail to discriminate between different patterns of variation in time. For instance, a gene express pattern for which a high value is found at an intermediate time point will be clustered with another gene for which a high value is found at a later point in time. These variations have to be separated in a subsequent step.

The second drawback of hierarchical clustering is best described by Quackenbush: “one potential problem with many hierarchical clustering methods is that, as clusters grow in size, the expression vector that represents the cluster might no longer represent any of the genes in the cluster. Consequently, as clustering progresses, the actual expression patterns of the genes themselves become less relevant”.23As a result, an active area of research in hierarchical cluster analysis is in detecting when to stop the hierarchical merging of elements. In this direction, new hybrid techniques are emerging that combine hierarchical methodology with k-means techniques.

Mixture models and EM (expectation maximization): Mixture models are a divide and conquer approach to statistical modeling. Mixture modeling comes from the realization that not all variables are measurable. These variables are called latent or hidden variables. Mixture models use positive convex combination of distributions (of measured and latent variables) to build a model.30,31 The reasoning for modeling with latent variables is two-fold. First, in most natural systems (biological or medical), it is not possible to measure the value of all the parameters involved. As a result, some variables that often play a role in defining the behavior of a system remain “hidden”. The second reason for using latent variables comes from the experience gained in modeling systems that are not “unknown” or “ambiguous” in as many aspects as those in nature. The experience (in particular from modeling engineering systems) shows that one can often build a simpler model using latent variables, as compared to one that does not use latent variables.

In general, there are two types of mixture models, conditional and unconditional. The unconditional varieties are generally used for density estimation, and the conditional ones are used for regression and classification problems. If the point of distinction between classification and clustering is defined to be the knowledge of class labels, mixture models become a common tool for clustering problems (where the class labels are unknown). A popular version of the mixture models is the Gaussian mixture model in which Gaussian densities are “mixed” to model a system.

Expectation-maximization (EM) is a two-step iterative process that maximizes the log-likelihood of a mixture model. During the expectation step, the distribution of the latent variables is computed. During the maximization step, the parameters of the model are updated so as to maximize the log-likelihood of the mixture (calculated using the latent variable distributions) based on the observed data and current estimates. Applying EM algorithm to the corresponding mixture model can serve as a complementary analysis to standard hierarchical clustering. “An attractive feature of the mixture modeling approach is that the strength of evidence measure for the number of true clusters in the data is computed”.30 This assessment of reliability addresses the primary deficiency of hierarchical clustering and is often an important question to biologists considering data from microarray studies.

Mixture models and EM have been suggested for clustering in microarray expression data analysis. McLachlan and colleagues,32for instance, used a mixture model and EM in the development of a software package that they called EMMIX-GENE. The authors use EMMIX-GENE to cluster microarray

(8)

expression data recorded from colon and leukemia tissue samples. For both data sets, relevant subsets of genes that revealed biologically significant clustering of the tissues were selected. The revealed clusters were consistent with the results of external examination of the tissue, or with a priori biological knowledge of these sets.32

Gene Shaving: Gene shaving is a recently developed and popular statistical method for discovering patterns in gene expression data. The original algorithm uses PCA and was proposed by Hastie, Tibshirani et al.33. Later variations are under development, some of which include the replacement of the PCA step with a nonlinear variety. Gene shaving is designed to identify clusters of genes with coherent expression patterns and large variation across the samples. Using this method, the authors have successfully analyzed gene expression measurements made on samples from patients with diffuse large B-cell lymphoma, and identified a small cluster of genes whose expression is highly predictive of survival.33 The shaving algorithm can be summarized in the following steps: 1) Build the expression matrixΞfor all genes, and center each row ofΞto zero mean. 2) Compute the leading principal component of rows ofΞ. 3) “Shave off” the proportion (typically 10%) of the genes having the smallest absolute inner product with leading principal component. 4) Repeat the second and third steps until only one remains. 5) Estimate the optimal cluster size k (for maximum gap). 6) Orthogonalize each row ofΞwith respect to the average gene in the size-optimized gene cluster. 7) Repeat the first five steps with the orthogonalized data to find the second optimal cluster. This process is continued until a maximum of M clusters are reached (M is chosen a priori).

There are three varieties to the original shaving algorithm: supervised, partially supervised, and unsupervised. In supervised and partial supervised shaving, available information about genes and samples (outcome measure, known properties of genes or sample, or any other a priori information) can be used to label their data as a means to supervise the clustering process and ensuring meaningful groupings. Gene shaving also allows genes to belong to more than one cluster. These two properties (the ability to supervise and multiple groupings for the same gene) make gene shaving different from most hierarchical clustering and other widely used methods for analyzing gene expression data.

The most prominent advantage of the shaving methods was expressed by the authors themselves: “(shaving methods) search for clusters of genes showing both high variation across the samples, and coherence (correlation) across the genes. Both of these aspects are important and cannot be captured by simple clustering of genes or setting threshold of individual genes based on the variation over samples”.33

The drawback of this method is the computational intensity of the algorithm. The shaving process requires repeated computation of the largest principal component of a large set of variables. Variant approaches should consider replacement algorithms for PCA that are less computationally intensive.

Support Vector Machine (SVM): SVM is a supervised machine learning technique in the sense that vectors are classified with respect to known reference vectors. “SVM solve the problem by mapping the gene-expression vectors from expression space into a higher-dimensional feature space, in which distance is measured using a mathematical function known as a kernel function, and the data can then be separated into two classes”.23 Gene expression vectors can be thought of as points in an n-dimensional space. For microarray analysis, sets of genes are identified that represent a target pattern of gene expression. The SVM is then trained to discriminate between the data points for that pattern (positive points in the feature space) and other data points that do not show that pattern (negative points in the feature space). “With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made separable”.34 SVM is a linear technique in the feature space in the sense that it uses hyperplanes as separating surfaces between positive and negative points in the feature space. Specifically, SVM chooses the hyperplane that provides maximum margin between the plane surface and the positive and negative points. This feature provides a mechanism to avoid over fitting. Once the separating hyperplanes have been selected, “the decision function for classifying points with respect to the hyperplane only involves dot product between points in the feature space”, which carries a low computational burden.34

Because the linear boundaries in the feature space map to nonlinear boundaries in the gene expression space, SVM can be considered a nonlinear separation technique. The important advantage of SVM is that

(9)

it offers a possibility to train generalizable, nonlinear classifiers in high-dimensional space using a small training set. For large training sets, SVM typically selects a small support set that is necessary for designing the classifier, thereby minimizing the computational requirements during testing. Furey, et al.35 praise SVM as follows: “It (SVM) has demonstrated the ability to not only correctly separate entities into appropriate classes, but also identify instances whose established classification is not supported by the data. It performs well in multiple biological analyses, having remarkable robust performance with respect to sparse and noisy data”.

SVM, however, does have drawbacks that can severely affect the outcome. One of the major drawbacks of SVM is its sensitivity to the choice of a kernel function, parameters, and penalties. For instance, if the kernel function is not appropriately chosen for the training data, SVM may not be able to find a separating hyperplane in feature space.34Choosing the best kernel function, parameters, and penalties for the SVM algorithm can be difficult in many cases. Because of the sensitivity of the algorithm to these choices, different choices often yield completely different classifications. As a result, “it is necessary to successively increase kernel complexity until an appropriate (biologically sound) classification is achieved”.23The ad hoc nature of the penalty term (error penalty), the computational complexity of the training procedure (a quadratic minimization problem), and risk of over fitting when using larger hidden layers, are further drawbacks of this method.

Hidden Markov Model: Hidden Markov models have been extensively studied for speech recognition and processing. Haussler et. al.36and Krogh et. al.37were among the pioneers in using HMM in biological data mining. HMMs are state-based models, in which states are defined based on a priori knowledge of the biological system to be modeled. The most popular version of HMM in sequencing and microarray data analysis is the profile HMM variety. Profile HMMs are the most popular among the bioinformatics research. Profile HMMs combine the profile38 and weight matrix method39 and treat the gaps in a systematic way40.

HMMs have been applied to various problems in biological data mining, including gene finding37,41,42and sequence alignment43,44. Sonnhammer created Pfam, a protein database using HMM engines to provide a means to align multiple sequences.43 Sequence alignment and modeling (SAM) (created by Karplus), is another example of the application of HMM in sequence alignment.44 More recent application of HMMs includes the work of Church and colleagues4from Harvard Medical School, in which the authors studied the very important question of whether one can predict new sites on the basis of a few known sites. For this analysis, the authors generated a probability distribution over all potential sequences using a hidden Markov model (HMM) derived from a weight matrix created from a few sites. The authors note the limitation of the technique in that “even when contained within the structure of a HMM, the use of binding site weight matrices makes the assumption that individual nucleotides, or groups of nucleotides, within the DNA binding site can be treated independently.”

Other clustering techniques in microarray data analysis: Sasik et al.29have presented the Percolation Clustering approach to clustering of gene expression patterns base on the mutual connectivity of the patterns. Unlike SOM or k-means which force gene expression data into a fixed number of predetermined clustering structures, this approach is to reveal the natural tendency of the data to cluster, in analogy to the physical phenomenon of percolation.

GA/KNN is another algorithm described by Li, et al.45“This approach combines a Genetic Algorithm (GA) and the k-Nearest Neighbor (KNN) method to identify genes that can jointly discriminate between different classes of samples. The GA/KNN is a supervised stochastic pattern recognition method. It is capable of selecting a subset of predictive genes from a set of large noisy data for sample classification.”45We will discuss genetic algorithm next.

Genetic Algorithm: Holland46 invented the genetic algorithm (GA) in 1975. GA is essentially an optimization technique that was inspired by mutation (in nature) that gives rise to biological evolution. In GA, the coordinates of points in the problem space are organized as a sequence, much like sequences of genes. The process of searching for a maximum or a minimum is accomplished by mutating the sequence, and hence arriving at a new coordinate. At each new coordinate, the function is evaluated, and if the new

(10)

point is determined to be more optimal than those previously observed, the new point is stored as the new extrema (minimum or maximum). GA has been used in a variety of applications in sequencing. For instance, in DNA fragment assembly, the work of Parsons et al47, Cedeno and Vemuri48, Fickett and Cinkosky49can be mentioned. Zhang and Wong50applied GA to multiple molecular sequence alignment. Most varieties of GA differ in the way the sequences are mutated, and hence search the problem space in different patterns. As an example of a variety, H. Valafar’s distributed global optimization (DGO) algorithm can be mentioned.51Hybrid systems also exist in which, for instance, a neural network is built using a GA algorithm as the learning algorithm. For example, H. Valafar used the DGO as the learning algorithm of a multilayer, feed-forward neural network to develop a system that could automatically identify the chemical structure of a group of complex carbohydrates and some glycoproteins from their1 H-NMR spectra.51,52

Artificial Neural Network:

Artificial neural networks (ANNs) belong to the adaptive class of techniques in machine learning. ANNs have been used as solutions to various types of problems (e.g. pattern recognition, prediction, estimation, etc.). However, ANNs’ success as an intelligent pattern recognition methodology has been advertised most prominently. ANNs were inspired by the brain (a biological neural network). Most models of ANNs are organized in the form of a number of processing units (also called artificial neurons, or simply neurons53), and a number of weighted connections (artificial synapses) between the neurons. The process of building an ANN (similar to its biological inspiration) involves a learning episode (also called training). During the learning episode, the network observes a sequence of recorded data, and adjusts the strength of its synapses according to a learning algorithm and based on the observed data. The process of adjusting the synaptic strengths in order to be able to accomplish a certain task (much like the brain) is called “learning”. Learning algorithms are generally divided into two types, supervised and unsupervised. Supervised algorithms require labeled training data. In other words, they require more a priori knowledge about the training set. The most important, and attractive, feature of ANNs is their capability of learning (generalizing) from example (extracting knowledge from data). ANNs can do this without any prespecified rules that define intelligence or represent an expert’s knowledge. This feature makes the ANN an attractive choice for gene expression analysis and sequencing. ANNs were the first group of machine learning algorithms to be used on a biological pattern recognition problem.54

Due to their power and flexibility, ANNs have even been used as tools for selection of relevant variable, which can in turn greatly increase the expert’s knowledge and understanding of the problem. For instance, Selaru et al. used ANNs to distinguish among subtypes of neoplastic colorectal lesions.55They then used the trained ANN to identify the relevant genes that are used to make this distinction. Specifically, the authors evaluated the ability of ANNs in distinguishing between complementary DNA (cDNA) microarray data (8064 clones) of two types of colorectal lesions (sporadic colorectal adenomas and cancers or SAC, and inflammatory bowel disease-associated or IBD-associated dysplasias and cancers). Salura and colleagues report the failure of hierarchical clustering to make the above distinction, even when all 8064 clones were used. ANNs not only correctly identified all twelve samples of the test set (3 IBDNs and 9 SACs), but also helped identify the subset of genes that were important to make this diagnostic distinction: “Using an iterative process based on the computer programs GeneFinder, Cluster, and MATLAB, we reduced the number of clones used for diagnosis from 8064 to 97.” Using the 97 clones, even the cluster analysis was then able to make the correct distinction between the two types of lesions. The authors conclude: “Our results suggest that ANNs have the potential to discriminate among subtly different clinical entities, such as IBDNs and SACs, as well as to identify gene subsets having the power to make these diagnostic distinctions.”

There is a very large body of research that has resulted in a large number of ANN designs. For a more comprehensive review of the various ANN types, see56,57. In this paper, we discuss only some of the types that have been used in sequencing.

Layered, feed-forward neural networks: This is a class of ANNs whose neurons are organized in layers. The layers are normally fully connected, meaning that each element (neuron) of a layer is connected to each element of the next layer. However, self-organizing varieties also exist in which a network starts

(11)

either with a minimal number of synaptic connections between the layers and adds to the number as training progresses (constructive), or starts as a fully connected network and prunes connections based on the data observed in training (destructive).56,57

Backpropagation56,57is a learning algorithm that, in its original version, belongs to the gradient descent optimization methods58. It is the most popular learning algorithm that has been used to train layered ANNs. A large number of varieties of the algorithm have been developed that use a number of various optimization techniques.57 The combination of backpropagation learning algorithm and the feed-forward, layered networks provides the most popular type of ANNs. These ANNs have been applied to virtually all pattern recognition problems, and are typically the first networks tried on a new problem. The reason for this is the simplicity of the algorithm, and the vast body of research that has studied these networks. As such, in sequencing many researchers have also used this type of network as a first line of attack. Examples can be mentioned in59,60. Wu59 developed a system called gene classification artificial neural system (GenCANS), which is based on a three layered, feed-forward backpropagation network. GenCANS was designed to “classify new (unknown) sequences into predefined (known) classes. It involves two steps, sequence encoding and neural network classification, to map molecular sequences (input) into gene families (output)”. The same type of network has been used to perform rapid searches for sequences of proteins.60

Other examples can be mentioned in Snyder and Stormo’s work in designing a system called GeneParser.61 Here authors experimented with two variations of a single layer network (one fully connected, and one partially connected with an activation bios added to some inputs), as well as a partially connected two-layer network. The authors use dynamic programming as the learning algorithm in order to train the system for protein sequencing.

As mentioned before, the advantage of these networks is in their simplicity of implementation and understanding of the underlying mathematics. Because of the large body of research conducted on these networks, there are a large number of public domain software packages available that implement virtually all the different varieties of the network.

Although these networks are theoretically capable of separating a problem space into appropriate classes irrespective of the complexity of the separation boundaries, one of their “classical” disadvantages of these networks is that a certain amount of a priori knowledge is required in order to build a useful network. A crucial factor in training a useful network is its size (number of layers, size of layers, and number of synaptic connections). In many cases, it takes a large number of simulations before a close-to-optimum size of the network is found. If the network is designed to be larger than this optimum size, it will memorize (also called over-fit) the data rather than generalizing and extracting knowledge. If the network is chosen to be smaller than the optimum size, the network will never learn the entire task at hand. An attractive alternative to these networks are self-organizing networks which automatically, or semi-automatically, determine the optimal size from the data at hand.

Self-organizing neural networks

These networks are a very large class of neural networks whose structure (number of neurons, number of synaptic connections, number of modules, or number of layers) changes during learning based on the observed data. There are two classes of this type of networks: destructive and constructive.57Destructive networks are initially a fully connected topology and the learning algorithm prunes synapses (sometime entire neurons, modules, or layers) based on the observed data. The final remaining network after learning, usually is a sparsely connected network. Constructive algorithms start with a minimally connected network, and gradually add synapses (neurons, modules, or layers) as training progresses in order to accommodate for the complexity of the task at hand.

Self-organizing networks fall into some well known groups. Below we summarize the application of some of these major groups in sequencing and expression level data analysis.

(12)

Self-Organizing Map: A self-organizing map (SOM)62is a type of neural network approach first proposed by Kohonen63. SOMs have been used as a divisive clustering approach in many areas, including genomics. Several groups have used SOMs to discover patterns in gene expression data. Among these, Toronen et al64, Golub et al65, and Tamayo et al66 can be mentioned. Tamayo and colleagues66 use self-organizing maps to explore patterns of gene expression generated using Affymetrix arrays, and provide the GENECLUSTER implementation of SOMs. Tamayo and his colleagues explain the implementation of SOM for expression level data as follows: “An SOM has a set of nodes with a simple topology and a distance function on the nodes. Nodes are iteratively mapped into k-dimensional “gene expression space” (in which the ithcoordinate represents the expression level in the ithsample)”66. A SOM assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. The summary of the basic SOM algorithm is perhaps best described in Quackenbush’s review: “First, random vectors are constructed and assigned to each partition. Second, a gene is picked at random and, using a selected distance metric, the reference vector that is closest to the gene is identified. Third, the reference vector is then adjusted so that it is more similar to the vector of the assigned gene. The reference vectors that are nearly on the two-dimensional gird are also adjusted so that they are more similar to the vector of the assigned gene. Fourth, steps 2 and 3 are iterated several thousand times, decreasing the amount by which the reference vectors are adjusted and increasing the stringency used to define closeness in each step. As the process continues, the reference vectors converge to fixed values. Last, the genes are mapped to the relevant partitions depending on the reference vector to which they are most similar”.67

SOMs, like the gene shaving68approach, have the distinct advantage that they allow a priori knowledge to be included in the clustering process. Tamayo et al explain this and other advantages of the SOM approach as follows: “The SOM has a number of features that make them particularly well suited to clustering and analyzing gene expression patterns. They are ideally suited to exploratory data analysis, allowing one to impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of k-means clustering) facilitating easy visualization and interpretation. SOMs have good computational properties and are easy to implement, reasonably fast, and are scalable to large data sets”.66

The most prominent disadvantage of the SOM approach is that it is difficult to know when to stop the algorithm. If the map is allowed to grow indefinitely, the size of SOM is gradually increased to a point where clearly different sets of expression patterns are identified. Therefore, as with k-means clustering, the user has to rely on some other source of information, such as PCA, to determine the number of clusters that best represents the available data. For this reason, Sasik and his colleagues believe that “SOM, as implemented by Tamayo et al66, is essentially a restricted version of k-means: Here, the k clusters are linked by some arbitrary user-imposed topological constraints (e.g. a 3 x 2 grid), and as such suffers from all of the problems mentioned above for k-means (and more), except that the constraints expedites the optimization process”.69

Self-organizing feature maps (SOFM) is one of many varieties to SOM that should be mentioned here.62,63 Growing cell structure (GCS)70 is subsequently a derivative of SOFM. It is a self-organizing and incremental (constructive) neural learning approach. The unsupervised variety of GCS was used by Azuaje71 for discovery of gene expression patterns in B-cell lymphoma. Using cDNA microarray expression data, GCS was able to “identify normal and diffuse large B-cell lymphoma (DLBCL) patients. Furthermore, it distinguishes patients with molecularly distinct types of DLBCL without previous knowledge of those subclasses”.

Self-organizing trees: Self-organizing trees are normally constructive neural network methods that develop into a tree (usually binary tree) topology during learning. The works of Dopazo et al72, Wang et al73,74, and Herrero et al75 are examples of the application of these networks to sequencing. Dopazo and Carazo introduce the self-organizing tree algorithm (SOTA).72SOTA is a hierarchical neural network that grows into a binary tree topology. For this reason SOTA can be considered a hierarchical clustering algorithm68. SOTA is based on Kohonen’s SOM (discussed above) and Fritzke’s growing cell70.

(13)

SOTA was originally designed to analyze pre-aligned sequences of genes (polygenetic reconstruction). Wang and colleagues extended that work in73,74in order to classify protein sequences: “(SOTA) is now adapted to be able to analyze patterns associated to the frequency of residues along a sequence, such as protein dipeptide composition and other n-gram compositions”. Herrero and colleagues demonstrate the application of SOTA to Microarray expression data.75 They show that SOTA’s performance is superior to that of classical hierarchical clustering techniques. Among the advantages of SOTA, as compared to hierarchical cluster algorithms, are its time complexity, and its top-to-bottom hierarchical approach. SOTA’s runtimes are approximately linear with the number of items to be classified, making it suitable for large datasets. Also, because SOTA forms higher clusters in the hierarchy before forming the lower clusters, it can be stopped at any level of hierarchy and still produce meaningful intermediate results. There are many other types of self-organizing trees that for space considerations cannot be mentioned here. SOTA was only chosen here to give a brief introduction to these types of networks.

ART and its derivatives: Adaptive Resonance Theory was introduced by Stephen Grossberg in 1976.76,77 Networks designed based on ART are unsupervised and self-organizing, and only learn in the so called “resonant” state. ART can form (stable) clusters of arbitrary sequences of input patterns by learning (entering resonant states) and self-organizing. Since the inception, many derivatives of ART have emerged. Among these ART-178(the binary version of ART; forms clusters of binary input data), ART-279 (analog version of ART), ART-2A80(fast version of ART-2), ART-381(includes "chemical transmitters" to control the search process in a hierarchical ART structure), ARTMAP82(supervised version of ART) can be mentioned. Many hybrid varieties such as Fuzzy–ART83, Fuzzy-ARTMAP84,85(supervised Fuzzy-ART) and simplified Fuzzy-ARTMAP (SFAM)86have also been developed.

The ART family of networks has broad application in virtually all areas of pattern recognition. As such, they have also been applied in biological sequencing problems. In general, in problem settings where the number of clusters is not previously known, researchers tend to use unsupervised ART, where when number of clusters is known a priori, usually the supervised version, ARTMAP, is used.

Among the unsupervised implementations, the work of Tomida et al87 should be mentioned. Here the authors used Fuzzy ART for expression level data analysis. This study is significant in that it compares results of Fuzzy ART with those of hierarchical clustering, k-mean clustering, and SOMs in analyzing and sequencing expression level data on the genomic scale. In this study, the authors consider a two dimensional problem space in which the first dimension is the expression level and the second dimension is time, in order to consider the variation of expression levels in time. In this study, the authors report the lowest gap index among the four algorithms considered for Fuzzy ART. (In this study, the authors use gap index to “evaluate the similarity of profiles as area between each profile and average profile during temporal phase”.) In terms of robustness of clustering results, the authors report a robustness of 82.2% for Fuzzy ART versus 71.1%, 44.4% and 46.7%, for hierarchical clustering, k-means clustering, and SOMs respectively. As a result, the authors conclude that Fuzzy ART also performs best in noisy data.

Among supervised implementations, Azuaje’s use of SFAM in cDNA data analysis with the purpose of discovery of gene function in cancer can be mentioned.88In this study, the authors use a data set generated for a previous study71 using SOFM. Here the authors were able to also distinguish between classes of cancer using SFAM instead of SOFM (reported earlier in this paper).

Other neural network techniques

Molecular computers and molecular neural networks: The idea of molecular and cellular computing dates back to the late 1950’s when Richard Feynman delivered his famous paper describing "sub-microscopic" computers. He suggested that living cells and molecules could be thought of as potential computational components. Along the same lines, a revolutionary form of computing was developed by Leonard Aldeman in 1994, when he announced that he had been able to solve a small instance of a computationally intractable problem using a small vial of DNA.89 Aldeman had accomplished this by representing information as sequences of bases in DNA molecules. He then showed how to exploit the self assembly

(14)

property of DNA, and use DNA-manipulation techniques to implement a simple but massively parallel random search.

As a result, it has been suggested by Mills90(among others) that a “DNA computer using cDNA as input might be ideal for clinical cell discrimination”. In particular, Mills recommends the neural networks version of Aldeman’s DNA computing scheme developed also by Mills and his colleagues91 to the classification and sequencing tasks involved in expression level profiling. Mills reports in90 that his “preliminary experimental results suggest that expression profiling should be feasible using a DNA neural network that acts directly on cDNA”.

The DNA neural networks are the most recent and innovative ANN solution suggested for expression data profiling and sequencing. Its massively parallel nature offers exceedingly fast computation. The method presents high potential in various aspects of the solution that it offers and should offer breakthrough advancement to the field.

Bayesian Neural Networks: There are a number of recent networks that have been suggested as solutions for sequencing and expression data analysis. For instance, Bayesian neural networks (BNNs) are another technique that has been recently used for gene expression analysis. Liang et al92have used the BNNs with structural learning for exploring microarray data in gene expressions. The BNNs are an important addition to the host of ANN solutions that have been offered to the problem at hand, as they represent a large group of hybrid ANNs that combine classical ANNs with statistical classification and prediction theories.

Missing data

Missing data is one of the problems that pattern recognition techniques need to deal with when analyzing microarray expression data. Microarray experiments often generate expression data arrays with some missing values. Many techniques, such as hierarchical and K-means clustering, are ill suited to analyze such problem spaces, as they require a complete data matrix to do the analysis. It is important to recognize the fact that such methods require a complementary preprocessing algorithm to fill in an estimate for the missing data. Various interpolation algorithms can be used for this purpose with varying degrees of success. These techniques can range from simple steps, such as filling the spot of the missing data with a zero (which is only partially effective for a very narrow range of classification algorithms) or row (gene) averaging, to using sophisticated interpolation techniques to fill in the missing data spots.

Some algorithms are better equipped to deal with missing data, and provide the means to estimate the missing data. For instance, k-nearest neighbors (KNN) algorithm “selects genes with expression profiles similar to the gene of interest to impute missing values. If we consider gene A that has one missing value in experiment 1, this method would find K other genes, which have a value present in experiment 1, with expression most similar to A in experiment 2 - N (where N is the total number of experiments). A weighted average of values in experiment 1 from k closest genes is then used as an estimate for missing value in gene A”.93

Another algorithm well suited to missing data is SVD. This is because intermediate results of SVD can be used to estimate the missing data. Specifically, once the k most significant eigengenes are identified, they can be used to estimate a missing value in a gene. First, the gene with the missing value is regressed against the k eigengenes, and then the coefficients of the regression are used to reconstruct the missing value from a linear combination of the k eigengenes. On comparing SVD with the row averaging technique in estimating missing values, Troyanskaya et al say the following: “Row average method performs poorly on non-time series data. KNN- and SVD-based methods provide fast and accurate ways of estimating missing values for microarray data. Both methods far surpass the row average solutions by taking advantage of the correlation structure of the data to estimate missing expression values. Compared to SVD impute, KNN-based imputation shows less deterioration in performance with increasing percent of missing entries. In addition, KNN impute method is more robust than SVD to the type of data for which estimation is performed, performing better on non-time series or noisy data. KNN impute is also less sensitive to the exact parameters used (number of nearest neighbors), whereas the SVD-based method shows sharp

(15)

deterioration in performance when a non-optimal fraction of missing values is used”.93 Troyanskaya et al93 give a thorough review of the missing data problem in gene expression analysis.

Conclusion

There are a number of pattern recognition techniques to analyze microarray expression data.8,23,6 The simplest category of these techniques is based on individual gene analysis. Examples of these techniques are fold approach, t-test rule, and Bayesian framework. More sophisticated techniques include clustering analysis methods and SVD technique. The hypothesis behind using clustering methods is that “genes in a cluster must share some common function or regulatory elements. However, classifications based on clustering algorithms are dependent on the particular methods used, the manner in which the data are normalized within and across experiments, and the manner in which we measure the similarity. Although different techniques might be more or less appropriate for different data sets, there is no such thing as a single correct classification.”23 We believe that the performance of a technique (newly developed or old) should ultimately be measured in its practicality, and in the biological soundness and relevance of its results. For instance, as noted in4, methods that presume the ability to treat nucleotides individually (such as a HMM model built from binding site weight matrices) are harder to take advantage of. As such, techniques that allow the incorporation of a priori biological knowledge (partially supervised), and offer flexibility in altering their underlying assumptions, show greater potential.

This review represents only a small part of the research being conducted in the area, and is only meant as a complementary/continuation of the survey that others have conducted in this area6,23,93-95. It should in no way to be taken as a complete survey of all algorithms. For the reason of limited space, some significant developments in the area had to be left out. Furthermore, new techniques and algorithms are being proposed for microarray data analysis on a daily basis, making survey articles such as this highly time dependent.

References

1. Venter,J.C. et al. 2001. The sequence of the human genome. Science, 291, 1304–1351. 2. Crick F.H.C. et al. 1961. General nature of the genetic code for proteins. Nature 192:1227-32. 3. Brenner S. et al. 1961. An unstable intermediate carrying information from genes to ribosomes for

protein synthesis. Nature 190: 576-81.

4. Bulyk ML, P.L. Johnson, G.M. Church. Mar. 1, 2002. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 30(5):1255-61.

5. Lander,E.S. et al. 2001 Initial sequencing and analysis of the human genome. Nature, 409, 860– 921.

6. Szabo A. et al. 2002. Variable selection and pattern recognition with gene expression data generated by the microarray technology. Math Biosci 176(1), 71-98.

7. Stanton A. Glantz. 2001. Primer of Biostatistics. Fifth Edition. McGraw-Hill Professional. 8. Baldi, P. et al. 2001. A Bayesian framework for the analysis of microarray expression data:

regularized t-test and statistical inference of gene changes, Bioinformatics 17(6) , 509-519. 9. Pan, W. 2002. A comparative review of statistical methods for discovering differentially expressed

genes in replicated microarray experiments. Bioinformatics 18(4), 546-54.

10. Mallat, S. G. 1999. A Wavelet Tour of Signal Processing, 2ndedition, Academic Press, San Diego 11. Anderson. T. W. 1984. Introduction to Multivariate Statistical Analysis, 2ndedition, Wiley, NY 12. Alter, O. et al. 2000. Singular Value Decomposition for Genome-wide Expression Data

Processing and Modeling, Proc. Natl. Acad. Sci. USA 97, 10101-10106.

13. Holter N.S. et al. 2001. Dynamic modeling of gene expression data. Proc Natl Acad Sci USA 98: 1693¯1698.

14. Holter N.S. 2000. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 97: 8409¯8414

15. Raychaudhuri S., JM Stuart, RB Altman. 2000. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput, 455-466.

(16)

16. Wu C., M. Berry, et al. 1995. Neural networks for full-scale protein sequence classification: sequence encoding with singular value decomposition. Machine Learning 21, 177-193.

17. Friedman N., M. Linial, I. Nachman and D. Pe'er. 2000. Using Bayesian networks to analyze expression data. J Comput Biol 7, 601¯620.

18. Drawid A. and M. Gerstein. 2000. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J Mol Biol 301, 1059¯1075.

19. Gifford D.K.. 2001. Blazing pathways through genetic mountains. Science 293, 2049¯2051.

20. Hartemink AJ, Gifford DK, et al. 2001. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac Symp Biocomput 2001, 422-433. 21. D. Pe'er, A. Regev, G. Elidan and N. Friedman. 2001. Inferring subnetworks from perturbed

expression profiles. Bioinformatics 17 Suppl. 1, S215¯S224

22. Fasulo, D. 1999. An Analysis of Recent Work on Clustering Algorithms. Technical Report: 01-03-02, Department of Computer Science and Engineering, University of Washington.

23. Quackenbush, J. 2001. Computational Analysis of MicroArray Data. Nature Genetics 2, 418-427. 24. Tamayo, P. et al. 1999. Interpreting patterns of gene expression with self-organizing maps:

methods and application to hematopoietic differtiation. Proc. Natl. Acad. Sci. USA 96, 2907-2912. 25. Bozinov D. 2002. Unsupervised technique for robust target separation and analysis of DNA

microarray spots through adaptive pixel clustering. Bioinformatics 18(5), 747-56.

26. Wen, X. et al. 1998. Large-scale Temporal Gene Expression Mapping of Central Nervous System Development Proc. Natl. Acad. Sci USA 95, 334-339.

27. Eisen, P. T. et al. 1998. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA 95, 14863-14868.

28. Alon, U.et al. 1999. Broad Patterns of gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc. Natl. Acad. Sci USA 96, 6745-6750.

29. Sasik, R. et al. 2001. Percolation Clustering: A Novel Approach to the Clustering of Gene Expression Patterns in Dictyostelium Development. PSB Proceedings 6, 335-347.

30. Baldi, P. 2000. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes Bioinformatics 16, 367-371.

31. Ghosh, D. 2001. Mixture modeling of gene expression data from microarray experiments Bioinformatiocs. 18(2), 275-286.

32. McLachlan G. J., R. W. Bean, and D. Peel. 2002. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18, 413-422.

33. Hastie, T. et al. 2000. “Gene Shaving” as a Method for Identifying Distinct Sets of Genes with Similar Expression Pattern Genome Biology 1, 1-21.

34. Brown, M. P. S. et al. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97, 262-267.

35. Furey, T.S et al. 2000. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906-914.

36. Brown M.P., R. Hughey, A. Krogh, I. S. Mian, K. Sjölander, and D. Haussler. July 1993. Using Dirichlet mixture priors to derive hidden Markov models for protein families. In L. Hunter, D. Searls, and J. Shavlik, editors, ISMB-93, pages 47-55, Menlo Park, CA. AAAI/MIT Press. 37. Krogh A., I.S. Mian, and D. Haussler. 1994. A hidden Markov model that finds genes in E. coli

DNA. Nuclei Acids Res. 22: 4768-4778.

38. Gribskov, M. and S. Veretnik. 1996. Identification of sequence patterns with profile analysis. Methods in Enzymology. 266: 198-227.

39. Staden, R. 1989. Methods for calculating the probabilities of finding patterns in sequences. CABIOS. 5: 89-96.

40. R. Hughey and A. Krogh. 1996. Hidden Markov models for sequence analysis: Extension and analysis of the basic method, CABIOS, 12(2):95-107.

41. Henderson, J., S. Salzberg and K. Fasman. 1997. Finding genes in human DNA with a hidden Markov model. J. Comput. Biol. 4: 127-141.

(17)

42. Lukashin, A.V. and M. Borodovsky. 1998. GeneMark.hmm: New solutions for gene finding. Nuclei Acids. Res. 26: 1107-1115.

43. Sonnhammer E.L.L., S.R. Eddy, and R. Durby. 1997. Pfam: a comprehensive database of protein domain families based on seed alignment. Protein: Structure, Function and Genetic. 28: 405-420. 44. Karplus, K., K.C. Barrett, and R. Hughey. 1998. Hidden Markov models for detecting remote

protein homologies. Bioinformatics 14: 846-856.

45. Li L.P., et al. 2001. Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17(12), 1131-1142. 46. Holland, J.H. 1975. Adaptation in natural and artificial systems. Cambridge, MA: MIT Press. 47. Parsons R.J., S. Forrest, and C. Burks. 1995. Genetic algorithms, operators, and DNA fragment

assembly. Machine Learning. 21: 11-33.

48. Cedeno, W. and V. Vemuri. 1993. An investigation of DNA mapping with genetic algorithms: preliminary results. In: Proc. Of the Fifth Workshop on Neural Networks, Vol. 2204 of SPIE. 49. Fickett, J. & M. Cinkosky. 1993. A genetic algorithm for assembling chromosome physical maps.

In: Proc. Of the Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis. St. Petersburg, FL: World Scientific. p. 272-285.

50. Zhang, C. and A.K. Wong. 1997. A genetic algorithm for multiple molecular sequence alignment. Comput. Appl. Biosci. 13: 565-581.

51. Valafar, H., F. Valafar, and O. Ersoy. 1996. Distributed Global Optimization (DGO). Proceedings of the International Conference on Neural Networks. Washington, DC, June 2-6, 530a-536. 52. Valafar H., O. Ersoy, F. Valafar. June 1998. Parallel Implementation of Distributed Global

Optimization (DGO). Proceedings of the international conference on Parallel Distributed Processing Techniques and Applications. Las Vegas, Nevada, 1782-1787.

53. McCulloch, W.S. and W. Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics. 5: 115-133.

54. Stormo, G., Schneider, T., Gold, L. & Ehrenfeucht, A. 1982. Use of the perceptron algorithm to distinguish translational initiation in E.coli. Nuclei Acids Research 10: 2997-3011.

55. Selaru F.M., Y. Xu, J. Yin et al. 2002 Artificial neural networks distinguish among subtypes of neoplastic colorectal lesions. Gastroenterology 122:606-613.

56. Rumelhart, D.E. and J.L. McClelland. 1988. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vols. 1 and 2. MIT Press, Cambridge, MA.

57. Rojas R. 1996. Neural Networks - A Systematic Introduction. Springer-Verlag, Berlin, New-York. 58. M. H. Hassoun. March 1995. Fundamentals of Artificial Neural Networks. MIT Press.

59. Wu, C. H. 1995. Chapter titled “Gene Classification Artificial Neural System” in Methods In Enzymology: Computer Methods for Macromolecular Sequence Analysis, Edited by Russell F. Doolittle, Academic Press, New York.

60. Wu, C., S. Zhao, H. L. Chen, C. J. Lo and J. McLarty. 1996. Motif identification neural design for rapid and sensitive protein family search. CABIOS, 12 (2), 109-118.

61. Snyder E.E., G.D. Stormo. 1995. Identification of Coding Regions in Genomic DNA. J. Mol. Biol. 248: 1-18.

62. Kohonen T. 2001. Self-Organizing Maps, 3rdExtended Edition. Springer, Berlin, Heidelberg, NY. 63. Kohonen T. 1990. The Self-organizing Map. Proc. IEEE 78, 1464–1480.

64. Toronen P. et al. 1999. Analysis of gene expression data using self organizing maps. FEBS Letters 451, 142-146.

65. Golub T. R. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-537.

66. Tamayo, P. et al. 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differtiation. Proc. Natl. Acad. Sci. USA 96, 2907-2912. 67. Quackenbush, J. Computational Analysis of Microarray Data. 2001. Nature Genetics 2, 418-427. 68. Li M., B. Wang, Z. Momeni, and F. Valafar. 2002. Pattern Recognition Techniques in Microarray

Data Analysis. Proceedings of Mathematics and Engineering Techniques in Medicine and Biological Sciences. June 24-27, 2002. In print.

69. Sasik, R. et al. 2001. Percolation Clustering: A Novel Approach to the Clustering of Gene Expression Patterns in Dictyostelium Development. PSB Proceedings 6, 335-347.