Clustering of genes or samples is a standard procedure performed for the analysis of gene expression data. While gene-wise clustering can bring functionally related genes together, clustering of samples can provide insight into disease subtypes or patient groups. This is of particular importance when subgroups can be linked to clinically relevant features such as recurrence or severity of disease.
36 3. Unsupervised Decision Trees
agglomerative hierarchical clustering with distances based on Pearson correlation (Eisen et al., 1998) as well as more sophisticated methods such as self-organizing maps (Tamayo et al., 1999). These methods are usually applied on all measured genes or those satisfying some filtering criteria in order to obtain a grouping of the samples, often in the shape of a dendrogram. This study only covers clustering of samples, but as the formal question of grouping similar objects remains the same, most clustering methods can be applied on samples and genes alike.
In general, the resulting groups or dendrograms do not provide any hints about biolog- ical processes in which the samples were particularly different between clusters, an obvious question to ask from a biological point of view. This issue is commonly addressed by identi- fying significantly differentially expressed (DE) genes using statistical tests (Tusher et al., 2001; Gentleman et al., 2005; L¨onnstedt and Speed, 2002) and then detect over/under- representations of positive genes within pre-defined functional classes (Palenchar et al., 2004; Al-Shahrour et al., 2004). Other approaches include gene-list based methods such as iterative group analysis (iGA) (Breitling et al., 2004), LACK (Kim and Falkow, 2003) and various resampling based methods that find e.g. significantly high pairwise gene cor- relations, high learnability (Pavlidis et al., 2002), or conspicuousness (Zien et al., 2000).
An aspect shared by all the methods based on DE genes is that they are uni-variate, i.e. they cannot address dependencies between genes as they work one gene at a time. For each gene, the association of the expression measurement with the sample label is assessed, and afterwards the distribution of the outcome is investigated. Since genes work together and are highly dependent of each other, this approach does not reflect the complexity of real biological systems. The resampling based methods do better in this aspect but as they do not consider all genes simultaneously they cannot discover more general tendencies in the data.
Using a standard clustering approach and looking for DE genes associated with a sec- ond variable, e.g. disease severity, is thus based on two disparate concepts. The clustering usually works on all genes simultaneously, mixing information from a large and heteroge- nous set of biological processes; identification of DE genes on the other hand does not take dependencies between genes into account at all.
Our approach instead includes information on predefined gene classes that represent different biological processes, in this case a mapping of genes to Gene Ontology (GO) (The Gene Ontology Consortium, 2000), directly in the clustering procedure. For each such gene class, a single feature is computed using principal component analysis (PCA). Then, the features that support a clear grouping of the samples are selected. These should give an indication on the biological processes in which the sample groups differ the most. Thus, we identify differentially expressed gene classes instead of differentially expressed genes and simultaneously compute a clustering that makes use of these known gene classes. Similar to our approach, Lottaz and Spang (2005) take advantage of functional gene classes, introducing a classification method that combines expression values of genes that are a priori known to be related. Our goal is instead to use such information in clustering methods.
3.2 Background 37
call it GO unsupervised decision trees (GO-UDTs), as it results in decision tree-like struc- tures. On a publicly available expression dataset from a prostate cancer study, our method exhibits similar clusters as were shown in the original publication by Lapointe et al. (2004); in addition it provides valuable indications for a biological interpretation.
3.2.1
Unsupervised Decision Trees
Our goal is to remedy one major drawback of current clustering methods – their lack of interpretability. In order to reach that goal, we borrow from methods known from classification theory (Quinlan, 1993) and adapt decision trees to our clustering problem.
Conventional decision trees need to be trained on labeled data which is not available in an unsupervised setting like clustering. Having no labeled data, an algorithm that deter- mines splits and corresponding rules in an unsupervised way is needed. Such algorithms have been developed previously in the form of UDTs and used for clustering (Karakos et al., 2005; Basak and Krishnapuram, 2005; Bellot and El-B`eze, 2000). UDTs are constructed like conventional decision trees by splitting the data into subsets, selecting a simple feature and a cut-off as a rule for each split. Instead of using labeled training data to determine a split, UDTs make use of an objective function that measures the quality of the resulting clustering. To our knowledge UDTs have not been used for the analysis of biological data before.
3.2.2
GO-UDTs
In order to make UDTs applicable to the clustering of samples from gene expression data, we have designed two new objective functions using features based on functional gene classes. The intuition behind these functions is to score the quality of a split using a measure of the separation of the resulting groups. Each feature used for splitting the data is based on genes from a single functional class. The GO-UDT algorithm computes a dendrogram in a top-down manner, in each step dividing the samples into subsets that exhibit large differences at least in some gene classes. Given a subset of the samples, the following steps are performed to determine the gene classes that imply the optimal split at a tree node.
1. For each gene class, compute a single feature by
• selecting all genes belonging to that class,
• summarizing the data matrix built up with the genes from that gene class using
PCA and
• selecting the first principal component (PC).
2. Cluster the samples according to each gene class using the computed feature.
3. Score gene classes according to an objective function which measures the quality of the separation of the resulting clusters.
38 3. Unsupervised Decision Trees
4. Select a set of high scoring gene classes that imply a similar clustering.
5. Recursively re-partition the resulting sample subsets until some stopping condition is fulfilled.
At each inner node of the tree, the samples are split into two groups. Using only the first PC of a gene class, a global optimum for the clustering can be achieved using algorithms like K-means or partitioning around medoids (PAM) (Kaufman and Rouosseeuw, 1990).
Two different objective functions, described in Section 3.3.3, were used for evaluating
the obtained clusters. The first is called themodel comparison (MC) score and is based on
the comparison of a uni-modal to a bimodal model. The second,weighted silhouette (WS),
is based on a well known characteristic of clusterings; the Silhouette index (Rousseeuw, 1987). Using any of these, the best clustering is chosen and the node is annotated with the gene class that was used for the split.
This way, functional classes are identified which most strongly exhibit natural partition- ings. Furthermore, the labels of these classes are expected to carry relevant information about the reason for that partitioning.