Confidence Thresholds - Proposed Validation Method

3.4 Proposed Validation Method

3.4.5 Confidence Thresholds

In order to draw a statistical conclusion, it is crucial to select representative p-value cut-offs so that performance can be evaluated on a significance basis. Adjustment of p- values for multiplicity is performed using the notion of false discovery rate (FDR) [9]. FDR suggests a different point of view when considering testing errors, by controlling the expected proportion of erroneous rejection of the null hypotheses

E[_|False Positives_|/(_|False Positives_|+_|True Positives_|)]. For a given thresholdα, the Benjamini Hochberg procedure states that if p1,p2, ...,pm are the observed p-values,

one can find the largest b so that b = max_{i_|pi ≤ iα/m} and reject null hypotheses

1,H

0 2, ...,H

b. After adjustment, p-values can be compared directly with any chosen

3.5 Experimental Results

Consistency, accuracy and discriminability are the main attributes of the validity indices to be accessed in this experimental section. To this aim, we design three com- parative experiments, allowing the proposed WB index to be assessed in many aspects. Biological data sets with distinct features and various complexities are used. Five other validity indices, including two GO-driven and three data-driven indices, are used to compare with the proposed index. Six popular clustering algorithms are selected to represent the wide spectrum of clustering methods.

The three data sets used in the experiments are: yeast cell cycle (Y5) data set (as described in Section 2.4.2), yeast galactose data set (as described in Section 2.4.3), and Arabidopsis L. Heynth diurnal data set. The yeast Y5 data set is popular in the clustering literature for its easy accessibility. The challenges from this data set are posed partly by the ambiguities among the five cell cycle phases and partly by the poor quality of the data set. Compared with Y5 data set, Yeast galactose data set show more distinguishable expression patterns. Its genes reflect four functional categories in GO.

Arabidopsis L. Heynth diurnal data set

The Arabidopsis L. Heynth diurnal data set [124] is collected from an experiment to investigate the impact of the diurnal cycle of the starch metabolism in the leaves of

Arabidopsis L. Heynth. It is a larger data set with 800 genes but with only 11 time

points and two replicates. For the assessment of our validation scheme, a subset of 800 genes is used which is previously selected using the periodicity test [147]. All data sets in the experiments are filtered. Because of noise and limited annotation knowl- edge, involving a whole data set prevents us from interpreting the performance of the

5 10 15 20 −2 2 4 6 8 12 1 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 2 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 3 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 4 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 5 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 6 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 7 Time points Gene expression 5 10 15 20 −2 2 4 6 8 12 8 Time points Gene expression

Figure 3.6: The Arabidopsis L. Heynth diurnal data clustered into eight clusters by K-means clustering.

proposed methods under evaluation. By using filtered data sets, the interference of un- known factors is significantly reduced, which provides a clearer picture about the role the methods play. Figure 3.6 shows the time series with one replicate concatenated with the other. Ambiguities, especially in the fifth cluster indicates difficulty in this data set in terms of clustering.

In addition to the proposed index, two GO-driven indices are used for comparison: the biological homogeneity index (BHI) and biological stability index (BSI). On the other hand, the three data-driven indices, namely the Calinski and Harabasz (CH) index [21], the Davies-Bouldin index and the Dunn index (as described in Section3.3.3.1), can be employed to judge the clustering quality from the aspect of data without taking GO into account. The idea behind the CH index is to compute the pairwise sum of

squared distances between clusters using microarray data, and compare that to the internal sum of squared distances for each cluster.

For both the CH index and the Dunn index, a large score corresponds to a good par- tition. However, for the Davies-Bouldin index, a set of compact clusters is associated with a small value. In the following experiments, the scores of the Davies-Bouldin index are inverted so that large scores correspond to good partitions for all the indices. We design three experiments to assess the performance of the proposed GO validation indices from different aspects. In the first experiment, six clustering algorithms are evaluated in their applications to the yeast Y5 data set and the Arabidopsis diurnal data set with the six validity indices. In the second experiment, we use yeast galactose data set and its cluster assignment to the four functional categories in a perturbation test to assess the sensitivity and consistency of the proposed validation index with different levels of random errors. The last experiment tests the accuracy of the proposed index by finding the optimum number of clusters for the yeast Y5 data set.

In document Statistical inference from large scale genomic data (Page 114-117)