• No results found

3.4 The Influence of Gene Filtering on Clustering Results

4.1.2.1 Predictability of Expression Patterns

Using only the binding activity upstream of a gene across 204 of the 275 annotated tran- scriptional regulators in yeast, as measured by the global binding data reported by Harbison et al. (2004), our neural network classifier was able to place 86% of cell cycle regulated genes into their proper phase specific expression pattern (Figure 4.2). Although we were able to construct an average-of-bests network that showed this high degree of reliability in predicting gene expression behavior, individual ANNs trained on only 80% of the data and tested on the remaining 20% had an average predication accuracy of 50%, a minimal prediction accuracy of 40% and a maximal prediction accuracy of 65%. Interestingly 125 genes in the dataset were predicted correctly by every ANN we trained, and 108 genes in the dataset were incorrectly classified by every ANN trained. The remaining genes in the dataset were predicted correctly only by a fraction of the individual ANN runs. Shown in figure 4.3 is the relative reproducibility of the rank order of regulators when we compare a best-of-average ANN built on the first 20 ANN runs with a best-of-average ANN built on the second 20 ANN runs. The ranking of regulators was based on the sum-of-squared weights taken across all expression classes for each regulator. This ranking paradigm fo- cuses on the regulators that have the most significant weights, both positive and negative, in the ANNs computation of expression class predictions. In general, the ranks of regulators are stable across multiple training runs. Figure 4.4 shows the distribution of predictabil- ity across the EM clusters. There is an enrichment among genes that show the highest predictability to be in EM2; cluster that corresponds best to late G1.

1 4 10 55 1 2 142 3 2 66 7 1 40 3 6 5 25 2 2 3 1 26 5 46 4 77 3 168 2 64 1 70 1 149 2 75 3 57 4 33 5 Confusion Matrix- NMI= 0.65, NMI'= 0.62, LA = 0.86

EM MoDG Expression Class

NN Predictions

Figure 4.2: Confusion Array showing the average-of-best ANN vs EM MoDG expres- sion classes (see methods 4.1.1.2). Here we compare the expression class prediction of the average-of-bests ANN which was created by averaging 40 ANNs trained to predict ex- pression behavior from the binding data available for a gene. Each of the 40 ANNs were trained on 80% of the data and tested on the remaining 20% and they were selected as the best performing network out 10 networks trained on the same data split but initialized with differing seeds. These two classifications have a similarity of .86 by linear assignment

0 0.2 0.4 0.6 0.8 1 Prediction Accuracy 0 2 4 6 8 10 12 14 16 Number of ANNs

(a) ANN Accuracy

0 50 100 150 200

Mean rank in second 20 NN runs 0

50 100 150 200

Mean rank in first 20 NN runs

(b) ANN Reproducibility

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

Fraction of NNs correctly classifing -200 0 200 400 600 800 1000 1200 Expression Level

(c) Predictability Vs Expression Level

-0.2 0 0.2 0.4 0.6 0.8 1 1.2 Fraction of NNs correctly classifing -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Binding Level

(d) Predictability Vs Binding Level

Figure 4.3: ANN Prediction Accuracy Histogram and correlations with binding and expression levels. We trained 40 ANNs (see methods 4.1.1.2) to predict a gene expression behavior from only the regulator binding activity upstream to its start of transcription. For each network we trained on 80% of the data and tested on the remaining 20%. a) The distri- bution of ANN accuracy across the 40 trained ANNs. Along the x-axis are bins of accuracy ranges, the y-axis counts the number of ANNs that showed the designated prediction ac- curacy. b) Displays the relative reproducibility of the ANN rankings. Each regulator was ranked by its net influence in the ANN using the sum of squared weights across the classes in the weight matrix (Pcw2c;r). Shown is a scatter plot of the regulator ranks from the first 20 ANNs vs the second 20 ANNs trained. c) Scatter plot of the predictability (fraction of ANNs correctly classifying a gene correctly) vs mean absolute expression level of the 4 highest measured time points for each gene. d) Predictability vs mean binding level for the 10 highest bound regulators.

Figure 4.4: Distribution of Neural Network Prediction Accuracy across EM MoDG Clusters. The y-axis on the top panel measures the number of genes correctly classified by the indicated fraction of the trained ANNs (x-axis, bin range specified in the lower right corner of corresponding confusion array cells). Each bin is then broken up across the 5 EM MoDG clusters using a confusion array. The color map within the confusion array in the lower panel is shown as in figure 2.1

the average-of-bests ANN does not show a bias in expression class prediction. The confu- sion array shown in figure 4.2 illustrates that expression class prediction errors are evenly distributed throughout the dataset and is not specific to a single phase of the cell cycle for the average-of-bests ANN. Further, in the case of predicting the canonical G1 expres- sion behavior which is represented by EM2, 65% of the misclassifications are slight errors whereby a gene is predicted to be expressed in either of the neighboring classes (ie. EM1 or EM3). However, in the case of EM1 56% of the errors are by misclassification into one non-adjacent kinetic expression cluster EM4.

4.1.2.2 Parsing the ANN Weight Matrix and Relating Inferred Regulatory Presence