5.6 Probe level uncertainty in microarray measurements
5.6.1 Experiments on uncertainty
By randomly selecting 53 datasets from ArrayExpress and GEO databases, we calculated the expression values and uncertainties with multi-mgMOS. It should be noted that these uncertainties are technical variances. For three of these datasets we applied the classification. And for the same three datasets we also applied the mas5calls to get the p-values. A comparison between the uncertainties obtained with multi-mgMOS and the p-values obtained with mas5calls is made.
After obtaining the uncertainties, each array is quantized with GMM and for those genes which are represented as expressed after binarized, uncertainties or p-values are recorded. Then we calculated the average uncertainty or (p-values for mas5calls) for the expressed genes in an array. This procedure is summarized below:
Chapter 5 Improving the performance of inferences with a signal sensitive metric 86 Exp1 Std1 Exp2 Std2 Genes 1 0 .. . 1 0 (n,1) 0.7 0.9 .. . 0.3 0.8 (n,1) 1 1 .. . 1 0 (n,1) 0.4 0.5 .. . 0.2 0.8 (n,1) P n P n 5.6.1.1 Results
First we present the results from multi-mgMOS analysis. After binarizing the expression values for each experiment by using GMM, the number of expressed genes are counted and for those genes which are expressed the mean of uncertainties are calculated. The number of expressed genes vs. the average uncertainty of expressed genes is plotted for three of the datasets used in the previous section of classification (see Fig. 5.3). The result of this experiment showed some unnoticed systematic variations in microarray measurements i.e., average uncertainty over expressed genes is lower when there are more expressed genes. This should not be confused with lower signal measurements having higher uncertainties. The above result is reached solely by analysing expressed genes and the corresponding uncertainties.
0 200 400 600 800 1000 1200 1400 1600 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
No of genes expressed in an array (multi−mgMOS)
Average uncertainity for expressed genes
Gordon et al. Linear Fit West et al. Linear Fit Huang et al Linear Fit
Figure 5.3:A systematic variation in probe level uncertainty of Affymetrix microarray data. Scatter plots of uncertainties against number of expressed genes, and the linear regression lines, for the three datasets whom classification had been applied in the
Section 5.3.2.
Looking at these three datasets, as there are more expressed genes in an array, the uncertainty is lower; we wanted to test this in a more systematic way. For this we downloaded 53 randomly selected datasets from public repositories. Except in one of these experiments (data not shown) we found that as there are more expressed genes, microarray measurements are more certain (see Fig. 5.4). For this particular reason
Chapter 5 Improving the performance of inferences with a signal sensitive metric 87 we suggest that the Tanimoto coefficient performs better than the other metrics and improves the performances of inference problems.
0 1000 2000 3000 4000 5000 6000 7000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
No of genes expressed in an array (multi−mgMOS)
Average uncertainity for expressed genes
Figure 5.4: A systematic variation in probe level uncertainty of Affymetrix microar- ray data. 53 randomly chosen arrays we plot the average uncertainty of determining expression levels against the number of genes detected as present. Only liner regression
lines are shown for clarity.
In order to show that these results are not the consequence of noise, we test the same idea above by using Affymetrix P and A calls and the corresponding p-values. Since detection calls is a non-parametric approach it is a robust method as explained in section 2.3.8. Expression values obtained with mas5 function in R are binarized by using the GMM and the corresponding p-values are stored. Then by applying the same methods above we counted the number of genes expressed in an array and calculated the mean of the p-values for the expresses genes. Fig. 5.5 shows these results. It can be seen that the p-values are also lower when there are more expressed genes which strengthen our results obtained with multi-mgMOS.
The biological interpretation of this result needs to be further investigated. Whether these results are only because of the particular array property or is it the same for most of the arrays needs to be verified with experiments.
5.7
Summary
In this chapter we show how to improve the inferences drawn from transcriptome data when binary data is used. We use a similarity metric suitable for binary data. This metric, called Tanimoto similarity, is widely applied in chemoinformatics and has been shown to give best performance for binary data. We focused on classification and clus- tering and presented experimental results. Our results showed that using binary gene
Chapter 5 Improving the performance of inferences with a signal sensitive metric 88 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045
No of genes expressed in an array (mas5)
Average p−values for expressed genes
Gordon et al. Linear fit West et al. Linear Fit Huang et al. Linear Fit
Figure 5.5: Uncertainty graph with p-values. As there are more expressed genes in an array, the average p-values are lower which supports the finding of multi-mgMOS
results.
expression data with metrics suitable for binary data improves the performance of in- ference from gene expression data.
We further show an advantage of using binary data with Tanimoto kernel in classification. By binarizing data we can combine datasets from cross-platform studies and this improve the classification performance of the individual datasets. Our approach has been shown to outperform the existing works in literature. We compare cross-platform classification to the Zilliox and Irizarry (2007)’ s bar code approach and raise a critical appraisal of this method. One aim of the bar code is to remove the lab effects of microarray measurements. However, bar code approach is only limited to one type of array and is not tested on cross-platform studies. And this is not the only drawback of bar code approach. Our experimental results show that Tanimoto kernel in SVM outperforms distance-to-template classifiers. This is highly expected for a kernel in SVM to perform much better than the distance-to-template classifiers as the kernels use higher dimensions in feature space to separate the data. Our search to find better templates than the ones obtained with mean as Zilliox and Irizarry (2007)’ s bar code, fails to do better than Tanimoto kernel. But we go further and make a direct comparison between Tanimoto similarity and Euclidean distance with spectral clustering. As the results of spectral clustering mainly rely on the similarity matrix and thus the similarity metric used, we choose spectral clustering as our method. The results showed that for transcriptome data Tanimoto similarity is doing better than the Euclidean distance in this context. We explain the success of Tanimoto coefficient with the uncertainties of the probe level data. It is worth mentioning here that the uncertainties obtained by multi-mgMOS is due to technical variance and not biological variance. Using biological variances here would be more appropriate but this is left as a future work of this work. By using 53 randomly selected datasets and two different methods, namely multi-mgMOS and
Chapter 5 Improving the performance of inferences with a signal sensitive metric 89 p-values of detection calls where the latter is a robust method, to extract expression values from probe level data, it has been shown that as there are more expressed genes in an array the uncertainty of these genes are lower in Affymetrix GeneChip technology. This property of microarray measurements when combined with the fact that Tanimoto coefficient focuses on the number of bits on (1s) in an array, helps to explain the success of Tanimoto coefficients for the inferences drawn from microarray data. With this weighting of expressed genes, we explain the success of Tanimoto coefficient. While no molecular level explanation is offered for this, we show that when this systematic property of Affymetrix is taken into consideration it improves the quality of inferences drawn from microarray data and hope that this will be taken into account before making any further processes with such data.
By using Tanimoto coefficient, we show that the quality of inferences drawn from mi- croarray data can be improved. Further more, certain advantages of using binarized transcriptome data e.g., in cross-platform analysis, have been shown. In the next chap- ter we show experimental results from classification that binary transcriptome data significantly reduces the effect of algorithm choice for pre-processing raw data especially when used with Tanimoto kernel.
Chapter 6
Reduction in algorithmic
variability
6.1
Introduction
A plethora of computational methods for the statistical analysis of high throughput gene expression measurements is available to users interested in making inferences from tran- scriptomes. These steps are commonly known as pre-processing stages. Pre-processing stages includes background correction, within and between-array normalizations, probe- specific correction and summarization. After the raw data is processed it is ready for more sophisticated machine learning approaches such as classification, cluster analy- sis and the modelling of time-course data by means of dynamical systems. The pre- processing stages lead to quantifications of relative mRNA abundances, taking scanned images as input. The pre-processing stage has been shown to have an important effect on the results of statistical inference approaches (Barash et al., 2004; Cope et al., 2004; Choe et al., 2005; Ploner et al., 2005; Shedden et al., 2005; Millenaar et al., 2006; Allison et al., 2006; Qin et al., 2006). In this chapter we show that binary representation of transcriptome data has the desirable property of reducing the variability introduced at the pre-processing stages due to algorithmic choice. We review the effect of the choice of algorithms on different problems and suggest that using binary representation of mi- croarray data with Tanimoto kernel for SVM reduces the effect of the choice of algorithm and simultaneously improves the performance of classification on transcriptome data.