Simulation 2 - Bayesian Analysis of Microarray Experiments with Multiple

CHAPTER 3. Bayesian Analysis of Microarray Experiments with Multiple

3.4 Simulation

3.4.2 Simulation 2

In classical gene-by-gene mixed linear analysis of the original barley data set in Section 3.2, the variance components of some random factors were estimated to be zero under the REML method for many genes. Based on this observation, we assumed that the true underlying

random part of the model (3.1) may differ from gene to gene, and totally, 23 = 8 versions

of the model (3.1) can be considered for the barley data set. Therefore, when generating the second data set, we try to mimic the original barley data set by allowing some variance components to be zero for some genes.

Similar to first simulation, a random set of one thousand genes were chosen from the original barley data set such that not all the variance component estimates were nonzero for some genes in that list. All eight models were equally represented in this simulation. In other words, under each of the eight models, data for 125 genes were generated, respectively. Random effects and residual errors were simulated similar to first simulated data set. Again we let only 10 percent of the genes to have fixed treatment effects. After generating the data set, it is also analyzed by the classical linear mixed model approach and the hierarchical Bayesian modeling approach. Similar to analysis of the first data set, we used the SAS PROC MIXED procedure to fit the full linear mixed model (3.1) for each gene in this data set. Variance components were estimated under REML method, and the KR method was set to determine the denominator degrees of freedom and F -statistics for the test of fixed effects. Again by applying the method of Storey and Tibshirani (2003), the number of significant genes with respect to fixed effects are determined for four nominal control levels of FDR (Table 3.3). For example, when FDR is controlled under 0.1, twenty genes are declared to have significant genotype effect. Fourteen of these genes are correctly identified with the true genotype effect by this analysis (Table 3.4).

Controlling the FDR at the same nominal level, forty six genes are declared to be significant for changing expression levels over time, and thirty eight genes are declared to have significant interaction between genotype and time (Table 3.4). Again in this analysis, forty four genes among the most significant 46 genes were correctly identified with the true time effect, and twenty six genes among the most significant 38 genes were correctly identified with the true interaction effect (Table 3.4).

We also analyzed the second simulated data set with the hierarchical Bayesian model described in Section 3.3. Four (3.4, 3.5, 3.6, and 3.7) statistics were calculated for each gene by using the posterior distribution of each treatment effect. Genes were ordered for significance

with respect to genotype, time, and interaction effects by using the values of F1, F2, F3, or

F4. Table 3.4 reports the number of correctly identified genes under hierarchical Bayesian

analysis when F4 is used for ranking the genes. For example, all of the of the most significant

20 genes with respect to genotype effect were correctly identified with the true genotype effect. Similarly, 42 genes among the most significant 46 genes with respect to time effect were truly differentially expressed, and 31 genes among the most significant 38 genes with interaction effect were correctly identified for having different time patterns over two genotypes (Table 3.4).

If we looked at the most significant 100 genes under both analyses, the hierarchical Bayesian method identified many more significant genes correctly than the classical mixed linear model analysis (Table 3.4). For example, when testing for genotype effect, the classical mixed linear model analysis identified 31 genes correctly whereas the hierarchical Bayesian analysis identified 61 genes correctly. These 65 genes includes 26 of the the 31 genes, which were correctly identified by the classical analysis. Similarly, for genotype-by-time interaction, classical mixed linear model analysis identified 37 genes correctly. Hierarchical mixed model analysis identified 54 genes correctly (Table 3.4). These observations show some evidence that the hierarchical Bayesian approach provides a better order rank of the genes than the classical mixed linear model analysis of fitting the same model for every gene.

We also used ROC curves to compare the power of the method for distinguishing differentially and non differentially expressed genes with respect to fixed effects. In Figures 3.7 – 3.7,

we plot ROC curves corresponding to the analysis by fitting the true underlying linear mixed model for each gene, the analysis by fitting full linear mixed model (3.1), and the analysis with the hierarchical Bayesian approach. In Figure 3.7 and Figure 3.7, we observe that hierarchi-

cal Bayesian analysis with the F4 statistic has the highest power to distinguish differentially

and non differentially expressed genes with respect to genotype effect, and genotype-by-time

interaction, respectively. For all the fixed effects, hierarchical Bayesian analysis with F1 and

F3 statistics produced the same ordering for genes (Figures 3.7, 3.7, and 3.7). In addition,

both statistics have the best ranking of the genes with respect to time effect (Figure 3.7). In all figures, it can be concluded that an analysis by hierarchical Bayesian modeling with any F statistic is better for rank ordering genes than the analysis by fitting the same mixed linear model for each gene or the analysis by fitting the true underlying model for each gene.

In document Classical and Bayesian mixed model analysis of microarray data for detecting gene expression and DNA differences (Page 95-97)