Discussion - A Modified Mixture Model Approach to the Large

Chapter 2 A Modified Mixture Model Approach to the Large

2.4 Discussion

Sample Splitting:In a multiple testing situation, if the entire available data is used for the fitting of a contamination model (null, non-null mixture), then using the same data for the non-null detection may cause a feedback loop. The sample splitting in the proposed method allows a part of the available information to be used for model building and the other part for screening significant cases, and hence avoids that drawback. Further, when the data is randomly split multiple times it produces a different (may not be disjoint) training set each time. When models are fitted based on these different training splits, it helps to neutralize the effect of sources of variation (noise) other than the one that is of interest in a study.

The use of only partial information for the model building part may lead to some loss of power for an individual training set, but repeated sample splitting and combining the resulting rejection regions overcomes that. However, the combined rejection region also accumulates false discoveries and reduces precision. Using cases

with high detection frequencies for screening with enough repetition of splits can balance out the power and the precision.

F-Network Plots: The other benefit of the repeated sample splitting is the fre-

quency network or F-network plots that are constructed based on the detection frequencies across the repeated splits. Here we want to emphasize that the F-network plots generated by using this method should not be used as a “proof” of group behavior among the cases. If two unrelated non-null cases deviate strongly from the null distribution, they are bound to be frequently detected as significant, no matter how the data is split. In that case these two unrelated non-null cases may appear in the detected sets repeatedly at the same time and consequently will show up in the F-network plot as a group. Therefore, the F-network plot is not intended to be used as a basis for a causal relation.

However, the simulation study shows that if some screened non-null cases are indeed correlated, that relation is captured in the group structure of the F-network plot. Thus we recommend that these plots to be used as an exploratory tool that precedes further investigation to establish possible causal relationships between cases that show high concurrent detection frequencies in this method. When a study in- cludes thousands of cases, at least some starting point for an exploratory network analysis can be highly useful and cost-effective.

The relevance and effectiveness of the proposed method can be explained partic- ularly well in microarray analysis where the main goal is the identification of groups of differentially expressed genes. These types of studies are commonly used to iden- tify the genes that are associated with a specific biological behavior. Genes detected through the proposed methodology can be isolated for detailed follow-up functional studies. For example, a biologist may look into the few most frequently identified significant genes to distinguish the “regulators”. A systematic gene knockout experi- ment, conducted on the sets that appear in the F-network plot as a group, can reveal

the effect of individual genes on the biological response of interest. This can provide a novel starting point for the elucidation of gene networks or hierarchical regulation patterns in a biological system. Thus, the proposed analysis can guide an exploratory biological study where, instead of experimental investigations of the effect of every gene, a small subset of significant ones can be selected for further experimentation to establish their individual or collective role in the biological response of interest.

Conditional Independence:An interesting question can be posed: “when does

the empirical Bayes model (2.1) work?” It of course works when the observations are i.i.d according to (2.1). But in many cases (in particular, the microarray example considered here), it is not correct. It does work, though, in pooling observations where the observations are conditionally independent. For example, in a microarray, clusters of genes may be acting together but still conditionally independent. This is a typical argument used in multiple hypothesis testing cases (Karlin and Taylor, 1981).

Pooling Data: In situations where we have clusters of observations and within

a cluster the observations are conditionally independent, pooling of the observations can result in the observations being i.i.d. from the pooled/mixed distribution model. Grego et al. (1990) were one of the first to suggest the use of mixed distribution methodology to analyze such data when the observations are exponentially distributed. Here we provide a justification of this type of analysis for more compli- cated situations such as the one considered in this paper.

To see this, consider k clusters, where there are ni observations, Xi,1, . . . , Xi,ni in

cluster Ci. Consider the situation where the joint density of all the observations can be written as g(I) k Y j=1 " nj Y m=1 f(xj,m|λj,m)gj,m(λj,m|Ij) # (2.14) where I = (I1, . . . , Ik) is a vector of indicator variables indicating if the clusters are in the background state (Ij = 0) or in the signal state (Ij = 1). Note that, if g(I) =

j=1

For example, a cluster might be a biological network of genes where the indicator

I = 0 denotes that the genes in the network are not being differentially expressed

and any expressed genes are simply background while if I = 1 the network is being

differentially expressed (signal). Typically, we do not know the clusters/networks and are simply pooling the data. Thus, from (2.14)

{(Xj,m,Λj,m)} given I are independent. (2.15)

Notice that this is almost an empirical Bayes or a mixture model formulation except that the distributions of the observations are not identically distributed.

However, notice that, given (I,Λ), the conditional distribution of the X’s is given by Xj,m|(I,Λ) ∼ f(xj,m|Λj,m). Thus, if we pool the data, then, given (I,Λ), the resulting X’s have marginal mixed density,

f(x) =

f(x|λ)m(dλ) (2.16)

with support Λ where the point masses are determined by {gj,m( Λj,m|Ij)}. That is, we are observing X’s for each gene from marginal density (2.16), where they can be considered conditionally independent in the pooled data. Thus, given (I,Λ), and (2.15) the form of (2.16) justifies the use of the mixture distribution/empirical Bayes that we developed.

2.5 Conclusion

In conclusion, we present a method that can be used for identifying significant cases when carrying out a large number of simultaneous tests. We propose a cross-validation type analysis where a part of the available information goes into the understanding of the underlying process or model fitting while the other part goes into screening for extreme cases. Random splitting and repeated screening provide a way to reduce the noise (other sources of variation) in the analysis and as a by-product we get an exploratory look into the network pattern for significant cases.

In document The Effect Of Emphasizing Key Vocabulary On Student Achievement With English Learners (Page 84-88)