2.5 Control of global familywise error under the null
2.5.2 Proof
Define Cm :={A:|A|=m}. By construction of the Benjamini-Hochberg procedure, the event thatA∈ Cm is a fixed point only if
\ j∈A n p(j, A;X)≤ mα d o \ \ j∈[d]\A n p(j, A|X)> mα d o (2.10)
Since the p-values are independent and uniformly distributed, this implies that for any A∈ Ck,
P0 uα(A) =A =mα d m 1−mα d d−m (2.11)
DefineAm(X, α) to be the class of all stable sets of sizem. Then, using equation 2.11 and a union bound, P0 |Am(X, α)|>0 ≤ d m mα d m 1−mα d d−m (2.12) Applying the inequality md
≤ √1 2π( ed m) m gives √ 2π P0 |Am(X, α)|>0 ≤(eα)m 1−mα d d−m ≤(eα)m
Since A(X, α) =∪Am, a union bound gives √ 2π P0 |A(X, α)|>0 ≤ d X m=2 (eα)m= d X m=1 (eα)m−(eα)
Asα≤0.15<1/e, the sum on the right-hand side is a geometric series. Thus,
√ 2πP0 |A(X, α)|>0 ≤ eα[1−(eα) d] 1−eα −eα≤ (eα)2 1−eα (2.13)
We want to show thatP0 |A(X, α)|>0
≤α, i.e., that (eα)2 1−eα ≤ √ 2πα . (2.14)
CHAPTER 3
Differential Correlation Mining
3.1 Introduction
Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. In this chapter, we present a method for differential association mining calledDifferential Correlation Mining (DCM). The Differential Correlation Mining method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. Differential Correlation Mining is a VSAT-style algorithm, so updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation.
We refer to the target variable sets of Differential Correlation Mining as differentially correlated (DC) cliques. In a graph, a clique is a set of nodes that is fully connected, in the sense that there is an edge between every pair of nodes in the set. Informally, a DC clique is a set of variables such that each variable in the set has a positive (usually large) average differential correlation with the other variables in the set. More formally, let R1,R2 be the d×dpopulation correlation matrices of the distributions underlying sampling conditions 1 and 2, respectively. Let A⊂[d], where [d] is the index set {1, ..., d}, and define
∆(j, A) = 1
|A| X
k∈A
(R1−R2)jk (3.1)
to be the average difference of correlations between variable j and variables in index set A. Here the subscriptjk denotes the element in thej-th row andk-th column of the corresponding matrix, and |A|is the cardinality of the set A. We formally define DC cliques as follows.
Definition 3. Let R1,R2 be given and let∆(·,·) be defined as in (3.1). An index set A⊆[d]with at least two elements is a DC clique for R1−R2 if
1. ∆(j, A)>0 if and only if j∈A,
2. The set A cannot be written as a disjoint union of nonempty index sets A1, A2 ⊂ [d] such thatA1 andA2 satisfy condition 1 above.
Condition 1 ensures that no relevant variables are omitted from a DC clique (every variable that is positively differentially correlated relative to the setAis included inA) and that a DC clique does not contain any extraneous elements. Condition 1 implies that a DC clique has larger average pairwise correlation under the first distribution than under the second. Condition 2 ensures that a DC clique cannot be subdivided into two smaller DC cliques. Importantly, the definition placesno conditions on the correlation matrices R1 and R2. In particular, R1 and R2 need not be sparse, and need not satisfy any structural constraints such as bandedness. For a given pair R1,R2, it may happen that no DC cliques exist, or that the entire variable set forms a DC clique.
Note that the definition of DC cliques is not symmetric: in general, the DC cliques forR1−R2 will be different from those for R2−R1. The difference lies not in the relational structure itself, but rather in how we order the sample conditions (1 or 2). For example, in biological data, one sample group may involve a treatment condition, while the other is a reference or control group. A DC clique forR1−R2 would contain genes that are more highly correlated in Condition 1 than Condition 2, for example, a protein pathway that is more active in Condition 1. This structure is illustrated in Figure 1.1.
The asymmetry in DC cliques could be eliminated by replacing the relevant section of (3.1) by a symmetric notion of difference such as |R1−R2|. However, a variable set based on absolute difference (or similar) could contain a mixture of elements with positive correlation to A and elements with negative correlation to A. Such mixed groups would not exhibit the unified block structure of the type seen in Figure 1.1. Further, large variable sets with strong average negative correlation cannot occur. Simple algebra shows that since R1 is positive definite, the average pairwise correlation in Condition 1 of any set A withm elements must be at least -(m−1)1 .
As defined above, DC cliques are features of the underlying population distributions of the data. In practice, we will replaceR1,R2with estimates from observations, accounting for the uncertainty
in these estimators, to select empirical DC cliques. The broad objective of Differential Correlation Mining is to use observed data to identify DC cliques, or approximations of these, without prior knowledge of the identity, number, or size of the DC cliques present in the population. It is worth noting that the Differential Correlation Mining algorithm and supporting analysis described here are easily adapted to a non-differential correlation mining algorithm. An implementation of a correlation mining procedure is included along with the public DCM software.
Notation. In what follows, we assume that the data under condition 1 consists ofn1independent samples drawn from a distributionF1with correlation matrixR1, and that the data under condition 2 consists ofn2 independent samples drawn from a distributionF2 with correlation matrixR2. Let X1 = (U1, ...,Ud) ∈ Rn1×d and X2 = (V1, ...,Vd) ∈ Rn2×d denote the resulting data matrices in standard sample-by-variable form. Thus Uj ∈ Rn1 denotes the measurements of variable j under condition 1, whileVj ∈Rn2 denotes the measurements of variable j under condition 2. Let X1,A = (Uj)j∈AandX2,A= (Vj)j∈Adenote the restriction ofX1 andX2, respectively, to a variable set A ⊂[d]. Similarly, let R1,A and R2,A denote the correlation matrices under the distributions of F1 and F2 restricted to the variables inA.
Let ˜Uj and ˜Vj be the standardized versions of Uj and Vj respectively, such that kU˜jk =
kV˜jk = 1, and define ˜X1 = ( ˜U1, ...,U˜d) and ˜X2 = ( ˜V1, ...,V˜d). Finally, let Rb1 and Rb2 denote
the usual sample correlation matrices of X1 and X2, respectively (and Rb1,A and Rb2,A those of the
appropriate restricted datasets). Thus Rb1
jk = cor (c Uj,Uk) = X˜
t 1X˜1
jk and a similar relation holds for Rb2.