5.2 Future work: Bi-community detection for correlation networks
5.2.5 Simulation results
5.2.5.3 Competing methods
In addition to CBCE, we run two other bi-partite community detection approaches through our simulation framework. Given a bi-partite correlation network G := (N1, N2,D), each method depends on the computation of the cross-correlation matrix C:=XtY.
1. Bipartite Recursively-Induced Modules (BRIM): Barber (2007) extended the mod- ularity score (see Section 1.2.3) to bi-partite networks, and introduced the BRIM method for community detection on bipartite networks as a technique for local maximization of the bi-partite modularity. As BRIM operates only on binary bi-partite networks, we introduce the following procedure to adapt the method in our setting:
(i) Convert the entries ofCtot-statistic p-values.
(ii) Compute the Benjamini-Hochberg thresholdτ =τ(C) at levelα= 0.1.
(iii) Dichotomize the entries ofCinto 0−1 variables, where an entry becomes 1 if and only if it is less thanτ.
(iv) Apply BRIM to the binary bipartite network G0:= (N1, N2,C).
Note that after step 3, some nodes may have no cross-edges. We remove these nodes fromG0 and automatically assign them to background.
2. Independent Row-Column k-means: One potential solution to bi-community detection in correlation networks is to apply bi-clustering to C. A common approach to bi-clustering is to cluster the rows and columns separately, called “Independent Row-Column Clustering” (IRRC) in Shabalin et al. (2009). In this paper we apply k-means IRRC alongside BRIM and CBCE. IRRC is somewhat ill-suited for the finding of bi-communities, as it there is no natural way to pair the row clusters with the column clusters. As such, we apply the following routine to the results of any IRRC method:
(i) Given cross-correlation matrix C, row clusters S :={S1, S2, . . . , SK}and column clus- ters S0:={S10, S20, . . . , SK0 }.
(ii) LetCij be the sub-matrix ofCformed by the row-subsetSi and the column-subset Sj0. Compute aK×K matrixMwith general entry Mij defined as the entry-wise mean of
Cij.
(iii) Let m:= maxMij, and letim, jm be the indices of m inM. (iv) Define C1 := Sim and C2 := S
0
jm. Add the bi-community (C1, C2) to a bi-community
collectionC.
(v) Remove Sim and S
0
jm from S and S
0, respectively. Re-set K ← K −1. If K = 0, terminate and returnC. Otherwise, return to step 2.
The algorithm above iteratively finds the strongest pairings of the row and column clusters. In each simulation, we set k, the number of row and column clusters thatk-means will find, to the true number of bi-communities in the model that produced the simulation (including the background node set). We also chose thek-means background node set by choosing the bi-community fromC with the closest Jaccard match to thetrue background node set. Each of these procedures can be viewed as “oracle” shortcuts to thek-means IRRC approach, and therefore are generous versions of its use in applications when the number of bi-communities or the background node set are unknown.
5.2.5.4 Simulation settings and results
In this section we present three settings in which parameters of the simulation model are toggled, to assess the competing methods’ sensitivities to aspects of the model. In each setting, we move one parameter of the model along an even grid, simulating 50 instances of the model at each parameter value. The performance metrics described in Section 5.2.5.2 are then averaged over the 50 repetitions. We describe the settings and results below.
Increasing the noise varianceσ2. In the first simulation setting, we increaseσ2significantly. We see that the BRIM method performs quite poorly, and that the overall accuracy of CBCE drops off more quickly than doesk-means IRRC. However, this result should be weighed against the fact that the k-means IRCC approach involves oracle settings of the number of bi-modules and the identity of the background node set. Absent these settings, it may be more difficult to achieve similar performance. Furthermore, the CBCE method involves explicit signficance testing, which
will be more sensitive (in general) to the absence of signal than a basic optimization method like
k-means.
Figure 5.6: Simulation model instances with varyingµβ (betamean).
Decreasing the mean regression parameterµβ. In the second simulation setting, the mean of the (random) regression parameters is allowed to tend to zero. We see that BRIM remains the least accurate performer, and CBCE andk-means perform comparably, with CBCE dipping a little below for lowµβ. Again, we temper this result with the fact that thek-means approach is generously informative, and that CBCE has built-in background node detection capability. Furthermore, CBCE is approximately three times faster than k-means IRCC in this simulation setting.
Figure 5.7: Simulation model instances with varyingσ2 (s2).
Increasing the proportion of background nodes vs. bi-community nodes g. In a third simulation setting, the ratio of the size of the background node set to the number of bi- community nodes is increased many-fold. In response to more background nodes in the model, the performance of both BRIM and k-means depreciate considerably, while the performance of CBCE remains near-optimal (see Figure 5.8). This displays the unique ability of a testing approach
to bi-community detection to accurately distinguish between background nodes and nodes in bi- communities. Such an ability is particularly important in large-scale genomic data, as important gene regulation sub-networks comprise only a small fraction of genome. Additionally, the size of the network increases linearly withg, and we see that the computation time ofk-means increases at least quadratically in response. This displays some of the computational complications with naive clustering approaches to bi-clustering and bi-community detection, which are surmounted by the set-by-set testing approach inherent to CBCE.
Figure 5.8: Simulation model instances with varyingg(“bgmult”).