Using random matrix theory to study detectability

2.5 Detectability in a single stratum

2.5.3 Using random matrix theory to study detectability

Since node-to-community assignments,z, can be inferred with spectral methods, random matrix theory (Benaych-Georges and Nadakuditi, 2011; Nadakuditi and Newman, 2013) is a useful approach for studying partitioning and phase transitions in detectability (i.e. node-to-community assignment accuracy) (Nadakuditi and Newman, 2012; Peixoto, 2013; Sarkar et al., 2013). Using this approach, a phase transition in detectability corresponds to the disappearance of gaps between eigenvalues (whose corresponding eigenvectors reflect community structure) and bulk eigenvalues (which arise due to stochasticity and whoseN → ∞limiting distribution is given by a spectral densityP(λ)). The theory we develop in this work is based on the modularity matrix,B¯ij = ¯Aij−ρL(Newman and Girvan, 2004). We first study∆∗ for the summation network. We analyze the distribution of real eigenvalues{λi} ofB¯ (in descending order). First, we describe the statistical properties of entries {A¯ij}, which are independent random variables following a binomial distribution withP( ¯Aij = A) = f(a;L, πzi,zj),

where f(a;L, p) = L a pa(1−p)L−a (2.15)

has meanLpand varianceLp(1−p). With sufficiently large variance in the edge probabilities, we find that the limitingN → ∞distribution of bulk eigenvalues forB¯ is given by a semicircle distribution,

P(λ) = p λ2 2−λ2 πλ2 2/2 (2.16)

is the upper bound on the support of the spectral density and is limitingN → ∞value of the second- largest eigenvalue. The largest eigenvalue ofB¯ in theN → ∞limit is the isolated eigenvalue,

λ1 =N L∆/2 + 2[ρ(1−ρ)−∆2/4]/∆. (2.18)

The eigenvectorvcorresponding toλ1gives the spectral bipartition. Here, the inferred community

label of nodeiis determined by the sign ofviand provided that the largest eigenvalue corresponds to this isolated eigenvalue,λ1, the eigenvector entries{v1}are correlated with the node-to-community labels,

z. As shown in Ref. (Taylor et al., 2015), we can derive a detectability equation that accounts for the number of layers,

N L∆ =p4N Lρ(1−ρ). (2.19)

We now study∆∗_{for the thresholded networks, which correspond to single-layer SBMs in which}

the community labels,zare identical to those of the multilayer SBM, but there are new effective block edge probabilities

Π( ˜_nmL) = 1−F( ˜L−1;L,Πnm), (2.20)

whereF(a;L, p)is the cumulative distribution function for the binomial distributionf(a;L, p). The effective probabilities for the AND and OR networks areΠˆ(nmL) = (Πnm)LandΠˆ(1)nm= 1−(1−Πnm)L, respectively. For the two-community SBM, the effective probabilities arepˆ( ˜_in,outL) = 1−F( ˜L−1;L, pin,out),

∆( ˜L)_{= ˆ}_p

in( ˜L)−poutˆ ( ˜L)andρˆ( ˜L)= ˆpin( ˜L)−poutˆ ( ˜L)/2.

2.5.4 Results

In Figures 2.7 (a) and (b), we show ∆∗ _{versus the mean edge probability,} _ρ _{for the different}

aggregation methods: (i) a single layer (red dot-dashed curves), which is identical in panels (a) and (b); (ii) the summation network (blue dashed curves), for which the curve in (b) corresponds to the curve in panel (a) rescaled by a factor of 1/2; and (iii) thresholded networks (solid curves), which shift left-to-right with increasingL˜. This is evident by comparing∆∗for the AND (L˜ =L, gold circles) and ORL˜ =L, cyan squares) networks. We find that whenρis large that the AND (OR) network has a relatively large

(small) detectability limit. In other words, aggregating layers using the AND (OR) operation is beneficial for dense (sparse) networks. Note that the results shown in Figure 2.7 are just a subset of results from Tayloret al.(Taylor et al., 2015).

2.5.5 Conclusion

In this work, we studied limitations on community detection for multilayer networks with layers drawn from a common SBM. As an illustrative model, we analyzed the effect of layer aggregation on the detectability limit∆∗for two equal-sized communities. When layers are aggregated by summation, we analytically showed that∆∗vanishes asO(L−1/2_{). When layers are aggregated by thresholding this}

summation, ∆∗ depends on the choice of threshold, L˜. ForL˜ = dρLe, we analytically found∆∗ to also vanish asO(L−1/2_{). We note that our analysis also describes layer aggregation by taking the mean,}

L−1P

lA(l), since the multiplication of a matrix by a constant simply scales the eigenvalues by that constant. Thus, our results are in agreement with previous work that proved the consistency of spectral clustering via the mean adjacency matrix (Han et al., 2015).

Finally, it is commonplace to threshold pairwise-interaction data to construct network representations that are sparse and unweighted and can be studied at a lower computational cost. Our research provides insight into this common-yet not well understood-practice. It would be interesting to extend this work to allow the SBMs of layers to be correlated (Abbe et al., 2016) (that is, rather than identical) or organized into ‘strata’ (Stanley et al., 2016) (i.e., layers within a single stratum are similar, but they differ across strata). We are currently extending our analysis to hierarchical stochastic block models.

Figure 2.7:Layer aggregation enhances the detectability of community structure.Layer aggregation enhances the detectability of community structure. (a),(b). We plot the detectability limit∆∗versus mean edge probabilityρfor a single network layer (red dot-dashed curves), the aggregate network obtained by summation (blue dashed curves), and aggregate networks obtained by thresholding this summation atL˜ ∈ {1,2,3,4}(solid curves). Gold circles and cyan squares highlightL˜ =LandL˜ = 1, which we refer to as AND and OR networks, respectively. Results are shown forN = 104_{nodes with (a)}_L_{= 4}

and (b)L= 16layers. (c) ForL= 4, we show∆∗ versusρfor the optimal thresholdL˜=dρLe(orange triangles), which lies on the solution curves forL˜ ∈ {1, . . . , L}(solid curves). (d) We show∆∗ for

L=dρLewithL∈ {4,16}. These piecewise-continuous solutions collapse onto the asymptotic solution δ_asym∗ (black curve) asLincreases. In panels (c), (d), we additionally plotδ∗for the summation network (blue dashed curves).

CHAPTER 3

Network compression for community detection with super nodes1

In practice, social and biological networks are quite large with hundreds of thousands or millions of nodes. For example, the SNAP (Stanford Network Analysis Project) network repository (https:

//snap.stanford.edu/data/) houses various types of social and technological networks. For

example, the amazon co-purchasing network has 548,000 nodes and 1,788,725 edges. Working with a network this size is overwhelming and makes computations slow and variable. In particular, we will show that community detection algorithms produce highly variable outputs on large networks or take a long time to run. In this chapter, we will introduce a pre-processing technique for networks to take the originally large network and reduce it to a smaller size, which enables analysis on a small network with a user-defined number of nodes or ‘super nodes’. This work is from our paperCompressing Networks with Super Nodes(Stanley et al., 2017).

In document Stanley_unc_0153D_17596.pdf (Page 85-89)