Experimental analysis - Similarity analysis among different community discovery methods

1.5 Similarity analysis among different community discovery methods

1.5.2 Experimental analysis

To get a general idea about the level of agreement and disagreement among different community detection methods (modularity maximization, information flow maps and statistical inference) in practice, I perform some experiments to study the similarity among patterns identified by these methods on two different types of synthetically gen-

erated networks. The first is generated using LFR benchmarks [24], which maintains

a power-law degree distribution in the resulted networks, while the second is generated

using stochastic block modeling (SBM) [19] without forcing any degree distribution on

the output networks. In this section, I refer to these two types of networks as LFR networks and SBM networks respectively. The goal of choosing these two different generative processes is not necessarily to study the effect of the degree distribution on the level of agreement between the different methods, but to consider the different generative models that have been used in the literature to generate networks with implanted communities. While I still invite more research on studying the similarity and dissimilarity among different community methods by considering more types of networks, including real-world networks, I claim that these experiments, as they are designed and presented,

are sufficient to get a general idea about the level of agreement among the methods discussed above.

1.5.2.1 Results and observations

Figures 1.11 and 1.12 each report the pairwise similarity together with the accuracy

(ability to recover the ground truth communities) among the different community detection methods with the two types of networks, LFR and SBM respectively. The three methods, modularity Maximization, statistical Inference and information Flow are re- ferred to as M, I and F respectively in these figures. For each type of network, the patterns were analyzed in four cases: 1) when the network is undirected and unweighted (UD-UW), 2) when the network is undirected and weighted (UD-W), 3) when the network is directed and unweighted (D-UW) and 4) when the network is directed and

weighted (D-W). In addition, table1.1reports the accuracy values achieved by different

community detection methods when averaged by experiments on LFR networks, SBM

networks, and all networks ( AvgAccLF R, AvgAccSBM, and AvgAcc respectively).

Experiments were repeated for different assignments of the mixing parameter µ, which controls the percentage of cross-community edges in the generated network. General observations about the results are summarized in the following:

• Different community detection methods do not always agree in the clusterings they recover from the networks. Moreover, the similarity patterns among them are not necessarily maintained across the different settings of the network’s directionality/weight, across different assignments of the mixing parameter µ, or across different network typologies (LFR and SBM). This finding is important as it sheds a light on the importance of understanding the logic behind each method and how that translates in each typology to provide more consistent interpretations of the patterns identified by each method.

• With networks that are generated using LFR benchmarks (Figure1.11), there is

a perfect agreement among the three methods with directed-unweighted networks (D-UW), and their abilities to recover ground truth communities, in this case, is not affected by the level of noise µ in the network. This agreement is relatively maintained with directed-weighted networks (D-W), as well as with undirected- unweighted networks (UD-UW), as long as the level of noise in the network is moderate (less than 0.25), at which point modularity maximization starts to be- come an outlier. With undirected-weighted networks, statistical inference shows a

AvgAccLF R AvgAccSBM AvgAcc

Information flow mapping 0.99 0.6 0.8

Statistical Inference 0.61 0.89 0.7

Modularity maximisation 0.84 0.23 0.5

Table 1.1: Average accuracy values of different community detection methods where

N ormAccLF R is the average accuracy value in all experiments on LFR networks,

N ormAccSBM is the average accuracy value in all experiments on SBM networks and

N ormAcc is the average accuracy value with all experiments

very poor ability in recovering ground truth communities, while modularity maximization and information flow maps demonstrate a similar behavior in recovering the ground truth communities.

• With networks that are generated using stochastic block modeling SBM, both statistical inference and information flow mapping show a good level of agreement and ability to recover ground truth communities with weighted networks, whether

directed or not (figure1.12). With non-weighted networks that are generated using

stochastic block modeling, both modularity maximization and information maps show poor abilities to recover ground truth communities, while statistical inference demonstrates a good accuracy with moderate levels of noise in the network (less than 0.25).

• On average, statistical inference seems to outperform other methods in recovering ground truth communities when the networks are generated using stochastic block modeling, while information flow mapping and modularity maximization methods outperform statistical inference with networks that are generated using LFR benchmarks. Generally speaking, information flow mapping method seems to outperform other methods in its ability to recover ground truth communities in

different types of networks and different levels of noise in the network (table1.1).

1.5.2.2 Experimental settings

Networks used in these experiments are constituted of 1000 nodes. The parameters used

for generating LFR networks [24] are: minimum degree = 15, maximum degree = 50

, minimum number of communities = 20, maximum number of communities = 50 and mixing parameter for the weights in weighted networks = 0.1. The process of generating

SBM networks takes as an input a clusteringC over the network nodes and a probability

matrix which specifies the edge creation probabilities within and across communities of C. In these experiments, C was chosen to be constituted of 10 communities where the community memberships were sampled from a categorical distribution, and community

= 0.05 = 0.2 = 0.45

(a) Pairwise similarity among community detection methods with three values of the mixing parameter µ ∈ {0.05, 0.2, 0.45}

(b) Accuracy of different community detection methods as a function of the mixing parameter µ

Figure 1.11: Accuracy and pairwise similarity among three community detection methods, namely modularity maximization (M), information flow maps (F) and statistical inference (I), with networks generated using the LFR benchmark in four different cases: when the network is undirected and unweighted (UD-UW), when the network is undirected and weighted (UD-W), when the network is directed and unweighted

(D-UW) and when the network is directed and weighted (D-W)

sizes were sampled from Dirichlet distribution (as used by [25]). Edge probabilities were

chosen to be between 0.5 and 0.7 for the within-community edges and µ (the value of the mixing parameter) for the cross-community edges. To generate weighted networks (as the model used does not provide that possibility) , I started by generating non-weighted networks, then weights were placed randomly such that they take higher values when they connect two nodes of the same community and lower values otherwise.

For each of the three methods, a representative algorithm that provides a good ap- proximation in implementing the logic behind the method was chosen. This resulted

= 0.05 = 0.2 = 0.45

(a) Pairwise similarity among community detection methods with three values of the mixing parameter µ ∈ {0.05, 0.2, 0.45}

(b) Accuracy of different community detection methods as a function of the mixing parameter µ

Figure 1.12: Accuracy and pairwise similarity among three community detection methods, namely modularity maximization (M), information flow maps (F) and statistical inference (I), with networks generated using stochastic block modeling in four different cases: when the network is undirected and unweighted (UD-UW), when the network is undirected and weighted (UD-W), when the network is directed and un-

weighted (D-UW) and when the network is directed and weighted (D-W)

in choosing Louvain [15], Infomap [18] and MCMC [22] algorithms as representatives of

modularity maximization, information flow mapping and statistical inference methods respectively. For calculating the similarity and accuracy values, the adjusted mutual information was used for its sensitivity to capture different types of dissimilarities and its consideration for the by-chance agreement among clusterings as we discussed in a

previous study [26] reported in AppendixC.

With weighted networks, modularity maximization translates into maximizing the total sum of weights (rather than the number edges) within communities and minimizing

that amount across communities. The same method (i.e., Louvain) can guarantee that if we assign weights to the entries in the adjacency matrix A, rather than 0s/1s for the absences/existences of edges. However, with directed networks, a variant of Louvain that

accounts for the direction of edges is used [27]. As to the other algorithms (Infomap and

MCMC), the directionality and accounting for edge weights is naturally implemented in them.

In document A Practical and Critical Look at the Problem of Community Discovery in Multi-layer Networks (Page 31-36)