Modularity based algorithms - Markov chains and random walks

1.5 Markov chains and random walks

2.1.2 Modularity based algorithms

Girvan and Newman [130] initiated recent work on detecting and evaluating communities in large networks. They introduced a fast greedy technique which relies on maximising a quality function called modularity, defined for a partition C as

Q(C) = 1 2m X ij h aij − kikj 2m i δ(c(i), c(j)) (2.6)

wherec(i) is the community to which nodeiis assigned, and the Kronecker delta function

δ(c(i), c(j)) = 1 if nodes i and j belong to the same community and 0 otherwise. The complexity of the Girvan-Newman algorithm is O(n3_{) and it is limited to networks with} around n= 103 nodes.

The desire to increase this limit due to the rapid increment in network size has in- vited researchers to find more efficient methods. Thus, Clauset et al [32] have developed the Girvan and Newman technique to the Fast greedy modularity optimization algorithm which improved the computational time to be O(nlog2_n_{). Also, they improved the limit} for network size to 106 nodes. These significant changes in the performance between greedy technique and fast greedy modularity optimization were hidden behind the main function of the algorithms. The greedy algorithm depends on counting the number of shortest paths between every pair of nodes on the network then removes an edge e with highest value. However, the fast greedy modularity optimization is based on adding edges iteratively between isolated nodes until the value of modularity reaches the point that it cannot increase any more.

The fast modularity optimization algorithm by Blondel et al [20], now known as the

Louvain algorithm, has one of the best results in the comparison tests [89]. It runs through a series of steps and each step has two iterations (passes) as shown in Figure 2.1.1. The first phase of this algorithm starts by assigning each node in the network to its own com-

CHAPTER 2 Network Clustering Algorithms

Figure 2.1.1: Fast modularity optimization algorithm by Blondel et al. The diagram shows the steps of the algorithm, each step consists of two iterations. Starting from the network on the bottom left, the first iteration is to merge neighboring communities that produced the largest modularity, as shown in top of the figure. Then, second iteration is to deal with these clusters as super nodes and aggregate them in order to build a new network of communities. The steps are repeated iteratively until no increase of modularity is possible.

The second phase starts by dealing with previously found communities as super-nodes in a new network and repeats the first phase on this new network by merging two super nodes to achieve a higher modularity value. These steps are repeated iteratively until the maximum modularity is reached, resulting in multi-levels of communities, as super- nodes2_{. The complexity of the}_Louvain_{algorithm is linear in the number of edges in the} network, that is O(m) [60].

Many efforts have been devoted to further upgrade the computational time of modularity optimization, and extend the limit of network size that can be clustered. For

2_{Figure reprinted with permission from Ref [20]. c} _{SISSA Medialab Srl. Reproduced by permission}

instance, the Radicchi et al [147] algorithm, in the spirit of Girvan-Newman, iteratively removes edges, but in this case removes the edges with highest clustering coefficient instead of edges with highest betweenness. The complexity of this algorithm isO(n2_{) which} is an improvement on the greedy technique. Another example of an algorithm that takes modularity optimization as its main quality function is that of Guimera and Amaral [72].

The Walktrap algorithm proposed by Pons and Latapy [143] uses random walks to define a distance which measures the structural similarity between nodes and between communities. It is based on the idea that at some stage a random walker tends to be trapped in dense part of a network corresponding to a community. Starting from an initial assignment of each node to its own community, communities are merged according to the minimum of their distances and the process iterated. The bottom-up hierarchy is represented in a dendogram and the algorithm stops when a partition with maximum modularity is obtained. The time complexity for this algorithm is O(mn2) [143].

However, modularity optimisation algorithms are subject to a resolution limit in the size of communities they can detect. “Good” small structural communities may remain undetected by the modularity function. This is because the modularity function is based on a null model [130] that assumes each node can interact with every other node [60]. In real world networks, for example the Web graph, “this assumption is not correct” [60]. If a cluster c1 has total degree kc1 and cluster c2 has total degree kc2, then the expected number of edges between the two clusterc1 andc2 ismc1c2 =kc1kc2/2m [60, 129]. If there are more than expected edges between c1 and c2, which indicates a strong correlation between the two clusters, modularity would be higher if they are in same cluster instead of each cluster standing alone. Therefore, c1 and c2 are merged in one cluster.

Fortunato and Barthelemy [61] showed that communities with internal edge numbers

≤ O(√m) may not be detected. Small strong communities in large networks may fail to be resolved, even when they are well defined. An illustrative example appears in Fig-

CHAPTER 2 Network Clustering Algorithms

Figure 2.1.2: Maximisation of modularity Q will fail to identify cliques in this example, eg if q p, there is higher modularity for the pair of cliques Kp joined by a single edge

than for the cliques themselves.

The authors of theLouvainalgorithm claimed the multi-level nature seems to circum- vent the resolution limit problem of modularity and this appeared to be born out by its high performance on the LFR benchmark (see Section 2.2 below).

However, a very recent acknowledgement by Lancichinetti et. al. [90] admits that they did not use the subsequent iterates of the Louvain algorithm in determining its performance, only the first phase, because the performance of the final level would be very poor, owing to the resolution limit.

In document Complex information networks – detecting community structure in bipartite networks (Page 60-64)