Overlapping Community and Node Discovery Algorithm Based on Edge Similarity

(1)

2017 International Conference on Electronic and Information Technology (ICEIT 2017) ISBN: 978-1-60595-526-1

Overlapping Community and Node Discovery Algorithm

Based on Edge Similarity

Dong-ming CHEN

1

, Dong-fang SIMA

1,*

and Xin-yu HUANG

1

Software College of Northeastern University, Shenyang, China

*Corresponding author

Keywords: Edge similarity, Line graph, Overlapping community, Over lapping node

Abstract. Most of community detection algorithms are designed from the perspective of nodes, which usually neglect the overlapping structure in networks. Whereas some of them hold the weakness of high complexity, inaccuracy and low stability. To solve the above issues, an overlapping community and node discovery algorithm based on edge similarity is proposed. In this paper, is established according to the incidence matrix. Then the algorithm is proceeding on line graph and finally the community detection results are restored to the original network, thus overlapping community and nodes are discovered. Several experiments are carried out on different datasets, demonstrating that the proposed algorithm is effective.

Introduction

Most of the existing community detection algorithms [1] in complex network [2] are designed from the perspective of nodes, in which a certain node can be divided into only one community. However, in real networks, some nodes usually belong to several communities, which are known as overlapping nodes [3] (or overlapping community [4]).For the overlapping nodes are related to many communities simultaneously, they usually act as a bridge between different communities and play an important role in networks. In the face of increasing scale of networks, further research on overlapping community and node discovery is of great significance.

Mining overlapping nodes in networks is usually achieved by discovering overlapping communities [5]. Although these researches had some achievements in the past, it is still lacking. Pallaet al. [6] proposed the famous Clique Percolation Method in 2005, however, the main problem is that a parameterk_{is difficult to determine. Gregory}

put forward the CONGA [7] algorithm, where as the effect of the algorithm heavily depends on the empirical parameters. Generally, as for the current algorithm of overlapping nodes detection, there are many problems such as high complexity, low accuracy etc. To solve the above issues, this paper proposes an overlapping community and node discovery algorithm based on edge similarity. The object of study is transformed from node to edge, and then the overlapping nodes are discovered by the conversion between the line graph and the original network.

Algorithm Description

Line Graph

(2)

Definition 1: AnN×N_{adjacency matrix}_A₍N_{is the number of nodes) is used to}

represent the network G_{.If the element}Aij=1, that indicates there is an edge between node i_{and node in the network, if}_Aij=₀ , there is no edge between iand j.

Definition 2: AnN×L_{incidence matrix}_B₍

_L

_{is the number of links) is used to}

represent the network G_{. If the element}Biα=₁, that means the node iis associated with

the edgeα, if the elementBiα=0 , then there is no link between the two.

In incidence matrix, the degreekiof a node and the number of nodes kα are attached to

a link α(always equal to two), shown as follows:

=

i i

k B_α

α

∑

,

= i

i k_α

∑

B_α

(1)

Another expression of adjacency matrix is represented in the following formula:

=

ij i j i ij A B B_α _α k

α

δ −

∑

(2) The method of constructing line graph is given by means of node mapping of bipartite network [8], the adjacency matrix C is used to represent the line graph of the network, and Crepresents a L×L_{incidence matrix. According to the transformation formula}

mentioned above, the adjacency matrix expression of the line graph can be obtained:

(1 )

i i i

C_αβ =

_∑

B B_α _β −δ_αβ

(3) When the element isδαβ=1in the matrix, there is a common node between two edges,

otherwiseδαβ=0, and this transformation does not lose any information in the original

network. Therefore, the theory of line graph is widely used.

Edge Similarity

Suppose that in network G_{, the edge} _{and edge} _{have a common node}p_{, we call}

a shared node, node i_{and node} j_{are contribution nodes. In the network topology,}

neighbor nodes can usually provide a lot of useful information, so we can reasonably define the edge similarity by means of two nodes associated with one edge and the neighbor nodes between the two nodes.

In summary, this paper employs the degree of influence which comes from the common neighbor nodes of the contribution nodes to the two edges connected with contribution nodes, and the degree of influence of the contribution node itself on the edge which it connected with. Improved the RA index [9], and the method of computing the edge similarity is given as follows:

( ) ( )

1 1

( , )

( ) ( ) ( )

ip pj

z i j

S e e

k z i j

∈Γ ∩Γ

= +

Γ ∪Γ

∑

(4) WhereΓ( )i _{represents the collection of neighbor nodes of the node}i_,Γ( )i ∩ Γ( )j _{is a}

set of common neighbor nodes of node i_{and node} j_.k z_{( )}_{represents the degree of}

node z_{, and}Γ( )i ∪ Γ( )j _{denotes the set of all neighbor nodes of node}i_{and node} j_.

Algorithm Process

Step 1: network initialization.

1) Build the original graphG V G( ( ), ( ))E G _{of the network.}

2) Calculate the number of nodes and edges in the networkG_{, and give each node} j

e

(3)

and edge an identifier, where the identifier of the edge is increased according to the input order .

3) Find the neighbor node set Neighbor i( )_{of each node}i_{in the network.}

4) Construct the line graphG' V G'( ( ), ( ))E G' _{according to the incidence matrix and}

formula (3), and here the mapping relationE G( )→V G'( )_exists.

Step 2: Take the splitting algorithm to divide community in the networkG'_.

1) Arbitrarily select two adjacent edges

e

ip_and

e

pj _{in network}G _{, calculate the}

similarityS e( ip,epj) between the two edges in accordance with the formula (4). 2) Find the current minimum edge similaritySmin(eip,epj) in network G.

3) By the mapping relationE G( )→V G'( )_{, find the corresponding edge}e_p_{in the line}

graphG'_{following the node} _{in the graph} _{, and remove edge}ep.

4) According to modularity calculation formula:

1~ 2

1

[ ][ ] [ ][ ]

[ ( ) ]

j i

m

j m

i

E i j E i i

Q

E E

≠

=

= −

∑

(5) Calculate the modularity Qcurrentof the current network G'(the modularity of the

initial community structure is Q0), m is the given number of the community division,E

represents the total number of edges of the network, and E i[ ][ ]j _{denotes the number of}

edges within the communityi_.

5) Compare Qcurrent with Qmax. If Qcurrent≥Qmax, the current Q value is recorded as

max= current

Q Q _{,and simultaneously record the community partition results of the edge}

graph G '_{,and repeat the procedure in step 2. Otherwise, go to step 3.} Step 3: Discovering overlapping communities and nodes.

1) Traverse the edges in G' _,vi∈V G'( ). If vi∈community j[ ] , we can get the

corresponding edge in G_{by mapping the relation}E G( )→V G'( )_{,that is}ei=_{( , )}u v ∈E G_{( )},

and add the two nodes u_andv_{corresponding to the edge}ei to the cluster j( ) (j∈[1... ]c ,

c_{is the number of communities).}

2) ∀ ∈j [1... ],c cluster(1)∩cluster(2)...∩cluster j( )_{, represents the overlapping part of}

the j_communities.

3) Calculate the number of times in a networkG_{where each node occurs in}

( )

cluster i ₍i∈_{[1... ])}c _{, if the number is greater than one, that means the node exists in more}

than one community, namely, the node is an overlapping node.

Complexity Analysis

Supposing an undirected network G_{, the number of nodes is}n_{, the number of edges is} m_{, and the complexity of the proposed algorithm is analyzed below:}

a) Compute the similarity of the edges and modularity. The time complexity of the link similarity is the product of the number of nodes and the number of neighbor nodes of these nodes, that is,n*Card( ( ))Γi _,Card₍₎_{represents the number of elements in a}

collection. The time complexity of modularity is the square of the number of nodes in

G'_{, that is}_m2_{. Since the time complexity of edge similarity is less than that of} modularity, hence the time complexity of modular in step a) is .

b) Remove the edge of the minimum edge similarity, and repeat step a), until

max

current

Q <Q _{is satisfied. The time complexity of removing edges is}O k_{( )}_{, here}k_{is the}

number of edges removed.

Consequently, the overall time complexity of the algorithm is 2 ( )

O km .

p

_G

(4)

Experimental Analysis

Experiment on Zachary’s Karate Club Dataset [10]

This dataset is used to verify the accuracy of the proposed algorithm. The initial number of community is 1. When the network is divided into 6parts, the modularity reaches its maximum value (0.5524). The nodes in the 6 communities were labeled and marked sequentially. Then, the partition results in the line graph are mapped back to the original network, and the final result is shown in Figure1.

Figure 1. Labeled network structure of the Zachary’s Karate club.

Figure 1 shows that some nodes are connected with the same color edge, and some are connected with different color edges, the nodes that connect the edges with more than one color means they belong to different communities, and these nodes are overlapping nodes. The nodes that link with the edges of the network G_{are added to}

[image:4.612.104.507.345.435.2]

the set of cluster corresponding to the edges, and the overlapping portions of these clusters are overlapping communities. The detailed results are shown in Table 1.

Table 1. Cluster of nodes for the Zachary's karate club dataset inG_.

Cluster Node Cluster Node Cluster Node

C1

1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 18,

20, 22, 32

C2

2, 3, 4, 8, 9, 10, 13, 14, 18, 20, 22, 28,

29, 31

C3 5, 6, 7, 11, 17

C4

3, 9, 15, 16, 19, 21, 23, 24, 26, 27, 28,

30, 31, 32, 33

C5 25, 26, 28, 29, 32 C6

9, 10, 14, 15, 16, 19, 20, 21, 23, 24, 27, 28,

29, 30, 31, 32, 34

[image:4.612.99.514.493.582.2]

The overlapping areas of these clusters are overlapping communities, and the nodes that appear repeatedly are overlapping nodes. Extract the network information from above and get the result as shown in Table 2.

Table 2. Overlapping communities and nodes distributions in Zachary club network.

C1∩C2∩C4∩C6 9 C2∩C5∩C6 29 C1∩C2∩C6 14, 20

C1∩C4∩C5∩C6 32 C1∩C2 2, 4, 8, 13, 18, 22 C2∩C4∩C6 31

C2∩C4∩C5∩C6 28 C1∩C3 5, 6, 7, 11 C4∩C5 26

C1∩C2∩C4 3 C2∩C6 10 C4∩C6

15, 16, 19, 21, 23, 24, 27, 30, 33

To sum up, Zachary karate club network has 12 overlapping areas, and except node 1, 12, 17, 25, 34, the rest of the nodes are subordinate to several communities. we can conclude that everyone in this karate club, there are more or less different identities with other people, such as, it can be assumed that it was the same coach, whether they competition together etc. Therefore, each person corresponds to an overlapping identity and exists in overlapping groups.

Experiment on Dolphin Social Network Dataset [11]

(5)

[image:5.612.255.373.100.158.2]

corresponding clusters, and the overlapping portions of those clusters are overlapping communities.

Figure 2. Labelled network structure of the Dolphin social network.

[image:5.612.114.500.216.333.2]

The detailed results are shown in Table 3.

Table 3. Cluster of nodes for the Dolphin social dataset in G_.

C1

0, 3, 4, 7, 8, 10, 11, 15, 18, 21, 23, 24, 28, 29, 35, 36, 39, 40, 43, 45, 50, 51, 52, 55, 59

C3

0, 3, 8, 12, 14, 16, 21, 24, 33, 34, 36, 37, 38, 40, 43, 44, 45, 46, 49, 50,

52, 53, 58, 61

C4

0, 7, 8, 10, 16, 18, 19, 20, 25, 26, 27, 28, 30, 36, 38, 42, 44, 47, 50

C2 0, 2, 10, 16, 42, 44,

50, 53, 61 C6 5, 6, 9, 13, 32, 41, 56, 60 C5

1, 5, 6, 7, 9, 13, 17, 19, 22, 25, 26, 27, 28, 31, 36, 39, 41, 48, 54, 57

The nodes that appear repeatedly (numbered with 0, 3, 5 6, 7, 8, 9, 10, 13, 16, 18, 19, 20, 21, 24, 25, 26, 27, 28, 36, 38, 39, 40, 41, 42, 43, 44, 45, 50, 52, 53,61) in the above Table 3 belong to more than one community, so these nodes are overlapping nodes. Dolphin social network totally has 62 nodes, representing 62 dolphins.As we know, dolphins are social animals, and a key member left the group will make the population splitting into several smaller groups. So it can be inferred that, if different dolphin groups are connected, there are overlapped parts.

Comparison Experiment

[image:5.612.145.469.545.645.2]

Figure 3shows the clustering accuracy of the proposed algorithms comparing with GN [12] and LPA algorithm [13] on different datasets. The results indicate the proposed algorithm can get high modularity and the optimal classification results in less number of iterations. Therefore, it is concluded that the algorithm proposed in this paper outperforms GN algorithm and LPA algorithm in classification accuracy. Moreover, overlapping communities and overlapping nodes can be discovered simultaneously.

Figure 3. Clustering accuracy of different algorithms on Zachary data set and Dolphin dataset.

Conclusions

(6)

social network verified the algorithm is effective. With the comparison of GN and LPA algorithm, the proposed algorithm has higher efficiency and better accuracy. In addition, this algorithm has certain universality to the analysis of complex network structure in practice with general applicability. The next step is to apply the proposed algorithm in distributed environment to deal with large scale networks.

Acknowledgement

This work was partially supported by Liaoning Natural Science Foundation under Grant No. 20170540320 and Research project of Liaoning Department of Education under Grant No. L2015173.

References

[1] Kakkar S, Beniwal S. Discovering overlapping community structure in networks through co-clustering[C]//International Conference on Inventive Computation Technologies. IEEE, 2017.

[2] Gao Z K, Small M, Kurths J. Complex network analysis of time series[J]. Epl, 2016, 116(5):50001.

[3] Ahn Y Y, Bagrow J P, Lehmann S. Link communities reveal multiscale complexity in networks. [J]. Nature, 2010, 466(7307):761.

[4] I. Psorakis, et al., "Overlapping community detection using bayesian non-negative matrix factorization," Physical review, vol. 83, p. 066114, 2011.

[5] Todeschini A, Caron F. Exchangeable Random Measures for Sparse and Modular Graphs with Overlapping Communities [J]. 2016.

[6] Palla G, Barabási A L, Vicsek T. Quantifying social group evolution. [J]. Nature, 2007, 446(446):664-667.

[7] Gregory S. An Algorithm to Find Overlapping Community Structure in Networks[C]// European Conference on Principles and Practice of Knowledge Discovery in Databases. Springer-Verlag, 2007:91-102.

[8] Evans T S, Lambiotte R. Line graphs, link partitions, and overlapping communities. [J]. Physical Review E Statistical Nonlinear & Soft Matter Physics, 2009, 80(2):145-148.

[9] Hsu I W H, Volkova M S S. Link Prediction in Social Networks [J]. Springerbriefs in Computer Science, 2016:246-250.

[10] Zachary W. An information flow model for conflict and fission in small groups [J]. Anth Res, 1977, 33: 452-473.

[11] Amaral L.A.N, Scala A, Barthelemy M, et a1. Classes of small-world networks [J]. Proc. Natl. Acad. Sci. USA, 2000, 97(21): 11149-11152.

[12] Girvan M, Newman M.E.J. Community structure in social and biological networks [J]. Proc Natl. Acad. Sci. USA, 2002, 99(12): 7821-7826.