Research on Multi label Propagation Algorithm for Community Detection Based on Weibo

(1)

2017 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE 2017) ISBN: 978-1-60595-458-5

Research on Multi-label Propagation Algorithm

for Community Detection Based on Weibo

Yun PAN, Bo XIE

*

and Xiao-jun LI

College of Computer and Information Engineering, Zhejiang Gongshang University, China

*Corresponding author

Keywords: Weibo, Community detection, LPA, COPRA.

Abstract. Community structure in social network has an overlapping property from where people can

find a lot of valuable information by community detection. It will help users to make accurate and valuable judgments as well as decisions. However, the increasing amount of information has led to difficulties for traditional technology to meet the requirement of big data processing. In order to effectively process data in large-scale network, a multi-label propagation algorithm for community detection has been implemented by using BSP parallel model. This algorithm combines the idea of LPA and COPRA. It finds out communities with active Weibo users, and then uses these communities as labels to perform label propagation. In the end, it classifies the remaining users’ communities. The algorithm solves LPA’s problems of instability and low accuracy. Its effectiveness and efficiency were tested and verified by using data from Weibo network.

Introduction

Community detection in social networks is important to the understanding of topology, composition, evolution as well as propagation characteristics of the network itself. The use of social networks, such as mining implicit information from Weibo, studying its community relations, and analyzing its community structure, is the foundation of public opinion monitoring, marketing promotion, etc.

The research of community structure includes both overlapping and non-overlapping community detection, in which the latter one is more mature. It began when Girvan and Newan [1] proposed GN algorithm in 2002. Later, based on their work, researchers proposed a series of algorithms. Andersen [2] added one or more seed nodes (users) and kept adding neighbor nodes until a community was found. Clauset [3] detected communities by selecting nodes that can get maximum ratio of sides’ number within communities to those between communities. But all these algorithms have high complexities. Take GN algorithm for example, it has a complexity which is O(n3) on sparse networks and is not applicable for large-scale social networks.

The network nodes belonging to different communities are called overlapping community structures and discovering these communities will help to precisely understand the network topology. In 2005, Nature journal published Clique Percolation Method (CPM) [4] proposed by Palla. That method assumes the community’s structure to be formed by adjacent complete sub-map and can be used to discover overlapping community structures when a node belongs to a number of cliques. More clique-based methods have been proposed after that. But they are more like pattern matching algorithms rather than community structure discovery algorithms in general meaning.

In recent years, discovering overlapping communities by edge division has come to be an important method. Alan et al. [5] proposed a hierarchical clustering method based on it. Evan et al. [6] proposed a method which constructs weighted edge graphs to improve accuracy. But Fortunato [7] pointed out that edge division does not guarantee better results than node-based community detection methods.

(2)

algorithm. However, its high computational complexity makes it unusable to large-scale networks. To solve this problem, localized intermediation algorithm such as CONGO [10] is proposed by limiting the shortest path length. It simply limits the shortest path length without considering too much about the network topology, however, the performance improvement leads to reduction of algorithm’s accuracy.

With the development of social networking and communication industry, the number of social network users has gradually increased, and the network scale has increased dramatically. This brings significant large data characteristics. However, most traditional community detecting algorithms are serial ones and must meet certain prerequisites to reach maximum performance. These restrictions make traditional algorithms inapplicable in large data processing.

This paper analyzes Weibo users, as well as those forwarded and commented content. Builds their subject models and find out Weibo communities with obvious subject inclination. Then based on these communities, reverse mining those communities with non-active users.

Multi-label Propagation Algorithm for Community Detection Based on Weibo

This paper proposes a Parallelized Multi-label Propagation Algorithm for Community Detection Based on Weibo (PMLPA). It first construct LDA subject model for users’ Weibo. Then find out active users’ communities by applying fuzzy cluster (one user can belong to multiple categories). These communities are marked as samples and are propagate through the Weibo network. In this way, we can discover those communities with inactive users and merge the users of same communities.

Characteristic parameters include: 1. User’s core degree

Core degree is the ratio of following users and followed users which is expressed as:

A A A F C P

 (1)

Where FA is the number of A’s followers, PA is the number of users A follows. 2. User’s propagation factor

For user A, if the number of users he followed is PA, the number his followers are FA and his core degree is CA. Then, the propagate factor of A for his following users is:

2 A A A A F H C P

   (2)

3. Propagation factor between users

Propagation factor between users includes: 1) user A and B follows each other, 2) A follows B while B doesn’t follow A. Under situation where both A and B follows each other, the propagation factor can be described as:





B

2 + 2

A

AB AB A B AB

A B

F F

H M H H M

P P

 

    _ _

  (3)

When A follows B but B doesn’t follow A, the propagation factor can be described as:





B

2 2

A

AB AB A B AB

A B

F F

H M H H M

P P

 

    _  _

  (4)

Where MAB is the number of people that are followed both by A and B.

Algorithm Description

(3)

factor and defines the community label’s updating rule. Combined with the community overlap propagation algorithm (COPRA) [11], it first finds out the communities of active users by mining Weibo content, therefore, we get classified users u1,u2,u3,…,uk. Then uses these classified users as nodes of marked communities to perform label propagation. The detailed algorithm is described as follows:

1. Calculate community propagation factor ρu,i of Weibo user u and his following user i. That is, for each user u, calculate propagation factor Hu,i between u and i. After normalization, we get community propagation factor ρu,i.

2. Initialize community label factor. First, use FuzzyKmeans algorithm to get classified users: u1,u2,u3,…,uk. These users are used as nodes of marked communities which is: ui-{(c1,b1),(c2,b2),…,(ck,bk)}, where ci is Weibo user ui’s community, bi is dependent coefficient of user ui and his community. Then, assigns a unique community ID cj (j≠1,2,…,k) to users of unclassified communities and their dependent coefficient is set to 1, which is vj→{(cj,1)}.

3. Set iterate number t=1.

4. Update Weibo user u’s community labels and his dependent coefficient. Retrieve these labels and dependent coefficient from last iterate procedure and use them to calculate the sum of all relevant communities’ dependent coefficients.

 

*

1 ,

( )

, ,

t t u v

v Friends u

Sum c u b_ c v







 (5)

Then divides it by the maximum dependent coefficient of community and get evaluation threshold, which is:

 



*



1 , , , max , t t

t u v

Sum c u f c u

b_ c v



 (6)

By setting evaluation threshold α, for example 0.45 (this value can be adjusted according to experiment value), if ft(c,u)>=α, then it means Weibo user u belongs to community c. If for community c1,c2,…,ck, there are ft(ci,u)>=α, then the coefficient describing user u belonging to community ci is:













1 , , , t i i k i t

f c u b u c

c u





(7) 5. The iterate process halt when Weibo users’ communities stop changing. Otherwise, set t=t+1 and jump to step 4.

6. Finally, merge all the Weibo users with the same community label.

Algorithm Implementation and Experimental Result. The algorithm takes the ideas of

messaging in BSP model. It uses Apache Giraph which implements BSP model to deal with big data and detect social network communities.

The implementation procedure includes:

1. Uses Hadoop and MapReduce to build subject models for Weibo users and classify communities.

2. Calculates community propagation factor ρu,i for user u and his following user i. 3. Initializes users’ community labels.

(4)

We collected experimental data (2015.09 - 2015.12) from Sina Weibo which includes Weibo content, basic user profiles, users’ followers, etc. It has overall 7 million Weibo content and 130 thousand user profiles. The collected data meets the following criteria:

1. Weibo content number per user is greater than 100. The fewer Weibo content the lower possibility to dig out user interest by LDA model.

2. Number of one’s following users is less than 1000. More following users imply unclear interest.

3. Users with less than 5 followers are removed. These kinds of users usually account for more than half of the total user number, which has negative impact on the accuracy of the model.

After data cleaning, the final experimental data collection contains 4800 users and 187144 concerned edges. Weibo content posted by these users is used to build LDA subject model. The constructed social network using these data by Apache Gephi is shown in Figure 1.

Figure 1. Social network of Weibo users.

Using multi-label propagation algorithm introduced in this article to detect communities, we get thirty-two different communities. The classified communities are shown in Figure 2. The total user number of these communities is greater than original users’ which means most of them can be divided into multiple communities. This indicates that the algorithm is capable of finding out the overlapping structure in Weibo social network.

Figure 2. Weibo communities detected by multi-label propagation algorithm.

[image:4.595.225.370.428.544.2]

We adjusted different threshold values in our experiment and used Qov quality function to assess the experimental result. The comparison between PMLPA and the BMLPA [12] is shown in Table 1.

Table 1. Comparison between PMLPA and BMLPA.

Qov Running time

PMLPA BMLPA PMLPA BMLPA

Parameter

α

0.45 0.754 0.712

5.21s 17.34s 0.50 0.654 0.641

0.55 0.653 0.638 0.60 0.521 0.513 0.65 0.516 0.509 0.70 0.443 0.421 0.75 0.441 0.416

(5)

We set PMLPA’s Qov to 0.45 and compared it to other algorithms such as COPRA, CPM, etc. The result is shown in Table 2.

Table 2. Comparison between PMLPA and other overlapping community algorithms.

PMLPA BMLPA SLPA COPRA CPM

Qov 0.754 0.712 0.653 0.369 -

Running time 5.21s 17.34s 12.12s 43.21s ∞

The comparison indicates that PMLPA has a higher Qov and requires less running time compared to other algorithm.

Summary

In this article, we combine the ideas of LPA and COPRA, introduce Weibo propagation factor and presented PMLPA algorithm to detect overlapping communities. It solves the instability and lack of accuracy problems in LPA. We also introduce thoughts from Hadoop, MapReduce and BSP’s message mechanism into the algorithm. And they are useful in Weibo users’ subject modeling, as well as active community detection. Experiment based on Weibo data collection proves PMLPA is more efficiency when dealing with practical problems than the others.

The algorithm presented in this article is suitable for social network like Weibo, and the propagation factor is only related to the number of Weibo users’ followers and following users, no other factors are involved. In future research, factors such as average forward rate and average comment rate could be taken into account.

Acknowledgement

This research was supported by the Technical research and social development projects of Zhejiang, China (2014C33078).

References

[1] M. Girvan, E.J. Newman, Community structure in social and biological networks, Proceedings of the National Academy of Sciences of the United States of America (2002) 7821-7826.

[2] Andersen, R., and Lang, K.J., Communities from Seed Sets. Proc. International World Wide Web Conference; Edinburgh(GB), Edinburgh(GB), (2006) pp. 10

[3] A. Clauset, Finding local community structure in networks, Physical Review (2005) 6131-6132.

[4] Gergely Palla, I. Derényi, I. Farkas, Tamás Vicsek, Uncovering the overlapping community structure of complex networks in nature and society, Nature (2005) 814-818.

[5] Yong-Yeol Ahn, J.P. Bagrow, S. Lehmann, Link communities reveal multiscale complexity in networks, Nature (2010) 761-764.

[6] T.S. Evans, R. Lambiotte, Line graphs, link partitions, and overlapping communities, Physical Review. E (2009) 16105-16112.

[7] S. Fortunato, Community detection in graphs, Physics (2010) 75-174.

[8] U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect community structures in large-scale networks, Physical Review (2007) 36101-36106.

(6)

[10] S. Gregory, A fast algorithm to find overlapping communities in networks. Proc. Joint European Conference on Machine Learning and Principles of Knowledge Discovery in Databases, Antwerp (Belgium), (2008) 408-423.

[11] S. Gregory, Finding overlapping communities in networks by label propagation, New Journal of Physics (2010) 103-118.