Graph Processing and Social Networks

(1)

Graph Processing and Social Networks

Presented by Shu Jiayu, Yang Ji

Department of Computer Science and Engineering The Hong Kong University of Science and Technology

(2)

Outline



Background



Graph database



Large graph processing



Social networks analysis



Conclusion

2015/4/20 2

(3)

Background

Graphs are everywhere

(4)

Background

 Graph processing

 Online query processing

OLTP workloads for quick low-latency access to small portions of graph data

 Offline graph analysis

OLAP workloads allowing batch processing of large portions of a graph

● Graph database & graph mining system

 e.g. Neo4j, Pregel

2015/4/20 4

(5)

Graph Database

 What is graph database

 graph database model: node, edge, property

 Storage is optimized for data represented as a graph

 Storage is optimized for the traversal of the graph

 Flexible data model

……

(6)

Graph Database

 Why graph database

 Focus on relationships between entities

 Provides a greater level of data complexity

 Ease of data modeling

…….

 graph database vs. relational database

 Relational databases are well fitted to findAll-like queries

 Graph databases are suited for exploring relationships

2015/4/20 6

(7)

Graph Database

e.g. Represent a business problem and associated entities

(8)

Graph Database: an example

 Neo4j

 Property Graph Model

 Supports ACID (atomicity, consistency, isolation, durability)

2015/4/20 8

(9)

Large-scale Graph

 Large graph processing challenges

 They exceed memory and even disks of a single machine

 Computational ability on a single machine is limited

……

● Solutions

 Distributed parallel processing

(10)

Large Graph Processing Systems

 MapReduce-based Pegasus

 Computation model is MapReduce

 A large graph mining library on top of Hadoop/MapReduce

● BSP-based Pregel

 Adopts BSP (Bulk Synchronous Processing) programming model

 A large graph processing library on the top of BSP

10

(11)

Large Graph Processing System: Pegasus

 MapReduce programming model

 Map function

input: a key/value pair

output: a set of intermediate key/value pairs

 Reduce function

input: a set of values for an intermediate key output: a set of key/value pairs

(12)

Large Graph Processing System: Pegasus

 e.g. count the number of occurrences of each word

2015/4/20 12

(13)

Large Graph Processing System: Pegasus

 GIM-V (Generalized Iterated Matrix-Vector multiplication)

𝑀 × 𝑣 = 𝑣′ where 𝑣_𝑖^′ = _𝑗=1^𝑛 𝑚_𝑖,𝑗𝑣_𝑗 𝑚_1,1 ⋯ 𝑚_1,𝑛

⋮ ⋱ ⋮

𝑚_𝑛,1 ⋯ 𝑚_𝑛,𝑛

𝑣₁

⋮

𝑣_𝑛 =

𝑚_1,1𝑣₁ + 𝑚_1,2𝑣₂ + ⋯ + 𝑚_1,𝑛𝑣_𝑛

⋮

𝑚_𝑛,1𝑣₁ + 𝑚_𝑛,2𝑣₂ + ⋯ + 𝑚_𝑛,𝑛𝑣_𝑛 = 𝑣₁

𝑚_1,1

⋮

𝑚_𝑛,1 + ⋯ + 𝑣_𝑛

𝑚_1,𝑛

⋮ 𝑚_𝑛,𝑛

 combine2: multiply 𝑚_𝑖,𝑗 and 𝑣_𝑗

 combineAll: sum n multiplication results for node i

 assign: overwrite previous value of 𝑣_𝑖 with new result to make 𝑣_𝑖^′

(14)

Large Graph Processing System: Pegasus

 Application: PageRank (calculate relative importance of web pages)

𝑚_1,1 ⋯ 𝑚_1,𝑛

⋮ ⋱ ⋮

𝑚_𝑛,1 ⋯ 𝑚_𝑛,𝑛

𝑣₁

⋮

𝑣_𝑛 =

𝑚_1,1𝑣₁ + 𝑚_1,2𝑣₂ + ⋯ + 𝑚_1,𝑛𝑣_𝑛

⋮

𝑚_𝑛,1𝑣₁ + 𝑚_𝑛,2𝑣₂ + ⋯ + 𝑚_𝑛,𝑛𝑣_𝑛 = 𝑣₁

𝑚_1,1

⋮

𝑚_𝑛,1 + ⋯ + 𝑣_𝑛

𝑚_1,𝑛

⋮ 𝑚_𝑛,𝑛

 𝑀 : a transition matrix, 𝑣 : rank vector, 𝑣′: a new rank vector

 input: an edge file and a vector file

 Stage 1: performs combine2 operation by combining columns of matrix with rows of vector, outputs key/value pairs

 Stage 2: combines all partial results from Stage 1 and assigns new vector to the old

2015/4/20 14

(15)

Large Graph Processing System: Pregel

 BSP (Bulk Synchronous Parallel) model

(16)

Large Graph Processing System: Pregel

 Google’s implementation of BSP

 Node -> Vertex

 Message passing

 Combiners

 Aggregators

2015/4/20 16

Vertex ID Vertex Value

(17)

Large Graph Processing System: Pregel

 Application: PageRank

 Initializes the value of each vertex in superstep 0

 Vertex sends along each outgoing edges its tentative PageRank divided by edges

 Each vertex sums up the values arriving on messages into sum and calculate its tentative PageRank in each superstep

 Terminates when convergence is achieved

(18)

Introduction to Social Networks

 A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest

 Social network analysis (SNA) is the study of social networks to understand their structure and behavior

2015/4/20 18

(19)

Data Mining for Social Network Analysis

 Community Detection

 Link Prediction

 Search in Social Networks

 Trust in Social Networks

 Characterization of Social Networks

 Other Research Topics in Social Networks

(20)

Community Detection

 Discovering communities of users in a social network

 Community – a “tightly-knit region”

of the network

 Has strong internal node-node connections

 Weaker external connections

 Community detection algorithms stress high internal connectivity and low external

connectivity with a given community

2015/4/20 20

(21)

Girvan-Newman Algorithm

 Calculate edge-betweenness for all edges

 Remove the edge with highest betweenness

 Recalculate betweenness

 Repeat until all edges are removed, or modularity function is optimized (depending on variation)

(22)

Girvan-Newman Algorithm

 Edge Betweenness

 Measurement of contributions of an edge to all shortest paths

 Calculating all-shortest paths between two vertices

 If there are N paths between any two vertices, each path gets a weight equal to 1/N

 Edge Betweenness Example – EA

 D-B +0.5

 E-B +0.5

 E-A +1

 Total =2

2015/4/20 22

A

D

C

B

E

(23)

Girvan-Newman Algorithm: Example

(24)

Girvan-Newman Algorithm: Example

2015/4/20 24

Betweenness(7-8)= 7x7 = 49

Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X11=33 Betweenness(1-3) = 1X12=12

(25)

Girvan-Newman Algorithm: Example

(26)

Girvan-Newman Algorithm: Example

2015/4/20 26

Betweenness of every edge = 1

(27)

Link Prediction

 Predict likely interactions, not explicitly observed, based on observed links

 Primarily used to predict the possibility of new friends, study friend structures and co-authorship networks.

 Given a snapshot of a social network, it is possible to infer new interactions between members who have never

interacted before

(28)

Link Prediction Methods

 Given the input graph G, a connection weight score(x,y) is assigned to a pair of nodes <x,y>

 A ranked list is produced in decreasing order of score(x,y)

 It can be viewed as computing a measure of proximity or

“similarity” between nodes x and y

2015/4/20 28

(29)

Link Prediction Methods

 Node Neighborhood Based Methods

 Common neighbors

 Jaccard’s coefficient

 Adamic-Adar

 All Paths Based Methodologies

 PageRank

 SimRank

 Higher Level Approaches

 Clustering

(30)

Node Neighborhood Based Methods

 Common neighbors

 𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = |𝑁 𝑢 ∩ 𝑁 𝑣 |

 Jaccard’s coefficient

 𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = 𝑁 𝑢 ∩ 𝑁 𝑣 /|𝑁 𝑢 ∪ 𝑁 𝑣 |

 Adamic-Adar

 𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 𝑧𝜖𝑁(𝑢)∩𝑁(𝑣) 1 log(𝑁(𝑧))

2015/4/20 30

(31)

All Paths Based Method: PageRank

 PageRank is one of the algorithms that aims to perform object ranking.

 The assumption PageRank makes is that a user starts a random walk by opening a page and then clicking on a link on that page.

(32)

All Paths Based Method: SimRank

 SimRank is a link analysis algorithm that works on a graph G to measure the similarity between two vertices u and v in the graph.

 For the nodes u and v, it is denoted by s(u,v) ∈ [0,1]. If u=v then, s(u,v)=1

 The definition iterates on the similarity index of the neighbors of u and v itself.

 𝑠 𝑢, 𝑣 = |𝑁 𝑢 ||𝑁 𝑣 |^𝐶 𝑎∈𝑁(𝑢) 𝑏∈𝑁(𝑣) 𝑠(𝑎, 𝑏)

2015/4/20 32

(33)

Conclusion

Graph Processing

Online query

processing Graph database Neo4j

Offline graph analysis

Large graph mining systems

Pegasus

Pregel

Social Network Analysis

Community Detection

(34)

References

 Angles R, Gutierrez C. Survey of graph database models[J]. ACM Computing Surveys (CSUR), 2008, 40(1): 1.

 Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.

 Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A peta-scale graph mining system implementation and observations[C]//Data Mining, 2009. ICDM'09.

Ninth IEEE International Conference on. IEEE, 2009: 229-238.

 Kang U, Tsourakakis C E, Faloutsos C. Pegasus: mining peta-scale graphs[J].

Knowledge and information systems, 2011, 27(2): 303-325.

 Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146.

 Shao B, Wang H, Xiao Y. Managing and mining large graphs: systems and implementations[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 589-592.

2015/4/20 34

(35)

References

 Newman, Mark EJ. "Modularity and community structure in networks."

Proceedings of the National Academy of Sciences 103.23 (2006): 8577- 8582.

 Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010.

 Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences

99.12 (2002): 7821-7826.

 Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031.

 Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international

(36)

Graph Processing and Social Networks