Graph Processing and Social Networks
Presented by Shu Jiayu, Yang Ji
Department of Computer Science and Engineering The Hong Kong University of Science and Technology
Outline
Background
Graph database
Large graph processing
Social networks analysis
Conclusion
2015/4/20 2
Background
Graphs are everywhere
Background
Graph processing
Online query processing
OLTP workloads for quick low-latency access to small portions of graph data
Offline graph analysis
OLAP workloads allowing batch processing of large portions of a graph
● Graph database & graph mining system
e.g. Neo4j, Pregel
2015/4/20 4
Graph Database
What is graph database
graph database model: node, edge, property
Storage is optimized for data represented as a graph
Storage is optimized for the traversal of the graph
Flexible data model
……
Graph Database
Why graph database
Focus on relationships between entities
Provides a greater level of data complexity
Ease of data modeling
…….
graph database vs. relational database
Relational databases are well fitted to findAll-like queries
Graph databases are suited for exploring relationships
2015/4/20 6
Graph Database
e.g. Represent a business problem and associated entities
Graph Database: an example
Neo4j
Property Graph Model
Supports ACID (atomicity, consistency, isolation, durability)
2015/4/20 8
Large-scale Graph
Large graph processing challenges
They exceed memory and even disks of a single machine
Computational ability on a single machine is limited
……
● Solutions
Distributed parallel processing
Large Graph Processing Systems
MapReduce-based Pegasus
Computation model is MapReduce
A large graph mining library on top of Hadoop/MapReduce
● BSP-based Pregel
Adopts BSP (Bulk Synchronous Processing) programming model
A large graph processing library on the top of BSP
10
Large Graph Processing System: Pegasus
MapReduce programming model
Map function
input: a key/value pair
output: a set of intermediate key/value pairs
Reduce function
input: a set of values for an intermediate key output: a set of key/value pairs
Large Graph Processing System: Pegasus
e.g. count the number of occurrences of each word
2015/4/20 12
Large Graph Processing System: Pegasus
GIM-V (Generalized Iterated Matrix-Vector multiplication)
𝑀 × 𝑣 = 𝑣′ where 𝑣𝑖′ = 𝑗=1𝑛 𝑚𝑖,𝑗𝑣𝑗 𝑚1,1 ⋯ 𝑚1,𝑛
⋮ ⋱ ⋮
𝑚𝑛,1 ⋯ 𝑚𝑛,𝑛
𝑣1
⋮
𝑣𝑛 =
𝑚1,1𝑣1 + 𝑚1,2𝑣2 + ⋯ + 𝑚1,𝑛𝑣𝑛
⋮
𝑚𝑛,1𝑣1 + 𝑚𝑛,2𝑣2 + ⋯ + 𝑚𝑛,𝑛𝑣𝑛 = 𝑣1
𝑚1,1
⋮
𝑚𝑛,1 + ⋯ + 𝑣𝑛
𝑚1,𝑛
⋮ 𝑚𝑛,𝑛
combine2: multiply 𝑚𝑖,𝑗 and 𝑣𝑗
combineAll: sum n multiplication results for node i
assign: overwrite previous value of 𝑣𝑖 with new result to make 𝑣𝑖′
Large Graph Processing System: Pegasus
Application: PageRank (calculate relative importance of web pages)
𝑚1,1 ⋯ 𝑚1,𝑛
⋮ ⋱ ⋮
𝑚𝑛,1 ⋯ 𝑚𝑛,𝑛
𝑣1
⋮
𝑣𝑛 =
𝑚1,1𝑣1 + 𝑚1,2𝑣2 + ⋯ + 𝑚1,𝑛𝑣𝑛
⋮
𝑚𝑛,1𝑣1 + 𝑚𝑛,2𝑣2 + ⋯ + 𝑚𝑛,𝑛𝑣𝑛 = 𝑣1
𝑚1,1
⋮
𝑚𝑛,1 + ⋯ + 𝑣𝑛
𝑚1,𝑛
⋮ 𝑚𝑛,𝑛
𝑀 : a transition matrix, 𝑣 : rank vector, 𝑣′: a new rank vector
input: an edge file and a vector file
Stage 1: performs combine2 operation by combining columns of matrix with rows of vector, outputs key/value pairs
Stage 2: combines all partial results from Stage 1 and assigns new vector to the old
2015/4/20 14
Large Graph Processing System: Pregel
BSP (Bulk Synchronous Parallel) model
Large Graph Processing System: Pregel
Google’s implementation of BSP
Node -> Vertex
Message passing
Combiners
Aggregators
2015/4/20 16
Vertex ID Vertex Value
Large Graph Processing System: Pregel
Application: PageRank
Initializes the value of each vertex in superstep 0
Vertex sends along each outgoing edges its tentative PageRank divided by edges
Each vertex sums up the values arriving on messages into sum and calculate its tentative PageRank in each superstep
Terminates when convergence is achieved
Introduction to Social Networks
A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest
Social network analysis (SNA) is the study of social networks to understand their structure and behavior
2015/4/20 18
Data Mining for Social Network Analysis
Community Detection
Link Prediction
Search in Social Networks
Trust in Social Networks
Characterization of Social Networks
Other Research Topics in Social Networks
Community Detection
Discovering communities of users in a social network
Community – a “tightly-knit region”
of the network
Has strong internal node-node connections
Weaker external connections
Community detection algorithms stress high internal connectivity and low external
connectivity with a given community
2015/4/20 20
Girvan-Newman Algorithm
Calculate edge-betweenness for all edges
Remove the edge with highest betweenness
Recalculate betweenness
Repeat until all edges are removed, or modularity function is optimized (depending on variation)
Girvan-Newman Algorithm
Edge Betweenness
Measurement of contributions of an edge to all shortest paths
Calculating all-shortest paths between two vertices
If there are N paths between any two vertices, each path gets a weight equal to 1/N
Edge Betweenness Example – EA
D-B +0.5
E-B +0.5
E-A +1
Total =2
2015/4/20 22
A
D
C
B
E
Girvan-Newman Algorithm: Example
Girvan-Newman Algorithm: Example
2015/4/20 24
Betweenness(7-8)= 7x7 = 49
Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X11=33 Betweenness(1-3) = 1X12=12
Girvan-Newman Algorithm: Example
Girvan-Newman Algorithm: Example
2015/4/20 26
Betweenness of every edge = 1
Link Prediction
Predict likely interactions, not explicitly observed, based on observed links
Primarily used to predict the possibility of new friends, study friend structures and co-authorship networks.
Given a snapshot of a social network, it is possible to infer new interactions between members who have never
interacted before
Link Prediction Methods
Given the input graph G, a connection weight score(x,y) is assigned to a pair of nodes <x,y>
A ranked list is produced in decreasing order of score(x,y)
It can be viewed as computing a measure of proximity or
“similarity” between nodes x and y
2015/4/20 28
Link Prediction Methods
Node Neighborhood Based Methods
Common neighbors
Jaccard’s coefficient
Adamic-Adar
All Paths Based Methodologies
PageRank
SimRank
Higher Level Approaches
Clustering
Node Neighborhood Based Methods
Common neighbors
𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = |𝑁 𝑢 ∩ 𝑁 𝑣 |
Jaccard’s coefficient
𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = 𝑁 𝑢 ∩ 𝑁 𝑣 /|𝑁 𝑢 ∪ 𝑁 𝑣 |
Adamic-Adar
𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 𝑧𝜖𝑁(𝑢)∩𝑁(𝑣) 1 log(𝑁(𝑧))
2015/4/20 30
All Paths Based Method: PageRank
PageRank is one of the algorithms that aims to perform object ranking.
The assumption PageRank makes is that a user starts a random walk by opening a page and then clicking on a link on that page.
All Paths Based Method: SimRank
SimRank is a link analysis algorithm that works on a graph G to measure the similarity between two vertices u and v in the graph.
For the nodes u and v, it is denoted by s(u,v) ∈ [0,1]. If u=v then, s(u,v)=1
The definition iterates on the similarity index of the neighbors of u and v itself.
𝑠 𝑢, 𝑣 = |𝑁 𝑢 ||𝑁 𝑣 |𝐶 𝑎∈𝑁(𝑢) 𝑏∈𝑁(𝑣) 𝑠(𝑎, 𝑏)
2015/4/20 32
Conclusion
Graph Processing
Online query
processing Graph database Neo4j
Offline graph analysis
Large graph mining systems
Pegasus
Pregel
Social Network Analysis
Community Detection
References
Angles R, Gutierrez C. Survey of graph database models[J]. ACM Computing Surveys (CSUR), 2008, 40(1): 1.
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A peta-scale graph mining system implementation and observations[C]//Data Mining, 2009. ICDM'09.
Ninth IEEE International Conference on. IEEE, 2009: 229-238.
Kang U, Tsourakakis C E, Faloutsos C. Pegasus: mining peta-scale graphs[J].
Knowledge and information systems, 2011, 27(2): 303-325.
Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146.
Shao B, Wang H, Xiao Y. Managing and mining large graphs: systems and implementations[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 589-592.
2015/4/20 34
References
Newman, Mark EJ. "Modularity and community structure in networks."
Proceedings of the National Academy of Sciences 103.23 (2006): 8577- 8582.
Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010.
Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences
99.12 (2002): 7821-7826.
Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031.
Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international