• No results found

Graph Processing and Social Networks

N/A
N/A
Protected

Academic year: 2021

Share "Graph Processing and Social Networks"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Graph Processing and Social Networks

Presented by Shu Jiayu, Yang Ji

Department of Computer Science and Engineering The Hong Kong University of Science and Technology

(2)

Outline

Background

Graph database

Large graph processing

Social networks analysis

Conclusion

2015/4/20 2

(3)

Background

Graphs are everywhere

(4)

Background

Graph processing

Online query processing

OLTP workloads for quick low-latency access to small portions of graph data

Offline graph analysis

OLAP workloads allowing batch processing of large portions of a graph

● Graph database & graph mining system

e.g. Neo4j, Pregel

2015/4/20 4

(5)

Graph Database

What is graph database

graph database model: node, edge, property

Storage is optimized for data represented as a graph

Storage is optimized for the traversal of the graph

Flexible data model

……

(6)

Graph Database

Why graph database

Focus on relationships between entities

Provides a greater level of data complexity

Ease of data modeling

…….

graph database vs. relational database

Relational databases are well fitted to findAll-like queries

Graph databases are suited for exploring relationships

2015/4/20 6

(7)

Graph Database

e.g. Represent a business problem and associated entities

(8)

Graph Database: an example

Neo4j

Property Graph Model

Supports ACID (atomicity, consistency, isolation, durability)

2015/4/20 8

(9)

Large-scale Graph

Large graph processing challenges

They exceed memory and even disks of a single machine

Computational ability on a single machine is limited

……

● Solutions

Distributed parallel processing

(10)

Large Graph Processing Systems

MapReduce-based Pegasus

Computation model is MapReduce

A large graph mining library on top of Hadoop/MapReduce

● BSP-based Pregel

Adopts BSP (Bulk Synchronous Processing) programming model

A large graph processing library on the top of BSP

10

(11)

Large Graph Processing System: Pegasus

MapReduce programming model

Map function

input: a key/value pair

output: a set of intermediate key/value pairs

Reduce function

input: a set of values for an intermediate key output: a set of key/value pairs

(12)

Large Graph Processing System: Pegasus

e.g. count the number of occurrences of each word

2015/4/20 12

(13)

Large Graph Processing System: Pegasus

GIM-V (Generalized Iterated Matrix-Vector multiplication)

𝑀 × 𝑣 = 𝑣′ where 𝑣𝑖 = 𝑗=1𝑛 𝑚𝑖,𝑗𝑣𝑗 𝑚1,1 ⋯ 𝑚1,𝑛

𝑚𝑛,1 ⋯ 𝑚𝑛,𝑛

𝑣1

𝑣𝑛 =

𝑚1,1𝑣1 + 𝑚1,2𝑣2 + ⋯ + 𝑚1,𝑛𝑣𝑛

𝑚𝑛,1𝑣1 + 𝑚𝑛,2𝑣2 + ⋯ + 𝑚𝑛,𝑛𝑣𝑛 = 𝑣1

𝑚1,1

𝑚𝑛,1 + ⋯ + 𝑣𝑛

𝑚1,𝑛

𝑚𝑛,𝑛

combine2: multiply 𝑚𝑖,𝑗 and 𝑣𝑗

combineAll: sum n multiplication results for node i

assign: overwrite previous value of 𝑣𝑖 with new result to make 𝑣𝑖

(14)

Large Graph Processing System: Pegasus

Application: PageRank (calculate relative importance of web pages)

𝑚1,1 ⋯ 𝑚1,𝑛

𝑚𝑛,1 ⋯ 𝑚𝑛,𝑛

𝑣1

𝑣𝑛 =

𝑚1,1𝑣1 + 𝑚1,2𝑣2 + ⋯ + 𝑚1,𝑛𝑣𝑛

𝑚𝑛,1𝑣1 + 𝑚𝑛,2𝑣2 + ⋯ + 𝑚𝑛,𝑛𝑣𝑛 = 𝑣1

𝑚1,1

𝑚𝑛,1 + ⋯ + 𝑣𝑛

𝑚1,𝑛

𝑚𝑛,𝑛

𝑀 : a transition matrix, 𝑣 : rank vector, 𝑣′: a new rank vector

input: an edge file and a vector file

Stage 1: performs combine2 operation by combining columns of matrix with rows of vector, outputs key/value pairs

Stage 2: combines all partial results from Stage 1 and assigns new vector to the old

2015/4/20 14

(15)

Large Graph Processing System: Pregel

BSP (Bulk Synchronous Parallel) model

(16)

Large Graph Processing System: Pregel

Google’s implementation of BSP

Node -> Vertex

Message passing

Combiners

Aggregators

2015/4/20 16

Vertex ID Vertex Value

(17)

Large Graph Processing System: Pregel

Application: PageRank

Initializes the value of each vertex in superstep 0

Vertex sends along each outgoing edges its tentative PageRank divided by edges

Each vertex sums up the values arriving on messages into sum and calculate its tentative PageRank in each superstep

Terminates when convergence is achieved

(18)

Introduction to Social Networks

A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest

Social network analysis (SNA) is the study of social networks to understand their structure and behavior

2015/4/20 18

(19)

Data Mining for Social Network Analysis

Community Detection

Link Prediction

Search in Social Networks

Trust in Social Networks

Characterization of Social Networks

Other Research Topics in Social Networks

(20)

Community Detection

Discovering communities of users in a social network

Community – a “tightly-knit region”

of the network

Has strong internal node-node connections

Weaker external connections

Community detection algorithms stress high internal connectivity and low external

connectivity with a given community

2015/4/20 20

(21)

Girvan-Newman Algorithm

Calculate edge-betweenness for all edges

Remove the edge with highest betweenness

Recalculate betweenness

Repeat until all edges are removed, or modularity function is optimized (depending on variation)

(22)

Girvan-Newman Algorithm

Edge Betweenness

Measurement of contributions of an edge to all shortest paths

Calculating all-shortest paths between two vertices

If there are N paths between any two vertices, each path gets a weight equal to 1/N

Edge Betweenness Example – EA

D-B +0.5

E-B +0.5

E-A +1

Total =2

2015/4/20 22

A

D

C

B

E

(23)

Girvan-Newman Algorithm: Example

(24)

Girvan-Newman Algorithm: Example

2015/4/20 24

Betweenness(7-8)= 7x7 = 49

Betweenness(3-7)=betweenness(6-7)=betweenness(8-9) = betweenness(8-12)= 3X11=33 Betweenness(1-3) = 1X12=12

(25)

Girvan-Newman Algorithm: Example

(26)

Girvan-Newman Algorithm: Example

2015/4/20 26

Betweenness of every edge = 1

(27)

Link Prediction

Predict likely interactions, not explicitly observed, based on observed links

Primarily used to predict the possibility of new friends, study friend structures and co-authorship networks.

Given a snapshot of a social network, it is possible to infer new interactions between members who have never

interacted before

(28)

Link Prediction Methods

Given the input graph G, a connection weight score(x,y) is assigned to a pair of nodes <x,y>

A ranked list is produced in decreasing order of score(x,y)

It can be viewed as computing a measure of proximity or

“similarity” between nodes x and y

2015/4/20 28

(29)

Link Prediction Methods

Node Neighborhood Based Methods

Common neighbors

Jaccard’s coefficient

Adamic-Adar

All Paths Based Methodologies

PageRank

SimRank

Higher Level Approaches

Clustering

(30)

Node Neighborhood Based Methods

Common neighbors

𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = |𝑁 𝑢 ∩ 𝑁 𝑣 |

Jaccard’s coefficient

𝑠𝑜𝑐𝑟𝑒 𝑢, 𝑣 = 𝑁 𝑢 ∩ 𝑁 𝑣 /|𝑁 𝑢 ∪ 𝑁 𝑣 |

Adamic-Adar

𝑠𝑐𝑜𝑟𝑒(𝑢, 𝑣) = 𝑧𝜖𝑁(𝑢)∩𝑁(𝑣) 1 log(𝑁(𝑧))

2015/4/20 30

(31)

All Paths Based Method: PageRank

PageRank is one of the algorithms that aims to perform object ranking.

The assumption PageRank makes is that a user starts a random walk by opening a page and then clicking on a link on that page.

(32)

All Paths Based Method: SimRank

SimRank is a link analysis algorithm that works on a graph G to measure the similarity between two vertices u and v in the graph.

For the nodes u and v, it is denoted by s(u,v) ∈ [0,1]. If u=v then, s(u,v)=1

The definition iterates on the similarity index of the neighbors of u and v itself.

𝑠 𝑢, 𝑣 = |𝑁 𝑢 ||𝑁 𝑣 |𝐶 𝑎∈𝑁(𝑢) 𝑏∈𝑁(𝑣) 𝑠(𝑎, 𝑏)

2015/4/20 32

(33)

Conclusion

Graph Processing

Online query

processing Graph database Neo4j

Offline graph analysis

Large graph mining systems

Pegasus

Pregel

Social Network Analysis

Community Detection

(34)

References

Angles R, Gutierrez C. Survey of graph database models[J]. ACM Computing Surveys (CSUR), 2008, 40(1): 1.

Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.

Kang U, Tsourakakis C E, Faloutsos C. Pegasus: A peta-scale graph mining system implementation and observations[C]//Data Mining, 2009. ICDM'09.

Ninth IEEE International Conference on. IEEE, 2009: 229-238.

Kang U, Tsourakakis C E, Faloutsos C. Pegasus: mining peta-scale graphs[J].

Knowledge and information systems, 2011, 27(2): 303-325.

Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146.

Shao B, Wang H, Xiao Y. Managing and mining large graphs: systems and implementations[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012: 589-592.

2015/4/20 34

(35)

References

Newman, Mark EJ. "Modularity and community structure in networks."

Proceedings of the National Academy of Sciences 103.23 (2006): 8577- 8582.

Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. "Empirical comparison of algorithms for network community detection." Proceedings of the 19th international conference on World wide web. ACM, 2010.

Girvan, Michelle, and Mark EJ Newman. "Community structure in social and biological networks." Proceedings of the National Academy of Sciences

99.12 (2002): 7821-7826.

Liben‐Nowell, David, and Jon Kleinberg. "The link‐prediction problem for social networks." Journal of the American society for information science and technology 58.7 (2007): 1019-1031.

Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." Proceedings of the eighth ACM SIGKDD international

(36)

Thank You

References

Related documents

Abstract: This paper addresses a methodological problem of choice experiments, namely the problem that respondents sometimes avoid the intellectual effort of thoroughly

If the calling in of these - not very considerable - French debts at German banks creates an obvious menace to German currency such as we have seen recorded in the Exchange

In subsequent sections we map the set of public governance measures and institutions that enabled the creation of a specific knowledge structure within the labour force that in

At all swim meets - both home and away meets and both YMCA and USA organized events - all people associated with the West Essex YMCA swim team, including swimmers, coaches,

NDBC's directional wave measurement systems determine wave direction information from cross- spectra between buoy acceleration (or displacement) and east-west and north-south

The following nine groups of colors are an example of how our design colors can be used, please take note that you should only use one design color group per slide..

INM’s focus on its core operations, investment in innovative product development and vigorous marketing saw its market-leading titles deliver important market share gains

brands at a time. Our customers generally switch between 2-3 Brands or Commodities, e.g. Therefore we are targeting the masses, we are not concentrating on any segment, age group