• No results found

Graph Mining. Tiphaine Viard

N/A
N/A
Protected

Academic year: 2021

Share "Graph Mining. Tiphaine Viard"

Copied!
76
0
0

Loading.... (view fulltext now)

Full text

(1)

Graph Mining

Tiphaine Viard

(2)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

(3)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

I Communication: phone, emails, flights, ...

(4)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

I Communication: phone, emails, flights, ...

I Information: Web, Wikipedia, knowledge bases, ...

(5)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

I Communication: phone, emails, flights, ...

I Information: Web, Wikipedia, knowledge bases, ...

(6)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

I Communication: phone, emails, flights, ...

I Information: Web, Wikipedia, knowledge bases, ...

I Social networks: Facebook, Twitter, LinkedIn, ...

I Marketing: customer-product, user-content, ...

I Text analysis: words, text similarity, ...

I Biology: brain, proteins, genetics, ...

Graphs may be:

I directed / undirected / bipartite

I weighted or not

I labelled or not

(7)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

I Communication: phone, emails, flights, ...

I Information: Web, Wikipedia, knowledge bases, ...

I Social networks: Facebook, Twitter, LinkedIn, ...

I Marketing: customer-product, user-content, ...

I Text analysis: words, text similarity, ...

I Biology: brain, proteins, genetics, ...

Graphs may be:

I directed / undirected / bipartite

I weighted or not

I labelled or not

(8)

Graph mining

Objective: Extract useful information / learn from graphs

(9)

Basic notions

Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }

Degree = size of neighbourhood

d (u) = |N (u)| ∼= X

v ∈V

wuv

Density = edge probability

(10)

Basic notions

Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }

Degree = size of neighbourhood

d (u) = |N (u)| ∼= X

v ∈V

wuv

Density = edge probability

(11)

Basic notions

Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }

Degree = size of neighbourhood

d (u) = |N (u)| ∼= X

v ∈V

wuv

Density = edge probability

δ(G ) = 2m n · (n − 1)

(12)

Quiz

What is the degree of the node in red? A) 2

(13)

Key questions

Given some graph:

I What are the most important nodes?

I What are the most important nodes relative to some other nodes?

I How is the graph structured?

I How to infer labels?

I How to predict new links?

(14)

Key properties

Most real-world graphs are sparse

Dataset #nodes #edges Density Flights 2,939 30,500 ≈ 10−3 Amazon products 335k 925k ≈ 10−5 Actors 382k 33M ≈ 10−4 Wikipedia (en) 12M 378M ≈ 10−6 Twitter 42M 1.5G ≈ 10−6 Friendster 68M 2.5G ≈ 10−7

(15)

Traversing graphs

Goal: get through the graph, assert connectivity, shortest paths

One of the foundational algorithms of the rest of this session

I A graph G , a a list of nodes to consider C , a list of seen nodes S

I Add a node r (the root) to C

I While |C | > 0:

I Remove a node u from C , add it to S

I Output u

I Add all nodes in N (u) to C

A B

C D

E

Two strategies: breadth (queue) vs depth (stack)

(16)

Quiz

A

B C

D E

Given the above graph, in what order will nodes be outputted by a Breadth-First search starting from A? (when multiple candidates, always take the first one clockwise)

(17)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(18)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(19)

Graphs as matrices

I Directed graphs

n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j

I Undirected graphs

n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j

I Bipartite graphs

n1, n2 nodes → biadjacency matrix B of size n1× n2

(20)

Graphs as matrices

I Directed graphs

n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j

I Undirected graphs

n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j

I Bipartite graphs

n1, n2 nodes → biadjacency matrix B of size n1× n2

(21)

Graphs as matrices

I Directed graphs

n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j

I Undirected graphs

n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j

I Bipartite graphs

n1, n2 nodes → biadjacency matrix B of size n1× n2

(22)

The COOrdinate format

A sparse matrix (in dense format):

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0      

The same matrix in COO (COOrdinate) format:

(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

or equivalently,

(23)

The COOrdinate format

A sparse matrix (in dense format):

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0      

The same matrix in COO (COOrdinate) format:

(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

or equivalently,

(24)

The CSR (Compressed Sparse Row) format

A (not so) sparse matrix (in dense format):

2 0 0 3 

The same matrix in coordinates:

(0, 0, 2), (1, 1, 3)

or equivalently, in CSR format:

indices = 0, 1 indptr = 0, 1, 2

data = 2, 3

(25)

The CSR (Compressed Sparse Row) format

A (not so) sparse matrix (in dense format):

2 0 0 3 

The same matrix in coordinates:

(0, 0, 2), (1, 1, 3)

or equivalently, in CSR format:

indices = 0, 1 indptr = 0, 1, 2

data = 2, 3

(26)

The CSR (Compressed Sparse Row) format

A sparse matrix (in dense format):

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0       Coordinates: (0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

The same matrix in CSR (Compressed Sparse Row) format:

indices = 0, 2, 6, 7, 4, 6, 7, 0, . . . indptr = 0, 4, 7, 12, 17, 20

data = 3, 5, 4, 4, 4, 2, 4, . . .

(27)

Properties of the CSR format

Pros

I Efficient storage

I Fast row slicing

I Fast matrix-vector product Cons

I Slow column slicing

I Slow modification (e.g., add an entry)

I Slow transpose

(28)

Properties of the CSR format

Pros

I Efficient storage

I Fast row slicing

I Fast matrix-vector product Cons

I Slow column slicing

I Slow modification (e.g., add an entry)

I Slow transpose

(29)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(30)

PageRank

(31)

PageRank

How to identify the most “important” nodes in a graph?

(32)
(33)

Personalized PageRank

How to identify the most “important” nodes in a graph relative to some other nodes?

(34)
(35)

Random walk

First scheme

I Pij = Aij/wi, probability of moving

from i to j

I A Markov chain with transition matrix P = D−1A with

D = diag(w )

(36)

A simple example

A B

C D

(37)
(38)

Quiz

C D

A B

(39)

Limits of our approach

Dead ends

Scores vanish to 0

Spider traps

Attracts all importance in the trap

(40)

Limits of our approach

Dead ends

Scores vanish to 0

Spider traps

Attracts all importance in the trap

(41)

Random walk with restart

I Fix α ∈ (0, 1)

I Walk with probability α, restart with probability 1 − α

I Restart distribution µ on V (e.g., uniform)

I New transition matrix:

P(α) = αP + (1 − α)1µ

I Forces restart from dead ends

I Eventually escapes spider traps

(42)

Personalized PageRank

Only requires minor adaptations

I Given some seed set S ⊂ V

(43)

Co-ranking

(44)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(45)

Graph clustering

(46)
(47)

Characters of Les Miserables

Myriel Napoleon Mlle Baptistine Mme Magloire Countess de Lo Geborand Champtercier Cravatte Count Old man Labarre Valjean Marguerite Mme Der Isabeau Gervais

TholomyesListolierZephineDahliaBlachevilleFavouriteFameuil Fantine

Mme ThenardierThenardier CosetteJavert

Fauchelevent

Bamatabois Simplice Perpetue

Scaufflaire Woman1 Judge Champmathieu BrevetChenildieu Cochepaille Pontmercy Boulatruelle Eponine Anzelma Woman2 MotherInnocent Gribier Jondrette Mme Burgon Gavroche Gillenormand Magnon Mlle Gillenormand Mme Pontmercy Mlle Vaubois Lt Gillenormand Marius Baroness Mabeuf Enjolras Combeferre Prouvaire

(48)
(49)
(50)
(51)

Graph clustering

I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }

I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K

(52)

Graph clustering

I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }

I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K

(53)

Graph clustering

I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }

I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K

(54)

Modularity

The modularity of clustering C is defined by:

Q(C ) = 1 2m X i ,j ∈V  Aij − didj 2m  δC (i ),C (j )

(55)

Maximizing the modularity

Consider the following problem:

max

C Q(C ) I Combinatorial problem!

(56)
(57)

The Louvain algorithm

Greedy algorithm:

1. (Initialization) C ← identity

2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another

3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph

(58)

The Louvain algorithm

Greedy algorithm:

1. (Initialization) C ← identity

2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another

3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph

(59)

Extensions

I Weighted graphs: Q(C ) = 1 w X i ,j ∈V  Aij− wiwj w  δC (i ),C (j ) I Resolution parameter: Qγ(C ) = 1 w X i ,j ∈V  Aij − γ wiwj w  δC (i ),C (j )

I Directed graphs: seen as bipartite graphs, i.e.,  0 A

AT 0 

(60)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(61)

Graph embedding

How to transform graph data into vector data, so as to preserve the proximity between nodes?

0.4

0.2 0.0

0.2

0.4

0.4

0.2

0.0

0.2

0.4

(62)

Graph embedding

How to transform graph data into vector data, so as to preserve the proximity between nodes?

0.4

0.2 0.0

0.2

0.4

0.4

0.2

0.0

0.2

0.4

(63)

Back to random walks

I Pij = Aij/wi, probability of moving

from i to j

I A Markov chain with transition matrix P = D−1A with

D = diag(w )

(64)

Spectral analysis

Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:

I If the graph is disconnected, with k connected components, then

λ1 = . . . = λk = 1 > λk+1

(65)

Spectral analysis

Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:

I If the graph is disconnected, with k connected components, then

λ1= . . . = λk = 1 > λk+1

(66)

Example

(67)

Barbell

(68)

Stochastic block model

(69)
(70)

Spectral embedding

Embedding in dimension k obtained by the k + 1 leading eigenvectors of P (skipping the first)

(71)

Extensions

I Various normalizations can be applied to the eigenvectors (depending on the eigenvalues)

I Embedding on the unit sphere → cosine similarity through dot product

I Bipartite graphs → co-embedding

I Directed graphs: seen as bipartite graphs, i.e.,  0 A

(72)

Getting inspiration from language processing:

word2vec

Goal: Predict contextual words

How ? Extract vector representations of words in a text

cos(x , y ) = x · y

kxkky k ∈ [−1, 1]

Two models: CBOW vs skip-gram

(73)

On graphs: node2vec

Text: A special graph

Voyez → ce → koala → fou → qui → mange → des → journaux...

(74)

Node2vec

Actually, biased random walks

I Walk length: How many nodes are in each random walk

I p: return parameter

(75)

Node2vec

Actually, biased random walks

I Walk length: How many nodes are in each random walk

I p: return parameter I q: Breadth-depth parameter Objective: max f X u

log Pr (N(u)|f (u))

(76)

Summary

Many data have a graph structure, which requires suitable data structures and algorithms:

References

Related documents

The only exception is when using the PEXSI solver, the sparse density matrix interface elsi dm {real|complex} sparse, and the “PEXSI CSC” matrix format.. In this case, an

Alternative theories of gravity are higher order; they contain at least one additional scalar field.. Generically, gravitational “constant”

Dr. Joaquin Aldas-Manzano Universitat de València Facultat d’Economia Avda. His research interests include consumer behaviour, advertising media planning and quantitative

For example, if the mining structure column that is referenced contains a scalar value, the function returns a scalar value. If the mining structure column that is referenced is

In the Weekly.Returns table, create a new calculated column named date-id in a format of yyyymmdd and use the calculated column to create a relationship to the Date table.. Add

The accounting policies and methods of computation adopted in the preparation of the financial information are consistent with those set out in the Group’s consolidated

We convert the RayStation com- pressed sparse matrix format to CSR and implement CSR sparse matrix vector multiplication (SpMV) for radiation dose calculation with mixed half and

changes to the ED arrival process and triage system may not be feasible due to significant development time, effort, and expense. To our knowledge, the impact of an ED phlebotomist