Graph Mining. Tiphaine Viard

(1)

Graph Mining

Tiphaine Viard

(2)

Graph data

I Infrastructure: roads, railways, power grid, internet, ...

(3)

Graph data

I Communication: phone, emails, flights, ...

(4)

Graph data

I Information: Web, Wikipedia, knowledge bases, ...

(5)

Graph data

(6)

Graph data

I Social networks: Facebook, Twitter, LinkedIn, ...

I Marketing: customer-product, user-content, ...

I Text analysis: words, text similarity, ...

I Biology: brain, proteins, genetics, ...

Graphs may be:

I directed / undirected / bipartite

I weighted or not

I labelled or not

(7)

Graph data

I Social networks: Facebook, Twitter, LinkedIn, ...

I Marketing: customer-product, user-content, ...

I Text analysis: words, text similarity, ...

I Biology: brain, proteins, genetics, ...

Graphs may be:

I directed / undirected / bipartite

I weighted or not

I labelled or not

(8)

Graph mining

Objective: Extract useful information / learn from graphs

(9)

Basic notions

Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }

Degree = size of neighbourhood

d (u) = |N (u)| ∼= X

v ∈V

wuv

Density = edge probability

(10)

Basic notions

d (u) = |N (u)| ∼= X

v ∈V

wuv

(11)

Basic notions

d (u) = |N (u)| ∼= X

v ∈V

wuv

δ(G ) = 2m n · (n − 1)

(12)

Quiz

What is the degree of the node in red? A) 2

(13)

Key questions

Given some graph:

I What are the most important nodes?

I What are the most important nodes relative to some other nodes?

I How is the graph structured?

I How to infer labels?

I How to predict new links?

(14)

Key properties

Most real-world graphs are sparse

Dataset #nodes #edges Density Flights 2,939 30,500 ≈ 10−3 Amazon products 335k 925k ≈ 10−5 Actors 382k 33M ≈ 10−4 Wikipedia (en) 12M 378M ≈ 10−6 Twitter 42M 1.5G ≈ 10−6 Friendster 68M 2.5G ≈ 10−7

(15)

Traversing graphs

Goal: get through the graph, assert connectivity, shortest paths

One of the foundational algorithms of the rest of this session

I A graph G , a a list of nodes to consider C , a list of seen nodes S

I Add a node r (the root) to C

I While |C | > 0:

I Remove a node u from C , add it to S

I Output u

I Add all nodes in N (u) to C

A B

C D

E

Two strategies: breadth (queue) vs depth (stack)

(16)

Quiz

A

B C

D E

Given the above graph, in what order will nodes be outputted by a Breadth-First search starting from A? (when multiple candidates, always take the first one clockwise)

(17)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(18)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(19)

Graphs as matrices

I Directed graphs

n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j

I Undirected graphs

n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j

I Bipartite graphs

n1, n2 nodes → biadjacency matrix B of size n1× n2

(20)

Graphs as matrices

I Directed graphs

I Undirected graphs

I Bipartite graphs

(21)

Graphs as matrices

I Directed graphs

I Undirected graphs

I Bipartite graphs

(22)

The COOrdinate format

A sparse matrix (in dense format):

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0      

The same matrix in COO (COOrdinate) format:

(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

or equivalently,

(23)

The COOrdinate format

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0      

The same matrix in COO (COOrdinate) format:

(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

or equivalently,

(24)

The CSR (Compressed Sparse Row) format

A (not so) sparse matrix (in dense format):

2 0 0 3

The same matrix in coordinates:

(0, 0, 2), (1, 1, 3)

or equivalently, in CSR format:

indices = 0, 1 indptr = 0, 1, 2

data = 2, 3

(25)

The CSR (Compressed Sparse Row) format

A (not so) sparse matrix (in dense format):

2 0 0 3

The same matrix in coordinates:

(0, 0, 2), (1, 1, 3)

or equivalently, in CSR format:

indices = 0, 1 indptr = 0, 1, 2

data = 2, 3

(26)

The CSR (Compressed Sparse Row) format

      3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0       Coordinates: (0, 0, 3), (0, 2, 5), (0, 6, 4), . . .

The same matrix in CSR (Compressed Sparse Row) format:

indices = 0, 2, 6, 7, 4, 6, 7, 0, . . . indptr = 0, 4, 7, 12, 17, 20

data = 3, 5, 4, 4, 4, 2, 4, . . .

(27)

Properties of the CSR format

Pros

I Efficient storage

I Fast row slicing

I Fast matrix-vector product Cons

I Slow column slicing

I Slow modification (e.g., add an entry)

I Slow transpose

(28)

Properties of the CSR format

Pros

I Efficient storage

I Fast row slicing

I Fast matrix-vector product Cons

I Slow column slicing

I Slow modification (e.g., add an entry)

I Slow transpose

(29)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(30)

PageRank

(31)

PageRank

How to identify the most “important” nodes in a graph?

(32)

(33)

Personalized PageRank

How to identify the most “important” nodes in a graph relative to some other nodes?

(34)

(35)

Random walk

First scheme

I Pij = Aij/wi, probability of moving

from i to j

I A Markov chain with transition matrix P = D−1A with

D = diag(w )

(36)

A simple example

A B

C D

(37)

(38)

Quiz

C D

A B

(39)

Limits of our approach

Dead ends

Scores vanish to 0

Spider traps

Attracts all importance in the trap

(40)

Limits of our approach

Dead ends

Scores vanish to 0

Spider traps

Attracts all importance in the trap

(41)

Random walk with restart

I Fix α ∈ (0, 1)

I Walk with probability α, restart with probability 1 − α

I Restart distribution µ on V (e.g., uniform)

I New transition matrix:

P(α) = αP + (1 − α)1µ

I Forces restart from dead ends

I Eventually escapes spider traps

(42)

Personalized PageRank

Only requires minor adaptations

I Given some seed set S ⊂ V

(43)

Co-ranking

(44)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(45)

Graph clustering

(46)

(47)

Characters of Les Miserables

Myriel Napoleon Mlle Baptistine Mme Magloire Countess de Lo Geborand Champtercier Cravatte Count Old man Labarre Valjean Marguerite Mme Der Isabeau Gervais

TholomyesListolierZephineDahliaBlacheville_Favourite_Fameuil Fantine

Mme ThenardierThenardier CosetteJavert

Fauchelevent

Bamatabois Simplice _Perpetue

Scaufflaire Woman1 Judge Champmathieu Brevet_Chenildieu Cochepaille _Pontmercy Boulatruelle Eponine Anzelma Woman2 MotherInnocent Gribier Jondrette Mme Burgon Gavroche Gillenormand Magnon Mlle Gillenormand Mme Pontmercy Mlle Vaubois Lt Gillenormand Marius Baroness Mabeuf Enjolras Combeferre Prouvaire

(48)

(49)

(50)

(51)

Graph clustering

I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }

I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K

(52)

Graph clustering

(53)

Graph clustering

(54)

Modularity

The modularity of clustering C is defined by:

Q(C ) = 1 2m X i ,j ∈V Aij − didj 2m δ_{C (i ),C (j )}

(55)

Maximizing the modularity

Consider the following problem:

max

C Q(C ) I Combinatorial problem!

(56)

(57)

The Louvain algorithm

Greedy algorithm:

1. (Initialization) C ← identity

2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another

3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph

(58)

The Louvain algorithm

Greedy algorithm:

1. (Initialization) C ← identity

2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another

3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph

(59)

Extensions

I Weighted graphs: Q(C ) = 1 w X i ,j ∈V Aij− wiwj w δC (i ),C (j ) I Resolution parameter: Qγ(C ) = 1 w X i ,j ∈V Aij − γ wiwj w δ_{C (i ),C (j )}

I Directed graphs: seen as bipartite graphs, i.e., 0 A

AT 0

(60)

Outline

1. Sparse matrices

2. PageRank

3. Clustering

(61)

Graph embedding

How to transform graph data into vector data, so as to preserve the proximity between nodes?

0.4 0.2 0.0

0.2

0.4

0.2

0.0

0.2

0.4

(62)

Graph embedding

How to transform graph data into vector data, so as to preserve the proximity between nodes?

0.4 0.2 0.0

0.2

0.4

0.2

0.0

0.2

0.4

(63)

Back to random walks

I Pij = Aij/wi, probability of moving

from i to j

I A Markov chain with transition matrix P = D−1A with

D = diag(w )

(64)

Spectral analysis

Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:

I If the graph is disconnected, with k connected components, then

λ1 = . . . = λk = 1 > λk+1

(65)

Spectral analysis

Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:

I If the graph is disconnected, with k connected components, then

λ1= . . . = λk = 1 > λk+1

(66)

Example

(67)

Barbell

(68)

Stochastic block model

(69)

(70)

Spectral embedding

Embedding in dimension k obtained by the k + 1 leading eigenvectors of P (skipping the first)

(71)

Extensions

I Various normalizations can be applied to the eigenvectors (depending on the eigenvalues)

I Embedding on the unit sphere → cosine similarity through dot product

I Bipartite graphs → co-embedding

I Directed graphs: seen as bipartite graphs, i.e., 0 A

(72)

Getting inspiration from language processing:

word2vec

Goal: Predict contextual words

How ? Extract vector representations of words in a text

cos(x , y ) = x · y

kxkky k ∈ [−1, 1]

Two models: CBOW vs skip-gram

(73)

On graphs: node2vec

Text: A special graph

Voyez → ce → koala → fou → qui → mange → des → journaux...

(74)

Node2vec

Actually, biased random walks

I Walk length: How many nodes are in each random walk

I p: return parameter

(75)

Node2vec

Actually, biased random walks

I Walk length: How many nodes are in each random walk

I p: return parameter I q: Breadth-depth parameter Objective: max f X u

log Pr (N(u)|f (u))

(76)

Summary

Many data have a graph structure, which requires suitable data structures and algorithms: