Graph Mining
Tiphaine Viard
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
I Communication: phone, emails, flights, ...
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
I Communication: phone, emails, flights, ...
I Information: Web, Wikipedia, knowledge bases, ...
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
I Communication: phone, emails, flights, ...
I Information: Web, Wikipedia, knowledge bases, ...
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
I Communication: phone, emails, flights, ...
I Information: Web, Wikipedia, knowledge bases, ...
I Social networks: Facebook, Twitter, LinkedIn, ...
I Marketing: customer-product, user-content, ...
I Text analysis: words, text similarity, ...
I Biology: brain, proteins, genetics, ...
Graphs may be:
I directed / undirected / bipartite
I weighted or not
I labelled or not
Graph data
I Infrastructure: roads, railways, power grid, internet, ...
I Communication: phone, emails, flights, ...
I Information: Web, Wikipedia, knowledge bases, ...
I Social networks: Facebook, Twitter, LinkedIn, ...
I Marketing: customer-product, user-content, ...
I Text analysis: words, text similarity, ...
I Biology: brain, proteins, genetics, ...
Graphs may be:
I directed / undirected / bipartite
I weighted or not
I labelled or not
Graph mining
Objective: Extract useful information / learn from graphs
Basic notions
Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }Degree = size of neighbourhood
d (u) = |N (u)| ∼= X
v ∈V
wuv
Density = edge probability
Basic notions
Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }Degree = size of neighbourhood
d (u) = |N (u)| ∼= X
v ∈V
wuv
Density = edge probability
Basic notions
Graph = a tuple G = (V , E ), E ⊆ V ⊗ V (u, v ) = (v , u), u 6= v n = |V |, m = |E | Neighbours: N (u) = {v : ∃(u, v ) ∈ E }Degree = size of neighbourhood
d (u) = |N (u)| ∼= X
v ∈V
wuv
Density = edge probability
δ(G ) = 2m n · (n − 1)
Quiz
What is the degree of the node in red? A) 2
Key questions
Given some graph:
I What are the most important nodes?
I What are the most important nodes relative to some other nodes?
I How is the graph structured?
I How to infer labels?
I How to predict new links?
Key properties
Most real-world graphs are sparse
Dataset #nodes #edges Density Flights 2,939 30,500 ≈ 10−3 Amazon products 335k 925k ≈ 10−5 Actors 382k 33M ≈ 10−4 Wikipedia (en) 12M 378M ≈ 10−6 Twitter 42M 1.5G ≈ 10−6 Friendster 68M 2.5G ≈ 10−7
Traversing graphs
Goal: get through the graph, assert connectivity, shortest paths
One of the foundational algorithms of the rest of this session
I A graph G , a a list of nodes to consider C , a list of seen nodes S
I Add a node r (the root) to C
I While |C | > 0:
I Remove a node u from C , add it to S
I Output u
I Add all nodes in N (u) to C
A B
C D
E
Two strategies: breadth (queue) vs depth (stack)
Quiz
A
B C
D E
Given the above graph, in what order will nodes be outputted by a Breadth-First search starting from A? (when multiple candidates, always take the first one clockwise)
Outline
1. Sparse matrices
2. PageRank
3. Clustering
Outline
1. Sparse matrices
2. PageRank
3. Clustering
Graphs as matrices
I Directed graphs
n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j
I Undirected graphs
n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j
I Bipartite graphs
n1, n2 nodes → biadjacency matrix B of size n1× n2
Graphs as matrices
I Directed graphs
n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j
I Undirected graphs
n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j
I Bipartite graphs
n1, n2 nodes → biadjacency matrix B of size n1× n2
Graphs as matrices
I Directed graphs
n nodes → adjacency matrix A of size n × n Aij > 0 iff edge from i to j
I Undirected graphs
n nodes → adjacency matrix A of size n × n Aij = Aji > 0 iff edge between i and j
I Bipartite graphs
n1, n2 nodes → biadjacency matrix B of size n1× n2
The COOrdinate format
A sparse matrix (in dense format):
3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0
The same matrix in COO (COOrdinate) format:
(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .
or equivalently,
The COOrdinate format
A sparse matrix (in dense format):
3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0
The same matrix in COO (COOrdinate) format:
(0, 0, 3), (0, 2, 5), (0, 6, 4), . . .
or equivalently,
The CSR (Compressed Sparse Row) format
A (not so) sparse matrix (in dense format):
2 0 0 3
The same matrix in coordinates:
(0, 0, 2), (1, 1, 3)
or equivalently, in CSR format:
indices = 0, 1 indptr = 0, 1, 2
data = 2, 3
The CSR (Compressed Sparse Row) format
A (not so) sparse matrix (in dense format):
2 0 0 3
The same matrix in coordinates:
(0, 0, 2), (1, 1, 3)
or equivalently, in CSR format:
indices = 0, 1 indptr = 0, 1, 2
data = 2, 3
The CSR (Compressed Sparse Row) format
A sparse matrix (in dense format):
3 0 5 0 0 0 4 4 0 0 0 0 0 0 4 0 2 4 0 0 1 0 2 0 0 3 5 3 0 0 0 2 0 4 1 5 0 0 1 0 5 0 3 0 4 0 0 0 0 0 Coordinates: (0, 0, 3), (0, 2, 5), (0, 6, 4), . . .
The same matrix in CSR (Compressed Sparse Row) format:
indices = 0, 2, 6, 7, 4, 6, 7, 0, . . . indptr = 0, 4, 7, 12, 17, 20
data = 3, 5, 4, 4, 4, 2, 4, . . .
Properties of the CSR format
Pros
I Efficient storage
I Fast row slicing
I Fast matrix-vector product Cons
I Slow column slicing
I Slow modification (e.g., add an entry)
I Slow transpose
Properties of the CSR format
Pros
I Efficient storage
I Fast row slicing
I Fast matrix-vector product Cons
I Slow column slicing
I Slow modification (e.g., add an entry)
I Slow transpose
Outline
1. Sparse matrices
2. PageRank
3. Clustering
PageRank
PageRank
How to identify the most “important” nodes in a graph?
Personalized PageRank
How to identify the most “important” nodes in a graph relative to some other nodes?
Random walk
First scheme
I Pij = Aij/wi, probability of moving
from i to j
I A Markov chain with transition matrix P = D−1A with
D = diag(w )
A simple example
A B
C D
Quiz
C D
A B
Limits of our approach
Dead ends
Scores vanish to 0
Spider traps
Attracts all importance in the trap
Limits of our approach
Dead ends
Scores vanish to 0
Spider traps
Attracts all importance in the trap
Random walk with restart
I Fix α ∈ (0, 1)
I Walk with probability α, restart with probability 1 − α
I Restart distribution µ on V (e.g., uniform)
I New transition matrix:
P(α) = αP + (1 − α)1µ
I Forces restart from dead ends
I Eventually escapes spider traps
Personalized PageRank
Only requires minor adaptations
I Given some seed set S ⊂ V
Co-ranking
Outline
1. Sparse matrices
2. PageRank
3. Clustering
Graph clustering
Characters of Les Miserables
Myriel Napoleon Mlle Baptistine Mme Magloire Countess de Lo Geborand Champtercier Cravatte Count Old man Labarre Valjean Marguerite Mme Der Isabeau GervaisTholomyesListolierZephineDahliaBlachevilleFavouriteFameuil Fantine
Mme ThenardierThenardier CosetteJavert
Fauchelevent
Bamatabois Simplice Perpetue
Scaufflaire Woman1 Judge Champmathieu BrevetChenildieu Cochepaille Pontmercy Boulatruelle Eponine Anzelma Woman2 MotherInnocent Gribier Jondrette Mme Burgon Gavroche Gillenormand Magnon Mlle Gillenormand Mme Pontmercy Mlle Vaubois Lt Gillenormand Marius Baroness Mabeuf Enjolras Combeferre Prouvaire
Graph clustering
I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }
I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K
Graph clustering
I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }
I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K
Graph clustering
I The clustering of a graph G = (V , E ) of n nodes and m edges is any function C : V → {1, . . . , K }
I In general, K is unknown (unlike K -means) and we look for the best clustering irrespective of the value of K
Modularity
The modularity of clustering C is defined by:
Q(C ) = 1 2m X i ,j ∈V Aij − didj 2m δC (i ),C (j )
Maximizing the modularity
Consider the following problem:
max
C Q(C ) I Combinatorial problem!
The Louvain algorithm
Greedy algorithm:
1. (Initialization) C ← identity
2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another
3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph
The Louvain algorithm
Greedy algorithm:
1. (Initialization) C ← identity
2. (Maximization) While modularity Q(C ) increases, update C by moving one node from one cluster to another
3. (Aggregation) Merge all nodes belonging to the same cluster into a single node, update the weights accordingly and apply step 2 to the aggregate graph
Extensions
I Weighted graphs: Q(C ) = 1 w X i ,j ∈V Aij− wiwj w δC (i ),C (j ) I Resolution parameter: Qγ(C ) = 1 w X i ,j ∈V Aij − γ wiwj w δC (i ),C (j )I Directed graphs: seen as bipartite graphs, i.e., 0 A
AT 0
Outline
1. Sparse matrices
2. PageRank
3. Clustering
Graph embedding
How to transform graph data into vector data, so as to preserve the proximity between nodes?
0.4
0.2 0.0
0.2
0.4
0.4
0.2
0.0
0.2
0.4
Graph embedding
How to transform graph data into vector data, so as to preserve the proximity between nodes?
0.4
0.2 0.0
0.2
0.4
0.4
0.2
0.0
0.2
0.4
Back to random walks
I Pij = Aij/wi, probability of moving
from i to j
I A Markov chain with transition matrix P = D−1A with
D = diag(w )
Spectral analysis
Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:I If the graph is disconnected, with k connected components, then
λ1 = . . . = λk = 1 > λk+1
Spectral analysis
Spectral decomposition PV = V Λ, VTDV = I where I Λ = diag(λ1, . . . , λn) with λ1= 1 > λ2 ≥ . . . ≥ λn≥ −1 I V = (v1, . . . , vn) with v1 ∝ 1 Notes:I If the graph is disconnected, with k connected components, then
λ1= . . . = λk = 1 > λk+1
Example
Barbell
Stochastic block model
Spectral embedding
Embedding in dimension k obtained by the k + 1 leading eigenvectors of P (skipping the first)
Extensions
I Various normalizations can be applied to the eigenvectors (depending on the eigenvalues)
I Embedding on the unit sphere → cosine similarity through dot product
I Bipartite graphs → co-embedding
I Directed graphs: seen as bipartite graphs, i.e., 0 A
Getting inspiration from language processing:
word2vec
Goal: Predict contextual words
How ? Extract vector representations of words in a text
cos(x , y ) = x · y
kxkky k ∈ [−1, 1]
Two models: CBOW vs skip-gram
On graphs: node2vec
Text: A special graph
Voyez → ce → koala → fou → qui → mange → des → journaux...
Node2vec
Actually, biased random walks
I Walk length: How many nodes are in each random walk
I p: return parameter
Node2vec
Actually, biased random walks
I Walk length: How many nodes are in each random walk
I p: return parameter I q: Breadth-depth parameter Objective: max f X u
log Pr (N(u)|f (u))
Summary
Many data have a graph structure, which requires suitable data structures and algorithms: