CIS 700:
βalgorithms for Big Dataβ
Grigory Yaroslavtsev
http://grigory.us
Lecture 6: Graph Sketching
Slides at http://grigory.us/big-data-class.html
Sketching Graphs?
β’ We know how to sketch vectors: π£ β ππ£
β’ How about sketching graphs?
β’ πΊ π, πΈ β‘ π΄πΊ (adjacency matrix): π΄πΊ β ππ΄πΊ
β’ Sketch columns of π΄πΊ
β’ π = π , π = |πΈ|
β’ π(ππππ¦(log π)) sketch per vertex / π (π) total
β Check connectivity β Check bipartiteness
β’ As always, space rather than dimension. Why?
Graph Streams
β’ Semi-streaming model: [Muthukrishnan β05; Feigenbaum, Kannan, McGregor, Suri, Zhangβ05]
β Graph defined by the stream of edges π1, β¦ , ππ β Space π π , edges processed in order
β Connectivity is easy on π (π) space for insertion-only
β’ Dynamic graphs:
β Stream of insertion/deletion updates
+ ππ1, βππ2, β¦ , βπππ‘ (assume sequence is correct) β Resulting graph has edge ππ if it wasnβt deleted after
the last insertion
β’ Linear sketching dynamic graphs:
ππ΄πΊβπ = ππ΄πΊ β MAe
Distributed Computing
β’ Linear sketches for distributed processing
β’ π servers with o(m) memory:
β Send π/π edges (πΈ1, β¦ , πΈπ ) to each server β Compute sketches ππΈ1, β¦ , ππΈπ locally
β Send sketches to a central server β Compute ππ΄πΊ = ππΈπ π π
β’ π has to have a small representation (same issue as in streaming)
Connectivity
β’ Thm. Connectivity is sketchable in π (π) space
β’ Framework:
β Take existing connectivity algorithm (Boruvka) β Sketch π΄πΊ β ππ΄πΊ
β Run Boruvka on ππ΄πΊ
β’ Important that the sketch is homomorphic w.r.t the algorithm
Part 1: Parallel Connectivity (Boruvka)
β’ Repeat until no edges left:
β For each vertex, select any incident edge β Contract selected edges
β’ Lemma: process converges in π(log π) steps
Part 2: Graph Representation
β’ For a vertex π let ππ be a vector in β π2
β’ Non-zero entries for edges π, π
β ππ π, π = 1 if π > π β ππ π, π = β1 if j < π
β’ Example:
π1 = 1, 1, 1, 1, 0, β¦ , 0
π2 = β1, 0, 0, 0, 0, 0, 1, 0, 1, β¦ , 0
β’ Lem: For any π β π supp πβπ ππ = πΈ(π, π β π)
1
2 6
4 7
5 3
1,2 , 1,3 , 1,4 , 1,5 , 1,6 , 1,7 , 2,3 , 2.4 . 2,5 , β¦
Part 3: πΏ
0-Sampling
β’ There is a distribution over π β βπΓπ with π = π(log2 π) such w.p. 9/10 that βπ β βπ:
πΆπ β π β π π’ππ(π)
[Cormode, Muthukrishnan, Rozenbaumβ05; Jowhari, Saglam, Tardos β11]
β’ Constant probability suffices β still π log π Boruvka iterations
Final Algorithm
β’ Construct log π β0-samplers for each ππ
β’ Run Boruvka on sketches:
β Use πΆ1ππ to get an edge incident on a node π β For π = 2 to π‘:
β’ To get incident edge on a component π β π use:
πΆπππ = πΆπ ππ
πβπ
β
πβπ
β π β π π’ππ ππ
πβπ
= πΈ(π, π β π)
K-Connectivity
β’ Graph is π-connected is every cut has size β₯ π
β’ Thm: There is a π(ππ log3 π)-size linear sketch for k-connectivity
β’ Generalization: There is an π(π log5 π /π2)-
size linear sketch which allows to approximate all cuts in a graph up to error (1 Β± π)
K-connectivity Algorithm
β’ Algorithm for π-connectivity:
β Let πΉ1 be a spanning forest of πΊ(π, πΈ) β For π = 2, β¦ , π
β’ Let πΉπ be a spanning forest of πΊ(π, πΈ β πΉ1 β β― β πΉπβ1)
β’ Lem: πΊ(π, πΉ1 + β― + πΉπ) is k-connected iff G(V,E) is.
β’ β Trivial
β’ β Consider a cut in πΊ(π, ππ=1 πΉπ) of size < π
β βπβ: this cut didnβt grow in step πβ
β there is a cut in πΊ(π, πΈ) of size < π
β contradiction
K-connectivity Algorithm
β’ Construct π independent linear sketches
*π1π΄πΊ, π2π΄πΊ β¦ , πππ΄πΊ+ for connectivity
β’ Run π-connectivity algorithm on sketches:
β Use π1π΄πΊ to get a spanning forest πΉ1 of πΊ β Use π2π΄πΊ β π2π΄πΉ1 = π2(π΄πΊβπΉ1) to find πΉ2
β Use π3π΄πΊ β π3π΄πΉ1 β π3π΄πΉ2 = π3(π΄πΊβπΉ1βπΉ2) to find πΉ3
β β¦
Bipartiteness
β’ Reduction: Given πΊ define πΊβ² where vertices
π£ β (π£1, π£2); edges π’, π£ β (π’1, π£2) & (π’2, π£1)
β’ Lem: # connected components doubles iff the graph is bipartite.
β’ Thm: π(π log3 π)-size linear sketch for k- connectivity (sketch πΊβ² (implicitly).)
β β
Minimum Spanning Tree
β’ If ππ = # connected components in a subgraph induced by edges of weight β€ 1 + π π:
π€ πππ β€ π β 1 + π π + ππππ β€ 1 + π π€ πππ
π=0β¦πβ1
where ππ = ( 1 + π π+1β 1 + π π
β’ cc(G) = #connected components of πΊ
β’ Round weights up to the nearest power of 1 + π
β’ πΊπ β‘ subgraph with edges of weight β€ 1 + π π
β’ Edges taken by the Kruskalβs algorithm:
β n β cc(πΊ0) edges of weight 1
β ππ πΊ0 β ππ(πΊ1) edges of weight (1 + π) β β¦
β cc πΊπβ1 β cc πΊπ edges of weight 1 + π π
Minimum Spanning Tree
β’ Let π = log1+π π where π = max edge weight
β’ Overall weight:
π β ππ πΊ0 + 1 + ππ1 π (ππ πΊπβ1 β ππ(πΊπ))
= π β 1 + π π + ( 1 + π π+1β 1 + π π) ππ(πΊπ)
πβ1
0
β’ Thm: 1 + π -approx. MST weight can be computed with π π linear sketch for π = ππππ¦(π)
MST: Single Linkage Clustering
β’ [Zahnβ71] Clustering via MST (Single-linkage):
k clusters: remove π β π longest edges from MST
β’ Maximizes minimum intercluster distance
[Kleinberg, Tardos]
Cut Sparsification
β’ Two problems:
β Approximating Min-Cut in the graph (up to 1 Β± π) β Preserving all cuts in the graph (up to 1 Β± π)
β’ General cut sparsification framework:
β Sample each edge π with probability ππ β Assign sampled edges weights 1/ππ
β’ Expected weight of each cut is preserved, but too many cuts β canβt take union bound
Cut Sparsification
β’ For an edge π let ππ = weight of the minimum cut that contains π
β’ π = size of the Min-Cut in G
β’ Thm [Fung et al.]: If πΊ is an undirected weighted graph the if ππ β₯ min πΆ logπ 2 π
π π2 , 1 then the cut sparsification alg. Preserves
weights of all cuts up to (1 Β± π)
β’ Thm [Karger]: ππ β₯ min πΆ log ππ π2 , 1 preserves Min-Cut up to (1 Β± π)
Minimum Cut
Algorithm:
β’ For π = 0,1, β¦ , 2 log π :
β Let πΊπ be the subgraph of πΊ where each edge is sampled with probability 1/2π
β Let π»π = F1, β¦ , πΉπ where π = π π12 β log π and πΉπ are forests constructed by the k-connectivity alg.
β’ Return 2ππ(π»π) where π = min *π βΆ π π»π < π+
Space: π π logπ24 π , works for dynamic graph streams
Minimum Cut: Analysis
β’ Key property: If πΊπ has β€ π edges across a cut then π»π contains all such edges
β’ πβ = log max 1, ππ2
6 log π
β’ π β€ πβ β ππ β₯ min 6 log π
ππ2 , 1 β min cut in πΊπ is approximating min-cut in πΊ up to (1 Β± π)
β’ π = πβ: By Chernoff bound # edges in πΊπβ that crosses min-cut in πΊ is π π12 log π β€ π w.h.p.
Cut Sparsification
Algorithm:
β’ For π = 0,1, β¦ , 2 log π :
β Let πΊπ be the subgraph of πΊ where each edge is sampled with probability 1/2π
β Let π»π = F1, β¦ , πΉπ where π = π π12 β log2 π and πΉπ are forests constructed by the k-connectivity alg.
β’ For each edge π let ππ = min i: ππ π»π < π .
β’ If π β π»ππ then add e to the sparsifier with weight 2ππ
β’ Space: π π logπ25 π , works for dynamic graph streams
β’ Analysis similar to the Min-Cut using [Fung et al.]