• No results found

CIS 700: algorithms for Big Data

N/A
N/A
Protected

Academic year: 2021

Share "CIS 700: algorithms for Big Data"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

CIS 700:

β€œalgorithms for Big Data”

Grigory Yaroslavtsev

http://grigory.us

Lecture 6: Graph Sketching

Slides at http://grigory.us/big-data-class.html

(2)

Sketching Graphs?

β€’ We know how to sketch vectors: 𝑣 β†’ 𝑀𝑣

β€’ How about sketching graphs?

β€’ 𝐺 𝑉, 𝐸 ≑ 𝐴𝐺 (adjacency matrix): 𝐴𝐺 β†’ 𝑀𝐴𝐺

β€’ Sketch columns of 𝐴𝐺

β€’ 𝑛 = 𝑉 , π‘š = |𝐸|

β€’ 𝑂(π‘π‘œπ‘™π‘¦(log 𝑛)) sketch per vertex / 𝑂 (𝑛) total

– Check connectivity – Check bipartiteness

β€’ As always, space rather than dimension. Why?

(3)

Graph Streams

β€’ Semi-streaming model: [Muthukrishnan ’05; Feigenbaum, Kannan, McGregor, Suri, Zhang’05]

– Graph defined by the stream of edges 𝑒1, … , π‘’π‘š – Space 𝑂 𝑛 , edges processed in order

– Connectivity is easy on 𝑂 (𝑛) space for insertion-only

β€’ Dynamic graphs:

– Stream of insertion/deletion updates

+ 𝑒𝑖1, βˆ’π‘’π‘–2, … , βˆ’π‘’π‘–π‘‘ (assume sequence is correct) – Resulting graph has edge 𝑒𝑖 if it wasn’t deleted after

the last insertion

β€’ Linear sketching dynamic graphs:

π‘€π΄πΊβˆ–π‘’ = 𝑀𝐴𝐺 βˆ’ MAe

(4)

Distributed Computing

β€’ Linear sketches for distributed processing

β€’ 𝑆 servers with o(m) memory:

– Send π‘š/𝑆 edges (𝐸1, … , 𝐸𝑠) to each server – Compute sketches 𝑀𝐸1, … , 𝑀𝐸𝑠 locally

– Send sketches to a central server – Compute 𝑀𝐴𝐺 = 𝑀𝐸𝑠𝑖 𝑖

β€’ 𝑀 has to have a small representation (same issue as in streaming)

(5)

Connectivity

β€’ Thm. Connectivity is sketchable in 𝑂 (𝑛) space

β€’ Framework:

– Take existing connectivity algorithm (Boruvka) – Sketch 𝐴𝐺 β†’ 𝑀𝐴𝐺

– Run Boruvka on 𝑀𝐴𝐺

β€’ Important that the sketch is homomorphic w.r.t the algorithm

(6)

Part 1: Parallel Connectivity (Boruvka)

β€’ Repeat until no edges left:

– For each vertex, select any incident edge – Contract selected edges

β€’ Lemma: process converges in 𝑂(log 𝑛) steps

(7)

Part 2: Graph Representation

β€’ For a vertex 𝑖 let π‘Žπ‘– be a vector in ℝ 𝑛2

β€’ Non-zero entries for edges 𝑖, 𝑗

– π‘Žπ‘– 𝑖, 𝑗 = 1 if 𝑗 > 𝑖 – π‘Žπ‘– 𝑖, 𝑗 = βˆ’1 if j < 𝑖

β€’ Example:

π‘Ž1 = 1, 1, 1, 1, 0, … , 0

π‘Ž2 = βˆ’1, 0, 0, 0, 0, 0, 1, 0, 1, … , 0

β€’ Lem: For any 𝑆 βŠ† 𝑉 supp π‘–βˆˆπ‘† π‘Žπ‘– = 𝐸(𝑆, 𝑉 βˆ– 𝑆)

1

2 6

4 7

5 3

1,2 , 1,3 , 1,4 , 1,5 , 1,6 , 1,7 , 2,3 , 2.4 . 2,5 , …

(8)

Part 3: 𝐿

0

-Sampling

β€’ There is a distribution over 𝑀 ∈ β„π‘‘Γ—π‘š with 𝑑 = 𝑂(log2 π‘š) such w.p. 9/10 that βˆ€π‘Ž ∈ β„π‘š:

πΆπ‘Ž β†’ 𝑒 ∈ 𝑠𝑒𝑝𝑝(π‘Ž)

[Cormode, Muthukrishnan, Rozenbaum’05; Jowhari, Saglam, Tardos β€˜11]

β€’ Constant probability suffices βˆ’ still 𝑂 log 𝑛 Boruvka iterations

(9)

Final Algorithm

β€’ Construct log 𝑛 β„“0-samplers for each π‘Žπ‘–

β€’ Run Boruvka on sketches:

– Use 𝐢1π‘Žπ‘— to get an edge incident on a node 𝑗 – For 𝑖 = 2 to 𝑑:

β€’ To get incident edge on a component 𝑆 βŠ† 𝑉 use:

πΆπ‘–π‘Žπ‘— = 𝐢𝑖 π‘Žπ‘—

π‘—βˆˆπ‘†

β†’

π‘—βˆˆπ‘†

β†’ 𝑒 ∈ 𝑠𝑒𝑝𝑝 π‘Žπ‘—

π‘—βˆˆπ‘†

= 𝐸(𝑆, 𝑉 βˆ– 𝑆)

(10)

K-Connectivity

β€’ Graph is π‘˜-connected is every cut has size β‰₯ π‘˜

β€’ Thm: There is a 𝑂(π‘›π‘˜ log3 𝑛)-size linear sketch for k-connectivity

β€’ Generalization: There is an 𝑂(𝑛 log5 𝑛 /πœ–2)-

size linear sketch which allows to approximate all cuts in a graph up to error (1 Β± πœ–)

(11)

K-connectivity Algorithm

β€’ Algorithm for π‘˜-connectivity:

– Let 𝐹1 be a spanning forest of 𝐺(𝑉, 𝐸) – For 𝑖 = 2, … , π‘˜

β€’ Let 𝐹𝑖 be a spanning forest of 𝐺(𝑉, 𝐸 βˆ– 𝐹1 βˆ– β‹― βˆ– πΉπ‘–βˆ’1)

β€’ Lem: 𝐺(𝑉, 𝐹1 + β‹― + πΉπ‘˜) is k-connected iff G(V,E) is.

β€’ β‡’ Trivial

β€’ ⇐ Consider a cut in 𝐺(𝑉, π‘˜π‘–=1 𝐹𝑖) of size < π‘˜

β‡’ βˆƒπ‘–βˆ—: this cut didn’t grow in step π‘–βˆ—

β‡’ there is a cut in 𝐺(𝑉, 𝐸) of size < π‘˜

β‡’ contradiction

(12)

K-connectivity Algorithm

β€’ Construct π‘˜ independent linear sketches

*𝑀1𝐴𝐺, 𝑀2𝐴𝐺 … , π‘€π‘˜π΄πΊ+ for connectivity

β€’ Run π‘˜-connectivity algorithm on sketches:

– Use 𝑀1𝐴𝐺 to get a spanning forest 𝐹1 of 𝐺 – Use 𝑀2𝐴𝐺 βˆ’ 𝑀2𝐴𝐹1 = 𝑀2(π΄πΊβˆ’πΉ1) to find 𝐹2

– Use 𝑀3𝐴𝐺 βˆ’ 𝑀3𝐴𝐹1 βˆ’ 𝑀3𝐴𝐹2 = 𝑀3(π΄πΊβˆ’πΉ1βˆ’πΉ2) to find 𝐹3

– …

(13)

Bipartiteness

β€’ Reduction: Given 𝐺 define 𝐺′ where vertices

𝑣 β†’ (𝑣1, 𝑣2); edges 𝑒, 𝑣 β†’ (𝑒1, 𝑣2) & (𝑒2, 𝑣1)

β€’ Lem: # connected components doubles iff the graph is bipartite.

β€’ Thm: 𝑂(𝑛 log3 𝑛)-size linear sketch for k- connectivity (sketch 𝐺′ (implicitly).)

β†’ β†’

(14)

Minimum Spanning Tree

β€’ If 𝑛𝑖 = # connected components in a subgraph induced by edges of weight ≀ 1 + πœ– 𝑖:

𝑀 𝑀𝑆𝑇 ≀ 𝑛 βˆ’ 1 + πœ– π‘Ÿ + πœ†π‘–π‘›π‘– ≀ 1 + πœ– 𝑀 𝑀𝑆𝑇

𝑖=0β€¦π‘Ÿβˆ’1

where πœ†π‘– = ( 1 + πœ– 𝑖+1βˆ’ 1 + πœ– 𝑖

β€’ cc(G) = #connected components of 𝐺

β€’ Round weights up to the nearest power of 1 + πœ–

β€’ 𝐺𝑖 ≑ subgraph with edges of weight ≀ 1 + πœ– 𝑖

β€’ Edges taken by the Kruskal’s algorithm:

– n – cc(𝐺0) edges of weight 1

– 𝑐𝑐 𝐺0 βˆ’ 𝑐𝑐(𝐺1) edges of weight (1 + πœ–) – …

– cc πΊπ‘–βˆ’1 βˆ’ cc 𝐺𝑖 edges of weight 1 + πœ– 𝑖

(15)

Minimum Spanning Tree

β€’ Let π‘Ÿ = log1+πœ– π‘Š where π‘Š = max edge weight

β€’ Overall weight:

𝑛 βˆ’ 𝑐𝑐 𝐺0 + 1 + πœ–π‘Ÿ1 𝑖 (𝑐𝑐 πΊπ‘–βˆ’1 βˆ’ 𝑐𝑐(𝐺𝑖))

= 𝑛 βˆ’ 1 + πœ– π‘Ÿ + ( 1 + πœ– 𝑖+1βˆ’ 1 + πœ– 𝑖) 𝑐𝑐(𝐺𝑖)

π‘Ÿβˆ’1

0

β€’ Thm: 1 + πœ– -approx. MST weight can be computed with 𝑂 𝑛 linear sketch for π‘Š = π‘π‘œπ‘™π‘¦(𝑛)

(16)

MST: Single Linkage Clustering

β€’ [Zahn’71] Clustering via MST (Single-linkage):

k clusters: remove π’Œ βˆ’ 𝟏 longest edges from MST

β€’ Maximizes minimum intercluster distance

[Kleinberg, Tardos]

(17)

Cut Sparsification

β€’ Two problems:

– Approximating Min-Cut in the graph (up to 1 Β± πœ–) – Preserving all cuts in the graph (up to 1 Β± πœ–)

β€’ General cut sparsification framework:

– Sample each edge 𝑒 with probability 𝑝𝑒 – Assign sampled edges weights 1/𝑝𝑒

β€’ Expected weight of each cut is preserved, but too many cuts βˆ’ can’t take union bound

(18)

Cut Sparsification

β€’ For an edge 𝑒 let πœ†π‘’ = weight of the minimum cut that contains 𝑒

β€’ πœ† = size of the Min-Cut in G

β€’ Thm [Fung et al.]: If 𝐺 is an undirected weighted graph the if 𝑝𝑒 β‰₯ min 𝐢 logπœ† 2 𝑛

𝑒 πœ–2 , 1 then the cut sparsification alg. Preserves

weights of all cuts up to (1 Β± πœ–)

β€’ Thm [Karger]: 𝑝𝑒 β‰₯ min 𝐢 log π‘›πœ† πœ–2 , 1 preserves Min-Cut up to (1 Β± πœ–)

(19)

Minimum Cut

Algorithm:

β€’ For 𝑖 = 0,1, … , 2 log 𝑛 :

– Let 𝐺𝑖 be the subgraph of 𝐺 where each edge is sampled with probability 1/2𝑖

– Let 𝐻𝑖 = F1, … , πΉπ‘˜ where π‘˜ = 𝑂 πœ–12 β‹… log 𝑛 and 𝐹𝑖 are forests constructed by the k-connectivity alg.

β€’ Return 2π‘—πœ†(𝐻𝑗) where 𝑗 = min *𝑖 ∢ πœ† 𝐻𝑖 < π‘˜+

Space: 𝑂 𝑛 logπœ–24 𝑛 , works for dynamic graph streams

(20)

Minimum Cut: Analysis

β€’ Key property: If 𝐺𝑖 has ≀ π‘˜ edges across a cut then 𝐻𝑖 contains all such edges

β€’ π‘–βˆ— = log max 1, πœ†πœ–2

6 log 𝑛

β€’ 𝑖 ≀ π‘–βˆ— β‡’ 𝑝𝑒 β‰₯ min 6 log 𝑛

πœ†πœ–2 , 1 β‡’ min cut in 𝐺𝑖 is approximating min-cut in 𝐺 up to (1 Β± πœ–)

β€’ 𝑖 = π‘–βˆ—: By Chernoff bound # edges in πΊπ‘–βˆ— that crosses min-cut in 𝐺 is 𝑂 πœ–12 log 𝑛 ≀ π‘˜ w.h.p.

(21)

Cut Sparsification

Algorithm:

β€’ For 𝑖 = 0,1, … , 2 log 𝑛 :

– Let 𝐺𝑖 be the subgraph of 𝐺 where each edge is sampled with probability 1/2𝑖

– Let 𝐻𝑖 = F1, … , πΉπ‘˜ where π‘˜ = 𝑂 πœ–12 β‹… log2 𝑛 and 𝐹𝑖 are forests constructed by the k-connectivity alg.

β€’ For each edge 𝑒 let 𝑗𝑒 = min i: πœ†π‘’ 𝐻𝑖 < π‘˜ .

β€’ If 𝑒 ∈ 𝐻𝑗𝑒 then add e to the sparsifier with weight 2𝑗𝑒

β€’ Space: 𝑂 𝑛 logπœ–25 𝑛 , works for dynamic graph streams

β€’ Analysis similar to the Min-Cut using [Fung et al.]

References

Related documents