Cycles and Full Connectivity - Foundations of Data Science 1

This section considers when cycles form and when the graph becomes fully connected. For both of these problems, we look at each subset ofk vertices and see when they form either a cycle or a connected component.

4.5.1 Emergence of Cycles

The emergence of cycles in G(n, p) has a threshold when p equals to 1/n.

Theorem 4.13 The threshold for the existence of cycles in G(n, p) is p= 1/n.

Proof: Letxbe the number of cycles inG(n, p). To form a cycle of lengthk, the vertices can be selected in n_k ways. Given the k vertices of the cycle, they can be ordered by arbitrarily selecting a first vertex, then a second vertex in one ofk-1 ways, a third in one of k−2 ways, etc. Since a cycle and its reversal are the same cycle, divide by 2. Thus, there are n_k(k−₂1)! cycles of length k and

E(x) = n X k=3 n k (k−1)! 2 p k _≤ Pn k=3 nk 2kp k_≤ Pn k=3 (np)k = (np)3 1−(np)n−2 1−np ≤2(np) 3_, provided that np <1/2. When p is asymptotically less than 1/n, then lim

n→∞np= 0 and lim n→∞ n P k=3

(np)k= 0. So, as n goes to infinity, E(x) goes to zero. Thus, the graph almost surely has no cycles by the first moment method. A second moment argument can be used to show that forp=d/n, d >1, a graph will have a cycle with probability tending to one.

The argument above does not yield a sharp threshold since we argued that E(x)→0 only under the assumption that p is asymptotically less that _n1.. A sharp threshold re- quires E(x)→0 for p=d/n,d <1.

Consider what happens in more detail when p=d/n, da constant.

E(x) = n X k=3 n k (k−1)! 2 p k = 1 2 n X k=3 n(n−1)· · ·(n−k+ 1) k! (k−1)! p k = 1 2 n X k=3 n(n−1)· · ·(n−k+ 1) nk dk k . E(x) converges if d < 1, and diverges if d ≥1. If d < 1, E(x) ≤ 1₂

k=3

k and lim_n_→∞E(x)

equals a constant greater than zero. If d = 1, E(x) = 1₂

n P k=3 n(n−1)···(n−k+1) nk 1 k. Consider

Property Threshold cycles 1/n giant component 1/n giant component + isolated vertices 1 2 lnn n connectivity, disappearance of isolated vertices lnn n diameter two q 2 lnn n

only the first logn terms of the sum. Since _nn₋_i = 1 + _n₋i_i ≤ ei/n−i_, _{it follows that} n(n−1)···(n−k+1) nk ≥1/2. Thus, E(x)≥ 1 2 logn P k=3 n(n−1)···(n−k+1) nk 1 k ≥ 1 4 logn P k=3 1 k.

Then, in the limit as n goes to infinity

lim n→∞E(x)≥nlim→∞ 1 4 logn P k=3 1

k ≥_nlim_→∞(log logn) = ∞.

For p = d/n, d < 1, E(x) converges to a nonzero constant and with some nonzero probability, graphs will have a constant number of cycles independent of the size of the graph. For d >1, E(x) converges to infinity and a second moment argument shows that graphs will have an unbounded number of cycles increasing withn.

4.5.2 Full Connectivity

As p increases from p = 0, small components form. At p = 1/n a giant component emerges and swallows up smaller components, starting with the larger components and ending up swallowing isolated vertices forming a single connected component at p= lnn

n ,

at which point the graph becomes connected. We begin our development with a technical lemma.

Lemma 4.14 The expected number of connected components of size k in G(n, p) is at most n k kk−2pk−1(1−p)kn−k2.

Proof: The probability thatk vertices form a connected component consists of the prod- uct of two probabilities. The first is the probability that the k vertices are connected, and the second is the probability that there are no edges out of the component to the remainder of the graph. The first probability is at most the sum over all spanning trees of the k vertices, that the edges of the spanning tree are present. The ”at most” in the

lemma statement is because G(n, p) may contain more than one spanning tree on these nodes and, in this case, the union bound is higher than the actual probability. There are

kk−2 _{spanning trees on} _k _{nodes. See Section 11.7.6 in the appendix. The probability of} all the k−1 edges of one spanning tree being present is pk−1 _{and the probability that} there are no edges connecting thek vertices to the remainder of the graph is (1−p)k(n−k). Thus, the probability of one particular set of k vertices forming a connected component is at mostkk−2_pk−1₍₁₋_p₎kn−k2_._{Thus, the expected number of connected components of} size k is n_kkk−2pk−1(1−p)kn−k2.

We now prove that for p= 1₂ln_nn, the giant component has absorbed all small components except for isolated vertices.

Theorem 4.15 Let p=cln_nn. For c > 1/2, almost surely there are only isolated vertices and a giant component. For c >1, almost surely the graph is connected.

Proof: We prove that almost surely for c > 1/2, there is no connected component with

k vertices for anyk, 2≤k ≤n/2. This proves the first statement of the theorem since, if there were two or more components that are not isolated vertices, both of them could not be of size greater than n/2. The second statement that for c >1 the graph is connected then follows from Theorem 4.6 which states that isolated vertices disappear atc= 1.

We now show that for p = cln_nn, the expected number of components of size k, 2≤k ≤n/2, is less than n1−2c and thus for c > 1/2 there are no components, except for isolated vertices and the giant component. Let xk be the number of connected com-

ponents of size k. Substitute p = cln_nn into n_kkk−2_pk−1₍₁₋_p₎kn−k2

and simplify using

n k

≤(en/k)k, 1−p≤e−p, k−1< k, and x=elnx to get

E(xk)≤exp lnn+k+kln lnn−2 lnk+klnc−cklnn+ck2lnn n .

Keep in mind that the leading terms here for largekare the last two and, in fact, atk =n,

they cancel each other so that our argument does not prove the fallacious statement for

c≥1 that there is no connected component of size n, since there is. Let

f(k) = lnn+k+kln lnn−2 lnk+klnc−cklnn+ck2lnn n .

Differentiating with respect tok,

f0(k) = 1 + ln lnn− 2 k + lnc−clnn+ 2cklnn n and f00(k) = 2 k2 + 2clnn n >0.

Thus, the functionf(k) attains its maximum over the range [2, n/2] at one of the extreme points 2 or n/2. At k = 2, f(2) ≈ (1−2c) lnn and at k = n/2, f(n/2) ≈ −cn₄ lnn. So

f(k) is maximum at k = 2. For k = 2, E(x)k =ef(k) is approximately e(1−2c) lnn = n1−2c

and is geometrically falling ask increases from 2. At some pointE(xk) starts to increase

but never gets aboven−4cn. Thus, the expected sum of the number of components of size

k, for 2 ≤k≤n/2 is E   n/2 X k=2 xk  =O(n 1−2c ).

This expected number goes to zero forc >1/2 and the first-moment method implies that, almost surely, there are no components of size between 2 and n/2. This completes the proof of Theorem 4.15.

4.5.3 Threshold for O(lnn) Diameter

We now show that within a constant factor of the threshold for graph connectivity, not only is the graph connected, but its diameter isO(lnn).That is, if pis Ω(lnn/n), the diameter of G(n, p) is O(lnn).

Consider a particular vertexv. LetSi be the set of vertices at distance i fromv. We

argue that as i grows, |S1|+|S2|+· · ·+|Si| grows by a constant factor up to a size of

n/1000. This implies that in O(lnn) steps, at least n/1000 vertices are connected to v. Then, there is a simple argument at the end of the proof of Theorem 4.17 that a pair of

n/1000 sized subsets, connected to two different vertices v and w, have an edge between them.

Lemma 4.16 Consider G(n, p) for sufficiently large n with p =clnn/n for any c > 0. LetSi be the set of vertices at distanceifrom some fixed vertexv. If|S1|+|S2|+· · ·+|Si| ≤

n/1000, then

Prob |Si+1|<2(|S1|+|S2|+· · ·+|Si|)

≤e−10|Si|_.

Proof: Let |Si| = k. For each vertex u not in S1 ∪S2 ∪. . .∪Si, the probability that

u is not in Si+1 is (1−p)k and these events are independent. So, |Si+1| is the sum of

n−(|S1|+|S2|+· · ·+|Si|) independent Bernoulli random variables, each with probability

1−(1−p)k≥1−e−cklnn/n

of being one. Note thatn−(|S1|+|S2|+· · ·+|Si|)≥999n/1000. So,

E(|Si+1|)≥ 999n

1000(1−e

−cklnn n ).

Subtracting 200k from each side

E(|Si+1|)−200k ≥ n 2 1−e−cklnnn −400k n .

Letα = _nk and f(α) = 1−e−cαlnn₋₄₀₀_α_{. By differentiation} _f00₍_α₎_≤_{0, so} _f _{is concave}

points. It is easy to check that both f(0) and f(1/1000) are greater than or equal to zero for sufficiently largen. Thus, f is nonnegative throughout the interval proving that

E(|Si+1|)≥200|Si|. The lemma follows from Chernoff bounds.

Theorem 4.17 For p≥ clnn/n, where c is a sufficiently large constant, almost surely,

G(n, p) has diameter O(lnn).

Proof: By Corollary 4.2, almost surely, the degree of every vertex is Ω(np) = Ω(lnn), which is at least 20 lnn for c sufficiently large. Assume this holds. So, for a fixed vertex

v,S1 as defined in Lemma 4.16 satisfies |S1| ≥20 lnn.

Leti0be the leastisuch that|S1|+|S2|+· · ·+|Si|> n/1000. From Lemma 4.16 and the

union bound, the probability that for somei,1≤i≤i0−1,|Si+1|<2(|S1|+|S2|+· · ·+|Si|) is at most Pn/1000

k=20 lnne

−10k _≤ ₁_/n4_{. So, with probability at least 1}₋₍₁_/n4_{), each} _S

i+1 is at least double the sum of the previous Sj ’s, which implies that in O(lnn) steps, i0+ 1 is reached.

Consider any other vertex w. We wish to find a short O(lnn) length path between

v and w. By the same argument as above, the number of vertices at distance O(lnn) from w is at least n/1000. To complete the argument, either these two sets intersect in which case we have found a path fromv to w of length O(lnn) or they do not intersect. In the latter case, with high probability there is some edge between them. For a pair of disjoint sets of size at least n/1000, the probability that none of the possible n2/106 or more edges between them is present is at most (1−p)n2_/₁₀6

=e−Ω(nlnn)_{. There are at most} 22n _{pairs of such sets and so the probability that there is some such pair with no edges}

is e−Ω(nlnn)+O(n) → 0. Note that there is no conditioning problem since we are arguing this for every pair of such sets. Think of whether such an argument made for just the n

subsets of vertices, which are vertices at distance at mostO(lnn) from a specific vertex, would work.

In document Foundations of Data Science 1 (Page 115-119)