Matrix estimation for generalized block models

In this section we phrase a result essentially due to Abbe and Sandon [3] (and closely related to results by Bordenave et al [42]) in somewhat more general terms. This turns out to be enough to capture an algorithm to estimate a pairwise- vertex-similarity matrix in thed, k, α, ε mixed-membership block model when ε2

d > k2₍α + 1)2

Let U be a universe of labels, endowed with some base measure ν, such that

∫

1· dν 1. Let µ be a probability distribution on U, with a density relative toν. (We abuse notation by conflating µ and its associated density). Let W : U × U →₊be a bounded nonnegative function with W(x, y) W(y, x) for every x, y ∈ U. Consider a random graph model G(n, d, W, µ) sampled as follows. For each ofn vertices, draw a label x_i ∼ µ independently. Then for each pair i j ∈ [n]2, independently add the edge(i, j) to the graph with probability

nW(xi, xj). (This captures the W-random graph models used in literature on

graphons.)

LetF denote the space of square-integrable functions f : U →, endowed with the inner producth f, 1i _x∼µ f (x)1(x). That is, f ∈ F if_x∼µ f (x)2exists.

We assume throughout that

1. (Stochasticity) For everyx ∈ U, the average_y∼µW(x, y) 1. 2. (Finite rank)W has a finite-rank decomposition W(x, y) Í

i6r λifi(x) fi( y)

where λ_i ∈ and f_i: U → . The values λ_i are the eigenvalues of W with respect to the inner product generated by µ. The eigenfunctions are orthonormal with respect to the µ inner product. Notice that the

assumptions onW imply that its top eigenfunction f₁(x) is the constant function, with eigenvalueλ1 1.

3. (Niceness I) Certain rational moments ofµ−1exist; that is_x∼µµ(x)−texists fort −3/2, −2.

4. (Niceness II)W and µ are nice enough that W(x, y) 6 1/pµ(x)µ(y) and |W(x, y)| 6 λ2/

µ(x)µ(y) for every x, y ∈ U, where W(x, y) W(x, y)−1. (Notice that in the case of discrete W and µ this is always true, and for smooth enoughW andµ it is true via a δ-function argument.)

The functionW induces a Markov operator W : F → F . If f ∈ F , then (W f )(x)

y∼µW(x, y) f (y) .

(We abuse notation by conflating the functionW and the Markov operator W.)

Theorem 8.4.1 (Implicit in [3]). Suppose the operator W has eigenvalues 1 λ₁ > λ2 > · · · > λr (each possibly with higher multiplicity) andδ

def

1 − 1 dλ2

> 0. Let Π be

the projector to the second eigenspace of the operatorW. For types x₁, . . . , x_n ∼µ, let

A ∈n×n be the random matrixA_{i j} Π(x_i, x_j), where we abuse notation and think of

Π : U × U →. There is an algorithm with running timenpoly(1/δ) which outputs an

n × n matrix P such that for x, G ∼ G(n, d, W, µ),

x,GTrP · A > δ O(1)_{· (} x,GkAk 2 )1/2( x,GkPk 2 )1/2.

When U is discrete with k elements one recovers the usual k-community stochastic block model, and the condition λ2

2 > 1 matches the Kesten-Stigum

condition in that setting. Whenλ2

2 > 1 + δ, the guarantees of Abbe and Sandon

can be obtained by applying the above theorem to obtain an estimatorP for the matrixM Í

s. The estimator P will have at leastδO(1)/k correlation with M, and a random vector in the span of the top k/δO(1) eigenvectors of M will have correlation (δ/k)O(1) with some v_s. Thresholding that vector leads to the guarantees of Abbe and Sandon for thek-community block model, with one difference: Abbe and Sandon’s algorithm runs inO(n log n) time, much faster than the npoly(1/δ) running time outlined above. In essence, they achieve this by computing an estimatorP0forM which counts only non-backtracking paths in G (the estimator P counts self-avoiding paths).

In Section 8.4.1 we prove a corollary of Theorem 8.4.1. This yields the algorithm discussed Theorem8.1.2for the mixed-membership blockmodel. As discussed before, the quantitative recovery guarantees of this algorithm are weaker than those of our final algorithm, whose recovery accuracy depends only on the distance δ of the signal-to-noise ratio of the mixed-membership blockmodel to 1. In Section8.4.2we prove Theorem8.4.1.

8.4.1 Matrix estimation for the mixed-membership model

We turn to the mixed-membership model and show that Theorem8.4.1yields an algorithm for partial recovery in the mixed-membership block model. However, the correlation of the vectors output by this algorithm with the underlying community memberships depends both on the signal-to-noise ratio and the numberk of communiteis. (In particular, when k is super-constant this algorithm no longer solves the partial recovery task.)

Definition 8.4.2 (Mixed-Membership Block Model). Let G(n, d, ε, α, k) be the

vector σ_i ∈ k

>0 with

t∈[k]σi(t) 1 according to the following (simplified)

Dirichlet distribution.

(σ) ∝ Ö

t∈[k]

σi(t)α/k−1

For each pair of verticesi, i0 ∈ [n], sample communities t ∼ σ_i and t0 ∼ σ_i0. If t t0, add the edge {i, i0} to G with probability d

n(1+ (1 − 1

k)ε). If t , t 0

, add the edge{i, i0} to G with probability d

n(1 − εk). (For simplicity, throughout this

paper we consider only the case that the communities have equal sizes and the connectivity matrix has just two unique entries.)

Theorem 8.4.3 (Constant-degree partial recovery for mixed-membership block

model,k-dependent error). For every δ > 0 and d(n), ε(n), k(n), α(n), there is an

algorithm with running timenO(1)+1/δO(1) with the following guarantees when

δ def 1 − k 2₍α + 1)2 ε2 d > 0 and k, α 6 n o(1)_and_ε2 d 6 no(1).

Letσ, G ∼ G(n, d, ε, k, α) and for s ∈ [k] let v_s ∈ n be given byv_s(i) σ_i(s) − 1 k.

The algorithm outputs a vector x such that hx, v₁i2 > δ0k xk2k v₁k2, for some

δ0

> (δ/k)O(1).23

Ideally one would prefer an algorithm which outputsτ₁, . . . , τ_n ∈ ∆_k−1 with corr(σ, τ) _> δ0/(α+1). If one knew that hx, v₁i _> δ0_{k xkkvk rather than merely the} guarantee onhx, v1i

(which does not include a guarantee on the sign ofx), then this could be accomplished by correlation-preserving projection, Theorem8.2.3. The tensor methods we use in our final algorithm for the mixed-membership model are able to obtain a guarantee onhx, v₁i and hence can output probability vectorsτ₁, . . . , τ_n.24

23 The requirementε2d 6no(1)is for technical convenience only; asε2d increases the recovery problem only becomes easier.

To prove Theorem8.4.3we will apply Theorem8.4.1and then a simple spectral rounding algorithm; the next two lemmas capture these two steps.

Lemma 8.4.4 (Mixed-membership block model, matrix estimation). If U is the

(k − 1)-simplex,µ is the α, k Dirichlet distribution, and W(σ, σ0) 1 − ε

k + εhσ, σ 0_i, thenG(n, d, W, µ) is the mixed-membership block model with parameters k, d, α, ε. In this case, the second eigenvalue ofW has multiplicity k − 1 and has valueλ₂ ε

k(α+1).

Proof. The first part of the claim follows from the definitions. For the second part, note thatW has the following decomposition

W(σ, τ) 1 +Õ

i6k

ε(σi− 1_k)(τi− 1_k).

The functionsσ 7→ σ_i− 1

k are all orthogonal to the constant function σ 7→ 1 with

respect toµ; i.e. σ∼µ1· (σi− 1 k) 0 becauseσ_i 1 k.

It will be enough to test the above Rayleigh quotient _σ∼µ f (σ) · (W f )(σ)

_σ∼µ f (σ)2

with any function f (σ) in the span of the functions σ 7→ σ_i − 1

k. If we pick

f (σ) σ1 − 1

k the remaining calculation is routine, using only the second

moments of the Dirichlet distribution (see Fact8.4.5below).

Fact 8.4.5 (Special case of Fact8.8.3). Letσ ∈k be distributed according to theα, k Dirichlet distribution. Let ˜σ σ − 1

k · 1 be centered. Then( ˜σ)( ˜σ)

> 1

k(α+1) · Π where

Π is the projector to the complement of the all-1s vector ink.

betweenx and −x. Since we are focused on what can be accomplished by matrix estimation methods generally we leave this to the reader.

We analyze a simple rounding algorithm.

Lemma 8.4.6. Let M Ík

i1viv>_i be an n × n symmetric rank-k PSD matrix. Let

P ∈n×n be another symmetric matrix such thathP, Mi > δkPkkMk (where k · k is the Frobenious norm). Then for at least one vectorv among v₁, . . . , v_k, a random unit vectorx in the span of the top (k/δ)O(1)eigenvectors ofP satisfies

hx, vi2

> (δ/k)O(1)_{k vk}2

Now we can prove Theorem8.4.3.

Proof of Theorem8.4.3. Lemma8.4.4shows that the conditions of Theorem8.4.1 hold, and hence (via color coding) there is annpoly(1/δ)time algorithm to compute a matrixP such that hP, Mi > δO(1)kPkkMk with probability at leastδO(1), where M Í

s∈[k]vsv>s . (The reader may check that the matrixA of Theorem8.4.1is in this case the matrixM described here.)

Applying Lemma8.4.6shows that a random unit vectorx in the span of the top(k/δ)O(1)eigenvectors ofP satisfies hx, vi2 >(δ/k)O(1)k vk2, wherev ∈ n has entriesv_i σ_i(1). (The choice of 1 is without loss of generality.)

8.4.2 Proof of Theorem

8.4.1

Definition 8.4.7. For a pair of functions A, B : U × U → , we denote by AB their product, whose entries are(AB)(x, y) _z∼µA(x, z)B(z, y).

The strategy to prove Theorem8.4.1will as usual be to apply Lemma8.2.1. We check the conditions of that Lemma in the following Lemmas, deferring their proofs till the end of this section.

Lemma 8.4.8. Let Gi j be the 0/1 indicator for the presence of edge i ∼ j in a graph G. As usual, let SAW_`(i, j) be the collection of simple paths of length ` in the complete graph onn vertices from i to j.

Let x, G ∼ G(n, d, W, µ). Let α ∈ SAW_`(i, j). Let p_α(G) Î_ab∈α(G_ab− d n). Let W(x, y) W(x, y) − 1. Then p_α(G) | xi, xj d n ` W`−1(xi, xj)

Lemma 8.4.9. With the same notation as in Lemma8.4.8, as long as` > C log n/δO(1) for a large-enough constantC,

n d

2` _Õ α,β∈SAW`(i,j)

_p_α_(G)p_β_(G) ₆ δ−O(1)· |SAW_`(i, j)|2

·_W`−1_(x_i, x_j)2

(The constantC depends on W and the moments of µ.)

Proof of Theorem8.4.1. Let B_{i j} λ

−(`−1)

2 W

`−1

(xi, xj). By Lemma 8.4.8, Lemma8.4.9, and Lemma8.2.1, there is matrix polynomialP(G), computable to 1/poly(n)-accuracy in time npoly(1/δ) by color coding, such that

TrPBT > δO(1)(kPk2)1/2(kBk2)1/2.

At the same time,B − A has entries (B − A)i j Õ 36t6r λt λ2 `−1 Π_t(xi, xj)

where theΠ_t projects to thet-th eigenspace of W. Since W is bounded, choosing ` a large enough multiple of log n ensures that_{kB − Ak}2

6 n−100_kBk2

, so the

8.4.3 Proofs of Lemmas

Proof of Lemma8.4.8. As usual, we simply expandp, obtaining _p_α_{(G) | x}_i, x_j x " Ö ab∈α d n · (Wxa,xb − 1) | xi, xj # d n ` · W`−1(xi, xj).

We will need some small facts to help in proving Lemma8.4.9.

Fact 8.4.10. If` − t > C log n for large enough C C(W), then λ2t 2 _x,y∼µ W `−t (x, y)2 6(1+ o(1)) · x,y∼µW ` (x, y)2 .

Also, for anyt 6 `,

λ2t 2 x,y∼µW `−t (x, y)2 6 r · x,y∼µW ` (x, y)2 .

wherer is the rank of W.

Proof. Using the eigendecomposition ofW, we have that_x,y∼µW`−t(x, y)2 Í

26i6rλ 2(`−t)

i and similarlyx,y∼µW ` (x, y)2Í 26i6rλ 2` i . Ifi > 2, then λ2t 2 λ 2(_`−t) i λ 2` 2 (λi/λ2) 2(_`−t) 6 λ2` 2 /n

by our assumption that` − t > C log n for large enough C. This finishes the proof

of the first claim; the second one is similar.

Proof of Lemma8.4.9. Pairsα, β which share only the vertices i, j each contribute exactly W

`−1

(xi, xj)2 to the left-hand side, by Lemma8.4.8. Consider next the contribution ofα, β whose shared edges form paths originating at i and j. Suppose there aret such shared edges. Then

pαpβ d n 2`−t x Ö ab∈α4β W(xa, xb) · Ö ab∈α∩β (W(xa, xb)+ O(d/n))

d n 2`−t (1+ O(d/n))t_W2(`−t−1)(x, y)2 ,

where for the second equality we used the assumption_x∼µW(x, y) 1 for every y.

If ` − t > C log n for the constant in Fact 8.4.10, then this is at most (1 + o(1))d n 2`−t λ−2t 2 W 2(`−1) (x, y)2

, and for every t 6 ` it is at most r · d_n 2`−t λ−2t 2 W 2(`−1) (x, y)2 .

There are at most|SAW_`(i, j)|2/nt · t choices for such pairsα, β, except when t `, in which case there are |SAW`(i, j)|2/nt−1choices. So the total contribution from suchα, β is at most

x,yW `−1

(x, y)2

It remains to handle pairsα, β which share t vertices and s edges for t > s. If t, s 6 ` − 2, then there are only n2(`−1)−s`O(t−s)choices for such a pair α, β. The contribution of each such pair we bound as follows

_p_α_p_β d n 2`−s Ö ab∈α∩β (Gab− d_n)2· Ö ab∈α4β Wxa,xb. Now,(G_ab − d n) 2

| x _nd(W(xa, xb)+ O(d/n)) by straightforward calculations,

so the above is (1+ O(d/n))s d n 2`−s x Ö ab∈α∩β W(xa, xb) · Ö ab∈α4β W(xa, xb) 6(1+ O(d/n))s d n 2`−s λ2`−s 2 Ö a∈α∪β µ(xa) − deg_α,β_(a)/2

where deg

α,β(a) is the degree of the vertex a in the graphα ∪ β. Any degree-2

vertices simply contribute 1 in the above, since _x∼µ1/µ(x) 1. There are at mostt − s vertices of higher degree; they may have degree at most 4. They each contribute at most some numberC C(µ) by the niceness assumptions on µ. So the above is at most

(1+ o(1)) d n 2`−s λ2`−s 2 exp{O(t − s)}.

Putting things together as in Lemma8.3.7finishes the proof.

Proof of Lemma8.4.6. By averaging, there is somev among v₁, . . . , v_k such that hP, vv>i _> δ

k · kPk · kMk > δ

k · kPk · kvv

>_k

where the second inequality usesM 0. Renormalizing, we may assume kP k has Frobenious norm 1 andv is a unit vector; in this case we obtain hv, Pvi > δ/k. Writing out the eigendecomposition ofP, let P Ín

i1λiuiu>_i and we get

n Õ i1 λihv, uii2 > δ/k By Cauchy-Schwarz, n Õ i1 λihv, uii2 6 n Õ i1 λ2 ihv, uii 2 !1/2

and henceÍn_i1λ2

ihv, uii 2

> (δ/k)2

, whileÍn_i1λ2

i 1. Now the Lemma follows

In document Statistical Inference and the Sum of Squares Method (Page 183-193)