Concentration of objective function - Some Algorithms and Paradigms for Big Data

The goal of this section is to prove the following concentration theorem for the reweighted sample covariance matrix ˆΣ.

Proposition 4.5.1 (Concentration of sample matrix). There is a universal constant C so that the following holds. Fix a sparsity parameter s, let a be defined using Assumption

4.2.1. Suppose Assumption(Af,Z,s) holds, and let x∗ be a unit norm vector. If a is non-

Gaussian, further assume thatx∗is admissible. Then for anyδ > 0, we have

k ˆΣ − Σk_∞≤ C(plog(n/δ) + log

(s/δ)) √

with probability at least1 − δ, provided m ≥ C max{s/δ, log(n/δ) log2m}.

Proof. Without loss of generality, assume that the support of x∗ is contained in the first s

coordinates. Let Psdenote the projection to the first s coordinates. We write

k ˆΣ − Σk_∞= max n kPs( ˆΣ − Σ)Psk∞, kPs( ˆΣ − Σ)P⊥sk∞, kP ⊥ s( ˆΣ − Σ)P ⊥ sk∞ o , (4.11) and bound each of the terms on the right separately.

For the first term, we shall use the fact that each entry is the mean of m i.i.d. ψ1/2ran-

dom variables (see Section2.2). This tail decay gives us a relatively strong large deviation inequality, which we can use together with a union bound. In more detail, let 1 ≤ k, l ≤ s. Then ( ˆΣ − Σ)kl= 1 m m X i=1 h

(ai)k(ai)lf (hai, x∗i) − E{(ai)k(ai)lf (hai, x∗i)}

i .

We now use Proposition2.2.6followed by Proposition2.2.5twice to get k(a)k(a)lf (ha, x∗i) − E{(a)k(a)lf (ha, x∗i)}k_ψ_α . k(a)k(a)lf (ha, x∗i)k_ψ

1/2 . k(a)k(a)lkψ1kf (ha, x∗i)kψ1 . k(a)kk_ψ₂k(a)lk_ψ₂kf (ha, x∗i)k_ψ₁.

Each of the terms in the product on the right hand side is bounded by an absolute constant by assumption. As such, the quantity on the left is also bounded by an absolute constant. We may thus use Proposition2.2.9to see that

P{|(Σ − Σ)ˆ kl| > t/

√

for t > 0 large enough. Pick t ∼ log2(s/δ). Then we can take a union bound over all s2

choices of k and l to get

kPs( ˆΣ − Σ)Psk∞.

log2(s/δ) √

m with probability at least 1 − δ/4.

We next bound the other two quantities in (4.11) via a conditioning argument similar to that in [81]. The key idea is to condition on the probability 1 − δ/4 event over which the three statements in Lemma4.5.3hold, and to observe that this event is independent of the random variables (ai)k for 1 ≤ i ≤ m, s < k ≤ n. Hence, conditioning on the event does

not alter the joint distribution of this set of random variables.

We consider a typical entry in Ps( ˆΣ − Σ)P⊥s, which is of the form

1 m m X i=1 (ai)k(ai)lf (hai, x∗i), 1 ≤ k ≤ s, s < l ≤ n. (4.12)

Fixing all randomness apart from (ai)l for all indices 1 ≤ i ≤ m, s < l ≤ n, we can use

Hoeffding’s inequality (Proposition2.2.2) to conclude that for each l, (4.12) is a subgaus- sian random variable with variance _m12

i=1(ai) 2

kf (hai, x∗i)2. By the second statement of

Lemma4.5.3, this is bounded by C/m, so that

P ( 1 m m X i=1 (ai)k(ai)lf (hai, x∗i) > √t m ) ≤ 2 exp(−ct2₎ _(4.13)

Choosing t ∼plog(n/δ) and taking a union bound over s < l ≤ n gives

kPs( ˆΣ − Σ)P⊥sk∞ .

log(n/δ) m with probability at least 1 − δ/4.

Finally, each entry of P⊥_s( ˆΣ − Σ)P⊥_s is of the form 1

i=1

f (hai, x∗i)(ai)k(ai)l− E{(ai)k(ai)l}, s < k, l ≤ n. (4.14)

We again fix all randomness apart from (ai)lfor all indices 1 ≤ i ≤ m, s < l ≤ n. Observe

that (ai)k(ai)l−E{(ai)k(ai)l}, s < k, l ≤ n, are centered subexponential random variables.

third statements of Lemma4.5.3to obtain the tail bound: P ( 1 m m X i=1

f (hai, x∗i)(ai)k(ai)l− E{(ai)k(ai)l}

> √t m ) ≤ 2e−c min n t2_,t√m log m o (4.15)

Once again, choosing t ∼plog(n/δ) and taking a union bound over s < k, l ≤ n gives

kPs( ˆΣ − Σ)P⊥sk∞.

log(n/δ)

m ,

with probability at least 1 − δ/4 provided that m_{& log(n/δ) log}2m. Remark4.5.2. When a is a standard Gaussian, [81] gave the bound

k ˆΣ − Σk_∞ ≤ Cplog(n/δ)√ m

with roughly the same tail probability. Hence, the only price to having more distributional generality is the additional log2(s/δ) term in the numerator.

Lemma 4.5.3. Let the hypotheses of Proposition4.5.1hold. There is an absolute constant C such that the following holds. Let I denote the support of x∗. Then for any δ > 0, so

long asm ≥ Cs/δ, the following three statements hold simultaneously with probability at least1 − δ/4. 1. m X i=1 f (hai, x∗i)2 ≤ Cm. 2. max k∈I m X i=1 (ai)2kf (hai, x∗i)2 ≤ Cm. 3. max

1≤i≤mf (hai, x∗i) ≤ C log m.

Proof. By Assumption (Af,Z,s), we know that kf (hai, x∗i)k_ψ₁ is bounded by an absolute

constant. As such, Proposition2.2.4 implies that both its second and fourth moments are also bounded. Furthermore, we have

Var(f (hai, x∗i)2) ≤ E{f (hai, x∗i)4} ≤ C,

moment bound, we thus get P ( _m X i=1 f (hai, x∗i)2 ≥ m(C + t) ) ≤ C mt2. (4.16)

We can use the same argument together with a union bound over k ∈ I to get

P ( max k∈I m X i=1 (ai)2kf (hai, x∗i)2 ≥ m(C + t) ) ≤ Cs mt2. (4.17)

Finally, we again use the union bound and the subexponential tail bound to get

max

1≤i≤mf (hai, x∗i) ≥ t log m

≤ 2m exp(−ct log m) = 2m1−ct_.

(4.18)

Choose t to be any fixed constant in (4.16) and (4.17), and choose t to be a constant larger than 2/c in (4.5). Then each of these probability bounds is of the order O(1/m), so that m& s/δ suffices for all three statements to hold with probability at least 1 − δ/4.

4.6 Comments and open questions

In this chapter, we have analyzed the problem of misspecified phase retrieval, and improved upon the work of Neykov et al. in [81]. In particular, we have shown that the first stage of their algorithm suffices for signal recovery with the same sample complexity, and extended the analysis to non-Gaussian measurements. Furthermore, we showed how the algorithm can be generalized to recover a signal vector x∗ efficiently given geometric prior informa-

tion other than sparsity.

Experts in compressed sensing may have observed that while the sample complexity for algorithms for misspecified linear regression scales linearly with the sparsity parameter, our sample complexity bounds for misspecified phase retrieval scale instead with the square of the parameter. In [81], the authors showed numerical evidence that this discrepancy is due to the statistical inefficiency of the algorithm, and not merely a slackness in the mathematical analysis.

This s2 scaling is also observed in all other efficient algorithms for sparse phase retrieval, and it is an open question whether there exist computationally efficient algorithms that can do better. The authors of [81] conjecture that the answer is in the negative. This is supported by results by Berthet and Rigollet, who show that computationally efficient algorithms for the related problem of detecting sparse principal components, using O(s2−₎

samples for any > 0, will lead to computationally efficient algorithms for solving hard instances of the planted clique problem [11,10]. This is widely conjectured to be impossi- ble.

It will also be interesting to investigate whether there is slackness in the sample complexity bound for signal recovery using general geometric constraints (Theorem4.2.5). In particular, I do not know how to bound γ1(K, k·k) where K is the set of unit trace PSD

matrices X with kXk₁ ≤ s. Hence, it is not yet clear whether Theorem4.2.2can be derived from Theorem4.2.5.

Finally, the literature on high-dimensional signal recovery from non-Gaussian measurements is still fairly limited. In this work, we have proved a recovery guarantee for admissi- blesignal vectors in the case of misspecified phase retrieval. Hopefully, this guarantee can be extended to larger classes of signal vectors in the near future.

In document Some Algorithms and Paradigms for Big Data (Page 70-74)