Finding the largest low rank clusters with Ky Fan 2 k norm and l1 norm

(1)

Original citation:

Doan, Xuan Vinh and Vavasis, Stephen. (2016) Finding the largest low-rank clusters

with Ky Fan 2-k-norm and l1-norm. SIAM Journal on Optimization, 26 (1). pp. 274-312.

Permanent WRAP url:

http://wrap.warwick.ac.uk/74985

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work of researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

Publisher’s statement:

“First Published in SIAM Journal on Optimization in 26 (1) published by the

Society for Industrial and Applied Mathematics (SIAM). Copyright © by SIAM.

Unauthorized reproduction of this article is prohibited.

A note on versions:

The version presented in WRAP is the published version or, version of record, and may

be cited as it appears here.

(2)

FINDING THE LARGEST LOW-RANK CLUSTERS WITH KY FAN 2-k-NORM AND₁-NORM∗

XUAN VINH DOAN† AND STEPHEN VAVASIS‡

Abstract. We propose a convex optimization formulation with the Ky Fan 2-k-norm and1 -norm to ﬁndklargest approximately rank-one submatrix blocks of a given nonnegative matrix that has low-rank block diagonal structure with noise. We analyze low-rank and sparsity structures of the optimal solutions using properties of these two matrix norms. We show that, under certain hypotheses, with high probability, the approach can recover rank-one submatrix blocks even when they are corrupted with random noise and inserted into a much larger matrix with other random noise blocks.

Key words. low rank matrix approximation, nonnegative matrix factorization, Ky Fan norm, subgaussian random noise, biclustering

AMS subject classifications.15A23, 15A60, 15B52, 90C22, 62H30

DOI.10.1137/140962097

1. Introduction. Given a matrix A ∈ Rm×n that has low-rank block diago-nal structure with noise, we would like to find that low-rank block structure of A. Doan and Vavasis [6] have proposed a convex optimization formulation to find a large approximately rank-one submatrix of A with the nuclear norm and ₁-norm. The proposed LAROS problem (“large approximately rank-one submatrix”) in [6] can be used to sequentially extract features in data. For example, given a corpus of docu-ments in some language, it can be used to co-cluster (or bicluster) both terms and documents, i.e., to identify simultaneously both subsets of terms and subsets of doc-uments strongly related to each other from theterm-document matrix A∈Rm×n of the underlying corpus ofndocuments withmdefined terms (see, for example, Dhillon [4]). Here, “term” means a word in the language, excluding common words such as articles and prepositions. The (i, j) entry ofAis the number of occurrences of termi in documentj, perhaps normalized. Another example is the biclustering of gene ex-pression data to discover exex-pression patterns of gene clusters with respect to different sets of experimental conditions (see the survey by Madeira and Oliveira [16] for more details). Gene expression data can be represented by a matrix Awhose rows are in correspondence with different genes and columns are in corresponence with different experimental conditions. The valueaij is the measurement of the expression level of

geneiunder the experimental conditionj.

If the selected terms in a bicluster occur proportionally in the selected documents, we can intuitively assign a topic to that particular term-document bicluster. Similarly, if the expression levels of selected genes are proportional in all selected experiments of a bicluster in the second example, we can identify an expression pattern for the given

∗_{Received by the editors March 24, 2014; accepted for publication (in revised form) November 25,}

2015; published electronically January 26, 2016. This work was supported in part by the U.S. Air Force Oﬃce of Scientiﬁc Research, a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and a grant from MITACS.

http://www.siam.org/journals/siopt/26-1/96209.html

†_{DIMAP and ORMS Group, Warwick Business School, University of Warwick, Coventry, CV4}

7AL, UK ([email protected]). This research was partly done while the author was at the De-partment of Combinatorics and Optimization, University of Waterloo, Canada.

‡_{Department of Combinatorics and Optimization, University of Waterloo, 200 University Avenue}

West, Waterloo, ON, N2L 3G1, Canada ([email protected]).

274

(3)

gene-experimental condition bicluster. Mathematically, for each biclusteri, we obtain a subset Ii ⊂ {1, . . . , m} and Ji ⊂ {1, . . . , n} and the submatrix block A(Ii,Ji) is

approximately rank one, i.e.,A(Ii,Ji)≈wihTi. Assuming there arekbiclusters and

Ii∩ Ij=∅ andJi∩ Jj =∅for alli=j, we then have the following approximation:

(1.1) A≈[ ¯w₁, . . . ,wk¯ ][¯h₁, . . . ,hk¯ ]T,

where ¯wi and ¯hi are the zero-padded extensions ofwi andhi to vectors of lengthm andn, respectively. If the matrixAis nonnegative and consists of thesek(row- and column-exclusive) biclusters, we may assume thatwi,hi≥0for alli(a consequence of the Perron–Frobenius theorem; see, for example, Golub and Van Loan [9] for more details). ThusA≈W HT, whereW,H ≥0, which is an approximatenonnegative matrix factorization (NMF) of the matrix A. In this paper, we shall follow the NMF representation to ﬁnd row- and column-exclusive biclusters. Note that there are diﬀerent frameworks for biclustering problems such as the graph partitioning models used in Dhillon [4], Tanay, Sharan, and Shamir [19], and Ames [1], among other models (see, for example, the survey by Nan, Boyko, and Pardalos [7]).

Approximate and exact NMF problems are difficult to solve. The LAROS problem proposed by Doan and Vavasis [6] can be used as a subroutine for a greedy algorithm with which columns ofW andHare constructed sequentially. Each pair of columns corresponds to a feature (or pattern) in the original data matrixA. Given the prop-erties of the LAROS problem, the most significant feature (in size and magnitude) will be constructed first with the appropriate parameter.

The iterated use of the LAROS algorithm of [6] to extract blocks one at a time, however, will not succeed in the case that there are two or more hidden blocks of roughly the same magnitude. In order to avoid this issue, we propose a new convex formulation that allows us to extract several (nonoverlapping) features simultaneously. In section 2, we study the proposed convex relaxation and the properties of its opti-mal solutions. In section 3, we provide conditions to recover low-rank block structure of the block diagonal data matrix A in the presence of random noise. Finally, we demonstrate our results with some numerical examples in section 4, including a syn-thetic biclustering example and a synsyn-thetic gene expression example from the previous literature.

Notation. A,X = trace(ATX) is used to denote the inner product of two matricesAandX inRm×n. X₁means the sum of the absolute values of all entries ofX, i.e., the ₁-norm of vec(X), the long vector constructed by the concatenation of all columns of X. Similarly, X_∞ is the maximum absolute value of entries of

X, i.e., the_∞-norm of vec(X).

2. Matrix norm minimization. We start with the following general norm minimization problem, which has been considered in [6]:

(2.1) min |X|

s.t. A,X ≥1,

where| · |is an arbitrary norm function onRm×n. The associated dual norm| · | is deﬁned as

(2.2) |A|

_{= max} _A_,_Y

s.t. |Y| ≤1.

These two optimization problems are closely related and their relationship is captured in the following lemmas and theorem discussed in Doan and Vavasis [6].

(4)

Lemma 2.1. Matrix X∗ is an optimal solution of problem (2.1) if and only if

Y∗_{= (}_|_A_|₎_X∗ _{is an optimal solution of problem}_(2.2)_.

Lemma 2.2. The set of all optimal solutions of problem(2.2)is the subdiﬀerential

of the dual norm function | · | atA,∂|A|.

Theorem 2.3 (Doan and Vavasis [6]). The following statements are true:

(i) The set of optimal solutions of problem (2.1) is (|A|)−1∂|A|, where

∂| · | _{is the subdiﬀerential of the dual norm function}_{| · |}_.

(ii) Problem (2.1) has a unique optimal solution if and only if the dual norm function | · | is diﬀerentiable atA.

The LAROS problem in [6] belongs to a special class of (2.1) with parametric matrix norms of the form|X|θ=|X|+θX1, where| · |is aunitarily invariant norm andθis a nonnegative parameter,θ≥0:

(2.3) min |X|+θX1

s.t. A,X ≥1.

A norm| · |is unitarily invariant if|UXV|=|X|for all pairs of unitary matrices

U and V (see, for example, Lewis [15] for more details). For the LAROS problem,

|X|is the nuclear norm, |X|=X_∗, which is the sum of singular values of X. In order to characterize the optimal solutions of (2.3), we need to compute the dual norm| · |_θ:

(2.4) |A|

θ= max A,Y

s.t. |Y|+θY₁≤1.

The following proposition, which is a straightforward generalization of Proposition 7 in [6], provides a dual formulation to compute| · |_θ.

Proposition 2.4. The dual norm |A|_θ with θ >0 is the optimal value of the

following optimization problem:

(2.5) |A|

θ= min max

|Y|_{, θ}−1_Z ∞

s.t. Y +Z=A.

The optimality conditions of (2.3) are described in the following proposition, which is again a generalization of Proposition 9 in [6].

Proposition 2.5. _{Consider a feasible solution}X_{of problem}₍₂.₃₎_{. If there exists}

(Y,Z)that satisﬁes the conditions

(i) Y +Z=Aand|Y|=θ−1Z_∞,

(ii) X∈α∂|Y|,α≥0,

(iii) X∈β∂Z_∞,β ≥0,

(iv) α+θβ=A∗_θ−1,

thenX is an optimal solution of problem(2.3). In addition, if

(v) | · | is diﬀerentiable atY or · _∞ is diﬀerentiable atZ, thenX is the unique optimal solution.

The low-rank structure of solutions obtained from the LAROS problem comes from the fact that the dual norm of the nuclear norm is the spectral norm (or 2-norm), X = σ₁(X), the largest singular value of X. More exactly, it is due to the structure of the subdiﬀerential∂ · . According to Zi¸etak [22], ifY =UΣVT is a singular value decomposition of Y and sis the multiplicity of the largest singular value ofY, the subdiﬀerential∂Yis written as follows:

∂Y=

U

S 0

0 0

VT :S ∈ S₊s,S_∗= 1 ,

(5)

whereS₊s is the set of positive semidefinite matrices of sizes. The description of the subdifferential shows that the maximum possible rank ofX ∈α∂Y is the multi-plicity of the largest singular value ofY and ifs= 1, we achieve rank-one solutions. This structural property of the subdifferential ∂ · motivates the norm optimiza-tion formulaoptimiza-tion for the LAROS problem, which aims to find asingle approximately rank-one submatrix of the data matrix A. We now propose a new pair of norms that would allow us to handle several approximately rank-one submatrices simulta-neously instead of individual ones. Let us consider the following norm, which we call the Ky Fan 2-k-norm given its similar formulation to that of the classical Ky Fan

k-norm:

(2.6) |A|k,2=

_k

i=1 σ2

i(A) 1

2 ,

whereσ₁≥. . . σk ≥0 are the ﬁrstklargest singular values ofA,k≤k0= rank(A).

The dual norm of the Ky Fan 2-k-norm is denoted by| · |_k,₂. According to Bhatia [3], the Ky Fan 2-k-norm is aQ-norm, which is unitarily invariant (Deﬁnition IV.2.9 in [3]). Since the Ky Fan 2-k-norm is unitarily invariant, we can deﬁne its corresponding symmetric gauge function, · _k,₂:Rn→R, as follows:

(2.7) x_k,₂=

_k

i=1 |x|2₍i)

1 2 ,

where|x|₍_i₎is the (n−i+ 1)st order statistic of|x|. The dual norm of this gauge func-tion (or more exactly, its square), has been used in Argyriou, Foygel, and Srebro [2] as a regularizer in sparse prediction problems. More recently, its matrix counterpart is considered in McDonald, Pontil, and Stamos [17] as a special case of the matrix cluster norm deﬁned in [13], whose square is used for multitask learning regulariza-tion. On the other hand, the square Ky Fan 2-k-norm is considered as a penalty in low-rank regression analysis in Giraud [8]. In this paper, we are going to use the dual Ky Fan 2-k-norm, not its square, in our formulation given its structural properties, which will be explained later.

Whenk= 1, the Ky Fan 2-k-norm becomes the spectral norm, whose subdiﬀer-ential has been used to characterize the low-rank structure of the optimal solutions of the LAROS problem. We now propose the following optimization problem, of which the LAROS problem is a special instance withk= 1:

(2.8) min |X|

k,2+θX1

s.t. A,X ≥1,

where θ is a nonnegative parameter, θ ≥ 0. The proposed formulation is an in-stance of the parametric problem (2.3) and we can use results obtained in Propo-sitions 2.4 and 2.5 to characterize its optimal solutions. Before doing so, we ﬁrst provide an equivalent semideﬁnite optimization formulation for (2.8) in the following proposition.

(6)

Proposition 2.6. Assuming m ≥ n, the optimization problem (2.8) is then

equivalent to the following semideﬁnite optimization problem:

(2.9)

min

p,P,Q,R,X p+trace(R) +θ EQ

s.t. kp−trace(P) = 0,

pI−P 0,

P −1

2XT −1

2X R

0,

Q≥X, Q≥ −X,

A,X ≥1,

whereE is the matrix of all ones.

Proof. We ﬁrst consider the dual norm|X|_k,₂. We have that

(2.10) |X|

k,2= max XY

s.t. |Y|k,2≤1.

Since m ≥ n, we have that (|Y|k,2)2 =|YTY|k, where| · |k is the Ky Fank

-norm, i.e., the sum ofklargest singular values. SinceYTY is symmetric,|YTY|kis actually the sum ofklargest eigenvalues ofYTY. Similar tox_k, which is the sum of k largest elements of x, we obtain the following (dual) optimization formulation for|YTY|k (for example, see Laurent and Rendl [14]):

|YT_Y_|

k= min kz+ trace(U) s.t. zI+U YTY,

U 0.

Applying the Schur complement, we have that

|YT_Y_|

k= min kz+ trace(U)

s.t.

zI+U YT

Y I

0,

U 0.

Thus, the dual norm| · |_k,₂ can be computed as follows:

|X|

k,2= max XY

s.t. kz + trace(U)≤1,

zI+U YT

Y I

0,

U 0.

Applying the strong duality theory under Slater’s condition, we have that

(2.11)

|X|

k,2= min p+ trace(R)

s.t. kp−trace(P) = 0,

pI−P 0,

P −1

2XT −1

2X R

0.

The reformulation ofX₁ is straightforward with the new decision variableQand additional constraintsQ≥X andQ≥ −X, given the fact that the main problem is a minimization problem.

(7)

Proposition 2.6 indicates that in general, we can solve (2.8) by solving its equiv-alent semideﬁnite optimization formulation (2.9) with any SDP solver. We are now ready to study some properties of optimal solutions of (2.8). We have that|X|_k,₂+

θX₁ is a norm forθ≥0 and we denote it by|X|k,2,θ. According to Proposition

2.4, the dual norm|X|_k,₂_,θ,

(2.12) |A|

k,2,θ= max A,X

s.t. X_k,₂_,θ≤1,

can be calculated by solving the following optimization problem givenθ >0:

(2.13) |A|

k,2,θ= min max

|Y|k,2, θ−1Z∞

s.t. Y +Z=A.

Similar to Proposition 2.5, we can provide the optimality conditions for (2.8) in the following proposition.

Proposition 2.7. Consider a feasible solutionXof problem(2.8). If there exists

(Y,Z)that satisﬁes the conditions

(i) Y +Z=Aand|Y|k,2=θ−1Z_∞,

(ii) X∈α∂|Y|k,2,α≥0,

(iii) X∈β∂Z_∞,β ≥0,

(iv) α+θβ= (A∗_k,₂_,θ)−1,

thenX is an optimal solution of Problem (2.3). In addition, if

(v) | · |k,2 is diﬀerentiable atY or · ∞ is diﬀerentiable atZ,

thenX is the unique optimal solution.

The optimality conditions presented in Proposition 2.7 indicate that some prop-erties of optimal solutions of (2.8) can be derived from the structure of∂| · |k,2. We

shall characterize the subdiﬀerential∂| · |k,2 next. According to Watson [21], since | · |k,2is a unitarily invariant norm,∂|A|k,2is related to∂σ(A)k,2, whereσ(A)

is the vector of singular values ofA. LetA=0be a matrix with singular values that satisfy

σ₁≥ · · ·> σk−t+1 =· · ·=σk =· · ·=σk+s>· · · ≥σp,

where p = min{m, n}, so that the multiplicity of σk is s+t. The subdiﬀerential

∂σk,2 is characterized in the following lemma.

Lemma 2.8. v∈∂σ

k,2 if and only ifv satisﬁes the following conditions:

(i) vi= σi

σk,2 for alli= 1, . . . , k−t.

(ii) vi=τi_σσk

k,2,0≤τi≤1, for alli=k−t+ 1, . . . , k+s, and

k+s

i=k−t+1τi=t.

(iii) vi= 0 for alli=k+s+ 1, . . . , p.

Proof. LetNk be the collection of all subsets with k elements of{1, . . . , p}, we

have

σk,2= max_N_∈N

k

fN(σ),

where fN(σ) =i∈Nσi2 1

2 _{for all} _N _{∈ N}

k. According to the Dubovitski–Milyutin

theorem (see, for example, Tikhomirov [20]), the subdiﬀerential of · _k,₂is computed as follows:

∂σk,2= conv

∂fN(σ) : N ∈ Nk, fN(σ) =σk,2

.

(8)

With the structure ofσ, clearly{1, . . . , k−t} ∈N for allN ∈ Nksuch thatfN(σ) =

σk,2. The remainingt elements of N are chosen from s+t values from {k−t+

1, . . . , k+s}. Sinceσ=0, allfN that satisfyfN(σ) =σk,2are diﬀerentiable atσ

(even in the caseσk = 0) and

∂fN(σ)

∂σi =

σi

σk,2 ∀

i∈N, ∂fN(σ)

∂σi = 0, i∈N.

Thus if v ∈ ∂σ_k,₂ for all i = 1, . . . , k−t, then, vi = σi

σ_k,₂ and vi = 0 for all i=k+s+ 1, . . . , p.

Finally, counting arguments for the appearance of each index in{k−t+1, . . . , k+s} with respect to all subsetsN ∈ Nk that satisfyfN(σ) =σk,2 allows us to

charac-terizevifori=k−t+ 1, . . . , k+sasvi=τi_σσk

k,2, 0≤τi ≤1, and

k+s

i=k−t+1τi=t.

We are ready to characterize the subdiﬀerential of | · |k,2 with the following

proposition.

Proposition 2.9. Consider A =0. Let A=UΣVT be a particular singular

value decomposition of Aand assume that σ(A)satisﬁes σ₁ ≥ · · ·> σk−t+1 =· · ·=

σk = · · · = σk+s > · · · ≥ σp. Then, G ∈ ∂|A|k,2 if and only if there exists T ∈R(s+t)×(s+t) _{such that}

G= 1

|A|k,2

U_[:,1:k−t]Σ[1:k−t,1:k−t]VT[:,1:k−t]+σkU[:,k−t+1:k+s]T VT[:,k−t+1:k+s]

,

whereT is symmetric positive semideﬁnite, T ≤1, andT_∗=t. Proof. According to Watson [21], we have that

∂|A|k,2=

UDiag(g)VT : A=UΣVTis any SVD ofA, g∈∂σ(A)_k,₂

.

Let A = UΣVT be a particular singular value decomposition of A and assume that a singular value σi > 0 has the multiplicity of r with corresponding singular

vectorsUi∈Rm×randVi ∈Rn×r. Then for any singular value decomposition ofA, A= ¯UΣV¯T, there exists an orthonormal matrixW ∈Rr×r, W WT =I, such that

¯

Ui=UiW and ¯Vi=ViW (for example, see Zi¸etak [22]).

Combining these results with Lemma 2.8, the proof is straightforward with a singular value (or eigenvalue) decomposition of matrixT.

Corollary 2.10. | · |_k,₂ is diﬀerentiable at anyA=0such that σ_k > σ_k₊₁

(σp+1= 0) or σk = 0.

Proof. Ifσk= 0, then, according to Proposition 2.9,

G∈∂|A|k,2 ⇔ G= 1

|A|k,2U[:,1:k−t]Σ[1:k−t,1:k−t]V

T

[:,1:k−t].

Now, ifσk > σk+1, thens= 0. Thus T =I is unique since T ∈ St, T_∗ =t, and

T ≤ 1. Thus ∂|A|k,2 is a singleton, which implies | · |k,2 is diﬀerentiable at

A.

Proposition 2.9 shows that problem (2.8) with θ = 0 is a convex optimization problem that ﬁnds k-approximation of a matrix A. It also shows that intuitively, problem (2.8) can be used to recoverk largest approximately rank-one submatrices

(9)

withθ >0. Note that for Ky Fank-norm, ifσk(A)> σk+1(A), its subdiﬀerential at

Ais a singleton with a unique subgradient:

∂|A|k=

U

Ik 0

0 0

VT ,

where A = UΣVT is a singular value decomposition of A and Ik is the identity matrix in Rk×k (see, for example, Watson [21]). In this particular case, the unique subgradient of the Ky Fank-norm provides the information of singular vectors corre-sponding to the klargest singular values. Having said that, it does not preserve the information of singular values. Whenθ = 0, the proposed formulation with the Ky Fank-norm will not return the rank-kapproximation of the matrixAas the Ky Fan 2-k-norm does. In the next section, we shall study the recovery of these submatrices under the presence of random noise.

3. Recovery with block diagonal matrices and random noise. We con-siderA=B+R, whereB is a block diagonal matrix, each block having rank one, whileR is a noise matrix. The main theorem shows that under certain assumptions concerning the noise, the positions of the blocks can be recovered from the solution of (2.8). As mentioned in the introduction, this corresponds to solving a special case of the approximate NMF problem, that is, a factorizationA≈W HT, whereW andH are nonnegative matrices. The special case solved is that W andH each consist of nonnegative columns with nonzeros in disjoint positions (so thatAis approximately a matrix with disjoint blocks each of rank one). Even this special case of NMF is NP-hard unless further restrictions are placed on the data model given the fact that the (exact) LAROS problem is NP-hard (see [6] for details).

Before starting the proof of the theorem, we need to consider some properties of subgaussian random variables. A random variablexisb-subgaussianifE[x] = 0 and there exists ab >0 such that for allt∈R,

(3.1) Eetx≤eb22t2.

We can apply the Markov inequality for the b-subgaussian random variable x and obtain the following inequalities:

(3.2) P(x≥t)≤exp(−t2/(2b2)) andP(x≤ −t)≤exp(−t2/(2b2)) ∀t >0. The next three lemmas, which show several properties of random matrices and vectors with independent subgaussian entries, are adopted from Doan and Vavasis [6] and references therein.

Lemma 3.1. Letx₁, . . . , x_k be independentb-subgaussian random variables and let

a₁, . . . , ak be scalars that satisfyki=1a2i = 1. Thenx= k

i=1aixi is ab-subgaussian random variable.

Lemma 3.2. Let B ∈ Rm×n be a random matrix, where b_ij are independent

b-subgaussian random variables for all i= 1, . . . , m, and j = 1, . . . , n. Then for any

u >0,

P(B ≥u)≤exp

−

8u2

81b2 −(log 7)(m+n)

.

(10)

Lemma 3.3. Let x,y be two vectors in Rn with independent and identically

distributed (i.i.d.) b-subgaussian entries. Then for anyt >0,

PxT_y_≥_t_≤_exp₋_min t2

(4eb2)2n,

t

4eb2 and

PxT_y_{≤ −}_t_≤_exp₋_min t2

(4eb2)2n,

t

4eb2 .

With these properties of subgaussian variables presented, we are now able to state and prove the main theorem, which gives suﬃcient conditions for optimization problem (2.8) to recoverk blocks in the presence of noise.

Theorem 3.4. SupposeA =B+R, where B is a block diagonal matrix with

k₀ blocks, that is, B= diag(B₁, . . . ,B_k₀), whereBi= ¯σiu¯iv¯Ti,u¯i∈Rmi,v¯i ∈Rni,

u¯i₂=v¯i₂= 1,u¯i>0,v¯i >0for alli= 1, . . . , k0. Assume the blocks are ordered

so that σ¯₁ ≥σ¯₂ ≥ · · · ≥σ¯k0 >0. Matrix R is a random matrix composed of blocks

in which each entry is a translated b-subgaussian variable, i.e., there exists μij ≥0 such that elements of the matrix block Rij/(φiφj)1/2 −μijemieTnj are independent

b-subgaussian random variables for all i, j = 1, . . . , k₀. Here φi = ¯σi/√mini, i =

1, . . . , k₀, is a scaling factor to match the scale of Rij with that of Bi andBj, and em denotes the m-vector of all1’s.

We deﬁne the following positive scalars that control the degree of heterogeneity among the ﬁrstk blocks:

δu≤ min

i=1,...,kui¯ 1/

√_m

i,

(3.3)

δv≤ min

i=1,...,k¯vi1/

√_n

i,

(3.4)

ξu≤ min i=1,...,k

min

j=1,...,m_iu¯i,j

√_m

i,

(3.5)

ξv≤ min i=1,...,k

min

j=1,...,ni ¯

vi,j

√_n

i,

(3.6)

πu≥ max i=1,...,k

max

j=1,...,mi ¯

ui,j

√_m

i,

(3.7)

πv≥ max i=1,...,k

max

j=1,...,ni ¯

vi,j

√_n

i,

(3.8)

ρm≥ max

i,j=1,...,kmi/mj,

(3.9)

ρn≥ max

i,j=1,...,kni/nj,

(3.10)

ρσ≥σ¯1/¯σk.

(3.11)

We also assume that the blocks do not diverge much from being square; more precisely we assume that mi ≤ O(n2j) and ni ≤ O(m2j) for i, j = 1, . . . , k. Let p= (k, δu, δv, πu, πv, ξu, ξv, ρσ, ρm, ρn)denote the vector of parameters controlling the heterogeneity.

For the remaining noise blocks i=k+ 1, . . . , k₀, we assume that their dominant singular values are substantially smaller than those of the ﬁrstk,

(3.12) σ¯k+1≤ 0_k.23¯_{+ 1}σk,

(11)

that their scale is bounded,

(3.13) φi≤c0φj

for all i=k+ 1, . . . , k₀ and j = 1, . . . , k, where c₀ is a constant, and that their size is bounded,

(3.14)

k0

i=k+1

(mi+ni)≤c1(p, c0, b) min_i₌₁_,...kmini,

wherec₁(p, c₀, b)is given by (3.123) below. Assume that

(3.15) μij ≤c2(p, c0)

for alli, j= 1, . . . , k₀, wherec₂(p, c₀)is given by (3.118) below. Then provided that

(3.16) c₃(p)

_k

i=1 mini

−1/2

≤θ≤2c₃(p)

_k

i=1 mini

−1/2 ,

where c₃(p) is given by (3.115) below, the optimization problem (2.8) will return X with nonzero entries precisely in the positions ofB₁, . . . ,B_kwith probability exponen-tially close to1 asmi, ni → ∞for alli= 1, . . . , k0.

Remarks.

1. Note that the theorem does not recover the exact values of (¯σi,ui¯ ,vi¯ ); it is clear that this is impossible in general under the assumptions made.

2. The theorem is valid under arbitrary permutation of the rows and columns (i.e., the block structure may be “concealed”) since (2.8) is invariant under such transformations.

3. Given the fact that for alli= 1, . . . , k,

0<

min

j=1,...,mi ¯

ui,j

√_m

i≤ u¯i₁/√mi ≤1≤

max

j=1,...,mi ¯

ui,j

√_m

i,

we can always choose ξu, δu, and πu such that 0 < ξu ≤ δu ≤ 1 ≤ πu. Similarly, we assume 0< ξv≤δv ≤1≤πv. These parameters measure how

much ¯ui and ¯vi diverge from emi and eni after normalization, respectively. The best case for our theory (i.e., the least restrictive values of parameters) occurs when all of these scalars are equal to 1. Similarlyρσ, ρm, ρn≥1, and

the best case for the theory is when they are all equal to 1.

4. It is an implicit assumption of the theorem that the scalars contained in p as well as b, which controls the subgaussian random variables, stay ﬁxed as

mi, ni→ ∞.

5. As compared to the recovery result in Ames [1] for the planted k-biclique problem, our result for the general bicluster problem is in general weaker in terms of noise magnitude (as compared to data magnitude) but stronger in terms of block sizes. Ames [1] requires mi = τ_i2ni, whereτi are scalars for all i, i = 1, . . . , k+ 1, whereas we only need mi ≤ O(n2_j) and ni ≤O(m2_j) for i, j = 1, . . . , k. More importantly, the noise block size, nk+1, is more restricted as compared to data block sizes, ni, for i= 1, . . . , k, in Ames [1]

with the condition

c₁√k+√nk+1+ 1

k+1

i=1

ni+βτk+1nk+1≤c2γ_i₌₁min_,...,kni.

(12)

In contrast, for our recovery result, (3.14) means that the total size of the noise blocks can be much larger (approximately the square) than the size of the data blocks. Thus, the theorem shows that the k blocks can be found even though they are hidden in a much larger matrix. In the special case whenk₀=k+ 1,mi=ni=n, ¯σi= ¯σfor alli= 1, . . . , k, andmk+1=nk+1,

combining (3.13) and (3.14), we will obtain the following condition, which clearly shows the relationship between block sizes:

¯

σk+1

c₀¯σ n≤nk+1≤

c₁(p, c₀, b)

2 n

2_.

6. As compared to the recovery result in Doan and Vavasis [6] whenk= 1, our recovery result is for a more general setting with ¯σ₂ > 0 instead of ¯σ₂ = 0 as in Doan and Vavasis [6]. We therefore need additional conditions on ¯σi,

i = 1,2. In addition, we need to consider the oﬀ-diagonal blocks (i, j) for

i, j= 1, . . . , k, which is not needed whenk= 1. This leads to more (stringent) conditions on the noise magnitudes. Having said that, the conditions on the parameterθ and block sizes remain similar. We still require θ to be in the order of (m₁n₁)−1/2 as in Doan and Vavasis [6]. The conditionsm₁≤O(n2₁) and n₁ ≤ O(m2₁) are similar to the condition m₁n₁ ≥ Ω((m₁+n₁)4/3) in Doan and Vavasis [6]. Finally, the condition m₂+n₂ ≤c₁(p, c₀, b)m₁n₁ is close to the conditionm₁n₁≥Ω(m₁+m₂+n₁+n₂), which again shows the similarity of these recovery results in terms of block sizes.

In order to simplify the proof, we ﬁrst consolidate all blocksi=k+1, . . . , k₀into a single block and call it block (k+1) of size ¯mk+1×¯nk+1where ¯mk+1=ki=0k+1miand

¯

nk+1 =ki=0k+1ni The only diﬀerence is that the new block ¯Bk+1,k+1∈Rm¯k+1×¯nk+1

is now a block diagonal matrix withk₀−kblocks instead of a rank-one block. Similarly, new blocks ¯Ri,k+1 and ¯Rk+1,i, i= 1, . . . , k0, now have more than one subblock with

diﬀerent parameters μ instead of a single one. This new block structure helps us derive the optimality conditions more concisely. Clearly, we would like to achieve the optimal solutionX with the following structure:

X=

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

σ₁u₁vT

1 0 · · · · · · 0 0 . .. . .. 0

..

. . .. . .. 0 ...

..

. 0 σkukvT_k 0

0 · · · · 0 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

,

where ui₂ =vi₂ = 1 fori = 1, . . . , k. Padding appropriate zeros to ui andvi to constructu0_i ∈ Rm₊ andv0_i ∈ Rn₊ for i= 1, . . . , k, we obtain suﬃcient optimality conditions based on Proposition 2.7 as follows:

There existY andZsuch thatY +Z=Aand

Y =|A|_k,₂_,θ

_k

i=1

σiu0i(v0i)T +W , Z=θ|A|k,2,θV,

where σi > 0 for i = 1, . . . , k, k_i₌₁σ2_i = 1, W ≤ mini=1,...,k{σi},

W v0

i =0, WTu0i =0, fori= 1, . . . , k, and V∞≤1, Vii =emieTni, fori= 1, . . . , k.

(13)

Since A has the block structure, we can break these optimality conditions into ap-propriate conditions for each block. Starting with diagonal (i, i) blocks, i= 1, . . . , k, the detailed conditions are

σiuivT

i +Wii =λ(¯σiui¯ ¯vTi +Rii)−θemieTni, (3.17)

WT iiui =0,

(3.18)

Wiivi =0,

(3.19)

where λ = 1/|A|_k,₂_,θ. For nondiagonal (i, j) blocks, i =j and i, j = 1, . . . , k, we obtain the following conditions:

Wij+θVij =λRij, (3.20)

WT

ijui=0,

(3.21)

Wijvj=0,

(3.22)

Vij_∞≤1.

(3.23)

For (i, k+ 1) blocks,i= 1, . . . , k, we have that

Wi,k+1+θVi,k+1=λR¯i,k+1,

(3.24)

WT

i,k+1ui=0,

(3.25)

Vi,k+1_∞≤1. (3.26)

Similarly, for (k+ 1, j) blocks,j= 1, . . . , k, the conditions are

Wk+1,j+θVk+1,j =λRk¯ ₊₁,j,

(3.27)

Wk+1,jvj =0,

(3.28)

Vk+1,j_∞≤1.

(3.29)

Finally, the (k+ 1, k+ 1) block needs the following conditions:

Wk+1,k+1+θVk+1,k+1 =λB¯k+1,k+1+ ¯Rk+1,k+1,

(3.30)

Vk+1,k+1_∞≤1. (3.31)

The remaining conditions are not block separable. We still needσi >0,i= 1, . . . , k,

and k_i₌₁σ_i2 = 1. The last condition, which is W ≤ mini=1,...,k{σi}, can be

replaced by the following suﬃcient conditions that are block separable by applying the fact thatW2≤_i,jWij2:

(3.32) Wij ≤ _k1

+ 1i=1min,...,k{σi}, i, j= 1, . . . , k+ 1.

With these suﬃcient block separable conditions, in order to construct (V,W), we now need to construct (Vij,Wij) for diﬀerent pairs (i, j) block by block. The block by block details are shown in the following analysis.

In the following proof, we assume that the random matrixRis chosen in stages: the diagonal blocksRii,i= 1, . . . , k, are selected before the off-diagonal blocks. This allows us to treat the diagonal blocks as deterministic during the analysis of the off-diagonal blocks. This technique of staging independent random variables is by now standard in the literature; see e.g., the “golfing” analysis of the matrix completion problem by Gross [11].

(14)

3.1. Analysis for block (i, i), i = 1, . . . , k. We begin with the proof of the existence of a λ > 0 that satisfies the optimality conditions. We then show the sufficient condition (3.32) for block (i, i),i= 1, . . . , k. The final condition that needs to be proved for these blocks is the positivity ofui andvi,i= 1, . . . , k.

3.1.1. Existence ofλ∗. The conditions for the (i, i) block,i= 1, . . . , k, namely, (3.17)–(3.19), indicate that (σi,ui,vi) is the dominant singular triple of the matrix

Li=λ(¯σiui¯ v¯T_i +Rii)−θem_ieT_n_i. They also indicate that

(3.33) Wii=σ2(Li)

since (3.17)–(3.19) are equivalent to the ﬁrst step of a singular value decomposition ofLi.

For the rest of this analysis, it is more convenient notationally to work with

τ=λ/θ rather than withλdirectly. The conditionk_i₌₁σ_i2= 1 becomes

(3.34) f(τ) =

k

i=1

!!τ(¯σiu¯iv¯Ti +Rii)−emieTni!!

2₋_θ₋₂

= 0.

We will prove that there exists τ∗ >0 such that f(τ∗) = 0. More precisely, we will focus our analysis off(τ) forτ∈[τ, τu], whereτis given by (3.110) andτu is given by (3.116) below, and prove that there existsτ∗∈[τ, τu] such thatf(τ∗) = 0.

Letting Q_ij = Rij/"φiφj −μijemieTnj for i, j = 1, . . . , k, clearly, Qij are b -subgaussian random matrices with independent elements. The function f can be rewritten as follows:

f(τ) =

k

i=1

!!τσ¯iui¯ v¯T_i −(1−τφiμii)em_ieT_n_i+τφiQ_ii!!2−θ−2

=

k

i=1

Pi(τ) +τφiQii2−θ−2,

(3.35)

where Pi(τ) =τσ¯iui¯ v¯_iT −(1−τφiμii)em_ieT_n_i. Applying the triangle inequality, we have

(3.36) Pi(τ) −τφiQii ≤ Pi(τ) +τφiQii ≤ Pi(τ)+τφiQii.

We start the analysis withPi(τ). We ﬁrst deﬁne the function

(3.37) gi(τ;a) =φ2iτ2−2aφiτ(1−μiiφiτ) + (1−μiiφiτ)2,

which is a quadratic function inτ with any ﬁxed parametera. Note by (3.119) below that τu≤0.3/(φiμii) for alli= 1, . . . , k, so 1−μiiφiτ≥0 andτ≥0 forτ∈[τ, τu]. Therefore, provideda≤1 andτ∈[τ, τu],

gi(τ;a) = (φiτ−(1−μiiφiτ))2+ 2(1−a)φiτ(1−μiiφiτ)

≥(φiτ−(1−μiiφiτ))2 = (φi(1 +μii)τ−1)2

(3.38)

≥0. (3.39)

The dominant singular triple of Pi(τ) = τσ¯iu¯iv¯iT −(1−τφiμii)emieTni is now analyzed for a ﬁxed τ ∈ [τ, τu]. It is clear that the dominant right singular vector

(15)

lies in span{v¯i,eni}since this is the range of (Pi(τ))T. Lettingζi=Pi(τ)2 be the square of the dominant singular value, it is clear thatζi is a solution of the following

eigenvector problem:

(Pi(τ))TPi(τ)(αv¯i+βeni) =ζi(αv¯i+βeni).

Expanding and gathering multiples of ¯vi and en_i, we obtain the 2×2 eigenvalue problem

(3.40) Mi

α β

=ζi

α β

,

whereMi is the 2×2 matrix, (3.41)

τ2_σ_¯2

i −τσ¯ihi(τ)ui¯ ₁vi¯ ₁ τ2σ¯2i −τσ¯ihi(τ)ui¯ ₁ni

(hi(τ))2vi¯ ₁mi−τσ¯ihi(τ)ui¯ ₁ (hi(τ))2mini−τσ¯ihi(τ)ui¯ ₁¯vi₁

,

andhi(τ) = 1−τφiμii,i= 1, . . . , k. Thus,ζi is a root of the equation

(3.42) ζ_i2−trace(Mi)ζi+ det(Mi) = 0,

where

trace(Mi) =τ2¯σi2−2τσ¯i(1−τφiμii)u¯i₁v¯i₁+ (1−τφiμii)2mini

=miniτ2φ2i −2τφi(1−τφiμii)δu,iδv,i+ (1−τφiμii)2 =minigi(τ;δu,iδv,i)

(3.43)

and

det(Mi) =τ2σ¯2_i(1−τφiμii)2(mi− ui¯ 2₁)(ni− vi¯ 2₁) =m2_in2_iτ2φ2_i(1−τφiμii)2(1−δ2u,i)(1−δv,i2 )

(3.44)

≥0.

Here, we have introduced notation

δu,i=ui¯ 1/√mi,

δv,i=¯vi1/√ni

that we will continue to use for the remainder of the proof. It is apparent that

δu,i∈[δu,1] by (3.3) and similarlyδv,i∈[δv,1].

Let Δ be the discriminant of the quadratic equation (3.42), that is,

(3.45) Δ = trace(Mi)2−4 det(Mi).

Letting Δ= Δ/(mini)2, we have that

Δ =

#

τ2_φ2

i −2τφi(1−τφiμii)

δu,iδv,i+ $

(1−δ_u,i2 )(1−δ2_v,i)

+ (1−τφiμii)2 %

· #τ2_φ2

i −2τφi(1−τφiμii)

δu,iδv,i− $

(1−δ_u,i2 )(1−δ2_v,i)

+ (1−τφiμii)2 %

=gi

τ;δu,iδv,i+ $

(1−δ_u,i2 )(1−δ2_v,i)

· gi

τ;δu,iδv,i− $

(1−δ_u,i2 )(1−δ2_v,i)

.

(3.46)

(16)

As 1−(δu,iδv,i+

$

(1−δ_u,i2 )(1−δ2_v,i))2 = (δu,i $

1−δ_v,i2 −δv,i $

1−δ2_u,i)2 ≥0, the second argument to each invocation of gi in the previous equation is less than or equal to 1. In addition, sinceτ ∈[τ, τu], it follows that both evaluations ofgi yield

nonnegative numbers, and therefore Δ≥0. We next claim that

(3.47) Δ =m2_in2_igi(τ;pi(τ))2

for a continuouspi(τ)∈[δu,iδv,i,1] for allτ ∈[τ, τu]. In other words, there exists a

continuouspi(τ) in the range [a,1] satisfying the equation

(3.48) gi(τ;pi(τ))2 =gi(τ;a+c)gi(τ;a−c), where, for this paragraph,a=δu,iδv,i and c=

$

(1−δ_u,i2 )(1−δ2_v,i). This is proved by ﬁrst treating pi as an unknown and expanding (3.48). After simpliﬁcation, the

result is a quadratic equation for pi. The facts that 0≤a, c≤1 anda+c≤1 allow

one to argue that the quadratic equation has a sign change over the interval [a,1] for all τ ∈ [0,1/(φiμii)] (hence for all τ ∈ [τ, τu]). Thus, the quadratic has a unique

root in this interval, which may be taken to bepi; it must vary continuously with the

coeﬃcients of the quadratic and hence withτ. The details are left to the reader. In addition toτ, pi(τ) depends onμii,φi,δu,i, andδv,i.

Thus, by the quadratic formula applied to (3.42), we can obtain ζi as the larger root

(3.49) ζi= 1

2(trace(Mi(τ)) +

√

Δ) =minigi(τ;ai(τ)),

where the second equation comes from adding (3.43) to the square root of (3.47) and noting that for anyτ, a, b, (gi(τ;a) +gi(τ;b))/2 =gi(τ; (a+b)/2). Here, we have that

(3.50) ai(τ) = 1

2(δu,iδv,i+pi(τ)).

By the earlier bound onpi(τ), this impliesai(τ)∈[ai, ai], where

(3.51) a_i=δu,iδv,i; ai= 1

2+ 1 2δu,iδv,i.

Clearly 0≤a_i ≤ai ≤1 for allisinceδu,i, δv,i∈[0,1]. Note that tighter bounds are

possible by a more careful analysis of Δ.

Sinceζi=Pi(τ)2, we can then expressPi(τ) as follows: (3.52) Pi(τ)="ζi="minigi(τ;ai(τ)).

Next, consider again (3.38); the right-hand side is a convex quadratic function of

τ with minimizer at 1/(φi(1 +μii)). It follows from (3.112) that τ ≥2/φi for alli.

Thus, forτ∈[τ, τu], we have that

τ ≥_φ2

i ≥

1

φi(1/2 +μii) >

1

φi(1 +μii).

Thus, the right-hand side of (3.38) is an increasing function of τ forτ ∈[τ, τu]. We then have, for anyτ∈[τ, τu],

gi(τ;a)≥

φi(1 +μii)

1

φi(1/2 +μii)

−1

2

=

1 1 + 2μii

2 .

(17)

Thus,

(3.53) Pi(τ) ≥

√_m

ini

1 + 2μii

for anyτ ∈[τ, τu]. We also have a second lower bound that grows linearly withτ:

"

gi(τ;a)≥φi(1 +μii)τ−1 (3.54)

=φiτ/2 + (φi(1/2 +μii)τ−1)

≥φiτ/2,

(3.55)

where the ﬁrst inequality follows from (3.38) and the other inequality is due to the fact thatτφi≥2 as noted above. This implies

Pi(τ) ≥√miniφiτ/2

= ¯σiτ/2. (3.56)

Next, we combine this linear lower bound on Pi(τ) with an upper bound on

Qiiin order to be able to take advantage of (3.36).

Claim 1. !!Q_ij!!≤(m_in_j)38 with probability exponentially close to1 asmi, nj →

∞for all i, j= 1, . . . , k.

To establish the claim observe that Q_ij is random with i.i.d. elements that are

b-subgaussian. Thus by Lemma 3.2(i),

(3.57) P!!Q_ij!!≥(minj)3/8

≤exp

−

(minj)3/4

81b2 −(log 7)(mi+nj)

,

where u is set to be (minj)3/8. The right-hand side tends to zero exponentially

fast since (minj)3/4 asymptotically dominates mi+nj under the assumption that

m1_i/4≤O(n1_j/2) andn1_j/4≤O(m1_i/2), which was stated as a hypothesis in the theorem. Then with a probability exponentially close to 1, the event in (3.57) does not happen, hence we assume!!Q_ij!!≤(minj)38. Focusing on thei=jcase for now, this implies

(3.58) Q_ii ≤√mini/40

for large mini; since the theorem applies to the asymptotic range, we assume this inequality holds true as well. Combining the inequality (3.58) with (3.56) and (3.36), we obtain

Pi(τ) +τφiQ_ii= (1 +γi(τ))Pi(τ)

= (1 +γi(τ))"minigi(τ;ai(τ)). (3.59)

In the ﬁrst line, we have introduced scalarγi(τ) to stand for a quantity in the range [−1/20,1/20] that varies continuously withτ. This notation will be used throughout the remainder of the proof. The second line follows from (3.52). Combining (3.59) and (3.56), we conclude

(3.60) Pi(τ) +τφiQii ≥0.47¯σiτ.

Finally, because Pi(τ) +τφiQii is a rescaling of the right-hand side of (3.17) by θ,

we conclude that

(3.61) σi≥0.47¯σiτθ.

(18)

Applying (3.59) to the formulation off(τ) in (3.35), we have that, forτ ∈[τ, τu],

f(τ) =

k

i=1

Pi(τ) +τφiQii2−θ−2

=

k

i=1

mini(1 +γi(τ))2gi(τ;ai(τ))−θ−2

=A(τ)τ2−2B(τ)τ−C(τ).

The third line is obtained by expanding the quadratic formula forgi(τ;ai(τ)), which

results in

A(τ) =

k

i=1

(1 +γi(τ))2σ¯i2(1 + 2μiiai(τ) +μ2ii),

B(τ) =

k

i=1

(1 +γi(τ))2√miniσ¯i(ai(τ) +μii),

C(τ) =θ−2−

k

i=1

(1 +γi(τ))2mini.

We will now prove that there existsτ∗ ∈[τ, τu] such thatf(τ∗) = 0 by applying the following lemma, which is a speciﬁc form of the intermediate theorem for “pseudo-quadratic” functions.

Lemma 3.5. _{Consider a real-valued function}fˆ₍τ₎_{of the form}

ˆ

f(τ) =A(τ)τ2−2B(τ)τ−C(τ),

whereA(τ),B(τ),C(τ)are continuous functions ofτ. Suppose there are two triples of positive numbers(A, B, C)<(A, B, C)(where <is understood elementwise). Deﬁne

(3.62) τ= B+

"

B2+A·C

A

and

(3.63) τ_u = B+

$

B2+A·C

A .

(Clearlyτ< τ_u.) Suppose further that there is an interval[τ, τu]such thatτ≤τ ≤

τ

u ≤τu and such that for all τ∈[τ, τu],

(A, B, C)≤(A(τ), B(τ), C(τ))≤(A, B, C).

Then there exists a rootτ∗ ∈[τ, τ_u](and therefore also in[τ, τu]) such thatfˆ(τ∗) = 0. Proof. Some simple algebra shows that ˆf(τ) =A(τ)(τ)2−2B(τ)(τ)−C(τ)≤ 0, while ˆf(τ_u) =A(τ_u)(τ_u)2−2B(τ_u)τ_u −C(τ_u)≥0, so there is aτ∗ ∈[τ, τ_u] such that ˆf(τ∗) = 0 by the intermediate value theorem.

(19)

In order to apply Lemma 3.5, we now deﬁne the following scalars:

A= (10/9)

k

i=1

¯

σ2

i(1 + 2μiiai+μ2_ii),

A= 0.90

k

i=1

¯

σ2

i(1 + 2μiiai+μ2ii),

B = (10/9)

k

i=1 √_m

iniσ¯i(ai+μii),

B = 0.90

k

i=1 √_m

iniσ¯i(ai+μii),

C= (10/9)

θ−2₋k

i=1 mini

,

C= (9/10)

θ−2₋

k

i=1 mini

.

It is obvious that (0,0)<(A, B)<(A, B). It follows from (3.16), (3.111), and (3.115) below that the parenthesized quantity in the deﬁnitions ofC, C is positive,

θ−2₋k

i=1

mini ≥

1 4c₃(p)2 −1

k

i=1 mini

= 1.24(c₄(p))2

_ρ

mρnk

k+ρmρn−1 k

i=1

mini>0,

and hence we also have 0< C < C.

In addition, given the fact that ai(τ) ∈ [ai;ai] and γi(τ) ∈ [−1/20; 1/20] for

τ ∈ [τ, τu], this establishes for this interval that (A, B, C) ≤ (A(τ), B(τ), C(τ)) ≤

(A, B, C).

We now show thatτ ≥τ andτ_u ≤τu. We have that

τ

=

B+"B2+A·C

A ≥

√ A·C

A .

Using the facts that 0≤ai ≤1, 0≤μii ≤c2(p, c0)≤0.08 (see (3.118) below), and θ−2₋k

i=1mini≥1.24(c4(p))2(k+ρρmmρρnnk−1) k

i=1mini>0 as above, we have that

τ

≥

_ρ

mρnk

k+ρmρn−1

₁/2 c₄(p)

_k

i=1 mini

1/2_k

i=1

¯

σ2

i −1/2

.

Next, observe that

_k i=1 ¯ σ2 i −1/2

≥k−1/2_σ−1 1