• No results found

Finding the largest low rank clusters with Ky Fan 2 k norm and l1 norm

N/A
N/A
Protected

Academic year: 2020

Share "Finding the largest low rank clusters with Ky Fan 2 k norm and l1 norm"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Original citation:

Doan, Xuan Vinh and Vavasis, Stephen. (2016) Finding the largest low-rank clusters

with Ky Fan 2-k-norm and l1-norm. SIAM Journal on Optimization, 26 (1). pp. 274-312.

Permanent WRAP url:

http://wrap.warwick.ac.uk/74985

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work of researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

Publisher’s statement:

“First Published in SIAM Journal on Optimization in 26 (1) published by the

Society for Industrial and Applied Mathematics (SIAM). Copyright © by SIAM.

Unauthorized reproduction of this article is prohibited.

A note on versions:

The version presented in WRAP is the published version or, version of record, and may

be cited as it appears here.

(2)

FINDING THE LARGEST LOW-RANK CLUSTERS WITH KY FAN 2-k-NORM AND1-NORM

XUAN VINH DOAN AND STEPHEN VAVASIS

Abstract. We propose a convex optimization formulation with the Ky Fan 2-k-norm and1 -norm to findklargest approximately rank-one submatrix blocks of a given nonnegative matrix that has low-rank block diagonal structure with noise. We analyze low-rank and sparsity structures of the optimal solutions using properties of these two matrix norms. We show that, under certain hypotheses, with high probability, the approach can recover rank-one submatrix blocks even when they are corrupted with random noise and inserted into a much larger matrix with other random noise blocks.

Key words. low rank matrix approximation, nonnegative matrix factorization, Ky Fan norm, subgaussian random noise, biclustering

AMS subject classifications.15A23, 15A60, 15B52, 90C22, 62H30

DOI.10.1137/140962097

1. Introduction. Given a matrix A Rm×n that has low-rank block diago-nal structure with noise, we would like to find that low-rank block structure of A. Doan and Vavasis [6] have proposed a convex optimization formulation to find a large approximately rank-one submatrix of A with the nuclear norm and 1-norm. The proposed LAROS problem (“large approximately rank-one submatrix”) in [6] can be used to sequentially extract features in data. For example, given a corpus of docu-ments in some language, it can be used to co-cluster (or bicluster) both terms and documents, i.e., to identify simultaneously both subsets of terms and subsets of doc-uments strongly related to each other from theterm-document matrix ARm×n of the underlying corpus ofndocuments withmdefined terms (see, for example, Dhillon [4]). Here, “term” means a word in the language, excluding common words such as articles and prepositions. The (i, j) entry ofAis the number of occurrences of termi in documentj, perhaps normalized. Another example is the biclustering of gene ex-pression data to discover exex-pression patterns of gene clusters with respect to different sets of experimental conditions (see the survey by Madeira and Oliveira [16] for more details). Gene expression data can be represented by a matrix Awhose rows are in correspondence with different genes and columns are in corresponence with different experimental conditions. The valueaij is the measurement of the expression level of

geneiunder the experimental conditionj.

If the selected terms in a bicluster occur proportionally in the selected documents, we can intuitively assign a topic to that particular term-document bicluster. Similarly, if the expression levels of selected genes are proportional in all selected experiments of a bicluster in the second example, we can identify an expression pattern for the given

Received by the editors March 24, 2014; accepted for publication (in revised form) November 25,

2015; published electronically January 26, 2016. This work was supported in part by the U.S. Air Force Office of Scientific Research, a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada, and a grant from MITACS.

http://www.siam.org/journals/siopt/26-1/96209.html

DIMAP and ORMS Group, Warwick Business School, University of Warwick, Coventry, CV4

7AL, UK ([email protected]). This research was partly done while the author was at the De-partment of Combinatorics and Optimization, University of Waterloo, Canada.

Department of Combinatorics and Optimization, University of Waterloo, 200 University Avenue

West, Waterloo, ON, N2L 3G1, Canada ([email protected]).

274

(3)

gene-experimental condition bicluster. Mathematically, for each biclusteri, we obtain a subset Ii ⊂ {1, . . . , m} and Ji ⊂ {1, . . . , n} and the submatrix block A(Ii,Ji) is

approximately rank one, i.e.,A(Ii,Ji)wihTi. Assuming there arekbiclusters and

Ii∩ Ij= andJi∩ Jj =for alli=j, we then have the following approximation:

(1.1) A[ ¯w1, . . . ,wk¯ ][¯h1, . . . ,hk¯ ]T,

where ¯wi and ¯hi are the zero-padded extensions ofwi andhi to vectors of lengthm andn, respectively. If the matrixAis nonnegative and consists of thesek(row- and column-exclusive) biclusters, we may assume thatwi,hi0for alli(a consequence of the Perron–Frobenius theorem; see, for example, Golub and Van Loan [9] for more details). ThusAW HT, whereW,H 0, which is an approximatenonnegative matrix factorization (NMF) of the matrix A. In this paper, we shall follow the NMF representation to find row- and column-exclusive biclusters. Note that there are different frameworks for biclustering problems such as the graph partitioning models used in Dhillon [4], Tanay, Sharan, and Shamir [19], and Ames [1], among other models (see, for example, the survey by Nan, Boyko, and Pardalos [7]).

Approximate and exact NMF problems are difficult to solve. The LAROS problem proposed by Doan and Vavasis [6] can be used as a subroutine for a greedy algorithm with which columns ofW andHare constructed sequentially. Each pair of columns corresponds to a feature (or pattern) in the original data matrixA. Given the prop-erties of the LAROS problem, the most significant feature (in size and magnitude) will be constructed first with the appropriate parameter.

The iterated use of the LAROS algorithm of [6] to extract blocks one at a time, however, will not succeed in the case that there are two or more hidden blocks of roughly the same magnitude. In order to avoid this issue, we propose a new convex formulation that allows us to extract several (nonoverlapping) features simultaneously. In section 2, we study the proposed convex relaxation and the properties of its opti-mal solutions. In section 3, we provide conditions to recover low-rank block structure of the block diagonal data matrix A in the presence of random noise. Finally, we demonstrate our results with some numerical examples in section 4, including a syn-thetic biclustering example and a synsyn-thetic gene expression example from the previous literature.

Notation. A,X = trace(ATX) is used to denote the inner product of two matricesAandX inRm×n. X1means the sum of the absolute values of all entries ofX, i.e., the 1-norm of vec(X), the long vector constructed by the concatenation of all columns of X. Similarly, X is the maximum absolute value of entries of

X, i.e., the-norm of vec(X).

2. Matrix norm minimization. We start with the following general norm minimization problem, which has been considered in [6]:

(2.1) min |X|

s.t. A,X1,

where| · |is an arbitrary norm function onRm×n. The associated dual norm| · | is defined as

(2.2) |A|

= max A,Y

s.t. |Y| ≤1.

These two optimization problems are closely related and their relationship is captured in the following lemmas and theorem discussed in Doan and Vavasis [6].

(4)

Lemma 2.1. Matrix X is an optimal solution of problem (2.1) if and only if

Y= (|A|)X is an optimal solution of problem(2.2).

Lemma 2.2. The set of all optimal solutions of problem(2.2)is the subdifferential

of the dual norm function | · | atA,∂|A|.

Theorem 2.3 (Doan and Vavasis [6]). The following statements are true:

(i) The set of optimal solutions of problem (2.1) is (|A|)−1∂|A|, where

∂| · | is the subdifferential of the dual norm function| · |.

(ii) Problem (2.1) has a unique optimal solution if and only if the dual norm function | · | is differentiable atA.

The LAROS problem in [6] belongs to a special class of (2.1) with parametric matrix norms of the form|X=|X|+θX1, where| · |is aunitarily invariant norm andθis a nonnegative parameter,θ≥0:

(2.3) min |X|+θX1

s.t. A,X1.

A norm| · |is unitarily invariant if|UXV|=|X|for all pairs of unitary matrices

U and V (see, for example, Lewis [15] for more details). For the LAROS problem,

|X|is the nuclear norm, |X|=X, which is the sum of singular values of X. In order to characterize the optimal solutions of (2.3), we need to compute the dual norm| · |θ:

(2.4) |A|

θ= max A,Y

s.t. |Y|+θY11.

The following proposition, which is a straightforward generalization of Proposition 7 in [6], provides a dual formulation to compute| · |θ.

Proposition 2.4. The dual norm |A|θ with θ >0 is the optimal value of the

following optimization problem:

(2.5) |A|

θ= min max

|Y|, θ−1Z

s.t. Y +Z=A.

The optimality conditions of (2.3) are described in the following proposition, which is again a generalization of Proposition 9 in [6].

Proposition 2.5. Consider a feasible solutionXof problem(2.3). If there exists

(Y,Z)that satisfies the conditions

(i) Y +Z=Aand|Y|=θ−1Z,

(ii) X∈α∂|Y|,α≥0,

(iii) X∈β∂Z 0,

(iv) α+θβ=Aθ−1,

thenX is an optimal solution of problem(2.3). In addition, if

(v) | · | is differentiable atY or · is differentiable atZ, thenX is the unique optimal solution.

The low-rank structure of solutions obtained from the LAROS problem comes from the fact that the dual norm of the nuclear norm is the spectral norm (or 2-norm), X = σ1(X), the largest singular value of X. More exactly, it is due to the structure of the subdifferential∂ · . According to Zi¸etak [22], ifY =UΣVT is a singular value decomposition of Y and sis the multiplicity of the largest singular value ofY, the subdifferentialYis written as follows:

Y=

U

S 0

0 0

VT :S ∈ S+s,S= 1 ,

(5)

whereS+s is the set of positive semidefinite matrices of sizes. The description of the subdifferential shows that the maximum possible rank ofX ∈α∂Y is the multi-plicity of the largest singular value ofY and ifs= 1, we achieve rank-one solutions. This structural property of the subdifferential ∂ · motivates the norm optimiza-tion formulaoptimiza-tion for the LAROS problem, which aims to find asingle approximately rank-one submatrix of the data matrix A. We now propose a new pair of norms that would allow us to handle several approximately rank-one submatrices simulta-neously instead of individual ones. Let us consider the following norm, which we call the Ky Fan 2-k-norm given its similar formulation to that of the classical Ky Fan

k-norm:

(2.6) |A|k,2=

k

i=1 σ2

i(A) 1

2 ,

whereσ1≥. . . σk 0 are the firstklargest singular values ofA,k≤k0= rank(A).

The dual norm of the Ky Fan 2-k-norm is denoted by| · |k,2. According to Bhatia [3], the Ky Fan 2-k-norm is aQ-norm, which is unitarily invariant (Definition IV.2.9 in [3]). Since the Ky Fan 2-k-norm is unitarily invariant, we can define its corresponding symmetric gauge function, · k,2:Rn→R, as follows:

(2.7) xk,2=

k

i=1 |x|2(i)

1 2 ,

where|x|(i)is the (n−i+ 1)st order statistic of|x|. The dual norm of this gauge func-tion (or more exactly, its square), has been used in Argyriou, Foygel, and Srebro [2] as a regularizer in sparse prediction problems. More recently, its matrix counterpart is considered in McDonald, Pontil, and Stamos [17] as a special case of the matrix cluster norm defined in [13], whose square is used for multitask learning regulariza-tion. On the other hand, the square Ky Fan 2-k-norm is considered as a penalty in low-rank regression analysis in Giraud [8]. In this paper, we are going to use the dual Ky Fan 2-k-norm, not its square, in our formulation given its structural properties, which will be explained later.

Whenk= 1, the Ky Fan 2-k-norm becomes the spectral norm, whose subdiffer-ential has been used to characterize the low-rank structure of the optimal solutions of the LAROS problem. We now propose the following optimization problem, of which the LAROS problem is a special instance withk= 1:

(2.8) min |X|

k,2+θX1

s.t. A,X1,

where θ is a nonnegative parameter, θ 0. The proposed formulation is an in-stance of the parametric problem (2.3) and we can use results obtained in Propo-sitions 2.4 and 2.5 to characterize its optimal solutions. Before doing so, we first provide an equivalent semidefinite optimization formulation for (2.8) in the following proposition.

(6)

Proposition 2.6. Assuming m n, the optimization problem (2.8) is then

equivalent to the following semidefinite optimization problem:

(2.9)

min

p,P,Q,R,X p+trace(R) +θ EQ

s.t. kp−trace(P) = 0,

pIP 0,

P 1

2XT 1

2X R

0,

QX, Q≥ −X,

A,X1,

whereE is the matrix of all ones.

Proof. We first consider the dual norm|X|k,2. We have that

(2.10) |X|

k,2= max XY

s.t. |Y|k,21.

Since m n, we have that (|Y|k,2)2 =|YTY|k, where| · |k is the Ky Fank

-norm, i.e., the sum ofklargest singular values. SinceYTY is symmetric,|YTY|kis actually the sum ofklargest eigenvalues ofYTY. Similar toxk, which is the sum of k largest elements of x, we obtain the following (dual) optimization formulation for|YTY|k (for example, see Laurent and Rendl [14]):

|YTY|

k= min kz+ trace(U) s.t. zI+U YTY,

U 0.

Applying the Schur complement, we have that

|YTY|

k= min kz+ trace(U)

s.t.

zI+U YT

Y I

0,

U 0.

Thus, the dual norm| · |k,2 can be computed as follows:

|X|

k,2= max XY

s.t. kz + trace(U)1,

zI+U YT

Y I

0,

U 0.

Applying the strong duality theory under Slater’s condition, we have that

(2.11)

|X|

k,2= min p+ trace(R)

s.t. kp−trace(P) = 0,

pIP 0,

P 1

2XT 1

2X R

0.

The reformulation ofX1 is straightforward with the new decision variableQand additional constraintsQX andQ≥ −X, given the fact that the main problem is a minimization problem.

(7)

Proposition 2.6 indicates that in general, we can solve (2.8) by solving its equiv-alent semidefinite optimization formulation (2.9) with any SDP solver. We are now ready to study some properties of optimal solutions of (2.8). We have that|X|k,2+

θX1 is a norm forθ≥0 and we denote it by|X|k,2. According to Proposition

2.4, the dual norm|X|k,2,

(2.12) |A|

k,2= max A,X

s.t. Xk,21,

can be calculated by solving the following optimization problem givenθ >0:

(2.13) |A|

k,2= min max

|Y|k,2, θ−1Z

s.t. Y +Z=A.

Similar to Proposition 2.5, we can provide the optimality conditions for (2.8) in the following proposition.

Proposition 2.7. Consider a feasible solutionXof problem(2.8). If there exists

(Y,Z)that satisfies the conditions

(i) Y +Z=Aand|Y|k,2=θ−1Z,

(ii) X∈α∂|Y|k,2,α≥0,

(iii) X∈β∂Z 0,

(iv) α+θβ= (Ak,2)−1,

thenX is an optimal solution of Problem (2.3). In addition, if

(v) | · |k,2 is differentiable atY or · ∞ is differentiable atZ,

thenX is the unique optimal solution.

The optimality conditions presented in Proposition 2.7 indicate that some prop-erties of optimal solutions of (2.8) can be derived from the structure of∂| · |k,2. We

shall characterize the subdifferential∂| · |k,2 next. According to Watson [21], since | · |k,2is a unitarily invariant norm,∂|A|k,2is related to∂σ(A)k,2, whereσ(A)

is the vector of singular values ofA. LetA=0be a matrix with singular values that satisfy

σ1≥ · · ·> σk−t+1 =· · ·=σk =· · ·=σk+s>· · · ≥σp,

where p = min{m, n}, so that the multiplicity of σk is s+t. The subdifferential

σk,2 is characterized in the following lemma.

Lemma 2.8. v∈∂σ

k,2 if and only ifv satisfies the following conditions:

(i) vi= σi

σk,2 for alli= 1, . . . , k−t.

(ii) vi=τiσσk

k,2,0≤τi≤1, for alli=k−t+ 1, . . . , k+s, and

k+s

i=k−t+1τi=t.

(iii) vi= 0 for alli=k+s+ 1, . . . , p.

Proof. LetNk be the collection of all subsets with k elements of{1, . . . , p}, we

have

σk,2= maxN∈N

k

fN(σ),

where fN(σ) =i∈Nσi2 1

2 for all N ∈ N

k. According to the Dubovitski–Milyutin

theorem (see, for example, Tikhomirov [20]), the subdifferential of · k,2is computed as follows:

σk,2= conv

∂fN(σ) : N ∈ Nk, fN(σ) =σk,2

.

(8)

With the structure ofσ, clearly{1, . . . , k−t} ∈N for allN ∈ Nksuch thatfN(σ) =

σk,2. The remainingt elements of N are chosen from s+t values from {k−t+

1, . . . , k+s}. Sinceσ=0, allfN that satisfyfN(σ) =σk,2are differentiable atσ

(even in the caseσk = 0) and

∂fN(σ)

∂σi =

σi

σk,2

i∈N, ∂fN(σ)

∂σi = 0, i∈N.

Thus if v σk,2 for all i = 1, . . . , k−t, then, vi = σi

σk,2 and vi = 0 for all i=k+s+ 1, . . . , p.

Finally, counting arguments for the appearance of each index in{k−t+1, . . . , k+s} with respect to all subsetsN ∈ Nk that satisfyfN(σ) =σk,2 allows us to

charac-terizevifori=k−t+ 1, . . . , k+sasvi=τiσσk

k,2, 0≤τi 1, and

k+s

i=k−t+1τi=t.

We are ready to characterize the subdifferential of | · |k,2 with the following

proposition.

Proposition 2.9. Consider A =0. Let A=UΣVT be a particular singular

value decomposition of Aand assume that σ(A)satisfies σ1 ≥ · · ·> σk−t+1 =· · ·=

σk = · · · = σk+s > · · · ≥ σp. Then, G ∂|A|k,2 if and only if there exists T R(s+t)×(s+t) such that

G= 1

|A|k,2

U[:,1:k−t]Σ[1:k−t,1:k−t]VT[:,1:k−t]+σkU[:,k−t+1:k+s]T VT[:,k−t+1:k+s]

,

whereT is symmetric positive semidefinite, T1, andT=t. Proof. According to Watson [21], we have that

∂|A|k,2=

UDiag(g)VT : A=UΣVTis any SVD ofA, g∈∂σ(A)k,2

.

Let A = UΣVT be a particular singular value decomposition of A and assume that a singular value σi > 0 has the multiplicity of r with corresponding singular

vectorsUi∈Rm×randVi Rn×r. Then for any singular value decomposition ofA, A= ¯UΣV¯T, there exists an orthonormal matrixW Rr×r, W WT =I, such that

¯

Ui=UiW and ¯Vi=ViW (for example, see Zi¸etak [22]).

Combining these results with Lemma 2.8, the proof is straightforward with a singular value (or eigenvalue) decomposition of matrixT.

Corollary 2.10. | · |k,2 is differentiable at anyA=0such that σk > σk+1

(σp+1= 0) or σk = 0.

Proof. Ifσk= 0, then, according to Proposition 2.9,

G∈∂|A|k,2 G= 1

|A|k,2U[:,1:k−t]Σ[1:k−t,1:k−t]V

T

[:,1:k−t].

Now, ifσk > σk+1, thens= 0. Thus T =I is unique since T ∈ St, T =t, and

T 1. Thus ∂|A|k,2 is a singleton, which implies | · |k,2 is differentiable at

A.

Proposition 2.9 shows that problem (2.8) with θ = 0 is a convex optimization problem that finds k-approximation of a matrix A. It also shows that intuitively, problem (2.8) can be used to recoverk largest approximately rank-one submatrices

(9)

withθ >0. Note that for Ky Fank-norm, ifσk(A)> σk+1(A), its subdifferential at

Ais a singleton with a unique subgradient:

∂|A|k=

U

Ik 0

0 0

VT ,

where A = UΣVT is a singular value decomposition of A and Ik is the identity matrix in Rk×k (see, for example, Watson [21]). In this particular case, the unique subgradient of the Ky Fank-norm provides the information of singular vectors corre-sponding to the klargest singular values. Having said that, it does not preserve the information of singular values. Whenθ = 0, the proposed formulation with the Ky Fank-norm will not return the rank-kapproximation of the matrixAas the Ky Fan 2-k-norm does. In the next section, we shall study the recovery of these submatrices under the presence of random noise.

3. Recovery with block diagonal matrices and random noise. We con-siderA=B+R, whereB is a block diagonal matrix, each block having rank one, whileR is a noise matrix. The main theorem shows that under certain assumptions concerning the noise, the positions of the blocks can be recovered from the solution of (2.8). As mentioned in the introduction, this corresponds to solving a special case of the approximate NMF problem, that is, a factorizationAW HT, whereW andH are nonnegative matrices. The special case solved is that W andH each consist of nonnegative columns with nonzeros in disjoint positions (so thatAis approximately a matrix with disjoint blocks each of rank one). Even this special case of NMF is NP-hard unless further restrictions are placed on the data model given the fact that the (exact) LAROS problem is NP-hard (see [6] for details).

Before starting the proof of the theorem, we need to consider some properties of subgaussian random variables. A random variablexisb-subgaussianifE[x] = 0 and there exists ab >0 such that for allt∈R,

(3.1) Eetx≤eb22t2.

We can apply the Markov inequality for the b-subgaussian random variable x and obtain the following inequalities:

(3.2) P(x≥t)exp(−t2/(2b2)) andP(x≤ −t)exp(−t2/(2b2)) ∀t >0. The next three lemmas, which show several properties of random matrices and vectors with independent subgaussian entries, are adopted from Doan and Vavasis [6] and references therein.

Lemma 3.1. Letx1, . . . , xk be independentb-subgaussian random variables and let

a1, . . . , ak be scalars that satisfyki=1a2i = 1. Thenx= k

i=1aixi is ab-subgaussian random variable.

Lemma 3.2. Let B Rm×n be a random matrix, where bij are independent

b-subgaussian random variables for all i= 1, . . . , m, and j = 1, . . . , n. Then for any

u >0,

P(B ≥u)exp

8u2

81b2 (log 7)(m+n)

.

(10)

Lemma 3.3. Let x,y be two vectors in Rn with independent and identically

distributed (i.i.d.) b-subgaussian entries. Then for anyt >0,

PxTytexpmin t2

(4eb2)2n,

t

4eb2 and

PxTy≤ −texpmin t2

(4eb2)2n,

t

4eb2 .

With these properties of subgaussian variables presented, we are now able to state and prove the main theorem, which gives sufficient conditions for optimization problem (2.8) to recoverk blocks in the presence of noise.

Theorem 3.4. SupposeA =B+R, where B is a block diagonal matrix with

k0 blocks, that is, B= diag(B1, . . . ,Bk0), whereBi= ¯σiu¯iv¯Ti,u¯i∈Rmi,v¯i Rni,

u¯i2=v¯i2= 1,u¯i>0,v¯i >0for alli= 1, . . . , k0. Assume the blocks are ordered

so that σ¯1 ≥σ¯2 ≥ · · · ≥σ¯k0 >0. Matrix R is a random matrix composed of blocks

in which each entry is a translated b-subgaussian variable, i.e., there exists μij 0 such that elements of the matrix block Rij/(φiφj)1/2 −μijemieTnj are independent

b-subgaussian random variables for all i, j = 1, . . . , k0. Here φi = ¯σi/√mini, i =

1, . . . , k0, is a scaling factor to match the scale of Rij with that of Bi andBj, and em denotes the m-vector of all1’s.

We define the following positive scalars that control the degree of heterogeneity among the firstk blocks:

δu≤ min

i=1,...,kui¯ 1/

m

i,

(3.3)

δv≤ min

i=1,...,k¯vi1/

n

i,

(3.4)

ξu≤ min i=1,...,k

min

j=1,...,miu¯i,j

m

i,

(3.5)

ξv≤ min i=1,...,k

min

j=1,...,ni ¯

vi,j

n

i,

(3.6)

πu≥ max i=1,...,k

max

j=1,...,mi ¯

ui,j

m

i,

(3.7)

πv≥ max i=1,...,k

max

j=1,...,ni ¯

vi,j

n

i,

(3.8)

ρm≥ max

i,j=1,...,kmi/mj,

(3.9)

ρn≥ max

i,j=1,...,kni/nj,

(3.10)

ρσ≥σ¯1/¯σk.

(3.11)

We also assume that the blocks do not diverge much from being square; more precisely we assume that mi O(n2j) and ni O(m2j) for i, j = 1, . . . , k. Let p= (k, δu, δv, πu, πv, ξu, ξv, ρσ, ρm, ρn)denote the vector of parameters controlling the heterogeneity.

For the remaining noise blocks i=k+ 1, . . . , k0, we assume that their dominant singular values are substantially smaller than those of the firstk,

(3.12) σ¯k+1 0k.23¯+ 1σk,

(11)

that their scale is bounded,

(3.13) φi≤c0φj

for all i=k+ 1, . . . , k0 and j = 1, . . . , k, where c0 is a constant, and that their size is bounded,

(3.14)

k0

i=k+1

(mi+ni)≤c1(p, c0, b) mini=1,...kmini,

wherec1(p, c0, b)is given by (3.123) below. Assume that

(3.15) μij ≤c2(p, c0)

for alli, j= 1, . . . , k0, wherec2(p, c0)is given by (3.118) below. Then provided that

(3.16) c3(p)

k

i=1 mini

−1/2

≤θ≤2c3(p)

k

i=1 mini

−1/2 ,

where c3(p) is given by (3.115) below, the optimization problem (2.8) will return X with nonzero entries precisely in the positions ofB1, . . . ,Bkwith probability exponen-tially close to1 asmi, ni → ∞for alli= 1, . . . , k0.

Remarks.

1. Note that the theorem does not recover the exact values of (¯σi,ui¯ ,vi¯ ); it is clear that this is impossible in general under the assumptions made.

2. The theorem is valid under arbitrary permutation of the rows and columns (i.e., the block structure may be “concealed”) since (2.8) is invariant under such transformations.

3. Given the fact that for alli= 1, . . . , k,

0<

min

j=1,...,mi ¯

ui,j

m

i≤ u¯i1/√mi 1

max

j=1,...,mi ¯

ui,j

m

i,

we can always choose ξu, δu, and πu such that 0 < ξu δu 1 πu. Similarly, we assume 0< ξv≤δv 1≤πv. These parameters measure how

much ¯ui and ¯vi diverge from emi and eni after normalization, respectively. The best case for our theory (i.e., the least restrictive values of parameters) occurs when all of these scalars are equal to 1. Similarlyρσ, ρm, ρn≥1, and

the best case for the theory is when they are all equal to 1.

4. It is an implicit assumption of the theorem that the scalars contained in p as well as b, which controls the subgaussian random variables, stay fixed as

mi, ni→ ∞.

5. As compared to the recovery result in Ames [1] for the planted k-biclique problem, our result for the general bicluster problem is in general weaker in terms of noise magnitude (as compared to data magnitude) but stronger in terms of block sizes. Ames [1] requires mi = τi2ni, whereτi are scalars for all i, i = 1, . . . , k+ 1, whereas we only need mi O(n2j) and ni ≤O(m2j) for i, j = 1, . . . , k. More importantly, the noise block size, nk+1, is more restricted as compared to data block sizes, ni, for i= 1, . . . , k, in Ames [1]

with the condition

c1√k+√nk+1+ 1

k+1

i=1

ni+βτk+1nk+1≤c2γi=1min,...,kni.

(12)

In contrast, for our recovery result, (3.14) means that the total size of the noise blocks can be much larger (approximately the square) than the size of the data blocks. Thus, the theorem shows that the k blocks can be found even though they are hidden in a much larger matrix. In the special case whenk0=k+ 1,mi=ni=n, ¯σi= ¯σfor alli= 1, . . . , k, andmk+1=nk+1,

combining (3.13) and (3.14), we will obtain the following condition, which clearly shows the relationship between block sizes:

¯

σk+1

c0¯σ n≤nk+1

c1(p, c0, b)

2 n

2.

6. As compared to the recovery result in Doan and Vavasis [6] whenk= 1, our recovery result is for a more general setting with ¯σ2 > 0 instead of ¯σ2 = 0 as in Doan and Vavasis [6]. We therefore need additional conditions on ¯σi,

i = 1,2. In addition, we need to consider the off-diagonal blocks (i, j) for

i, j= 1, . . . , k, which is not needed whenk= 1. This leads to more (stringent) conditions on the noise magnitudes. Having said that, the conditions on the parameterθ and block sizes remain similar. We still require θ to be in the order of (m1n1)−1/2 as in Doan and Vavasis [6]. The conditionsm1≤O(n21) and n1 O(m21) are similar to the condition m1n1 Ω((m1+n1)4/3) in Doan and Vavasis [6]. Finally, the condition m2+n2 ≤c1(p, c0, b)m1n1 is close to the conditionm1n1Ω(m1+m2+n1+n2), which again shows the similarity of these recovery results in terms of block sizes.

In order to simplify the proof, we first consolidate all blocksi=k+1, . . . , k0into a single block and call it block (k+1) of size ¯mk+1ׯnk+1where ¯mk+1=ki=0k+1miand

¯

nk+1 =ki=0k+1ni The only difference is that the new block ¯Bk+1,k+1Rm¯k+1ׯnk+1

is now a block diagonal matrix withk0−kblocks instead of a rank-one block. Similarly, new blocks ¯Ri,k+1 and ¯Rk+1,i, i= 1, . . . , k0, now have more than one subblock with

different parameters μ instead of a single one. This new block structure helps us derive the optimality conditions more concisely. Clearly, we would like to achieve the optimal solutionX with the following structure:

X=

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

σ1u1vT

1 0 · · · · · · 0 0 . .. . .. 0

..

. . .. . .. 0 ...

..

. 0 σkukvTk 0

0 · · · · 0 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

,

where ui2 =vi2 = 1 fori = 1, . . . , k. Padding appropriate zeros to ui andvi to constructu0i Rm+ andv0i Rn+ for i= 1, . . . , k, we obtain sufficient optimality conditions based on Proposition 2.7 as follows:

There existY andZsuch thatY +Z=Aand

Y =|A|k,2

k

i=1

σiu0i(v0i)T +W , Z=θ|A|k,2V,

where σi > 0 for i = 1, . . . , k, ki=1σ2i = 1, W mini=1,...,k{σi},

W v0

i =0, WTu0i =0, fori= 1, . . . , k, and V∞≤1, Vii =emieTni, fori= 1, . . . , k.

(13)

Since A has the block structure, we can break these optimality conditions into ap-propriate conditions for each block. Starting with diagonal (i, i) blocks, i= 1, . . . , k, the detailed conditions are

σiuivT

i +Wii =λσiui¯ ¯vTi +Rii)−θemieTni, (3.17)

WT iiui =0,

(3.18)

Wiivi =0,

(3.19)

where λ = 1/|A|k,2. For nondiagonal (i, j) blocks, i =j and i, j = 1, . . . , k, we obtain the following conditions:

Wij+θVij =λRij, (3.20)

WT

ijui=0,

(3.21)

Wijvj=0,

(3.22)

Vij1.

(3.23)

For (i, k+ 1) blocks,i= 1, . . . , k, we have that

Wi,k+1+θVi,k+1=λR¯i,k+1,

(3.24)

WT

i,k+1ui=0,

(3.25)

Vi,k+11. (3.26)

Similarly, for (k+ 1, j) blocks,j= 1, . . . , k, the conditions are

Wk+1,j+θVk+1,j =λRk¯ +1,j,

(3.27)

Wk+1,jvj =0,

(3.28)

Vk+1,j1.

(3.29)

Finally, the (k+ 1, k+ 1) block needs the following conditions:

Wk+1,k+1+θVk+1,k+1 =λB¯k+1,k+1+ ¯Rk+1,k+1,

(3.30)

Vk+1,k+11. (3.31)

The remaining conditions are not block separable. We still needσi >0,i= 1, . . . , k,

and ki=1σi2 = 1. The last condition, which is W mini=1,...,k{σi}, can be

replaced by the following sufficient conditions that are block separable by applying the fact thatW2i,jWij2:

(3.32) Wij ≤ k1

+ 1i=1min,...,k{σi}, i, j= 1, . . . , k+ 1.

With these sufficient block separable conditions, in order to construct (V,W), we now need to construct (Vij,Wij) for different pairs (i, j) block by block. The block by block details are shown in the following analysis.

In the following proof, we assume that the random matrixRis chosen in stages: the diagonal blocksRii,i= 1, . . . , k, are selected before the off-diagonal blocks. This allows us to treat the diagonal blocks as deterministic during the analysis of the off-diagonal blocks. This technique of staging independent random variables is by now standard in the literature; see e.g., the “golfing” analysis of the matrix completion problem by Gross [11].

(14)

3.1. Analysis for block (i, i), i = 1, . . . , k. We begin with the proof of the existence of a λ > 0 that satisfies the optimality conditions. We then show the sufficient condition (3.32) for block (i, i),i= 1, . . . , k. The final condition that needs to be proved for these blocks is the positivity ofui andvi,i= 1, . . . , k.

3.1.1. Existence ofλ. The conditions for the (i, i) block,i= 1, . . . , k, namely, (3.17)–(3.19), indicate that (σi,ui,vi) is the dominant singular triple of the matrix

Li=λσiui¯ v¯Ti +Rii)−θemieTni. They also indicate that

(3.33) Wii=σ2(Li)

since (3.17)–(3.19) are equivalent to the first step of a singular value decomposition ofLi.

For the rest of this analysis, it is more convenient notationally to work with

τ=λ/θ rather than withλdirectly. The conditionki=1σi2= 1 becomes

(3.34) f(τ) =

k

i=1

!!τσiu¯iv¯Ti +Rii)emieTni!!

2θ−2

= 0.

We will prove that there exists τ∗ >0 such that f(τ∗) = 0. More precisely, we will focus our analysis off(τ) forτ∈[τ, τu], whereτis given by (3.110) andτu is given by (3.116) below, and prove that there existsτ∗∈[τ, τu] such thatf(τ∗) = 0.

Letting Qij = Rij/"φiφj −μijemieTnj for i, j = 1, . . . , k, clearly, Qij are b -subgaussian random matrices with independent elements. The function f can be rewritten as follows:

f(τ) =

k

i=1

!!τσ¯iui¯ v¯Ti (1−τφiμii)emieTni+τφiQii!!2−θ−2

=

k

i=1

Pi(τ) +τφiQii2−θ−2,

(3.35)

where Pi(τ) =τσ¯iui¯ v¯iT (1−τφiμii)emieTni. Applying the triangle inequality, we have

(3.36) Pi(τ) −τφiQii ≤ Pi(τ) +τφiQii ≤ Pi(τ)+τφiQii.

We start the analysis withPi(τ). We first define the function

(3.37) gi(τ;a) =φ222aφiτ(1−μiiφiτ) + (1−μiiφiτ)2,

which is a quadratic function inτ with any fixed parametera. Note by (3.119) below that τu≤0.3/(φiμii) for alli= 1, . . . , k, so 1−μiiφiτ≥0 andτ≥0 forτ∈[τ, τu]. Therefore, provideda≤1 andτ∈[τ, τu],

gi(τ;a) = (φiτ−(1−μiiφiτ))2+ 2(1−a)φiτ(1−μiiφiτ)

(φiτ−(1−μiiφiτ))2 = (φi(1 +μii)τ−1)2

(3.38)

0. (3.39)

The dominant singular triple of Pi(τ) = τσ¯iu¯iv¯iT (1−τφiμii)emieTni is now analyzed for a fixed τ [τ, τu]. It is clear that the dominant right singular vector

(15)

lies in span{v¯i,eni}since this is the range of (Pi(τ))T. Lettingζi=Pi(τ)2 be the square of the dominant singular value, it is clear thatζi is a solution of the following

eigenvector problem:

(Pi(τ))TPi(τ)(αv¯i+βeni) =ζi(αv¯i+βeni).

Expanding and gathering multiples of ¯vi and eni, we obtain the 2×2 eigenvalue problem

(3.40) Mi

α β

=ζi

α β

,

whereMi is the 2×2 matrix, (3.41)

τ2σ¯2

i −τσ¯ihi(τ)ui¯ 1vi¯ 1 τ2σ¯2i −τσ¯ihi(τ)ui¯ 1ni

(hi(τ))2vi¯ 1mi−τσ¯ihi(τ)ui¯ 1 (hi(τ))2mini−τσ¯ihi(τ)ui¯ 1¯vi1

,

andhi(τ) = 1−τφiμii,i= 1, . . . , k. Thus,ζi is a root of the equation

(3.42) ζi2trace(Mi)ζi+ det(Mi) = 0,

where

trace(Mi) =τσi22τσ¯i(1−τφiμii)u¯i1v¯i1+ (1−τφiμii)2mini

=miniτ2φ2i 2τφi(1−τφiμii)δu,iδv,i+ (1−τφiμii)2 =minigi(τ;δu,iδv,i)

(3.43)

and

det(Mi) =τ2σ¯2i(1−τφiμii)2(mi− ui¯ 21)(ni− vi¯ 21) =m2in2iτ2φ2i(1−τφiμii)2(1−δ2u,i)(1−δv,i2 )

(3.44)

0.

Here, we have introduced notation

δu,i=ui¯ 1/√mi,

δv,i=¯vi1/√ni

that we will continue to use for the remainder of the proof. It is apparent that

δu,i∈[δu,1] by (3.3) and similarlyδv,i∈[δv,1].

Let Δ be the discriminant of the quadratic equation (3.42), that is,

(3.45) Δ = trace(Mi)24 det(Mi).

Letting Δ= Δ/(mini)2, we have that

Δ =

#

τ2φ2

i 2τφi(1−τφiμii)

δu,iδv,i+ $

(1−δu,i2 )(1−δ2v,i)

+ (1−τφiμii)2 %

· #τ2φ2

i 2τφi(1−τφiμii)

δu,iδv,i− $

(1−δu,i2 )(1−δ2v,i)

+ (1−τφiμii)2 %

=gi

τ;δu,iδv,i+ $

(1−δu,i2 )(1−δ2v,i)

· gi

τ;δu,iδv,i− $

(1−δu,i2 )(1−δ2v,i)

.

(3.46)

(16)

As 1(δu,iδv,i+

$

(1−δu,i2 )(1−δ2v,i))2 = (δu,i $

1−δv,i2 −δv,i $

1−δ2u,i)2 0, the second argument to each invocation of gi in the previous equation is less than or equal to 1. In addition, sinceτ [τ, τu], it follows that both evaluations ofgi yield

nonnegative numbers, and therefore Δ0. We next claim that

(3.47) Δ =m2in2igi(τ;pi(τ))2

for a continuouspi(τ)[δu,iδv,i,1] for allτ [τ, τu]. In other words, there exists a

continuouspi(τ) in the range [a,1] satisfying the equation

(3.48) gi(τ;pi(τ))2 =gi(τ;a+c)gi(τ;a−c), where, for this paragraph,a=δu,iδv,i and c=

$

(1−δu,i2 )(1−δ2v,i). This is proved by first treating pi as an unknown and expanding (3.48). After simplification, the

result is a quadratic equation for pi. The facts that 0≤a, c≤1 anda+c≤1 allow

one to argue that the quadratic equation has a sign change over the interval [a,1] for all τ [0,1/(φiμii)] (hence for all τ [τ, τu]). Thus, the quadratic has a unique

root in this interval, which may be taken to bepi; it must vary continuously with the

coefficients of the quadratic and hence withτ. The details are left to the reader. In addition toτ, pi(τ) depends onμii,φi,δu,i, andδv,i.

Thus, by the quadratic formula applied to (3.42), we can obtain ζi as the larger root

(3.49) ζi= 1

2(trace(Mi(τ)) +

Δ) =minigi(τ;ai(τ)),

where the second equation comes from adding (3.43) to the square root of (3.47) and noting that for anyτ, a, b, (gi(τ;a) +gi(τ;b))/2 =gi(τ; (a+b)/2). Here, we have that

(3.50) ai(τ) = 1

2(δu,iδv,i+pi(τ)).

By the earlier bound onpi(τ), this impliesai(τ)[ai, ai], where

(3.51) ai=δu,iδv,i; ai= 1

2+ 1 2δu,iδv,i.

Clearly 0≤ai ≤ai 1 for allisinceδu,i, δv,i∈[0,1]. Note that tighter bounds are

possible by a more careful analysis of Δ.

Sinceζi=Pi(τ)2, we can then expressPi(τ) as follows: (3.52) Pi(τ)="ζi="minigi(τ;ai(τ)).

Next, consider again (3.38); the right-hand side is a convex quadratic function of

τ with minimizer at 1/(φi(1 +μii)). It follows from (3.112) that τ 2/φi for alli.

Thus, forτ∈[τ, τu], we have that

τ φ2

i

1

φi(1/2 +μii) >

1

φi(1 +μii).

Thus, the right-hand side of (3.38) is an increasing function of τ forτ [τ, τu]. We then have, for anyτ∈[τ, τu],

gi(τ;a)

φi(1 +μii)

1

φi(1/2 +μii)

1

2

=

1 1 + 2μii

2 .

(17)

Thus,

(3.53) Pi(τ)

m

ini

1 + 2μii

for anyτ [τ, τu]. We also have a second lower bound that grows linearly withτ:

"

gi(τ;a)≥φi(1 +μii)τ−1 (3.54)

=φiτ/2 + (φi(1/2 +μii)τ−1)

≥φiτ/2,

(3.55)

where the first inequality follows from (3.38) and the other inequality is due to the fact thatτφi≥2 as noted above. This implies

Pi(τ) ≥√miniφiτ/2

= ¯σiτ/2. (3.56)

Next, we combine this linear lower bound on Pi(τ) with an upper bound on

Qiiin order to be able to take advantage of (3.36).

Claim 1. !!Qij!!(minj)38 with probability exponentially close to1 asmi, nj

∞for all i, j= 1, . . . , k.

To establish the claim observe that Qij is random with i.i.d. elements that are

b-subgaussian. Thus by Lemma 3.2(i),

(3.57) P!!Qij!!(minj)3/8

exp

(minj)3/4

81b2 (log 7)(mi+nj)

,

where u is set to be (minj)3/8. The right-hand side tends to zero exponentially

fast since (minj)3/4 asymptotically dominates mi+nj under the assumption that

m1i/4≤O(n1j/2) andn1j/4≤O(m1i/2), which was stated as a hypothesis in the theorem. Then with a probability exponentially close to 1, the event in (3.57) does not happen, hence we assume!!Qij!!(minj)38. Focusing on thei=jcase for now, this implies

(3.58) Qii ≤√mini/40

for large mini; since the theorem applies to the asymptotic range, we assume this inequality holds true as well. Combining the inequality (3.58) with (3.56) and (3.36), we obtain

Pi(τ) +τφiQii= (1 +γi(τ))Pi(τ)

= (1 +γi(τ))"minigi(τ;ai(τ)). (3.59)

In the first line, we have introduced scalarγi(τ) to stand for a quantity in the range [1/20,1/20] that varies continuously withτ. This notation will be used throughout the remainder of the proof. The second line follows from (3.52). Combining (3.59) and (3.56), we conclude

(3.60) Pi(τ) +τφiQii ≥0.47¯σiτ.

Finally, because Pi(τ) +τφiQii is a rescaling of the right-hand side of (3.17) by θ,

we conclude that

(3.61) σi≥0.47¯σiτθ.

(18)

Applying (3.59) to the formulation off(τ) in (3.35), we have that, forτ [τ, τu],

f(τ) =

k

i=1

Pi(τ) +τφiQii2−θ−2

=

k

i=1

mini(1 +γi(τ))2gi(τ;ai(τ))−θ−2

=A(τ)τ22B(τ)τ−C(τ).

The third line is obtained by expanding the quadratic formula forgi(τ;ai(τ)), which

results in

A(τ) =

k

i=1

(1 +γi(τ))2σ¯i2(1 + 2μiiai(τ) +μ2ii),

B(τ) =

k

i=1

(1 +γi(τ))2√miniσ¯i(ai(τ) +μii),

C(τ) =θ−2−

k

i=1

(1 +γi(τ))2mini.

We will now prove that there existsτ∗ [τ, τu] such thatf(τ∗) = 0 by applying the following lemma, which is a specific form of the intermediate theorem for “pseudo-quadratic” functions.

Lemma 3.5. Consider a real-valued functionfˆ(τ)of the form

ˆ

f(τ) =A(τ)τ22B(τ)τ−C(τ),

whereA(τ),B(τ),C(τ)are continuous functions ofτ. Suppose there are two triples of positive numbers(A, B, C)<(A, B, C)(where <is understood elementwise). Define

(3.62) τ= B+

"

B2+A·C

A

and

(3.63) τu = B+

$

B2+A·C

A .

(Clearlyτ< τu.) Suppose further that there is an interval[τ, τu]such thatτ≤τ

τ

u ≤τu and such that for all τ∈[τ, τu],

(A, B, C)(A(τ), B(τ), C(τ))(A, B, C).

Then there exists a rootτ∗ [τ, τu](and therefore also in[τ, τu]) such thatfˆ(τ∗) = 0. Proof. Some simple algebra shows that ˆf(τ) =A(τ)(τ)22B(τ)(τ)−C(τ) 0, while ˆf(τu) =A(τu)(τu)22B(τu)τu −C(τu)0, so there is aτ∗ [τ, τu] such that ˆf(τ∗) = 0 by the intermediate value theorem.

(19)

In order to apply Lemma 3.5, we now define the following scalars:

A= (10/9)

k

i=1

¯

σ2

i(1 + 2μiiai+μ2ii),

A= 0.90

k

i=1

¯

σ2

i(1 + 2μiiai+μ2ii),

B = (10/9)

k

i=1 m

iniσ¯i(ai+μii),

B = 0.90

k

i=1 m

iniσ¯i(ai+μii),

C= (10/9)

θ−2k

i=1 mini

,

C= (9/10)

θ−2

k

i=1 mini

.

It is obvious that (0,0)<(A, B)<(A, B). It follows from (3.16), (3.111), and (3.115) below that the parenthesized quantity in the definitions ofC, C is positive,

θ−2k

i=1

mini

1 4c3(p)2 1

k

i=1 mini

= 1.24(c4(p))2

ρ

mρnk

k+ρmρn−1 k

i=1

mini>0,

and hence we also have 0< C < C.

In addition, given the fact that ai(τ) [ai;ai] and γi(τ) [1/20; 1/20] for

τ [τ, τu], this establishes for this interval that (A, B, C) (A(τ), B(τ), C(τ))

(A, B, C).

We now show thatτ ≥τ andτu ≤τu. We have that

τ

=

B+"B2+A·C

A

A·C

A .

Using the facts that 0≤ai 1, 0≤μii ≤c2(p, c0)0.08 (see (3.118) below), and θ−2k

i=1mini≥1.24(c4(p))2(k+ρρmmρρnnk−1) k

i=1mini>0 as above, we have that

τ

ρ

mρnk

k+ρmρn−1

1/2 c4(p)

k

i=1 mini

1/2k

i=1

¯

σ2

i −1/2

.

Next, observe that

k i=1 ¯ σ2 i −1/2

≥k−1/2σ−1 1

Figure

Fig. 2.Maximum differences between entries in diagonal blocks and off-diagonal blocks for
Fig. 3. Maximum differences between entries for three models.
Fig. 4. Approximation averaging effect versus magnitude of resulting blocks.
Fig. 7. Recovered transcription modules from two different models.
+2

References

Related documents

Favor you leave and sample policy employees use their job application for absence may take family and produce emails waste company it discusses email etiquette Deviation from

By first analysing the image data in terms of the local image structures, such as lines or edges, and then controlling the filtering based on local information from the analysis

• People who worked in Cuyahoga County lost a combined total of $5.2 billion from their paychecks between 2000 and 2010, according to an analysis of payroll data of residents'

In the United States, the Deaf community, or culturally Deaf is largely comprised of individuals that attended deaf schools and utilize American Sign Language (ASL), which is

In OF1.0, packet forwarding within an OpenFlow logical switch is controlled by a single flow table that contains a set of flow entries installed by the controller.. The flow

According to VanPatten, (1996), the originator of the PI approach, PI is an input based grammar instruction which aims to affect learners’ attention to input data which

All stationary perfect equilibria of the intertemporal game approach (as slight stochastic perturbations as in Nash (1953) tend to zero) the same division of surplus as the static

If breastfeeding by itself doesn’t effectively remove the thickened inspissated milk, then manual expression of the milk, or the use of an efficient breast pump after feeds will