Algorithms for ℓp Low Rank Approximation

(1)

Flavio Chierichetti1 Sreenivas Gollapudi2 Ravi Kumar2 Silvio Lattanzi3 Rina Panigrahy2 David P. Woodruff4

Abstract

We consider the problem of approximating a given matrix by a low-rank matrix so as to mini-mize the entry-wise`p-approximation error, for

anyp≥1; the casep= 2is the classical SVD problem. We obtain the first provably good ap-proximation algorithms for this version of low-rank approximation that work for every value of

p≥1, includingp=∞. Our algorithms are sim-ple, easy to implement, work well in practice, and illustrate interesting tradeoffs between the approx-imation quality, the running time, and the rank of the approximating matrix.

1. Introduction

The problem of low-rank approximation of a matrix is usu-ally studied as approximating a given matrix by a matrix of low rank so that the Frobenius norm of the error in the ap-proximation is minimized. The Frobenius norm of a matrix is obtained by taking the sum of the squares of the entries in the matrix. Under this objective, the optimal solution is obtained using the singular value decomposition (SVD) of the given matrix. Low-rank approximation is useful in large data analysis, especially in predicting missing entries of a matrix by projecting the row and column entities (e.g., users and movies) into a low-dimensional space. In this work we consider the low-rank approximation problem, but under the general entry-wise`pnorm, for anyp∈[1,∞].

There are several reasons for considering the`pversion of

*_{Equal contribution} 1_{Sapienza University, Rome, Italy.}

Work done in part while visiting Google. Supported in

part by a Google Focused Research Award, by the ERC

Starting Grant DMAP 680153, and by the SIR Grant

RBSI14Q743. 2Google, Mountain View, CA 3Google,

Zurich, Switzerland 4IBM Almaden, San Jose, CA.

Corre-spondence to: Flavio Chierichetti <[email protected]>,

Sreenivas Gollapudi <[email protected]>, Ravi Kumar

<[email protected]>, Silvio Lattanzi<[email protected]>,

Rina Panigrahy<[email protected]>, David P. Woodruff<

[email protected]>.

Proceedings of the34th

low-rank approximation instead of the usually studied`2

(i.e., Frobenius) version. For example, it is widely acknowl-edged that the`1version is more robust to noise and outliers

than the`2version (Cand`es et al., 2011; Huber, 1981; Xu

& Yuille, 1995). Several data mining and computer vision-related applications exploit this insight and resort to finding a low-rank approximation to minimize the`1error (Lu et al.,

2014; Meng & Torre, 2013; Wang & Yeung, 2013; Xiong et al., 2011). Furthermore, the`1error is typically used as a

proxy for capturing sparsity in many applications including robust versions of PCA, sparse recovery, and matrix com-pletion; see, for example (Cand`es et al., 2011; Xu et al., 2012). For these reasons the problem has already received attention (Gillis & Vavasis, 2015) and was suggested as an open question in a survey on sketching techniques for linear algebra (Woodruff, 2014). Likewise, the`∞version

(dubbed also as the Chebyshev norm) has been studied for the past many years (Goreinov & Tyrtyshnikov, 2001; 2011), though to the best of our knowledge, no result with theo-retical guarantees was known for`∞before our work. Our

algorithm is quite general, and works for everyp≥1. Working with`perror, however, poses many technical chal-lenges. First of all, unlike`2, the general`pspace is not

amenable to spectral techniques. Secondly, the`p space is not as nicely behaved as the `2 space, for example, it

lacks the notion of orthogonality. Thirdly, the`p version quickly runs into computational complexity barriers: for example, even the rank-1 approximation in `1 has been

shown to be NP-hard by Gillis and Vavasis (Gillis & Vava-sis, 2015). However, there has been no dearth in terms of heuristics for the`p low-rank approximation problem, in

particular forp= 1andp=∞: this includes alternating convex (and, in fact, linear) minimization (Ke & Kanade, 2005), methods based on expectation-maximization (Wang et al., 2012), minimization with augmented Lagrange mul-tipliers (Zheng et al., 2012), hyperplanes projections and linear programming (Brooks et al., 2013), and generaliza-tions of the Wiberg algorithm (Eriksson & van den Hengel, 2012). These heuristics, unfortunately, do not come with any performance guarantees. While theoretical approxima-tion guarantees have been given for the rank-1version for the GF(2)and the Boolean cases (Dan et al., 2015), to the best of our knowledge there have been no provably good

(2)

(approximation) algorithms for general matrices, or for rank more than one, or for general`p.

1.1. Our contributions

In this paper we obtain the first provably good algorithms for the`prank-kapproximation problem for everyp≥1. Letn×mbe the dimensions of the input matrix. From an algorithmic viewpoint, there are three quantities of interest: the running time of the algorithm, the approximation fac-tor guaranteed by the algorithm, and the actual number of vectors in the low-rank approximation that is output by the algorithm (even though we only desirek).

Given this setting, we show three main algorithmic results intended for the case when kis not too large. First, we show that one can obtain a(k+ 1)-approximation to the rank-kproblem in timemk_poly(_{n, m}₎_{; note that this}

run-ning time is not polynomial once kis larger than a con-stant. To address this, next we show that one can get anO(k)-approximation to the bestk-factorization in time

O(poly(nm)); however, the algorithm returnsO(klogm)

columns, which is more than the desiredk(this is referred to as a bi-criteria approximation). Next, we combine these two algorithms. We first show that the output of the second algorithm can further be refined to output exactlyk vec-tors, with an approximation factor ofpoly(k)and a running time ofO(poly(n, m)(klogn)k₎_{. The running time now}

is polynomial as long ask=O(logn/log logn). Finally, we show that for any constantp ≥ 1, we can obtain an approximation factor of(klogm)O(p)_{and a running time}

ofpoly(n, m)for every value ofk.

Our first algorithm is existential in nature: it shows that there arekcolumns in the given matrix that can be used, along with an appropriate convex program, to obtain a(k+ 1) -approximation. Realizing this as an algorithm would there-fore na¨ıvely incur a factor mk in the running time. Our second algorithm works by sampling columns and itera-tively “covering” the columns of the given matrix, for an appropriate notion of covering. In each round of sampling our algorithm uniformly samples from a remaining set of columns; we note here that it is critical that our algorithm is adaptive as otherwise uniform sampling would not work. While this is computationally efficient and maintains an

O(k)-approximation to the best rank-kapproximation, it can end up with more thankcolumns, in factO(klogm). Our third algorithm fixes this issue by combining the first algorithm with the notion of a near-isoperimetric transfor-mation for the`p-space, which lets us transform a given matrix into another matrix spanning the same subspace but with small`pdistortion. Our fourth algorithm uses`p lever-age scores to overcome the sampling step; this improves the running time while mildly worsening the approximation factor.

A useful feature of our algorithms is that they are uniform with respect to all values ofp. We test the performance of our algorithms, forp= 1andp=∞, on real and synthetic data and show that they produce low-rank approximations that are substantially better than what the SVD (i.e.,p= 2) would obtain.

1.2. Related work

Very recently, Song, Woodruff, and Zhong (Song et al., 2017), obtained a low-rank approximation that holds for every p ∈ [1,2], using sketching-based techniques. In particular, their main result is an (O(logm) poly(k)) -approximation innnz(A) + (n+m) poly(k)time, for every

k, wherennz(A)is the number of non-zero entries inA. In our work, we also obtain such a result forp ∈ [1,2]

but via very different sampling-based methods. In addi-tion, we obtain an algorithm with apoly(k)-approximation factor that is independent of mandn, though this latter algorithm requiresk=O(logn/log logn)in order to be polynomial time. Another result in (Song et al., 2017) shows how to achieve a(kpoly(logk))-approximation, innO(k)

time for p ∈ [1,2]. Forklarger than a constant, this is larger than polynomial time, whereas our algorithm with

poly(k)-approximation is polynomial time forkas large as

Θ(logn/log logn). Importantly, our results hold for every

p∈[1,∞], rather than only whenp∈[1,2].

In addition we note that there exist papers solving prob-lems that, at first blush, might seem similar to ours. For instance, (Deshpande et al., 2011) study a convex relation, and a rounding algorithm to solve the subspace approxima-tion problem (an`pgeneralization of the least squares fit), which is related to but different from our problem. Also, (Feldman et al., 2007) offer a bi-criteria solution for another related problem of approximating a set of points by a col-lection of flats; they use convex relaxations to solve their problem and are limited to bi-criteria solutions, unlike ours. Finally, in some special settings robust PCA can be used to solve`1low-rank approximation (Cand`es et al., 2011).

However, robust PCA and`1low-rank approximation have

some apparent similarities but they have key differences. Firstly,`1low-rank approximation allows to recover an

ap-proximating matrix of any chosen rank, whereas robust PCA returns some matrix of some unknown (possibly full) rank. While variants of robust PCA have been proposed to force the output rank to be a given value (Netrapalli et al., 2014; Yi et al., 2016), these variants make additional noise model and incoherence assumptions on the input matrix, whereas our results hold for every input matrix. Secondly, in terms of approximation quality, it is unclear if near-optimal solu-tions of robust PCA provide near-optimal solusolu-tions for`1

low-rank approximation.

(3)

gives a poor approximation factor for `p-approximation

error. First, suppose p < 2 and k = 1. Consider the followingn×nblock diagonal matrix composed of two blocks: a1×1matrix with valuenand an(n−1)×(n−1)

matrix with all1s. The SVD returns as a solution the first column, and therefore incurs polynomial innerror forp= 2−Ω(1). Now supposep >2andk= 1. Consider the followingn×nblock diagonal matrix composed of two blocks: a1×1matrix with valuen−2and an(n−1)×(n−1)

matrix with all1s. The SVD returns as a solution the matrix spanned by the bottom block, and so also incurs an error polynomial innforp= 2 + Ω(1).

2. Background

For a matrixM, letMi,j denote the entry in itsith row

andjth column and letMidenote itsith column. LetMT

denote its transpose and let|M|p=

P

i,j|Mi,j|

p1/p

de-note its entry-wisepnorm. Given a setS={i1, . . . , it}of

column indices, letMS =Mi1,...,itbe the matrix composed

of the columns ofM with the indices inS.

Given a matrixM withmcolumns, we will usespanM =

{Pm

i=1αiMi|αi∈R}to denote the vectors spanned by

its columns. IfM is a matrix and v is a vector, we let

dp(v, M)denote the minimum`pdistance betweenvand a vector inspanM:

dp(v, M) = inf

w∈spanM|v−w|p.

Let A ∈ Rn×m denote the input matrix and let k > 0

denote the target rank. We assume, without loss of generality (wlog.), thatm≤n. Our first goal is, givenAandk, to find a subsetU ∈Rn×kofkcolumns ofAandV ∈Rk×mto

minimize the`perror,p≥1, given by

|A−U V|p.

Our second goal is, givenAandk, to findU ∈Rn×k, V ∈ Rk×mto minimize the`perror,p≥1, given by

|A−U V|p.

Note that in the second goal, we do not requireUbe a subset of columns.

We refer to the first problem as thek-columns subset selec-tion problem in the`pnorm, denotedk-CSSp, and to the second problem as therank-kapproximation problem in the

`pnorm, denotedk-LRAp.1 _{In the paper we often call}_{U, V}

thek-factorizationofA. Note that a solution tok-CSSp

can be used as a solution tok-LRAp, but not necessarily vice versa.

1_k

-LRA2 is the classical SVD problem of finding U ∈

Rn×k, V ∈Rk×mso as to minimize|A−U V|2.

In this paper we focus on solving the two problems for generalp. LetU?_V?_{be a}_k_{-factorization of}_A_{that is}

opti-mal in the`pnorm, whereU? _∈

Rn×k andV? ∈Rk×m,

and let opt_k,p(A) = |A−U?_V?_|

p. An algorithm is said

to be an α-approximation, for an α ≥ 1, if it outputs

U ∈Rn×k, V ∈Rk×msuch that

|A−U V|p≤α·optk,p(A).

It is often convenient to view the input matrix as A = U?_V?_{+ ∆ =}_A?_{+ ∆}_{, where}_∆_{is some}_error_{matrix of}

minimum`p-norm. Letδ=|∆|p= optk,p(A).

We will use the following observation.

Lemma 1. LetU ∈Rn×k andv ∈ Rn×1. Suppose that

there existsx ∈ Rk×1such thatδ =|U·x−v|p. Then,

there exists a polynomial time algorithm that, givenU and

v, findsy∈Rk×1such that|U·y−v|p≤δ.

Proof. This`pregression problem is a convex program and

well-known to be solvable in polynomial time.

3. An

(

m

k

_poly(

_nm

₎₎

_{-time algorithm for}

k-

LRA

p

In this section we will present an algorithm that runs in time

mk_poly(_nm₎_{and produces a}₍_k_{+ 1)}_{-approximation to}_k -CSSp(hence tok-LRAp) of a matrixA∈Rn×m,m≤n,

for anyp∈[1,∞]. The algorithm simply tries all possible subsets ofkcolumns ofAfor producing one of the factors,

U, and then uses Lemma 1 to find the second factorV.

3.1. The existence of one factor inA

For simplicity, we assume that|∆i|p>0for each columni.

To satisfy this, we can add an arbitrary small random error to each entry of the matrix. For instance, for anyγ >0, and to each entry of the matrix, we can add an independent uniform value in[−γ, γ]. This would guarantee that|∆i|p >0for

eachi∈[m].

Recall thatA=A?_{+ ∆}_{is the perturbed matrix, and we}

only have access toA, notA?_{. Consider Algorithm 1 and}

its outputS. Note that we cannot actually run this algorithm since we do not knowA?_{. It is a hypothetical algorithm used}

for the purpose of our proof, i.e., the algorithm serves as a proof that there exists a subset ofkcolumns ofAproviding a good low rank approximation. In Theorem 3 we prove that the columns inAindexed by the subsetScan be used as one factor of ak-factorization ofA.

Before proving the main theorem of the section, we show a useful property of the matrixA˜?_{, i.e., the matrix having the}

vectorA?

i/|∆i|pas theith column. Then we will use this

(4)

Algorithm 1 Enumerating and selectingkcolumns ofA.

Require: A rankkmatrixA?_{and perturbation matrix}_∆ Ensure: kcolumn indices ofA?

1: For each column indexi, letA˜?

i ←A?i/|∆i|p.

2: WriteA˜?_{= ˜}_U_·_V_˜_{, s.t.}_U_˜ _∈

Rn×k,V˜ ∈Rk×m.

3: LetS be the subset ofkcolumns ofV˜ ∈ _Rk×m_that

has maximum determinant in absolute value (note that the subsetSindexes ak×ksubmatrix).

4: OutputS.

Lemma 2. For each columnA˜?_i ofA˜?, one can writeA˜?_i =

P

j∈SMi(j) ˜A ?

j, where|Mi(j)| ≤1for alli, j.

Proof. Fix an i ∈ {1, . . . , m}. Consider the equation

˜

VSMi = ˜Vi forMi ∈ Rk. We can assume the columns

inVS˜ are linearly independent, since wlog.,A˜?has rank

k. Hence, there is a unique solutionMi = ( ˜VS)−1Vi˜. By Cramer’s rule, the jth coordinate Mi(j) of Mi satisfies

Mi(j) = det( ˜V

j S)

det( ˜VS), where

˜

V_Sjis the matrix obtained by re-placing thejth column ofVS˜ withVi˜. By our choice ofS,

˜ A?

SMi= ˜A?i.

Now we prove the main theorem of this section.

Theorem 3. LetU =AS. Forp∈[1,∞], letM1, . . . , Mm

be the vectors whose existence is guaranteed by Lemma 2 and let V ∈ Rk×n be the matrix having the vector

|∆i|p·(Mi(1)/|∆i1|p, . . . , Mi(k)/|∆ik|p)

T

as itsith col-umn. Then, |Ai −(U V)i|p ≤ (k+ 1)|∆i|p and hence

|A−U V|p≤(k+ 1)|∆|p.

Proof. We consider the generic column(U V)i.

(U V)i = |∆i|p k X j=1 _Mi₍_j₎ |∆ij|p Ai_j = |∆i|p k X j=1 Mi(j) |∆ij|p A?ij + ∆ij = |∆i|p k X j=1 Mi(j) ˜A?ij +Mi(j) ∆ij |∆ij|p = |∆i|pA˜?i +|∆i|p k X j=1 Mi(j) ∆ij |∆ij|p = A?_i + k X j=1 |∆i|p·Mi(j) ∆ij |∆ij|p ∆ =A?_i +Ei.

Observe that Ei is the weighted sum of k vectors,

∆i1 |∆i₁|p, . . . ,

∆_ik

|∆_ik|p, having unit `p-norm. Observe

fur-ther that, since the sum of their weights satisfies

|∆i|pP k

j=1|Mi(j)| ≤k|∆i|p, we have that the`p-norm of Eiis not larger than|Ei|p≤k|∆i|p. The proof is complete

using the triangle inequality:

|Ai−(U V)i|p ≤ |A?i −Ai|p+|A?i −(U V)i|p = |∆i|p+|Ei|p

≤ (k+ 1)|∆i|p.

3.2. Anmkpoly(nm)-time algorithm

In this section we give an algorithm that returns a

(k + 1)-approximation to the k-LRAp problem in time

mkpoly(nm).

Algorithm 2 A(k+ 1)-approximation tok-LRAp.

Require: An integerkand a matrixA Ensure: U ∈ _Rn×k_{, V} _∈ Rk×ms.t. |A−U V|p ≤ (k+ 1)opt_k,p(A). 1: for allI∈ [m_k] do 2: LetU =AI

3: Use Lemma 1 to compute a matrixV that minimizes the distancedI =|A−U V|p

4: end for

5: ReturnU, V that minimizesdI, forI∈ [m_k]

The following statement follows directly from the existence ofkcolumns inAthat make up a factorU having small`p

error (Theorem 3).

Theorem 4. Algorithm 2 obtains a(k+ 1)-approximation tok-LRApin timemk_poly(_nm₎_.

4. A

poly(

nm

)

-time bi-criteria algorithm for

k-

CSS

p

We next show an algorithm that runs in time poly(nm)

but returnsO(klogm)columns ofAthat can be used in place ofU, with an errorO(k)times the error of the best

k-factorization. In other words, it obtains more than k

columns but achieves a polynomial running time; we will later build upon this algorithm in Section 5 to obtain a faster algorithm for thek-LRApproblem. We also show a lower bound: there exists a matrixAfor which the best possible approximation for thek-CSSp, forp∈(2,∞), iskΩ(1).

Definition 5 (Approximate coverage). Let S be a sub-set of k column indices. We say that column Ai is cp -approximately covered by S if forp ∈ [1,∞) we have

(5)

uniformly at random in [₂m_k]

, with constant probability we cover a constant fraction of columns ofA.

Lemma 6. SupposeR is a set of 2k uniformly random chosen columns of A. With probability at least2/9, R

covers at least a1/10fraction of columns ofA.

Proof. Letibe a column index ofAselected uniformly at random and not inR. LetT =R∪ {i}and letηbe the cost of the best`prank-kapproximation toAT. Note thatTis a uniformly random subset of2k+ 1columns ofA.

Case: p < ∞. Since T is a uniformly random subset of2k+ 1columns ofA, ET[ηp] =

(2k+1)|∆|p p

n . Let E1

denote the event “ηp≤ 10(2k+1)|∆|

p p

n ”. By a Markov bound, Pr[E1]≥9/10.

By Theorem 3, there exists a subset Lofk columns of

AT for whichminX|ALX−AT|pp ≤(k+ 1)pηp. Since i is itself uniformly random in the set T, it holds that

Ei[minx|ALx−Ai|pp≤

(k+1)p_ηp

2k+1 . LetE2denote the event

“minx|ALx−Ai|pp ≤

10(k+1)p_ηp

2k+1 ”. By a Markov bound, Pr[E2]≥9/10.

Let E3 denote the event “i /∈ L”. Since i is uniformly

random in the setT,Pr[E3]≥k+1 2k >1/2.

ClearlyPr[E1∧E2∧E3]≥3/10. Conditioned onE1∧E2∧E3,

which implies thatiis covered byR. Note that the first inequality uses thatL is a subset ofR givenE3, and so

the regression cost usingALcannot be smaller than that of

usingAR

Let Zi be an indicator variable ifi is covered byR and letZ =P iZi. We haveE[Z] = P iE[Zi] ≥ P i 3 10 = 3m/10; hence E[m−Z] ≤ 7m 10. By a Markov bound, Pr[m−Z≥ 9m 10]≤ 7 9.

Casep=∞.Thenη≤ |∆|∞sinceAT is a submatrix of A. By Theorem 3, there exists a subsetLofkcolumns of

AT for whichminX|ALX−AT|∞≤(k+ 1)η. Defining

E3as before and conditioning on it, we have min

x |ARx−Ai|∞ ≤ minx |ALx−Ai|∞

≤ min

X |ALX−AT|∞

≤ (k+ 1)|∆|∞,

i.e.,iis covered byR. Again definingZito be the event that

iis covered byR, we haveE[Zi]≥12, and soE[m−Z]≤ m 2, which impliesPr[m−Z ≥ 9m 10]≤ 5 9 < 7 9.

We are now ready to introduce Algorithm 3. We can wlog. assume that the algorithm knows a numberN for which

|∆|p≤N ≤2|∆|p. Indeed, such a value can be obtained

by first computing|∆|2using the SVD. Note that although

one does not know∆, one does know|∆|2since this is the

Euclidean norm of all but the topksingular values ofA, which one can compute from the SVD ofA. Then, note that forp <2,|∆|2≤ |∆|p≤n2−p|∆|2, while forp≥2,

|∆|p ≤ |∆|2 ≤ n1−2/p|∆|p. Hence, given|∆|2, there

are onlyO(logn)values ofN, one of which will satisfy

|∆|p≤N ≤2|∆|p. One can take the best solution found

by Algorithm 3 for each of theO(logn)guesses toN.

Algorithm 3 SelectingO(klogm)columns ofA.

Require: An integerk, and a matrixA=A?_{+ ∆}_. Ensure: O(klogm)columns ofA

SELECTCOLUMNS(k, A)

ifnumber of columns ofA≤2kthen

return all the columns ofA else

repeat

LetRbe uniform at random2kcolumns ofA until at least (1/10)-fraction columns of A arecp -approximately covered

LetA_Rbe the columns ofAnot approximately covered byR

returnAR∪SELECTCOLUMNS(k, A_R)

end if

Theorem 7. With probability at least9/10, Algorithm 3 runs in timepoly(nm)and returnsO(klogm)columns that can be used as a factor of the whole matrix inducing`p

errorO(k|∆|p).

Proof. First note that if|∆|p ≤ N ≤ 2|∆|p and ifi is

covered by a setRof columns, theniiscp-approximately

covered byRfor a constantcp; herecp= 2p_for_{p <}_∞_and c∞= 2. By Lemma 6, the expected number of repetitions

of selecting2kcolumns until(1/10)-fraction of columns ofAare covered isO(1). When we recurse on SELECT -COLUMNS on the resulting matrixA_R, each such matrix admits a rank-kfactorization of cost at most|∆|p.

More-over, the number of recursive calls to SELECTCOLUMNS

can be upper bounded bylog10m. In expectation there will

beO(logm)total repetitions of selecting2kcolumns, and so by a Markov bound, with probability9/10, the algorithm will chooseO(klogm)columns in total and run in time

poly(nm).

LetS be the union of all columns ofAchosen by the al-gorithm. Then for each column iofA, forp ∈ [1,∞),

(6)

we have minx|ASx−Ai|pp ≤

100(k+1)p₂p_|_∆_|p p

n , and so

minX|ASX −A|pp ≤ 100(k+ 1)p2p|∆|pp. Forp = ∞

we instead haveminx|ASx−Ai|∞≤2(k+ 1)|∆|∞, and

sominX|ASX−A|∞≤2(k+ 1)|∆|∞. 4.1. A lower bound fork-CSSp

In this section we prove an existential result showing that there exists a matrix for which the best approximation to the

k-CSSpiskΩ(1)_.

Lemma 8. There exists a matrixAsuch that the best ap-proximation for thek-CSSp problem, forp ∈ (2,∞), is

kΩ(1)_.

Proof. ConsiderA= (k+ 1)Ik+1, whereIk+1is the(k+ 1)×(k+ 1)identity matrix. And consider the matrixB = (k+ 1)·Ik+1−E, whereEis the(k+ 1)×(k+ 1)all

ones matrix. Note thatBhas rank at mostk, since the sum of its columns is0.

Case:2< p <∞.If we choose anykcolumns ofA, then the`pcost of using them to approximateAis(k+ 1). On the other hand,|A−B|∞= 1, which means that`pcost of Bis smaller or equal than (k+ 1)21/p

.

Case:p=∞.If we choose anykcolumns ofA, then the

`∞cost of using them to approximateAisk+ 1. On the

other hand,|A−B|∞= 1, which means that`∞cost ofB

is smaller or equal than1.

Note also that in (Song et al., 2017) the authors show that forp= 1the best possible approximation isΩ(√k)up to

poly(logk)factors.

5. A

((

k

log

n

)

k

_poly(

_mn

₎₎

_{-time algorithm for}

k-

LRA

p

In the previous section we have shown how to get a

rank-O(klogm),O(k)-approximation in timepoly(nm)to the

k-CSSpandk-LRApproblems. In this section we first show how to get a rank-k,poly(k)-approximation efficiently start-ing from a rank-O(klogm)approximation. This algorithm runs in polynomial time as long ask=O_{log log}logn_n. We then show how to obtain a(klogm)O(p)_{-approximation}

ratio in polynomial time for everyk.

LetUbe the columns ofAselected by Algorithm 3.

5.1. An isoperimetric transformation

The first step of our proof is to show that we can modify the selected columns ofAto span the same space but to have small distortion. For this, we need the following notion of isoperimetry.

Definition 9(Almost isoperimetry). A matrixB∈_Rn×m

isalmost-`p-isoperimetricif for allx, we have |x|p

2m ≤ |Bx|p≤ |x|p.

We now show that given a full rankA∈Rn×m, it is possible

to construct in polynomial time a matrixB∈_Rn×m_such

that A and B span the same space and B is almost-`p

-isoperimetric.

Lemma 10. Given a full (column) rankA∈Rn×m, there

is an algorithm that transformsA into a matrixB such thatspanA = spanB andB is almost-`p-isoperimetric. Furthermore the running time of the algorithm ispoly(nm). Proof. In (Dasgupta et al., 2009), specifically, Equation (4) in the proof of Theorem 4, the authors show that in polynomial time it is possible to find a matrixBsuch that

spanB=spanAand for allx,

|x|2≤ |Bx|p≤

√

m|x|2,

for anyp≥1.

Ifp <2, their result implies

|x|p √ m ≤ |x|2≤ |Bx|p≤ √ m|x|2≤ √ m|x|p,

and so rescalingBby√mmakes it almost-`p-isoperimetric. On the other hand, ifp >2, then

|x|p≤ |x|2≤ |Bx|p≤

√

m|x|2≤m|x|p,

and rescalingBbymmakes it almost-`p-isoperimetric. Note that the algorithm used in (Dasgupta et al., 2009) re-lies on the construction of the L¨owner–John ellipsoid for a specific set of points. Interestingly, we can also show that there is a more simple and direct algorithm to compute such a matrixB; this may be of independent interest. We provide the details of our algorithm in the full version (Chierichetti et al., 2017).

5.2. Reducing the rank tok

The main idea for reducing the rank is to first apply the almost-`p-isoperimetric transformation to the factorU to obtain a new factorZ0. For such aZ0, the`p-norm ofZ0V

is at most the`p-norm ofV. Using this fact we show that

V has a low-rank approximation and a rank-k approxima-tion ofV translates into a good rank-kapproximation of

U V. But a good rank-k approximation ofV can be ob-tained by exploring all possiblek-subsets of rows ofV, as in Algorithm 2. More formally, in Algorithm 4 we give the pseudo-code to reduce the rank of our low-rank approxima-tion fromO(klogm)tok. Letδ=|∆|p= optk,p(A).

(7)

Algorithm 4An algorithm that transforms anO(klogm) -rank matrix decomposition into ak-rank matrix decomposi-tion without inflating the error too much.

Require: U ∈Rn×O(klogm),V ∈RO(klogm)×m

Ensure: W ∈Rn×k,Z∈Rk×m

1: Apply Lemma 10 toU to obtain matrixW0

2: Apply Lemma 1 to obtain matrixZ0_{, s.t.}_∀_i_,_|_W0_Z0 i − (U V)i|pis minimized

3: Apply Algorithm 2 with input(Z0₎T _∈

Rn×O(klogm)

andkto obtainX andY

4: SetZ ←XT

5: SetW ←W0_YT

6: OutputW andZ

Theorem 11. Let A ∈ _Rn×m_,_U _∈

Rn×O(klogm),V ∈ RO(klogm)×mbe such that|A−U V|p = O(kδ). Then,

Algorithm 4 runs in timeO(klogm)k₍_mn₎O(1) _and

out-puts W ∈ Rn×k, Z ∈ Rk×m such that |A−W Z|p = O((k4_log_k₎_δ₎_.

5.3. Improving the running time

Interestingly it is possible to improve the running time to (mn)O(1) _{for every} _k _{and every constant} _p _≥ ₁_{, at}

the cost of apoly(klog(m))-approximation instead of the

poly(k)-approximation we had previously. See the full ver-sion (Chierichetti et al., 2017) for a proof.

Theorem 12. LetA ∈Rn×m,1 ≤k ≤min(m, n), and

p ≥ 1 be an arbitrary constant. LetU ∈ Rn×O(klogm)

andV ∈_RO(klogm)×m_{be such that}_|_A₋_{U V}_|

p=O(kδ).

There is an algorithm that runs in time(mn)O(1)_and

out-puts W ∈ _Rn×k_{, Z} _∈

Rk×m such that |A−W Z|p = (klogm)O(p)_δ_.

6. Experiments

In this section, we show the effectiveness of Algorithm 2 compared to the SVD. We run our comparison both on synthetic as well as real data sets. For the real data sets, we use matrices from the FIDAP set2_{and a word frequency}

dataset from UC Irvine3_{. The FIDAP matrix is 27}_×_{27 with}

279 real asymmetric non-zero entries. The KOS blog entries matrix, representing word frequencies in blogs, is 3430×

6906 with 353160 non-zero entries. For the synthetic data sets, we use two matrices. For the first, we use a 20 ×

30 random matrix with 184 non-zero entries—this random matrix was generated as follows: independently, we set each entry to0with probability0.7, and to a uniformly random 2_i_{http://math.nist.gov/MatrixMarket/data/}

SPARSKIT/fidap/fidap005.html

3_{https://archive.ics.uci.edu/ml/datasets/}

Bag+of+Words

value in[0,1]with probability0.3. Both matrices are full rank. For the second matrix, we use a random±120×30 matrix.

In all our experiments, we run a simplified version of Al-gorithm 2, where instead of running for all possible [m_k]

subsets ofkcolumns (which would be computationally pro-hibitive), we repeatedly samplekcolumns, a few thousand times, uniformly at random. We then run the`p-projection on each sampled set and finally select the solution with the smallest`p-error. (While this may not guarantee prov-able approximations, we use this a reasonprov-able heuristic that seems to work well in practice, without much computational overhead.) We focus onp= 1andp=∞.

Figure 1 illustrates the relative performance of Algorithm 2 compared to the SVD for different values ofkon the real data sets. In the figure the green line is the ratio of the total error. The`1-error for Algorithm 2 is always less than the

corresponding error for the SVD and in fact consistently outperforms the SVD by roughly 40% for small values of

k on the FIDAP matrix. On the larger KOS matrix, the relative improvement in performance with respect to`∞

-error is more uniform (around 10%).

We observe similar trends for the synthetic data sets as well. Figures 2 and 3 illustrate the trends. Algorithm 2 performs consistently better than the SVD in the case of

`1-error for both the matrices. In the case of`∞-error, it

outperforms SVD by around10%for higher values ofkon the random matrix. Furthermore, it consistently outperforms SVD, between 30% and 50%, for all values of kon the random±1matrix.

To see why our`∞error is always1for a random±1matrix A, note that by setting our rank-kapproximation to be the zero matrix, we achieve an`∞error of1. This is optimal

for large values of nandm and smallk as can be seen by recalling the notion of thesign-rankof a matrixA ∈ {−1,1}n×m_{, which is the minimum rank of a matrix}_B_for

which the sign ofBi,jequalsAi,jfor all entriesi, j. If the sign-rank ofAis larger thank, then for any rank-kmatrix

B, we havekA−Bk∞ ≥1since necessarily there is an

entryAi,jfor which|Ai,j−Bi,j| ≥1. It is known that the

sign-rank of a randomm×mmatrixA, and thus also of a randomn×mmatrixA, isΩ(√m)with high probability (Forster, 2002).

7. Conclusions

We studied the problem of low-rank approximation in the entry-wise `p error norm and obtained the first provably good approximation algorithms for the problem that work for everyp ≥ 1. Our algorithms are extremely simple, which makes them practically appealing. We showed the effectiveness of our algorithms compared with the SVD on

(8)

0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0 50 100 150 200 250 1 2 3 4 5 6 7 8 9 10 Ra tio o f Alg 2 t o S VD Er ror s Ab so lut e Er ror K Alg 2 SVD Ratio

(a) FIDAP matrix (`1)

0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20 25 30 35 40 45 1 2 3 4 5 6 7 8 9 10 Ra tio o f Alg 2 t o S VD Er ror s Ab so lut e Er ror K Alg 2 SVD Ratio (b) KOS matrix (`∞)

Figure 1: Comparing the performance of Algorithm 2 with SVD on the real data sets.

0.68 0.7 0.72 0.74 0.76 0.78 0.8 0 50 100 150 200 250 1 2 3 4 5 6 7 8 9 10 Ra tio o f Alg2 to SVD Er ror s Ab so lut e Er ror K Alg 2 SVD Ratio

(a) Random matrix (`1)

0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 Ra tio o f Alg2 to SVD Er ros Ab so lut e Er ror K Alg 2 SVD Ratio (b) Random matrix (`∞)

Figure 2: Comparing the performance of Algorithm 2 with SVD on the random matrix.

0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 Ra tio of A lg 2 to S VD Err or s Ab so lut e Er ror K Alg 2 SVD Ratio (a)±1matrix (`1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8 9 10 Ra tio o f Alg 2 t o S VD Er ror s Ab so lut e Er ror K Alg 2 SVD Ratio (b)±1matrix (`∞)

Figure 3: Comparing the performance of Algorithm 2 with SVD on the±1matrix.

real and synthetic data sets. We obtain akO(1)

approxi-mation factor for everypfor the column subset selection problem, and we showed an example matrix for this prob-lem for which akΩ(1)approximation factor is necessary. It is unclear if better approximation factors are possible by designing algorithms that do not choose a subset of input columns to span the output low rank approximation. Re-solving this would be an interesting and important research direction.

(9)

References

Brooks, J. Paul, Dul´a, Jose’ H., and Boone, Edward L. A pure`1-norm principal component analysis.

Computa-tional Statistics & Data Analysis, 61:83–98, 2013. Cand`es, Emmanuel J., Li, Xiaodong, Ma, Yi, and Wright,

John. Robust principal component analysis? JACM, 58 (3):11:1–11:37, 2011.

Chierichetti, Flavio, Gollapudi, Sreenivas, Kumar, Ravi, Lattanzi, Silvio, Panigrahy, Rina, and Woodruff, David P. Algorithms for`p low-rank approximation. Technical

Report 1705.06730, arXiv, 2017.

Dan, Chen, Hansen, Kristoffer A., Jiang, He, Wang, Liwei, and Zhou, Yuchen. On low rank approximation of binary matrices. Technical Report 1511.01699v1, arXiv, 2015. Dasgupta, Anirban, Drineas, Petros, Harb, Boulos, Kumar,

Ravi, and Mahoney, Michael W. Sampling algorithms and coresets forlpregression.SICOMP, 38(5)(2060–2078), 2009.

Deshpande, Amit, Tulsiani, Madhur, and Vishnoi, Nisheeth K. Algorithms and hardness for subspace ap-proximation. InSODA, pp. 482–496, 2011.

Eriksson, Anders and van den Hengel, Anton. Efficient computation of robust low-rank matrix approximations using theL1norm.PAMI, 34(9):1681–1690, 2012.

Feldman, Dan, Fiat, Amos, Sharir, Micha, and Segev, Danny. Bi-criteria linear-time approximations for generalizedk -mean/median/center. InSoCG, pp. 19–26, 2007. Forster, J¨urgen. A linear lower bound on the unbounded

error probabilistic communication complexity.J. Comput. Syst. Sci., 65(4):612–625, 2002.

Gillis, Nicolas and Vavasis, Stephen A. On the complexity of robust PCA and`1-norm low-rank matrix

approxima-tion. Technical Report 1509.09236, arXiv, 2015. Goreinov, Sergei A. and Tyrtyshnikov, Eugene E. The

maximal-volume concept in approximation by low-rank matrices. Contemporary Mathematics, 208:47–51, 2001. Goreinov, Sergei A. and Tyrtyshnikov, Eugene E. Qua-sioptimality of skeleton approximation of a matrix in the Chebyshev norm.Doklady Mathematics, 83(3):374–375, 2011.

Huber, Peter J.Robust Statistics. John Wiley & Sons, New York,, 1981.

Ke, Qifa and Kanade, Takeo. RobustL1norm factorization

in the presence of outliers and missing data by alternative convex programming. InCVPR, pp. 739–746, 2005. Lu, Cewu, Shi, Jiaping, and Jia, Jiaya. Scalable adaptive

robust dictionary learning.TIP, 23(2):837–847, 2014. Meng, Deyu and Torre, Fernando. D. L. Robust matrix

factorization with unknown noise. InICCV, pp. 1337– 1344, 2013.

Netrapalli, Praneeth, Niranjan, U. N., Sanghavi, Sujay, Anandkumar, Animashree, and Jain, Prateek. Non-convex robust PCA. InNIPS, pp. 1107–1115, 2014.

Song, Zhao, Woodruff, David P., and Zhong, Pelin. Low rank approximation with entrywise`1-norm error. In

STOC, 2017.

Wang, Naiyan and Yeung, Dit-Yan. Bayesian robust matrix factorization for image and video processing. InICCV, pp. 1785–1792, 2013.

Wang, Naiyan, Yao, Tiansheng, Wang, Jingdong, and Ye-ung, Dit-Yan. A probabilistic approach to robust matrix factorization. InECCV, pp. 126–139, 2012.

Woodruff, David P. Sketching as a tool for numerical linear algebra.Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.

Xiong, Liang, Chen, Xi, and Schneider, Jeff. Direct robust matrix factorization for anomaly detection. InICDM, pp. 844–853, 2011.

Xu, Huan, Caramanis, Constantine, and Sanghavi, Sujay. Robust PCA via outlier pursuit.TOIT, 58(5):3047–3064, 2012.

Xu, Lei and Yuille, Alan L. Robust principal compo-nent analysis by self-organizing rules based on statis-tical physics approach. IEEE Transactions on Neural Networks, 6(1):131–143, 1995.

Yi, Xinyang, Park, Dohyung, Chen, Yudong, and Caramanis, Constantine. Fast algorithms for robust pca via gradient descent. InNIPS, pp. 4152–4160, 2016.

Zheng, Yinqiang, Liu, Guangcan, Sugimoto, Shigeki, Yan, Shuicheng, and Okutomi, Masatoshi. Practical low-rank matrix approximation under robustL1-norm. InCVPR,