Improved matrix algorithms via the Subsampled Randomized Hadamard Transform

(1)

RANDOMIZED HADAMARD TRANSFORM

CHRISTOS BOUTSIDIS∗ _AND _{ALEX GITTENS}†

Key words. low-rank approximation, least-squares regression, hadamard transform, sampling, randomized algorithms.

AMS subject classifications. 15B52, 15A18, 11K45

Abstract. Several recent randomized linear algebra algorithms rely upon fast dimension reduc-tion methods. A popular choice is the Subsampled Randomized Hadamard Transform (SRHT). In this article, we address the efficacy, in the Frobenius and spectral norms, of an SRHT-based low-rank matrix approximation technique introduced by Woolfe, Liberty, Rohklin, and Tygert. We establish a slightly better Frobenius norm error bound than currently available, and a much sharper spectral norm error bound (in the presence of reasonable decay of the singular values). Along the way, we produce several results on matrix operations with SRHTs (such as approximate matrix multiplica-tion) that may be of independent interest. Our approach builds upon Tropp’s in “Improved analysis of the Subsampled Randomized Hadamard Transform”.

1. Introduction. Numerical linear algebra algorithms are traditionally deter-ministic. For example, given a full-rank matrix A ∈ Rm×m and a vector b∈ Rm, Gaussian elimination requires at most 2m3/3 arithmetic operations to compute a vector x ∈ Rn that satisfies Ax = b, while the matrix-matrix multiplication AAT requires at most (2m−1)m2 _{operations, assuming that the matrix multiplication}

exponent equals 3. Another important problem is eigenvalue computation: current state-of-the-art solvers compute allmeigenvalues ofAAT _{in O(}_m3_{) arithmetic}

oper-ations. All these computations are deterministic, i.e ensure that the solution of the underlying problem is returned after the corresponding operation count.

Although these algorithms are numerically stable and run in polynomial time, O(m3_{) arithmetic operations can be prohibitive for many applications when the size}

of the matrix is large, e.g. on the order of millions or billions [34, 33]. One way to speed up these algorithms is to reduce the size of A, and then apply standard deterministic procedures to the resulting matrix. In more detail, for a matrix Ω ∈

Rm×r(m > r= o(m)), letY=AΩ∈Rm×r. Ωis a so-called “dimension reduction” matrix andY contains as much information of Aas possible. Consider for example the matrix-matrix multiplication operation mentioned above. In this setting, one can computeYYT _{instead of}_AAT_{. If}_Ω_{is chosen carefully, then}

YYT≈AAT,

and the number of operations needed to computeYYT_{is at most o(}_m3_{) [15, 16].}

Recent years have produced a large body of research on designing random matri-cesΩ with which many popular problems in numerical linear algebra (e.g. low-rank matrix approximation [17, 18], least-squares regression [42, 6, 11], k-means cluster-ing [8]) can be solved approximately in o(m3_{) arithmetic operations. We refer the}

reader to a recent comprehensive survey of the topic [27], which has now emerged as

Randomized Numerical Linear Algebra.

∗_Mathematical _Sciences _Department, _IBM _T. _J. _Watson _Research _Center. _Email:

{cboutsi}@us.ibm.com

†_{Applied and Computational Mathematics Department, California Institute of Technology. Email:}

{gittens}@caltech.edu

1

(2)

Some proposed choices forΩinclude: (i) every entry ofΩtakes the values +1,−1 with equal probability [12, 35]; (ii) the entries of Ωare i.i.d. Gaussian random vari-ables with zero mean and unit variance [27]; (iii) the columns of Ω are chosen inde-pendently from the columns of them×midentity matrix with probabilities that are proportional to the Euclidean length of the columns ofA [21, 17]; (vi) the columns of Ω are chosen independently from the columns of them×m identity matrix uni-formly at random [23]; (v) Ω is designed carefully such that AΩ can be computed in at most O(nnz(A)) arithmetic operations, where nnz(A) denotes the number of non-zero entries inA[13].

In this article we focus on the so-called Subsampled Randomized Hadamard Transform (SRHT), i.e. the matrixΩ contains a subset of the columns of a random-ized Hadamard matrix (see Definitions 1.1 and 1.2 below). This form of dimension reduction was introduced in [1]. It is of particular interest because the highly struc-tured nature of Ω can be exploited to reduce the time of computingY =AΩ from O(m2_r_{) to O(}_m2_log

2r) (see Lemma 1.3 below).

Definition 1.1 (Normalized Walsh–Hadamard Matrix). Fix an integer n =

2p_{, for} _p _{= 1}_,₂_,₃_{, ...}_{. The (non-normalized)} _n_×_n _{matrix of the Walsh–Hadamard}

transform is defined recursively as, Hn= Hn/2 Hn/2 Hn/2 −Hn/2 , with H2= +1 +1 +1 −1 .

The n×n normalized matrix of the Walsh–Hadamard transform is equal to H = n−12H_n∈_Rn×n.

Definition 1.2 (Subsampled Randomized Hadamard Transform (SRHT)

ma-trix_).

Fix integers rand n= 2p _with_{r < n} _and_p_{= 1}_,₂_,₃_{, ...}_{. An SRHT matrix is an} r×n matrix of the form

Θ=

r n

r ·RHD;

• D∈Rn×nis a random diagonal matrix whose entries are independent random

signs, i.e. random variables uniformly distributed on {±1}. • H∈_Rn×n _{is a normalized Walsh–Hadamard matrix.}

• R ∈ _Rr×n _{is a subset or} _r _{rows from the} _n_×_n _{identity matrix, where the}

rows are chosen uniformly at random and without replacement.

Lemma 1.3 (Fast Matrix-Vector Multiplication, Theorem 2.1 in [2]). Given

x ∈ Rn and r < n, one can construct Θ ∈ Rr×n and compute Θx in at most 2nlog₂(r+ 1)) operations.

The purpose of this article is to analyze the theoretical performance of an SRHT-based randomized low-rank approximation algorithm introduced in [45] and analyzed in [45, 27, 37]. Our analysis (see Theorem 2.1) provides sharper approximation bounds than those in [45, 27, 37].

Our study should also be viewed as followup to the work of Drineas et al. [19] and [40, 3] on designing fast approximation algorithms for solving least-squares re-gression problems. One of the two algorithms presented in [19] employs the SRHT to quickly reduce the dimension of the least squares problem and then solves the smaller problem with a direct least-squares solver, while [40, 3] use the SRHT to design a good preconditioner for an iterative method, which is then used to solve the regres-sion problem. The results in this article along with the work in [43] have implications in all these studies [40, 19, 3]. We discuss these implications in Section 3.

(3)

1.1. Beyond the SRHT. Finally, notice that the SRHT is defined only when the matrix dimension is a power of two. An alternative option is to use other structured orthonormal randomized transforms such as the discrete cosine transform (DCT) or the discrete Hartley transform (DHT) [45, 37, 40, 3], whose entries are on the order ofn−1/2_._{All these transforms do not place any restrictions on the size of the}

matrix. The results of this paper - with minimal effort - can be extendedunchanged

to encompass these transforms. To see this, notice that Lemma 3.3 in [43] remains unchanged for all these orthogonal transforms. Thus Lemma 4.1 in our work as well as all other results presented in this article are true for these orthogonal transforms as well.

1.2. Roadmap. This article is structured as follows. Section 1.3 introduces the notation. In Section 2, we present our main results on the quality of SRHT low-rank approximations and compare them to prior results in the literature. In Section 3, we discuss two approaches to least-squares regression involving SRHT dimensionality-reduction. Section 4 first recalls known facts on the application of SRHTs to orthog-onal matrices and then presents new results on the application of SRHTs to general matrices and the approximation of matrix multiplication using SRHTs under the Frobenius norm. Section 5 contains the proofs of our two main theorems presented in Sections 2 and 3. We conclude the paper with an experimental evaluation of the SRHT low-rank approximation algorithm in Section 6.

1.3. Preliminaries. We use A,B, . . . to denote real matrices and a,b, . . . to denote real column vectors. In is the n×n identity matrix; 0m×n is the m×n matrix of zeros;ei is the standard basis (whose dimensionality will be clear from the context). A(i) denotes the ith row of A; A(j) denotes the jth column of A; Aij denotes the (i, j)th element of A. We use the Frobenius and the spectral norm of a matrix: kAkF = qP

i,jA

2

ij and kAk2 = maxx:kxk2=1kAxk2, respectively. The

notationkAkξ indicates that an expression holds for bothξ= 2 and ξ= F.

A (compact) Singular Value Decomposition (SVD) of the matrixA∈Rm×nwith rank(A) =ρis a decomposition of the form

A= Uk Uρ−k | {z } UA∈Rm×ρ Σk Σρ−k | {z } ΣA∈Rρ×ρ VT k VT ρ−k | {z } VT A∈Rρ×n ,

where the singular values ofAare ordered σ1 ≥. . . σk ≥σk+1 ≥. . .≥σρ>0. Here k is a parameter in the interval 1 ≤k ≤ρand the above formula corresponds to a partition of the SVD in block form usingk. We denote theith singular value ofAby σi(A) and sometimes refer toσ1 as σmax andσρ as σmin. The matricesUk ∈Rm×k andUρ−k ∈Rm×(ρ−k) contain the left singular vectors ofA; similarly, the matrices

Vk ∈Rn×k andVρ−k∈Rn×(ρ−k)contain the right singular vectors ofA. We denote

Ak=UkΣkVTk ∈Rm×n. Ak minimizeskA−Xkξ over allm×nmatricesXof rank at most k. A† =VAΣA−1UTA ∈Rn×m denotes the Moore-Penrose pseudo-inverse of

A∈Rm×n.LetX∈Rm×n (n≥m) andB=XXT∈Rm×m; any matrix that can be written in this form is called a symmetric positive semidefinite (SPSD) matrix. For alli= 1, ..., m, λi(B) = σi2(X) denotes theith eigenvalue ofB. We sometimes use λmin(B) andλmax(B) to denote the smallest (nonzero) and largest eigenvalues ofB,

(4)

2. Low-rank matrix approximation using SRHTs. Using an SRHT matrix (see Definition 1.2), one can quickly construct a low-rank approximation to a given matrix A. Our main result, Theorem 2.1 below, provides theoretical guarantees on the spectral and Frobenius norm accuracy of these approximations.

Theorem 2.1. LetA∈Rm×n with rankρandnis a power of 2. Fix an integer

k satisfying 2 ≤ k < ρ. Let 0 < ε < 1/3 be an accuracy parameter, 0 < δ < 1

be a failure probability, and C≥1 be any specified constant. Let Y =AΘT_, _where Θ∈Rr×n is an SRHT withr satisfying

6C2ε−1h√k+p8 ln(n/δ)i

2

ln(k/δ)≤r≤n. (2.1)

Let ` = min{m, r}. Furthermore, let Q ∈ _Rm×` _satisfy _QT_Q ₌ _I

` and be such

that the column space of Y is contained in the range of Q (e.g. such a Q can be computed with the QR factorization of Y in O(m`2₎ _{arithmetic operations), and let}

˜

Ak=QXopt∈Rm×n,whereXopt is computed via the SVD of QTAas follows,

Xopt= argmin X∈R`×n,rank(X)≤k

kQT_A₋_XkF_.

Given this setup, with probability at least1−δC2ln(k/δ)/4−7δthe following Frobe-nius norm bounds hold simultaneously:

kA−YY†AkF≤(1 + 22ε)· kA−AkkF, (i) kA−A˜kkF≤(1 + 22ε)· kA−AkkF, (ii)

kAk−YY†AkF≤(1 + 22ε)· kA−AkkF, (iii)

kAk−A˜kkF≤(2 + 22ε)· kA−AkkF. (iv)

Similarly, the same setup ensures that with probability at least1−5δ,the following spectral norm bounds hold simultaneously:

kA−YY†Ak2≤ 4 + r 3 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 3 ln(ρ/δ) r · kA−AkkF, (v) kA−A˜kk2≤ 6 + r 6 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 6 ln(ρ/δ) r · kA−AkkF, (vi) kAk−YY†Ak2≤ 4 + r 3 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 3 ln(ρ/δ) r · kA−AkkF, (vii) kAk−A˜kk2≤ 7 + r 12 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 6 ln(ρ/δ) r · kA−AkkF. (viii)

Recall that ` = min{m, r}. The matrix Y can be constructed using 2mnlog₂(r+ 1)

arithmetic operations and, givenY,the matricesYY†AandA˜k can be formed using O(mn`+mr`)andO(mn`+`2_n₎_{additional arithmetic operations, respectively.}

(5)

We prove this theorem in Section 5.2. Notice that the Theorem provides residual and forward error bounds for two low-rank matrices in the spectral and Frobenius norms. The matrixYY†Ahas rank at mostr > k,while the matrix ˜Ak has rank at mostk.Prior works have provided only residual error bounds [45, 27, 37].

The first two Frobenius norm bounds in this theorem (residual error analysis) are slightly stronger than the best bounds appearing in prior efforts [37]. The spectral norm bounds on the residual error are significantly better than the bounds presented in prior work and shed light on an open question mentioned in [37]. We do not, however, claim that the error bounds provided are the tightest possible. Certainly the specific constants (22,6,etc.) in the error estimates are not optimized.

We now present a detailed comparison of the guarantees given in Theorem 2.1 with those available in the existing literature.

2.1. Detailed Comparison to Prior Work.

2.1.1. Halko et al. [27]. To put our result into perspective, we compare it to prior efforts at analyzing the SRHT algorithm introduced above. Halko et al. [27] argue that ifrsatisfies

4h√k+p8 ln(kn)i

2

ln(k)≤r≤n, (2.2)

then, for bothξ= 2,F,

kA−YY†Akξ ≤

1 +p7n/r· kA−Akkξ,

with probability at least 1−O(1/k). Our first Frobenius norm bound is always tighter than the Frobenius norm bound given here. To compare the spectral norm bounds, note that our first spectral norm bound is on the order of

max (r ln(ρ/δ) ln(n/δ) r · kA−Akk2, r ln(ρ/δ) r · kA−AkkF ) . (2.3)

If the singular values ofAare flat andAhas close to full rank, then the spectral norm result in [27] is perhaps optimal. But in the cases where it makes most sense to ask for low-rank approximations—viz.,Ais rank-deficient or the singular values ofAdecay fast—the spectral error norm bound in Theorem 2.1 is more useful. Specifically, if

kA−AkkF

r _n

ln(ρ/δ)· kA−Akk2,

then whenr is chosen according to Theorem 2.1 the quantity in Eqn. (2.3) is much smaller than

p

7n/r· kA−Akk2.

We were able to obtain this improved bound by using the results in Section 4.1, which allow one to take into account decays in the spectrum ofA. Finally, notice that our theorem makes explicit the intuition that the probability of failure can be driven to zero independently of the target rankkby increasing the number of samplesr.

(6)

2.1.2. Nguyen et al. [37]. A tighter analysis of the Frobenius norm error term of the SRHT low-rank matrix approximation algorithm appeared in Nguyen et al. [37]. Let δ be a probability parameter with 0 < δ < 1 and ε be an accuracy parameter with 0< ε <1. Then, Nguyen et al. show that in order to get a rank-k matrix ˜Ak satisfying kA−A˜kkF≤(1 +ε)· kA−AkkF and kA−A˜kk2≤ 2 +p2n/r· kA−Akk2 with probability of success at least 1−5δ, one requires

r= Ωε−1max{k,√kln(2n/δ)} ·max{lnk,ln(3/δ)}.

Theorem 2.1 gives a tighter spectral norm error bound in the cases most of interest, wherekA−AkkF

q _n

ln(ρ/δ)· kA−Akk2.It also provides an equivalent Frobenius

norm error bound with a comparable failure probability for a smaller number of samples. Specifically, if

r≥528ε−1[√k+p8 ln(8n/δ)]2ln(8k/δ) = Ω ε−1max{k,ln(n/δ)} ·max{lnk,ln(1/δ)} , then the second Frobenius norm bound in Theorem 2.1 ensureskA−A˜kkF≤(1 +ε)·

kA−AkkF,with probability at least 1−8δ.

In the conclusion of [37], the authors left as a subject for future research the explanation of a curious experimental phenomenon: when the singular values decay according to power laws, the SRHT low-rank approximation algorithm empirically achieves relative-error spectral norm approximations. Our spectral norm result pro-vides an explanation of this phenomenon: when the singular values of A decay fast enough, as in power law decay, one haskA−AkkF= Θ (1)· kA−Akk2. In this case, by choosingr

24ε−1h√k+p8 ln(n/δ)i

2

ln(k/δ) ln(n/δ)≤r≤n

our second spectral norm bound ensureskA−A˜kk2≤O(1)·kA−Akk2with probabil-ity of at least 1−8δ,thus predicting the observed empirical behavior of the algorithm.

2.1.3. The subsampled randomized Fourier transform (SRFT). The al-gorithm in Section 5.2 of [45], which was the first to use the idea of employing sub-sampled randomized orthogonal transforms to compute low-rank approximations to matrices, provides a spectral norm error bound but replaces the SRHT with an SRFT, i.e. the matrixHof Definition 1.2 is replaced by a matrix where the (j, h)th entry is

Hjh=e−2πi(j−1)(h−1)/n, wherei=

√

−1, i.e. His the unnormalized discrete Fourier transform. Woolfe et al. [45] (see eqn. 190) argue that, for anyα >1,β >1, if

r≥α2β(α−1)−1(2k)2, then with probability at least 1−3/β (ω= max{m, n}),

kA−U˜kΣ˜kV˜ T kk2≤2 √ 2α−1 + 1 · √αω+ 1 +√αω · kA−Akk2.

Here, ˜Uk ∈ Rm×k contains orthonormal columns, as does ˜Vk ∈ Rn×k, while ˜Σk ∈ Rk×k is diagonal with nonegative entries. These matrices can be computed determin-istically fromAΘT_{in O(}_k2₍_m₊_n_{) +}_kr2_ln_r_{) time. Also, computing}_Y₌_AΘT_takes

(7)

2.1.4. Two alternative dimensionality-reduction algorithms. Instead of using an SRHT matrix, one can takeΘT _{in Theorem 2.1 to be a matrix of i.i.d}

stan-dard Gaussian random variables. One gains theoretically and often empirically better worse-case trade-offs between the number of samples taken, the failure probability, and the error guarantees. The SRHT algorithm is still faster, though, since matrix multiplications with Gaussian matrices require O(mnr) time. One can also takeΘT

to be a matrix of i.i.d. random signs (±1 with equal probability). In many ways, this is analogous to the Gaussian algorithm—in both casesΘis a matrix of i.i.d. subgaus-sian random variables—so we expect this algorithm to have the same advantages and disadvantages relative to the SRHT algorithm. We now compare the best available performance bounds for these schemes to our SRHT performance bounds.

We use the notion of the stable rank of a matrix, sr (A) =kAk2

F/kAk 2 2,

to capture the decay of the spectrum of A (spectrum here refers to the singular values ofA). As can be seen by considering a matrix with a flat spectrum, in general the stable rank is no smaller than the rank; the smaller the stable rank, the more pronounced the decay in the spectrum ofA.

Whenr > k+ 4,Theorem 10.7 and Corollary 10.9 in [27] imply that, when using Gaussian sampling, with probability at least 1−2·32−(r−k)−e−(r−2k+1),

kA−YY†AkF≤ 1 + 32 √ 3k+ e√r √ r−k+ 1 ! · kA−AkkF

and with probability at least 1−3e−(r−k)_,

kA−YY†Ak2≤ 1 + 16 r 1 + k r−k ! · kA−Akk2+ 8√r r−k+ 1· kA−AkkF. Comparing to the guarantees of Theorem 2.1 we see that these bounds suggest that with the same number of samples, Gaussian low-rank approximations outperform SRHT low-rank approximations. In particular, the spectral norm bound guarantees that if sr (A−Ak)≤k, i.e. kA−AkkF≤

√

kkA−Akk2,then the Gaussian low-rank approximation algorithm requires O(k/ε2) samples to return a (17+ε) constant factor spectral norm error approximation with high probability. Similarly, the Frobenius norm bound guarantees that the same number of samples returns a (1 + 32ε) constant factor Frobenius norm error approximation with high probability. Neither the spectral nor Frobenius bounds given in Theorem 2.1 for SRHT low-rank approximations apply for this few samples.

[35] does not consider the Frobenius norm error of the random sign low-rank approximation algorithm, but Remark 4 in [35] shows that whenr= O(k/ε4_ln(1_/δ_)),

for 1< δ <0,and sr (A−Ak)≤k,this algorithm ensures that with high probability of at least 1−δ,

kA−YY†Ak2≤(1 +ε)kA−Akk2.

To compare our results to those stated in [27, 35] we assume thatkln(n/δ) so that r > klnksuffices for Theorem 2.1 to apply. Then, in order to acquire a (4 +ε)

(8)

relative error bound from Theorem 2.1, it suffices that (here C0 is an explicit constant no larger than 6)

r≥C0ε−2kln(ρ/δ) and sr (A−Ak)≤C0k.

We see that the Gaussian and random sign approximation algorithms return (17+ ε) and (1 +ε) relative spectral error approximations, respectively, when r is on the order ofkand the relatively weak spectral decay condition sr (A−Ak)≤kis satisfied, while our bounds for the SRHT algorithm requirer > kln(ρ/δ) and the spectral decay condition

sr (A−Ak)≤C0k

to ensure a (6 +ε) relative spectral error approximation. We note that the SRHT algorithm can be used to obtain relative spectral error approximations of matrices with arbitrary stable rank at the cost of increasing r(the same is of course true for the Gaussian and random sign algorithms).

The disparity in the bounds for these three schemes—the presence of the loga-rithmic factors in the SRHT bounds and the fact that these bounds apply only when r > kln(ρ/δ)—may reflect a fundamental trade-off between the structure and ran-domness ofΘT_{. The highly structured nature of SRHT matrices makes it possible to}

calculate Y much faster than when Gaussian or random sign sampling matrices are used, but this moves us away from the very nice isotropic randomness present in the GaussianΘT_{and the similarly nice properties of a matrix of i.i.d subgaussian random}

variables, thus resulting in slacker bounds which require more samples.

3. Least squares regression. We now show how one can use the SRHT to solve least squares problems of the form

min

x kAx−bk2.

HereAis anm×nmatrix withmnand rank(A) =n,b∈Rm, andx∈Rn. One approach to solve this optimization problem is via the SVD ofA: xopt=A†b; while an example of an iterative algorithm is the LSQR algorithm in [38].

During the last decade, researchers have developed several randomized algorithms that (approximately) solve the regression problem in less running time than the ap-proaches mentioned above [42, 40, 35, 3, 19]. We refer the reader to Section 3.3 in [4] for a survey of these methods. The fastest non-iterative method is in [19] while the fastest iterative algorithm is in [40, 3]. Both approaches employ the Subsampled Randomized Hadamard Transform.

3.1. Least-squares via the SRHT and the SVD. The idea in the SRHT algorithm of Drineas et al. [19] is to reduce the dimensions of A and b by pre-multiplication with an SRHT matrix Θ ∈ Rr×m (the matrix R in this case is con-structed by uniform sampling without replacement) and then solve quickly the smaller problem,

min

x kΘAx−Θbk2.

Let ˜xopt= (ΘA)†Θb; then, assumingr satisfies (ε >0 is an accuracy parameter) r= max{482nln(40mn) ln(104nln(40mn)),40nln(40mn)/ε},

(9)

[19] shows that with probability at least 0.8,

kAx˜opt−bk2≤(1 +ε)· kAxopt−bk2.

Furthermore, assume that there exists a γ ∈ (0,1] such that kUAUTAbk2 = γkbk2.

Then, with the same probability,

kxopt−x˜optk2≤

√

εκ(A)pγ−2₋₁_kx

optk2.

Here,κ(A) is the two-norm condition number ofA: κ(A) =kAk2kA†k2.

The running time of this approximation algorithm is O(mnlog2r+rn2), since the

SRHT multiplication takes O(mnlog₂r) time and the solution of the small regression problem another O(rn2_).

Below, we provide a novel analysis of this SRHT least squares algorithm which shows that one needs asymptotically fewer samples r. This immediately implies an improvement on the running time of the algorithm. Additionally, we show logarithmic dependence on the failure probability.

Theorem 3.1. Let A ∈ Rm×n (m n) have rank ρ = n and n be a power

of 2; let b ∈Rm. Let 0 < ε <1/3 denote an accuracy parameter, 0 < δ < 1 be a

failure probability, andC≥1be a constant. Let Θ be anr×m SRHT matrix withr

satisfying

6C2ε−1h√n+p8 ln(m/δ)i

2

ln(n/δ)≤r≤m.

Then, with probability at least1−δC2ln(n/δ)/4−7δ,

kAx˜opt−bk2≤(1 + 22ε)· kAxopt−bk2.

Furthermore, assume that there exists a γ ∈ (0,1] such that kUAUTAbk2 =γkbk2.

Then, with the same probability,

kxopt−x˜optk2≤ ₁₋√_ε 4ε 12 κ(A)pγ−2₋₁_kx optk2.

We prove this theorem in Section 5.3. Another possibility to obtain a better analysis of the method of Drineas et al. is to use Lemma 4.5 in this article, which was proved in [29] and presents bounds for sampling without replacement. This analysis is not straightforward and is beyond the scope of this paper.

3.2. Iterative methods. The key idea of an iterative algorithm such as the LSQR method of [38] ispreconditioning. Blendenpik in [3] constructs such a precon-ditioner by using the SRHT (the matrix R in this case is constructed by uniform sampling without replacement) as follows. First, an SRHT matrixΘ∈Rr×mis con-structed. Then, one forms a QR factorization ΘA = QRA, with Q ∈ Rr×n and

RA ∈Rn×n. Finally,A and RA are given as inputs to LSQR to find a solution to

the least squares problem. We refer the reader to [3] (see also [29]) for a detailed discussion of this approach. The purpose of our discussion here is to comment on the first step of the above procedure and show that a preconditioner of the same quality

(10)

can be constructed with a smaller r. Avron et al. [3] argue that if the number of samples is sufficiently large then the two-norm condition number ofAR−_A1 is small. A small condition number is desirable because the number of iterations required for convergence of the LSQR method is proportional to the condition number. More specifically, Theorem 3.2 in [3] argues that withr= Ω (nln(m) ln(nln(m))), and with constant probability (e.g. 0.9),

κ AR−_A1

= O(1).

The analysis of Blendenpik was recently improved in [29]. More specifically, Corollary 3.11 in [29], along with Lemma 4.2 in our manuscript, which gives a bound on the coherence, show that if

8 3ε −2h√_n₊p 8 ln(m/δ)i 2 ln(2n/δ)≤r≤m, then, with probability at least 1−2δ,

κ AR−_A1

≤

r

1 +ε 1−ε.

We now provide a similar bound in the case where the SRHT is constructed via sampling without replacement. This bound is a simple combination of results in prior work. More specifically, Theorem 1 in [40] argues that the two-norm condition number ofAR−_A1equals the two-norm condition number ofUT_ΘT_{, where}_U_∈

Rm×ncontains the topnleft singular vectors ofA. Combine this fact with the bounds on the singular values ofUT_ΘT _{from Lemma 4.1, to obtain the following observation.}

Remark. Let A ∈ _Rm×n ₍_m _n_{) have rank} _ρ ₌ _n _and _n _{be a power of 2.} Fix 0< δ <1 and 0< ε <1/3. Construct the upper triangular matrix RA∈Rn×n via the QR factorization ΘA = QRA, where Θ is an r×m SRHT matrix with r

satisfying

6ε−2h√n+p8 ln(m/δ)i

2

ln(2n/δ)≤r≤m. Then, with probability at least 1−2δ,

κ AR−_A1

≤

r

1 +ε 1−ε.

Finally, notice that we form the SRHT by uniform sampling without replacement while Blendenpik samples the columns of the randomized Hadamard matrix with

replacement. A different sampling scheme - Bernoulli sampling - was analyzed in Theorem 6.1 in [24] and Section 4 in [29].

3.2.1. The subsampled randomized Fourier transform (SRFT). Finally, we mention the work of Rokhlin and Tygert [40], which was the first to use the idea of employing subsampled randomized orthogonal transforms to precondition iterative solvers for least squares regression problems. [40] replaces the SRHT with the SRFT; notice though that one still needs O(mnlnr) time to compute the productΘA. In this case, for anyα >1, 0< δ <1, if

r≥

_α2_{+ 1}

α2₋₁

2_n2

δ , then with probability at least 1−δ,

κ AR−_A1

(11)

4. Matrix Computations with SRHT matrices.

4.1. SRHTs applied to orthonormal matrices. An important ingredient in analyzing the low-rank approximation algorithm of Theorem 2.1 is understanding how an SRHT changes the spectrum of a matrix after postmultiplication: given a matrix

Xand an SRHT matrixΘ, how are the singular values ofXandXΘT_{related? To be}

more precise, Lemma 5.4 in Section 5.1.2 suggests that one path towards establishing the efficacy of SRHT-based low-rank approximations lies in understanding how the SRHT perturbs the singular values of orthonormal matrices. To see this, we informally repeat the statement of the lemma here. LetA∈Rm×nhave rankρ.Fixksatisfying 0≤k≤ρ. Given a matrix Ω∈Rn×r, withr≥k, construct Y=AΩ.If VTkΩ has full row-rank, then, forξ= 2,F,

kA−YY†Ak2 ξ ≤ kA−Akk2ξ+kΣρ−kVTρ−kΩ VT kΩ † k2 ξ. (4.1)

Now takeΩ=ΘT_{and observe that if the product}_Σ

ρ−kVTρ−kΘT VT kΘT † has small norm, then the residual error of the approximant YY†Ais small. The norm of this product is small when the norms of the perturbed orthonormal matricesVT

ρ−kΘTand

VT

kΘT †

are in turn small, because

kΣρ−kVTρ−kΘT VT kΘT † k2ξ ≤ kΣρ−kkξ2kVTρ−kΘTk 2 ξk VT kΘT † k2ξ. (4.2) These perturbed orthogonal matrices have small norm precisely when their singular values are close to those of the original orthogonal matrices.

4.1.1. SRHTs by uniform sampling without replacement. In this section, we collect known results on how the singular values of a matrix with orthonormal rows are affected by postmultiplication by an SRHT matrix.

It has recently been shown by Tropp [43] that, if the SRHT matrix is of sufficiently large dimensions, post-multiplying a short-fat matrix with orthonormal rows with an SRHT matrix preserves the singular values of the orthonormal matrix, with high probability, up to a small multiplicative factor. The following lemma is essentially a restatement of Theorem 3.1 in [43], but we include a full proof (later in this subsection) for completeness.

Lemma 4.1 (The SRHT preserves geometry). Let V∈Rn×k have orthonormal

columns and nbe a power of 2. Let0< ε <1/3 and0< δ <1. Construct an SRHT matrixΘ∈Rr×n with rsatisfying

6ε−1h√k+p8 ln(n/δ)i

2

ln(k/δ)≤r≤n. (4.3)

Then, with probability at least1−3δ, for all i= 1, ..., k,

q 1−√ε≤σi(VTΘT)≤ q 1 +√ε and k(VT_ΘT₎†₋₍_VT_ΘT₎T_k2_≤₁_.₅₄√_ε.

Tropp [43] (see also [1]) argues that the above lemma follows from a more fundamental fact: if V has orthonormal columns, then the rows of the product HDV all have

(12)

roughly the same norm. That is, premultiplication byHD equalizes the row norms of an orthonormal matrix.

Lemma 4.2 (Row norms, Lemma 3.3 in [43]). Let V∈Rn×k have orthonormal

columns (n is a power of 2), H ∈ Rn×n be a normalized Hadamard matrix, D ∈ Rn×n be a diagonal matrix of independent random signs, and 0< δ <1 be a failure

probability. Recall that (HDV)₍_i₎ denotes the ith row of the matrix HDV ∈Rn×k.

Then, with probability at least1−δ,

maxi=1,...,nk(HDV)(i)k2≤ r k n+ r 8 ln(n/δ) n .

To prove Lemma 4.1 we need one more result on uniform random sampling (without replacement) of rows from tall-thin matrices with orthonormal columns.

Lemma 4.3 (Uniform Sampling without replacement from an Orthonormal

Ma-trix, Corollary to Lemma 3.4 of [43] _). Let W∈Rn×k have orthonormal columns.

Let 0< ε <1 and 0< δ <1. LetM :=n·maxi=1,...,nkW(i)k22. Letr be an integer such that

6ε−2Mln(k/δ)≤r≤n . (4.4)

LetR∈_Rr×n _{be a matrix which consists of a subset of}_r_{rows from}_I

n where the rows

are chosen uniformly at random and without replacement. Then, with probability at least1−2δ, for i∈[k]: r r n· √ 1−ε≤σi(RW)≤ √ 1 +ε· r r n.

Proof. Apply Lemma 3.4 from [43] with the following choice of parameters: `= αMln(k/δ), α = 6/ε2_, _and _δtropp₌_η ₌_ε_{. Here,} _`_, _α_, _M_, _k_, _η _{are the variables of}

Lemma 3.4 from [43] (we also useM andk), and δtroppplays the role of δ, an error parameter, of Lemma 3.4 from [43]. The variablesεandδare from our Lemma. The choice of`proportional to ln(k/δ) rather than proportional to ln(k), as in the original statement of Lemma 3.4, is what results in a probability proportional to δ instead ofk; this can easily be seen by tracing the modified choice of`through the proof of Lemma 3.4.

Proof. (of Lemma 4.1) To obtain the bounds on the singular values, we combine Lemmas 4.2 and 4.3. More specifically, apply Lemma 4.3 withW=HDVand use the bound forM from Lemma 4.2. Then, the bound onrin Eqn. (4.4), the bound on the singular values in Lemma 4.3, and the union bound, establish that with probability at least 1−3δ, r r n· √ 1−ε≤σi(RHDV)≤ √ 1 +ε· r r n.

Now, multiply this inequality with pn/r and recall the definitionΘ =pn

r ·RHD to obtain

√

1−ε≤σi(ΘV)≤√1 +ε.

Replacingεwith√εand using the bound onrin Eqn. (4.3) concludes the proof. The second bound in the lemma follows from the first bound after a simple al-gebraic manipulation. Let X = VT_ΘT _∈

(13)

UX∈Rk×k,ΣX∈Rk×k, andVX∈Rr×k, sincer > k. Consider taking the SVDs of (VT_ΘT₎† _{and (}_VT_ΘT₎T_,

k(VT_ΘT₎†₋₍_VT_ΘT₎T_k2₌_kV

XΣ−X1UTX−VXΣXUTXk2= kVX(Σ−X1−ΣX)UTXk2

= kΣ−_X1−ΣXk2,

since VX and UTX can be dropped without changing the spectral norm. Let Y =

Σ−_X1−ΣX∈Rk×k. Then, for alli= 1, . . . , k,Yii=

1−σ2

i(X)

σi(X) .We conclude the proof

as follows,

kYk2= max1≤i≤k|Yii|= max1≤i≤k 1−σ2_i(X) σi(X) = max1≤i≤k 1−σ_i2(X) σi(X) ≤ √ ε p 1−√ε ≤ 1.54√ε.

4.1.2. SRHTs by uniform sampling with replacement. Lemma 4.1 and Lemma 4.3 analyze uniform random sampling without replacement. Below, we present the analogs of these two lemmas for uniform random sampling with replacement. Lemma 4.4 is essentially a restatement of Algorithm 2 (with the probabilities set to 1/m) along with the third point in Remark 3.9 and Lemma 2.1 (with α = pn/r) in [29].

Lemma 4.4 (Uniform Sampling with replacement from an Orthonormal

Ma-trix [29]_). Let W∈_Rn×k _{have orthonormal columns. Let}₀_{< ε <}₁ _and₀_{< δ <}₁_.

Let M :=n·maxi=1,...,nkW(i)k22. Let rbe an integer such that

8 3ε

−2_M_ln(_k/δ₎_≤_r_≤_{n .} _(4.5)

LetRˆ ∈Rr×n be a matrix which consists of a subset ofrrows fromIn where the rows

are chosen uniformly at random and with replacement. Then, with probability of at least1−2δ, for i∈[k]: √ 1−ε≤σi r_n r ˆ RW ≤√1 +ε.

Lemma 4.5 (The SRHT preserves geometry). Let V∈Rn×k have orthonormal

columns and n be a power of 2. Let 0< ε < 1 and 0< δ <1. Construct an SRHT matrix Θ ∈ Rr×n (R is constructed as in Lemma 4.4, i.e. via uniform random

sampling with replacement) withrsatisfying

8 3ε −1h√_k₊p 8 ln(n/δ)i 2 ln(k/δ)≤r≤n. (4.6)

Then, with probability at least1−3δ,for alli= 1, ..., k,

q

1−√ε≤σiVT_ΘT_≤q_{1 +}√_ε

Proof. To obtain the bounds on the singular values, we combine Lemmas 4.2 and 4.4. More specifically, apply Lemma 4.4 withW=HDVand use the bound for

(14)

M from Lemma 4.2. Then, the bound onr in Eqn. (4.4), the bound on the singular values in Lemma 4.4, and the union bound, establish that with probability of at least 1−3δ, √ 1−ε≤σi r_n r ·RHDV ≤√1 +ε.

Replacingεwith√εand using the bound onrin Eqn. (4.4) concludes the proof.

4.2. SRHTs applied to general matrices. The structural result in Lemma 5.4, Lemma 4.1 on the perturbative effects of SRHTs on the singular values of orthonor-mal matrices, and the basic estimate in (4.2) are enough to reproduce the results on the approximation error of SRHT-based low-rank approximation in [27]. The main contribution of this paper is the realization that one can take advantage of the decay in the singular values ofAencoded inΣρ−k to obtain sharper results. In view of the fact that kΣρ−kVTρ−kΘT VT kΘT † k2 ξ ≤ kΣρ−kVTρ−kΘTk 2 ξk VT kΘT † k2 ξ, (4.7)

we should consider the behavior of the singular values of Σρ−kVTρ−kΘT instead of those ofVT

ρ−kΘT.Accordingly, in this section we extend the analysis of [43] to apply to the application of SRHTs to general matrices.

Our main tool is a generalization of Lemma 4.2 that states that the maximum column norm of a matrix to which an SRHT has been applied is, with high probability, not much larger than the root mean-squared average of the column norms of the original matrix.

4.2.1. SRHT equalizes column-norms.

Lemma 4.6 (SRHT equalization of column-norms). Suppose that A is a matrix

with n columns and n is a power of 2. Let H ∈ _Rn×n _{be a normalized Walsh–}

Hadamard matrix, and D ∈ _Rn×n _{a diagonal matrix of independent random signs.}

Then for everyt≥0,

P maxi=1,...,nk ADHT(i)_k2_≤ _√1 nkAkF+ t √ nkAk2 ≥1−n·e−t2/8.

Proof. Our proof of Lemma 4.6 is essentially that of Lemma 4.2 in [43], with attention paid to the fact thatAis no longer assumed to have orthonormal columns. In particular, the following concentration result for Lipschitz functions of Rademacher vectors is central to establishing the result. Recall that a Rademacher vector is a random vector whose entries are independent and take the values ±1 with equal probability.

Lemma 4.7 (Concentration of convex Lipschitz functions of Rademacher random

variables [Corollary 1.3 ff. in [30]]_). Suppose f is a convex function on vectors that satisfies the Lipschitz bound

|f(x)−f(y)| ≤Lkx−yk2 for allx,y. Let εbe a Rademacher vector. For allt≥0,

P[f(ε)≥E[f(ε)] +Lt]≤e−t 2_/₈

(15)

Lemma 4.6 follows immediately from the observation that the norm of any one column ofADHT _{is a convex Lipschitz function of a Rademacher vector. Consider the norm}

of thejth column ofADHT _{as a function of}_ε_,_where_D_{= diag}_ε_:

fj(ε) =kADHT_e

jk=kAdiagεhjk2=kAdiaghjεk2,

wherehj denotes thejth column ofHT.Evidentlyfj is convex. Furthermore,

|fj(x)−fj(y)| ≤ kAdiaghj(x−y)k2≤ kAk2kdiaghjk2kx−yk2= 1

√

nkAk2kx−yk2, where we used the triangle inequality and the fact thatkdiaghjk2 =khjk∞ = √1_n.

Thusfj is convex and Lipschitz with Lipschitz constant at mostkAk2/√n. We calculate E[fj(ε)]≤Efj(ε)21/2= h TrAdiaghjE[εε?] diaghjTAT i1/2 = Tr ₁ nAA T 1/2 = √1 nkAkF.

It now follows from Lemma 4.7 that, for all j = 1,2, . . . , n, the norm of thejth column ofADHT _{satisfies the tail bound}

P kADHT_e jk2≥ 1 √ nkAkF+ t √ nkAk2 ≤e−t2/8.

Taking a union bound over all columns ofADHT_, _{we conclude that}

P maxj=1,...,nk(ADHT)(j)k2≥ 1 √ nkAkF+ t √ nkAk2 ≤n·e−t2/8.

As an interesting aside, we note that just as Lemma 4.1, which states that the SRHT essentially preserves the singular value of matrices with orthonormal rows and an aspect ratio ofk/n, follows from Lemma 4.2, Lemma 4.6 implies that the SRHT essentially preserves the singular values of general rectangular matrices with the same aspect ratio. This can be shown using, e.g., the results on the effects of column sampling on the singular values of matrices from [24, Section 6].

4.2.2. SRHT preserves the spectral norm. The following lemma shows that even if the aspect ratio is larger thank/n,the SRHT does not substantially increase the spectral norm of a matrix.

Lemma 4.8 (SRHT-based subsampling in the spectral norm). Let A ∈ Rm×n

have rank ρ and n be a power of 2. For some r < n, let Θ ∈ Rr×n be an SRHT

matrix. Fix a failure probability0< δ <1. Then,

P kAΘT_k2 2≤5kAk22+ ln(ρ/δ) r kAkF+p8 ln(n/δ)kAk2 2 ≥1−2δ.

(16)

To establish Lemma 4.8, we use the following Chernoff bound for sampling ma-trices without replacement.

Lemma 4.9 (Matrix Chernoff bound, Theorem 2.2 in [43]; see also Corollary

in [44]_). Let X be a finite set of positive-semidefinite matrices with dimensionk, and suppose that

max

X∈Xλmax(X)≤B.

Sample{X1, . . . ,Xr} uniformly at random fromX without replacement. Compute µmax=r·λmax(E[X1]). Then P h λmax X jXj ≥(1 +ν)µmax i ≤k· _eν (1 +ν)1+ν µmax/B forν ≥0.

Proof. [Proof of Lemma 4.8]

Write the SVD ofA as UΣVT _where _Σ _∈

Rρ×ρ and observe that the spectral norm ofAΘT _{is the same as that of}_ΣVT_ΘT_.

We control the norm ofΣVT_ΘT _{by considering the maximum singular value of}

its Gram matrix. DefineM=ΣVT_DHT_{and let}_G_{be the Gram matrix of}_MRT_: G=MRT₍_MRT₎T_. Evidently λmax(G) = r nkΣV T_ΘT_k2 2. (4.8)

Recall thatM(j)denotes thejth column ofM.If we denote the random set ofr coordinates to whichRrestricts by T, then

G=X

j∈TM

(j) _M(j)T

.

ThusGis a sum ofrrandom matricesX1, . . . ,Xrsampled without replacement from the setX ={M(j) M(j)T

: j = 1,2, . . . , n}. There are two sources of randomness inG: Rand the Rademacher random variables on the diagonal ofD.

Set B= 1 n kΣkF+p8 ln(n/δ)kΣk2 2

and letE be the event

maxj=1,...,nkM(j)k22≤B.

WhenEholds, for allj = 1,2, . . . , n, λmax

M(j) M(j)T=kM(j)k2 2≤B,

soGis a sum of random positive-semidefinite matrices each of whose norms is bounded byB.Note that whether or notE holds is determined byD, and independent ofR.

(17)

Conditioning on E, the randomness in R allows us to use the matrix Chernoff bound of Lemma 4.9 to control the maximum eigenvalue ofG. We observe that

µmax=r·λmax(E[X1]) = r nλmax Xn j=1M (j) _M(j)T = r nkΣk 2 2.

Take the parameterν in Lemma 4.9 to be ν= 4 + B

µmax

ln(ρ/δ) to obtain the relation

P[λmax(G)≥5µmax+Bln(ρ/δ) | E]≤(ρ−k)·e[δ−(1+ν) ln(1+ν)] µmax B ≤ρ·e 1−5₄ln 5δµmax B ≤ρ·e− ₅ 4ln 5−1 ln(ρ/δ) < δ. The second inequality holds becauseν≥4 implies that (1 +ν) ln(1 +ν)≥ν· 5

4ln 5.

We have conditioned on E,the event that the squared norms of the columns of

M are all smaller than B. By Lemma 4.6, E occurs with probability at least 1−δ. Thus, substituting the values ofB andµmax, we find that

P λmax(G)≥ r n 5kΣk22+ ln(ρ/δ) r kΣkF+p8 ln(n/δ)kΣk2 2 ≤2δ. Use equation (4.8) to wrap up.

4.2.3. SRHT preserves the Frobenius norm. Similarly, the SRHT is un-likely to substantially increase the Frobenius norm of a matrix.

Lemma 4.10 (SRHT-based subsampling in the Frobenius norm). LetA∈Rm×n

(n is a power of 2) and let Θ ∈ _Rr×n _{be an SRHT matrix for some} _{r < n.} _{Fix a}

failure probability0< δ <1. Then, for anyη≥0,

P h kAΘT_k2 F≤(1 +η)kAk 2 F i ≥1− _eη (1 +η)1+η r/ 1+ √ 8 ln(n/δ)2 −δ.

Proof. Let cj = n_rk(ADHT)jk22 denote the squared norm of the jth column of

p

n/r·ADHT_{. Then, since right multiplication by} _RT _{samples columns uniformly}

at random without replacement,

kAΘT_k2 F= n rkADH T_RT_k2 F= Xr i=1Xi (4.9)

where the random variables Xi are chosen randomly without replacement from the set{cj}n

j=1.There are two independent sources of randomness in this sum: the choice

of summands, which is determined by R, and the magnitudes of the {cj}, which is determined byD.

To bound this sum, we first condition onD being such that each cj is bounded by a quantityB.Call this eventE, then

P hXr i=1Xi≥(1 +η) Xr i=1E[Xi] i ≤P hXr i=1Xi≤(1 +η) Xr i=1E[Xi] | E i +P[Ec].

(18)

To selectB,we observe that Lemma 4.6 implies that with probability 1−δ,the entries ofDare such that

maxjcj≤ n r · 1 n(kAkF+ p 8 ln(n/δ)kAk2)2≤1 r(1 + p 8 ln(n/δ))2kAk2 F. Accordingly, we take B=1 r(1 + p 8 ln(n/δ))2kAk2 F,

thereby arriving at the bound P hXr i=1Xi≥(1 +η) Xr i=1E[Xi] i ≤P hXr i=1Xi≤(1 +η) Xr i=1E[Xi] | E i +δ. (4.10) After conditioning on D, we observe that the randomness remaining on the righthandside of Eqn. (4.10) is the choice of the summandsXi, which is determined byR.We address this randomness by applying a scalar Chernoff bound (Lemma 4.9 with k = 1). To do so, we need µmax, the expected value of the sum; this is an

elementary calculation: E[X1] =n−1 Xn j=1cj = 1 rkAk 2 F, soµmax=rE[X1] =kAk2F.

Applying Lemma 4.9 conditioned onE,we conclude that

P h kAΘT_k2 F≥(1 +η)kAk 2 F | E i ≤ _eη (1 +η)1+η r/(1+ √ 8 ln(n/δ))2 +δ forη≥0.

4.2.4. SRHT preserves matrix multiplication. Finally, we prove a novel result on approximate matrix multiplication involving SRHT matrices.

Lemma 4.11 (SRHT for approximate matrix multiplication). Let A ∈ Rm×n,

B∈Rn×p, andnbe a power of 2. For somer < n, letΘ∈Rr×n be an SRHT matrix.

Fix a failure probability0< δ <1.Assume R satisfies0≤R≤

√ r

1+√8 ln(n/δ).Then,

P "

kAΘT_ΘB₋_ABkF_≤_{2(R + 1)}kAkFkBkF+

p 8 ln(n/δ)kAkFkBk2 √ r # ≥1−e−R2/4−2δ.

Remark. Recall that the stable rank sr (A) = kAk2

F/kAk22 reflects the decay

of the spectrum of the matrix A. Lemma 4.11 can be rewritten as a bound on the relative error of the approximationAΘT_ΘB_{to the product}_AB_:

kAΘT_ΘB₋_ABkF kABkF ≤kAkFkBkF kABkF ·R√+ 2 r · 1 + p 8 ln(n/δ) sr (B) ! .

In this form, we see that the relative error is controlled by the deterministic condition number for the matrix multiplication problem as well as the stable rank of B and the number of column samples r. Since the roles ofA and B in this bound can be interchanged, in fact we have the bound

kAΘT_ΘB₋_ABkF kABkF ≤ kAkFkBkF kABkF · R+ 2 √ r · 1 + p 8 ln(n/δ) max(sr (B),sr (A)) ! ,

(19)

Proof of Lemma 4.11. To prove the Lemma, we first develop a generic result for approximate matrix multiplication via uniform sampling (without replacement) of the columns and the rows of the two matrices involved in the product (see Lemma 4.13 below). Lemma 4.11 is a simple instance of this generic result. We mention that Lemma 3.2.8 in [14] gives a similar result for approximate matrix multiplication, which, however gives a bound for the expected value of the error term, while our Lemma 4.11 gives a comparable bound which holds with high probability. To prove Lemma 4.13, we use the following vector Bernstein inequality for sampling without replacement in Banach spaces; this result follows directly from a similar inequality for sampling with replacement established by Gross in [25].

Lemma 4.12. Let V be a collection of n vectors in a normed space with norm

|·|.ChooseV1, . . . ,VrfromV uniformly at random withoutreplacement. Also choose V₁0, . . . ,V_r0 fromV uniformly at random withreplacement. Let

µ=Eh Xr i=1(V 0 i −E[Vi0]) i and set σ2≥4rEh|V₁0|2i and B ≥2 max V∈V |V|. If 0≤t≤σ2/B,then P h Xr i=1Vi−rE[V1] ≥µ+t i ≤exp − t 2 4σ2 .

Proof. We proceed by developing a bound on the moment generating function (mgf) of Xr i=1Vi−rE[V1] −µ.

This mgf is controlled by the mgf of a similar sum where the vectors are sampled with replacement. That is, forλ≥0,

Ehexpλ· Xr i=1Vi−rE[V1] −λµ i ≤Ehexpλ· Xr i=1V 0 i −rE[V1] −λµ i . (4.11) This follows from a classical observation due to Hoeffding [28] (see also [26] for a more modern exposition) that for any convex_R-valued functiong,

EhgXr i=1Vi i ≤EhgXr i=1V 0 i i .

Specifically, takeg(V) = exp (λ|V −rE[V1]| −λµ) to obtain the asserted inequality

of mgfs.

In the proof of Theorem 12 in [25], Gross establishes that any random variableZ whose mgf is less than the righthand side of Eqn. (4.11) satisfies a tail inequality of the form P[Z≥µ+t]≤exp − t 2 4s2 (4.12)

(20)

whent≤s2_/M,_where s2≥ r X i=1 Eh|Vi0−E[V10]| 2i

and M almost surely bounds |V_i0−E[V₁0]| for all i= 1, . . . , r. To apply this result, note that for alli= 1, . . . , r,

|V_i0−E[V₁0]| ≤2 max

V∈V |V|=B.

Also takeV₁00 to be an i.i.d. copy ofV₁0 and observe that, by Jensen’s inequality, r X i=1 Eh|V_i0−E[V₁0]|2i=rEh|V₁0−E[V₁0]|2i ≤rEh|V₁0−V₁00|2i≤rE (|V₁0|+|V₁00|)2 ≤2rEh|V₁0|2+ |V₁00|2i = 4rEh|V₁0|2i≤σ2.

The bound given in the statement of Lemma 4.12 follows from taking s2 = σ2 andM =B in Eqn. (4.12).

This vector Bernstein inequality gives us a tail bound on the Frobenius error of a simple approximate matrix multiplication scheme based upon column and row sampling.

Lemma 4.13 (Matrix Multiplication). Let X ∈ Rm×n and Y ∈ Rn×`. Fix

r≤n. Select uniformly at random and without replacementrcolumns fromXand the corresponding rows from Y and multiply the selected columns and rows withpn/r. Let Xˆ ∈ Rm×r and Yˆ ∈ Rr×` contain the selected columns and rows, respectively.

Choose σ2≥4n r n X i=1 kX(i)k2 2kY(i)k22 and B≥ 2n r maxi kX (i)_k2k_Y (i)k2. Then if0≤t≤σ2_/B, P h kXˆYˆ −XYkF≥t+σ i ≤exp − t 2 4σ2 .

Proof. Let V be the collection of vectorized rank-one products of columns of

p

n/r·Xand rows ofpn/r·Y.That is, take

V= n rvec(X (i)_Y (i)) n i=1 .

Sample V1, . . . ,Vr uniformly at random from V without replacement, and observe thatE[Vi] = 1_rvec(XY).With this notation, the quantitieskXˆYˆ −XYkF and

Xr i=1(Vi−E[Vi]) ₂,

(21)

have the same distribution, therefore any probabilistic bound developed for the latter holds for the former. The conclusion of the lemma follows from applying Lemma 4.12 to bound the second quantity.

We calculate the variance-like term in Lemma 4.12, 4rE

kV1k2 2 : 4rE kV1k2 2 = 4r1 n n X i=1 n2 r2kX (i)_k2 2kY(i)k22= 4 n r n X i=1 kX(i)k2 2kY(i)k22≤σ 2_.

Now we consider the expectation µ=Eh Xr i=1(V 0 i −E[V 0 i]) ₂ i .

In doing so, we will use the notationEA,B,...[C] to denote the conditional expectation of a random variable C with respect to the random variables A, B, . . . .Recall that a Rademacher vector is a random vector whose entries are independent and take the values ±1 with equal probability. Let ε be a Rademacher vector of length r and sample V₁0, . . . ,V_r0 and V₁00, . . . ,V_r00 uniformly at random fromV with replacement. Nowµcan be bounded as follows:

µ=Eh Xr i=1(V 0 i −E[Vi0]) ₂ i ≤E{V0 i},{Vi00} h Xr i=1(V 0 i −V 00 i ) 2 i =E{V0 i},{V 00 i },ε h Xr i=1εi(V 0 i −Vi00) ₂ i ≤2E{V0 i},ε h Xr i=1εiV 0 i ₂ i ≤2 s E{V0 i},ε Xr i=1εiV 0 i 2 2 = 2 r E{V0 i} h Eε hXr i,j=1εiεjV 0 iTVj0 ii = 2 r EhXr i=1kV 0 ik22 i .

The first inequality is Jensen’s, and the following equality holds because the com-ponents of the sequence{V_i0−V_i00} are symmetric and independent. The next two manipulations are the triangle inequality and Jensen’s inequality. This stage of the estimate is concluded by conditioning and using the orthogonality of the Rademacher variables. Next, the triangle inequality and the fact thatE

kV₁0k2 2 =E kV1k22 allow us to further simplify the estimate ofµ:

µ≤2 r EhXr i=1kVik 2 2 i = 2 q rE[kV1k2 2]≤σ.

We also calculate the quantity 2 max V∈VkVk2= 2n r maxi kX (i)_k2k_Y (i)k2≤B.

The stipulated tail bound follows from applying Lemma 4.12 with our estimates forB,σ2_,_and_µ.

(22)

Lemma 4.11 now follows from this result on matrix multiplication.

Proof. (of Lemma 4.11) Let X =ADHT _and _Y ₌_HDB _{and form ˆ}_X _{and ˆ}_Y

according to Lemma 4.13. Then,XY=ABand

kAΘT_ΘB₋_ABkF₌_k_Xˆ_Yˆ ₋_XYkF_.

To apply Lemma 4.13, we first condition on the event that the SRHT equalizes the column norms of our matrices. Namely, we observe that, from Lemma 4.6, with probability at least 1−2δ, maxikX(i)k2≤ 1 √ n(kAkF+ p 8 ln(n/δ)kAk2), and (4.13) maxikY(i)k2≤ 1 √ n(kBkF+ p 8 ln(n/δ)kBk2).

Conditioning on these nice interactions, we choose the parameters σ and B in Lemma 4.13. We first take

σ2=4

r(kBkF+ p

8 ln(n/δ)kBk2)2kAk2

F. (4.14)

Observe that because of (4.13), σ2= 4n r · (kYkF+p8 ln(n/δ)kYk2)2 n kXk 2 F≥4 n r Xn i=1kX (i) k22kY(i)k22

so this choice ofσsatisfies the inequality stipulated in Lemma 4.13. Next we choose B =2

r(kAkF+ p

8 ln(n/δ)kAk2)(kBkF+p8 ln(n/δ)kBk2). Again, because of (4.13),B satisfies the stipulationB≥ 2n

r maxikX

(i)

k2kY(i)k2.

For simplicity, letγ= 8 ln(n/δ).With these choices forσ2 andB, σ2 B = 2kAk2 F(kBkF+ √ γkBk2)2 (kAkF+√γkAk2)(kBkF+√γkBk2) ≥ 2kAk 2 F(kBkF+ √ γkBk2)2 (kAkF+√γkAkF)(kBkF+√γkBk2) =2kAkF(kBkF+ √ γkBk2) 1 +√γ .

Now, referring to Eqn. (4.14), identify the numerator as√rσto see that σ2

B ≥

√

rσ 1 +p8 ln(n/δ).

Apply Lemma 4.13 to see that, when Eqns. (4.13) hold and 0≤Rσ≤σ2_/B,

P h

kAΘT_ΘB₋_ABkF_≥₍_R_{+ 1)}_σi_≤_exp

−R 2 4 .

From our lower bound onσ2_/B,_{we know that the condition}_Rσ_≤_σ2_/B _{is satisfied}

when

(23)

Also, we established above that Eqns. (4.13) hold with probability at least 1−2δ. From these two facts, it follows that when 0≤R≤√r/(1 +p8 ln(n/δ)),

P h

kAΘT_ΘB₋_ABkF_≥₍_R_{+ 1)}_σi_≤_exp

−R 2 4 + 2δ.

The tail bound given in the statement of Lemma 4.11 follows from substituting our estimate ofσ.

5. Proofs of our main Theorems.

5.1. Preliminaries. To prove Theorem 2.1 we first need some background on restricted (within a subspace) low-rank matrix approximations. Let A∈Rm×n, let k < nbe an integer, and let Y∈Rm×r withr > k (the casem=r corresponds to the standard unrestricted low rank approximation problem which can be addressed via the SVD). We call Πξ_Y_,k(A)∈Rm×n the best rank kapproximation to Ain the column space ofY, with respect to theξ norm (ξ= 2 orξ= F). Formally, for fixed ξ, we can write Πξ_Y_,k(A) =YXξ, where

Xξ= argmin

X∈Rr×n:rank(X)≤k

kA−YXk2

ξ.

In order to compute (or approximate) Πξ_Y_,k(A) we will use the following 3-step pro-cedure:

1: Let ` = min{m, r}. Use an SVD to construct a matrix Q ∈Rm×` that satisfies

QT_Q₌_I

`and spans the range of Y.This construction takes O(mr`) time. 2: Compute Xopt= argmin_X∈_R`×n_,_rank₍_X₎_≤_kkQTA−XkF in O(mn`+n`

2_{) time.}

In fact, since`≤m, we see thatXopt can be computed in O(mn`) time. 3: ReturnQXopt∈Rm×n in O(mn`) time.

QXoptis a matrix of rank at mostkthat lies within the column span ofY. Note that though Πξ_Y_,k(A) can depend on ξ, the algorithm above computes the same matrix, independent ofξ. The following result, which appeared as Lemma 18 in [5], proves that this algorithm computes ΠF

Y,k(A) and a constant factor approximation to Π

2

Y,k(A).

Lemma 5.1. [Lemma 18 in [5]] Given A∈ Rm×n, Y ∈Rm×r, and an integer

k≤r, the matrix QXopt∈Rm×n described above satisfies:

kA−QXoptk2_F=kA−ΠFY,k(A)k 2 F, kA−QXoptk 2 2≤2kA−Π 2 Y,k(A)k 2 2.

The discussion above the Lemma shows thatQXoptcan be computed in O(mn`+mr`) time.

5.1.1. Matrix Pythagoras and generalized least-squares regression. Lemma 5.2 is the analog of the Pythagoras theorem in the matrix setting. A proof of this lemma can be found in [5]. Lemma 5.3 is an immediate corollary of Matrix-Pythogoras.

Lemma 5.2. If X,Y∈Rm×n andXYT =0m×morXTY=0n×n, then for both

ξ= 2,F : kX+Yk2 ξ ≤ kXk 2 ξ+kYk 2 ξ.

(24)

Lemma 5.3. Given A∈Rm×n, C∈Rm×r, and for all X∈Rr×n and for both ξ= 2,F :

kA−CC+Ak2

ξ ≤ kA−CXk2ξ.

Proof. Write A−CX = (I−CC+)A+C(C+A−X). Observe that ((I− CC+)A)T_C₍_C+_A_{) =} ₀ n×n. By Lemma 5.2, kA−CXk2ξ ≥ k(I−CC +₎_Ak2 ξ + kC1(C+A−X)k2 ξ ≥ k(I−CC +₎_Ak2 ξ.

5.1.2. Low-rank matrix approximation based on projections. The low-rank matrix approximation algorithm investigated in this paper is an instance of a wider class of low-rank approximation schemes wherein a matrix is projected onto a subspace spanned by some linear combination of its columns. The problem of providing a general framework for studying the error of such projection schemes is well studied [9, 27, 5]. The following result appeared as Lemma 7 in [9] (see also Theorem 9.1 in [27]).

Lemma 5.4. [Lemma 7 in [9]] Let A ∈ Rm×n have rank ρ. Fix k satisfying

0≤k≤ρ. Given a matrixΩ∈_Rn×r_{, with} _r_≥_k_{, construct} _Y₌_AΩ_. _If _VT

kΩ has

full row-rank, then, forξ= 2,F, kA−YY†Ak2 ξ≤ kA−Π ξ Y,k(A)k 2 ξ ≤ kA−Akk2ξ+kΣρ−kVTρ−kΩ VT kΩ † k2 ξ. (5.1)

This lemma provides an upper bound for the residual error of the low-rank matrix approximation obtained via projections. We now prove a new result for the forward error.

Lemma 5.5. Let A∈Rm×n have rank ρ.Fix k satisfying 0 ≤k≤ρ. Given a

matrix Ω∈_Rn×r_{, with} _r_≥_k_{, construct} _Y ₌_AΩ_. _If _VT

kΩ has full row-rank, then,

forξ= 2,F, kAk−YY†Ak2ξ ≤ kA−Akk2ξ +kΣρ−kVTρ−kΩ VT kΩ † k2 ξ. (5.2)

Proof. For bothξ= 2,F,

kAk−YY†Ak2ξ =kAk−YY†Ak−YY†Aρ−kk2ξ ≤ kAk−YY†Akkξ2+kAρ−kk2ξ ≤ kAk−Y(VTkΩ) † VT kk2ξ+kAρ−kk2ξ =kAk−UkΣkVTkΩ(VTkΩ) † VT k +Aρ−kΩ(VTkΩ) † VT kk2ξ+kAρ−kk2ξ =kAρ−kΩ(VTkΩ) † VT kk2ξ+kAρ−kk2ξ ≤ kUρ−kk22kΣρ−kVρ−kΩ(VTkS) † k2 ξkVTkk22+kAρ−kk2ξ =kΣρ−kVρ−kΩ(VTkΩ) † k2ξ+kAρ−kk2ξ

In the above, in the first inequality we used Lemma 5.2 ((Ak−YY†Ak)(−YY†Aρ−k)T=

0m×mbecauseAkAρT−k =0m×m). In the second inequality we used Lemma 5.3 (with

X= (VT

kΩ) †

VT

(25)

since, by assumption, rank(VT

kΩ) = k. In the last inequality we used the sub-multiplicativity property of the spectral and Frobenius norms, i.e for any three ma-tricesX,Y,Z: kXYZk2 ξ ≤ kXk 2 2kYZk 2 ξ ≤ kXk 2 2kYk 2 ξkZk 2 2.

5.1.3. Least squares regression based on projections. Similarly, one of the two SRHT least squares regression algorithms analyzed in this article is an instance of a wider class of approximation algorithms where the dimensions of the input matrix and the vector of the regression problem are reduced via pre-multiplication with a random matrix. Lemma 9 in [7] provides a general framework for the analysis of such projection algorithms.

Lemma 5.6. Let A ∈ Rm×n (m ≥ n) of rank ρ and b ∈ Rm be inputs to

the least squares problem minx∈RnkAx−bk2. Let U ∈ R

m×ρ _{contain the top} _ρ

left singular vectors of A and let Ω ∈ Rm×r (ρ ≤ r ≤ m) be a matrix such that rank(ΩT_U_{) = rank(}_U₎_{. Then,}

kAx˜opt−bk22≤ kAxopt−bk22+k

ΩT_U†_ΩT₍_Ax

opt−b)k22. In the above,xopt=A†bandx˜opt=

ΩT_A†_ΩT_b.

The following lemma is a restatement of Lemma 2, along with Eqn. (9) and Eqn. (11) in [19]. It gives a bound on the forward error of the approximation of a least-squares problem that is obtained via projections. In [19] the parameters αand β are fixed toα= 1/√2 and β =ε/2, for some parameter 0 < ε <1. Showing the result for generalα >0 andβ >is straightforward, hence a detailed proof is omitted.

Lemma 5.7 (Lemma 2 in [19]). Let A ∈ Rm×n (m ≥ n) of rank ρ = n and

b∈_Rm _{be inputs to the least squares problem} _min

x∈RnkAx−bk2. Let U ∈R

m×ρ

contain the topρleft singular vectors of Aand letΩ∈Rm×r(ρ≤r≤m). For some α >0, andβ >0, assume that

σmin

UT_Ω_≥_α12 _(5.3)

and

kUT_ΩΩT₍_Ax

opt−b)k22≤βkAxopt−bk22. (5.4) Furthermore, assume that there exists a γ ∈ (0,1] such that kUAUTAbk2 ≥γkbk2. Then, kxopt−x˜optk2≤ _α β 12 ·κ(A)pγ−2₋₁_kx optk2. (5.5)

In the above,xopt=A†bandx˜opt=

ΩT_A†_ΩT_b. 5.2. Proof of Theorem 2.1.

(26)

5.2.1. Frobenius norm bounds. We first prove the Frobenius norm bounds in the theorem (i.e Eqns. (i), (ii), (iii), and (v)). We would like to apply Lemma 5.4 withΩ=ΘT_∈

Rn×r andξ= F. Notice that because of our assumption that

r≥6C2ε−1√

k+p8 ln(n/δ)2

ln(k/δ),

where C>1, Lemma 4.1 implies that with probability at least 1−3δ, rank(VT

kΘT) =k;

so, forξ= F, Lemma 5.4 applies with the same probability, yielding

kA−YY†Ak2 F≤ kA−Π F Y,k(A)k 2 F≤ kA−Akk2F+kΣρ−kVTρ−kΘT(VTkΘT) † k2 F. (5.6)

We continue by bounding the second term in the right hand side of the above inequal-ity, S:=kΣρ−kVTρ−kΘT(VTkΘT) † k2F ≤ 2kΣρ−kVTρ−kΘTΘVkk2F+ 2kΣρ−kVTρ−kΘT((VTkΘT) † −(VT kΘT)T)k 2 F ≤ 2kΣρ−kVTρ−kΘTΘVkk2F+ 2kΣρ−kVTρ−kΘTk 2 Fk(VTkΘT) † −(VT kΘT)Tk 2 2 ≤ 8ε· kΣρ−kVTρ−kk 2 F+ 2· ₁₁ 4 kΣρ−kV T ρ−kk 2 F ·(2.38ε) ≤ 22ε· kΣρ−kk2F.

In the above, in the first inequality we used the fact that for any two matricesX,Y:

kX+Yk2

F ≤2kXk 2

F+ 2kYk 2

F. To justify the first estimate in the third inequality,

first notice thatVT

ρ−kVk =0n×k.Next use Lemma 4.11 with R = C p

ln(k/δ).From the lower bound onr,we have that

√ r 1 +p8 ln(n/δ)≥ √ 6ε−1_· √ k+p8 ln(n/δ) 1 +p8 ln(n/δ) ·C p ln(k/δ)>R>0,

so this choice of R satisfies the requirements of Lemma 4.11. Apply Lemma 4.11 to obtain P " kΣρ−kVTρ−kΘTΘVkk2F≤4(R + 1) 2( √ k+p8 ln(n/δ))2 r kΣρ−kV T ρ−kk 2 F # ≥1−e−R2/4−2δ.

Use the lower bound onrto justify the estimate

4(R + 1)2 √ k+p8 ln(n/δ)2 r ≤4(R + 1) 2 √ k+p8 ln(n/δ)2 6C2_ε−1√ k+p8 ln(n/δ)2ln(k/δ) = 2ε 3 · (Cpln(k/δ) + 1)2 C2_ln(_k/δ₎ ≤ 2ε 3 1 + 1 Cpln(k/δ) !2 .

(27)

This estimate implies that P  kΣρ−kVTρ−kΘTΘVkk2F≤ 2ε 3 1 + 1 Cpln(k/δ) !2 kΣρ−kVTρ−kk 2 F  ≥1−δ C2ln(k/δ)/4₋₂_δ.

Since C>1 andk≥2, a simple numerical estimation allows us to state that, more simply, P h kΣρ−kVTρ−kΘTΘVkk2F≤4εkΣρ−kVTρ−kk 2 F i ≥1−δC2ln(k/δ)/4−2δ. The remaining estimates in the third inequality follow from applying Lemma 4.1 (keeping in mind our lower bound onr) to obtain

P h k(VT kΘT) † −(VT kΘT)Tk 2 2≤2.38ε i ≥1−3δ. and Lemma 4.10 withη= 7/4 to obtain

P kΣρ−kVTρ−kΘTk2F≤ 11 4 kΣρ−kV T ρ−kk2F ≥1− e7/4 (1 + 7/4)1+7/4 r/(1+ √ 8 ln(n/δ))2 −δ. We have the estimate

e7/4 (1 + 7/4)1+7/4 < 1 e, so in fact P kΣρ−kVTρ−kΘTk 2 F≤ 11 4 kΣρ−kV T ρ−kk 2 F ≥1−e−r/(1+ √ 8 ln(n/δ))2₋_δ ≥1−e−6C2ε−1ln(k/δ)−δ ≥1−e−ln(k/δ)−δ ≥1−2δ. Combining (5.6) with the bound onS, we obtain

kA−YY†Ak2 F≤ kA−Π F Y,k(A)k 2 F≤(1 + 22ε)· kA−Akk2F.

Taking the square-roots of both sides and using the fact that √1 + 22ε ≤ 1 + 22ε gives the bound

kA−YY†AkF≤ kA−ΠF_Y_,k(A)kF≤√1 + 22ε· kA−AkkF≤(1 + 22ε)kA−AkkF. Eqn. (i) in the theorem follows directly from this:

kA−YY†AkF≤(1 + 22ε)kA−AkkF.

To derive Eqn. (ii), recall the equalitykA−A˜kkF=kA−ΠFY,k(A)kF,established in Lemma 5.1. From this it follows that

(28)

also.

We now prove Eqn. (iii) in the theorem. Eqn. (5.2) withξ= F and Ω=ΘT _∈

Rn×rgives kAk−YY†Ak2F≤ kA−AkkF+kΣρ−kVTρ−kΘT VT kΘT † k2 F.

Now recall the bound forS:

kΣρ−kVTρ−kΘT VT kΘT † k2F≤22εkA−Akk2F. So, kAk−YY†Ak2F≤ kA−Akk2F+ 22ε· kA−Akk2F= (1 + 22ε)· kA−Akk2F.

Taking the square-roots of both sides and using the fact that √1 + 22ε ≤ 1 + 22ε gives Eqn. (iii).

Finally, we prove Eqn. (iv):

kAk−A˜kkF=kA−Ak−(A−A˜k)kF≤ kA−AkkF+kA−A˜kkF≤(2+22ε)kA−AkkF,

where the first inequality follows by the triangle inequality and the second by using the bound obtained in Eqn. (ii) in the theorem.

The failure probability in the theorem follows from a union bound on all the probabilistic events involved in boundingS.

5.2.2. Spectral norm bounds. We now prove the spectral norm bounds in Theorem 2.1 (i.e Eqns. (v), (vi), (vii), and (viii)). Lemma 4.1 implies that, with this choice ofr, k(VT kΘT) † k2 2≤(1− √ ε)−1, with probability at least 1−3δ.Consequently,VT

kΘThas full row-rank and Lemma 5.4 withΩ=ΘT_∈

Rn×r andξ= 2 applies with the same probability, yielding

kA−YY†Ak2 2≤ kA−Akk22+ (1− √ ε)−1kΣρ−kVTρ−kΘTk 2 2. (5.7)

Also, the spectral norm bound in Lemma 5.1 implies

kA−A˜kk22≤2kA−Π 2 Y,k(A)k 2 2≤2 kA−Akk22+ (1− √ ε)−1kΣρ−kVTρ−kΘTk 2 2 . (5.8) We now provide an upper bound for√Z whereZ is the scalar

Z :=kA−Akk22+ (1− √

ε)−1kΣρ−kVTρ−kΘTk

2 2.

From Lemma 4.8 we obtain Z≤ 1 + 5 1−√ε · kA−Akk22+ ln(ρ/δ) (1−√ε)r kA−AkkF+ p 8 ln(n/δ)kA−Akk2 2

with probability at least 1−5δ.Using that ε <1/3, we see that (1−√ε)−1<3, so

Z ≤16· kA−Akk22+ 3 ln(ρ/δ) r kA−AkkF+ p 8 ln(n/δ)kA−Akk2 2 .

(29)

Use the subadditivity of the square-root function and rearrange the spectral and Frobenius norm terms to obtain that

√ Z ≤ 4 + r 3 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 3 ln(ρ/δ) r · kA−AkkF. Apply Eqn. (5.7) to arrive at Eqn. (v) in the theorem,

kA−YY†Ak2≤ 4 + r 3 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 3 ln(ρ/δ) r · kA−AkkF. Take the square root of both sides of Eqn. (5.8), use the subadditivity of the square root function, and use the bound for√Z to find Eqn. (vi):

kA−A˜kk2≤ 6 + r 6 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 6 ln(ρ/δ) r · kA−AkkF. We now derive the spectral norm bounds on the forward errors. Eqn. (5.2) with

Ω=ΘT_∈ Rn×r gives kAk−YY†Ak22≤ kA−Akk2+kΣρ−kVTρ−kΘT VT kΘT † k2 2.

Use the inequalityk(VT

kΘT) † k2 2≤(1− √ ε)−1 _{to obtain} kAk−YY†Ak22≤ kA−Akk22+ (1− √ ε)−1kΣρ−kVTρ−kΘTk 2 2.

Take the square-root of both sides of this inequality to obtain

kAk−YY†Ak2≤ q kA−Akk22+ (1− √ ε)−1_kΣ ρ−kVTρ−kΘTk22

and identify the righthand side as√Z.Use the bound on√Z to arrive at Eqn. (vii):

kAk−YY†Ak2≤ 4 + r 3 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 3 ln(ρ/δ) r · kA−AkkF. We now prove Eqn. (viii). First, recall that ˜Ak =QXopt and observe that

kAk−A˜kk2=kAk+Aρ−k−A˜k−Aρ−kk2≤ kA−A˜kk2+kAρ−kk2.

Now, recall Eqn. (vi),

kA−A˜kk2≤ 6 + r 6 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 6 ln(ρ/δ) r · kA−AkkF. In conjunction with the previous inequality, this gives us the desired bound:

kA−A˜kk2≤ 7 + r 12 ln(n/δ) ln(ρ/δ) r ! · kA−Akk2+ r 6 ln(ρ/δ) r · kA−AkkF. Finally, we recall that the two probabilistic events we used in our derivations — that (VT

kΘT) †

is bounded and that our application of Lemma 4.8 succeeds— hold with probabilities at least 1−3δand 1−2δrespectively, so the failure probability for each of these four spectral error bounds is no more than 5δ.

(30)

5.2.3. Running time Analysis. The matrix Ycan be constructed in at most 2mnlog₂(r+ 1) arithmetic operations (see Lemma 1.3).

Given Y∈_Rm×r_,_{the matrix} _Y₍_Y†_A₎_∈

Rm×n can be constructed in O(mr`+ mn`) arithmetic operations as follows. Observe thatY(Y†A) =Q(QT_A₎_, _and_Q_∈

Rm×`can be computed in O(mr`) time. Recall that`= min{m, r}.ComputingQTA requires O(mn`) operations, as does the subsequent computation ofQ(QT_A₎_._Thus,

in total, O(mr`+mn`) operations are required. If `=r, this is O(mr2+mnr), but ifr > m, the total operation count becomes O(m2(r+n)).

Finally, givenY, the matrix ˜Ak can be constructed in O(mn`+`2n) arithmetic operations as follows. As argued above,Qcan be constructed in O(mr`) operations, then the productQT_A_∈

R`×n can be computed in O(mn`) operations. The SVD of

QT_A _{requires O(}_`n_min_{_{`, n}_}_{) operations, which is proportional to O(}_`2_n_{) because}

min{`, n}= min{m, r, n}= min{m, r}=`,sincer < n. The final matrix multiplica-tionQXoptagain requires O(mn`) arithmetic operations. The total operation count is therefore O(mn`+`2_n_{). If} _{m > r}_{, the oper}