Sublinear Time Numerical Linear Algebra for Structured Matrices

(1)

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Sublinear Time Numerical Linear Algebra for Structured Matrices

Xiaofei Shi,

1

David P. Woodruff

1 1_{Carnegie Mellon University}

[email protected], [email protected]

Abstract

We show how to solve a number of problems in numerical lin-ear algebra, such as least squares regression,`p-regression for

anyp ≥1, low rank approximation, and kernel regression, in timeT(A)poly(log(nd)), where for a given input matrix

A∈Rn×d,T(A)is the time needed to computeA·yfor an arbitrary vectory ∈Rd. SinceT(A)≤O(nnz(A)), where

nnz(A)denotes the number of non-zero entries ofA, the time is no worse, up to polylogarithmic factors, as all of the recent advances for such problems that run in input-sparsity time. However, for many applications,T(A)can be much smaller than nnz(A), yielding significantly sublinear time algorithms. For example, in the overconstrained (1 + )-approximate polynomial interpolation problem,Ais a Vandermonde ma-trix andT(A) =O(nlogn); in this case our running time isn·poly(logn) +poly(d/)and we recover the results of Avron, Sindhwani, and Woodruff (2013) as a special case. For overconstrained autoregression, which is a common problem arising in dynamical systems,T(A) = O(nlogn), and we immediately obtainn·poly(logn) +poly(d/)time. For ker-nel autoregression, we significantly improve the running time of prior algorithms for general kernels. For the important case of autoregression with the polynomial kernel and arbitrary target vectorb∈Rn, we obtain even faster algorithms. Our

algorithms show that, perhaps surprisingly, most of these op-timization problems do not require much more time than that of a polylogarithmic number of matrix-vector multiplications.

Introduction

A number of recent advances in randomized numerical lin-ear algebra have been made possible by the technique of oblivious sketching. In this setting, given an n×d input matrixAto some problem, one first computes a sketchSA

whereS is a random matrix drawn from a certain random family of matrices. TypicallyS is wide and fat, and there-fore applyingSsignificantly reduces the number of rows of

A. Moreover,SApreserves structural information aboutA. For example, in the least squares regression problem one is given ann×dmatrixAand ann×1vectorband one would like to output a vectorx∈Rdfor which

kAx−bk2≤(1 +) min

x kAx−bk2, (1)

where for a vectory,kyk2 = Pi|yi|2 1/2

. Typically the

n rows of A correspond to observations, and one would like the prediction hAi, xi to be close to the observation

bi, where Ai denotes the i-th row of A. While it can be solved exactly via the normal equations, one can solve it much faster using oblivious sketching. Indeed, for overcon-strained least squares wheren d, one can chooseS to be a subspace embedding, meaning that simultaneously for all vectorsx∈Rd,kSAxk2 = (1±)kAxk2. In such ap-plications, S has only poly(d/)rows, independent of the large dimensionn. By computingSAandSb, and solving

x0=argminxkSAx−Sbk2, one has thatx0satisfies (1) with high probability. Thus, much of the expensive computation is reduced to the “sketch space”, which is independent ofn. Another example is low rank approximation, in which one is given ann×dmatrixAand would like to find ann×k

matrixU and ak×dmatrixV so that

kU V −Ak2F ≤(1 +)kA−Akk2F, (2)

where for a matrixB ∈Rn×d,kBk2F = Pn

i=1 Pd

j=1B 2 i,j, and whereAk =argminrank-k BkA−Bk

2

F is the best rank-k approximation toA. While it can be solved via the singu-lar value decomposition (SVD), one can solve it much faster using oblivious sketching. In this case one choosesSso that the row span ofSAcontains a good rank-kspace, meaning that there is a matrixV ∈ Rk×d whose row span is inside

of the row span of SA, so that there is aU for which this pair(U, V)satisfies the guarantee of (2). HereSAonly has poly(k/)rows, independent ofnandd. Several known al-gorithms approximately project the rows ofAonto the row span ofSA, then compute the SVD of the projected points to findV, and then solve a regression problem to findU. Other algorithms compute the topkdirections ofSAdirectly. Im-portantly, the expensive computation involving the SVD can be carried out in the much lower poly(k/)-dimensional space rather than the originald-dimensional space.

(2)

comput-ingSAis slow. Another matrix which works is a so-called fast Johnson-Lindenstrauss transform, see Sarl´os (2006). As in the Gaussian case, S has a very small number of rows in the above applications, and computingSA can be done inO˜(nd)time, whereO˜(f)denotes a function of the form

f ·poly(logf). This is useful if A is dense, but often A

is sparse and may have a number nnz(A)of non-zero en-tries which is significantly smaller thannd. Here one could hope to computeSAin nnz(A)time, which is indeed possi-ble using a CountSketch matrix, see Clarkson and Woodruff (2013); Meng and Mahoney (2013); Nelson and Nguyen (2013), also with a small number of rows.

For most problems in numerical linear algebra, one needs to at least read all the non-zero entries of A, as otherwise one could miss reading a potentially very large entry. For example, in the low rank approximation problem, if there is one entry which is infinite and all other entries are small, the best rank-1approximation would first fit the single infinite-valued entry. From this perspective, the above nnz(A)-time algorithms are optimal. However, there are many applica-tions for whichAhas additional structure. For example, the polynomial interpolation problem is a special case of re-gression in which the matrixA is a Vandermonde matrix. As observed in Avron, Sindhwani, and Woodruff (2013), in this case ifS ∈ Rpoly(d/)×n is a CountSketch matrix,

then one can computeSAinO(nlogn) +poly(d/)time. This is sublinear in the number of non-zero entries of A, which may be as large asndthus may be much larger than

O(nlogn) +poly(d/). The idea of Avron, Sindhwani, and Woodruff (2013) was to simultaneously exploit the sparsity ofS, together with the fast multiplication algorithm based on the Fast Fourier Transform associated with Vandermonde matrices to reduce the computation ofSAto a small number of disjoint matrix-vector products. A key fact used in Avron, Sindhwani, and Woodruff (2013) was that submatrices of Vandermonde matrices are also Vandermonde, which is a property that does not hold for other structured families of matrices, such as Toeplitz matrices, which arise in applica-tions like autoregression. There are also sublinear time low rank approximation algorithms of matrices with other kinds of structure, like PSD and distance matrices, see Musco and Woodruff (2017); Bakshi and Woodruff (2018).

An open question, which is the starting point of our work, is if one can extend the results of Avron, Sindhwani, and Woodruff (2013) toany structured matrixA. More specifi-cally,can one solve all of the aforementioned linear algebra

problems in timeT(A)instead of nnz(A), whereT(A)is

the time required to computeAyfor a single vectory? For

many applications, discussed more below, one has a struc-tured matrixAwithT(A) =O(nlogn)nnz(A).

Our Contributions.We answer the above question in the affirmative, showing that for a number of problems in nu-merical linear algebra, one can replace the nnz(A)term with aT(A)term in the time complexity. Perhaps surprisingly, we are not able to achieve these running times via oblivious sketching, but rather need to resort to sampling techniques, as explained below. We state our formal results:

• Low Rank Approximation: Given an n×d matrix A,

we can findU ∈ Rn×k andV ∈ Rk×d satisfying (2) in

O(T(A) logn+n·poly(k/))time.

• `p-Regression: Given an n×dmatrix Aand an n×1

vectorb, one would like to output anx∈_Rd_{for which}

kAx−bkp≤(1 +) min

x kAx−bkp, (3) where for a vector y, kykp = (Pi|yi|p)1/p. We show for any real number p ≥ 1, we can solve this prob-lem inO(T(A) logn+poly(d/))time. This includes least squares regression (p= 2) as a special case.

• Kernel Autoregression: A kernel function is a mapping

φ : Rp → Rp

0

where p0 ≥ p so that the inner product

hφ(x), φ(y)ibetween any two pointsφ(x), φ(y)∈Rp

0 can be computed quickly given the inner producthx, yibetween

x, y ∈ Rp. Such mappings are useful when it is not

possi-ble to find a linear relationship between the input points, but after lifting the points to a higher dimensional space viaφ

it become possible. We are given a matrixA ∈ Rnp×d for

which the rows can be partitioned intoncontiguousp×d

block matricesA1, . . . , An. Further, we are in the setting of autoregression, so for j = 2, . . . , n, Aj _{is obtained from}

Aj−1_{by setting the}_`_{-th column}_Aj

`ofAjto be the(`−1) -st column ofAj−1_{, namely, to}_Aj−1

`−1. The first columnA j 1 of Aj is allowed to be arbitrary. Letφ(A) be the matrix obtained from Aby replacing each block Aj _with_φ₍_Aj₎_, whereφ(Aj₎_{is obtained from}_Aj_{by replacing each column}

Aj_`withφ(Aj_`). We are also given an(np0)×1vectorb, per-haps implicitly. We wantx∈Rdto minimizekφ(A)x−bk2. For general kernels not much is known, though prior work Kumar and Jawahar (2007) shows how to find a min-imizerxassuming i.i.d. Gaussian noise. Their running time isO(n2_t₎_{, where}_t_{is the time to evaluate}_h_φ₍_x₎_{, φ}₍_y_)i_given

x and y. We show how to improve this to O(ndt+dω₎ time, where ω ≈ 2.376 is the exponent of fast matrix multiplication. Note for autoregression thatb has the form

[φ(c1);φ(c2);. . .;φ(cn)]for certain vectorsc1, . . . , cn that we know. As n d in overconstrained regression, our

O(ndt+dω)time is faster than theO(n2t+dω)time of earlier work. For dense matricesA, describingAalready re-quiresΩ(ndp)time, so in the typical case whent ≈p, we are optimal for such matrices. We note that prior work Ku-mar and Jawahar (2007) assumes Gaussian noise, while we do not make such an assumption.

While the above gives an improvement for general ker-nels, one could also hope for much faster algorithms. In gen-eral we would like anxfor which:

kφ(A)x−bk2≤(1 +) min

x kφ(A)x−bk2. (4) We show how to solve this in the case thatφcorresponds to the polynomial kernel of degree2, though discuss extensions toq >2. In this case,hφ(x), φ(y)i=hx, yiq_{. The running} time of our algorithm isO(nnz(A)) +poly(pd/). Note that

(3)

Applications. Our results are quite general, recovering the results of Avron, Sindhwani, and Woodruff (2013) for Vandermonde matrices which have applications to polyno-mial fitting and additive models as a special case. We refer the reader to Avron, Sindhwani, and Woodruff (2013) for de-tails, and here we focus on other implications. One applica-tion is to autoregression, which is a time series model which uses observations from previous time steps as input to a re-gression problem to predict the value in the next time step, and can provide accurate forecasts on time series problems. It is often used to model stochastic time-varying processes in nature, economics, etc. Formally, in autoregression we have:

bt= d X

i=1

bt−ixi+t, (5)

whered+ 1≤t≤n+d, and thetcorrespond to the noise in the model. We note thatb1is defined to be0. This model is known as thed-th order Autoregression model (AR(d)).

The underlying matrix in the AR(d) model corresponds to the firstd columns of aToeplitz matrix, and consequently one can computeAT_A_in_O₍_nd_log_n₎_{time, which is faster} than theO(ndω−1)time which assumingd > poly(logn)

for computing AT_A _{for general matrices} _A_{, where} _ω _≈

2.376is the exponent of matrix multiplication. Alternatively, one can apply the above sketching techniques which run in timeO(nnz(A)) +poly(d/) = O(nd) +poly(d/). Ei-ther way, this gives a time ofΩ(nd). We show how to solve this problem in O(nlog2n) +poly(d/) time, which is a significant improvement over the above methods whenever

d >log2n. There are a number of other works on Toeplitz linear systems and regression see Van Barel, Heinig, and Kravanja (2001, 2003); Heinig (2004); Sweet (1984); Bini, Codevico, and Van Barel (2003); Pan et al. (2004); our work is the first row sampling-based algorithm, and this technique will be crucial for obtaining our polynomial kernel results. More generally, our algorithms only depend onT(A), rather than on specific properties ofA. If instead of just a Toeplitz matrixA, one had a matrix of the formA+B, whereB is an arbitrary matrix withT(B) =O(nlogn), e.g., a sparse perturbation toA, we would obtain the same running time.

Another stochastic process model is the vector autore-gression (VAR), in which one replaces the scalarsbt ∈ R

in (5) with points in Rp. This forecast model is used in

Granger causality, impulse responses, forecast error vari-ance decompositions, and health research van der Krieke et al. (2016). An extension is kernel autoregression Kumar and Jawahar (2007), where we additionally have a kernel func-tionφ:_Rp _→

Rp

0

withp0 > p, and further replacebtwith

φ(bt)in (5). One wants to find the coefficients x1, . . . , xd fitting the pointsφ(bt)without computingφ(bt), which may not be possible sincep0could be very large or even infinite. To the best of our knowledge, our results give the fastest known algorithms for VAR and kernel autoregression.

Our Techniques.Unlike the result in Avron, Sindhwani, and Woodruff (2013) for Vandermonde matrices, many of our results for other structured matricesdo not use oblivious

sketching. We illustrate the difficulties for least squares

re-gression of using oblivious sketching. In Avron, Sindhwani,

and Woodruff (2013), given an n×d Vandermonde ma-trix A, one wants to compute SA, where S is a CountS-ketch matrix. For each i, the i-th row of A has the form

(1, xi, x2i, . . . , x d−1

i ). S has r = poly(d/) rows and n columns, and each column ofS has a single non-zero entry located at a uniformly random chosen position. Denote the entry in thei-th column ash(i), thenSAdecomposes intor

matrix-vector products, where each row ofAparticipates in exactly one matrix product. Namely, we can group the rows of Ainto submatricesAi and create a vectorxi which in-dexes the subset of coordinatesjofxfor whichh(j) = i. Thei-th row ofSAis preciselyxi_Ai_{. For a submatrix}_Ai_of a Vandermonde matrix, the productxi_Ai _{can be computed} inO(silogsi)time, wheresiis the number of rows ofAi. The total time to computeSAis thusO(nlogn).

Now supposeA∈Rn×d,dn, is a rectangular Toeplitz

matrix, i.e., the i-th row of A is obtained by shifting the

(i−1)-st row to the right by one position, and including an arbitrary entry in the first position. Toeplitz matrices are the matrices which arise in autoregression. We can think of

A as a submatrix of a square Toeplitz matrix C, and can computexC for any vectorxinO(nlogn)time. Unfortu-nately though, anr×dsubmatrixAi _{of a Toeplitz matrix,}

r > d, may not have an efficient multiplication algorithm. Indeed, imagine the r rows correspond to disjoint subsets ofdcoordinates of a Toeplitz matrix. Then computingxAi

would takeO(rd)time, whereas for a Vandermonde matrix one could always multiply a vector times anr×dsubmatrix in onlyO(rlogr)time. Vandermonde matrices are a spe-cial sub-class of structured matrices which are closed under taking sub-matrices, which we do not have in general.

Rather than using oblivious sketching, we instead use

sampling-based techniques. A first important observation is

that the sampling-based techniques for subspace approxi-mation Cohen et al. (2015b), low rank approxiapproxi-mation Co-hen, Musco, and Musco (2017), and `p-regression Cohen and Peng (2015), can each be implemented with onlyt =

O(logn)matrix-vector products between the input matrix

A and certain arbitrary vectors v1, . . . , vt arising through-out the course of the algorithm. We start by verifying this property for each of these important applications, allowing us to replace the nnz(A)term with aT(A)term. We then give new algorithms for autoregression, for which the de-sign matrix is a truncated Toeplitz matrix, and more gener-ally composed with a difference and a diagonal matrix.

Our technically more involved results are then for kernel autoregression. First for general kernels, we show how to ac-celerate the computation ofφ(A)Tφ(A)using the Toeplitz nature of autoregression, and observe that only O(nd) in-ner products ever need to be computed, even though there are Θ(n2₎ _{possible inner products. We then show how to} solve polynomial kernels of degreeq. We focus onq = 2

though our arguments can be extended to q > 2. We first use oblivious sketching to compute a d ×O(logn) ma-trix RG from which, via standard arguments, it suffices to sample O(dlogd+d/) row indices i proportional to

(4)

-tuple(i1, . . . , iq)of a blockφ(Aj)with columnsφ(Aj`), for

`∈ {1,2, . . . , d}, and soeiφ(A)ek =Aji1,kA

j i2,k· · ·A

j iq,k. We can also directly read off the corresponding entry from

b. The j-th row of S is also just q_p1

iei if row index i is the j-th sampled row, where pi is the probability of samplingi. Further, the matricesRG andS can be found inO(nnz(A) +d3)time using earlier work Clarkson and Woodruff (2013); Avron, Nguyen, and Woodruff (2014). We show to find the set ofO(dlogd+d/)sampled row indices quickly. Here we use thatφ(A)is “block Toeplitz”, together with a technique of replacing blocks ofφ(A)with “sketched blocks”, which allows us to sample blocks ofφ(A)RG pro-portional to their squared norm. We then need to obtain a sampled index inside of a block, and for the polynomial ker-nel of degree2 we use the fact that the entries ofφ(Aj₎_y for a vector y are in one-to-one correspondence with the entries of Aj−1_D

y(Aj−1)T, where Dy is a diagonal ma-trix withyalong the diagonal. We do not need to compute

Aj−1_D

y(Aj−1)T, but can computeHAj−1Dy(Aj−1)T for a matrixH of i.i.d. Gaussians in order to sample a column ofAj−1Dy(Aj−1)T proportional to its squared norm, after which we can compute the sampled column exactly and out-put an entry of the column proportional to its squared value. Here we use the Johnson Lindenstrauss lemma to argue that

H preserves column norms. A similar identity holds for de-greesq > 2, and that identity was used in the context of Kronecker product regression Diao et al. (2019).

Fast Algorithms Based on Sampling

We first considerminxkAx−bk2, whereA ∈ Rn×d, b ∈ Rn×1, andn > d. We show how, inO(T(A)poly(logn) +

poly(d(logn)/)) time, to reduce this to a problem

minxkSAx−Sbk2, whereSA ∈ Rr×d andSb ∈ Rr×1

such that ifxˆ=argmin_xkSAx−Sbk2, then kAxˆ−bk2≤(1 +) min

x kAx−bk2. (6) Here r = O(d/2₎_{. Given} _SA _and_Sb_{, one can compute} ˆ

x= (SA)−Sbin poly(d/)time. For (6) to hold, it suffices for the matrixSto satisfy the property that for allx,kSAx−

Sbk2 = (1±)kAx−bk2. This is implied if for any fixed

n×(d+ 1)matrixC,kSCxk2 = (1±)kCxk2for allx. Indeed, in this case we may setC= [A, b]. This problem is sometimes called thematrix approximationproblem.

The REPEATED HALVING Algorithm. The following al-gorithm for matrix approximation is given in Cohen et al. (2015b), and called REPEATEDHALVING.

Algorithm 1Repeated Halving

1: procedureREPEATEDHALVING(C∈Rn×(d+1)) 2: Uniformly samplen/2rows ofCto formC0

3: IfC0 has more thanO((dlogd)/2₎_{rows, recursively} compute a spectral approximationC˜0ofC0

4: Approximate generalized leverage scores ofCw.r.t.C˜0

5: Use these estimates to sample rows ofCto formC˜

6: returnC˜

7: end procedure

Leverage Score Computation. We first clarify step 4 in REPEATED HALVING, which is a standard Johnson-Lindenstrauss trick for speeding up leverage score compu-tation Drineas et al. (2012). The i-th generalized leverage score of a matrix C withn rows w.r.t. a matrix B is de-fined to be τ_iB(C) = cT_i(BTB)+ci = kB(BTB)+cik22, where ci is the i-th row of C, written as a column vec-tor. The idea is to instead compute kGB(BTB)+cik22, where G is a random Gaussian matrix with O(logn)

rows. The Johnson-Lindenstrauss lemma and a union bound yield kGB(BT_B₎+_c

ik22 = Θ(1)kB(BTB)+cik22. If B is O((dlogd)/2₎_× _d_{, then} ₍_BT_B₎+ _{can be computed} in poly(d/) time. We compute GB in O((d2_/2_{) log}_n₎ time, which is anO(logn)×dmatrix. Then we compute

(GB)(BTB)+, which now takes only O(d2logn) time, and is an O(logn)×d matrix. Finally one can compute

GB(BTB)+CT inO(nnz(C) logn)time, and the squared column norms are constant factor approximations to the

τB

i (C)values. The total time to compute alli-th generalized leverage scores isO(nnz(C) logn) +poly(dlogn/).

Sampling. We clarify how step 5 in REPEATED HALV

-ING works, which is a standard leverage score sampling-based procedure, see, e.g., Mahoney (2011). Given a list of approximate generalized leverage scores τ˜B

i (C), we sam-ple O((dlogd)/2) rows ofC independently proportional to form C˜. We write this as C˜ = SC, where the i-th row of S has a 1/√pj(i) in the j(i)-th position, where

j(i)is the row ofC sampled in the i-th trial, andpj(i) = ˜

τB i (C)/

P

i0₌₁_,...,n˜τiB0(C) is the probability of sampling

j(i)in thei-th trial. HereSis called asampling and rescal-ingmatrix. Sampling independently from a distribution on

nnumbers with replacementO((dlogd)/2₎_{times can be} done inO(n+ (dlogd)/2)time Vose (1991), giving a to-tal time spent in step 5 ofO(nlogn+ (dlogd)(logn)/2₎ across allO(logn) recursive calls. As argued in Cohen et al. (2015b), the error probability is at most 1/100, which can be made an arbitrarily small constant by appropriately setting the constants in the big-Oh notation above.

Speeding upREPEATEDHALVING.We now show how to speed up the REPEATEDHALVINGalgorithm. Step 2 of RE

-PEATED HALVING can be implemented just by choosing a subset ofrow indicesinO(n)time. Step 3 just involves checking if the number of uniformly sampled rows is larger thanO((dlogd)/2₎_{, which can be done in constant time,} and if so, a recursive call is performed. The number of re-cursive calls is at most O(logn), since Step 1 halves the number of rows. So the total time spent on these steps is

O(nlogn+ (dlogd)(logn)/2).

In step 4 of REPEATED HALVING, we compute general-ized leverage scores ofCwith respect to a matrixC˜0, and is only (non-recursively) applied whenC˜0_has_O₍₍_d_log_d₎_/2₎ rows. As described when computing leverage scores with

B= ˜C0_{, we must do the following:} 1. ComputeGC˜0_∈

RO(logn)×din timeO((d2/2) logn)

2. Compute(( ˜C0₎T_C˜0₎+_{in time}_O₍_d3_/2₎

(5)

Since GC˜0_{(( ˜}_C0₎T_C_˜0₎+ _has _O_(log_n₎ _{rows that already} be computed, one can compute GC˜0_{(( ˜}_C0₎T_C_˜0₎+_CT _in

O(logn)T(C)time, whereT(C)is the time needed to mul-tiply C by a vector (note that computing yCT _is equiva-lent to computingCyT_{). In our application to regression,}

C= [A, b]. Consequently,T(C)≤T(A) +n. As the num-ber of recursive calls isO(logn), it follows that the total time spent for step 4 of REPEATEDHALVING, across all re-cursive calls, isO(T(A) logn+nlogn+ (d2_log2_n₎_/2₊

d3(logn)/2). The fifth step of REPEATED HALVING is to find the sampling and rescaling matrix as described above, which can be done inO(nlogn+ (dlogd)(logn)/2) to-tal time, across all recursive calls. Thus, the toto-tal time is

O(T(A) logn+nlogn+ (d2log2n)/2+d3(logn)/2).

We summarize our findings with the following theorem.

Theorem 1. Given ann×dmatrixA, ann×1vectorb,

an accuracy parameter0< <1, and a failure probability

bound0< δ <1, one can output a vectorxˆ∈Rdfor which

kAxˆ−bk2 ≤ (1 +) minxkAx−bk2with probability at

least1−δ, in total time

O((T(A) logn+poly(dlogn/)) log(1/δ)).

Proof. From the discussion above, our modified version of

REPEATEDHALVINGproduces a vectorxˆforkAxˆ−bk2≤ (1 +) minxkAx−bk2 with probability at least99/100. Repeatingr = O(log(1/δ)) times independently, obtain-ing candidate solutionsxˆ1_{, . . . ,}_x_ˆr_{, and choosing the}_x_ˆi_for whichkAxˆi₋_b_k

2is smallest, one reduces the failure prob-ability toδvia standard Chernoff bounds. The time to com-putekAxˆi₋_b_k

2givenxˆiis at mostT(A) +O(n), which is negligible compared to other operations in a repetition.

Low Rank Approximation.We look at thelow-rank

ap-proximationproblem, where forA∈Rn×done tries to find

a matrixZ ∈_Rn×k_{with orthonormal columns such that}

kA−ZZTAk2F ≤(1 +ε)kA−Akk2F. (7) Here, Ak is the best rank k approximation to A. It is shown in Cohen et al. (2015a) that the low-rank approxima-tion problem can be solved by finding a subset of rescaled columnsC∈Rn×d

0

withd0< d, such that for every rankk

orthogonal projection matrixX:

kC−XCk2

F = (1±ε)kA−XAk 2

F. (8)

Basic Recursive Algorithm. In Cohen, Musco, and Musco (2017), a slightly different version of Algorithm 1 with ridge leverage score approximation is used to solve (8):

Algorithm 2Repeated Halving

1: procedureREPEATEDHALVING(A∈Rn×d) 2: Uniformly sampled/2columns ofAto formC0

3: If C0 has more than O(klogk) columns, recursively compute a constant approximation C˜0 for C0 with

O(klogk)columns

4: Get generalized ridge leverage scores ofAw.r.t.C˜0

5: Use estimates to sample columns ofAto formC

6: returnC

7: end procedure

Improved Running Time.With a similar argument as for least squares regression, we obtain the following theorem.

Theorem 2. There is an iterative column sampling

algo-rithm that, in timeO(T(A) logn+n·poly(k/)), returns

Z ∈Rn×ksatisfying:kA−ZZTAk2F ≤(1+)kA−Akk2F.

`p-Regression. Another important problem is the `p

-regression problem. GivenA ∈ _Rn×d _and_b _∈

Rn×1, we

want to output anx ∈ Rd satisfying (3). We first consider

the problem: forC = [A, b] ∈ Rn×(d+1), find a matrixS

such that for everyx∈_R(d+1)×1_,

(1−)kCxkp≤ kSCxkp≤(1 +)kCxkp. (9) The following APPROXLEWISFORM Algorithm is given in Cohen and Peng (2015) to solve (9), and for them it suf-fices to set the parameterθ in the algorithm description to a small enough constant. This is because in Step 7 of AP

-PROXLEWISFORM, they run the algorithm of Theorem 4.4 in their paper, which runs in at mostntime providedθ is a small enough constant and n > dC0 for a large enough constant C0 > 0. We refer the reader to Cohen and Peng (2015) for the details, but remark that by setting θ to be a constant, Step 5 of APPROXLEWISFORMcan be imple-mented inT(A)time. Also, due to space constraints, we do not define the quadratic formQin what follows; the algo-rithm for computing it is also in Theorem 4.4 of Cohen and Peng (2015). The only property we need is that it is com-putable inmlogmlog logm·dC_{time, for an absolute} con-stantC >0, if it is applied to a matrix with at mostmrows. Theorem 4.4 of Cohen and Peng (2015) can be invoked with constant, giving the so-called Lewis weights up to a constant factor, after which one can sampleO(d(logd)/2₎ rows according to these weights. Note that our running time isO(T(A) logn+poly(d/))even for constantθ, since in each recursive call we may need to spend T(A)time, un-like Cohen and Peng (2015), who obtain a geometric series of nnz(A) +nnz(A)/2 +nnz(A)/4 +· · ·+ 1 ≤2nnz(A)

time in expectation. Here, we do not know ifT(A)decreases when looking at submatrices ofA.

Algorithm 3ApproxLewisForm

1: procedureAPPROXLEWISFORM(C∈Rn×(d+1)) 2: Ifn ≤d+ 1, apply Theorem 4.4 in Cohen and Peng

(2015) to returnQ.

3: Uniformly samplen/2rows ofCto formCb

4: LetQb=APPROXLEWISFORM(C, p, θb )

5: Let ui be an nθ/p multiplicative approximation of

cT i Qcb i

6: Nonuniformly sample rows of C, taking an expected

pi = min(1, f(p)nθ/2dp/2logdu p/2

i )copies of row i (each scaled down byp−_i1/p), producingC0

7: Apply Theorem 4.4 in Cohen and Peng (2015) toC0, and return the quadratic formQ.

8: returnQ

9: end procedure

(6)

approximation for C with only poly(d/)rows. Applying our earlier arguments to this setting yields the following:

Theorem 3. Given ∈ (0,1), a constant p ≥ 1, A ∈

Rn×d and b ∈ Rn×1, there is an algorithm that, in time

O(T(A) logn+poly(d/)), returnsxˆ∈_Rd×1_{such that} kAxˆ−bkp≤(1 +) min

x kAx−bkp.

Applications

Autoregression and General Dynamical Systems.In the original AR(d) model, we have:



  

bd+1

bd+2 .. .

bn+d     =    

bd . . . b1

bd+1 . . . b2 ..

. . .. ...

bn+d−1 . . . bn         x1 x2 .. . xd     +    

εd+1

εd+2 .. .

εn+d 

  

(10)

Here we can create ann×dmatrixAwhere thei-th row is (bi+d−1, bi+d−2, . . . , bi). One obtains the `2-regression problemminxkAx−bk2withbT = (bd+1, . . . , bn+d). In order to apply Theorem 1, we need to boundT(A). The fol-lowing lemma follows from the fact thatAis a submatrix of a Toeplitz matrix.

Lemma 4. T(A) =O(nlogn).

Combining Lemma 4 with Theorem 1, we can conclude:

Theorem 5. Given an instanceminxkAx−bk2of

autore-gression, with probability at least1−δone can find a vector

ˆ

xso thatkAxˆ−bk2≤(1 +) minxkAx−bk2in total time

O (nlog2n+ (d2_log2

n)/2₊_d3_(log_n₎_/2_{) log(1}_/δ₎ . General Dynamical Systems.When dealing with more general dynamical systems, theAin Theorem 5 would be-come A = T U D, where T is a Toeplitz matrix, U is a matrix that represents computing successive differences, andD = diag

1,1_h, . . . ,_hd1−1 . Note thatT isn×d, as

for linear dynamical systems, U is d×d and the opera-tion xU corresponds to replacing xwith (x2 −x1, x3 −

x2, x4 −x3, . . . , xd −xd−1,0), and D is a d×d diago-nal matrix, and soU andDcan each be applied to a vec-tor inO(d)time. Consequently by Lemma 4, we still have

T(A)≤T(T)+T(U)+T(D) =O(nlogn), and we obtain the same time bounds in Theorem 5.

Kernel Autoregression.Letφ : Rp → Rp

0

be a kernel transformation, as defined in the introduction. The kernel autoregression problem is: φ(bt) =

Pd

i=1φ(bt−i)xi+t, where now note thatt∈Rp

0

. Note that there are still only

dunknownsx1, . . . , xd. One way of solving this would be to computeφ(bt)for eacht, represented as a column vector in_Rp0_{, and then create the linear system by stacking such} vectors on top of each other:



  

φ(bd+1)

φ(bd+2)

. . . φ(bn+d)

    =    

φ(bd+1) . . . φ(b1)

φ(bd+2) . . . φ(b2)

. .

. . .. ... φ(bn+d−1) . . . φ(bn)

        x1 x2 . . . xd     +    

εd+1

εd+2

. . . εn+d



  

(11)

One can then compute a(1+)-approximate least squares solution to (11). Now the design matrix φ(A) in the re-gression problem is the vertical concatenation ofp0 matri-cesA, and an analogous argument shows that T(φ(A)) =

O(np0log(np0)), which gives us the analogous version of Theorem 5, showing least squares regression is solvable in

O(np0log2(np0)) +poly((dlogn)/)with constant proba-bility. While correct, this is prohibitive sincep0may be large.

Speeding up General Kernels. Let φ(A) denote the design matrix in (11), where the i-th block is

φ(A)i ₌ _[_φ₍_b

i+d−1);φ(bi+d−2);. . .;φ(bi)]. Here b is

[φ(bd+1);. . .;φ(bn+d)], which we know. We first com-pute φ(A)T_φ₍_A₎_{. To do so quickly, we again} ex-ploit the Toeplitz structure of A. More specifically, we have that φ(A)T_φ₍_A₎ ₌ P

i(φ(A)

i₎T_φ₍_A₎i_. _In order to compute (φ(A)i₎T_φ₍_A₎i_{, we must compute}

d2 _{inner products, namely,} _h_φ₍_b

d−j+i), φ(bd−j0₊_i)i for all j, j0 ∈ {1,2, . . . , d}. Using the kernel trick,

hφ(bd−j+i), φ(bd−j0₊_i)i = f(hb_d₋_j₊_i, b_d₋_j0₊_ii)for some functionfthat we assume can be evaluated in constant time, givenhbd−j+i, bd−j0₊_ii. Note that the latter inner product can be computed in O(p) time and thus we can compute

(φ(A)i)Tφ(A)ifor a giveni, inO(d2p)time. Thus, na¨ıvely, we can computeφ(A)T_φ₍_A₎_in_O₍_nd2_p₎_time.

We can reuse most of our computation across different blocks i. As we range over all i, the inner products we compute are those of the formhφ(bd−j+i), φ(bd−j0₊_i)ifor

i ∈ {1, . . . , n}andj, j0 ∈ {1,2, . . . , d}. Although a na¨ıve count gives nd2 _{different inner products, this overcounts} since for each pointφ(bd−j+i)we only need its inner prod-uct with O(d)points other than with itself, and so O(nd)

inner products in total. This is total timeO(ndp).

Given these inner products, we quickly evaluate

φ(A)T_φ₍_A₎_{. The crucial point is that not only is each entry} inφ(A)T_φ₍_A₎_{a sum of}_n_{inner products we already} com-puted, but one can quickly determine entries from other en-tries. Indeed, given an entry on one of the2d−1diagonal bands, one can compute the next entry on the band inO(1)

time by subtracting off a single inner product and adding one additional inner product, since two consecutive entries along such a band sharen−1out ofninner product summands.

Thus, each diagonal can be computed inO(n+d)time, and so in totalφ(A)T_φ₍_A₎_{can be computed in}_O₍_nd₊_d2₎ time, given the inner products. We can computeφ(A)T_φ₍_A₎ in O(ndp) time assuming d ≤ n. We then define R = (φ(A)T_φ₍_A₎₎−1_{, which can be computed in an additional}

O(dω)time, whereω ≈ 2.376is the exponent of fast ma-trix multiplication. Thus,Ris computable inO(ndp+dω₎ time. Note this is optimal for dense matricesA, since just reading each entry of A takes O(ndp). We can compute

φ(A)Tb ∈ Rd using the kernel trick, which takesO(ndp)

time. By the normal equations,x=Rb, which can be com-puted indω_{time. Overall, we obtain}_O₍_ndp₊_dω₎_time.

(7)

sampling and rescaling matrixS, thenkSφ(A)x−Sbk2 = (1±)kφ(A)x−bk2simultaneously for all vectorsx. Here thei-th row ofS contains a1/√pj in thej-th entry if the

j-th row ofφ(A)is sampled in thei-th trial, and the j-th entry is0otherwise. Herepj = kejφ(A)Rk

2 2 kφ(A)Rk2

F

, whereej is the

j-th standard unit vector. We show how to sample indices

i∈[np0]proportional to the squared row norms ofφ(A)R. Instead of sampling indicesi∈ [np0]proportional to the exact squared row norms ofφ(A)R, it is well-known (see, e.g., Woodruff (2014)) that it suffices to sample them pro-portional to approximations τ˜i to the actual squared row normsτi, where τ₂i ≤ τ˜i ≤2τi for everyi. As in Drineas et al. (2012), to do the latter, we can instead sample in-dices according to the squared row norms of φ(A)RG, whereG ∈ Rd×O(logn) is a matrix of i.i.d. Gaussian

ran-dom variables. To do this, we can first compute RG in

O(d2logn)time, and now we must sample row indices pro-portional to the squared row norms ofφ(A)RG. Note that if we sample an entry (i, j) of φ(A)RG proportional to its squared value, then the row indexi is sampled accord-ing to its squared row norm. SinceRG only hasO(logn)

columnsvi_{, we can do the following: we first approximate} the squared norm of eachφ(A)vi_{. Call our approximation}

γiwith 1₂kφ(A)vik22≤γi≤2kφ(A)vik22. Since we need to samples=O(dlogd+d/)total entries, we sample each entry by first choosing a column i ∈ [d] with probability

γi Pd

j=1γj, and then outputting a sample from columni pro-portional to its squared value. We show (1) how to obtain theγiand (2) how to obtainssampled entries, proportional to their squared value, from each columnφ(A)vi_.

For the polynomial kernel of degree2, the matrixφ(A)

is inRnp

2_×_d

, since each of thenpoints is expanded top2 dimensions byφ. Thenφ(A)is the vertical concatenation of

B1_{, . . . , B}n_{, where each}_Bi_∈ Rp

2_×_d

is a subset of columns of the matrixCi_◦_Ci _∈

Rp

2_×_d2

, whereCi _∈

Rp×d and

Ci_◦_Ci_{consists of the Kronecker product of}_Ci_{with itself,} i.e., the((a, b),(c, d))-th entry ofCi_◦_Ci_is_Ci

a,bCc,di . Notice thatBi_{consists of the subset of}_d_{columns of}_Ci correspond-ing tob=d. Fix a column vectorv∈ {v1_{, . . . , v}d_}_defined above. LetSbe the TensorSketch of Pham and Pagh (2013); Avron, Nguyen, and Woodruff (2014) withO(1)rows. Then

Sφ(z)can be computed in O(nnz(z))time for any z. For blockBi_,_k_SBi_v_k2

2 = (1±1/10)kBivk22with probability at least2/3. We can repeat this schemeO(logn)times in-dependently, creating a new matrixS withO(logn)rows, for whichSφ(bi)can be computed inO(nnz(bi) logn)time and thusSφ(A)can be computed inO(nnz(A) logn)time overall. Further, we have the property that for each block

Bi,kSBivkmed = (1±(1/10))kBivk22,with probability 1−1/n2_{, where the med operation denotes taking the} me-dian estimate on each of the O(logn)independent repeti-tions. By a union bound, with probability1−O(1/n),

kSBivkmed= (1±(1/10))kBivk22, (12) simultaneously for every i = 1, . . . , n. Noticeφ(A) is a

block-Toeplitz matrix, truncated to its firstdcolumns, where

each block corresponds toφ(bi)for somei. Suppose we

re-place the blocksφ(bi)withSφ(bi), obtaining a new block Toeplitz matrixA0, truncated to its firstdcolumns, where now each block has sizeO(logn). The new block Toeplitz matrix can be viewed asO(logn)disjoint standard (blocks of size 1) Toeplitz matrices withn rows, and truncated to their first dcolumns. Thus, T(A0_{) =} _O₍_n_log2_n₎

. The i -th block of coordinates of size O(logn)is equal toSBi_v_, and by (12) we can in O(logn) time compute a number

`i = (1±(1/10))kBivk22. Since this holds for everyi, we can compute the desired estimateγjtokφ(A)vjk22ifv=vj. The time isO(nnz(A) logn+nlog2n).

After computing the γj, suppose our earlier sampling scheme samplesv =vj. Then to output an entry ofφ(A)v proportional to its squared value, we first output a block

Bi proportional to kBivk2

2. Note that given the`i, we can sample such aBi_{within a}₍₁_±₁_/₁₀₎_{factor of the actual} sampling probability. Next, we must sample an entry ofBi_v proportional to its squared value. The entries ofBi_v _{are in} one-to-one correspondence with the entries ofCi_D

v(Ci)T, whereDv∈Rd×dis the diagonal matrix with the entries of

v along the diagonal. LetH ∈ RO(logn)×p be a matrix of

i.i.d. normal random variables. We first computeHCi_{. This} can be done inO(pdlogn)time. We then computeHCiDv inO(dlogn)time, and then(HCi_D

v)(Ci)TinO(pdlogn) time. By the Johnson-Lindenstrauss lemma (see, e.g., John-son and Lindenstrauss (1984)), each squared column norm ofHCi_D

v(Ci)T is the same as that ofCiDv(Ci)T up to a factor of(1±1/10), for an appropriateO(logn)number of rows ofH. So we first sample a columnj ofCi_D

v(Ci)T proportional to this approximate squared norm. Next we compute Ci_D

v(Ci)Tej in O(pd+d2) time, and then in

O(p) time we output an entry of the j-th column propor-tional to its squared value. Thus we find our sample.

To bound the overall time, note that we only need to com-pute theγj values once, for j = 1, . . . , O(logn)and for eachjthis takesO(nnz(A) logn+nlog2n)time. So in to-tal across all indicesjthis takesO(nnz(A) log2n+nlog3n)

time. Moreover, this procedure also gave us the values`ifor eachvj_{. Suppose we also sort the partial sums}Pi0

i=1`i for each1 ≤i0 ≤ nand corresponding to eachvj. This takes

O(nlog2n)time and fits within our time bound. Then for each of ourO(dlogd+d/)samples we need to take, we can first samplejbased on theγjvalues and sampleibased on the sorted partial sums of`ivalues inO(logn)time via a bi-nary search. Having foundi, we perform the procedure in the previous paragraph which takesO(pdlogn+d2₎_{time. Thus,} the time for sampling isO((pd2logn+d3)(1/+ logd)).

The overall time is, up to a constant factor,

O nnz(A) log2n+nlog3n+ (pd2logn+d3)(1+ logd).

Acknowledgments

(8)

References

Avron, H.; Nguyen, H. L.; and Woodruff, D. P. 2014. Sub-space embeddings for the polynomial kernel. InAdvances in Neural Information Processing Systems 27: Annual Confer-ence on Neural Information Processing Systems 2014,

De-cember 8-13 2014, Montreal, Quebec, Canada, 2258–2266.

Avron, H.; Sindhwani, V.; and Woodruff, D. 2013. Sketch-ing structured matrices for faster nonlinear regression. In

Advances in neural information processing systems, 2994–

3002.

Bakshi, A., and Woodruff, D. P. 2018. Sublinear time low-rank approximation of distance matrices. CoRR abs/1809.06986.

Bini, D. A.; Codevico, G.; and Van Barel, M. 2003. Solving toeplitz least squares problems by means of newton’s itera-tion.Numerical Algorithms33(1-4):93–103.

Clarkson, K. L., and Woodruff, D. P. 2013. Low rank ap-proximation and regression in input sparsity time. In Sym-posium on Theory of Computing Conference, STOC’13, Palo

Alto, CA, USA, June 1-4, 2013, 81–90.

Cohen, M. B., and Peng, R. 2015. Lp row sampling by lewis weights. InProceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015,

Portland, OR, USA, June 14-17, 2015, 183–192.

Cohen, M. B.; Elder, S.; Musco, C.; Musco, C.; and Persu, M. 2015a. Dimensionality reduction for k-means clustering and low rank approximation. InProceedings of the

forty-seventh annual ACM symposium on Theory of computing,

163–172. ACM.

Cohen, M. B.; Lee, Y. T.; Musco, C.; Musco, C.; Peng, R.; and Sidford, A. 2015b. Uniform sampling for matrix ap-proximation. InProceedings of the 2015 Conference on In-novations in Theoretical Computer Science, ITCS 2015,

Re-hovot, Israel, January 11-13, 2015, 181–190.

Cohen, M. B.; Musco, C.; and Musco, C. 2017. In-put sparsity time low-rank approximation via ridge leverage score sampling. InProceedings of the Twenty-Eighth

An-nual ACM-SIAM Symposium on Discrete Algorithms, 1758–

1777. SIAM.

Diao, H.; Song, Z.; Sun, W.; and Woodruff, D. P. 2019. Opti-mal sketching for kronecker product regression.Manuscript. Drineas, P.; Magdon-Ismail, M.; Mahoney, M. W.; and Woodruff, D. P. 2012. Fast approximation of matrix coher-ence and statistical leverage. Journal of Machine Learning

Research13:3475–3506.

Heinig, G. 2004. Fast algorithms for toeplitz least squares problems. In Current Trends in Operator Theory and its

Applications. Springer. 167–197.

Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into a hilbert space.

Kumar, R., and Jawahar, C. 2007. Kernel approach to au-toregressive modeling.

Mahoney, M. W. 2011. Randomized algorithms for matri-ces and data. Foundations and Trends in Machine Learning 3(2):123–224.

Meng, X., and Mahoney, M. W. 2013. Low-distortion sub-space embeddings in input-sparsity time and applications to robust linear regression. InSymposium on Theory of Com-puting Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013, 91–100.

Musco, C., and Woodruff, D. P. 2017. Sublinear time low-rank approximation of positive semidefinite matrices. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, 672–683.

Nelson, J., and Nguyen, H. L. 2013. OSNAP: faster nu-merical linear algebra algorithms via sparser subspace em-beddings. In54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013,

Berkeley, CA, USA, 117–126.

Pan, V. Y.; Van Barel, M.; Wang, X.; and Codevico, G. 2004. Iterative inversion of structured matrices. Theoretical

Com-puter Science315(2-3):581–592.

Pham, N., and Pagh, R. 2013. Fast and scalable polyno-mial kernels via explicit feature maps. In The 19th ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining, KDD 2013, Chicago, IL, USA, August

11-14, 2013, 239–247.

Sarl´os, T. 2006. Improved approximation algorithms for large matrices via random projections. In47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA,

Pro-ceedings, 143–152.

Sweet, D. 1984. Fast toeplitz orthogonalization.Numerische

Mathematik43(1):1–21.

Van Barel, M.; Heinig, G.; and Kravanja, P. 2001. A stabilized superfast solver for nonsymmetric toeplitz sys-tems. SIAM Journal on Matrix Analysis and Applications 23(2):494–510.

Van Barel, M.; Heinig, G.; and Kravanja, P. 2003. A su-perfast method for solving toeplitz linear least squares prob-lems. Linear algebra and its applications366:441–457.

van der Krieke et al. 2016. Temporal dynamics of health and well-being: A crowdsourcing approach to momentary assessments and automated generation of personalized feed-back. Psychosomatic Medicine.

Vose, M. D. 1991. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Softw. Eng. 17(9):972–975.

Woodruff, D. P. 2014. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical