Matrix Methods for Low-Rank Compression in Large-Scale Applications

(1)

Applications

by

Alec Michael Dunton B.S., Harvey Mudd College, 2016 M.S., University of Colorado Boulder, 2018

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment

of the requirements for the degree of Doctor of Philosophy

Department of Applied Mathematics 2021

(2)

Dunton, Alec Michael (Ph.D., Applied Mathematics)

Matrix Methods for Low-Rank Compression in Large-Scale Applications Thesis directed by Prof. Alireza Doostan

Modern scientific applications generate and require more data every year, far outpacing stor-age capabilities. This growing disparity has inspired work in lossless and lossy data compression, which seek to alleviate the overwhelming surge in big data. Lossless compression approaches pro-vide an exact reconstruction of the original data, with the trade-off of a lower compression factor. Lossy compression approaches, on the other hand, achieve larger compression factors than lossless methods at the cost of error in reconstruction.

In the interest of reducing the size of data generated in scientific applications, this thesis proposes low-rank matrix approximation-based lossy compression algorithms for reducing the di-mensionality of data matrices. Several pass-efficient, memory lean, and fast low-rank approximation methods are proposed for temporal compression of scientific data. These approaches are shown to compress matrices arising in various scientific applications. These low-rank methods are particu-larly successful in compressing scientific data matrices when a significant fraction of the variance in the data can be captured on a low-dimensional linear subspace; such structure typically arises in diffusion-dominated problems such as low Reynolds number flow simulations.

(3)

(4)

Acknowledgements

First and foremost, thanks to my advisor Alireza Doostan for his relentless commitment to my success as a graduate student. To my committee members Stephen Becker, Gregory Beylkin, and Francois Meyer, thank you for helpful feedback and support. To Aly Fox, thank you for your mentorship during my first summer at Lawrence Livermore National Lab and throughout my PhD. For their work on Chapter 2 and boundless support as co-authors, I thank Llu´ıs Jofre Cruanyes and Gianluca Iaccarino. I would also like to thank Ben Priest and Geoff Sanders for their collaboration and mentorship during and since my second summer at Livermore. To Dr. Heather Pacella, thanks for being a great collaborator and co-author. Finally, this work was supported by the Department of Energy’s Predictive Science Academic Alliance Program (PSAAP-II) hosted at Stanford University. I cannot thank the PSAAP-II team and our sponsors enough for the opportunities I have had as a result of my participation.

(5)

Contents

Chapter

1 Introduction 1

1.1 Lossy data compression . . . 1

1.2 Kolmogorov n-widths and low-rank approximation . . . 4

1.3 Widening n-widths with neural networks . . . 6

1.4 Pass-efficiency via matrix sketching . . . 8

1.5 Outline of chapters . . . 11

2 Pass-efficient methods for compression of high-dimensional turbulent flow data 15 2.1 Introduction . . . 15

2.1.1 Contribution of this work . . . 18

2.2 Low-rank decomposition methods for data compression . . . 20

2.2.1 Review of QR and SVD . . . 20

2.2.2 Randomized algorithms . . . 22

2.2.3 Randomized SVD and single-pass algorithms . . . 24

2.2.4 Interpolative decomposition (ID) and its randomized variant . . . 28

2.2.5 Sub-sampled interpolative decomposition . . . 31

2.2.6 Single-pass interpolative decomposition . . . 33

2.2.7 Computational complexity and storage comparison . . . 37

(6)

2.3.1 Test case 1: Outlet flow data compression . . . 40

2.3.2 Test case 2: Volumetric flow data compression . . . 48

2.3.3 Test case 3: Particle data compression efficiency and accuracy . . . 51

2.4 Conclusions . . . 60

3 Deterministic matrix sketches for low-rank compression of high-dimensional simulation data 63 3.1 Introduction . . . 63

3.1.1 Contribution of this work . . . 68

3.2 Deterministic sketches . . . 69

3.2.1 Deterministic sketches for low-rank approximations . . . 70

3.3 Singular value decomposition algorithms . . . 73

3.3.1 A faster two-pass and single-pass SVD algorithm . . . 75

3.4 A single-pass interpolative decomposition algorithm . . . 78

3.4.1 Single-pass error estimation for SPC-ID . . . 83

3.5 Coarse grid power iteration . . . 84

3.5.1 Summary of algorithms . . . 87

3.6 Numerical experiments: SPC-SVD . . . 88

3.6.1 NACA-4412 airfoil data . . . 90

3.6.2 Turbulent channel flow data . . . 94

3.7 Numerical experiments: SPC-ID . . . 97

3.7.1 NACA-4412 airfoil data . . . 97

3.7.2 Turbulent channel flow data . . . 100

3.8 In situ spatio-temporal compression of forced isotropic turbulence data . . . 103

3.9 Proofs of main theoretical results . . . 109

3.9.1 Proof of Theorem 3.3.1 . . . 109

3.9.2 Proof of Theorem 3.3.2 . . . 109

(7)

3.9.4 Proof of Theorem 3.5.1 . . . 112

4 Single-pass self-expressive decompositions for low-rank approximation 116 4.1 Introduction . . . 116

4.1.1 Contributions of this work . . . 118

4.2 Background . . . 121

4.3 Outline of single-pass ID algorithms . . . 124

4.3.1 Row-by-row ID . . . 127

4.3.2 Blocked single-pass ID . . . 135

4.4 Numerical experiments . . . 140

4.4.1 Turbulent channel flow . . . 141

4.4.2 Particle-laden turbulent flow . . . 146

4.5 Proofs of theoretical results . . . 149

4.5.1 Proof of Theorem 4.3.1 . . . 149

4.5.2 Proof of Theorem 4.3.2 . . . 154

4.5.3 Proof of Theorem 4.3.3 . . . 156

5 Summary of Linear Methods and Nonlinear Extensions 161 5.1 Linear methods . . . 161

5.2 Nonlinear methods . . . 162

(8)

Bibliography 165

Appendix

A Single-pass nonlinear dimensionality reduction for data compression 183

A.1 Introduction . . . 183

A.1.1 Contribution of this work . . . 186

A.2 Background . . . 188

A.2.1 Linear dimensionality reduction . . . 188

A.2.2 Nonlinear dimensionality reduction . . . 190

A.2.3 Autoencoders . . . 193

A.3 Scaling AEs for single-pass data compression . . . 196

A.3.1 Computing compression ratios for single-pass AEs . . . 199

A.3.2 Improvement in accuracy using SPAE . . . 201

A.4 Numerical experiments . . . 203

A.4.1 Swiss roll dataset . . . 204

A.4.2 Isentropic flow shock solution dataset . . . 206

A.4.3 Chemically reacting flow dataset . . . 211

A.5 Conclusions . . . 218

Appendix . . . 221

A.5.1 Adam . . . 221

(9)

Tables

Table

2.1 Computational complexity of the SBR-SVD, full ID, sub-sampled ID and single-pass ID. . . 37 2.2 Speedup in runtime using the SBR-SVD, sub-sampled ID, and single-pass ID relative

to that of the full ID. . . 42 2.3 Friction, Reτ, and bulk, Reb, Reynolds numbers, and skin-friction coefficient, Cf, of

channel flow at Reτ = 180 obtained from uncompressed and compressed data. . . 45

2.4 Compression error and runtime achieved by the SBR-SVD, ID, randomized ID, and sub-sampled ID for St+= 0, 1, 10 Y -position data. . . 53 2.5 Compression error and runtime achieved by the SBR-SVD, ID, randomized ID, and

sub-sampled ID for St+= 0, 1, 10 V -velocity data. . . 54 2.6 Compression error and runtime achieved by the SBR-SVD, ID, randomized ID, and

sub-sampled ID for St+= 0, 1, 10 Y -position data. . . 56 2.7 Compression error and runtime achieved by the SBR-SVD, ID, randomized ID, and

sub-sampled ID for St+= 0, 1, 10 V -velocity data. . . 57

3.1 Computational complexity of the deterministic sketch operations proposed in this work. . . 72 3.2 Complexity, RAM usage, and passes used by each algorithm. . . 88 3.3 Spatial (S) and temporal (T) compression factors and errors for the 100 snapshot

(10)

3.4 Spatio-temporal compression factors and errors for JHU isotropic turbulence dataset. 107

4.1 Notation Table. . . 125 4.2 Computational complexity of the ID, RBR-ID, and BLOCK-ID. . . 133

(11)

Figures

Figure

2.1 Schematic of a PDE data matrix A. . . 21 2.2 Extraction and snapshot of turbulent channel flow data. . . 38 2.3 Particle-laden turbulent flow in a periodic channel. . . 39 2.4 Relative 2-norm error of proposal with respect to target rank and subsampling factor. 41 2.5 Relative maximum error as function of sub-sampling factor and target rank. . . 43 2.6 Reconstruction of mean stream-wise velocity profile and root mean square velocity

fluctuations. . . 46 2.7 Reconstruction of mean stream-wise velocity profile and root mean square velocity

fluctuations. . . 47 2.8 Instantaneous volumetric snapshots of channel flow stream-wise velocity u

com-pressed using proposals. . . 49 2.9 Errors and runtimes of three compression methods on volumetric data versus target

rank. . . 50 2.10 Singular vector associated with the largest singular value of the data matrix in Test

case 2 computed via SBR-SVD. . . 50 2.11 Accuracy and runtime for proposals plotted against subsampling factor and target

rank. . . 55 2.12 Reconstruction of various QoIs from data compressed using the proposals. . . 59

(12)

3.2 Simulation snapshots are read into memory, vectorized, and sketched to form the sketch matrix. . . 70 3.3 Mesh used to model pressure coefficient response of a two dimensional NACA 4412

airfoil in a steady, incompressible flow with Reynolds number 1.52_{× 10}6 [223]. . . . 90 3.4 Relative errors and runtimes of proposed sketches benchmarked against

state-of-the-art and optimal SVD. . . 91 3.5 Left: Maximum ratio over 100 independent trials of Frobenius error of schemes

relative to the lower bound given by the Eckart-Young theorem and average runtimes over the same 100 trials. . . 92 3.6 Extraction and snapshot of turbulent channel flow data. . . 94 3.7 Relative errors and runtimes of proposed sketches benchmarked against

relative to the lower bound given by the Eckart-Young theorem on the turbulent channel flow dataset. Right: Average runtimes over the same 100 trials. . . 96 3.9 Relative errors and runtimes of proposed sketches benchmarked against

relative to the lower bound given by the Eckart-Young theorem on the NACA-4412 airfoil dataset. Right: Average runtimes over the same 100 trials. . . 99 3.11 Relative errors and runtimes of proposed sketches benchmarked against

(13)

4.1 Schematic of RBR-ID and BLOCK-ID. . . 126

4.2 Extraction and snapshot of turbulent channel flow data. . . 141

4.3 Errors and runtimes of the proposals compared to the multi-pass approach. . . 142

4.7 Stream-wise velocity of the fluid with particles in a turbulent channel flow. . . 146

A.1 Schematic of a fully-connected AE. . . 198

A.2 SPAE and SVD approximation of the Swiss roll dataset from one, two, and three latent dimensions. . . 206

A.3 Relative error and compressed factors achieved using SPAE, an offline AE, SPSVD, and the optimal linear embedding given by the truncated SVD. . . 208

A.4 Linear embedding dimension required to match accuracy of SPAE and improvement in compression factor achieved using SPAE. . . 209

A.5 Performance of SPAE compared to SPSVD and the ground truth solution on the 1D Isentropic Flow Shockwave dataset in one latent dimension. . . 210

A.6 Snapshots of temperature, H2 mass fraction, O2 mass fraction, and H2O mass frac-tion from chemically reacting flow. . . 213

A.7 Snapshots of temperature taken at time-steps 2000, 5000, 10000, and 60000 from chemically reacting flow. . . 214

(14)

A.9 Linear embedding dimension required to match accuracy of SPAE and improvement in compression factor achieved using SPAE for latent dimensions 1 to 5 on the chemically reacting flow problem. . . 216 A.10 Performance of SPAE compared to SPSVD and the ground truth solution on the 2D

(15)

Introduction

The future of high-performance computing, specifically on future Exascale computers, will presumably see memory capacity and bandwidth fail to keep pace with data generated, for instance, from massively parallel partial differential equation (PDE) systems. Current strategies proposed to address this bottleneck entail the omission of large fractions of data, as well as the incorporation of in situ compression algorithms to avoid overuse of memory. To ensure that post-processing operations are successful, compression must be done in a way that a sufficiently accurate representation of the solution is stored. Moreover, in situations where the input/output system becomes a bottleneck in analysis, visualization, etc., or the execution of the PDE solver is expensive, the number of passes made over the data must be minimized.

Each of the chapters presented in this thesis seeks to address the big data problem in scientific simulation by identifying low-dimensional spaces, e.g., linear subspaces or nonlinear manifolds, to reduce the size of data in as few passes over the input as possible. In the rest of this chapter, we provide relevant background for the works presented in the subsequent chapters of this thesis. We begin with a brief overview of modern methods for lossy data compression.

1.1 Lossy data compression

(16)

FPZIP [161] and ZFP [160] at Lawrence Livermore National Laboratory, are among the state-of-the-art in scientific data compression methods. SZ and ZFP stand out in terms of compression factors achieved on benchmark datasets, computational efficiency, and reconstruction accuracy. Other lossy compression methods include transform-based approaches such as the discrete cosine transform, discrete Legendre transform [172, 196], Wavelet-based approaches [246, 150, 192, 154, 151, 152], and Tucker tensor decompositions [114, 240, 8]. Predictive coding techniques approximate data values by extrapolating from neighboring points. Examples of these methods include SZ, FPZIP, Lorenzo [120], and Isabela [144, 147]. Low-rank matrix approximations, which are selected for implementation in this thesis, identify bases to approximate the fundamental subspaces of data matrices; in this sense, low-rank matrix methods fall within the class of transform-based methods. Low-rank matrix approximations enable compression by producing factor matrices whose product forms an approximation of the original matrix. Compression is achieved when these factor matrices are stored in place of the original matrix. Let X _{∈ R}m×n be a real matrix and a k-rank linear approximation of X given as X _{≈ BC with B ∈ R}m×k and C _{∈ R}k×n. Then, B and C may be stored out-of-core for offline reconstruction of the matrix X. If k_{m, n, then these factor} matrices provide a compressed version of X. The corresponding compression factor (CF) of the matrix X is then given by

CF := mn

k(m + n). (1.1)

(17)

directly for column subset selection and clustering, two fundamental machine learning applications. The SVD, on the other hand, provides vectors which span the same subspace as the principal components of the data matrix. This direct utility is not as frequently observed in other compression approaches.

Appendix A of this thesis generalizes the low-rank approximation approaches presented in previous chapters to a nonlinear setting using a class of neural networks called autoencoders. Autoencoders are auto-associative feed-forward neural networks which generate approximations of their input from a learned code. The encoder Φ, which comprises the first half of the network architecture, learns an embedding into a (typically) lower dimensional latent space. The decoder, abusively denoted Φ−1, which comprises the second half of the network, learns an approximation to the input from this latent space. The approximation to the input matrix X is then given by X _{≈ Φ}−1(Φ(X)). The compressed, i.e., encoded latent data is given by Φ(X), which stored with the decoder Φ−1 enables offline reconstruction of X.

Because nonlinearity is introduced into the approximation via activation functions placed between each layer of the network, autoencoders can achieve significant improvements in accuracy and compression factor over low-rank methods. Autoencoders have been used for scientific data compression in works such as [93, 185, 164, 163]; in these two works, a pre-trained network is use to compress data in situ.

When using autoencoders to compress data, computation of the CF becomes more compli-cated than the definition given in Equation 1.1. Let_|Φ−1_{| be the number of parameters required to} store the decoder offline and Φ(X)_{∈ R}m×r. Then, the CF achieved using an autoencoder (AECF) is given by

AECF := mn

mr +_|Φ−1_|.

(18)

to offset the improvement in dimensionality reduction, then the autoencoder will outperform the corresponding low-rank approximation in terms of data compression as well. It is therefore critical to the success of autoencoders in data compression that the decoder is constructed parsimoniously with respect to the number of parameters defining the mapping. Using a network with too large of a memory footprint, regardless of the dimensionality reduction achieved by the encoder, will destroy any benefit of using the autoencoder over linear low-rank methods.

This overview of lossy compression methods is far from exhaustive. A more thorough lit-erature review of lossy compression approaches is presented throughout the chapters of this the-sis, in particular Chapter 2. Chapter 2 also highlights some leading approaches in lossless data compression. We also refer the interested reader to the following review paper on scientific data compression [153].

1.2 Kolmogorov n-widths and low-rank approximation

In the context of data matrices arising in the solution of PDE systems, the efficacy of linear dimensionality reduction methods like low-rank matrix decompositions in compressing a data ma-trix can be quantified by the Kolmogorov n-width of the system from which the data is generated. At a high level, Kolmogorov n-widths provide a metric on how well linear subspace-based models can capture approximate the solutions of a PDE system. Given that this thesis focuses on develop-ing low-rank matrix approximation approaches for compression of data which comes from physical systems, in particular PDE systems, knowledge of the corresponding Kolmogorov n-widths can be useful in determining their utility in a given problem. We now provide a theoretical exposition on Kolmogorov n-widths taken from definitions and theorems in [204].

Let X be a normed linear space and ˜Xn be an n-dimensional linear subspace of X . Then,

for every x∈ X we define the metric δ(x, ˜Xn):

δ(x, ˜Xn) = inf

y∈ ˜Xn

(19)

This metric provides a measure of how close the element x _{∈ X is to the subspace ˜}X_n. If there exists a ˜y which minimizes (1.2), then ˜y is sometimes called the best approximation of x in the subspace ˜X_n. Now, let _{S be a subspace of X . The metric δ is now extended to subspaces of X via} the following definition:

δ(_{S, ˜}X_n) = sup x∈S δ(x, ˜X_n) = sup x∈S inf y∈ ˜Xn kx − yk.

This quantity provides the “worst-best” approximation of the elements in _{S. That is, it bounds} how poorly a linear subspace ˜X_napproximates any element of the subspace_{S. With the definition} of the metric δ defined for subspaces of the normed linear space_{X , the Kolmogorov n-width of the} subspace_{S may now be defined as follows:}

Definition 1.2.1. The Kolmogorov n-width of S ∈ X , dn(S), is given by

dn(S) = inf ˜ Xn∈X δ(S, ˜Xn) = inf ˜ Xn∈X sup x∈S inf y∈ ˜Xn kx − yk.

In words, the Kolmogorov n-width of the subspace _{S provides an upper bound on how} accurately a optimally selected n-dimensional linear subspace can approximate any element of the subspace _{S. That is, it provides the worst-case approximation error of any given element of a} linear subspace_{S ∈ X following projection of S onto an optimal n-dimensional subspace ˜}Xnof the

normed linear space _{X . As an example, the Kolmogorov n-width of a linear time-invariant (LTI)} system is given by the (n + 1)th largest singular value of the Hankel operator [242]. In Petrov-Galerkin projection-based schemes, n-widths identify the optimal convergence rate for a provided dataset data [181, 183].

(20)

prob-lems, linear methods are likely to reduce the dimension of associated data matrices provided their structure reflects that of the normed linear space _{X on which the n-width is computed [10, 21].} Examples of systems with fast-decaying n-width include low Reynolds number flows. However, in other applications of interest with slow-decaying Kolmogorov n-widths, there may be significant limitations on the efficacy of linear subspace methods in data compression.

1.3 Widening n-widths with neural networks

Advection- and convection-dominated problems, such as those found in models describing hypersonic systems, often feature slow-decaying Kolmogorov n-widths. In these scenarios, linear projection methods can fail to efficiently approximate data matrices arising in simulation [194, 146]. This has led researchers to explore dimensionality reduction approaches which are not limited by n-width decay. Nonlinear dimensionality reduction methods, among other approaches, are particu-larly well equipped to break through the inherent limitations of linear approaches. In recent years, nonlinear dimensionality reduction methods have been developed for myriad scientific applications. One such application is reduced order modeling (ROM).

ROM is closely related to low-rank compression methods. Low-rank compression seeks to reduced the memory expense of storing data from, e.g., high-fidelity simulations of physical systems, while the goal of ROM is to reduce the computational expense of solving the high-fidelity full-order model (FOM). In projection-based ROM, this is done via identification of low-dimensional latent spaces which capture the majority of the variance in the FOM; these latent spaces are sometimes identical to those computed in low-rank compression approaches. For example, classical approaches for ROM include the proper orthogonal decomposition (POD) [18], in which the singular value decomposition (SVD) of a matrix of snapshots is computed. The leading k singular vectors span a k-dimensional linear subspace on which the ROM is solved; the low-rank SVD approximation in this case also constitutes a compressed representation of the snapshot matrix.

(21)

sub-spaces. However, the limitations of low-rank methods in ROM are analogous to those in low-rank compression. This has inspired research in alternative approaches for ROM which seek to break through the limitations imposed by n-widths. Approaches such as transforming the physical do-main [254, 255], separating transport dynamics [193], and shifting the POD basis [210] have all sought to address the limitations of the aforementioned linear subspace-based ROM methods. Other approaches focus on local improvement of linear approaches, and require significant knowledge of the physical system for which the ROM is being constructed [146].

In 2019, a break-through paper by Lee and Carlberg introduced a nonlinear ROM approach based on deep convolutional autoencoders which generalizes nonlinear ROM well beyond the afore-mentioned approaches [146]. The low dimensional latent space learned by the deep convolutional autoencoder provides the space in which the reduced order model evolves, while the network de-coder provides the manifold on which the generalized coordinates of the reduced system evolves in time. Their network architecture is comprised of four convolutional layers composed with nonlinear activation functions followed by a single linear dimensionality reduction step. In the paper, the au-thors show that the expressivity and nonlinearity of these convolutional autoencoders allows them to beat optimal linear subspace-based ROMs by orders of magnitude in terms of accuracy [146].

Due to the recent success of autoencoders in constructing dimension-efficient ROMs on highly nonlinear, advection- and convection-dominated problems, as well as their success in compressing data from turbulent flows [93, 185], it seems they should succeed as a broad tool for compressing scientific data. The authors of [164] and [163] explore this potential application; the former paper uses a fully-connected network to reduce the dimensionality of 1D scientific data by a factor of 512, while the latter uses convolutional autoencoders trained on blocks of large-scale data tensors to enhance the compressive capabilities of the state-of-the art scientific compressor SZ. Due to its fully-connected architecture, the approach in [164] is only tested on small-scale scientific data. The approach in [163] is designed and trained to compress on small blocks of large-scale data tensors, and therefore scales to much larger problems than the approach in [164].

(22)

only reduce the dimensionality of data matrices, but provide accurate, nonlinear preimage map-pings to approximate the input matrix. To the best of our knowledge, autoencoders outperform all other nonlinear dimensionality reduction approaches in both of these respects. Autoencoders, if sufficiently trained, will compress and reconstruct data at least as well as an optimally computed linear approximation of equal latent dimension. For this reason, they are selected for implemen-tation in the nonlinear data compression component of this thesis. Further, based on the very recent work done in scientific data compression using them, their promise in the field is quite clear. None of the work done using autoencoders for scientific data compression has presented an entirely online framework in which the network is trained and the embedding and reconstruction mapping is provided in one pass over the input. To the best of our knowledge, Appendix A of this thesis presents the first work which proposes such an algorithm.

1.4 Pass-efficiency via matrix sketching

All of the linear and nonlinear methods proposed in this thesis emphasize pass-efficiency; they seek to obtain approximations to the input data seeing it as few times as possible. In the online or streaming setting, single-pass implementations which see the input once are vital. Moreover, in large-scale applications where the cost of a simulation or memory movement is prohibitive, single-pass algorithms can achieve significant speedups over multi-single-pass approaches. For example, in flow control for wing design [207], simulations on leadership computers can generate over 1TB/s of data. In such scenarios, computing a low-rank approximation in one pass over the input will reduce memory movement and simulation costs significantly.

(23)

input; instead of viewing the original data multiple times during the computation, the sketched data is viewed instead. Constructing sketching operators has been an active area of research for decades; these operators are frequently drawn from distributions on matrices, generating so-called randomized algorithms. For a thorough review on randomized linear algebra, particularly in the context of low-rank matrix approximation, we refer the interested reader to the seminal review paper [108].

Among the most popular random matrices used as sketching operators in low-rank approx-imation are Gaussian sketches [108], the fast Johnson-Lindenstrauss transform (FJLT) [4], the subsampled random Fourier transform (SRFT) [257], and CountSketch [157, 256]. In Gaussian sketches, the entries of the sketching operator are i.i.d. and drawn from a Gaussian distribution. In the FJLT, a composition of operators including a fast Walsh-Hadamard transform and a sparse matrix is used to improve upon the computational complexity observed when using dense Gaussian matrices to embed the data. In the SRFT, the fast Fourier transform is used in tandem with a sub-sampling scheme and matrix with i.i.d. entries selected from the complex unit circle to accelerate the sketching procedure [257, 108]. Finally, in CountSketch, a hashing function whose complex-ity scales with the number of nonzero entries of the input matrix is selected for construction the sketching operator; a key feature of CountSketch is that it runs in input-sparsity time [157, 256]. This list of sketches is far from exhaustive; for a thorough exploration of other sketching approaches used in numerical linear algebra, we refer the interested reader to [261].

(24)

Lemma 1.4.1. (Theorem 2.1 of [58].) Given 0 < < 1, a set X of m points in RN, and a positive integer n_{≥ 4(}2/2₋3/3)−1log(m), there is a linear map f : RN → Rn _{such that}

(1_{− ) ku − vk}2 ≤ kf(u) − f(v)k2 ≤ (1 + ) ku − vk2,

for all u, v_{∈ X.}

Matrices which satisfy the J-L lemma are frequently selected for application in low-rank approximation algorithms. Some sketches, such as CountSketch, which do not satisfy the J-L lemma but can be significantly faster than dense sketcher operators, are also often used with the acceptance of loss of theoretical guarantees. For example, the authors of [171] propose CountSketch in the computation of matrix and tensor interpolative decompositions. Although CountSketch does not satisfy the J-L lemma, they are still able to provide performance guarantees on both the matrix and tensor approximations following analysis similar to that of [257]. The satisfactory performance of CountSketch in terms of accuracy, coupled with the computational expediency of input-sparsity time sketching approaches like it, highlights a critical trade-off between runtime and fidelity in matrix sketching.

(25)

Matrix sketching has been used to construct online low-rank matrix approximation algorithms in works such as [108, 263, 238]. Inspired by these approaches, we propose sketches based on coarse grid representations of scientific simulation data to construct online algorithms for computing the singular value decomposition and interpolative decomposition of matrices. In systems with fast-decaying Kolmogorov n-widths, these low-rank approximations identify linear subspaces of the temporal and parametric domains of physical systems to achieve significant dimensionality reduction and accurate data reconstruction. In problems with slow-decaying Kolmogorov n-widths, linear sketching approaches are limited in their efficacy, but can be augmented with neural network-based methods to circumvent some of these shortcomings.

In the nonlinear data compression setting presented in Appendix A, linear matrix sketches are used to enable simultaneous online training of autoencoders and embedding of the input data via the trained autoencoder. In this context, the matrix sketch can be thought of as the first layer of an augmented neural network. A lifting operator, which maps the sketch to an approximation of the input in a least-squares optimal manner, constitutes the final layer of the network. Therefore, both the sketch and lifting operators are effectively pre-trained layers in the overall network archi-tecture. The autoencoder is then trained on the sketched data, with the encoded sketch forming the latent data. The decoder trained on the sketch, combined with the least-squares lifting operator, form a larger augmented decoder which yields an approximation to the input. This framework constitutes a linear-nonlinear hybrid approach which exploits the benefits of neural network-based approaches such as their expressivity, while also taking advantage of the performance guarantees and computational efficiency of linear approaches to enable online low-rank compression.

1.5 Outline of chapters

(26)

manifolds for matrix compression using autoencoders. The chapters of this thesis are independent works, presented with independent notation. Some figures are redundant across chapters; the repeated figures are included for clarity of presentation. Each of the chapters of this thesis has either been published, submitted for publication, or is to be submitted for publication. Most are available on the arXiV preprint server, and a list of collaborators and other relevant information is provided at the beginning of each chapter. A brief outline of each of the chapters of this thesis is now provided.

The first work of this thesis, presented in Chapter 2, focuses on the utility of pass-efficient, parallelizable, low-rank, matrix decomposition methods in compressing high-dimensional simulation data from turbulent flows. A particular emphasis is placed on using coarse representations of the data – compatible with the PDE discretization grid – to accelerate the construction of the low-rank factorization. This includes the presentation of a novel single-pass, i.e., requiring one read over the input, matrix decomposition algorithm for computing the so-called interpolative decomposition. The methods are described extensively and numerical experiments on two turbulent channel flow data are performed. In the first (unladen) channel flow case, compression factors exceeding 400 are achieved while maintaining accuracy with respect to first and second-order flow statistics. In the particle-laden case, compression factors of 100 are achieved and the compressed data is used to recover particle velocities. These results show that these compression methods can enable efficient computation of various quantities of interest in both the carrier and dispersed phases.

(27)

decomposi-tion and matrix interpolative decomposidecomposi-tion. The deterministic sketching approaches in this work have many advantages over randomized sketches. Broadly, randomized sketches are data-agnostic, whereas the proposed sketching methods exploit structures within data generated in complex PDE systems. These deterministic sketches are often faster, require access to a small fraction of the input matrix, and do not need to be explicitly constructed. A novel single-pass power iteration algorithm is also presented. The power iteration method is particularly effective in improving low-rank approximations when the singular value decay of data is slow. Theoretical error bounds and estimates, as well as numerical results across three application problems, are provided.

Building further on the theme of low-rank approximation, in Chapter 4, we propose two online algorithms for computing self-expressive decompositions of large-scale data matrices. By self-expressive, we refer to decompositions which are comprised of a subset of rows, columns, or both, and a corresponding pre-image mapping which forms a low-rank approximation to the original data. We first present a broad framework which accommodates any method which generates self-expressive decompositions. We then focus on the so-called row interpolative decomposition. This low-rank approximation represents a matrix as the product of a subset of its rows and a least-squares coefficient matrix. After outlining each algorithm, we provide complexity and error analysis to highlight the trade-offs in using our online approach as opposed to an analogous offline algorithm. We apply our methods to datasets from particle-laden turbulence to demonstrate their utility in real-world applications. In Chapter 5, we draw conclusions on the overall contribution of this thesis and suggest future avenues of research for both linear and nonlinear approaches to scientific data compression.

(28)

(29)

Pass-efficient methods for compression of high-dimensional turbulent flow data

This chapter presents a paper I published with Prof. Alireza Doostan (University of Colorado Boulder), Prof. Llu´ıs Jofre-Cruanyes (UPC Barcelona), and Prof. Gianluca Iaccarino (Stanford University). It appeared in the Journal of Computational Physics Volume 423 on 15 December 2020, and is reproduced in its entirety under Elsevier copyright permissions. The work described in this chapter was presented by me at the 2018 SIAM Conference on Uncertainty Quantification in Garden Grove, California and the 2019 SIAM Conference on Computational Science and Engineering in Spokane, Washington.

2.1 Introduction

(30)

Flow solvers use random access memory (RAM), I/O, and disk space to store solution states at different times for subsequent restart and post-processing. As the gap between data generation and storage performance has increased, numerical solvers have typically adapted by saving their state less often, viz. temporal or spatial sub-sampling. This can lead to the loss of important data, rendering it less useful in post-processing operations. This problem is of particular impor-tance in the case of turbulent flows, as the number of spatial and time integration resolutions required to capture all the flow scales in direct numerical simulation (DNS) increases exponen-tially with the Reynolds number, Re. Extrapolating this trend to future supercomputing settings, storage subsystems may become considerably underpowered with respect to the number-crunching capacity. In this scenario, the affordable resulting data storage frequency will not be sufficient for conducting meaningful analyses. A similar problem is encountered in outer-loop studies, such as inference, uncertainty quantification (UQ), and optimization, in which large ensembles of model evaluations for different input values are performed, resulting in a rapid growth of data storage requirements [125, 84, 124]. The storage capacity and bandwidth limitations also complicate the applicability of time-decoupled strong recycling turbulence inflow methods [258], in which flow data for several characteristic integral times, e.g., eddy-turnover time in homogeneous isotropic turbu-lence (HIT) or flow through time (FTT) in wall-bounded flows, are stored to disk to be reused later as inflow in spatially developing flow problems. If the prediction described above materializes, flow solvers will need to pursue new strategies in which the data size at each time slice is reduced before writing to disk, a process known as data compression.

(31)

cost of limited compression ratios [80, 208]. On the other hand, higher compression ratios can be obtained by using lossy data compression algorithms. This comes at the expense that the inverse transformation of the compressed data produces at best an approximation of the original data.

There are numerous existing methods in truly lossless compression, wherein the reconstructed data is bit-for-bit identical to the original [153]. Examples include the well known method gzip [89], as well as entropy-based coders [218, 118], dictionary-based coders [270, 271], and predictive coders, e.g., FPC [32], FCM [214], and FPZIP [161]. In a related class of methods, near-lossless compressers, reconstructed data is not identical to the original data due to floating-point round-off errors. Ex-amples from this class of approaches include transformation methods such as lossless Fourier and wavelet transform schemes [226]. Because of the limited compression ratio attained by these meth-ods coupled with the disk- and RAM-prohibitive magnitude of the data examined in Section 2.3, lossy compression methods for turbulent flow data are presented as a more appealing alternative. A more in-depth review of these strategies, as well as a more extensive list of sources can be found in [153].

(32)

e.g., [245, 60, 140].

2.1.1 Contribution of this work

This work is concerned with temporal compression of large-scale fluid dynamics simulation data via low-rank matrix decomposition algorithms. Standard low-rank factorization methods for temporal compression of large-scale fluid dynamics simulation data, such as standard proper orthogonal decomposition (POD) or principle component analysis (PCA), are not well suited to high-dimensional data due to their significant computational cost, high memory requirements, and limited parallel scalability [109]. In addition, they are not pass-efficient in that they need to access/read the data multiple times. Due to the memory bottlenecks inherent in large-scale simulations, methods which minimize the number of passes made over a data matrix are emphasized. In this work, we examine the utility of four pass-efficient, low-rank factorization techniques for temporal compression of turbulent flow data. These include a blocked single-pass singular-value decomposition (SBR-SVD) [263], a single-pass and two-pass variants of interpolative decomposition (ID) [44]. The methods enable low-rank approximation of flows without scalability issues nor bottlenecks in extension to higher dimensions. In building two of the ID methods, we propose using coarse grid (or grid sub-sampled) representations, a.k.a sketch, of the data (as an alternative to random projections) in order to accelerate the construction of the factorization. This particular choice of sketch enables a single-pass implementation of ID, which, to the best of the authors’ knowledge, is the first single-pass ID algorithm. In addition, we provide convergence analysis of the ID schemes relying on coarse grid data.

(33)

compression factors. Our empirical observations indicate the overall superiority of SBR-SVD over single-pass ID in terms of the accuracy of the reconstructed data with similar compression factors. Additionally, SBR-SVD frequently leads to accuracy superior to that of the two-pass ID techniques. Practitioners may decide which scheme is best suited for their application and computing platform, balancing accuracy and performance.

Similar to this work, the authors of [9] present matrix decomposition methods as an effective compression technique for large-scale simulation data, though pass-efficiency is not emphasized in their work. In [28] and [269], the authors present online methods for maintaining a low-rank SVD approximation of simulation data via rank-one updates; this procedure requires significant computation at each step, however. Recent work by Tropp et al. [238] addresses this concern, though in their framework matrix updates arrive as sparse or rank-one linear updates; in the single-pass framework presented in this work, updates are assumed to arrive as row vectors into RAM. Moreover, in this work, the temporal (row) dimension of the PDE data matrix does not need to be known a priori.

As the focus of this chapter is achieving temporal compression, methods designed for spa-tial compression in simulations of turbulent flow such as mesh reduction [253] and compressed sensing [23] are left for future investigation. Another interesting future work is a formal compari-son of temporal compression techniques based on low-rank factorization with the aforementioned (non-matrix) techniques, such as SZ and FPZIP.

(34)

2.2 Low-rank decomposition methods for data compression

2.2.1 Review of QR and SVD

Let A∈ Rm×n_{denote the data matrix of interest whose rows are time snapshots of flow data,}

e.g., pressure or velocity, as depicted in Figure 2.1. While in the below discussions we refer to this matrix, we note that in practice A is not formed explicitly due to the high-dimensionality of data, i.e., large m and/or n. Instead rows of A are processed one at a time. The matrix decomposition methods presented in this work involve at their core two canonical decompositions: the QR decom-position and the SVD. QR factorization yields a decomdecom-position of A (more precisely AT) of the form A = QR, where Q ∈ Rm×m _{is a unitary matrix whose columns form an orthonormal basis}

for the column space of A [94]. This procedure is referred to as the range-finding step [109] in the algorithms described later in the paper. The three main approaches for computing this decomposi-tion include pivoted Gram-Schmidt orthonormalizadecomposi-tion of the columns or rows of A, Householder reflections, and Givens rotations [94]. Of particular interest in this work is the rank-revealing QR algorithm, which relies on the full-pivoted Gram-Schmidt procedure [103]. In the execution of this procedure, k pivot columns are selected to form an approximate basis for the range of A. The selection of these columns induces the concept of a numerical rank [117]. A matrix A is said to be of numerical rank k for some > 0 if there exists a matrix Ak of rank k such that kA − Akk ≤ .

(35)

te m po ra l d ire ct io n

··

·

<latexit sha1_base64="RZzXNe2nwZn6dxeln0mQfxa6pCs=">AAACAXicbVBLSgNBFOyJvxh/UZduGoPgKsyIoMugG5cRzAeSIfT0dJI2/Rm63whhyMoLuNUbuBO3nsQLeA47ySxMYsGDouo96lFRIrgF3//2CmvrG5tbxe3Szu7e/kH58KhpdWooa1AttGlHxDLBFWsAB8HaiWFERoK1otHt1G89MWO5Vg8wTlgoyUDxPqcEnNTs0liD7ZUrftWfAa+SICcVlKPeK/90Y01TyRRQQaztBH4CYUYMcCrYpNRNLUsIHZEB6ziqiGQ2zGbfTvCZU2Lc18aNAjxT/15kRFo7lpHblASGdtmbiv96kXRyJBfToX8dZlwlKTBF5+H9VGDQeFoHjrlhFMTYEUINd/9jOiSGUHCllVwxwXINq6R5UQ38anB/Wand5BUV0Qk6RecoQFeohu5QHTUQRY/oBb2iN+/Ze/c+vM/5asHLb47RAryvX4mMlwE=</latexit><latexit sha1_base64="RZzXNe2nwZn6dxeln0mQfxa6pCs=">AAACAXicbVBLSgNBFOyJvxh/UZduGoPgKsyIoMugG5cRzAeSIfT0dJI2/Rm63whhyMoLuNUbuBO3nsQLeA47ySxMYsGDouo96lFRIrgF3//2CmvrG5tbxe3Szu7e/kH58KhpdWooa1AttGlHxDLBFWsAB8HaiWFERoK1otHt1G89MWO5Vg8wTlgoyUDxPqcEnNTs0liD7ZUrftWfAa+SICcVlKPeK/90Y01TyRRQQaztBH4CYUYMcCrYpNRNLUsIHZEB6ziqiGQ2zGbfTvCZU2Lc18aNAjxT/15kRFo7lpHblASGdtmbiv96kXRyJBfToX8dZlwlKTBF5+H9VGDQeFoHjrlhFMTYEUINd/9jOiSGUHCllVwxwXINq6R5UQ38anB/Wand5BUV0Qk6RecoQFeohu5QHTUQRY/oBb2iN+/Ze/c+vM/5asHLb47RAryvX4mMlwE=</latexit><latexit sha1_base64="RZzXNe2nwZn6dxeln0mQfxa6pCs=">AAACAXicbVBLSgNBFOyJvxh/UZduGoPgKsyIoMugG5cRzAeSIfT0dJI2/Rm63whhyMoLuNUbuBO3nsQLeA47ySxMYsGDouo96lFRIrgF3//2CmvrG5tbxe3Szu7e/kH58KhpdWooa1AttGlHxDLBFWsAB8HaiWFERoK1otHt1G89MWO5Vg8wTlgoyUDxPqcEnNTs0liD7ZUrftWfAa+SICcVlKPeK/90Y01TyRRQQaztBH4CYUYMcCrYpNRNLUsIHZEB6ziqiGQ2zGbfTvCZU2Lc18aNAjxT/15kRFo7lpHblASGdtmbiv96kXRyJBfToX8dZlwlKTBF5+H9VGDQeFoHjrlhFMTYEUINd/9jOiSGUHCllVwxwXINq6R5UQ38anB/Wand5BUV0Qk6RecoQFeohu5QHTUQRY/oBb2iN+/Ze/c+vM/5asHLb47RAryvX4mMlwE=</latexit><latexit sha1_base64="RZzXNe2nwZn6dxeln0mQfxa6pCs=">AAACAXicbVBLSgNBFOyJvxh/UZduGoPgKsyIoMugG5cRzAeSIfT0dJI2/Rm63whhyMoLuNUbuBO3nsQLeA47ySxMYsGDouo96lFRIrgF3//2CmvrG5tbxe3Szu7e/kH58KhpdWooa1AttGlHxDLBFWsAB8HaiWFERoK1otHt1G89MWO5Vg8wTlgoyUDxPqcEnNTs0liD7ZUrftWfAa+SICcVlKPeK/90Y01TyRRQQaztBH4CYUYMcCrYpNRNLUsIHZEB6ziqiGQ2zGbfTvCZU2Lc18aNAjxT/15kRFo7lpHblASGdtmbiv96kXRyJBfToX8dZlwlKTBF5+H9VGDQeFoHjrlhFMTYEUINd/9jOiSGUHCllVwxwXINq6R5UQ38anB/Wand5BUV0Qk6RecoQFeohu5QHTUQRY/oBb2iN+/Ze/c+vM/5asHLb47RAryvX4mMlwE=</latexit>

m× n

−→ A

Figure 2.1: Schematic of a PDE data matrix A with m time solutions (rows) and n spatial degrees of freedom.

Also crucial to the compression methods explored is the singular value decomposition (SVD), defined as A = U SVT _{(the transpose is left unconjugated because the data in all applications}

is real in this work). The matrices U _{∈ R}m×n_{, V} _{∈ R}n×n _{are unitary and their columns form}

orthonormal bases for the column and row spaces of A, respectively. The matrix S _{∈ R}n×n _{is a}

diagonal matrix whose entries are the singular values of A. To generate low-rank approximations of a matrix A, one may employ a truncated SVD, which yields a decomposition of the form UkSkV_kT,

with Uk∈ Rm×k, Vk∈ Rn×k and Sk ∈ Rk×k. In this decomposition, the column spaces of Uk and

Vk are approximations of the k-dimensional row and column subspaces of the matrix A, taken in

correspondence to its k largest singular values, which are the entries σ1, ..., σkof the k× k diagonal

matrix Sk. The product D = UkSkVkT forms a rank-k approximation of the matrix A. By the

Eckart-Young theorem [77], a truncated SVD is the theoretically best rank-k approximation of a matrix A in Frobenius norm, i.e.,

inf rank(D)=kkA − DkF =kA − UkSkV T k kF =   min(m,n) X j=k+1 σ_j2   1/2 ,

(36)

the spectral norm (induced 2-norm), in which case the lower bound is inf rank(D)=kkA − Dk2 =kA − UkSkV T k k2 = σk+1. 2.2.2 Randomized algorithms

The first methods developed for computing a low-rank approximation of a matrix, e.g., via the SVD, are often computationally expensive, can requireO(k) passes over the input data matrix, lend themselves to limited parallel scalablity, and are not designed to minimize memory movement [109]. Moreover, when such schemes are built in a purely deterministic framework, adversarial cases can be introduced, such as those presented by Kahan [130]. Developed in order to address these issues, randomized schemes have gained popularity in recent years in low-rank matrix factorizations. These methods rely on embedding the input matrix in a lower dimensional space via a random matrix Ω∈ Rn×l_{, with l}_n,

AΩ,

referred to as randomized projection [175, 158, 109, 173, 263]. The resulting matrix AΩ is called a sketch matrix [109]. The effectiveness of randomized matrix algorithms relies on the utility of the sketch matrix AΩ. That is, they require that the column space of AΩ approximately spans the column space of A. The theoretical underpinning of random projections in numerical linear algebra is the Johnson-Lindenstrauss Lemma [126], which roughly speaking states that _{O(n) points in a} Euclidean space may be randomly embedded in a _{O(log(n))-dimensional space such that pairwise} distances between points are nearly preserved. This result precipitated the introduction of random matrices as dimensionality reduction tools in numerical linear algebra [109]. Matrices with i.i.d. Gaussian entries – used exclusively in this work – are a preeminent examples of such random projections, [175], though the literature is rife with other techniques for matrix sketching.

(37)

• The cost of computing a k-rank approximation of A using deterministic methods, including some of those implemented in this work, requires_{O(mnk) operations. By using randomized} methods this can be reduced to_{O(mn log(k) + k}2(m + n)) or better [109].

• Randomized methods require less communication than standard methods, which enables efficient implementation in low-communication environments such as graphics processing units (GPUs) [178].

(38)

2.2.3 Randomized SVD and single-pass algorithms

A natural application of random projection is in the construction of SVD. Randomized SVD (R-SVD) methods approximate A in the form A≈ USVT _{for a given target rank k via two main}

stages. Notice that, for the interest of a simpler notation, we hereafter drop the subscript k from U , V , and S, but keep those in the Algorithms to facilitate their implementation. In the first stage of R-SVD, a basis Q∈ Rm×l_{of the approximate column space of A is identified from the QR}

factorization of the sketch matrix AΩ, where Ω∈ Rn×l _{is, e.g., a Gaussian random matrix. Here}

l = k + p, where the so-called over-sampling parameter p is a small number, e.g., 10 or 20. In the second stage, the SVD of the smaller matrix B = QTA≈ ˜U SVT is computed. Recognizing that A ≈ QB, the approximate SVD of A is given by A ≈ (Q ˜U )SVT, i.e., U = Q ˜U . The details of these steps are described in Algorithm 1 reported from [109].

Algorithm 1 Basic Randomized SVD (R-SVD) A≈ UkSkV_kT

1: _{procedure R-SVD(A}∈ Rm×n) 2: k_{← target rank} 3: p_{← oversampling parameter} 4: l← k + p 5: Ω_{← randn(n, l)} 6: QR_{← qr(AΩ)} 7: B ← QT_A 8: U , S, V˜ _{← svd(B)} 9: U _{← Q ˜}U 10: Uk ← U(:, 1 : k); Sk ← S(1 : k, 1 : k); Vk ← V (:, 1 : k) 11: return Uk, Sk, Vk

More recently, multiple improvements of the R-SVD implementation in Algorithm 1 have been proposed to address pass and parallel efficiency of R-SVD, which is explained next. Firstly, Algorithm 1 requires two passes over the data matrix A (Steps 6 and 7), which makes it less attractive for the compression of large-scale PDE data that are expensive to store or re-generate. Halko et al. [109, Section 5.5] proposed a single-pass extension of R-SVD as follows:

(39)

(2) Compute the products AΩ and ATΩ in a single-pass over A˜

(3) Using these two products, compute the two QR decompositions AΩ = QR and AT_{Ω = ˜}˜ _{Q ˜}_R

(4) Solve for the matrix B, given by the minimum residual solution to the relations QT(AΩ) = B ˜QTΩ and ˜QT(ATΩ) = B˜ TQTΩ.˜

(5) Compute the SVD of the small matrix B, yielding B_{≈ ˜}U SVT

(6) Form the matrix U = Q ˜U and set Uk= U (:, 1 : k), Sk = S(1 : k, 1 : k), Vk = V (:, 1 : k), to

obtain the truncated SVD A≈ UkSkVkT [109]

The main drawback in this method lies in step (3) above, where the typically ill-conditioned matrix QTΩ may lead to considerable accumulation of error compared to the double-pass method presented˜ in Algorithm 1 [109]. In order to reduce the communication in the parallel implementation of R-SVD in Algorithm 1 and enable adaptive rank determination, a blocked formulation of the standard algorithm above is proposed in [178]. In this approach, the orthogonal matrix Q _{∈ R}m×l is separated into s blocks each of size m_{× b in the form}

Q = [Q1, Q2,· · · , Qs] ,

where s_{× b = l. The computation of the matrix Q is then decoupled and carried out on the smaller} blocks Qi, with all blocks orthogonalized via Gram-Schmidt and concatenated at the end of the

process. In doing so, the rank k can be determined so that the factorization admits a prescribed error . For the interest of clarity and completeness, the main steps of this blocked formulation from [178] are now reported, which constitute an alteration to Steps 5-7 of the standard R-SVD in Algorithm 1.

(1) For each block i = 1, 2, 3, ..., s do

(40)

(3) Compute the QR factorization of AΩi to obtain Qi (4) Re-orthonormalize Qi−Pi_j=1−1QjQTjQi (5) Compute Bi = QT_i A (6) Set A = A_{− Q}iBi (7) IfkAk < stop (8) Construct Q = [Q1, Q2,· · · , Qs]; B =B1T, B2T,· · · , BsT T

However, the above implementation of the blocking procedure increases the number of passes through A to _O(s).

In a recent work, Yu et al. [263] proposed a single-pass formulation of R-SVD that is shown empirically to result in more accurate low-rank factorizations, as compared to the single-pass R-SVD of [109]. In more detail, the approach of [263] generates an approximate truncated R-SVD A_{≈ U}kSkVkT of rank k following the below steps:

(1) Generate Gaussian matrix Ω∈ Rn×l_{, where k < l}_n

(2) Obtain the matrices Y = AΩ and B = ATY in a single-pass over A (Steps 11-15 in Algo-rithm 2)

(3) Compute a QR decomposition of Y = QR, set B = BR−1 so B _{≈ A}T_{Q (Steps 16-24 in}

Algorithm 2).

(4) Compute the SVD of the small matrix BT _{≈ ˜}U SVT

(5) Construct U = Q ˜U and set Uk = U (:, 1 : k), Sk = S(1 : k, 1 : k), Vk = V (:, 1 : k) to extract

(41)

As discussed in [263], Step (3) of this implementation can be performed in a blocked and, more importantly, single-pass mode resulting in Algorithm 2, herein referred to as single-pass blocked randomized SVD (SBR-SVD) and employed in the numerical examples.

Algorithm 2 Single-pass Blocked Randomized SVD (SBR-SVD) A_{≈ U}kSkVkT [263]

1: _{procedure SBR-SVD(A}_{∈ R}m×n) 2: k_{← target rank}

3: p← over-sampling parameter 4: l_{← k + p}

5: b_{← block size}

6: s← number of blocks such that s × b = l 7: instantiate Q, B

8: Ω_{← randn(n, l)} 9: instantiate G 10: H _{← zeros(n, l)}

11: while A is not entirely read through do 12: read the next row a of A

13: g_{← aΩ} G_{← [G; g]} 14: H _{← H + a}Tg 15: end while 16: f or i = 1, 2, . . . , s do 17: Ωi ← Ω(:, (i − 1)b + 1 : ib) 18: Yi ← G(:, (i − 1)b + 1 : ib) − Q(BΩi) 19: Qi, Ri ← qr(Yi) 20: Qi, ˜Ri ← qr(Qi− Q(QTQi)) 21: Ri← ˜RiRi 22: Bi← R−T_i (H(:, (i− 1)b + 1 : ib)T − Y_iTQB− ΩT_iBTB) 23: Q_{← [Q, Q}i] B←BT, B_iT T 24: end f or 25: U , S, V˜ ← svd(B); 26: U _{← Q ˜}U ; 27: Uk ← U(:, 1 : k), Sk ← S(1 : k, 1 : k), Vk ← V (:, 1 : k) 28: return Uk, Sk, Vk

(42)

In words, the error is on average only worse than the optimal truncated SVD solution by a factor of [1 + k/(p_{− 1)]}1/2 [263].

The SBR-SVD algorithm has computational complexity _{O(mnk), which can be reduced to} O(mn log k) when implemented with certain optimized matrix sketches including the sub-sampled random Fourier transform [109, 257, 211]. In addition, it has an approximate processing storage requirement of l(m + 2n) elements in RAM during execution [263]. Other single-pass implementa-tions of randomized SVD are available in the literature, see, e.g., [27, 49, 237, 243, 256], which are not considered in this work.

2.2.4 Interpolative decomposition (ID) and its randomized variant

The row ID, a two-pass algorithm, generates a decomposition of a matrix A following the form, [44],

A_{≈ P A(I, :),} (2.1)

where the row skeleton A(I, :) ∈ Rk×n _{consists of a set of rows of A indexed by} _{I ⊆ {1, . . . , m}}

with size |I| = k. Further, P ∈ Rm×k _{is a coefficient matrix such that P (}_{I, :) = I, with I the}

identity matrix. Row ID earns its name from the fact that it interpolates A in a basis consisting of a subset of its rows. The core procedure in the algorithm used to generate this decomposition are the column-pivoted (rank-revealing) QR algorithm [103], which yields the index vectorI along with a least squares problem to compute the coefficient matrix P . In more detail, first, the rank k column-pivoted QR decomposition of AT is computed.

ATZ≈ QR,

where Z ∈ Rm×m _{is a permutation matrix encoding the pivoting done in the algorithm, Q} _∈

(43)

approximating R2 ≈ R1C yields

AT _{≈ QR}1[I| C] ZT = AT(:,I) [I | C] ZT = AT(:,I)PT,

and hence (2.1). As stated in [110], the rank k row ID of an m_{× n matrix features a spectral error} bound of

kA − P A(I, :)k2≤

p

1 + k(m_{− k)σ}k+1, (2.2)

and the computational complexity of _{O(mnk). Algorithm 3 summarizes the steps involved in ID,} which are used in this study. To improve the stability of the QR factorization step, the modified Gram-Schmidt (mgsqr) procedure of [94] is employed in Step 3.

Remark 2.2.1. The ID algorithms presented in this study do not constitute a complete list of methods for generating decomposition of matrices which interpolate a subset of their rows. Other approaches for generating similar decompositions can be found in, e.g., [169, 79, 76]. It is also important to note that in the numerical results presented in Section 2.3, the ID refers to a specific variation of the decomposition, the row ID. Analogous definitions of a column ID or double-sided ID are also available in the literature [173].

Algorithm 3 Row ID A_{≈ P A(I, :) [44]} 1: _{procedure ID(A}_{∈ R}m×n)

2: k_{← approximation rank}

3: Q, R, I ← mgsqr(AT_{, k)} _{(First pass through A)}

4: C _{← (R(1 : k, 1 : k))}+R(1 : k, (k + 1) : m) (+ denotes pseudo-inverse)

5: Z _{← I}m(:, [I , Ic]) (Ic is the complement of I in {1, . . . , m})

6: P ← Z [Ik| C]T

7: return A(_{I, :), P} (Collecting A(_{I, :) requires second pass over A)}

(44)

of A using which an approximate ID of A is computed. Specifically, randomized ID (Algorithm 4) performs the ID of the sketch matrix Y = AΩ, with a random matrix Ω_{∈ R}n×l, as

Y _{≈ ˜}P Y (˜_{I, :).}

It then uses the row indices ˜_{I to set the row skeleton A(˜I, :) and applies the same coefficient matrix} ˜

P ; that is,

A_{≈ ˜}P A(˜_{I, :).}

As shown in [176], with high probability depending on l, the randomized ID leads to a low-rank factorization of A admitting bounds comparable to that of the standard ID in (2.2) but with larger constants.

Algorithm 4 Randomized (Row) ID (Gaussian) [158] A_{≈ ˜}P A(˜_{I, :)} 1: _{procedure RandID(A}∈ Rm×n)

2: k_{← approximation rank}

3: Generate Ω_{∈ R}n×l with i.i.d. Gaussian entries and l = k + p

4: Y ← AΩ (First pass through A)

5: Q, ˜˜ R, ˜_{I ← mgsqr(Y}T, k)

6: C˜ _{← ( ˜}R(1 : k, 1 : k))+R(1 : k, (k + 1) : m)˜ (+ denotes pseudo-inverse)

7: Z˜ ← Im(:, [˜I , ˜Ic]) (˜Icis the complement of ˜I in {1, . . . , m})

8: P˜ _{← ˜}ZhIk| ˜C

iT

(45)

2.2.5 Sub-sampled interpolative decomposition

Motivated by the randomized ID approach, a faster variant of ID that relies on grid (or data) sub-sampling to generate a deterministic sketch of the original data matrix A is proposed. Such a sketch is of particular interest, as it can be generated faster that the random projection – at least when Ω is dense – by relying on the simple observation that for smooth solutions, a coarse representation of data is able to capture the low-rank subspace of the full solution.

Let J ⊆ {1, . . . , n}, with k < |J | = nc n, denote the index of the degrees-of-freedom

associated with a coarse subset of the original degrees-of-freedom, i.e. coarse grid or sub-sampled representation of data. The sub-sampled ID generates the rank k ID of the sub-sampled data Ac,

Ac= A(:,J ), (2.3)

as

Ac≈ ˆAc= PcAc(Ic, :), (2.4)

which then induces an ID for the original matrix A as

A_{≈ ˆ}A = PcA(Ic, :). (2.5)

In words, the indices and coefficient matrix obtained from the ID decomposition of the coarse grid data in (2.4) are used to generate an interpolation rule for the full data matrix A, a procedure called lifting in [191]. This form of dimensionality reduction (i.e., directly sub-sampling the columns of the input matrix) is referred to as direct injection in the geometric multi-grid literature [107].

In addition to reducing the number of columns to be stored from n to nc, the complexity of

computing the coarse-grid ID (2.4) is _O(mnck) instead of O(mnk) needed for A (see Table 2.1).

Moreover, the sketch time required to obtain Ac is O(1), a far better complexity than that of,

(46)

In the context of reduced order modeling and uncertainty quantification, the use of coarse grid data to guide the low-rank approximation of a fine grid quantity of interest has been successfully considered in several recent work; see, e.g., [68, 191, 67, 83, 110]. One such result is employed from [110, Theorem 1] to bound the error of sub-sampled ID.

Theorem 2.2.1. Let A be the original data matrix, Ac the sub-sampled (coarsened) matrix as in

(2.3), ˆActhe rank k ID approximation to Ac as in (2.4), and ˆA the sub-sampled ID approximation

as in (2.5). For any τ _{≥ 0, let}

(τ ) := λmax(AAT − τAcATc),

where λmax denotes the largest eigenvalue. Then,

kA − ˆAk2 ≤ min τ,k≤rank(Ac) ρk(τ ), (2.6) ρk(τ ) := (1 +_kPck2) q τ σ_k+12 + (τ ) +_kAc− ˆAck2 q τ + (τ )σ_k−2 , (2.7)

where σk and σk+1 are the kth and (k + 1)th largest singular values of Ac, respectively.

The interested reader is referred to [110, Theorem 1] for the details of the proof of this theorem. Instead, some remarks regarding the results are provided next. Firstly, following [176], kPck2 ≤pk(m − k) + 1. The effectiveness of the sub-sampled ID approximation ˆA depends on the

assumptions that Acis low-rank and the optimal (τ ) is small. The former assumption follows from

the assumption that A is low-rank. To investigate (τ ), consider the case in which the physical domain of the PDE is a subset of R3 and the original data is generated via a uniform grid of size h in each direction. It is also assumed the sub-sampled data corresponds to a coarse subset of this grid with uniform size H h in each direction. It can be shown that, for τ∗_{= n/n}_c_,

(47)

where . denotes a smaller inequality with a bounded constant. To see this, we observe that the entry (i, j) of the Gramians AAT/n and AcATc/nc are the approximations of the Euclidean

inner-product of the solution at times i and j via a rectangular (piece-wise constant) rule of size h and H, respectively. Therefore, with τ∗ = n/nc and k · kmax the maximum entry in terms of absolute

value in the argument,_kAAT_−τ∗AcATckmax. H, assuming the data snapshots have bounded first

derivatives. Then, (2.8) follows given that (τ ) = _kAAT _{− τA}cATck2 ≤ mkAAT − τAcATckmax.

When the singular values of Ac decay rapidly, together with (2.6), this dependence of (τ∗) on H

suggests the error estimate _{kA − ˆ}A_k2. H1/2 for the sub-sampled ID approximation.

Algorithm 5 Sub-sampled ID A_{≈ P}cA(Ic, :)

1: procedure SubID(A∈ Rm×n) 2: k← approximation rank

3: AT_c _{← sub − sample(A}T) (First pass through A)

4: Qc, Rc,Ic← mgsqr(ATc, k)

5: Cc← (Rc(1 : k, 1 : k))+Rc(1 : k, (k + 1) : m) (+ denotes pseudo-inverse)

6: Zc← Im(:, [Ic, Icc]) (Iccis the complement of Icin{1, . . . , m})

7: Pc← Zc[Ik| Cc]T

8: return A(Ic, :), Pc (Collecting A(Ic, :) requires second pass over A)

2.2.6 Single-pass interpolative decomposition

In this section, a simple, single-pass algorithm for generating ID, dubbed single-pass ID (Algorithm 6), is presented. To the best of our knowledge, this is the first single-pass ID algorithm. The standard, randomized, or sub-sampled ID algorithms described in Sections 2.2.4-2.2.5 require a second pass through the data to extract the row skeleton matrix A(_Ic, :) after identifying the row

indices given in _Icin the first pass. Instead, in single-pass ID, the row skeleton of the coarse data

matrix, i.e., Ac(Ic, :), are interpolated back to the original grid in order to form an approximation

to A(_Ic, :). Specifically,

A_{≈ ˆ}A = PcAc(Ic, :)M , (2.9)

where Pcand Ac(Ic, :) are as in (2.4) and M ∈ Rnc×nis a coarse to fine interpolation operator. Such

(48)

by the observation that the rows of Ac(Ic, :) are coarse grid representation of those of A(Ic, :).

The trade-off for one fewer pass reduces loading and simulation time, but sacrifices accuracy as, depending on the sub-sampling factor n/nc, this method may incur large error when interpolating

back onto the fine grid. Therefore, the use of the single-pass ID is justified when rerunning the PDE solver or a second pass through the data to set A(_Ic, :) is not desirable. Notice that the

interpolation step Ac(Ic, :)M in (2.9) is more computationally complex than simply accessing

the rows of A indexed by _Ic to set A(Ic, :) in the randomized or sub-sampled ID. However, the

interpolation can be done when the data matrix A is to be reconstructed as opposed to during the PDE simulation or data compression steps. This in turn leads to overall savings in memory movement and simulation time.

Algorithm 6 Single-pass ID A_{≈ P}cAc(Ic, :)M

1: procedure SPID(A∈ Rm×n_{, M esh)}

2: k← approximation rank

3: AT_c _{← sub − sample(A}T) (Only pass through A)

4: Form the interpolation matrix M 5: Qc, Rc,Ic← mgsqr(ATc, k)

6: Cc← (Rc(1 : k, 1 : k))+Rc(1 : k, (k + 1) : m) (+ denotes pseudo-inverse)

7: Zc← Im(:, [Ic, Icc]) (Iccis the complement of Icin{1, . . . , m})

8: Pc← Zc[Ik| Cc]T

9: return Ac(Ic, :), Pc, M

The computational complexity of single-pass ID is the same as that of the sub-sampled ID, i.e., O(mnck). Let t denote the number of coarse grid data points used to construct each of the

n fine grid data points. The cost of building and storing the interpolation operator M , a sparse matrix, isO(tn), which is negligible relative to any asymptotic dependence on mnck. In lieu of any

spatial compression, the total disk memory required to store the output of the algorithm is far less than the other ID (or SVD) methods described in the prior sections. This is because the coarse skeleton matrix Ac(Ic, :) is stored instead of its fine counterpart. This variant of ID therefore yields

(49)

where tn corresponds to the interpolation matrix M . Notice that each step of the single-pass ID is the same as those of sub-sampled ID excluding the interpolation step. The interpolation operator can be represented as a sparse matrix M (constructed in Step 4 of Algorithm 6), which can then be stored to disk and applied to the coarse row skeleton Ac(Ic, :) following execution of the algorithm.

A bound on the error of an approximation generated using single-pass ID is presented in the following theorem.

Theorem 2.2.2. Let A be a data matrix, Ac the sub-sampled (coarsened) matrix as in (2.3), M

the interpolation operator as in (2.9) and with associated interpolation error EI := A− AcM , and

σk+1 the (k + 1)th largest singular value of Ac. The error of the single-pass ID approximation ˆA

in (2.9) is bounded as follows

kA − ˆA_k2 ≤ kEIk2+kMk2

p

1 + k(m_{− k)σ}k+1.

Proof. Note that,

kA − ˆAk2 =kA − PcAc(Ic, :)Mk2 ≤ kA − AcMk2+kAcM− PcAc(Ic, :)Mk2 ≤ kEIk2+kAc− PcAc(Ic, :)k2kMk2 ≤ kEIk2+kMk2 p 1 + k(m_{− k)σ}k+1,

where the last inequality follows from (2.2) applied to the ID of Ac.

Stated differently, Theorem 2.2.2 suggests the error of single-pass ID depends on the low-rank structure of Acas well as the error incurred in the interpolation step. When the data is considerably

low-rank, the single-pass ID error is dominated by the interpolation errorkEIk. Considering

piece-wise linear interpolation, when data has bounded second derivative, kEIk . H2, where H is the