Tiled Arrays and Matrices - Matrix Multiplication

5.3 Matrix Multiplication

5.6.3 Tiled Arrays and Matrices

Lastly, we discuss the related work about tiling strategies for arrays and matrices in particular, and tiling in regard to linear algebra operations.

Array DBMS. Chunking or tiling of arrays is a common storage model used by array DBMS, since conventional Fortran or C++array linearization orders will “optimize for one access pattern while making all others very inefficient” (Sarawagi and Stonebraker, 1994). Array DBMSs usually target to support arbitrary access patterns, leaving users the options to configure the tiling according to his access pattern. The system SciDB decomposes the array into a number of equal sized, and poten- tially overlapping chunks (Brown, 2010). Although the chunk partitioning might be customized by providing a configuration of the dimension-partitioning, once selected the chunk size is fixed and not adaptive. Another system is RasDaMan (Baumann et al., 1999), which employs an arbitrary tiled storage model with “R+-tree-like indexes” (Furtado and Baumann, 1999) for general multidi- mensional arrays. In particular, their storage model also adapts the tile size to the sparsity of the array, but it does not include heterogeneous tile data structures. An upper limit is imposed on the tile size depending on the system characteristics, e.g., the page size. Different tiling strategies are offered, which for example optimize the tiling along a dimension direction, or based on query statis- tics. The resulting tile layout is more flexible than ours, as it is could also be “totally nonaligned”. However, a nonaligned alignment strategy is usually disadvantageous for join-similar tile operations such as tiled matrix multiplication. The main difference is that the interface to RasDaMan (AQL) is a database-centric, low-level array language that does not offer any linear algebra primitives as high-level operations. Instead, Furtado and Baumann (1999) define accesses exclusively as multidi- mensional range queries. A user might implement a matrix multiplication algorithm via AQL, but then he is responsible to write the query in an efficient way. This could be tedious without proper the language constructs, and the performance is likely to fall back behind that of carefully tuned C++multiplication kernels that ATmult uses internally. In particular, none of the systems really ties the storage model in with a cost-based, optimized matrix multiplication operator like ATmult. Tiled matrices. Many of the related papers proposed a rather fine-grained matrix-partitioning into small, fixed-size blocks. For instance, Vuduc and Moon (2005) used a variable block row struc- ture (VBR) for spgemv. The latter is an unaligned BCSR, and serves the purpose to achieve higher performance by register-blocking. However, their maximum block size is 3 × 3 – hence, their focus is rather on microscopic tuning than on high-level tile optimizations. SystemML (Ghoting et al., 2011) as well uses a fixed blocking (2 × 2 in their example), where blocks are either a 1D array

(dense) or a hashmap (sparse). In other work, Huang et al. (2013) motivate a macroscopic, fixed block size (2048 × 2048) by arguing that two matrix tiles should fit entirely into main memory, in the context of a disk-based system. Hierarchical data structures in the context of matrix multiplication have also been proposed by Valsalam and Skjellum (2002). Like in our partitioning process, they also apply the Morton Order (Morton, 1966) on matrix elements to achieve a locality preserv- ing partitioning of a large matrix, and a higher cache efficiency. In contrast to our AT Matrix, they restrict their algorithm to dense matrices only, and do not consider sparse matrices at all.

Adaptive tensors. Besides fixed-size blocking approaches, we have found little research on adap- tively tiled sparse matrix structures. One work that touches this approach was presented by Smith

et al. (2015), which deals with tensor-matrix multiplications for three-mode (three-dimensional) tensors. They “propose a method of growing tiles to adapt to the sparsity pattern of a given tensor.” In particular, they first statically partition the mode-1 dimension. Then, for each layer, they indepen- dently tile the second mode, and then the third mode, such that each tile nearly contains a number of floating point values that fits in cache. This is similar to our sparse tile grow condition, which is bound by the absolute number of non-zero elements in a sparse matrix tile. However, our data structure exploits the heterogeneity by using different data structures for different tiles, whereas Smith et al. (2015) use a uniform data representation for all tiles.

LAB-tree. The work probably most related to our adaptively tiled data structure is the Linearized Array B-tree (LAB-tree) by Zhang et al. (2011). Although their approach is focused on storing matrices on disk, and minimize I/O for batch-wise matrix writes, they employ a hybrid tree structure that stores sparse and dense parts differently. In particular, the LAB-tree contains key-value pairs as records, where the key is the linearization of the matrix input dimensions (i, j) given by a user- defined space-filling curve f : (i, j) → k. Naturally, each leaf spanning a certain key range ℓ is constrained to accommodate a limited number of records κ, which is connected to the disk block size. If a leaf overflows, the leaf is either split into two sparse leafs, or a dense sub-array is created if the density in the effective range ℓ∗ (key range of actual non-zeros in the leaf) is above 0.5 and ℓ∗ is smaller than κ. As a result, the dense array covers the leaf range ℓ only partially. However, this might lead to the obscure effect that an existing dense array might be re-converted into two sparse leafs, if additional elements are inserted outside the effective range ℓ∗. Their aligned-splitting mode ensures that sparse leaf splits occur at multiples of κ, which asserts that – at least the sparse leafs – follow the higher level topology implicated by the space-filling curve.

As we did, Zhang et al. (2011) benchmark their LAB-tree data structure with matrix multiplications. Therefore, they load matrix blocks into memory and apply either a dense Blas multiplication kernel, or the Cholmod (Chen et al., 2008) implementation for sparse parts. Since Cholmod requires a CSC structure while the LAB-tree contains key-value pairs, an additional conversion is required. In fact, they also use matrices of the Florida Sparse matrix collection in the measurements. For example, for a self-multiplication of the upper-triangular6_{matrix TSOPF_RS_b2383 (R3) their}

system takes 388 seconds. We re-ran our multiplication measurement by using only the upper triangular part of R3, and obtained a runtime of 2.4 seconds for spspsp_gemm and 836 milliseconds for

6_{Note that many examples of the Florida Sparse Matrix Collection are symmetric matrices, for which only the upper-}

triangular is stored. Regarding our measurements, however, we explicitly mirrored the upper half to obtain the complete matrix.

ATmult, which is a speed-up of almost three orders of magnitude. Obviously, contrasting the disk- based LAB-tree with our in-memory approach is like comparing apples and oranges. Nevertheless, apart from the tremendous performance difference, the AT Matrix deviates from the LAB-tree for multiple reasons: firstly, AT Matrix arranges sparse and dense tiles in a quad-tree structure and always aligns them to logical blocks, which is beneficial for join processing, e.g. in the tiled matrix multiplication. Instead, the LAB-tree could contain unaligned dense arrays that partially cover a logical block, and presumably need a copy before being further processed. Secondly, we use eight different matrix multiplication kernels. In contrast, Cholmod only contains spspsp_gemm and spdd_gemm, so it is unclear how the remaining 5 situations are handled. Thirdly, ATmult uses an enhanced cost model and tile conversion at runtime to accelerate the multiplication performance, whereas the selection of the multiplication kernel by Huang et al. (2013) is hand-coded and indepen- dent of hardware characteristics. Most importantly, the LAB-tree is indeed adaptive regarding leaf

size and storage type, but unlike AT Matrix, its matrix topology is predominantly influenced by the

user-defined (or default) linearization. In fact, they therefore hand-crafted individual linearization functions, which create block sizes matching the pattern of each matrix under consideration.

5.7 SUMMARY

With the increase in data volume and computation effort in many analytical applications, efficient processing of large sparse matrices often becomes performance-critical. To fill the gap that is left between generic array DBMS and highly specialized numerical libraries, we redesign sparse and dense matrix processing for the Lapeg in a novel way, going beyond simple array data structures, Blas libraries, and low-level tuning of sparse algorithms.

Therefore, we presented the AT Matrix that has an adaptive, heterogeneous storage layout for large matrices of any topology, by internally using the native sparse and dense matrix data structures that are integrated in the column-oriented storage layer. Moreover, we showed how the matrix multiplication operator ATmult accelerates matrix multiplication by elegantly applying several optimizations introduced in Chapter 4. These include the density estimation technique (SpProdest), a cost-based selection of multiplication kernels, and dynamic runtime conversions on tile-level. Our approach outperformed common multiplication algorithms, similar to those that are still used for example in Matlab or R, by a factor of up to 6x, while maintaining configurable memory restrictions. Nevertheless, our optimization approach is general and orthogonal to the multiplication kernels. At present, we have not yet made use of several performance tweaks in our custom kernels, and expect even further improvement potential by implementing them.

6

UPDATES AND MUTABILITY

In this chapter, we address the final requirement that was identified in the introduction: data manipulation. To be more precise, we will discuss the mutability of matrices with regard to updates, inserts, and deletions of matrix parts. In this context, we consider applications that have mixed read-and-write workloads, and applications that are based on a dynamically evolving data set, which could be run in the Lapeg.

6.1 MOTIVATION

The ability of consistently updating and persisting data is a feature that is natural to users of ACID- compliant database systems, and is the reason why business data has been stored in relational DBMS for decades. In fact, the powerful data management capabilities of a DBMS makes it also appealing as a storage and processing platform for new application scenarios, e.g., scientific computations (Hey et al., 2009). Formerly, scientists across all domains have widely been depending on static data files, which were processed by hand-written programs. File-based data is tedious to maintain and update. However, a common (mis-)conception of poor performance is among the reasons why database management systems have barely been considered by science users as an alternative processing platform (Gray et al., 2005; Buneman, 2002). Moreover, relational DBMSs lack an interface for scalable math operations, in particular linear algebra (Stonebraker et al., 2013b). Nonetheless, the advantages of a DBMS and efficient implementation of analytical queries are not mutually ex- clusive. The DBMS-integrated Lapeg proves that a columnar in-memory DBMS not only acts as storage back-end, but can also be used as an efficient in-memory engine for matrix processing on a large scale.

We have argued in the introduction that if data is persisted and kept consistently in a single database system with integrated linear algebra functionality, expensive copying into external sys- tems such as R or Matlab becomes dispensable. Furthermore, the existence of a single source of

truth avoids data inconsistencies, since there are no redundant copies of the data in external sys-

tems. However, as a consequence of relying on a single data storage for matrices, all data changes have to take place on the primary data representations. Regarding the Lapeg these changes take place in the columnar storage layer that we introduced in Chapter 3.

Since matrices are non-static objects in several analytics workflows, this chapter focuses on the implementation and exposition of data manipulation commands. The nuclear energy state analysis

described in Section 2.2 serves as an example. As part of the importance truncation, several rows and columns, which refer to quantum states, are simultaneously cut out of a Hamiltonian matrix H_{. Then, row and column pairs are iteratively added to the submatrix M}_refagain.

Moreover, matrices are directly or indirectly manipulated in a multitude of other applications. In the environment of a transactional, online DBMS, matrix data sets evolve over time. For instance, consider the term-document matrix A of Section 2.3.1, where documents are continuously added to the database. The latter translates into appending additional rows or columns to A. Another example is the social network graph (Section 2.3.2), where each non-zero element of the adjacency matrix corresponds to a connection between two persons (vertices). Every newly found or deleted connection requires an update of the corresponding matrix element. Hence, the physical and logical organization of matrices should enable reads, writes, and deletions of single elements, rows, columns, and individual matrix subregions. In order to offer these manipulations as functionality in the Lapeg, we propose a standardized user API that comprises all of the aforementioned manipulation primitives. Furthermore, we discuss the physical implementation of updates and deletions on matrices, and show that the required modifications seamlessly integrate with the matrix data structures in the column-oriented storage layer.

In particular, the main contributions of this chapter are:

• Matrix application programming interface. Similar to the data manipulation language of transactional, relational systems, we sketch the different access patterns and an application interface to read and manipulate matrices.

• Mutable sparse matrix architecture. We show how a two-layered main-delta storage can be leveraged to provide updatable sparse matrices. By using the native DBMS columns we automatically benefit from multiversion control and transactionality features of the DBMS. • Mutable adaptive tile matrix. We present two different ways of integrating mutability into

the AT Matrix representation, and how wide matrix manipulations spanning over multiple tiles are efficiently implemented using the AT Matrix’s indirection layer.

• Evaluation. We thoroughly evaluate the performance of the mutable matrix architectures against alternative approaches. Therefore, the example workload was separated in order to consider mixed insert-read-queries and deletions separately.

This chapter is organized as follows: an introduction about the different matrix access patterns, manipulation types and linear algebra primitives is provided in Section 6.2. Then, we show how the presented manipulation API is applied using the nuclear physics analysis application as an example. In Section 6.3 we discuss the complexity of update operations on different matrix representations. Thereafter, we present the architecture of the mutable sparse matrix, and the AT MatrixGDand AT MatrixLD_{representations. The evaluation of these representations and multiple other approaches}

under different manipulation aspects is presented in Section 6.5. Finally, Section 6.6 provides an overview of related research on updatability in databases and mutable matrix structures.

In document Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System (Page 131-137)