• No results found

Unlike relational tables in business environments, the majority of data in scientific fields is con- tained in arrays, mainly vectors and matrices. In the following, we first discuss a standard way of representing arrays in a relational DBMS, and outline its shortcomings. We then impose a few re- quirements necessary to store vectors and matrices more efficiently in a column-oriented DBMS. Thereafter, concrete data types for vectors and matrices are described in Sections 3.3 and 3.4. 3.2.1 A Straw Man’s Method.

A straightforward way of storing general n-dimensional arrays in a relational DBMS is to use a table with n key attributes for the cell coordinates, and at least one attribute for the cell value. Hence, each row in the database table represents one element of the array, and conversely, each element (including null and not-null elements) of the array is represented by a table row. For instance, this is the storage representation of matrices in RIOT-DB (Zhang et al., 2009), who are using a row-store DBMS (MySQL). By interpreting the n-tuple as a relation R, some linear algorithms can indeed be expressed by relational means using this format, for instance a matrix multiplication, as we sketched in Listing 2.1 (Section 2.6.1).

Nevertheless this approach has some drawbacks:

• The unordered n-tuple table exhibits rather poor performance for complex operations such as matrix multiplication. This is because unlike relational algebra, which is based on unordered sets of tuples, linear algebra operations on such tables can not benefit from the inherent struc- ture of an array.

1Modern multi-core platforms easily reach total bandwidths of 50GB/s and more.

• The n-tuple array table usually results in abundant data storage for fully populated arrays, especially if the dimensionality is high.

• If arrays are sparsely populated, it is further inefficient to store null elements (i.e., zero ele- ments for sparse matrices).

Regarding row-store DBMSs, it was shown that this data layout is by an order of magnitude less inefficient for array processing compared to a system with native array support (Stonebraker et al., 2007). In contrast, we show that a column-store exhibits more efficient ways to store and process arrays. To tackle the disadvantages mentioned above we impose a few basic prerequisites for the column-oriented storage layer, such as the ability to retain a stable table order. The Lapeg can employ much more efficient algorithms on ordered and indexed data structures, which are common in numerical high performance libraries.

As the majority of analysis applications are based on matrices, we focus on the representation of matrix data in the remainder of this thesis. Although a matrix is a two-dimensional array, there is not a single representation that suits all matrices perfectly, in particular when the matrix sparsity is taken into account. Therefore, we examine separate structures for dense and sparse vectors and matrices.

3.2.2 Storage Engine Prerequisites

To overcome the aforementioned shortcomings of general relational DBMS, we first define some prerequisites to efficiently store and process arrays in the columnar storage layer of a main-memory DBMS. Note that these properties are required to let the Lapeg make direct use of efficient kernel algorithms on the core data structures, e.g. routines from the Blas library. All of the following characteristics are featured by the SAP HANA DB (Färber et al., 2012). These are:

Dictionary-less columns. Usually vectors and matrices accommodate numeric values per defi- nition, which are either of type integer, float or double. Moreover, most matrices contain many different values. Since neither extensive data types are used, nor many duplicate values are present, the aforementioned dictionary encoding becomes superfluous for the majority of scientific data sets. An exception are matrices derived from contextual data, which contain auxiliary, semantic informa- tion. For example, consider the term-document relations described in Section 2.3.1: in this case, the dictionaries of the dimension attributes implicitly map arbitrary attribute values (e.g., strings) to matrix coordinates. We will readdress this characteristic in Section 3.4.3. Until then, the follow- ing columnar data representations are based on the assumption of dictionary-less, integer and float columns.

Row-positional access. Although relational tables are not foreseen to be referenced by their row position, the Lapeg internally requires positional access to table columns. That is, each column in the column-oriented storage layer should be addressable by its index, i.e. the absolute position of a value in the column. This corresponds to the table row position if the column is part of a relational table. Position-accesses are commonly used in column-store operators, e.g., for materializing result rows of a filter query.

0 2 0 4 ⎛ ⎜ ⎜ ⎜ ⎝ ⎞ ⎟ ⎟ ⎟ ⎠ 0 2 0 4 Val a) 0 2 0 4 Val 0 : 1 : 2 : 3 : 1 2 3 4 Row Val b) 1 3 Row 0 : 1 : 2 4 Val Logical Design Physical Design

Figure 3.2: Dense (a) and sparse (b) representation of a vector (left) in a column-oriented database system. Middle: Logical exposition as table. Right: physical representation in the column-oriented storage layer.

Configurable table sort. The storage layer should provide a table-wide, flexible sorting mech- anism for columns. That is, each column of a table should be ordered at certain reorganization checkpoints (e.g., at the initial creation of a matrix) according to a given mapping that is internally provided by the Lapeg. This mapping result from the sort order of one or multiple attributes of the table (e.g., the matrix row coordinate). Internal column reordering during table reorganization is a common technique in column-stores to improve the compressibility of tables (Abadi et al., 2006). Row order preservation. Finally, we require that a table containing a matrix or vector represen- tation preserves the current order at any time, except if the reorganization is triggered by the Lapeg. In fact, general column-store tables are usually only reordered during a reorganization process, e.g., to improve the table compression (Lemke et al., 2010). For general relational tables, this reorgani- zation may decide to reorder a table according to some heuristics, which is undesired for our array structures that require a stable order. Hence, the column tables containing matrix data are flagged by the Lapeg to be ignored by any heuristic reorganization routine of the DBMS.