2.6 Linear Algebra in Databases
2.6.4 Commercial Systems and R-Integrations
Despite the need of their customer base to have tools for advanced statistical data analysis, many commercial database system vendors deferred providing statistical functionality for a long time. Only recently, in response to the emerging market of data science and advanced analytics, major DBMS started to extend their statistical feature set. Due to the dominance of statistical software packages, some vendors integrated R with their systems instead of offering custom implementa- tions. Examples are Oracle R Enterprise (Oracle, 2015), Microsoft SQL Server R Services (Mi- crosoft, 2016), the SAP HANA R Integration (Große et al., 2011), and IBM Netezza Analytics (IBM, 2014b). As a matter of fact, published details about proprietary systems are rare. We give a short review about different approaches and extensions of proprietary systems providing linear algebra functionality.
SAP HANA R Integration
The R runtime offers the ability to access the data of any relational database system that implements the standardized JDBC or ODBC interface via the RODBC and RJDBC packages (Ripley, 2015; Ur- banek, 2014). However, these network-based interfaces are commonly known for poor performance, in particular for batch transfers of larger data sets. This problem is tackled by the integration of R in the in-memory column-store SAP HANA DB . As presented by Große et al. (2011), the data transfer implementation is based on a shared memory (SHM) method, which is superior to the standard SQL interface. From the user perspective, R-scripts are embedded in R-nodes, as part of a program written in SQL-script, which is a proprietary SQL meta-programming language. The DBMS exe-
cution engine then calls the R runtime for each R-node, and converts all database tables that are referenced within the R script into the corresponding R-data types, while all data completely re- mains in the shared memory. The actual computations take place in a single-threaded R runtime, whereas the DBMS acts as a data orchestrator, and starts multiple R runtimes in parallel. However, all linear algebra operations such as matrix multiplications rely on the R algorithm implementations. Consequently, they can only be run within the sequential R runtime on the local data chunk, rather than on a distributed matrix data set as a whole. This opposes the idea of a comprehensive, deep integration of linear algebra in the DBMS engine.
Oracle ORE
Oracle offers the Oracle R Enterprise (ORE) runtime as an optional part of their Enterprise DBMS Edition. The ORE includes a transparency layer, i.e. a collection of R packages that support “map- ping of R data types and generate SQL transparently in response to R expressions on mapped data types” (Oracle, 2015). According to Oracle (2013), the mapping of R functions into database- resident operations happens wherever possible: “when in-database equivalents are not available for contributed R packages, Oracle Advanced Analytics can run them via embedded R mode [...]”, meaning that additional R runtimes are spawned by the DBMS, which execute the respective R code. The reference manual (Oracle, 2012) indicates a list of functions and operations that are integrated directly in the database engine. The integration includes a matrix type and basic arithmetic oper- ations such as element-wise and full multiplication of matrices. However, it is unclear if matrices are represented as relational tables (e.g., triple tables) in the database system, or if an implicit data conversion to an internal matrix type is performed. The former approach would yield a poor per- formance for dense matrices, whereas the latter would incur additional conversion overhead and undesired data replication. Moreover, the interpretation of algebraic expressions is left to the R runtime, which results in missed opportunities to optimize on expression level.
As another way to employ linear algebra, Oracle offers the UTL_NLA package (Oracle Doc., n.d.) that exposes a subset of the Blas and Lapack operations on vectors and matrices when they are represented as VARRAYs. The latter is a proprietary data type for an array table attribute, and thus, not a top level data object. The downsides of VARRAYs are manifold: they may not be partially manipulated, they can not be split, partitioned and orchestrated via SQL, and finally, they are only suitable for small and dense matrices.
IBM Netezza
A different way of integrating linear algebra functionality is employed by the data warehouse sys- tem Netezza of IBM (2014a), which includes a MatrixEngine, and exposes matrices as first class citizens to the user. Although an additional conversion from a ⟨row, column, value⟩ database ta- ble to a Netezza matrix is required, their approach comes closer to a deep integration than the R-integrations presented above. Matrix operations are top-level operations, and part of their SQL dialect NZPLSQL. Computational intensive tasks, such as matrix multiplications, are handled by calling the Scalapack implementation of the Intel MKL. To our surprise, only dense matrices are supported, which narrows down the applicability of Netezza significantly. Indeed, a wide range of
applications require sparse linear algebra, including all applications presented in the beginning of this chapter.