MapReduce and Hadoop-based Systems - Linear Algebra in Databases

2.6 Linear Algebra in Databases

2.6.3 MapReduce and Hadoop-based Systems

In line with the big data and massive data parallelism movement, a lot of approaches emerged to im- plement machine learning applications based on the MapReduce paradigm and Hadoop. However, since MapReduce does not offer any linear algebra functionality natively, there had been a market gap for analytical systems providing linear algebra functionality, built on top of Hadoop.

SystemML

SystemML (Ghoting et al., 2011) is a system that aims to serve as a scalable platform for data mining applications. It is a Hadoop-based, scalable machine-learning framework with a custom, declarative machine learning language (DML) that is similar to R. DML statements are compiled in two stages into a high level operator (HOP), and a low level operator (LOP) plan. Finally, the operators are mapped onto map-reduce tasks and executed in Hadoop. Their matrix data structure consists of individual, fixed-size blocks that can either have a dense or a sparse representation. Some high- level operations such as matrix multiplication can have multiple low-level implementations, two of which are described in their paper. We briefly summarize them in the related work section of Chap- ter 4. Unlike the previously presented systems that mostly offer isolated linear algebra operations, the abstraction of the generated execution plan from the logical linear algebra expression reveals significant optimization opportunities. Indeed, expression optimization is one of the core objectives of SystemML. Their optimizer is presented in greater detail in a follow-up paper by Boehm et al. (2014b): it includes techniques like static and dynamic algebraic rewrites, many of which are com- plementing our ideas. In fact, they also discuss matrix chain multiplication optimizations; in partic- ular, they identify a matrix chain multiplication as the core computation in the exemplary logistic regression script sketched in their paper. We revisit the optimizer of SystemML in Section 4.6.1, and reveal some of its shortcomings that have been addressed by our SpMachO optimizer.

Nevertheless, it should be noted that SystemML is a read-only system that exclusively focuses on the optimization of linear algebra scripts. Matrices are read from and written into files, whereas data manipulation operations as such are not foreseen. Furthermore, the disk-based distributed Hadoop file system offers a near-to-arbitrary scalability, while at the same time sacrificing quick query response times for smaller computations. Although in a more recent publication Boehm et al. (2014b) suggest to process small-sized problems in local memory, the engine implementation of SystemML is based on Java, and does not reach the performance of high performance C++Blas- libraries.

As from November 2015, SystemML has become an open-source project of the Apache Software Foundation (Apache Software Foundation, 2016).

Cumulon

A closely related system that should be mentioned in this context is Cumulon by Huang et al. (2013). Similar to SystemML, it is implemented on Hadoop and targets data analysis applications, despite being designed and optimized for running in a cloud environment with a flexible hardware configu- ration. They also create a logical plan from an R-like expression, and conduct a series of rule-based logical plan rewrites, including matrix multiplication optimizations. In line with SystemML, they employ a simple cost model to reduce the plan runtime. However, apart from time cost models for different hardware settings, they also optimize for monetary costs that are caused by the cloud usage. As a result, they can generate a deployment plan that generates the least cost for a cloud user. Interestingly, they explicitly state that the MapReduce paradigm is not a good fit for matrix-based computations. They further address SystemML directly in the discussion about the inefficiency of a MapReduce-parallel matrix multiplication. Their key argument is that map jobs are only allowed to have disjoint input sets, whereas the tasks in a blocked matrix multiplication require overlapping input sets. Thus, the map-tasks perform no useful computation other than replicating matrix data for the reduce tasks. As a consequence, their approach departs significantly from the traditional MapReduce-paradigm, and builds up on “map only” jobs: a “map task does not receive input as key-value pairs; instead, it simply gets specification of the input splits it is responsible for, and reads directly from Hadoop distributed file system (HDFS) as needed”.

Unlike the recent version of SystemML, Cumulon is a purely disk-based, read-only system, similar to SciDB. Hence, it is neither suited for efficient ad-hoc queries, nor does it serve our manipulation requirements.

Apache Spark & MLib

The open-source project Spark (Zaharia et al., 2010, 2012) was launched in 2010 and is a framework for cluster computing. Besides HDFS, Spark can also interface other distributed storage frameworks, e.g. the MapR File System (Scott, 2014) or Cassandra (Lakshman and Malik, 2010). In Spark, a “driver” program launches a script that is written in either of Scala, Java, or Python, and invokes parallel computations that are either scheduled locally or on the cluster (Zaharia et al., 2010). Rather than working on key-value pairs, Spark uses an resilient distributed dataset (RDD) (Zaharia et al., 2012). The major advantage over conventional MapReduce is that data is cached in distributed memory, and thus many operations can be executed in-memory while avoiding disk I/O. Hence, Spark

yields a significant speed-up over disk-based MapReduce jobs for ad-hoc analysis. Linear algebra functionality is offered by the linalg library (Zadeh et al., 2015), which is packaged together with several other algorithms in the Spark machine learning library MLlib (Meng et al., 2014). At the time of writing this thesis, linalg included the distributed matrix storage types CoordinateMatrix, BlockMatrix and (Indexed-)RowMatrix, where the latter two can contain dense or sparse blocks and rows, respectively. However, not only has the user to select the most appropriate matrix among these types, but also is the matrix processing functionality type-dependent and often very limited. For instance, it is not possible to multiply matrices of type CoordinateMatrix at all, and a (Indexed- )RowMatrix can only be multiplied with a local matrix of different type (Spark Documentation, n.d.). Some high-level algorithms are hard-coded for certain matrix types, e.g., a column similari- ties method for RowMatrix that is based on a approximate matrix square computation (Zadeh and Carlsson, 2013). Moreover, linalg comprises a non-distributed SparseMatrix type based on the compressed sparse column (CSC) format3_{, which includes a single-core sparse matrix-dense matrix}

multiplication method (Zadeh et al., 2015), but no sparse matrix-sparse matrix counterpart. To summarize, the sparse functionality is very limited, and scalability is mostly restricted to dense linear algebra and sparse matrix-vector multiplication. As for our remaining requirements, neither are the spark matrices able to be manipulated in an ad-hoc fashion, nor does the runtime foresee any logical optimizations of linear algebra expressions.

In document Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System (Page 38-40)