3.2 Data analytics platforms
3.2.3 The Spark stack
Besides Hadoop, Spark [14] is probably one of the most prominent big data analytic frameworks today. Spark is an open source cluster computing framework. It was originally developed at the University of California at Berkeley’s AMPLab. In the meantime, Spark has become an Apache top-level project. It offers an interface for programming clusters with implicit data parallelism and fault-tolerance. Spark makes intensive use of an execution engine that supports cyclic data flow and in-memory computation. Spark developers claim that programs run much faster than Hadoop, especially for in-memory computations [14].
The core of the Spark processing model is a data structure, called resilient dis- tributed dataset (RDD) [326]. RDDs are distributed memory abstractions that enable programmers to perform in-memory computations on large clusters in a fault- tolerant way. In contrary to other approaches, like Hadoop, where results can only be reused between several processing iterations (or queries) by writing and reading them to disk, RDDs allow to store results in memory and reuse them across several iterations. Two types of applications motivated the development of RDDs: iterative algorithms and interactive data mining tools. For these cases, keeping data in mem- ory can significantly improve the performance [326]. A RDD can be thought of as an immutable multiset of data items, which are distributed over a cluster of machines. RDDs were developed to address the limitations of the Hadoop map-reduce paradigm, which forces a linear data flow on distributed programs.
Spark SQL [80] is a component of the Spark stack, which integrates relational pro- cessing with Spark’s functional API. This allows programmers to leverage the benefits of relational processing while working with the Spark stack. A declarative DataFrame API seamlessly integrates with procedural Spark code and allows to cross-fertilise re- lational and procedural processing. Spark SQL enables developers to express complex analytics in form of a rich API.
Spark itself provides only the computing core and data structures. To be used as a full data analytics framework it requires a cluster manager and a distributed storage sys- tem. As cluster managers, besides supporting standalone, i.e., native Spark clusters, Spark supports, among others, Hadoop YARN and Apache Mesos [12]. It supports a wide variety of distributed storage systems, e.g., HDFS, Cassandra, Amazon S3, Tachyon [223] (renamed into Alluxio), and many more. The Spark core itself can be seen as an alternative and faster processing model to replace Hadoop’s computation model. However, by itself, Spark is, like Hadoop, essentially a batch processing frame- work. In order to address the increasing need for (near) real-time analytics, Spark comes with a streaming component, called Spark Streaming.
The core concept of Spark Streaming is to divide arriving data into mini-batches (short, stateless, deterministic tasks) of data and then to perform RDD transformations and computations on those mini-batches. A big advantage of this is that it allows to write streaming jobs the exact same way than batch jobs. However, on the downside this comes with the limitation that the latency is equal to the mini-batch duration. Spark Streaming is built on so-called “discretised streams” (D-Streams), which are proposed by Zaharia et al., [328] as a processing model for fault-tolerant streaming computation at a large-scale. D-Streams aim at enabling a parallel recovery mechanism that should improve efficiency. While Spark’s batch-based streaming mechanism works very well for many domains, like business intelligence or social media analytics, for applications that need to (potentially) react to every data record this is problematic. For example, in the context of cyber-physical systems, it is often not appropriate to wait the end of a batch period before reacting. Also, in batch-based processing systems it is often difficult to deal with poor data quality. For example, records can be missing, data streams can arrive out of (creation) time order, and data from different sources may be generated at the same time, but some streams may arrive delayed. These real-time factors cannot be addressed easily in a batch-based processing. This is usually much easier to overcome in a record-by-record stream processing system, where each record has its own timestamp, and is processed individually. The same counts for finding correlations between multiple records over time. Batch-based stream processing is limiting the granularity of a response inherently to the length of a batch.
The GraphX [323] component of Spark is especially interesting in the context of this dissertation. Many problem domains, from social networks over advertising to machine learning, can be naturally expressed with graph data structures. However, it is difficult and often inefficient to directly apply existing data-parallel computation frameworks to graphs. To address this, GraphX allows to express graph computations, that can model the Pregel [232] abstraction, on top of Spark. Like Spark itself, GraphX was originally developed at the University of California, at Berkeley’s AMPLab. It maps distributed graphs as tabular data structures to fit the graph representations into the Spark processing model. GraphX aims at combining the advantages of data-parallel and graph-parallel systems by allowing to formulate graph computations within the Spark data-parallel framework [323]. With GraphFrames [126] the authors present an integrated API for mixing graph and relational queries on top of Spark. It enables users to combine graph algorithms, pattern matching, and relational queries. While representing complex data in form of graphs goes into a similar direction than the approach followed in this thesis, the graph data structure provided in GraphX does
not provide means to represent and analyse data in motion, nor does it support to represent and analyse many different alternatives.
With MLlib [237] the Spark stack provides a distributed machine learning library on top of Spark’s core. It supports several languages and, therefore, provides a high-level API to simplify the development of machine learning tasks. A wide variety of machine learning and statistical algorithms have been implemented and integrated into MLlib, such as support vector machines, logistic regression, linear regression, decision trees, alternating least squares, k-means, and many more. MLlib greatly benefits from the integration into the Spark ecosystem and can leverage its components, like GraphX and Spark SQL. While MLlib is able to extract commonalities over massive datasets in a very efficient way, it is mostly a coarse-grained and pipeline-based learning approach. This makes it less appropriate for online learning and for structuring fine-grained learn- ing units of entities which behave very differently, as it is needed for CPSs. Although, by leveraging Spark’s host languages, learning tasks can be implemented on a higher level of abstraction, implementing learning tasks with MLlib remains challenging. In this dissertation we propose an approach to model machine learning together with the domain data itself, instead of managing learned and collected data in different models. Velox [121] is an online model management, maintenance, and serving platform for the Spark analytics stack, which aims at serving and managing models at scale. A model in the context of Velox refers to learning models,i.e., models built from training datasets in order to enable predictions (not to models in the sense of MDE). While commodity cluster compute frameworks like Hadoop and Spark enabled to facilitate the creation of statistical models that can be used to make predictions in a variety of applications, e.g., personalised recommendations or targeted advertising, they mostly neglect the problems of how models are trained, deployed, served, and managed. A common approach is to store the computed models into a data store that has no knowledge about the semantics of the model [121]. This let it to other application layers to interpret and serve the models and to manage the machine learning life cycles, e.g., to integrate feedback into the models. Velox aims at filling this gap of model serving and management. Therefore, it performs two main tasks. First, Velox exposes models as a service via a generic API which provides predictions for a range of query types. Secondly, it manages the model life cycle, i.e., it keeps the model up-to-date by implementing offline and online maintenance strategies. Instead of providing an additional modelling layer for model management and maintenance, the approach presented in this thesis suggests to integrate model management and maintenance directly into data and domain modelling. More specifically, by expressing the relationships and dependencies of learned information together with and on the same level as collected data—combined with online learning techniques—the model is always up-to-date. Moreover, this model is used to generate a domain-specific API (through MDE techniques), which enable applications to leverage the information and semantics captured in the model.
Essentially, Spark is like Hadoop a pipeline-based analytics framework. Although, it allows to efficiently process mini-batches in a stream-based way with Spark Streaming, the core of Spark is pipeline-based. While this is appropriate for many application domains, for others, like cyber-physical systems and IoT, which often require imme- diate reactions, such pipeline-based approach is problematic. This is continued in
machine learning libraries, like MLlib, which are built on top of Spark’s core and are therefore also mainly designed for a pipeline-based processing model. This, essen- tially, makes Spark, like Hadoop, less appropriate for near real-time analytics. While GraphX offers rich, graph-based data structures in Spark, it does not provide support for time-evolving graphs nor for representing many different alternatives. Both, Spark and Hadoop enable distributed computing and are able to process data close to the source. However, although both analytics stacks are scalable, they require a consider- able amount of computation power and are not well suited for smaller devices. While Velox and MLlib offers higher-level abstractions for learning, it does not allow to model data and learning at the same level. Whereas data analytics in the context of cyber- physical systems requires to analyse data in live, the amount of data is usually less compared to business intelligence or social media analytics, where Spark and Hadoop are optimised for. Although, the amount of data to be analysed in the context of this dissertation is “big” it is not in the dimensions of petabytes, as it can be the case for Spark. In summary, while there are similarities of our approach compared to Spark and Hadoop, the specific requirements of live analytics of CPS data and, therefore, the resulting techniques are quite different.