Apache Spark - General Purpose Large Scale Data Processing Engines

2 Evolution of Scientiﬁc Archives

3.4 General Purpose Large Scale Data Processing Engines

3.4.1 Apache Spark

Apache Spark (Zaharia et al., 2012) has gained a lot of traction and attention because it addresses most of these shortcomings. It proposes a generic distributed processing engine that can be leveraged for many diﬀerent purposes, i.e. streaming use cases (Zaharia

et al., 2013), machine learning workloads18, graph processing (Xin et al., 2013), SQL

access to data (Armbrust et al., 2015) for exploration and visualization purposes, and other innovations on top for automatic model selection and parametrization (Kraska

et al., 2013; Sparks et al., 2013), or for approximate query processing on extremely large

volumes of data (Agarwal et al., 2012, 2013).

It was originally developed for machine learning algorithms, which are typically it erative (scanning the data over and over again for optimising a certain function), or for interactive data mining through a console that enables exploration. These two objectives are enabled by providing the relevant primitives for working with the data sets in mem ory. The fact that Apache Spark takes care of the memory management and the variety of operations made available to the user, allow to both expose a console to the ﬁnal user and also perform these iterative workﬂows faster (as the data can stay in memory among iterations).

The idea of creating a general purpose distributed processing engine that can be reused for diﬀerent use cases is proven by looking at the lines of code of Apache Spark, and comparing to other niche solutions for batch, streaming, MPP database and graph

56 _{Chapter 3. Enabling Large Scale Data Science and Data Products}

Figure 3.5: Lines of code comparison for major Big Data frameworks.

processing (see Figure 3.5). It is important to remark that Apache Spark is implemented in Scala19, whose software typically produces more compact code with less lines of code. Even so, the diﬀerence among the other counterparts is noticeable. Lesser code means easier management and optimization. Furthermore, any optimization in the core is au tomatically reﬂected on the libraries built on top.

Even if Apache Spark is internally coded in Scala, it provides several other language bindings, i.e. Python, Java and recently R. This makes it more suitable for the scientific community as they may feel more comfortable with these languages given the conducted research put forward in Section 4.3.6. Having alternatives proves to work fine in these complex scenarios where the architecture needs to be leveraged by many different stake holders that might tend to use different programming or scripting languages.

Figure 3.6 shows the stack of Apache Spark core and its main libraries. These libraries are being bundled by the same team of developers. This way, they can be implemented in a more eﬀective way by leveraging the knowledge of the main core engine. These libraries are:

• Apache Spark core engine. Spark core engine is the central part of the stack that binds all components. It is responsible for all basic spark functionalities including memory management, fault-recovery, storage system interaction and more. One of the major programming abstraction of spark is called Resilient Distributed Dataset (RDD), which is a collection of items distributed across multiple nodes, typically stored in the memory of these nodes. They can be manipulated in parallel through

a set of high-level actions like transformations, aggregations, joins and so on. • Spark SQL is the package for working with structured data. It allows querying

data via SQL as well the Hive Query Language (HQL), supporting many sources of data, including Hive tables, Parquet or ORC formats, and JSON. Moreover, it provides a SQL interface that allows developers to write programs that manipu late the RDD in other languages (Scala, Python, Java or R) along with the SQL queries. This is a key ingredient that allows for the combination of SQL with com plex analytics coded in those programming languages, providing the means for the straightforward implementation of astronomical services like those in the VO, e.g. TAP (see Section 2.2). Apache Hive might be another option for a SQL interface that scales well. However, its main drawback may be the lack of ways to easily em bed code in the SQL instructions, or the interoperability of both in diﬀerent stages (i.e. start the workﬂow with a set of queries and then continue with a complex pipeline written in any of the programming languages that are available in Apache Spark).

• Spark Streaming is the component that allows processing of live streams of data (topics coming from Apache Kafka, telemetry, initial data treatments, Twitter streams, or other messages queues containing regular updates). The streaming API extends the RDD API and hence it facilitates code reuse among the batch and streaming processing workflows. This is one of the main concepts of Apache Spark, the idea of unifying several types of processing under the same engine. It helps reduce the high complexity of systems implementing the Lambda architecture (see Section 3.2.1) as the code for dealing with the batch view can be reused in the processing applied to the near-real time one. Furthermore, all transformations of incoming data as well as its final exposure (through Spark SQL) will leave in the same framework. This makes it very suitable for unifying architectures for several SOC and scientific archives as stated in Section 4.4. A potential competitor for Spark Streaming might be Apache Storm, which goes beyond the concept of the Spark Streaming micro-batch (with minimum latency of around half a second) and can implement scenarios demanding smaller latencies. However, for most of the real life use cases, accepted latencies are above that minimum threshold.

• MLlib is the library that provides basic machine learning functionality. Its current offering contains machine learning algorithms like regressions, clustering, classifica tions, collaborative filtering and gradient boosting mechanisms among others. All these algorithms are designed to scale to large data sets and also serve as examples of how to approach new developments for mission specific modeling like those stated in Section 4.5. One of the competitors for MLlib might be Apache Mahout20 which implements a large variety of machine learning algorithms that used to run on a series of MapReduce jobs. Nowadays, it is being transformed to a programming

58 _{Chapter 3. Enabling Large Scale Data Science and Data Products}

environment for building scalable algorithms on top of Apache Spark (reinforcing the idea of the power of its core distributed processing engine).

• GraphX is a library for manipulating graphs and performing graph-parallel com putations. Like the other components, GraphX extends the Spark RDD API, allowing the deﬁnition of directed graphs with arbitrary properties attached to each vertices and edges. GraphX also provides various operators for manipulating graphs (e.g. subgraph and the mapping of vertices) and a library of common graph algorithms (e.g. PageRank (Page et al., 1999) and triangle counting).

All the features stated above make Apache Spark as the best suitable framework for the Gaia Archive Data Mining sub-workpackage as it enables the possibility for some of the concepts laid out in Section 2.4:

1. It empowers the idea of a living archive by providing a view of any data set being derived through well-known interfaces like SQL.

2. It facilitates an archive as a model, which could be developed, executed and tweaked on the platform (and from the raw data), allowing for a straightforward release of the results to the community through the mentioned interfaces (and the acompa nying code).

3. The generic distributed processing engine is suited to be run on a Cloud environ ment, where both resources and middleware can be made available to the commu nity in isolated and on-demand environments (some kind of PaaS).

4. Interoperability is obtained by both the possibilities enabled by the data lake, but also given the variety of programming languages for which there are bindings available as remarked above.

5. Interactivity is supplied not only by the availability of a console, but also with the integration of multi-purpose notebooks like Apache Zeppelin21, that help build data-driven, interactive and collaborative documents with declarative (i.e. SQL) and programming (i.e. Scala) languages. These notebooks are a step forward to wards a collaborative reproducible research.

6. Consolidation of the SOC, archiving and data exploitation is easier when a generic distributed processing engine that can cope with streaming, batch and declarative workﬂows is leveraged.

7. Existing models are easy to port, and new ones are easy to develop. Chapter 5 shows an example of two models that have been developed from scratch. These models are successfully using existing tools for MCMC without any modiﬁcations being done to them. This would prove challenging when using other frameworks (e.g. MapReduce).

Figure 3.6: Generality of Apache Spark combining SQL, streaming, machine learning and graph processing over the same distributed processing engine.

In document Architecture, techniques and models for enabling Data Science in the Gaia Mission Archive (Page 82-86)