3.3 Tools for Big Data Analytics
3.3.4 Apache Spark
As mentioned in previous sections, MapReduce is not suitable for iterative algo- rithms or interactive analytics, not because it is not possible to implement iterative jobs, but because data have to be repeatedly stored and loaded at each iteration. Furthermore, data can be replicated on the distributed file system between succes- sive jobs. Apache Spark [133, 131, 132] design is intended to solve this problem by reusing the working dataset by keeping it in memory, For this reason, Spark represents a landmark in Big Data tools history, having a strong success in the com- munity. The overall framework and parallel computing model of Spark is similar to MapReduce, while the innovation is in the data model, represented by the Resilient Distributed Dataset (RDD), which are immutable multisets. In this Section, we only give an overview of Spark, which will be further investigated in Chapter4. A Spark program can be characterized by the two kinds of operations applicable to RDDs: transformations and actions. Those transformations and actions compose the directed acyclic graph (DAG) representing the application. Transformations are the functional style operations applicable to collections, such as map, reduce, flatmap, that are uniformly applied to whole RDDs [131]. Actions return a value to the user after running a computation on the dataset, thus they effectively start the program execution. Transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just store the transformations applied to some base dataset, and they are computed when an action requires a result to be returned to the driver program, that is, the main program written by the programmer.
3.3. Tools for Big Data Analytics
39
Listing3.3shows the source code for a simple Word Count application in the Java Spark API.
1JavaRDD<String> textFile=sc.textFile("hdfs://...");
2
3JavaRDD<String> words =
4 textFile.flatMap(new FlatMapFunction<String, String>() {
5 public Iterable<String> call(String s) {
6 return Arrays.asList(s.split(" "));
7 }
8 });
9
10JavaPairRDD<String, Integer> pairs =
11 words.mapToPair(new PairFunction<String, String, Integer>() {
12 public Tuple2<String, Integer> call(String s) {
13 return new Tuple2<String, Integer>(s, 1);
14 }
15});
16
17JavaPairRDD<String, Integer> counts =
18 pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
19 public Integer call(Integer a, Integer b) {
20 return a + b;
21 }
22});
23counts.saveAsTextFile("hdfs://...");
Listing 3.3: A Java Word Count example in Spark.
For stream processing, Spark implements an extension through the Spark Stream- ing module, providing a high-level abstraction called discretized stream or DStream [133]. Such streams represent results in continuous sequences of RDDs of the same type, called micro-batch. Spark’s execution model relies on the Master-Worker model: a cluster manager (e.g., YARN) manages resources and supervises the ex- ecution of the program. It manages application scheduling to worker nodes, which execute the application logic (the DAG) that has been serialized and sent by the master.
Resilient Distributed Datasets
An RDD is a read-only collection of objects partitioned across a cluster of computers that can be operated on in parallel. A Spark application consists of a driver program that creates RDDs from, for instance, HDFS files or an existing Scala collection as well as creating RDDs from queries to databases. The driver program may transform an RDD in parallel by invoking supported operations with user-defined functions, which returns another RDD. Since RDDs are read-only collections, each transformation on such collections creates a new RDD containing the result of the transformation. This new collection is not materialized, nor kept in memory: it is the result of an expression rather than a value and it is computed each time the transformation is called. The driver can also persist an RDD in memory, allowing it to be reused efficiently across parallel operations without recomputing it. In fact, the semantics of RDDs have the following properties:
• Abstract: elements of an RDD do not have to exist in physical memory. In this sense, an element of an RDD is an expression rather than a value. The value can be computed by evaluating the expression when necessary (i.e., when executing an action).
• Lazy and Ephemeral: RDDs can be created from a file or by transforming an existing RDD such as map, filter, groupByKey, reduceByKey, join, etc. However, RDDs are materialized on demand when they are used in some operation, and are discarded from memory after use. This thus performs a sort of lazy evaluation.
• Caching and Persistence: a dataset can be cached in memory across oper- ations, which allows future actions to be much faster since they not have to be reconstructed. Caching is a key tool for iterative algorithms and fast interactive use cases and it is actually one special case of persistence that al- lows different storage levels, e.g. persisting the dataset on disk or in memory but as serialized Java objects (to save space), replicating it across nodes, or storing it off-JVM heap.
• Fault Tolerance: if any partition of an RDD is lost, the lost block will auto- matically be recomputed using only the transformations that originally cre- ated it.
The operations on RDDs take user-defined functions, which are considered closures in functional programming style features provided by the Scala programming lan- guage, used to implement the Spark runtime. A closure can refer to variables in the scope when created, which will be copied to the workers when Spark runs a closure. We recall that, exploiting the JVM and serialization features, it is possible to serialize closures and send them to Workers.
Operations on RDDs take user-defined functions as arguments, which act as closures that can refer to non-local variables from the scope where they were created. These captured variables will be copied to the workers when Spark runs a closure. We recall that, exploiting the JVM and serialization features, it is possible to serialize closures and send them to workers.
Spark also offers two kinds of shared variables:
• Broadcast variables, which are copied to the workers once, and appropriate when large read-only data is used in multiple operations
• Accumulators, which are states local to Workers that are only “updated” through an associative operation and can therefore be efficiently supported in parallel. Accumulators can thus be used to implement counters or sums. Only the driver program can read the accumulator’s value. Spark natively supports accumulators of numeric types.
By reusing cached data in RDDs, Spark offers great performance improvement over Hadoop MapReduce [131], thus making it suitable for iterative machine learning algorithms. Similar to MapReduce, Spark is independent of the underlying storage system. It is the application developer’s duty to organize data on distributed nodes — i.e., by partitioning and collocating related datasets, etc. — if a distributed file system is not available. These are critical for interactive analytics, since just caching is insufficient and not effective for extremely large data.
Spark Streaming
For stream processing, Spark implements an extension through the Spark Streaming module, providing a high-level abstraction called discretized stream or DStream [133]. Such streams represent results in continuous sequences of RDDs of the same type, called micro-batch. Operations over DStreams are “forwarded” to each RDD in the DStream, thus the semantics of operations over streams is defined in terms of batch processing according to the simple translation op(a) = [op(a1), op(a2), . . .], where [·] refers to a possibly unbounded ordered sequence, a = [a1, a2, . . .] is a DStream, and each item aiis a micro-batch of type RDD. All RDDs in a DStream are processed in order, whereas data items inside an RDD are processed in parallel without any ordering guarantees.