Big data frameworks - Big data landscape 3.1 Big Data Applications

Big data landscape 3.1 Big Data Applications

3.3. Big data frameworks

3.3.1. Apache Flink. Apache Flink is a batch and stream processing engine that models every computation as a data flow graph which is then submitted to the Flink cluster. The nodes in this graph are the computations and the edges are the communication links. Flink closely resembles both the data flow execution model and API. The user graph is transformed into an execution graph by Flink before it is executed on the distributed nodes. While undergoing this transformation, Flink optimizes the user graph, taking into account the data locality. Flink uses thread based worker model for executing the data flow graphs. It can chain consecutive tasks in the work flow in a single node to make the run more efficient by reducing data serializations and communications.

Even though Flink has a nice theoretically sound data flow abstraction for programming, we found that it is difficult to program in a strictly data-flow fashion for complex programs. This primarily due to the fact that control flow operations such as if conditions and iterations are harder code in data flow paradigm.

3.3.2. Apache Spark. Apache Spark is a distributed in-memory data processing engine. The data model in Spark is based on RDDs [150], and the execution model is based on RDDs and lineage graphs. The lineage graph captures dependencies between RDDs and their transformations. The logical execution model is expressed through a chain of transformations on RDDs by the user. This lineage graph is also essential in supporting fault tolerance in Apache Spark.

RDD’s can be read in from a file system such as HDFS, and transformations are applied to the RDDs. Spark transformations are lazy operations and actual work is only done when an action operation such as count, reduce are invoked. By default, intermediate RDDs created through transformations are not cached and will be recomputed when needed. The user has the ability to cache or persist intermediate RDDs by specifying this explicitly. This is very important for iterative computation where same data sets are being used over and over again.

Spark primarily uses a thread based worker model for executing the tasks. Unlike in Flink where the user submits the execution graph to the engine, Spark programs are controlled by a driver program. This driver program usually runs in a separate master node and the parallel regions in this driver program are shipped to the cluster to be executed. With this model complex control flow operations that needs to run serially such as iterations and if conditions run in master while data flow operators are executed in worker nodes. While this model makes it easier to write complex programs, it is harder to do complex optimizations on the data flow graph as it needs to be executed on the fly.

3.3.3. Apache Storm. Every DSPF consists of two logical layers identified as the application layer and the execution layer. The application layer provides the API for the user to define a stream processing application as a graph. The execution layer converts this user defined graph into an execution graph and executes it on a distributed set of nodes in a cluster.

3.3.3.1. Storm Application Layer. A Storm application called a topology determines the data flow graph, with streams defining the edges and processing elements defining the nodes. A stream is an unbounded sequence of events flowing through the edges of the graph, and each such event consists of a chunk of data. A node in the graph is a stream operator implemented by the user. The entry nodes in the graph acting as event sources to the rest of the topology are termed Spouts while the rest of the data processing nodes are called Bolts. The spouts generate event streams to the topology by connecting to external event sources such as message brokers. From here onwards we refer to both spouts and bolts as processing elements (PEs) or operators. Bolts consume input streams and produce output streams. The user code inside a bolt executes when an event is delivered to it on incoming links. The topology defines the logical flow of data among the PEs in the graph by using streams to connect PEs. A user can also define the parameters necessary to convert this user defined graph into an execution graph. The physical layout of the data flow is mainly defined by the parallelism of each processing element and the communication strategies defined among them. This graph is defined by the user who deploys it to the Storm cluster to be executed. Once deployed, the topology runs continuously, processing incoming events until it is terminated by the user. An example topology is shown in Figure 3.9 where it

has a spout connected to a bolt by a stream and a second bolt connected to the first bolt by another stream.

Figure 3.9. A sample stream processing user defined graph

Figure 3.10. A sample stream processing execution graph

3.3.3.2. Storm Execution layer. Storm master (known as Nimbus) converts logical graph of processing elements to an execution graph by taking the number of parallel tasks for each logical PE and the stream grouping 6.1.1 into account. For example, Figure 3.10 displays an execution graph of the user graph shown in Figure 3.9 where two instances ofS, three instances ofW and one instance ofGare running. The stream grouping betweenSand

W _{is a load balancing grouping where each instance of}S_{distributes its output to the 3}W

instances in a round-robin fashion. A runtime instance of a node in the execution graph is called a task.

After converting logical graph to execution graph, master node takes care of scheduling of the execution graph and also manages the stream processing applications running in the cluster. Each slave node runs a daemon called a supervisor, which is responsible for executing a set of worker processes which in turn execute the tasks of the execution graph. Tasks in an execution graph will get assigned to multiple workers running in the cluster.

running two workers. Each worker can host multiple tasks of the same graph, and the worker assigned a thread of execution to every task. If multiple tasks run in the same worker, multiple threads execute the user codes in the same worker process.

Figure 3.11. Storm task distribution in multiple nodes

In document TOWARDS HPC AND BIG DATA CONVERGENCE IN A COMPONENT BASED APPROACH (Page 46-50)