• No results found

This layer shows how the program is effectively executed, following the process and scheduling-based categorization described in Section2.3.2.

4.5.1

Scheduling-based Execution

In Spark, Flink and Storm, the resulting process network dataflow follows the Master-Workers pattern, where actors from previous layers are transformed into tasks. Fig.4.7ashows a representation of the Spark Master-Workers runtime. We will use this structure also to examine Storm and Flink, since the pattern is similar for them: they differ only in how tasks are distributed among workers and how the inter/intra-communication between actors is managed.

The Master

has total control over program execution, job scheduling, commu- nications, failure management, resource allocations, etc. The master also relies on a cluster manager, an external service for acquiring resources on the cluster (like Mesos, YARN or Zookeper).

The master is the one that knows the semantic dataflow representing the current application, while workers are completely agnostic about the whole dataflow: they only obtain tasks to execute, that represent actors of the execution dataflow the master is running. It is only when the execution is effectively launched that the semantic dataflow is built and eventually optimized to obtain the best execution plan (Flink, Google Dataflow). With this postponed evaluation, the master creates the parallel execution dataflow to be executed.

Each framework has its own instance of the master entity: in Spark it is called SparkContext, in Flink it is the JobManager and in Storm it is called Nimbus, in Google Dataflow it is called Cloud Dataflow Service. In Storm and Flink, the data distribution is managed in a decentralized manner, that is, it is delegated to each executor, since they use pipelined data transfers and forward tokens as soon as they are produced. For efficiency, in Flink tuples are collected in a buffer which is sent over the network once it is full or reach a certain time threshold. In Spark batch, data can be possibly dispatched by the master but typically each worker get data from a DFS. In Spark streaming, the master is the one responsible for data distribution: it discretizes the stream into micro-batches that are buffered into workers’ memory.

The master generally keeps track of distributed tasks, decides when to schedule the next tasks, reacts to finished vs. failed tasks, keeps track of the semantic dataflow progress, and orchestrates collective communications and data exchange among workers. This last aspect is crucial when executing the so-called shuffle operation, which implies a data exchange among executors. Whereas workers do not have any information about others, to exchange data they have to request information to the master and, moreover, specify they are ready to send/receive data.

In Google Dataflow the Master is represented by the Cloud Dataflow managed service that deploys and execute the DAG representing the application, built during the Graph Construction Time (see Sect. 4.4). Once on the Dataflow service, the DAG becomes a Dataflow Job. The Cloud Dataflow managed service automatically

4.5. Execution Models

65

(a) Master-Workers

(b) Worker hierarchy

Figure 4.7: Master-Workers structure of the Spark run- time (a) and Worker hierarchy example in Storm (b).

partitions data and distributes the Transforms code to Compute Engine instances (Workers) for parallel processing.

Workers

are nodes executing the actor logic, namely, a worker node is a pro- cess in the cluster. Within a worker, a certain number of parallel executors is instantiated, that execute tasks related to the given application. Workers have no information about the dataflow at any level since they are scheduled by the master. Despite this, the different frameworks use different nomenclatures: in Spark, Storm and Flink cluster nodes are decomposed into Workers, Executors and Tasks. A Worker is a node of the cluster, i.e., a Spark worker instance. A node may host multiple Worker instances. An Executor is a (parallel) process that is spawned in a Worker process and it executes Tasks, which are the actual kernel of an actor of the dataflow. Fig. 4.7b illustrates this structure in Storm, an example that would also be valid for Spark and Flink.

In Google Dataflow, workers are called Google Compute Engine, and occasionally are referred to as Workers or VMs. The Dataflow managed service deploys Compute Engine virtual machines associated with Dataflow jobs using Managed Instance Groups. A Managed Instance Group creates multiple Compute Engine instances from a common template and allows the user to control and manage them as a group. The Compute Engines execute both serial and parallel code (e.g., ParDo parallel code) related to a job (parallel execution DAG).

4.5.2

Process-based Execution

In TensorFlow, actors are effectively mapped to threads and possibly distributed on different nodes. The cardinality of the semantic dataflow is preserved, as each actor node is instantiated into one node, and the allocation is decided using a placement algorithm based on cost model optimization. This model is statically estimated based on heuristics or on previous dataflow execution of the same application. The dataflow is distributed on cluster nodes and each node/Worker may host one or more dataflow actors/Tasks, that internally implement data parallelism with a pool of threads/Executors working on Tensors. Communication among actors is done using the send/receive paradigm, allowing workers to manage their own data movement or to receive data without involving the master node, thus decentralizing the logic and the execution of the application.

As we have seen in Sec.3.3, a sort of mixed model is proposed by Naiad. Nodes rep- resent data-parallel computations. Each computer or thread, called shard, executes the entire dataflow graph. It keeps its fraction of the state of all nodes resident in local memory throughout, as for a scheduling based execution. Execution occurs in a coordinated fashion, with all shards processing the same node at any time, and graph edges are implemented by channels that route records between shards as re- quired. There is no Master entity directing the execution: each shard is autonomous also in fault tolerance and recovering.