Data Parallel Runtimes - HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTI

Data parallel runtimes explicitly exploit the data parallelism of the computation and provide simple pro- gramming models where users can plug in their sequential implementation and obtain parallelism automatically at run time. Once domain-specific problems are cast into the model, the runtime is able to automatically parallelize the execution and schedule tasks. As a result, developers do not need to care about resource allo- cation, threading, concurrency control, synchronization and fault tolerance which are known to be difficult to program with.

2.3.1 Hadoop

Hadoop is an open source implementation of MapReduce developed under the umbrella of Apache Source Foundation. Hadoop community is developing MapReduce 2.0 which dramatically changes the architecture to i) separate resource management and task scheduling/management, ii) mitigate the performance bottleneck of a single master node.

Some higher level projects have been built on top of Haddop and add additional functionalities. For example, Apache Mahout implements many widely used data mining and machine learning algorithms in a parallel manner so that data of larger-scale can be processed efficiently. Hive is a data warehouse software that supports querying and managing large data sets. Queries are converted to MapReduce jobs which are run on Hadoop.

2.3.2 Iterative MapReduce Runtimes

Some frameworks and enhancements to MapReduce have been proposed for iterative MapReduce applications. HaLoop [34] modifies Hadoop to provide various caching options and reuse the same set of tasks to process data across iterations (i.e. tasks are loop-aware). Twister [57] and Spark [134] reuse the same set of “persistent” tasks to process intermediate data across iterations. Significant performance improvement over MapReduce has been shown for these frameworks.

2.3.3 Cosmos/Dryad

MapReduce model is limited in the expressiveness. Although many applications can be artificially trans- formed to MapReduce form (e.g. split the application into multiple MapReduce jobs), the transformation i) may be unnatural and complicate the writing of programs, ii) may have significant performance implication. Iterative MapReduce is one example. Dryad is an execution engine for data-parallel applications. It provides a DAG model which can encode both computation and communication and thus is more powerful and ex- pressive than MapReduce. Each vertex in the graph corresponds to a task that can be executed on available nodes after its prerequisite tasks have finished and the input data have been staged in. Dryad scheduler maps vertices to compute nodes with the goal of maximizing concurrency. So independent vertices can be run

simultaneously on multiple nodes or multiple cores on a single node. Difference types of communication channels are supported such as files, TCP pipes and shared-memory FIFOs. Dryad automatically monitors the whole system and recovers from computer or network failures. If a vertex fails, Dryad reruns the task on a different node. A version number is associated with each vertex execution to avoid conflicts. If the read of input data fails, Dryad reruns the corresponding upstream vertex to re-generate the data. In initial version, greedy scheduling was adopted by the job manager with the assumption that it is the only job running in the cluster. Dryad applies run-time optimization that dynamically refines the graph structure according to network topology and application workload. Dryad runs on top of Cosmos which is a distributed file system that facilitates sharing and managing distributed data sets across the whole cluster. Although Dryad provides more advanced features compared to MapReduce, its use in both industry and academia is really limited and thus its scalability and performance for running diverse parallel applications have not been demonstrated.

Fig. 2.4 shows the Dryad ecosystem. Many languages such as Nebula and DryadLINQ have been ex- tended to integrate the processing capability of underlying Dryad systems. Existing sequential programs can be easily modified to become parallel.

Figure 2.4: Dryad ecosystem

* Copied from http://research.microsoft.com/en-us/projects/dryad/

2.3.4 Sector and Sphere

Sector [65] is a user-space distributed file system. Sector files, which are stored in the local file systems of one or more slave nodes, are not split into blocks. So if files are too large, users need to manually split

them into multiple files of smaller size. Sector can manage data distributed across geographically scattered data centers. One assumption made by Sector is that nodes are interconnected with high-speed network links. Sector is network topology ware, which means network structure is considered when data are managed. Data in Sector are replicated and per-file replication factor can be specified by users. Sector allows users to specify where replicas are placed (e.g. on local rack, on a remote rack). Permanent file system metadata is not required. If file system metadata is corrupted, the metadata can be rebuilt from real data. Data transfer is done with a specific transport protocol called UDP-based Data Transfer (UDT) which provides reliability control and congestion control. UDT has been shown to be fast and firewall friendly, and used by both commercial companies and research projects. Fig. 2.5 shows the architecture of Sector. Sector adopts a master-slave architecture. The Security Server maintains user accounts, file access permission information and authorized slave nodes. The Master Server maintains file system metadata, monitors slave nodes and responds to users’ requests. Real data are stored on slave nodes.

Figure 2.5: Sector system architecture

* Copied from [65] Sphere [65] is a distributed runtime built upon Sector. Data are processed by Sphere Processing Engines (Sphere Processing Engines (SPEs)) each of which processes a segment of data records. Multiple input streams can be processed simultaneously. In-situ processing is achieved with best efforts by processing data near where it resides. User specified User Defined Functions (UDFs) can be plugged in and applied to data processing. In this sense, it is more generic than MapReduce model.

In document HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE (Page 41-45)