Big Data and Scripting Systems beyond Hadoop

(1)

Big Data and Scripting

Systems beyond Hadoop

(2)

ZooKeeper

• distributed coordination service

• many problems are shared among distributed systems

• ZooKeeper provides an implementation that solves these

• avoid repeated implementation of the same services

• provide primitives for – synchronization

– configuration maintenance – naming

• optimized for failure tolerance, reliability and performance

• used in other projects as sub service

• another Apache top-level project

(3)

Zookeeper

• provides tree-like information storage

• update guarantees

– sequential consistency (keep update order) – atomicity

– single system image (one state for views) – reliability (applied updates persist) – timeliness (time bounds for updates)

• extremely simple interface

– create/delete test node existance – get/set data

– get children

– sync (wait for update to propagate)

(4)

ZooKeeper environment

• run standalone or replicated/distributed

• interfaces for Java and C

• independent (but often used in together with) Hadoop/HDFS

• memory based (stored data is kept in memory and therefore limited)

• prefers heavy reading over heavy writing

(5)

Mesos

¹

Hadoop:

• use one (physical) cluster of machines exclusively Mesos:

• enable sharing of a cluster between multiple distributed computing systems

• implent intermediate layer between distributed framework (e.g. Hadoop) and hardware

• administrate physical resources

• distribute these to involved frameworks

• improve cluster utilization

• implement prioritization and failure tolerance

1Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011

(6)

example scenarios

multiple Hadoop systems on the same (physical) set of machines

• production system, takes priority

• testing implementations or execute analyses that are of general interest but should not disturb the production system

• test new versions of Hadoop

• all involved Hadoop instances use the same data as input

different distributed frameworks on the same cluster

• different tasks benefit from different optimization approaches

• the map reduce approach is not optimal in every situation

• still, the different frameworks might work on the same base data

(7)

dividing tasks

scheduling

• distribute tasks to available resources

• consider data locality

send tasks to nodes that already store the involved data

• depends on framework (optimization strategy), job (algorithm) and task order

should be implemented by the framework

resource distribution

• distribute available resource to frameworks

• keep track of system usage

• ensure priorities between different frameworks

should be implemented by intermediate layer (Mesos)

(8)

architecture

• centralized master-slave system

• frameworks run tasks on slave nodes

• master implements sharing using resource offers:

– list of free resources on various slaves

– master decides which (and how many) resources are offered to which framework

implements organizational policy

• frameworks have two parts:

– scheduler - accepts or declines offers from Mesos

– executor process - started on computing nodes, executes tasks

• framework decides which task is solved on a particular resource

• tasks are executed by sending task description to Master

(9)

design

resource allocation

• each framework gets a guaranteed amount of resources corresponding to its share

• when resources are available, framework receives additional offers can cause conflicts (another framework starts using its share)

• over usage of first framework resolved by “revoking” (i.e. killing) tasks

frameworks can indicate demand, otherwise free resource are offered freely

• when used resources are below share, no task is revoked

• when usage above share, any task can be revoked

(10)

design

robustness and fault tolerance

• frameworks are treated as unreliable:

– resources offered to a framework count into its usage – unanswered offers are interpreted as rejection

– the corresponding resources are offered to another framework

• single master could pose a single point of failure

• master has soft state:

• replace master can reconstruct state from information held by slaves and framework schedulers

• recover:

– active slaves – active frameworks – running tasks

• inactive standby replacement masters elect new leader using ZooKeeper

• frameworks can also register replacement schedulers

(11)

summary/overview

• a framework/library/set of servers

• allows to run several distributed frameworks on top of a single cluster of machines

• administrates and distributes resources with respect to configurable priorities

• is an actually implemented and used system:

mesos.apache.org

• made it into the apache incubator

• started at UC Berkeley AMP Lab

(12)

Spark: an alternative distributed framework

²

• map reduce is not the optimal framework for all algorithms

• problem: many algorithms iterate a number of times over their source data

• example: gradient descent, each iteration uses source data to compute new gradient

• in map reduce/Hadoop every iteration reads all source data completely from disk and computes a single step

• approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines

2Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010

(13)

Spark: overview

• cluster computing system, comparable to Hadoop

• provides primitives for in-memory cluster computing – data types that are distributed throughout

– parts on the individual machines kept in memory

• in applications that benefit from this approach, Spark tends to be much faster than Hadoop

• provides APIs for Scala, Java, Python

• build on top of Mesos

• originally developed for

– iterative algorithms (iterations using the same source data) – interactive data mining

• spark-project.org

• not (yet) an Apache project, but open-source (BSD-license)

(14)

intermission: Scala

• integrates object-oriented and functional language features

• Java-based

– (in part) similar syntax

– compiles to Java-bytecode (runs on standard JRE)

• type save

• allows functional programming

• interactive usage with console

• full object-oriented compilable programming (including GUI)

• free: www.scala-lang.org

(15)

Spark programming model

• Spark applications consist of a driver program – implements the global, high-level control flow – launches operations that are executed in parallel

• distribution and parallelization is achieved with – resilient distributed data sets (RDDs)

– parallel operations working on RDDs – shared variables

(16)

resilient distributed datasets

• RDDs are read-only

• partitioned across the compute nodes

• not necessarily on physical storage:

– described by handle

– handle contains information to infer RDD from reliable storage – parts can be rebuild in case of data loss

• constructed:

– from files (e.g. in HDFS)

– from local collection (e.g. distribute array across cluster) – by transformation from existing RDD

– by changing persistence of existing RDD

(17)

resilient distributed datasets

• lazy evaluation:

– creating handle describes derivation

– derivation is only executed, when necessary

• ephemeral:

– not guaranteed to stay in memory – recreated on demand

• state can be changed using cache and save

• cache:

– still not evaluated

– after first evaluation kept in memory, if possible

• save:

– triggers evaluation

– writes to distributed storage

– handle of saved RDD points to persistently stored object

(18)

parallel transformations

emulating map reduce is simple:

• flatMap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call)

• reduceByKey(function)

– called on collections of (K,V) key/value pairs – groups by key, aggregate with function other transformations include

• map(function) apply function to all elements

• filter(), sample()

• union() (of two RDDs), distinct() (distinct elements)

• sort(), groupByKey()

• join() (equi-join on key), cartesian()

• cogroup() maps (K,V), (K,W) → (K, Seq(V), Seq(W))

(19)

actions to access distributed data

actions extract data from the RDD and transport it back into the context of the driver program:

• collect() retrieve all elements of an RDD

• first(), take(n), retrieve first/first n elements

• reduce(func) use commutative, associative func for parallel reduction and retrieve final result

• forEach(func) run function over all elements (e.g. for statistics)

• count() get number of elements

(20)

shared variables

• parallel functions transport variables from their original context to the node they are executed at

• these have to be transport every time a function is send over the network

• Spark supports 2 additional forms for special use cases

• broadcast variables:

– transported only once to all involved nodes – read only in parallel functions

• accumulators

– parallel functions can “add” to accumulators – adding is some associative operation

– can be read only by driver program

(21)

Spark - summary

• system for distributed computation

• master-slave architecture

• main difference to map reduce:

in-memory computations with RDDs

(22)

Pregel

³

• many large scale problem involve graphs (graph: nodes connected by edges)

• example: PageRank

• distributed system designed for graph computations

• assumption: many graph algorithms – traverse nodes via edges

– access data very local

e.g. computations for a node involve values of its neighbors

• Pregel implements such a system, but is not public/open source

• Giraph is an open source framework implementing the same idea:

giraph.apache.org

3Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010

(23)

Pregel: idea

• basic unit of computation: node with unique id and its incident edges

• node can perform computations

• nodes communicate with each other via messages

• a superstep is one round where each node computes

• in each superstep:

– node receives messages from last round

– update/computes/sends messages to be received in next round

• each node can vote for stopping – turns node inactive

– gets reactivated by received message

– computation stops when all nodes vote for stop

(24)

example: connected components

• assume: node ids are totally ordered

• node initializes minID with its ID

• sends its ID to all neighbors

• in each following round:

• collect all received IDs

• update minimum

• if minID changed send new minID to neighbors

• else vote for stop result:

• all nodes in a component have same minID

• nodes in different components have different minID

(25)

• master slave system

• master node:

– determines end of algorithm – takes care of node failure

– synchronizes node communication

• basic idea can be extended:

– nodes can mutate the graph (create/delete nodes/edges)