Big Data and Scripting
Systems beyond Hadoop
ZooKeeper
• distributed coordination service
• many problems are shared among distributed systems
• ZooKeeper provides an implementation that solves these
• avoid repeated implementation of the same services
• provide primitives for – synchronization
– configuration maintenance – naming
• optimized for failure tolerance, reliability and performance
• used in other projects as sub service
• another Apache top-level project
Zookeeper
• provides tree-like information storage
• update guarantees
– sequential consistency (keep update order) – atomicity
– single system image (one state for views) – reliability (applied updates persist) – timeliness (time bounds for updates)
• extremely simple interface
– create/delete test node existance – get/set data
– get children
– sync (wait for update to propagate)
ZooKeeper environment
• run standalone or replicated/distributed
• interfaces for Java and C
• independent (but often used in together with) Hadoop/HDFS
• memory based (stored data is kept in memory and therefore limited)
• prefers heavy reading over heavy writing
Mesos
1Hadoop:
• use one (physical) cluster of machines exclusively Mesos:
• enable sharing of a cluster between multiple distributed computing systems
• implent intermediate layer between distributed framework (e.g. Hadoop) and hardware
• administrate physical resources
• distribute these to involved frameworks
• improve cluster utilization
• implement prioritization and failure tolerance
1Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011
example scenarios
multiple Hadoop systems on the same (physical) set of machines
• production system, takes priority
• testing implementations or execute analyses that are of general interest but should not disturb the production system
• test new versions of Hadoop
• all involved Hadoop instances use the same data as input
different distributed frameworks on the same cluster
• different tasks benefit from different optimization approaches
• the map reduce approach is not optimal in every situation
• still, the different frameworks might work on the same base data
dividing tasks
scheduling
• distribute tasks to available resources
• consider data locality
send tasks to nodes that already store the involved data
• depends on framework (optimization strategy), job (algorithm) and task order
should be implemented by the framework
resource distribution
• distribute available resource to frameworks
• keep track of system usage
• ensure priorities between different frameworks
should be implemented by intermediate layer (Mesos)
architecture
• centralized master-slave system
• frameworks run tasks on slave nodes
• master implements sharing using resource offers:
– list of free resources on various slaves
– master decides which (and how many) resources are offered to which framework
implements organizational policy
• frameworks have two parts:
– scheduler - accepts or declines offers from Mesos
– executor process - started on computing nodes, executes tasks
• framework decides which task is solved on a particular resource
• tasks are executed by sending task description to Master
design
resource allocation
• each framework gets a guaranteed amount of resources corresponding to its share
• when resources are available, framework receives additional offers can cause conflicts (another framework starts using its share)
• over usage of first framework resolved by “revoking” (i.e. killing) tasks
frameworks can indicate demand, otherwise free resource are offered freely
• when used resources are below share, no task is revoked
• when usage above share, any task can be revoked
design
robustness and fault tolerance
• frameworks are treated as unreliable:
– resources offered to a framework count into its usage – unanswered offers are interpreted as rejection
– the corresponding resources are offered to another framework
• single master could pose a single point of failure
• master has soft state:
• replace master can reconstruct state from information held by slaves and framework schedulers
• recover:
– active slaves – active frameworks – running tasks
• inactive standby replacement masters elect new leader using ZooKeeper
• frameworks can also register replacement schedulers
summary/overview
• a framework/library/set of servers
• allows to run several distributed frameworks on top of a single cluster of machines
• administrates and distributes resources with respect to configurable priorities
• is an actually implemented and used system:
mesos.apache.org
• made it into the apache incubator
• started at UC Berkeley AMP Lab
Spark: an alternative distributed framework
2• map reduce is not the optimal framework for all algorithms
• problem: many algorithms iterate a number of times over their source data
• example: gradient descent, each iteration uses source data to compute new gradient
• in map reduce/Hadoop every iteration reads all source data completely from disk and computes a single step
• approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines
2Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010
Spark: overview
• cluster computing system, comparable to Hadoop
• provides primitives for in-memory cluster computing – data types that are distributed throughout
– parts on the individual machines kept in memory
• in applications that benefit from this approach, Spark tends to be much faster than Hadoop
• provides APIs for Scala, Java, Python
• build on top of Mesos
• originally developed for
– iterative algorithms (iterations using the same source data) – interactive data mining
• spark-project.org
• not (yet) an Apache project, but open-source (BSD-license)
intermission: Scala
• integrates object-oriented and functional language features
• Java-based
– (in part) similar syntax
– compiles to Java-bytecode (runs on standard JRE)
• type save
• allows functional programming
• interactive usage with console
• full object-oriented compilable programming (including GUI)
• free: www.scala-lang.org
Spark programming model
• Spark applications consist of a driver program – implements the global, high-level control flow – launches operations that are executed in parallel
• distribution and parallelization is achieved with – resilient distributed data sets (RDDs)
– parallel operations working on RDDs – shared variables
resilient distributed datasets
• RDDs are read-only
• partitioned across the compute nodes
• not necessarily on physical storage:
– described by handle
– handle contains information to infer RDD from reliable storage – parts can be rebuild in case of data loss
• constructed:
– from files (e.g. in HDFS)
– from local collection (e.g. distribute array across cluster) – by transformation from existing RDD
– by changing persistence of existing RDD
resilient distributed datasets
• lazy evaluation:
– creating handle describes derivation
– derivation is only executed, when necessary
• ephemeral:
– not guaranteed to stay in memory – recreated on demand
• state can be changed using cache and save
• cache:
– still not evaluated
– after first evaluation kept in memory, if possible
• save:
– triggers evaluation
– writes to distributed storage
– handle of saved RDD points to persistently stored object
parallel transformations
emulating map reduce is simple:
• flatMap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call)
• reduceByKey(function)
– called on collections of (K,V) key/value pairs – groups by key, aggregate with function other transformations include
• map(function) apply function to all elements
• filter(), sample()
• union() (of two RDDs), distinct() (distinct elements)
• sort(), groupByKey()
• join() (equi-join on key), cartesian()
• cogroup() maps (K,V), (K,W) → (K, Seq(V), Seq(W))
actions to access distributed data
actions extract data from the RDD and transport it back into the context of the driver program:
• collect() retrieve all elements of an RDD
• first(), take(n), retrieve first/first n elements
• reduce(func) use commutative, associative func for parallel reduction and retrieve final result
• forEach(func) run function over all elements (e.g. for statistics)
• count() get number of elements
shared variables
• parallel functions transport variables from their original context to the node they are executed at
• these have to be transport every time a function is send over the network
• Spark supports 2 additional forms for special use cases
• broadcast variables:
– transported only once to all involved nodes – read only in parallel functions
• accumulators
– parallel functions can “add” to accumulators – adding is some associative operation
– can be read only by driver program
Spark - summary
• system for distributed computation
• master-slave architecture
• main difference to map reduce:
in-memory computations with RDDs
Pregel
3• many large scale problem involve graphs (graph: nodes connected by edges)
• example: PageRank
• distributed system designed for graph computations
• assumption: many graph algorithms – traverse nodes via edges
– access data very local
e.g. computations for a node involve values of its neighbors
• Pregel implements such a system, but is not public/open source
• Giraph is an open source framework implementing the same idea:
giraph.apache.org
3Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010
Pregel: idea
• basic unit of computation: node with unique id and its incident edges
• node can perform computations
• nodes communicate with each other via messages
• a superstep is one round where each node computes
• in each superstep:
– node receives messages from last round
– update/computes/sends messages to be received in next round
• each node can vote for stopping – turns node inactive
– gets reactivated by received message
– computation stops when all nodes vote for stop
example: connected components
• assume: node ids are totally ordered
• node initializes minID with its ID
• sends its ID to all neighbors
• in each following round:
• collect all received IDs
• update minimum
• if minID changed send new minID to neighbors
• else vote for stop result:
• all nodes in a component have same minID
• nodes in different components have different minID
• master slave system
• master node:
– determines end of algorithm – takes care of node failure
– synchronizes node communication
• basic idea can be extended:
– nodes can mutate the graph (create/delete nodes/edges)