• No results found

Big Data and Scripting Systems beyond Hadoop

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Scripting Systems beyond Hadoop"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Scripting

Systems beyond Hadoop

(2)

ZooKeeper

• distributed coordination service

• many problems are shared among distributed systems

• ZooKeeper provides an implementation that solves these

• avoid repeated implementation of the same services

• provide primitives for – synchronization

– configuration maintenance – naming

• optimized for failure tolerance, reliability and performance

• used in other projects as sub service

• another Apache top-level project

(3)

Zookeeper

• provides tree-like information storage

• update guarantees

– sequential consistency (keep update order) – atomicity

– single system image (one state for views) – reliability (applied updates persist) – timeliness (time bounds for updates)

• extremely simple interface

– create/delete test node existance – get/set data

– get children

– sync (wait for update to propagate)

(4)

ZooKeeper environment

• run standalone or replicated/distributed

• interfaces for Java and C

• independent (but often used in together with) Hadoop/HDFS

• memory based (stored data is kept in memory and therefore limited)

• prefers heavy reading over heavy writing

(5)

Mesos

1

Hadoop:

• use one (physical) cluster of machines exclusively Mesos:

• enable sharing of a cluster between multiple distributed computing systems

• implent intermediate layer between distributed framework (e.g. Hadoop) and hardware

• administrate physical resources

• distribute these to involved frameworks

• improve cluster utilization

• implement prioritization and failure tolerance

1Mesos: A platform for fine-grained resource sharing in the data center, Hindman et.al., 2011

(6)

example scenarios

multiple Hadoop systems on the same (physical) set of machines

• production system, takes priority

• testing implementations or execute analyses that are of general interest but should not disturb the production system

• test new versions of Hadoop

• all involved Hadoop instances use the same data as input

different distributed frameworks on the same cluster

• different tasks benefit from different optimization approaches

• the map reduce approach is not optimal in every situation

• still, the different frameworks might work on the same base data

(7)

dividing tasks

scheduling

• distribute tasks to available resources

• consider data locality

send tasks to nodes that already store the involved data

• depends on framework (optimization strategy), job (algorithm) and task order

should be implemented by the framework

resource distribution

• distribute available resource to frameworks

• keep track of system usage

• ensure priorities between different frameworks

should be implemented by intermediate layer (Mesos)

(8)

architecture

• centralized master-slave system

• frameworks run tasks on slave nodes

• master implements sharing using resource offers:

– list of free resources on various slaves

– master decides which (and how many) resources are offered to which framework

implements organizational policy

• frameworks have two parts:

– scheduler - accepts or declines offers from Mesos

– executor process - started on computing nodes, executes tasks

• framework decides which task is solved on a particular resource

• tasks are executed by sending task description to Master

(9)

design

resource allocation

• each framework gets a guaranteed amount of resources corresponding to its share

• when resources are available, framework receives additional offers can cause conflicts (another framework starts using its share)

• over usage of first framework resolved by “revoking” (i.e. killing) tasks

frameworks can indicate demand, otherwise free resource are offered freely

• when used resources are below share, no task is revoked

• when usage above share, any task can be revoked

(10)

design

robustness and fault tolerance

• frameworks are treated as unreliable:

– resources offered to a framework count into its usage – unanswered offers are interpreted as rejection

– the corresponding resources are offered to another framework

• single master could pose a single point of failure

• master has soft state:

• replace master can reconstruct state from information held by slaves and framework schedulers

• recover:

– active slaves – active frameworks – running tasks

• inactive standby replacement masters elect new leader using ZooKeeper

• frameworks can also register replacement schedulers

(11)

summary/overview

• a framework/library/set of servers

• allows to run several distributed frameworks on top of a single cluster of machines

• administrates and distributes resources with respect to configurable priorities

• is an actually implemented and used system:

mesos.apache.org

• made it into the apache incubator

• started at UC Berkeley AMP Lab

(12)

Spark: an alternative distributed framework

2

• map reduce is not the optimal framework for all algorithms

• problem: many algorithms iterate a number of times over their source data

• example: gradient descent, each iteration uses source data to compute new gradient

• in map reduce/Hadoop every iteration reads all source data completely from disk and computes a single step

• approach in Spark: create resilient distributed datasets (RDDs), if possible cached in memory of the involved machines

2Spark: Cluster Computing with Working Sets, Zaharia, Chowdhury, Franklin, Shenker, Stoica, 2010

(13)

Spark: overview

• cluster computing system, comparable to Hadoop

• provides primitives for in-memory cluster computing – data types that are distributed throughout

– parts on the individual machines kept in memory

• in applications that benefit from this approach, Spark tends to be much faster than Hadoop

• provides APIs for Scala, Java, Python

• build on top of Mesos

• originally developed for

– iterative algorithms (iterations using the same source data) – interactive data mining

• spark-project.org

• not (yet) an Apache project, but open-source (BSD-license)

(14)

intermission: Scala

• integrates object-oriented and functional language features

• Java-based

– (in part) similar syntax

– compiles to Java-bytecode (runs on standard JRE)

• type save

• allows functional programming

• interactive usage with console

• full object-oriented compilable programming (including GUI)

• free: www.scala-lang.org

(15)

Spark programming model

• Spark applications consist of a driver program – implements the global, high-level control flow – launches operations that are executed in parallel

• distribution and parallelization is achieved with – resilient distributed data sets (RDDs)

– parallel operations working on RDDs – shared variables

(16)

resilient distributed datasets

• RDDs are read-only

• partitioned across the compute nodes

• not necessarily on physical storage:

– described by handle

– handle contains information to infer RDD from reliable storage – parts can be rebuild in case of data loss

• constructed:

– from files (e.g. in HDFS)

– from local collection (e.g. distribute array across cluster) – by transformation from existing RDD

– by changing persistence of existing RDD

(17)

resilient distributed datasets

• lazy evaluation:

– creating handle describes derivation

– derivation is only executed, when necessary

• ephemeral:

– not guaranteed to stay in memory – recreated on demand

• state can be changed using cache and save

• cache:

– still not evaluated

– after first evaluation kept in memory, if possible

• save:

– triggers evaluation

– writes to distributed storage

– handle of saved RDD points to persistently stored object

(18)

parallel transformations

emulating map reduce is simple:

• flatMap(function) apply function to each element of the RDD, produce new RDD from results (multiple results per call)

• reduceByKey(function)

– called on collections of (K,V) key/value pairs – groups by key, aggregate with function other transformations include

• map(function) apply function to all elements

• filter(), sample()

• union() (of two RDDs), distinct() (distinct elements)

• sort(), groupByKey()

• join() (equi-join on key), cartesian()

• cogroup() maps (K,V), (K,W) → (K, Seq(V), Seq(W))

(19)

actions to access distributed data

actions extract data from the RDD and transport it back into the context of the driver program:

• collect() retrieve all elements of an RDD

• first(), take(n), retrieve first/first n elements

• reduce(func) use commutative, associative func for parallel reduction and retrieve final result

• forEach(func) run function over all elements (e.g. for statistics)

• count() get number of elements

(20)

shared variables

• parallel functions transport variables from their original context to the node they are executed at

• these have to be transport every time a function is send over the network

• Spark supports 2 additional forms for special use cases

• broadcast variables:

– transported only once to all involved nodes – read only in parallel functions

• accumulators

– parallel functions can “add” to accumulators – adding is some associative operation

– can be read only by driver program

(21)

Spark - summary

• system for distributed computation

• master-slave architecture

• main difference to map reduce:

in-memory computations with RDDs

(22)

Pregel

3

• many large scale problem involve graphs (graph: nodes connected by edges)

• example: PageRank

• distributed system designed for graph computations

• assumption: many graph algorithms – traverse nodes via edges

– access data very local

e.g. computations for a node involve values of its neighbors

• Pregel implements such a system, but is not public/open source

• Giraph is an open source framework implementing the same idea:

giraph.apache.org

3Pregel: A System for Large-Scale Graph Processing, Malewicz, Austern, Bik, Dehnert, Horn, Leiser,Czajkowski, 2010

(23)

Pregel: idea

• basic unit of computation: node with unique id and its incident edges

• node can perform computations

• nodes communicate with each other via messages

• a superstep is one round where each node computes

• in each superstep:

– node receives messages from last round

– update/computes/sends messages to be received in next round

• each node can vote for stopping – turns node inactive

– gets reactivated by received message

– computation stops when all nodes vote for stop

(24)

example: connected components

• assume: node ids are totally ordered

• node initializes minID with its ID

• sends its ID to all neighbors

• in each following round:

• collect all received IDs

• update minimum

• if minID changed send new minID to neighbors

• else vote for stop result:

• all nodes in a component have same minID

• nodes in different components have different minID

(25)

• master slave system

• master node:

– determines end of algorithm – takes care of node failure

– synchronizes node communication

• basic idea can be extended:

– nodes can mutate the graph (create/delete nodes/edges)

References

Related documents

Methods: Ninety-seven outpatient idiopathic scoliosis patients enrolled from June 2008 to June 2011 were divided to three groups according to different Cobb angles and

Cisco Devices Generic DoS 367 ICMP Remote DoS Vulnerabilities 367 Malformed SNMP Message DoS Vulnerability 369. Examples of Specific DoS Attacks Against Cisco Routers 370 Cisco

Dental HMO plan benefits are provided by: SafeGuard Health Plans, Inc., a California corporation in CA; SafeGuard Health Plans, Inc., a Florida corporation in FL; SafeGuard

193. This is not dissimilar to the rule governing the reasonable scope of physical searches, but here, by virtue of the semantic zone, some authority is shifted to the magistrate.

Generally, the identity of Fusarium isolates section Liseola can only be confirmed by using sequences of TEF-1α gene, and the results from morphological characteristics and

Explanatory variables considered included service level, frequency, speed, stop spacing, separate right of way share, vehicle accessibility, employment and

Examples are pulse width modulation (PWM) converters and cycloconverters. Interharmonics generated by them may be located anywhere in the spectrum with respect to

To determine your monthly premium amount, refer to the MedicareBlue Supplement - Preferred Non-Tobacco premium table in the Outline of Coverage for Plans D, F, High Deductible F and