Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Full text

(1)UC BERKELEY . Mesos: A Platform for FineGrained Resource Sharing in Data Centers (II). Anthony D. Joseph. LASER Summer School. September 2013.

(2) My Talks at LASER 2013. 1. AMP Lab introduction. 2. The Datacenter Needs an Operating System. 3. Mesos, part one. 4. Dominant Resource Fairness. 5. Mesos, part two. 6. Spark. 2.

(3) Collaborators. • Matei Zaharia. • Benjamin Hindman. • Andy Konwinski. • Ali Ghodsi. • Randy Katz. • Scott Shenker. • Ion Stoica. 3.

(4) Apache Mesos. Spark. …. Cassandra. MPI. Hive. Hadoop. Hypertbale. A common resource sharing layer for diverse frameworks. PIQL. …. SCADS. Datacenter “OS” (e.g., Apache Mesos). Node OS. (e.g. Linux). Node OS. (e.g. Windows). …. Node OS. (e.g. Linux). Run multiple instances of the same framework. » Isolate production and experimental jobs. » Run multiple versions of the framework concurrently. Support specialized frameworks for problem domains. 4.

(5) Implementation. 20,000+ lines of C++. APIs in C, C++, Java, and Python. Master failover using ZooKeeper. Frameworks ported: Hadoop, MPI, Torque. New specialized frameworks: Spark, Apache/HaProxy . Open source Apache project http://mesos.apache.org/ . 5.

(6) Frameworks. Ported frameworks:. » Hadoop (900 line patch). » MPI (160 line wrapper scripts). New frameworks:. » Spark, Scala framework for iterative jobs (1300 lines). » Apache+haproxy, elastic web server farm (200 lines). 6.

(7) Isolation. Mesos has pluggable isolation modules to isolate tasks sharing a node. Currently supports Linux Containers and Solaris projects . » Can isolate memory, CPU, IO, network bandwidth. Could be a great place to use VMs. 7.

(8) Apache ZooKeeper. Server. Server. Client. Client. Server. Client. Leader . . Server. Client. Server. Client. Client. Server. Client. Multiple servers require coordination. » Leader Election, Group Membership, Work Queues, Data Sharding, Event Notifications, Configuration, and Cluster Management. Highly available, scalable, distributed coordination kernel. » Ordered updates and strong persistence guarantees. » Conditional updates (version), Watches for data changes.

(9) Connect to currently active master. Master Failure. Scheduler. Hadoop JobTracker. Mesos Master 2. Mesos Master 3. Mesos Master 1. Scheduler. MPI Scheduler. ZooKeeper used to elect one active Mesos master. Zoo 1 Zoo 5 . ZooKeeper. Zoo 4 . Zoo 2 . Zoo 3 . Connect to currently active master. Slave 1. Hadoop Executor. MPI Executor. Slave 2. Hadoop Executor. Hadoop Executor. Slave 3. JVM Executor. 9.

(10) Resource Revocation. Killing tasks to make room for other users. Killing typically not needed for short tasks. » If avg task length is 2 min, a new framework gets 10% of all machines within 12 seconds on avg. Hadoop job and task durations at Facebook. 10.

(11) Resource Revocation (2). Not the normal case because fine-grained tasks enable quick reallocation of resources . Sometimes necessary:. » Long running tasks never relinquishing resources. » Buggy job running forever. » Greedy user who decides to makes his task long. Safe allocation lets frameworks have long running tasks defined by allocation policy. » Users will get at least safe share within specified time. » If stay below safe allocation, task won’t be killed. 11.

(12) Resource Revocation (3). Dealing with long tasks monopolizing nodes. » Let slaves have long slots and short slots. » Short slots killed if used too long by a task. Revoke only if a user is below its safe share and is interested in offers. » Revoke tasks from users farthest above their safe share. » Framework given a grace period before killing its tasks. 12.

(13) Example: Running MPI on Mesos. Users always told their safe share. » Avoid revocation by staying below it. Giving each user a small safe share may not be enough if jobs need many machines . Can run a traditional HPC scheduler as a user with a large safe share of the cluster, and have MPI jobs queue up on it. » E.g. Torque gets 40% of cluster. 13.

(14) Example: Torque on Mesos. Facebook.com. 20%. Job 1. Job 2. Torque. Ads. Spam. User 1. 40%. 40%. User 2. Job 1. Safe share = 40%. MPI Job. MPI MPI MPI Job Job Job. Job 4. 14.

(15) Some Mesos Deployments. 1,000’s of nodes running over a dozen production services . Genomics researchers using Hadoop and Spark on Mesos. Spark in use by Yahoo! Research. Spark for analytics. Hadoop and Spark used by machine learning researchers. 15.

(16) Results.

(17) Share of Cluster. Dynamic Resource Sharing. 100%. 90%. 80%. 70%. 60%. 50%. 40%. 30%. 20%. 10%. 0%. MPI. Hadoop. Spark. 1. 31 61 91 121 151 181 211 241 271 301 331. Time (s). 17.

(18) Elastic Web Server Farm. Load calculation. HTTP httperf. HTTP request. request. Scheduler task. (haproxy). Load gen framework. resource offer. Mesos master. Mesos slave. Mesos slave. Load gen executor. Web executor. Load gen executor. Web executor. task. task. (Apache). task. task. (Apache). status update. Mesos slave. Load gen executor. Web. executor. task. task. task. (Apache). 18.

(19) Web Framework Results. 19.

(20) Scalability. Task startup overhead with 200 frameworks. 20.

(21) Fault Tolerance. Mean time to recovery, 95% confidence. 21.

(22) Deep Dive Experiments. Macrobenchmark experiment. » Test the benefits of using Mesos to multiplex a cluster between multiple diverse frameworks. High level goals of experiment. » Demonstrate increased CPU/memory utilization due to multiplexing available resources. » Demonstrate job runtime speedups. 22.

(23) Macrobenchmark setup. 100 Extra Large EC2 instances (4 cores/15GB ram per machine). Experiment length: ~25 minutes. Realistic workload. 1. A Hadoop instance running a mix of small and large jobs based on the workload at Facebook. 2. A Hadoop instance running a set of large batch jobs. 3. Spark running a series of machine learning jobs. 4. Torque running a series of MPI jobs. 23.

(24) Goal of experiment. Run the four frameworks and corresponding workloads…. » 1st on a cluster that is shared via Mesos. » 2nd on 4 partitioned clusters, each ¼ the size of the shared cluster . Compare resource utilization and workload performance (i.e., job run times) on static partitioning vs. sharing with Mesos. 24.

(25) Macrobenchmark Details: Breakdown of the Facebook Hive (Hadoop) Workload mix. Bin. Job Type. Map Tasks. Reduce Tasks Jobs Run. 1. Selection. 1. NA. 38. 2. Text search. 2. NA. 18. 3. Aggregation. 10. 2. 14. 4. Selection. 50. NA. 12. 5. Aggregation. 100. 10. 6. 6. Selection. 200. NA. 6. 7. Text Search. 400. NA. 4. 8. Join. 400. 30. 2. 25.

(26) Results: CPU Allocation. 100 node cluster. Hadoop (Facebook Mix). Hadoop (Batch Jobs). 26.

(27) Sharing With Mesos vs. No-Sharing (Dedicated Cluster). 1 0.8 0.6 0.4 0.2 0. (b) Large Hadoop Mix Share of Cluster. Share of Cluster. (a) Facebook Hadoop Mix Dedicated Cluster Mesos. 0. 200. 400. 600. 800. 1000. 1200. 1400. 1 0.8 0.6 0.4 0.2 0. 1600. Dedicated Cluster Mesos. 0. 500. 1000. Time (s). 200. 400. 600. 800. 1000. Time (s). Share of Cluster. Share of Cluster. 2500. 3000. (d) Torque / MPI Dedicated Cluster Mesos. 0. 2000. Time (s). (c) Spark 1 0.8 0.6 0.4 0.2 0. 1500. 1200. 1400. 1600. 1800. 1 0.8 0.6 0.4 0.2 0. Dedicated Cluster Mesos. 0. 200. 400. 600. 800. 1000. 1200. 1400. Time (s). 27. 1600.

(28) CPU Utilization (%). Cluster Utilization Mesos vs. Dedicated Clusters. 100 80 60 40 20 0. Mesos 0. 200. 400. 600. 800. 1000. Static 1200. 1400. 1600. Memory Utilization (%). Time (s) 50 40 30 20 10 0. Mesos 0. 200. 400. 600. 800. 1000. Static 1200. 1400. 1600. Time (s) 28.

(29) Job Run Times (and Speedup) Grouped By Framework. Framework. Sum of Exec times on Sum of Exec Times Dedicated Cluster (s). on Mesos (s). Speedup. Facebook Hadoop Mix. 7235. 6319. 1.14. Large Hadoop Mix. 3143. 1494. 2.10. Spark. 1684. 1338. 1.26. Torque / MPI. 3210. 3352. 0.96. 2x speedup for Large Hadoop Mix. 29.

(30) Job Run Times (and Speedup) Grouped by Job Type. Framework. Facebook Hadoop Mix. Job Type. selection (1). text search (2). aggregation (3). selection (4). aggregation (5). selection (6). text search (7). join (8). Large Hadoop Mix text search. Spark. ALS. Torque / MPI. small tachyon. large tachyon. Time on Dedicated Cluster (s). 24. 31. 82. 65. 192. 136. 137. 662. 314. 337. 261. 822. Avg. Speedup. on Mesos. 0.84. 0.90. 0.94. 1.40. 1.26. 1.71. 2.14. 1.35. 2.21. 1.36. 0.91. 0.88. 30.

(31) Discussion: Facebook Hadoop Mix Results. Smaller jobs perform worse on Mesos:. » Side effect of interaction between fair sharing performed by Hadoop framework (among its jobs) and performed by Mesos (among frameworks). » When Hadoop has more than 1/4 of the cluster, Mesos allocates freed up resources to framework farthest below its share. » Significant effect on any small Hadoop job submitted during this time (long delay relative to its length). » In contrast, Hadoop running alone can assign resources to the new job as soon as any of its tasks finishes. 31.

(32) Discussion: Facebook Hadoop Mix Results. Similar problem with hierarchical fair sharing appears in networks. » Mitigation #1: run small jobs on a separate framework, or. » Mitigation #2: use lottery scheduling as the Mesos allocation policy. 32.

(33) Discussion: Torque Results. Torque is the only framework that performed worse, on average, on Mesos. » Large tachyon jobs took on average 2 minutes longer. » Small ones took 20s longer. 33.

(34) Discussion: Torque Results. Causes of delay. » Partially due to Torque having to wait to launch 24 tasks on Mesos before starting each job – average delay is 12s. » Rest of the delay may be due to stragglers (slow nodes). » In standalone Torque run, two jobs each took ~60s longer to run than others . » Both jobs used a node that performed slower on singlenode benchmarks than the others (Linux reported a 40% lower bogomips value on the node). » Since tachyon hands out equal amounts of work to each node, it runs as slowly as the slowest node. 34.

(35) Macrobenchmark Summary. Evaluated performance of diverse set of frameworks representing realistic workloads running on Mesos versus a statically partitioned cluster. Showed 10% increase in CPU utilization, 18% increase in memory utilization. Some frameworks show significant speed ups in job run time. Some frameworks show minor slowdowns in job run time due to experimental/environmental artifacts. 35.

(36) Summary. Mesos is a platform for sharing data centers among diverse cluster computing frameworks. » Enables efficient fine-grained sharing. » Gives frameworks control over scheduling. Mesos is. » Scalable (50,000 slaves). » Fault-tolerant (MTTR 6 sec). » Flexible enough to support a variety of frameworks (MPI, Hadoop, Spark, Apache, …). 36.

(37) My Talks at LASER 2013. 1. AMP Lab introduction. 2. The Datacenter Needs an Operating System. 3. Mesos, part one. 4. Dominant Resource Fairness. 5. Mesos, part two. 6. Spark. 37.

(38)

No results found