Dr. Bernd Mathiske
Senior Software Architect
Mesosphere
Why the Datacenter needs an
Operating System
Bringing
Google-Scale
Computing
to Everybody
A Slice of Google Tech Transfer History
2005: MapReduce -> Hadoop (Yahoo)
2007: Linux cgroups for lightweight isolation (Google) 2009: BigTable -> MongoDB
2009: “The Datacenter as a Computer” - Barroso, Hölzle (Google)
2009: Mesos - a distributed operating system kernel (UC Berkeley) 2010: Large scale production Mesos deployment (Twitter)
Notable Operating System Developments
Single-something => multi-something: user, tasking, threading, core, …
More: bits, memory, storage, bandwidth…
OS virtualization => lightweight virtualization (cgroups, LXCs, jails, …)
Packaging => containers (docker, rkt, lmctfy, …)
Static libraries => dynamic libraries => static libraries
Cluster Operating Systems (Hardware Clustering)
Researched since the 1980s
Trying to provide (the illusion of) a single system image
Aiming at HA, load balancing, location transparency (e.g. for storage)
Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS, Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, …
Relatively low scale (up to 100s of nodes)
Complicated to manage, less dynamic than software clustering
From HPC Grid to Enterprise Cloud
Condor, LSF, Maui, Moab, Quartz, SLURM, …
Typically for batch jobs
Also cover services => SOA => more job schedulers
=> grid computing => grid middleware … => cloud stacks
From Server Virtualization to App Aggregation
Cloud Era:
Big apps, small servers Client-Server Era:
Small apps, big servers
Server
Virtualization
App App App App
App
Aggregation
Cloud Computing
SaaS: Salesforce demonstrated success, then many followed
PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, …
IaaS: AWS, Azure, DigitalOcean, GCE…
Private cloud stacks including IaaS: Eucalyptus, CloudStack, Joyent, OpenStack, SmartCloud, vSphere, …
Datacenter
✴ A facility used to house computer systems and associated
components (e.g. networking, storage, cooling, sensors) ✴ In this talk we focus on how to manage and use a single
production cluster of networked computers in a datacenter ✴ Such clusters range in size from 10s to 10000s of nodes
✴ Why should we and how can we end up with
just one production cluster?
Datacenter Services
✴ LAMP (Linux, Apache, MSQL, PHP) or similar
✴ MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar
✴ Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins,
Kafka, MPI, Spark, Storm, SSSP, Torque, … ✴ Private PaaS: Deis, …
✴ …
From Static Partitioning to Elastic Sharing
Static Partitioning
Elastic Sharing
WEB CACHE HADOOP
WASTED FREE FREE HADOOP WEB CACHE WASTED WASTED 100% — 100% —
Software Clustering
Layer between node OS and application frameworks
Scale
Multi-tenancy High availability
Available Open Source Components
✴ 2-level scheduler: Apache Mesos
✴ Meta-frameworks / schedulers: Aurora, Chronos, Marathon,
Kubernetes, Swarm, …
✴ Service discovery: Consul, HAProxy, Mesos DNS, … ✴ Highly available configuration: zk, etcd, …
✴ Storage: HDFS, Ceph, …
✴ Node OSs: lots of Linux variants
✴ Lots of app frameworks: Sparc, Storm, Cassandra, Kafka, …
2-Level Scheduling
Scale: from 1 node to at least 10000s of nodes
Optimizing resource management
End-to-end principle: “application-specific functions ought to reside in the end nodes of a network rather than intermediary nodes”
-> Requirement for general multi-tenancy
-> Requirement for having only one production cluster
App
How Mesos Works
16
Framework
Scheduler
Master
Slave
Master
Master
Master
Executor ExecutorTask
Task
Task
Task
zk/etcdWays to Run an Application
1. Vanilla job
• Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, …
2. Application of an adapted framework
• Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more…
3. Non-adapted services
• Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, …
• Provide (select) a service discovery solution
4. Program your own scheduler (and executor)
The Mesos Framework API
✴ Currently like internal Mesos communication:
• protobuf messages over HTTP
✴ Soon:
• JSON messages over HTTP (stream)
=> no need to link with binary Mesos library and/or less to reimplement
ca. a dozen programming languages => any language
How to implement a framework
✴ Scheduler interface: 1 half of 2-level scheduling
• The framework knows best when to do what with what kind of resources • About a dozen callbacks, main functionality in 2 of them:
- receive resource offers
- receive task status updates
✴ Executor interface: task life-cycle management and monitoring
• Command line executor included in Mesos
• Docker executor included in Mesos
• Custom executors often not needed
Scheduler SPI (implemented by Framework)
20
public interface Scheduler {
void registered(SchedulerDriver driver, FrameworkID frameworkId,
MasterInfo masterInfo);
void reregistered(SchedulerDriver driver, MasterInfo masterInfo); void resourceOffers(SchedulerDriver driver, List<Offer> offers); void offerRescinded(SchedulerDriver driver, OfferID offerId); void statusUpdate(SchedulerDriver driver, TaskStatus status);
void frameworkMessage(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, byte[] data);
void disconnected(SchedulerDriver driver);
void slaveLost(SchedulerDriver driver, SlaveID slaveId);
void executorLost(SchedulerDriver driver, ExecutorID executorId, SlaveID slaveId, int status);
void error(SchedulerDriver driver, String message); }
Minimal Scheduler Implementation
class MyFrameworkScheduler implements Scheduler { …
private TaskGenerator _taskGen;
public void resourceOffers(SchedulerDriver driver, List<Offer> offers) { if (_taskGen.doneCreatingTasks()) {
for (offer : offers) {
driver.declineOffer(offer.getId()); }
} else {
for (offer : offers) {
List<TaskInfo> taskInfos = _taskGen.generateTaskInfos(offer); driver.launchTasks(offer.getId(), taskInfos, _filters);
} } }
public void statusUpdate(SchedulerDriver driver, TaskStatus status) { _taskGen.observeTaskStatusUpdate(taskStatus); if (_taskGen.done()) { driver.stop(); } } … } 21
The Developer’s Perspective
✴ Focus on application logic, not datacenter structure ✴ Avoid networking-related code
✴ Reuse of built-in fault-tolerance and high availability
✴ Reuse distributed (infrastructure) frameworks (e.g., storage)
=> API, SDK for datacenter services
The Operations Engineer’s Perspective
✴ Ease of deployment/management
✴ Uniformity of deployment/management ✴ Hardware utilization rate
✴ Scaling up as business grows ✴ Scaling out sporadically
✴ Cost and time for moving to a different datacenter
✴ High availability and fault-tolerance of system services ✴ Monitoring
✴ Trouble shooting
Necessary Multi-Tenancy Features
Task containerization
Resource isolation
Resource and task attributes
Static and dynamic resource reservations
Reservation levels
Meta-frameworks
Dynamic scheduler update and reconfiguration
Security
Desirable Multi-Tenancy Features
Optimistic offers
Oversubscription
Task preemption, migration, resizing, reconfiguration
Rate limiting
Auto-scaling => hybrid cloud
Infrastructure frameworks
Using Docker Containers in Mesos
26
Mesos Master Server
init | + mesos-master | + marathon |
Mesos Slave Server
init | + docker | | | + lxc | |
| + (user task, under container init system) | | | + mesos-slave | | | + /var/lib/mesos/executors/docker | | | | | + docker run … | | | Docker Registry
When a user requests a container…
Mesos, LXC, and
Docker are tied together for launch
2 1 3 4 5 6 7 8
Other Schedulers as Meta-Frameworks in a 2-level Scheduler
YARN => https://github.com/mesos/myriad
Kubernetes => https://github.com/mesosphere/kubernetes-mesos
Swarm => Swarm on Mesos (new project)
=> run everything in one cluster
Myriad : Virtual YARN Clusters on Mesos
28
◦ POST /api/clusters: Registers a new YARN
◦ GET /api/clusters: Lists all registered clusters
◦ GET /api/clusters/{clusterId}: Lists the cluster with {clusterId}
◦ PUT /api/clusters/{clusterId}/flexup: Expands the size of cluster with {clusterId}
◦ PUT /api/clusters/{clusterId}/flexdown: Shrinks the size of cluster with {clusterId}
◦ DELETE /api/clusters/{clusterId}: Unregisters YARN cluster with {clusterId}. Also, kills all the nodes. Node Master Mesos Slave Mesos YARN Myriad Scheduler RM Myriad Executor 1. Launch NodeManager 1 1 1 2.5 CPU 2.5 GB 1 NM YARN flexU p 2.0 CPU 2.0 GB C1 C2
29
Portability
30
Mesos
Public Cloud
Managed Cloud
Your Own DC
Framework Apps
Meta-Frameworks
Vanilla Apps
The Application User’s Perspective
✴ Focus on apps, services, parameters, results
✴ Avoid dealing with datacenter operations/management ✴ Avoid adjusting system settings
✴ High availability ✴ Throughput
✴ Responsiveness ✴ Predictiveness
✴ Run everything I need
✴ Return on and safety of investment
The Datacenter is the new form factor
✴ 2-level scheduler => single production cluster
✴ scalability and portability => avoiding hardware/cloud lock-in ✴ built-in container support => running containers at scale
✴ automation => operator efficiency
✴ repositories => apps/services readily available
✴ API and SDK => productive/quick app/service development
33