Clusters in the Cloud
Dr. Paul Coddington, Deputy Director Dr. Shunde Zhang, Compu:ng Specialist
eResearch SA
Use Cases
• Make the cloud easier to use for compute jobs
– Par:cularly for users familiar with HPC clusters
• Personal, on-‐demand cluster in the cloud • Cluster in the cloud
– A private cluster only available to a research group – A shared Node-‐managed cluster
• Preferably dynamic/elas:c
– “Cloudburs:ng” for HPC
• Dynamically (and transparently) add extra compute nodes from cloud to an exis:ng HPC cluster
Cluster infrastructure
Hardware, Network Local Resource Management System / Queueing system Monitoring Shared File System Configura:on Management System Applica:on Distribu:on SoWware LayerTradi:onal Sta:c Cluster
• Hardware/Network
– Dedicated hardware
– Long process to get new hardware – Sta:c not elas:c
• SoWware
– Assumes a fairly sta:c environment (IPs etc) – Not cloud-‐friendly
– Some systems need restart if cluster is changed – Not adaptable to changes
Cluster in the cloud
• Hardware / Network
– Provisioned by the cloud (OpenStack) – Get new resources in minutes
– Remove resources in minutes – Elas:c/scalable on demand
• SoWware
– Dynamic
Possible solu:ons
• Condor for high-‐throughput compu:ng
– Cloud Scheduler working for a CERN LCG node
– Recent versions of Condor support cloud execu:on
• Torque/PBS sta:c cluster in cloud
– Works, but painful to set up and maintain
• Dynamic Torque/PBS cluster
– No exis:ng dynamic/elas:c solu:on
• StarCluster for personal cluster
– Automate setup of VMs in cloud, including cluster – Can add/subtract worker nodes manually
Our work
• Condor for high-‐throughput compu:ng
– Cloud Scheduler for Australian CERN LCG node
• Torque/PBS sta:c cluster in cloud
– Set up large cluster in cloud for CoEPP
– Scripts to automate setup and monitoring
• Dynamic Torque/PBS cluster
– Created Dynamic Torque system for OpenStack
• StarCluster for personal cluster
– Ported to OpenStack and added Torque plugin – Add-‐ons to make it easier to use for eRSA users
Applica:on soWware
• Want familiar HPC applica:ons to be available to cloud VMs
– And we don’t want to install and maintain soWware twice, in HPC and cloud
• But limit on size of VM images in the cloud • Want to avoid making lots of custom images • We use CVMFS
– Read-‐only distributed file system, hbp based
– Used by CERN LHC Grid for distribu:ng soWware – One VM image, with CVMFS client
HTC Cluster in the Cloud
• NeCTAR eResearch Tools project for high-‐ throughput compu:ng in the cloud
• ARC Centre of Excellence in Experimental Par:cle Physics (CoEPP)
• Needed a large cluster for CERN ATLAS data analysis and simula:on
• Tier 2 (global) and Tier 3 (local) jobs
• Augment exis:ng small physical clusters at mul:ple sites – running Torque
Sta:c Cluster in the Cloud
• Built a large Torque cluster using cloud VMs • A challenging exercise!
• Reliability issues, needed a lot of scripts to automate setup, monitoring, recovery, etc • Some types of usage are bursty but cluster
resources were sta:c
Dynamic Torque
• Sta:c/dynamic worker nodes
– Sta:c: stays up all the :me
– Dynamic: up and down according to workload
• Independent of Torque/MAUI
– Runs as a separate process
• Only add/remove worker nodes
• Query Torque and MAUI scheduler periodically • S:ll up to MAUI scheduler to decide where to
Dynamic Torque for CoEPP
Worker nodes in SA
Worker nodes in Melbourne
Worker nodes in Monash
Torque/MAUI and Dynamic Torque LDAP NFS Puppet Ganglia Nagios CVMFS Interac:ve nodes in Melbourne
CoEPP Outcomes
• Three large clusters in use for over a year
– Hundreds of cores in each
• Condor and CloudScheduler for ATLAS Tier 2 • Dynamic Torque for ATLAS Tier 3 and Belle • LHC ATLAS experiment at CERN
– 530,000 Tier 2 jobs
– 325,000 CPU hours for Tier 3 jobs
• Belle experiment in Japan
Private Clusters
• Good for building a shared cluster for a large research group with good IT support who can set up and manage a Torque cluster
• What about the many individual researchers or small groups who also want a private
cluster using their cloud alloca:on?
• But have no dedicated IT staff and very basic Unix skills?
StarCluster
• Generic setup
– Create security group for the cluster
– Launch VMs (master, node01, node02 …) – Set up public key for password-‐less SSH
– Install NFS on master and share scratch space to all nodeXX
– Can use EBS (Cinder) volumes as scratch space
• Queuing system setup (plugins)
StarCluster for OpenStack
OpenStack EC2 API StarCluster Head Node (NFS, Torque server, MAUI) CVMFS proxy Worker Node (Torque MOM) Volume eRSA App Repository (CVMFS server) Worker Node (Torque MOM) Worker Node (Torque MOM)StarCluster -‐ configura:on
• Availability zone
• Image
• (op:onal) Image for master
• Flavor
• (op:onal) Flavor for master
• Number of nodes
• Volume • Username • User ID • Group ID • User shell • plugins
Start a cluster with StarCluster
# fire up a new cluster (from your desktop) $ starcluster start mycluster
# log in to the head node (master) to submit jobs $ starcluster sshmaster mycluster
# Copy files
$ starcluster put /path/to/local/file/or/dir /remote/path/ $ starcluster get /path/to/remote/file/or/dir /local/path/ # Add a compute node to the cluster
$ starcluster addnode –n 2 mycluster # terminate it aWer use
$ starcluster terminate mycluster
Other op:ons for Personal Cluster
• Elas:cluster
– Python code to provision VMs – Ansible to configure them
– Ansible playbooks for Torque/SGE/…, NFS/pvfs/…
• Heat
– Everything in HOT template
– Earlier versions had limita:ons that made it hard to implement everything
Private Cluster in the Cloud
• Can use your personal or project cloud alloca:on to start up your own personal cluster in the cloud
– No need to share! Except among your group.
• Can use the standard PBS/Torque queueing system to submit jobs (or not)
– Only your jobs in the queue
• But you have to set up and manage the cluster
– Straighqorward if you have good Unix skills (unless things go wrong…)
• Several groups now using this
Emu Cluster in the Cloud
• Emu is an eRSA cluster that runs in the cloud • Aimed to be like an old cluster (Corvus)
– 8-‐core compute nodes
• But a bit different
– Dynamically created VMs in the cloud – Can have private compute nodes
Emu
• eRSA-‐managed dynamic cluster in the cloud
• Shared by mul:ple cloud tenants and eRSA users • All nodes in SA zone
• eRSA cloud alloca:on contributes 128 cores
• Users can bring in their own cloud alloca:on to launch their worker nodes in Emu
• Users don’t need to build and look aWer their own personal cluster
• It can also mount users’ Cinder volume storage to their own worker nodes via NFS
Using your own cloud alloca:on
• Users add our sysadmins to their tenant
– So we can launch VMs on their behalf – Will look at Trusts in Icehouse
• Add some configs to Dynamic Torque
– Number of sta:c/dynamic nodes, size of nodes, etc
• Add a group of user accounts allowed to use it
• Create a reserva:on for users’ worker nodes in MAUI
• A special ‘account string’ needs to be put in the job
– To match users’ jobs to their group’s reserved nodes
• A qsub filter to check if the ‘account string’ is valid
Emu
Worker nodes of Tenant1
Shared worker nodes (eRSA donated)
Worker nodes of Tenant2 Torque/MAUI and Dynamic Torque LDAP NFS Salt Sensu CVMFS NFS
Sta:c cluster vs dynamic cluster
Sta$c Cluster Dynamic Cluster
Hardware Physical Machines Virtual Machines
LRMS Torque Torque with Dynamic Torque
CMS Puppet Salt Stack
Monitoring Nagios, Ganglia Sensu, Graphite, Logstash
App Distribu:on NFS Mount CVMFS
Future Work
• Beber repor:ng and usage graphs • More monitoring checks
• Queueing system
– Mul:-‐node jobs don’t work because a new node is not trusted by exis:ng nodes
– Trust list is only updated when PBS server is started – Could hack Torque source code
– Or maybe use SLURM or SGE
• Beber way to share user creden:als
– Trusts in Icehouse?
Future Work
• Spot instance queue • Distributed file system
– NFS is the sta:c component
• Cannot add storage to NFS without stopping it
• Cannot add new nodes to the allow list dynamically • needs to use iptables; update iptables instead
– Inves:ga:on of a dynamic and distributed FS
• One FS for all tenants
• Alterna:ves to StarCluster
Resources
• Cloud Scheduler – hbp://cloudscheduler.org/ • Star cluster – hbp://star.mit.edu.au/cluster – OpenStack version • hbps://github.com/shundezhang/StarCluster/ • Dynamic Torque – hbps://github.com/shundezhang/dynamictorqueNagios vs Sensu
• Nagios
– First designed in the last century for sta:c environment
– Needs to update local configura:on and restart service if a remote server is added or removed – Server perform all checks and it is not scalable
• Sensu
– Modern design with AMQP as communica:on layer – Local agent runs checks
– Weak coupling between clients and server and it is scalable
Imagefactory
• Template in XML
– Packages – Commands – Files
• Can backup an ‘image’ in github
EMU Monitoring
• Sensu
– Run health checks
• Logstash
– Collect PBS server/MOM/accoun:ng logs
• Collectd
Salt Stack vs Puppet
Salt Stack Puppet
Architecture Server-‐Client Server-‐Client
Working Modal Push Pull
Communica:on Zeromq + msgpack HTTP + text
Language Python Ruby