Infrastructure Clouds for Science and
Education: Platform Tools
Kate Keahey, Renato J. Figueiredo, John
Bresnahan, Mike Wilde, David LaBissoniere
Argonne National Laboratory
Computation Institute, University of Chicago
University of Florida
www.nimbusproject.org
The Power of Infrastructure Clouds
Virtualization opens the flood gates
7/16/2012 2
• Outsourcing
• Virtual appliances
– Freeze your stack in time
– Run it anywhere
• Multi-cloud applications
– Run many copies all over the world!
• Elasticity
Harnessing The Power
• Organization tools and techniques
www.nimbusproject.org
Towards a Power Adapter
7/16/2012 4
What Needs To Be Harnessed
• VM (appliance) creation and development
– configuration management tools (chef, puppet)
• VM hypervisors
– Infrastructure-as-a-Service (IaaS)
• Cloud applications
– virtual clusters, cloudinit.d, CloudFormation,
• Elasticity
– Auto-scaling tools, phantom
• Workflow
– Swift, etc
www.nimbusproject.org
What Needs To Be Organized?
• VM (appliance) creation and development
– configuration management tools (chef, puppet)
• VM hypervisors
– Infrastructure-as-a-Service (IaaS)
• Cloud applications
– virtual clusters, cloudinit.d, CloudFormation,
• Elasticity
– Auto-scaling tools, phantom
• Workflow
– Swift, etc
7/16/2012 6
VM Applications
• An entire system frozen in time
– Full software stacks (versions)
– Configuration files
– Important for science!
• A dedicated modular service
– Web service, database, AMQP node, etc
• Demos
• A binary single file (or set of files)
– Easy to freeze
www.nimbusproject.org
Developing Appliances
• A single binary image?
– Many developers?
– Version control?
– Merging conflicts?
• Base image with a description
– Ex: Ubuntu 11.04 base images plus a set of
scripts
• Configuration Management Software
– Chef, Puppet, FG Rain, etc
7/16/2012 8
• Software stack description
– ruby and json
• A library of cookbooks
• Cookbooks contain recipes
– Ex: apache2 server with php4
• Attributes to customize each recipe
– Ex: on what port will apache listen
• Templates for configuration files
• Appliance developers make recipes
– Version control can be done with git/svn/cvs…
Chef
www.nimbusproject.org
Example Recipe
7/16/2012 10
app_dir = node[:appdir]
ve_dir = node[:virtualenv][:path]
git app_dir do
repository node[:autoscale][:git_repo]
reference node[:autoscale][:git_branch]
action :sync
user node[:username]
group node[:groupname]
end
execute "run install" do cwd app_dir
user node[:username]
group node[:groupname]
command "python setup.py install"
end
Example Template
phantom:
system:
type: epu
rabbit: <%= node[:autoscale][:rabbit_host] %>
rabbit_port: <%= node[:autoscale][:rabbit_port] %>
rabbit_ssl: False
rabbit_user: <%= node[:autoscale][:rabbit_username] %>
rabbit_pw: <%= node[:autoscale][:rabbit_password] %>
rabbit_exchange: <%= node[:autoscale][:rabbit_exchange] %>
authz:
type: sqldb
dburl: <%= node[:autoscale][:dburl] %>
phantom:
system:
type: epu
rabbit: vm-102.uc.futuregrid.org rabbit_port: 5672
rabbit_ssl: False rabbit_user: XXX rabbit_pw: PPPPPP
rabbit_exchange: default_dashi_exchange authz:
type: sqldb
dburl: mysql://nimbus:[email protected]/testphantom
www.nimbusproject.org
What Needs To Be Organized?
• VM (appliance) creation and development
– configuration management tools (chef, puppet)
• VM hypervisors
– Infrastructure-as-a-Service (IaaS)
• Cloud applications
– virtual clusters, cloudinit.d, CloudFormation,
• Elasticity
– Auto-scaling tools, phantom
• Workflow
– Swift, etc
7/16/2012 12
Cloud Applications
• More than 1 VM needed for the job
• Information exchange is needed
– Manual information exchange
• Multi-cloud
– Cloud independence required
Web Server database
Web Web
Web Server nginx
Web Servers
www.nimbusproject.org
Cloud Management Tools
• Architecture description
– VM type, location, count
– Volumes
– Networks
– Other services
• Contextualization
– Exchange dynamically determined information
• IP addrs, security information.
– Bootstrap component connections
• Ex: mount NFS, connect to DB, etc
7/16/2012 14
A Simplified Deployment Scenario
www.nimbusproject.org
A Grid in Your Pocket…
7/16/2012 16
Pierre
EC2
A Grid in Your Pocket…
Jamie
EC2
OOI private cloud
Pierre
www.nimbusproject.org
7/16/2012 18
Jamie
David
EC2
OOI private cloud
FutureGrid
A Grid in Your Pocket…
Pierre
CloudFormation
• Assemble AWS services
– Run AMIs.
– Connect EBS volumes to AMIs
– Associate and SQS queue, etc
• JSON descriptions
• AWS only
• No configuration management software
integration
– Manual integration with Chef
www.nimbusproject.org
cloudinit.d
• Multicloud VM dependency management
– Uses the libcloud abstraction library
• Integrated with chef solo
• ini file format descriptions
– Coupled with any executable script
• Launch plan end-users/operators
– Lightweight
– Copy launch plan and “one click” action
– Easily reconfigured for various clouds
• Launch plan/application developers:
– Minimal software assumptions (ssh)
– “Stem cell” deployment approach
– Incremental launch plan development
7/16/2012 20
[svc-alamoHTTP]
iaas_key: XXXXXX iaas_secret: XXXX
iaas_host: alamo.futuregrid.org iaas_port: 8443
iaas: Nimbus image: ubunut10.10 ssh_username: ubuntu
localsshkeypath: ~/.ssh/fg.pem readypgm: http-test.py
bootpgm: http-boot.sh
cloudinit.d Overview
• Services
• Run Levels
– Collections of
services without
dependencies on
each other
• Launch Plan
– An ordered set of
run levels
www.nimbusproject.org
Cloudinit.d Features
7/16/2012 22
database
Web Server Web Server Web Server
• Repeatability: write a launch plan once,
deploy many times
Launch plan
Cloudinit.d Features
database
Web Server Web Server Web Server
• Deploy on cloud and non-cloud resources
Launch plan
www.nimbusproject.org
Cloudinit.d Features
7/16/2012 24
database
Web Server Web Server Web Server
• Coordination of interdependent launches
Launch plan
Run-level 1Run-level 2
Cloudinit.d Features
database
Web Server Web Server Web Server
Launch plan
Run-level 1Run-level 2
• User-defined launch tests
www.nimbusproject.org
Cloudinit.d Features
7/16/2012 26
database
Web Server Web Server Web Server
Launch plan
Run-level 1Run-level 2
• Test-based monitoring and repair
Cloudinit.d Features
database
Web Server Web Server Web Server
Launch plan
Run-level 1Run-level 2
• Test-based monitoring and repair
www.nimbusproject.org
Cloudinit.d Interface Iaas
A Single Service Application Boot
Infrastructure Cloud
Request a new VM Check Status
New VM
sshd
Verify ssh works
bootpgm
Run the boot program….
VM HTTP Server
readypgm
Run the ready program…
If the has a successful exit code (0), then the new simple cloud
application is set to go!
The VM is running
Now the VM has been
contextualized to be a web server
scp over the boot
contextualization program…
scp over the ready program Poll the IaaS service to determine when the VM is running…
sshd needs to startup and be accessible on the new VM
Here we show how cloudinit.d automatically creates a HTTP server from a simple distribution base image
What Needs To Be Organized?
• VM (appliance) creation and development
– configuration management tools (chef, puppet)
• VM hypervisors
– Infrastructure-as-a-Service (IaaS)
• Cloud applications
– virtual clusters, cloudinit.d, CloudFormation,
• Elasticity
– Auto-scaling tools, phantom
• Workflow
– Swift, etc
www.nimbusproject.org
Escalation Pattern
7/16/2012 30
Operational Units
User Domain
(configuration and security)
Domain Management:
Monitor and regulate domain properties based on system-specific and application-specific
metrics
• Challenge: leverage on-demand, large but unreliable provider pool
– Applications that absorb resources – Applications that tolerate failures
Scaling Considerations
• Reasons to scale
– Business vs science
• Cost vs quota
• Lossy environment
– VMs fail more often than bare metal
– N preserving
• Spot instances
– If the price is right
• Backfill
– If resources are idle
www.nimbusproject.org
Amazon Auto Scaling and CloudWatch
• Auto Scaling in EC2
– Policies to scale up and down servers
• Min, Max, and desired size
• Integrated with AWS CloudWatch Sensors
– Triggers
– CPU load, disk capacity, load balancer loads,etc
– Custom sensors
• No contextualization
• REST API
• AWS only
7/16/2012 32
Phantom Scaling Services
• Multi-cloud
– Fail-over and even distribution policies
• Monitor scaling factors and failures
– Generic/system qualities: deployment status, load, bank account, etc.
– Application-specific qualities, e.g., a workload queue for ALiEn, PBS, AMQP, and others
• Evaluate against policies
• Scale and/or recover
– For user components – For system components
– Across different cloud providers
• Release as a Service
• 0.1 running on FutureGrid now
– Initially available as a service on FutureGrid resources
– Provides high availability
Sensor information
Reliably provision, manage and contextualize resources
Apply Policy
www.nimbusproject.org
Infrastructure Platform Goals
• Multi-cloud
– Work across private, community and commercial clouds
• Any Scale
– Scale in response to a diverse set of sensors/triggers
– Both system and application sensors
• High Availability
– “Any VM can die”: system or user VMs
– Minimizing time to recovery (TTR)
• Your Polices, Our Enactment
– User-defined sensors/triggers and policies
• Engineered from the ground up to work with
infrastructure clouds
• Easy on the user
7/16/2012 34
How Can Science Plug Into This
Power
Example Embarrassingly Parallel
Scientific Application
Demonstration
www.nimbusproject.org
…
M subtask messages
Task Queue
Application Start the workers
Using Nimbus Domains
…
Preserve N worker VMs M subtask messages
Cumulus/S3 Message Queue
“N preserving”
policy
Infrastructure Compute Cloud
Get task
Results/Checkpoints
Application Start the workers
Using Nimbus Domains
www.nimbusproject.org
Phantom Architecture
7/16/2012 39
MySQL nginx REST HTTPS
Web Application HTTPS
REST Service
Web Application
FutureGrid Clouds
RabbitMQ
EPUM
Provisioner
DTRS
Zookeeper Cluster
REST Service
REST Service IaaS
Clouds
Adventures in Availability
• Time to scale (TTS)
– PENDING (request)
– STARTED (deployment)
– RUNNING
(contextualization)
TTS: preliminary results for 2,000 VMs provisioned on AWS EC2
www.nimbusproject.org Application adaptation:
Applications
7/16/2012 41
Infrastructure Platform
Contextualization, multi-cloud bridge, repeatable launches, scaling, elasticity and High Availability
Schedulers
Elastic MapReduce
Workflow Systems (Swift) Data Transfer Systems
Science Gateways Custom Applications (OOI)
Library of generic sensors
Application-specific
sensors Policies Decision Engine