Cloud Computing
For Bioinformatics
Cloud Computing: what is it?
• Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion.
• Cloud Computing abstracts infrastructure from application.
• Cloud Computing should save you time the way software packages save you time.
Before:
• Purchase Hardware & ensure it’s all compatible
• Appropriate resources for hardware (power, cooling, rack space, etc)
• Set up & configure hardware
• Install baseline software (OS, packages)
• Develop & deploy your application
With the Cloud:
• Request resource
• Develop & deploy your application
Cloud Computing
Advantages:
– Reliability: Decoupling applications from hardware removes hardware failure concerns
– Scalability: Many cloud services have built-in linear scaling, allowing more resources to be brought online on-demand
– Turnaround: Greatly reduce time taken to procure hardware resources – Cost: Limited upfront cost when compared to hardware purchase
– Pay as you go: Pay for what you use. Don’t pay for servers sitting idly sucking power & cooling
– Experimentation: Because of the above, the opportunity costs of experimentation are tiny
– Sharing & Collaboration: Share resources such as machine images & data without worry
Cloud Computing
Disadvantages:
– Learning Curve: One must learn how to leverage cloud & it’s advantages; not how one is used to working
– Data Transfer: Getting data into & out of the cloud is at internet speed, not network speed
– Opacity: The underlying infrastructure is hidden from view
Cloud Computing
Cloud Computing: Components
Glossary:
– AWS: Amazon Web Services – EC2 / Elastic Compute Cloud:
Computer resources in the cloud.
Essentially virtual computers with varying CPU & memory resources.
– EBS / Elastic Block Store: Block- level storage for data. They are virtual hard drives for EC2
instances.
– S3 / Simple Storage Service: An object store allowing you to save data in the cloud in a highly-
redundant fashion
– EMR / Elastic Map Reduce: Auto- managed map reduce
infrastructure for running highly- parallel computation problems against a farm of computers.
– SDB / Simple Database: Run
queries against structured data in real time. A very simple version of:
– RDS / Relational Database Service:
Web service that lets you place a relational database in the cloud.
– AWS Import/Export: Load your data onto a device and mail it to Amazon, and let them load your data for you!
Cloud Computing: Components
There’s plenty more, but these are the most important for bioinformatics.
Ok, here are some others:
– CloudWatch: Monitor AWS cloud resources, such as EC2 instances.
– Elastic Load Balancing: Amazon- hosted load balancers distributing incoming traffic among EC2 nodes.
– SQS / Simple Queue Service:
Hosted queue for storing messages as they pass between computers, enabling combination of disparate programs communicating with each other.
– VPC / Virtual Private Cloud: Fence off AWS services over an IP range via VPN, allowing cloud services to fit in with legacy security
protocols.
– CloudFront: Content delivery service (CDN) on Amazon’s collection of edge servers.
– SNS / Simple Notification Service:
Set up, operate, and send
notifications from the cloud to a variety of locations such as web page, email, SMS, etc.
– Amazon Mechanical Turk: As the name implies, you create a Human Intelligence Task (HITs) which a human can do easily, then you pay a modest fee each time some
human performs this task.
Examples would be rating quality between items, filling out forms, or solving CAPTCHAs, etc.
Cloud Computing: Components
Let’s learn more about those important services
Cloud Computing: Components
EC2:
• Virtual computers offered with varying memory / cpu power
• How is CPU power measured in a virtual world?
– ECU: EC2 Compute Unit: measure of computing power on AWS. Equivalent of a 1.0GHz 2007 Xeon processor.
• 4 classes of instances:
– Standard Instances: inexpensive instances used for testing, web service, and many less intensive jobs – High-Memory Instances: Large RAM images for high throughput applications e.g. databases, caches – High-CPU Instances: High ECU instances for compute-intensive applications
– Cluster Compute Instances: Increased network performance for HPC applications e.g. map-reduce
Cloud Computing: Services
Cloud Computing: Services
Instance Type ECU Units RAM (GB) Local Storage (GB)
Standard
Small 1 1.7 160
Large 4 7.5 850
XL 8 15 1690
High-Memory
XL 6.5 17.1 420
Double XL 13 34.2 850
Quadruple XL 26 68.4 1690
High-CPU
Medium 5 1.7 350
XL 20 7 1690
Cluster Compute
Quadruple XL 33.5 23 1690
Pricing
• Lot of factors affect pricing
• Prices commensurate with class of instance used (Standard, High-memory)
• Prices adjusted by OS: Linux (cheaper) and Windows (pricier)
• Prices adjusted by instance type:
– On-demand Instances: Always available to start. Priciest option. No commitment, no contract – Reserved Instances: Pre-pay upfront to have the ability to run an instance at a reduced rate – Spot Instances: EBay-style! Bid a max price for compute instances, and procure them when the
demand price meets your top bid. Cannot get a price reliably, but can save money on instances.
• Prices adjusted by availability zone. 4 available:
– US East (cheapest across the board) – US West
– EU Ireland
– APAC Singapore (new!)
• Estimating costs is hard, even with Amazon-provided calculators, as YMMV.
Cloud Computing: Services
Availability Zone? What’s that?
• Amazon data centers are located around the globe. This ensures protection from data-center wide failure
– Problem is many services are independent between zones, making this moot in most cases
• Proximity to your work environment will reduce latency (the speed information travels from you to Amazon and back)
– Choose the one closest to you, or the cheapest price, or somewhere in between
• This will trip you up, trust me.
Cloud Computing: Services
EBS:
• Create ‘disks’ that can be mounted onto your EC2 AMIs
• Disks are also placed in Availability Zones, and priced accordingly
• Can create new volumes based on public data sets
• Can create ‘snapshots’: User-initiated copies of all the data stored in super-durable Amazon S3
Cloud Computing: Services
S3:
• Stores objects in a bucket and allows retrieval based on unique key (URI)
• Can store objects ranging from 1 byte to 5GB.
• Unlimited objects can be stored
• RESTful interface (Representational state transfer)
• Extreme durability of data, with option for cheaper service (but reduced durability)
• Backed by Amazon S3 SLA (service level agreement)
Unlimited objects and Extreme durability? What’s the catch?
• Simple object stores are bad when disk I/O operations are needed
• 5GB may be too small for data sets
At the end of the day you can save data to S3 but you’ll be transferring it to EBS for any operations you’re going to do with it.
Cloud Computing: Services
EMR:
• Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud
• Allows…
Wait, do you know what MapReduce is?
No?
Then let’s back up a moment…
Cloud Computing: Services
MapReduce:
• Inspired by functional programming, and introduced by Google.
• A way to process large amounts of data by farming out work to a cluster.
• Works by using two functions:
– Mapper: Takes huge input data and chunks it out into smaller sub-problems, applying one or more functions to each, resulting in a key/value pair of the data
– Reducer: Takes the key/value pairs and combines them into useful data
• This is just a way of thinking about a problem. You need to code everything by hand. (Think of this not as a solution, but a way to think about creating one)
• Hadoop is software that handles distribution and collection of the data through your Map and Reduce functions, abstracting the bookkeeping.
• If this still seems obtuse, Vince & Daniel have great talks on this.
Also for more information, Google has the answer. Check out Google’s “MapReduce in a Week”
(http://code.google.com/edu/submissions/mapreduce/listing.html)
Cloud Computing:
MapReduce Super-quick Primer
With that out of the way…
EMR:
• Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud
• Allows processing of vast amounts of data
• Built to take advantage of other systems such as S3 to process data & store results (respectively)
• Most Bioinformatics tools cannot make good use of EMR at this time
Cloud Computing: Services
SDB:
• Non-relational data store (More like excel than MySQL)
• Think of it as S3 for data instead of files
• Primarily for index & query capabilities
• Comes with a ‘free tier’ for testing, making approaching this service easy
– First 25 machine hours & 1GB storage / month free – After that, pricing is per machine hour used
• Syntax:
– Domains: Think of this as your spreadsheet name
– Attributes: These would be the data in a column. Attributes have a name (header) and a value
• Limit of 10GB per domain
• Comes in two flavors:
– Consistent: Your read reflects the data previously written
– Eventually Consistent: Higher read throughput, but reads are not guaranteed to reflect everything written to it before. Latency between writing and reading updated information.
Cloud Computing: Services
RDS:
• Literally a hosted relational database (like MySQL)
• Features reserved & on-demand pricing
• Patches the software and handles backups for a user-defined retention period
• Designed for use with other services (as you can imagine), so using EC2 will have low-latency to a RDS instance and vice-versa
• Can create ‘snapshots’ (sound familiar?): User-initiated backups with indefinite retention (last until you delete them)
• Multi-zone deployment: Allows replication of data across availability zones for durability of data
• RDS instances come in various sizes which will look familiar to anyone that knows EC2 instance sizes.
Cloud Computing: Services
Questions?
Cloud Computing: Components
Oh yeah, here’s some free money!
(weren’t expecting that, were ya?)