Cloud Computing For Bioinformatics

(1)

Cloud Computing

For Bioinformatics

(2)

Cloud Computing: what is it?

• Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion.

• Cloud Computing abstracts infrastructure from application.

• Cloud Computing should save you time the way software packages save you time.

(3)

Before:

• Purchase Hardware & ensure it’s all compatible

• Appropriate resources for hardware (power, cooling, rack space, etc)

• Set up & configure hardware

• Install baseline software (OS, packages)

• Develop & deploy your application

With the Cloud:

• Request resource

• Develop & deploy your application

Cloud Computing

(4)

Advantages:

– Reliability: Decoupling applications from hardware removes hardware failure concerns

– Scalability: Many cloud services have built-in linear scaling, allowing more resources to be brought online on-demand

– Turnaround: Greatly reduce time taken to procure hardware resources – Cost: Limited upfront cost when compared to hardware purchase

– Pay as you go: Pay for what you use. Don’t pay for servers sitting idly sucking power & cooling

– Experimentation: Because of the above, the opportunity costs of experimentation are tiny

– Sharing & Collaboration: Share resources such as machine images & data without worry

Cloud Computing

(5)

Disadvantages:

– Learning Curve: One must learn how to leverage cloud & it’s advantages; not how one is used to working

– Data Transfer: Getting data into & out of the cloud is at internet speed, not network speed

– Opacity: The underlying infrastructure is hidden from view

Cloud Computing

(6)

Cloud Computing: Components

(7)

Glossary:

– AWS: Amazon Web Services – EC2 / Elastic Compute Cloud:

Computer resources in the cloud.

Essentially virtual computers with varying CPU & memory resources.

– EBS / Elastic Block Store: Block- level storage for data. They are virtual hard drives for EC2

instances.

– S3 / Simple Storage Service: An object store allowing you to save data in the cloud in a highly-

redundant fashion

– EMR / Elastic Map Reduce: Auto- managed map reduce

infrastructure for running highly- parallel computation problems against a farm of computers.

– SDB / Simple Database: Run

queries against structured data in real time. A very simple version of:

– RDS / Relational Database Service:

Web service that lets you place a relational database in the cloud.

– AWS Import/Export: Load your data onto a device and mail it to Amazon, and let them load your data for you!

Cloud Computing: Components

There’s plenty more, but these are the most important for bioinformatics.

(8)

Ok, here are some others:

– CloudWatch: Monitor AWS cloud resources, such as EC2 instances.

– Elastic Load Balancing: Amazon- hosted load balancers distributing incoming traffic among EC2 nodes.

– SQS / Simple Queue Service:

Hosted queue for storing messages as they pass between computers, enabling combination of disparate programs communicating with each other.

– VPC / Virtual Private Cloud: Fence off AWS services over an IP range via VPN, allowing cloud services to fit in with legacy security

protocols.

– CloudFront: Content delivery service (CDN) on Amazon’s collection of edge servers.

– SNS / Simple Notification Service:

Set up, operate, and send

notifications from the cloud to a variety of locations such as web page, email, SMS, etc.

– Amazon Mechanical Turk: As the name implies, you create a Human Intelligence Task (HITs) which a human can do easily, then you pay a modest fee each time some

human performs this task.

Examples would be rating quality between items, filling out forms, or solving CAPTCHAs, etc.

Cloud Computing: Components

(9)

Let’s learn more about those important services

Cloud Computing: Components

(10)

EC2:

• Virtual computers offered with varying memory / cpu power

• How is CPU power measured in a virtual world?

– ECU: EC2 Compute Unit: measure of computing power on AWS. Equivalent of a 1.0GHz 2007 Xeon processor.

• 4 classes of instances:

– Standard Instances: inexpensive instances used for testing, web service, and many less intensive jobs – High-Memory Instances: Large RAM images for high throughput applications e.g. databases, caches – High-CPU Instances: High ECU instances for compute-intensive applications

– Cluster Compute Instances: Increased network performance for HPC applications e.g. map-reduce

Cloud Computing: Services

(11)

Cloud Computing: Services

Instance Type ECU Units RAM (GB) Local Storage (GB)

Standard

Small 1 1.7 160

Large 4 7.5 850

XL 8 15 1690

High-Memory

XL 6.5 17.1 420

Double XL 13 34.2 850

Quadruple XL 26 68.4 1690

High-CPU

Medium 5 1.7 350

XL 20 7 1690

Cluster Compute

Quadruple XL 33.5 23 1690

(12)

Pricing

• Lot of factors affect pricing

• Prices commensurate with class of instance used (Standard, High-memory)

• Prices adjusted by OS: Linux (cheaper) and Windows (pricier)

• Prices adjusted by instance type:

– On-demand Instances: Always available to start. Priciest option. No commitment, no contract – Reserved Instances: Pre-pay upfront to have the ability to run an instance at a reduced rate – Spot Instances: EBay-style! Bid a max price for compute instances, and procure them when the

demand price meets your top bid. Cannot get a price reliably, but can save money on instances.

• Prices adjusted by availability zone. 4 available:

– US East (cheapest across the board) – US West

– EU Ireland

– APAC Singapore (new!)

• Estimating costs is hard, even with Amazon-provided calculators, as YMMV.

Cloud Computing: Services

(13)

Availability Zone? What’s that?

• Amazon data centers are located around the globe. This ensures protection from data-center wide failure

– Problem is many services are independent between zones, making this moot in most cases

• Proximity to your work environment will reduce latency (the speed information travels from you to Amazon and back)

– Choose the one closest to you, or the cheapest price, or somewhere in between

• This will trip you up, trust me.

Cloud Computing: Services

(14)

EBS:

• Create ‘disks’ that can be mounted onto your EC2 AMIs

• Disks are also placed in Availability Zones, and priced accordingly

• Can create new volumes based on public data sets

• Can create ‘snapshots’: User-initiated copies of all the data stored in super-durable Amazon S3

Cloud Computing: Services

(15)

S3:

• Stores objects in a bucket and allows retrieval based on unique key (URI)

• Can store objects ranging from 1 byte to 5GB.

• Unlimited objects can be stored

• RESTful interface (Representational state transfer)

• Extreme durability of data, with option for cheaper service (but reduced durability)

• Backed by Amazon S3 SLA (service level agreement)

Unlimited objects and Extreme durability? What’s the catch?

• Simple object stores are bad when disk I/O operations are needed

• 5GB may be too small for data sets

At the end of the day you can save data to S3 but you’ll be transferring it to EBS for any operations you’re going to do with it.

Cloud Computing: Services

(16)

EMR:

• Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud

• Allows…

Wait, do you know what MapReduce is?

No?

Then let’s back up a moment…

Cloud Computing: Services

(17)

MapReduce:

• Inspired by functional programming, and introduced by Google.

• A way to process large amounts of data by farming out work to a cluster.

• Works by using two functions:

– Mapper: Takes huge input data and chunks it out into smaller sub-problems, applying one or more functions to each, resulting in a key/value pair of the data

– Reducer: Takes the key/value pairs and combines them into useful data

• This is just a way of thinking about a problem. You need to code everything by hand. (Think of this not as a solution, but a way to think about creating one)

• Hadoop is software that handles distribution and collection of the data through your Map and Reduce functions, abstracting the bookkeeping.

• If this still seems obtuse, Vince & Daniel have great talks on this.

Also for more information, Google has the answer. Check out Google’s “MapReduce in a Week”

(http://code.google.com/edu/submissions/mapreduce/listing.html)

Cloud Computing:

MapReduce Super-quick Primer

(18)

With that out of the way…

EMR:

• Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud

• Allows processing of vast amounts of data

• Built to take advantage of other systems such as S3 to process data & store results (respectively)

• Most Bioinformatics tools cannot make good use of EMR at this time

Cloud Computing: Services

(19)

SDB:

• Non-relational data store (More like excel than MySQL)

• Think of it as S3 for data instead of files

• Primarily for index & query capabilities

• Comes with a ‘free tier’ for testing, making approaching this service easy

– First 25 machine hours & 1GB storage / month free – After that, pricing is per machine hour used

• Syntax:

– Domains: Think of this as your spreadsheet name

– Attributes: These would be the data in a column. Attributes have a name (header) and a value

• Limit of 10GB per domain

• Comes in two flavors:

– Consistent: Your read reflects the data previously written

– Eventually Consistent: Higher read throughput, but reads are not guaranteed to reflect everything written to it before. Latency between writing and reading updated information.

Cloud Computing: Services

(20)

RDS:

• Literally a hosted relational database (like MySQL)

• Features reserved & on-demand pricing

• Patches the software and handles backups for a user-defined retention period

• Designed for use with other services (as you can imagine), so using EC2 will have low-latency to a RDS instance and vice-versa

• Can create ‘snapshots’ (sound familiar?): User-initiated backups with indefinite retention (last until you delete them)

• Multi-zone deployment: Allows replication of data across availability zones for durability of data

• RDS instances come in various sizes which will look familiar to anyone that knows EC2 instance sizes.

Cloud Computing For Bioinformatics

Cloud Computing

For Bioinformatics

Cloud Computing: what is it?

Before:

With the Cloud:

Cloud Computing

Advantages:

Cloud Computing

Disadvantages:

Cloud Computing

Cloud Computing: Components

Cloud Computing: Components

Cloud Computing: Components

Let’s learn more about those important services

Cloud Computing: Components

EC2:

Cloud Computing: Services

Cloud Computing: Services

Pricing

Cloud Computing: Services

Availability Zone? What’s that?

Cloud Computing: Services

EBS:

Cloud Computing: Services

S3:

Cloud Computing: Services

EMR:

Cloud Computing: Services

MapReduce:

Cloud Computing:

MapReduce Super-quick Primer

EMR:

Cloud Computing: Services

SDB:

Cloud Computing: Services

RDS:

Cloud Computing: Services

Questions?

Cloud Computing: Components

Oh yeah, here’s some free money!

Cloud Computing: Components