• No results found

Cloud Computing For Bioinformatics

N/A
N/A
Protected

Academic year: 2021

Share "Cloud Computing For Bioinformatics"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Cloud Computing

For Bioinformatics

(2)

Cloud Computing: what is it?

• Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion.

• Cloud Computing abstracts infrastructure from application.

• Cloud Computing should save you time the way software packages save you time.

(3)

Before:

• Purchase Hardware & ensure it’s all compatible

• Appropriate resources for hardware (power, cooling, rack space, etc)

• Set up & configure hardware

• Install baseline software (OS, packages)

• Develop & deploy your application

With the Cloud:

• Request resource

• Develop & deploy your application

Cloud Computing

(4)

Advantages:

– Reliability: Decoupling applications from hardware removes hardware failure concerns

– Scalability: Many cloud services have built-in linear scaling, allowing more resources to be brought online on-demand

– Turnaround: Greatly reduce time taken to procure hardware resources – Cost: Limited upfront cost when compared to hardware purchase

– Pay as you go: Pay for what you use. Don’t pay for servers sitting idly sucking power & cooling

– Experimentation: Because of the above, the opportunity costs of experimentation are tiny

– Sharing & Collaboration: Share resources such as machine images & data without worry

Cloud Computing

(5)

Disadvantages:

– Learning Curve: One must learn how to leverage cloud & it’s advantages; not how one is used to working

– Data Transfer: Getting data into & out of the cloud is at internet speed, not network speed

– Opacity: The underlying infrastructure is hidden from view

Cloud Computing

(6)

Cloud Computing: Components

(7)

Glossary:

– AWS: Amazon Web Services – EC2 / Elastic Compute Cloud:

Computer resources in the cloud.

Essentially virtual computers with varying CPU & memory resources.

– EBS / Elastic Block Store: Block- level storage for data. They are virtual hard drives for EC2

instances.

– S3 / Simple Storage Service: An object store allowing you to save data in the cloud in a highly-

redundant fashion

– EMR / Elastic Map Reduce: Auto- managed map reduce

infrastructure for running highly- parallel computation problems against a farm of computers.

– SDB / Simple Database: Run

queries against structured data in real time. A very simple version of:

– RDS / Relational Database Service:

Web service that lets you place a relational database in the cloud.

– AWS Import/Export: Load your data onto a device and mail it to Amazon, and let them load your data for you!

Cloud Computing: Components

There’s plenty more, but these are the most important for bioinformatics.

(8)

Ok, here are some others:

– CloudWatch: Monitor AWS cloud resources, such as EC2 instances.

– Elastic Load Balancing: Amazon- hosted load balancers distributing incoming traffic among EC2 nodes.

– SQS / Simple Queue Service:

Hosted queue for storing messages as they pass between computers, enabling combination of disparate programs communicating with each other.

– VPC / Virtual Private Cloud: Fence off AWS services over an IP range via VPN, allowing cloud services to fit in with legacy security

protocols.

– CloudFront: Content delivery service (CDN) on Amazon’s collection of edge servers.

– SNS / Simple Notification Service:

Set up, operate, and send

notifications from the cloud to a variety of locations such as web page, email, SMS, etc.

– Amazon Mechanical Turk: As the name implies, you create a Human Intelligence Task (HITs) which a human can do easily, then you pay a modest fee each time some

human performs this task.

Examples would be rating quality between items, filling out forms, or solving CAPTCHAs, etc.

Cloud Computing: Components

(9)

Let’s learn more about those important services

Cloud Computing: Components

(10)

EC2:

Virtual computers offered with varying memory / cpu power

How is CPU power measured in a virtual world?

ECU: EC2 Compute Unit: measure of computing power on AWS. Equivalent of a 1.0GHz 2007 Xeon processor.

4 classes of instances:

Standard Instances: inexpensive instances used for testing, web service, and many less intensive jobs High-Memory Instances: Large RAM images for high throughput applications e.g. databases, caches High-CPU Instances: High ECU instances for compute-intensive applications

Cluster Compute Instances: Increased network performance for HPC applications e.g. map-reduce

Cloud Computing: Services

(11)

Cloud Computing: Services

Instance Type ECU Units RAM (GB) Local Storage (GB)

Standard

Small 1 1.7 160

Large 4 7.5 850

XL 8 15 1690

High-Memory

XL 6.5 17.1 420

Double XL 13 34.2 850

Quadruple XL 26 68.4 1690

High-CPU

Medium 5 1.7 350

XL 20 7 1690

Cluster Compute

Quadruple XL 33.5 23 1690

(12)

Pricing

Lot of factors affect pricing

Prices commensurate with class of instance used (Standard, High-memory)

Prices adjusted by OS: Linux (cheaper) and Windows (pricier)

Prices adjusted by instance type:

On-demand Instances: Always available to start. Priciest option. No commitment, no contract Reserved Instances: Pre-pay upfront to have the ability to run an instance at a reduced rate Spot Instances: EBay-style! Bid a max price for compute instances, and procure them when the

demand price meets your top bid. Cannot get a price reliably, but can save money on instances.

Prices adjusted by availability zone. 4 available:

US East (cheapest across the board) US West

EU Ireland

APAC Singapore (new!)

Estimating costs is hard, even with Amazon-provided calculators, as YMMV.

Cloud Computing: Services

(13)

Availability Zone? What’s that?

Amazon data centers are located around the globe. This ensures protection from data-center wide failure

Problem is many services are independent between zones, making this moot in most cases

Proximity to your work environment will reduce latency (the speed information travels from you to Amazon and back)

Choose the one closest to you, or the cheapest price, or somewhere in between

This will trip you up, trust me.

Cloud Computing: Services

(14)

EBS:

Create ‘disks’ that can be mounted onto your EC2 AMIs

Disks are also placed in Availability Zones, and priced accordingly

Can create new volumes based on public data sets

Can create ‘snapshots’: User-initiated copies of all the data stored in super-durable Amazon S3

Cloud Computing: Services

(15)

S3:

Stores objects in a bucket and allows retrieval based on unique key (URI)

Can store objects ranging from 1 byte to 5GB.

Unlimited objects can be stored

RESTful interface (Representational state transfer)

Extreme durability of data, with option for cheaper service (but reduced durability)

Backed by Amazon S3 SLA (service level agreement)

Unlimited objects and Extreme durability? What’s the catch?

Simple object stores are bad when disk I/O operations are needed

5GB may be too small for data sets

At the end of the day you can save data to S3 but you’ll be transferring it to EBS for any operations you’re going to do with it.

Cloud Computing: Services

(16)

EMR:

Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud

Allows…

Wait, do you know what MapReduce is?

No?

Then let’s back up a moment…

Cloud Computing: Services

(17)

MapReduce:

Inspired by functional programming, and introduced by Google.

A way to process large amounts of data by farming out work to a cluster.

Works by using two functions:

Mapper: Takes huge input data and chunks it out into smaller sub-problems, applying one or more functions to each, resulting in a key/value pair of the data

Reducer: Takes the key/value pairs and combines them into useful data

This is just a way of thinking about a problem. You need to code everything by hand. (Think of this not as a solution, but a way to think about creating one)

Hadoop is software that handles distribution and collection of the data through your Map and Reduce functions, abstracting the bookkeeping.

If this still seems obtuse, Vince & Daniel have great talks on this.

Also for more information, Google has the answer. Check out Google’s “MapReduce in a Week”

(http://code.google.com/edu/submissions/mapreduce/listing.html)

Cloud Computing:

MapReduce Super-quick Primer

(18)

With that out of the way…

EMR:

Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud

Allows processing of vast amounts of data

Built to take advantage of other systems such as S3 to process data & store results (respectively)

Most Bioinformatics tools cannot make good use of EMR at this time

Cloud Computing: Services

(19)

SDB:

Non-relational data store (More like excel than MySQL)

Think of it as S3 for data instead of files

Primarily for index & query capabilities

Comes with a ‘free tier’ for testing, making approaching this service easy

First 25 machine hours & 1GB storage / month free After that, pricing is per machine hour used

Syntax:

Domains: Think of this as your spreadsheet name

Attributes: These would be the data in a column. Attributes have a name (header) and a value

Limit of 10GB per domain

Comes in two flavors:

Consistent: Your read reflects the data previously written

Eventually Consistent: Higher read throughput, but reads are not guaranteed to reflect everything written to it before. Latency between writing and reading updated information.

Cloud Computing: Services

(20)

RDS:

Literally a hosted relational database (like MySQL)

Features reserved & on-demand pricing

Patches the software and handles backups for a user-defined retention period

Designed for use with other services (as you can imagine), so using EC2 will have low-latency to a RDS instance and vice-versa

Can create ‘snapshots’ (sound familiar?): User-initiated backups with indefinite retention (last until you delete them)

Multi-zone deployment: Allows replication of data across availability zones for durability of data

RDS instances come in various sizes which will look familiar to anyone that knows EC2 instance sizes.

Cloud Computing: Services

(21)

Questions?

Cloud Computing: Components

(22)

Oh yeah, here’s some free money!

(weren’t expecting that, were ya?)

Cloud Computing: Components

References

Related documents

Aim: The aim of this study was to evaluate the susceptibility of HCWs to HBV infec- tion in the representative Tripoli Central Hospital in Libya and prepare a practical guideline

The study reports on primary school pupils’ perception of an extensive reading (ER) and writing project and their response to the reading material offered, including a focus on

As such, the main research question is the following: does the traditional exploitation of timber by the local communities improve social cohesion, reinforce capacities in

‘We were impressed with the way Huntsman® integrated into our data infrastructure,’ the Security Team Manager makes the point, ‘and how well it works with our other security

With the Keyguard unlocked, enter a phone number, then press the Left Soft Key [Save].. Use the Directional Key to highlight an existing

The presentation has not been updated since it was originally presented, and does not constitute a commitment by any CDF entity to underwrite, subscribe for or place any securities or

The black square in the top left hand corner moves one step, then two, then three in a clockwise manner around the edge of the large square each move.. The other

When analysing changes occurring in the milk yield and composition depending on successive lactation it was concluded that the highest amount of obtained milk, calculated FCM and