INFO5011
Cloud Computing Semester 2, 2011
Lecture 1, Cloud Computing Introduction
Some slides were developed using the original Berkeley RAD lab Above the Clouds Presentation
The original PPT slides and the Berkeley paper can be found from:
Outline
›
Cloud computing – a broad definition and views
›
The road to cloud computing
›
Cloud killer apps
›
From users’ perspective
›
From providers’ perspective
›
Challenges and opportunities
cloud computing– a broad definition
›
A definition by US goverments’ National Institute of Standard
and Technology
-
Cloud computing is a model for enabling ubiquitous, convenient,
on-demand network access to a
shared
pool of
configurable
computing
resources (e.g.,
networks, servers, storage, applications
,
and services
)
Cloud computing delivering models
›
In this definition, cloud computing has three delivery models:
- Software as a Service (SaaS): The consumer uses an application, but does not control the operating system, hardware or network infrastructure on which it's running.
- Applications are restricted to business applications or applications that may normally installed in a business network or personal computer
- Examples
- Business applications: CRM solutions from salesforce.com
Cloud computing delivering models (II)
- Platform as a Service (PaaS): The consumer uses a hosting environment for their applications. The consumer controls the applications that run in the
environment (and possibly has some control over the hosting environment),but does not control the operating system, hardware or network infrastructure on which they are running. The platform is typically an application framework.
Cloud computing delivering models (III)
› Infrastructure as a Service (IaaS): The consumer uses "fundamental computing resources" such as processing power, storage, networking components or
middleware. The consumer can control the operating system, storage, deployed applications and possibly networking components such as firewalls and load balancers, but not the cloud infrastructure beneath them.
Cloud Server and Data Center Map:
All cloud services exist in a spectrum
EC2
Azure
AppEngine
Lower-level,
Less management
Higher-level,
More management
Force.com GoogleApps
Utility computingStack of Services: Berkley’s view
›
Example
- Netflix: world's leading Internet subscription service for movies and TV shows
- Netflix migrates from its own data centers to AWS in 2010
- Capacity growth rate is accelerating, unpredictable
- Year on year customer growth is 52%, year on year customers using streaming is up 145% (from ~4M to ~11M).
- Product lunch spikes– ipone, wii, PS3,
Xbox
- Datacenter is large inflexible capital commitment
SaaS User
SaaS Provider / Cloud User
Cloud Provider
Utility Computing Web Application
Netflix example: reasons for moving to cloud
› We needed to re-architect, which allowed us to question everything, including whether to keep building out our own datacenter solution.
› Letting Amazon focus on datacenter infrastructure allows our engineers to focus on building and improving our business.
› We’re not very good at predicting customer growth or device engagement.
› We think cloud computing is the future.
http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html
Q & A with Cloud Architect, at Netflix.
―Many folks claim that, they can deliver a private cloud at a similar price point to AWS. I assume you ran the numbers yourself. In whatever detail you can share, what does the ROI look like for Netflix?‖
› ―Oracle on IBM is very expensive, so AWS looks cheap in comparison‖
› ―AWS costs are fully burdened, and we could not have hired enough SAs and DBAs to build out our own datacenter this fast.‖
› ―costs are elastic, you start paying for a resource just before it goes live, and if you stop using a resource you stop paying for it‖
Netflix example: moving to cloud is not a simple
platform migration
› Public cloud is less reliable than private datacenter
- May require migrating state in volatile memory between instances › Co-tenancy is hard
- Multi-tenancy is an important feature of cloud platform
- Co-tenancy can introduce variance in throughput at any level of the stack. › The best way to avoid failure is to fail constantly
› Learn with real scale, not toy models
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
A down side of current cloud services is the SLA and its monitoring, measuring and claiming credit back.
The road to cloud computing
›
Public Cloud Uptake Will Be Driven by Informal Buyers
›
Enterprise IT Buyers Will Stay Focused on Virtualization Over Cloud
›
New Cloud Offerings Will Increase a Typical Enterprise's Service
Usage
›
―Data from the 2010 Forrester survey shows that
80
percent of enterprise
decision makers surveyed said
that
consolidating IT infrastructure through
server virtualization
is a high priority. In contrast,
29
percent of
respondents said that
building an internal private cloud
operated by IT (not
a service provider) is a high priority at their organization and
28
percent
said
that
using a [public] cloud service provider for storage or server
consolidation
was a high priority.‖
Shane O’Neill, Cloud Computing in 2011: 3 Trends Changing Business Adoption, PCWorld, Feb 2011
Server Sprawl
›
Server sprawl (large number of underutilized servers) has become a major
problem in many IT departments
›
Main causes:
- Requirement by vendors to run their applications in isolation
- Operating system heterogeneity
- Mail server may require Windows server; a database maybe best run in lunix or Solaris.
- Mergers and acquisitions and other integration projects may end up with a large collection of servers, each dedicated to a single task
›
―81 percent of CIOs were using virtualization technologies to drive
consolidation, according to a recent survey by CIO‖ (2008 survey)
Increasing utilization is hard!
Beyond Server Consolidation. Werner Vogels. ACM Queue, Jan/Feb 2008. Average CPU utilization of 5000+ servers at google during a six- month period
Virtualization-based server consolidation
›
Benefits of virtualization:
- It breaks the 1:1 relationship between applications and the operation system and between the operating system and the hardware
- It creates N:1 relationships so that we can run multiple isolated applications on a single shared resoruce
- It also enables 1:N relationships where applications can span multiple physical resources more easily by providing elasticity in their resource usage.
›
Challenges in server consolidation
- How to accurately characterize an application’s resource requirements
- How to optimally distribute the virtual machines hosting the applications over the physical resources
100% utilization is not the goal
›
Workload in the enterprise are heterogeneous
›
Demand is uncertain and often occurs in spikes
›
Operating System starts to behave unpredictable under high CPU and IO
load
›
for pure CPU-bound environments,
70
percent seems to be achievable for
highly tuned applications; for environments with mixed workloads,
40
percent is a major success, and
50
percent has become the Holy Grail.
›
Real word estimates of server utilization in datacenters range from 5% to
20%!
Private Cloud and Hosted private cloud
›
―
Anecdotally and from surveys, it's becoming clear that most enterprises
are first looking to the private cloud as a way to play with cloud tools and
concepts in the safety of their own secure sandbox.”
[http://www.pcworld.com/businesscenter/article/224228/public_cloud_vs_private_cloud_why_not_both.html]
›
It uses
similar technologies
as those in the public cloud, but is behind a fire
wall and is only open to departments and people within the organization
›
Maybe useful for large enterprise, but the upfront cost would be too much
for small organizations. ―
The level of expertise we would have needed
in-house to make this happen doesn't make sense for a company of our size,
and it doesn't even make sense for our road map for the next three to five
years
.‖
›
Amazon offers hosted private cloud, Virtual Private Cloud
Public Cloud computing
›
The obvious benefits:
-
Illusion of infinite resources
-
No up-front cost
-
Fine-grained billing (e.g. hourly)
›
The driving technologies
-
Experience with very large datacenters (Warehouse Scale Computer)
-
Unprecedented economies of scale
-
Pervasive broadband Internet
-
Fast x86 virtualization
-
Pay-as-you-go billing model
Cloud Killer Apps
›
Mobile and web applications
›
Extensions of desktop software
- Matlab, Mathematica
›
One off Batch processing / MapReduce
- Washington Post Engineer used 200 EC2 instances (1,407 server hours) to convert 17,481 pages of Hillary Clinton’s travel documents into a form more
friendly to use in WWW presentation
- NY Times used 100 instances of Amazon EC2 to convert 11 million historical articles from TIFF to PDF, within 24 hours, all articles as 4TB data were
converted into 1.5 TB of pdf.
- NY Times builds its own Hadoop Toolkit to enable easy writing of MapReduce jobs
- Motivated by huge volume of data log and the difficulties of running it
Cloud users’ incentive
›
There is significant overhead in acquiring IT resources
›
Server acquisition times often run into several months
›
In large organization, once a resource has been allocated to a project,
teams are unwilling to release it given the long lead times in reacquiring
the resource when needed again.
›
This in turn increase the wasted server times.
The famous Animoto example
―They had 25,000 members on Monday, 50,000 on Tuesday, and
250,000 on Thursday. Their EC2 usage grew as well.
For the last month or so they had been using between 50 and 100 instances. On Tuesday their usage peaked at around 400, Wednesday it was 900, and then 3400 instances as of Friday morning.‖
Unused resources
Economics of Cloud Users
• Pay by use instead of provisioning for peak
Static data center Data center in the cloud
Demand Capacity Time Re s o u rc e s Demand Capacity Time Res o u rc e s
Unused resources
Economics of Cloud Users
• Risk of over-provisioning: underutilization
Data center/cloud server distribution: http://www.datacentermap.com/
Demand Capacity Time Res o u rc e s
Economics of Cloud Users
• Heavy penalty for under-provisioning
Lost revenue
Lost users
Res o u rc e s Demand Capacity Time (days) 1 2 3 Res o u rc e s Demand Capacity Time (days) 1 2 3 Res o u rc e s Demand Capacity Time (days) 1 2 3Economics of Cloud Providers
›
5-7x economies of scale [Hamilton 2008]
›
Extra benefits
- Amazon: utilize off-peak capacity
- Microsoft: sell .NET tools
- Google: reuse existing infrastructure
Resource
Cost in
Medium DC
Cost in
Very Large DC
Ratio
Network
$95 / Mbps / month
$13 / Mbps / month
7.1x
Storage
$2.20 / GB / month
$0.40 / GB / month
5.7x
Administration
≈140 servers/admin >1000 servers/admin
7.1x
Adoption Challenges
Challenge
Opportunity
Availability
Multiple providers & DCs
Data lock-in
Standardization
Data Confidentiality and
Auditability
Encryption, VLANs, Firewalls;
Geographical Data Storage
Datacenter
blackout is ―common‖
Amazon data center lists: In US:
* Ashburn, Virginia * Dallas/Fort Worth * Los Angeles
* Miami
* Newark, New Jersey * Palo Alto, California * Seattle
* St. Louis
Amazon data center lists: In other countries: * Amsterdam * Dublin * Frankfurt * London * Hong Kong * Singapore * Tokyo http://www.datacenterknowledge.com/
―Because our costs vary by location, pricing for data served from edge locations outside of the US varies, and is currently slightly higher,‖ – Jeff Barr in AWS blog
―Last week’s engthy outage for Amazon Web Services cloud computing platform was caused by a network configuration error as Amazon was attempting to upgrade capacity on its network‖[April 21st, 2011]
―More than 18 million blogs WordPress.comwere down for several hours Tuesday [March 22nd, 2011]night. ―A fix for a
server issue caused a series of failures,‖ Automattic reported on its Twitter feed.‖
―The social news site Redditis revising how it uses Amazon’s cloud computing service following performance problems
that contributed to six hours of downtime for the Reddit site this week[ March 18th, 2011].‖ ―Several hours after the
latencies were reported as fixed, AWS reported that connectivity problems related to a ―misbehaving network device.‖‖
―Hundreds of thousands of UK customers of Vodafonelost service this morning [February 28th, 2011] after switch
Growth Challenges
Challenge
Opportunity
Data transfer
bottlenecks
FedEx-ing disks, Data
Backup/Archival
Performance
unpredictability
Improved VM support, flash
memory, scheduling VMs
Scalable storage
Invent scalable store
Bugs in large distributed
systems
Invent Debugger that relies
on Distributed VMs
Scaling quickly
Invent Auto-Scaler that relies
Policy and Business Challenges
Challenge
Opportunity
Reputation Fate Sharing Offer reputation-guarding
services like those for email
Software Licensing
Pay-for-use licenses; Bulk
Cloud Services SLA
› Most cloud provider’s SLA does not contain fine grained Service Level Objectives (SLO)
› It is customers’ responsibility for noticing problems and for collecting evidence
Cloud Services SLA comparison
Windows Azure compute Amazon – EC2 RackSpace --Cloud Servers
Uptime/available guarantee
99.95% (overall), 99.9%(role instance)
99.95% 100%
Notification onus Customer Customer customer
Time window Notify incidents within 5 days, submit claim before next billing month
30 days after incident 30 days after incident
Credit back < 99.95%(99.9%) : 10% credut <99%: 25% credit
10% of bill per eligible credit period
5% of the fees for each 30 minutes of network or data center downtime, up to 100% of the fees·
5% of the fees for each additional hour of downtime past time-to-resolve, up to 100% of the fees
Cloud Services SLA comparison (II)
Windows Azure storage Amazon – S3 RackSpace --Cloud Files
Uptime/available guarantee
99.9%
―Error rate‖ is defined as the number of failed transaction divide by total number of transactions. The definition of ―failed transaction‖ considers maximum process. E.g. for transaction type ―copy blob‖, processing time over 90 seconds is considered as failed; for normal query, processing time over 10 seconds is considered as failed.
99.9%
―Error Rate‖ means: (i) the total number of internal server errors returned by Amazon S3 as error status ―InternalError‖ or
―ServiceUnavailable‖ divided by (ii) the total number of requests during that five minute period. We will calculate the Error Rate for each Amazon S3 account as a percentage for each five minute period in the monthly billing cycle.
99.9%
(i) The Rackspace Cloud network is down, or (ii) the Cloud Files service returns a server error response to a valid user request during two or more consecutive 90 second intervals, or (iii) the Content Delivery Network fails to deliver an average download time for a 1-byte reference document of 0.3 seconds or less, as
measured by The Rackspace Cloud's third party measuring Notification onus Customer Customer customer
Time window Notify incidents within 5 days, submit claim before next billing month
10 business days after the current billing cycle
30 days after incident
Credit back < 99.9% : 10% credit <99%: 25% credit 99 % - 99.9% 10% Less than 99% 25% 99.89% - 99.5% 10% 99.49% - 99.0% 25% … Less than 96.5% 100%
Monitoring Services
›
Cloud providers has mechanism for monitoring services
- E.g Amazon Cloud Watch
- A web service that provides monitoring for AWS cloud resources, starting with Amazon EC2. It provides customers with visibility into resource utilization,
operational performance, and overall demand patterns—including metrics such as CPU utilization, disk reads and writes, and network traffic.
- Such tool aims more for customers to decide the way of using services, e.g. how many instances, at what capacity
Main Resources
› Arik HesselDahl, Seven Questions for Adam Selipsky, VP at Amazon Web Services, All things Digital,March, 7, 2011 [accessible from: http://newenterprise.allthingsd.com/20110307/seven-questions-for-adam-selipsky-head-of-amazon-web-services/ ]
› Cloud Computing Use Case Discussion Group, Cloud Computing Use Case White Paper (version 4.0) ,
July, 2010 [accessible from: http://opencloudmanifesto.org/Cloud_Computing_Use_Cases_Whitepaper-4_0.pdf ]
› "Above the Clouds: A Berkeley View of Clould Computing", 2009