Application Development
for the Cloud:
for the Cloud:
A Paradigm Shift
Ramesh Rangachar
I t l t
Intelsat
Motivation for the Paradigm Shift
Motivation for the Paradigm Shift
• The performance of the traditional applications is
becoming inadequate
becoming inadequate
• Data is growing faster than Moore’s Law [IDC study]
• The new buzzword is Big Data
A li ti f d l bilit li it d b th
• Application performance and scalability are limited by the performance and scalability of storage (disk I/O) and
network
Applications m st capitali e on the benefits pro ided
• Applications must capitalize on the benefits provided
by the cloud environment
• No significant improvements in “scaling up” (scale
vertically or upgrade a single node)
• “Scaling out” (scale horizontally or add more nodes)
is a better approach
New 2007 Template - 2
pp
How are Others Solving Big Data Problem?
How are Others Solving Big Data Problem?
• Google invented a software framework called MapReduce to support distributed computing on large data sets on clusters of pp p g g computers
– Google also invented Google Bigtable, a distributed storage system for managing petabytes of data
• Hadoop: An open source Apache project inspired by the work at Google. A software framework for reliable, scalable,
distributed computing
– There are a number of other Hadoop related projects at Apache – Major contributors include Yahoo, Facebook, and Twitter
• S4 Distributed Stream Computing Platform: An ApacheS4 Distributed Stream Computing Platform: An Apache
incubator program. Supports development of applications for processing continuous unbounded streams of data
What is Apache Hadoop?
What is Apache Hadoop?
• The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing software for reliable, scalable, distributed computing
– A framework for the distributed processing of large data sets across clusters of computers using a simple programming model – Designed to scale from single server to thousands of servers.Designed to scale from single server to thousands of servers.
Each server offers local computation and storage
• The Hadoop core consists of:
• The Hadoop core consists of:
– Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access that provides high throughput access to application data
– Hadoop MapReduce: A software
framework for distributed processing
New 2007 Template - 4
p g
of large data sets on clusters
MapReduce Data Flow
MapReduce Data Flow
Hadoop MapReduce
Hadoop MapReduce
Master = JobTracker Worker=TaskTracker
Mapreduce Example: Word Count
Mapreduce Example: Word Count
cloud,1;
computing,1 overview,1
h 1 a 1
cloud computing overview
worker worker
heterogeneous,1 high,1
performance,1 cloud,1 computing,1
a,1 as,1
acquisitions,1 applications,2 cloud,4;
computing,3
d l t 1
heterogeneous high performance cloud computing
cloud computing applications in space
worker
cloud,1 computing,1 applications,1 in,1
space,1 system,1
development,1 dynamics,1 flight,1
heterogeneous,1 high,1
applications in space in,2
system acquisitions applications
development in the cloud
space flight dynamics
worker
worker
acquisitions,1 space,1 flight,1 dynamics,1 as,1 a,1
performance,1 overview,1 space,2 system 1 as a service
worker
Service,1
applications,1 development,1 in,1
th 1
system,1 service,1 the,1
Input files
Reduce
Phase Output
fil
New 2007 Template - 6
the,1 cloud,1
Map Phase
Intermediate files
files
HDFS
HDFS
• Very large distributed file system on commodity
h d
hardware
• Designed to scale to petabytes of storage
Si l N f ti l t
• Single Namespace for entire cluster
– Optimized for batch processing
Write once read man access model
– Write-once-read-many access model
– Read, write, delete, append-only
• NameNode handles replication deletion creation
• NameNode handles replication, deletion, creation
• DataNode handles data retrieval
HDFS
HDFS
• Files are broken up p
NameNode BackupNodeinto blocks
– Block size is 64
MB (can be
MB (can be
modified)
– Blocks replicated
on at least 3
on at least 3
nodes
– If a node fails,
DataNode DataNode DataNode DataNode DataNode
blocks get
replicated
automatically to
th d
New 2007 Template - 8
other nodes
Hadoop Cluster at Yahoo!
Hadoop Cluster at Yahoo!
• Largest cluster:
4000 d 16PB di k 64TB RAM – 4000 nodes, 16PB raw disk, 64TB RAM
Who Uses Hadoop?
Who Uses Hadoop?
• Adobe • LinkedIn
• AOL
• Disney
• Ebay
• New York Times
• Powerset/Microsoft
• Rackspace Ebay
• Rackspace
• StumbleUpon
• Hulu
• IBM
• Yahoo!
Full list: http://wiki.apache.org/hadoop/PoweredBy
New 2007 Template - 10
Examples
Examples
• New York Times
– Converted public domain articles from 1851-1922 to PDF – Converted public domain articles from 1851-1922 to PDF – Stored 4 TB of scanned images as input on Amazon S3 – Ran 100 Amazon EC2 instances for around 24 hours
P d d 11 Milli ti l 1 5 TB f PDF fil t t – Produced 11 Million articles, 1.5 TB of PDF files as output
– http://open.blogs.nytimes.com/2007/11/01/self-service-prorated- super-computing-fun/
Vi
• Visa
– Processing 73 billion transactions, amounting to 36 TB of data – Traditional methods: 1 month
– Hadoop: 13 minutes
– How Hadoop Tames Enterprises’ Big Data. Sreedhar Kajeepeta,
Sorting Benchmarks
Sorting Benchmarks
• Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Sec
– http://developer.yahoo.com/blogs/hadoop/posts/2009/05/had oop_sorts_a_petabyte_in_162/ondsp_ _ _p y _ _
New 2007 Template - 12
The Hadoop Ecosystem
The Hadoop Ecosystem
• HBase™: Initiated by Powerset. A scalable, distributed
database that supports structured data storage for large tables
• Hive™: Initiated by Facebook. A data warehouse infrastructure that provides data summarization and ad hoc querying
• Pig™: Initiated by Yahoo A high-level data-flow language andPig : Initiated by Yahoo. A high level data flow language and execution framework for parallel computation
HBase
HBase
• Open source project modeled after Google’s
BigTable
BigTable
• Distributed, large-scale data store
• Efficient at random reads/writes Efficient at random reads/writes
• Use Hbase when you:
– need to store large amounts of data
– need high write throughput
– need efficient random access within large data sets
d t l f ll ith d t
– need to scale gracefully with data
– need to save time-series ordered data
– don’t need full RDMS capabilities (cross row/cross
New 2007 Template - 14
don t need full RDMS capabilities (cross row/cross
table transactions, joins, etc.)
HBase Use Cases
HBase Use Cases
• Mozilla’s Socorro - crash reporting system
2 5 Milli h t /d
– 2.5 Million crash reports/day
– Real-time Messaging System: HBase to Store 135+ Real time Messaging System: HBase to Store 135+
Billion Messages a Month
– Realtime Analytics System: HBase to Process 20
Billi E t P D
Billion Events Per Day
• Twitter, Yahoo!, TrendMicro, StumbleUpon, …
• See http://wiki apache org/hadoop/Hbase/PoweredBy See http://wiki.apache.org/hadoop/Hbase/PoweredBy
OpenTSDB
OpenTSDB
• 15464 points retrieved, 932 points plotted in 100ms
points plotted in 100ms
• Generating custom graphs and correlating events is easy
• A distributed, scalable, time series database developed by p y StumbleUpon
– Open-source monitoring system built on Hbase – It is also a data plotting system.p g y
– Used to monitor computers in their datacenter
– Collects 600 million data points per day, over 70 billion data points per year
New 2007 Template - 16
data points per year
• Store them forever, retain granular data
Typical Use Cases in Ground Systems Domain
Typical Use Cases in Ground Systems Domain
• Hadoop can be used to scale out applications that fit
the “write once read many” model
the write-once-read-many model
• Typical use cases:
T l t l i d i li ti
– Telemetry analysis and visualization – Real time displays
– Storage and retrieval of spacecraft and ground system – Storage and retrieval of spacecraft and ground system
events
– Payload data processing
Summary
Summary
• In order to capitalize on the benefits of the cloud
environment applications must be designed to
environment, applications must be designed to
“scale out” rather than “scale up”
• Robust open source tools to support computing on a p pp p g
cluster of computers are available and in use today
• The Hadoop Ecosystem is used in many enterprises
Th f d l bilit f d t
• The performance and scalability of ground systems
can be improved by using the growing ecosystem of
Hadoop tools and technologies p g
• Applications that fit the “write-once-read-many”
model and process large amount of data benefit from
this approach
New 2007 Template - 18
this approach
References
References
• Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM, Volume g g 51, Issue 1, pp. 107-113, 2008
• Apache Hadoop: http://hadoop.apache.org/
• Apache Hadoop: an introduction. Todd Lipcon, May 27, 2010
• Hadoop Update, Open Cirrus Summit 2009.Eric Baldeschwieler, VP Hadoop Development, Yahoo!
• Cloudera Training: http://www.cloudera.com/hadoop-training
Realtime Apache Hadoop at Facebook Dhruba Borthakur Oct 5 2011
• Realtime Apache Hadoop at Facebook. Dhruba Borthakur, Oct 5, 2011 at UC Berkeley Graduate Course
• Apache Hbase: hbase.apache.org
• http://www.slideshare.net/kevinweil/nosql-attwitter-nosql-eu-2010http://www.slideshare.net/kevinweil/nosql attwitter nosql eu 2010
• http://www.slideshare.net/ydn/3-hadooppigattwitterhadoopsummit2010
• OpenTSDB, The Distributed, Scalable, Time Series Database. Benoît