• No results found

Application Development. A Paradigm Shift

N/A
N/A
Protected

Academic year: 2021

Share "Application Development. A Paradigm Shift"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Application Development

for the Cloud:

for the Cloud:

A Paradigm Shift

Ramesh Rangachar

I t l t

Intelsat

(2)

Motivation for the Paradigm Shift

Motivation for the Paradigm Shift

The performance of the traditional applications is

becoming inadequate

becoming inadequate

Data is growing faster than Moore’s Law [IDC study]

The new buzzword is Big Data

A li ti f d l bilit li it d b th

Application performance and scalability are limited by the performance and scalability of storage (disk I/O) and

network

Applications m st capitali e on the benefits pro ided

Applications must capitalize on the benefits provided

by the cloud environment

No significant improvements in “scaling up” (scale

vertically or upgrade a single node)

“Scaling out” (scale horizontally or add more nodes)

is a better approach

New 2007 Template - 2

pp

(3)

How are Others Solving Big Data Problem?

How are Others Solving Big Data Problem?

Google invented a software framework called MapReduce to support distributed computing on large data sets on clusters of pp p g g computers

Google also invented Google Bigtable, a distributed storage system for managing petabytes of data

Hadoop: An open source Apache project inspired by the work at Google. A software framework for reliable, scalable,

distributed computing

There are a number of other Hadoop related projects at Apache Major contributors include Yahoo, Facebook, and Twitter

S4 Distributed Stream Computing Platform: An ApacheS4 Distributed Stream Computing Platform: An Apache

incubator program. Supports development of applications for processing continuous unbounded streams of data

(4)

What is Apache Hadoop?

What is Apache Hadoop?

The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing software for reliable, scalable, distributed computing

A framework for the distributed processing of large data sets across clusters of computers using a simple programming model Designed to scale from single server to thousands of servers.Designed to scale from single server to thousands of servers.

Each server offers local computation and storage

The Hadoop core consists of:

The Hadoop core consists of:

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access that provides high throughput access to application data

Hadoop MapReduce: A software

framework for distributed processing

New 2007 Template - 4

p g

of large data sets on clusters

(5)

MapReduce Data Flow

MapReduce Data Flow

Hadoop MapReduce

Hadoop MapReduce

Master = JobTracker Worker=TaskTracker

(6)

Mapreduce Example: Word Count

Mapreduce Example: Word Count

cloud,1;

computing,1 overview,1

h 1 a 1

cloud computing overview

worker worker

heterogeneous,1 high,1

performance,1 cloud,1 computing,1

a,1 as,1

acquisitions,1 applications,2 cloud,4;

computing,3

d l t 1

heterogeneous high performance cloud computing

cloud computing applications in space

worker

cloud,1 computing,1 applications,1 in,1

space,1 system,1

development,1 dynamics,1 flight,1

heterogeneous,1 high,1

applications in space in,2

system acquisitions applications

development in the cloud

space flight dynamics

worker

worker

acquisitions,1 space,1 flight,1 dynamics,1 as,1 a,1

performance,1 overview,1 space,2 system 1 as a service

worker

Service,1

applications,1 development,1 in,1

th 1

system,1 service,1 the,1

Input files

Reduce

Phase Output

fil

New 2007 Template - 6

the,1 cloud,1

Map Phase

Intermediate files

files

(7)

HDFS

HDFS

Very large distributed file system on commodity

h d

hardware

Designed to scale to petabytes of storage

Si l N f ti l t

Single Namespace for entire cluster

Optimized for batch processing

Write once read man access model

Write-once-read-many access model

Read, write, delete, append-only

NameNode handles replication deletion creation

NameNode handles replication, deletion, creation

DataNode handles data retrieval

(8)

HDFS

HDFS

Files are broken up p

NameNode BackupNode

into blocks

Block size is 64

MB (can be

MB (can be

modified)

Blocks replicated

on at least 3

on at least 3

nodes

If a node fails,

DataNode DataNode DataNode DataNode DataNode

blocks get

replicated

automatically to

th d

New 2007 Template - 8

other nodes

(9)

Hadoop Cluster at Yahoo!

Hadoop Cluster at Yahoo!

• Largest cluster:

4000 d 16PB di k 64TB RAM – 4000 nodes, 16PB raw disk, 64TB RAM

(10)

Who Uses Hadoop?

Who Uses Hadoop?

AdobeLinkedIn

AOL

Disney

Ebay

New York Times

Powerset/Microsoft

Rackspace Ebay

Facebook

Google

Rackspace

StumbleUpon

Twitter

Hulu

IBM

Yahoo!

Full list: http://wiki.apache.org/hadoop/PoweredBy

New 2007 Template - 10

(11)

Examples

Examples

New York Times

Converted public domain articles from 1851-1922 to PDF Converted public domain articles from 1851-1922 to PDF Stored 4 TB of scanned images as input on Amazon S3 Ran 100 Amazon EC2 instances for around 24 hours

P d d 11 Milli ti l 1 5 TB f PDF fil t t Produced 11 Million articles, 1.5 TB of PDF files as output

http://open.blogs.nytimes.com/2007/11/01/self-service-prorated- super-computing-fun/

Vi

Visa

Processing 73 billion transactions, amounting to 36 TB of data Traditional methods: 1 month

Hadoop: 13 minutes

How Hadoop Tames Enterprises’ Big Data. Sreedhar Kajeepeta,

(12)

Sorting Benchmarks

Sorting Benchmarks

Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Sec

http://developer.yahoo.com/blogs/hadoop/posts/2009/05/had oop_sorts_a_petabyte_in_162/ondsp_ _ _p y _ _

New 2007 Template - 12

(13)

The Hadoop Ecosystem

The Hadoop Ecosystem

HBase™: Initiated by Powerset. A scalable, distributed

database that supports structured data storage for large tables

Hive™: Initiated by Facebook. A data warehouse infrastructure that provides data summarization and ad hoc querying

Pig™: Initiated by Yahoo A high-level data-flow language andPig : Initiated by Yahoo. A high level data flow language and execution framework for parallel computation

(14)

HBase

HBase

Open source project modeled after Google’s

BigTable

BigTable

Distributed, large-scale data store

Efficient at random reads/writes Efficient at random reads/writes

Use Hbase when you:

need to store large amounts of data

need high write throughput

need efficient random access within large data sets

d t l f ll ith d t

need to scale gracefully with data

need to save time-series ordered data

don’t need full RDMS capabilities (cross row/cross

New 2007 Template - 14

don t need full RDMS capabilities (cross row/cross

table transactions, joins, etc.)

(15)

HBase Use Cases

HBase Use Cases

Mozilla’s Socorro - crash reporting system

2 5 Milli h t /d

2.5 Million crash reports/day

Facebook

Real-time Messaging System: HBase to Store 135+ Real time Messaging System: HBase to Store 135+

Billion Messages a Month

Realtime Analytics System: HBase to Process 20

Billi E t P D

Billion Events Per Day

Twitter, Yahoo!, TrendMicro, StumbleUpon, …

See http://wiki apache org/hadoop/Hbase/PoweredBy See http://wiki.apache.org/hadoop/Hbase/PoweredBy

(16)

OpenTSDB

OpenTSDB

• 15464 points retrieved, 932 points plotted in 100ms

points plotted in 100ms

• Generating custom graphs and correlating events is easy

A distributed, scalable, time series database developed by p y StumbleUpon

Open-source monitoring system built on Hbase It is also a data plotting system.p g y

Used to monitor computers in their datacenter

Collects 600 million data points per day, over 70 billion data points per year

New 2007 Template - 16

data points per year

Store them forever, retain granular data

(17)

Typical Use Cases in Ground Systems Domain

Typical Use Cases in Ground Systems Domain

Hadoop can be used to scale out applications that fit

the “write once read many” model

the write-once-read-many model

Typical use cases:

T l t l i d i li ti

Telemetry analysis and visualizationReal time displays

Storage and retrieval of spacecraft and ground systemStorage and retrieval of spacecraft and ground system

events

Payload data processing

(18)

Summary

Summary

In order to capitalize on the benefits of the cloud

environment applications must be designed to

environment, applications must be designed to

“scale out” rather than “scale up”

Robust open source tools to support computing on a p pp p g

cluster of computers are available and in use today

The Hadoop Ecosystem is used in many enterprises

Th f d l bilit f d t

The performance and scalability of ground systems

can be improved by using the growing ecosystem of

Hadoop tools and technologies p g

Applications that fit the “write-once-read-many”

model and process large amount of data benefit from

this approach

New 2007 Template - 18

this approach

(19)

References

References

Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM, Volume g g 51, Issue 1, pp. 107-113, 2008

Apache Hadoop: http://hadoop.apache.org/

Apache Hadoop: an introduction. Todd Lipcon, May 27, 2010

Hadoop Update, Open Cirrus Summit 2009.Eric Baldeschwieler, VP Hadoop Development, Yahoo!

Cloudera Training: http://www.cloudera.com/hadoop-training

Realtime Apache Hadoop at Facebook Dhruba Borthakur Oct 5 2011

Realtime Apache Hadoop at Facebook. Dhruba Borthakur, Oct 5, 2011 at UC Berkeley Graduate Course

Apache Hbase: hbase.apache.org

http://www.slideshare.net/kevinweil/nosql-attwitter-nosql-eu-2010http://www.slideshare.net/kevinweil/nosql attwitter nosql eu 2010

http://www.slideshare.net/ydn/3-hadooppigattwitterhadoopsummit2010

OpenTSDB, The Distributed, Scalable, Time Series Database. Benoît

References

Related documents

Komerční banka offers day-to-day banking services (current accounts, payment cards, direct banking, payment system) as well as saving products, financing products (loans, credit

Based on the above, it can be deduced that though the experience of other developing countries give contradicting reports on the effect of Investment, the Nigerian case

Furthermore, as Ramsay observes, DLT/Blockchain principles of operation and their regulation might create a Gordian legal knot when it comes to tension and out- right conflict

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using a thousands of computational

Hadoop MapReduce - the programming model (framework) of performance of distributed computing for the large amounts of data within the paradigm of map / reduce, which is a set of

If there are variations in the social structure of these cases (such as offender and victim characteristics) these social characteristics will be predictors o f arrest and other

X=7495.00 y= 499.00 500 m GEOLOGICAL MAP HANNUKAINEN X=7497.60 X=7496.80 X=7496.20 VUOPIO KUERVAARA LAURINOJA LAUKU Monzonite Diorite Skarn Iron ore Quartz-feldspar schist