MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling

(1)

ID2210: Lecture 11 Jim Dowling

(2)

Large-Scale Distributed Computing

•In #Nodes

- BitTorrent (millions) - Peer-to-Peer

•In #Instructions/sec

- Teraflops, Petaflops, Exascale - Super-Computing

•In #Bytes stored

- Facebook: 30+ Petabytes (July ’11)* - NoSQL storages

•In #Bytes processed/time

- Google processes 24

petabytes of data per day

- ?

*http://www.computerworld.com/s/article/9218752/Facebook_ moves_30_petabyte_Hadoop_cluster_to_new_data_center

(3)

Big Data

“… The total amount of digital storage worldwide is

approaching 1 zettabyte, or 1 million times the contents of the Earth’s largest library. Currently that information is archived on equipment with a mass equivalent to 20 percent of Manhattan. Global data

storage is expected to reach 35 zettabytes by 2020 …”

The Boston Globe Editorial & Opinion September 7, 2010

(4)

Programming Large-Scale

•With thousands of servers available within a data centre, how do we:

- write applications for them?

- allocate and manage resources?

•Applications should also be scalable, reliable, and highly available.

- Failures are expected with thousands of machines. - Need for load-balancing, handling heterogeneity.

(5)

Commodity Computing Challenges

• Cheap nodes fail, especially if you have many

- Mean time between failures for 1 node = 3 years - Mean time between failures for 1000 nodes = 1 day - Solution: Build fault-tolerance into system

• Commodity networks have low(ish) bandwidth

- Scan 100TB Datasets on 1000 node cluster with remote storage @ 10MB/s = 165 mins

- Solution: Push computation to the data

• Programming distributed systems is hard

- Solution: Provide a simple to use data-parallel programming model that distributes work and handles faults.

(6)

Typical Large-Scale Programming Problem

•Iterate over a large number of records

•Extract something of interest from each

•Shuffle and sort intermediate results

•Aggregate intermediate results

•Generate final output

Key idea: provide a functional abstraction for these two operations

Map

(7)

Programming in the Large

•Imagine we have hundreds of thousands of documents structured as follows:

{

"type": "blog", "id": "564",

"tags": ["hdfs", "mysql", "cats"], "content": "<div>...</div>", "mentions": [ { "google": 6, "apple": 11, "microsoft": 1, } ] }

(8)

(9)

(10)

Pseudocode for Map Phase

•Extract the blog_id and the number of mentions of google from each document

def mapper(doc.blogs):

foreach (blog in docs.blogs):

(11)

Pseudocode for Reduce Phases

•Sum up all the google mentions from the same blog_id

def reducer(blog_id, mentions_google):

output(agg_id, sum(blog_id, mentions_google));

•Sum up all the google mentions from all blogs

def reducer(agg_id, mentions_google):

(12)

* _{Slide taken from tutorial by Jerry Zhao and Jelena Pjesivac-Grovic (Google Inc.):}

“MapReduce – The Programming Model and Practice”. Tutorial held at SIGMETRICS 2009.

(13)

MapReduce Programming Model

•Input Data type: key-value records

•Map function:

(K_in, V_in)  list(K_inter, V_inter)

•Reduce function:

(14)

MapReduce Programming Model

•Takes input records - one by one

- key, value

•Processes records - Independently

•Outputs intermediate - 1..n per input record - key’, value’

map

_reduce

•Takes intermediate results - Groups with same key

- key’, value’[] •Processes records - Group-wise •Outputs result - Per group - Any format

(15)

MapReduce Workflow

worker worker worker worker input files map phase intermediate output reduce phase output files read write write read partition1 partition2 partition3 file(s)

(16)

MapReduce Basics

•MapReduce programming model (and framework) that hides the complexity of work distribution and fault tolerance

•Principal design philosophies: - Near-linear scalability for data sets

- Low cost – reduce hardware, programming and admin costs

•MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

(17)

MapReduce-Like Implementations

Google MR Hadoop Dryad

Availability Proprietary Open Source Proprietary

Used by Google Yahoo!, Facebook, Amazon (EC2!), Twitter

Microsoft

(18)

Load Balancing, Failure, and Stragglers

•Load Balancing

- Break a MapReduce job in small tasks

- Schedule tasks on workers as they report idle status

•MapReduce functions are side-effect free

- Enables failed (and partially completed) tasks to be re-executed without any problems (on a different machine) - When a worker fails, its tasks can be reallocated to other

workers

•Identify and handle stragglers (slow workers) - Restart slow tasks on new workers

- Stragglers appear with increasing probability when there are an increasing numbers of workers

(19)

Components in a Hadoop MR Workflow

(20)

(21)

(22)

(23)

(24)

Pig Latin: a relational data-flow language

•MapReduce programs are quite low-level.

•A higher-level programming model was needed for processing semi-structured data sets using the

MapReduce platform

•Pig Latin is a procedural, relational data-flow

language that is implemented using MapReduce

•Pig Engine: Parser, Optimizer and distributed query

(25)

Pig Example*

•Input: User proﬁles, Page visits

•Problem: Find the top 5 most visited pages by users aged 18-25

Load Users Load Pages

Filter by Age Join on name Group on url Count clicks Order by clicks Take top 5 *Example taken from Yahoo Hadoop

tutorial from Middleware 2009

(26)

(27)

In PigLatin

Users = load ‘users’ as (name, age);

Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);

Joined = join Filtered by name, Pages by user; Grouped = group Joined by url;

Summed = foreach Grouped generate group, COUNT(Joined) as clicks;

Sorted = order Summed by clicks desc; Top5 = limit Sorted 5;

(28)

Pig Latin Data Types

•Tuple: Ordered set of ﬁelds

- Field can be simple or complex type - Nested relational model

•Bag: Collection of tuples. - Can contain duplicates

•Map: Set of (key, value) pairs

•Primitive types

(29)

(30)

Hive and HBase

•Hive and Pig were parallel projects developed at Facebook and Yahoo, respectively.

•HiveQL is closer to SQL from traditional RDBMSs than Pig Latin (which is procedural).

- Due to the limitations of MapReduce, and the fact that

HiveQL is compiled into a MapReduce query plan, HiveQL is a cut-down version of SQL

•HBase is an open-source, distributed, versioned,

column-oriented store modeled after Google's Bigtable. •A lot of Hive programs run over HBase.

(31)

MapReduce Filesystem Requirements

•MapReduce jobs run on data stored in files

•Support large files - Streaming reads

- Mostly append to end (easier concurrency)

•Scalability

- Add machines to scale

•Workers’ tasks use data on their local machine - Bandwidth is the bottleneck. Move code to data.

•Expect failures

(32)

Hadoop Filesystem (HDFS)

 Supports huge data sets and large files

 Gigabytes files, petabyte data sets

 Supports tens of millions of files in a file system  Files have write-once-read-many semantics

 Clients can only append to existing files

•Designed to run on COTS hardware - Implemented in Java

•Timely detection and recovery from data node faults

(33)

HDFS Clusters can be very large!

HDFS at Facebook (May ‘10)

•21 PB of storage in a single HDFS cluster containing 2000 machines

- 1200 machines with 8 cores + 800 machines with 16 cores

•12 TB per machine (some have 24 TB)

(34)

HDFS Architecture

Namenode B replication Rack1 Rack2 Blocks Datanodes Datanodes Client Write Read

Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..

Block ops

(35)

Typical Hadoop Cluster

• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 Gbps within a rack; 8 Gbps between racks

Aggregation switch

Rack switch

(36)

The HDFS NameNode

 A single Namenode manages the file system

metadata and regulates access to files by clients.  FileName->[BlockIds]

 BlockIds->[replica locations]

•Controls replication of blocks to DataNodes - Listens for Heartbeats from DataNodes

- Signals creating, opening, closing of blocks - Load balancing, rack-aware distribution

 It is a single-point of failure (as of May ‘12).

(37)

HDFS DataNodes

 DataNodes store a set of blocks

 File are split into one or more blocks (64MB default).

 NameNode sends instructions to DataNodes for

block creation, deletion, and replication.

(38)

HDFS Client

•Read and write HDFS data

•Create checksum files for files in HDFS.

- Recompute the checksum when a file has been read.

(39)

Related Filesystems

•Google File System (GFS)

- http://labs.google.com/papers/gfs.html

•Kosmos File System (KFS)

- Open source, Scales better than HDFS

• C++ implementation

• Master polls data nodes

• Supports writing to multiple arbitrary positions in files and file

(40)

(41)

Democratization of Large-Scale Computing

•Cloud computing is the

delivery of hosting

services that are provided to a client over the

Internet.

- Enable large-scale services without up-front investment.

•New programming tools, databases and

systems have enabled the

low-cost construction of large-scale services.

(42)

NIST Definition of Cloud Computing

"Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of

configurable computing resources (e.g., networks,

servers, storage, applications, and services) that can be rapidly provisioned and released with minimal

(43)

Supporting Technologies

•Enormous computer data-centres containing

commodity hardware.

•Virtualization of computation, storage, and

communication.

- Turn hardware and networking into software!

•Achieve economies of scale.

- Reduce costs of electricity, bandwidth, hardware, software and use low-cost locations.

- Lower-cost than provisioning own hardware.

•NoSQL datastores have enabled storage scalability

(44)

Cloud Computing Essentials

•Cloud computing is Utility Computing

- Cloud services are controlled and monitored by the cloud provider typically through a pay-per-use business model.

•An ideal cloud computing platform is:

- efficient in its use of resources - scalable elastic

- self-managing

- highly available and accessible - inter-operable and portable

(45)

Cloud Properties

•Resource efficiency: computing and network

resources are pooled to provide services to multiple users. Resource allocation is dynamically adapted according to user demand.

•Elasticity: computing resources can be rapidly and

elastically provisioned to scale up, and released to scale down based on consumer’s demand.

(46)

Cloud Properties

•Self-managing services: a consumer can provision

cloud services, such as web applications, server time, processing, storage and network as needed and automatically without requiring human

interaction with each service’s provider

•Accessible and highly available: cloud resources

are available over the network anytime and anywhere and are accessed through standard

mechanisms that promote use by different types of platform (e.g., mobile phones, laptops, and PDAs).

(47)

IaaS, PaaS and SaaS

•Infrastructure as a Service (IaaS)

•Platform as a Service (PaaS)

•Software as a Service (SaaS)

Infrastructure Servers · Storage · Network

IaaS

Platform OS & Application Stack Infrastructure Servers · Storage · Network

PaaS

Packaged Software Applications Platform OS & Application Stack Infrastructure Servers · Storage · Network

SaaS

(48)

Spectrum of Cloud Users

Image credit:

http://blogs.msdn.com/b/seliot/archive/2010/03/04/what-the-heck-is-cloud-computing-another-re-look-with-pretty-pictures.aspx

(49)

Infrastructure as a Service (IaaS)

•Virtualization

- Virtualization is the abstraction of logical resources away from underlying physical resources.

•A hypervisor virtualizes a platform’s operating system.

• The hypervisor manages OS’ as virtual machines (VMs) , enabling

multiple OS’ to share the same physical hardware.

(50)

(51)

KVM (Kernel-based Virtual Machine)

•VMWare and Xen are the best-known virtualization platforms.

•KVM (Kernel-based Virtual Machine) is an open-source virtualization platform

- Linux host OS

• Run multiple virtual machines (Windows, MAC, etc) on your linux box

- IO is virtualized using a device model in KVM

• KVM requires a modified QEMU (open-source processor emulator) for

(52)

Virtualization using KVM in Linux

•KVM is a loadable kernel module

- kvm.ko

• provides the core virtualization infrastructure

- kvm-intel.ko / kvm-amd.ko

(53)

IO Device Model in KVM

•Original approach with full-virtualization - Guest hardware accesses are

intercepted by KVM

- QEMU emulates hardware behavior of common devices

• Video cards • PCI

• Input devices (mouse, keyboard)

(54)

IaaS is Not Enough

•IaaS provides virtual machines, but it cannot provide

elastic computing by itself, where services scale up

and down to meet user demand. - Dynamic provisioning

•Existing IaaS’ do not provide support for the sharing middleware platforms among different VMs

(55)

From IaaS to PaaS

Traditional Stack Networking Storage Servers OS Middleware Runtime Data Applications Y ou Mana ge IaaS Networking Storage Servers Virtualization OS Middleware Runtime Data Applications Y ou Mana ge Provider Mana ges PaaS Networking Storage Servers Virtualization OS Middleware Runtime Data Applications Y ou Mana ge Provi de r Ma na ge s

(56)

PaaS

•Platform as a Service (PaaS) is a computing

platform that abstracts the infrastructure, OS, and middleware to drive developer productivity.

•PaaS leverages dynamic provisioning

(57)

Under-Provisioning

•In traditional computing, underestimating system utilization results first in lost revenue, then lost customers R es ou rc es Demand Capacity 1 2 3 R esou rce s Demand Capacity 1 2 3 R es ou rc es Demand Capacity Time (days) 1 2 3 Lost Users Lost Revenue

(58)

Over-Provisioning

•Overestimating system utilization results in higher than necessary infrastructure costs

Dynamically provisioning resources solves the

under-/over-provisioning problem. Unused resources Demand Capacity Time R es our ces

(59)

Dynamic Provisioning

•Cloud computing enables server computing

instances to be provisioned or deployed from a

administrative console or client application by the server administrator, network administrator, or any other enabled user.

•Self-managing systems perform dynamic

provisioning on behalf of a user or administrator in order to ensure quality of service (QoS) contracts are not broken and/or to meet some policy

(60)

How do we reuse middleware services?

•Image a single physical machine that is currently running 10 virtual machines (VMs), where each VM running has 5 active java programs.

•Assuming no virtualized application server, how many JVMs processes are running on the physical machine?

(61)

Multi-Tenancy

•Multi-tenancy is where a single instance of the

software runs on a server, serving multiple clients. - Think multiple users in a MySQL database

- Java 8 will support multi-tenancy (many java programs running in the same JVM)

•The software should be able to provide a single service to all customers by setting configurations

(62)

(63)

(64)

Deployment Model

•There are four primary cloud deployment models : - Public Cloud

- Private Cloud

- Community Cloud - Hybrid Cloud

(65)

Public Clouds

•Public clouds are owned by cloud service providers who charge for the use of cloud resources.

•Basic characteristics:

- Homogeneous infrastructure, Common policies - Shared resources and multi-tenancy

- Leased or rented infrastructure - Economies of scale

•EC2 (Amazon) Elastic Compute Cloud. General

purpose computing.

•Azure (Microsoft) General purpose computing on a

Microsoft platform.

•AppEngine (Google) Build scalable web applications

(66)

Amazon, Microsoft, Google Cloud Offerings

Amazon Web

Services Microsoft Azure Google AppEngine

Computation

model(VM) x86 ISA via Xen VM Microsoft CLR VM Predefined application structure and framework

Storage model SimpleDB, S3 SQL Data Services MegaStore/BigTable

Networking model Declarative

specification of IP level topology Automatic based on programmer’s declarative descriptions of app components Fixed topology to accommodate 3-tier Web app structure

(67)

Private Clouds

•The cloud infrastructure belongs to and is operated by only one organization.

•Basic characteristics :

- Heterogeneous infrastructure; Customized policies - Dedicated resources

- In-house infrastructure; End-to-end control

(68)

Public vs. Private

Public Cloud Private Cloud

Infrastructure Homogeneous Heterogeneous

Policy Model Common defined Customized & Tailored

Resource Model Shared & Multi-tenant Dedicated

Cost Model Operational _expenditure Capital expenditure

(69)

Other types of Clouds

•Community cloud

- The cloud infrastructure is shared by several organizations and supports a specific community that has shared

concerns (e.g., mission, security requirements, policy, and compliance considerations).

•Hybrid cloud

- The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or

proprietary technology that enables data and application portability.

(70)

Obstacles To Cloud Computing

•Data Lock-in

•No standardized APIs. - OpenStack API

•Data Confidentiality/Auditability

•Data transfer bottlenecks/costs

(71)

(72)

Challenges for Cloud computing

•Will all data migrate to the cloud? - The post-PC era.

•By 2015 some 6.3 exabytes of mobile data will be flowing each month*. How much will end up in the cloud?

•What will we do with all the new data generated by the Internet of Things and DNA sequencing

machines?

•How will we manage security, ownership and migration of data stored in the cloud?

(73)

References

•NIST (National Institute of Standards and

Technology). http://csrc.nist.gov/groups/SNS/cloud-computing/

•Dean et al., MapReduce: simplified data processing on large clusters, Comms of ACM, vol 51(1), 2008.

•Armburst et al., “Above the Clouds: A Berkeley View of Cloud Computing”

(74)

References

•“Cloud Computing: Principles and Paradigms,” R. Buyya et al. (eds.), Wiley, 2010.

•“Cloud Computing: Principles, Systems and

Applications,” L. Gillam et al. (eds.) Springer, 2010.

•Jeffrey Dean and Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters” in

OSDI 2004

•Senjay Ghemawat, : “The Google File System”.

SIGOPS Operating Systems Review 37(5), 2003

•M. Isard et al.: “Dryad: Distributed Data-parallel Programs from Sequential Building Blocks” in