ID2210: Lecture 11 Jim Dowling
Large-Scale Distributed Computing
•In #Nodes
- BitTorrent (millions) - Peer-to-Peer
•In #Instructions/sec
- Teraflops, Petaflops, Exascale - Super-Computing
•In #Bytes stored
- Facebook: 30+ Petabytes (July ’11)* - NoSQL storages
•In #Bytes processed/time
- Google processes 24
petabytes of data per day
- ?
*http://www.computerworld.com/s/article/9218752/Facebook_ moves_30_petabyte_Hadoop_cluster_to_new_data_center
Big Data
“… The total amount of digital storage worldwide is
approaching 1 zettabyte, or 1 million times the contents of the Earth’s largest library. Currently that information is archived on equipment with a mass equivalent to 20 percent of Manhattan. Global data
storage is expected to reach 35 zettabytes by 2020 …”
The Boston Globe Editorial & Opinion September 7, 2010
Programming Large-Scale
•With thousands of servers available within a data centre, how do we:
- write applications for them?
- allocate and manage resources?
•Applications should also be scalable, reliable, and highly available.
- Failures are expected with thousands of machines. - Need for load-balancing, handling heterogeneity.
Commodity Computing Challenges
• Cheap nodes fail, especially if you have many
- Mean time between failures for 1 node = 3 years - Mean time between failures for 1000 nodes = 1 day - Solution: Build fault-tolerance into system
• Commodity networks have low(ish) bandwidth
- Scan 100TB Datasets on 1000 node cluster with remote storage @ 10MB/s = 165 mins
- Solution: Push computation to the data
• Programming distributed systems is hard
- Solution: Provide a simple to use data-parallel programming model that distributes work and handles faults.
Typical Large-Scale Programming Problem
•Iterate over a large number of records
•Extract something of interest from each
•Shuffle and sort intermediate results
•Aggregate intermediate results
•Generate final output
Key idea: provide a functional abstraction for these two operations
Map
Programming in the Large
•Imagine we have hundreds of thousands of documents structured as follows:
{
"type": "blog", "id": "564",
"tags": ["hdfs", "mysql", "cats"], "content": "<div>...</div>", "mentions": [ { "google": 6, "apple": 11, "microsoft": 1, } ] }
Pseudocode for Map Phase
•Extract the blog_id and the number of mentions of google from each document
def mapper(doc.blogs):
foreach (blog in docs.blogs):
Pseudocode for Reduce Phases
•Sum up all the google mentions from the same blog_id
def reducer(blog_id, mentions_google):
output(agg_id, sum(blog_id, mentions_google));
•Sum up all the google mentions from all blogs
def reducer(agg_id, mentions_google):
* Slide taken from tutorial by Jerry Zhao and Jelena Pjesivac-Grovic (Google Inc.):
“MapReduce – The Programming Model and Practice”. Tutorial held at SIGMETRICS 2009.
MapReduce Programming Model
•Input Data type: key-value records
•Map function:
(Kin, Vin) list(Kinter, Vinter)
•Reduce function:
MapReduce Programming Model
•Takes input records - one by one
- key, value
•Processes records - Independently
•Outputs intermediate - 1..n per input record - key’, value’
map
reduce
•Takes intermediate results - Groups with same key
- key’, value’[] •Processes records - Group-wise •Outputs result - Per group - Any format
MapReduce Workflow
worker worker worker worker input files map phase intermediate output reduce phase output files read write write read partition1 partition2 partition3 file(s)MapReduce Basics
•MapReduce programming model (and framework) that hides the complexity of work distribution and fault tolerance
•Principal design philosophies: - Near-linear scalability for data sets
- Low cost – reduce hardware, programming and admin costs
•MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time
MapReduce-Like Implementations
Google MR Hadoop Dryad
Availability Proprietary Open Source Proprietary
Used by Google Yahoo!, Facebook, Amazon (EC2!), Twitter
Microsoft
Load Balancing, Failure, and Stragglers
•Load Balancing
- Break a MapReduce job in small tasks
- Schedule tasks on workers as they report idle status
•MapReduce functions are side-effect free
- Enables failed (and partially completed) tasks to be re-executed without any problems (on a different machine) - When a worker fails, its tasks can be reallocated to other
workers
•Identify and handle stragglers (slow workers) - Restart slow tasks on new workers
- Stragglers appear with increasing probability when there are an increasing numbers of workers
Components in a Hadoop MR Workflow
Pig Latin: a relational data-flow language
•MapReduce programs are quite low-level.
•A higher-level programming model was needed for processing semi-structured data sets using the
MapReduce platform
•Pig Latin is a procedural, relational data-flow
language that is implemented using MapReduce
•Pig Engine: Parser, Optimizer and distributed query
Pig Example*
•Input: User profiles, Page visits
•Problem: Find the top 5 most visited pages by users aged 18-25
Load Users Load Pages
Filter by Age Join on name Group on url Count clicks Order by clicks Take top 5 *Example taken from Yahoo Hadoop
tutorial from Middleware 2009
In PigLatin
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user; Grouped = group Joined by url;
Summed = foreach Grouped generate group, COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc; Top5 = limit Sorted 5;
Pig Latin Data Types
•Tuple: Ordered set of fields
- Field can be simple or complex type - Nested relational model
•Bag: Collection of tuples. - Can contain duplicates
•Map: Set of (key, value) pairs
•Primitive types
Hive and HBase
•Hive and Pig were parallel projects developed at Facebook and Yahoo, respectively.
•HiveQL is closer to SQL from traditional RDBMSs than Pig Latin (which is procedural).
- Due to the limitations of MapReduce, and the fact that
HiveQL is compiled into a MapReduce query plan, HiveQL is a cut-down version of SQL
•HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google's Bigtable. •A lot of Hive programs run over HBase.
MapReduce Filesystem Requirements
•MapReduce jobs run on data stored in files
•Support large files - Streaming reads
- Mostly append to end (easier concurrency)
•Scalability
- Add machines to scale
•Workers’ tasks use data on their local machine - Bandwidth is the bottleneck. Move code to data.
•Expect failures
Hadoop Filesystem (HDFS)
Supports huge data sets and large files
Gigabytes files, petabyte data sets
Supports tens of millions of files in a file system Files have write-once-read-many semantics
Clients can only append to existing files
•Designed to run on COTS hardware - Implemented in Java
•Timely detection and recovery from data node faults
HDFS Clusters can be very large!
HDFS at Facebook (May ‘10)
•21 PB of storage in a single HDFS cluster containing 2000 machines
- 1200 machines with 8 cores + 800 machines with 16 cores
•12 TB per machine (some have 24 TB)
HDFS Architecture
Namenode B replication Rack1 Rack2 Blocks Datanodes Datanodes Client Write ReadMetadata ops Metadata(Name, replicas..) (/home/foo/data,6. ..
Block ops
Typical Hadoop Cluster
•
40 nodes/rack, 1000-4000 nodes in cluster
•
1 Gbps within a rack; 8 Gbps between racks
Aggregation switch
Rack switch
The HDFS NameNode
A single Namenode manages the file system
metadata and regulates access to files by clients. FileName->[BlockIds]
BlockIds->[replica locations]
•Controls replication of blocks to DataNodes - Listens for Heartbeats from DataNodes
- Signals creating, opening, closing of blocks - Load balancing, rack-aware distribution
It is a single-point of failure (as of May ‘12).
HDFS DataNodes
DataNodes store a set of blocks
File are split into one or more blocks (64MB default).
NameNode sends instructions to DataNodes for
block creation, deletion, and replication.
HDFS Client
•Read and write HDFS data
•Create checksum files for files in HDFS.
- Recompute the checksum when a file has been read.
Related Filesystems
•Google File System (GFS)
- http://labs.google.com/papers/gfs.html
•Kosmos File System (KFS)
- Open source, Scales better than HDFS
• C++ implementation
• Master polls data nodes
• Supports writing to multiple arbitrary positions in files and file
Democratization of Large-Scale Computing
•Cloud computing is the
delivery of hosting
services that are provided to a client over the
Internet.
- Enable large-scale services without up-front investment.
•New programming tools, databases and
systems have enabled the
low-cost construction of large-scale services.
NIST Definition of Cloud Computing
"Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks,
servers, storage, applications, and services) that can be rapidly provisioned and released with minimal
Supporting Technologies
•Enormous computer data-centres containing
commodity hardware.
•Virtualization of computation, storage, and
communication.
- Turn hardware and networking into software!
•Achieve economies of scale.
- Reduce costs of electricity, bandwidth, hardware, software and use low-cost locations.
- Lower-cost than provisioning own hardware.
•NoSQL datastores have enabled storage scalability
Cloud Computing Essentials
•Cloud computing is Utility Computing
- Cloud services are controlled and monitored by the cloud provider typically through a pay-per-use business model.
•An ideal cloud computing platform is:
- efficient in its use of resources - scalable elastic
- self-managing
- highly available and accessible - inter-operable and portable
Cloud Properties
•Resource efficiency: computing and network
resources are pooled to provide services to multiple users. Resource allocation is dynamically adapted according to user demand.
•Elasticity: computing resources can be rapidly and
elastically provisioned to scale up, and released to scale down based on consumer’s demand.
Cloud Properties
•Self-managing services: a consumer can provision
cloud services, such as web applications, server time, processing, storage and network as needed and automatically without requiring human
interaction with each service’s provider
•Accessible and highly available: cloud resources
are available over the network anytime and anywhere and are accessed through standard
mechanisms that promote use by different types of platform (e.g., mobile phones, laptops, and PDAs).
IaaS, PaaS and SaaS
•Infrastructure as a Service (IaaS)
•Platform as a Service (PaaS)
•Software as a Service (SaaS)
Infrastructure Servers · Storage · Network
IaaS
Platform OS & Application Stack Infrastructure Servers · Storage · NetworkPaaS
Packaged Software Applications Platform OS & Application Stack Infrastructure Servers · Storage · NetworkSaaS
Spectrum of Cloud Users
Image credit:
http://blogs.msdn.com/b/seliot/archive/2010/03/04/what-the-heck-is-cloud-computing-another-re-look-with-pretty-pictures.aspx
Infrastructure as a Service (IaaS)
•Virtualization
- Virtualization is the abstraction of logical resources away from underlying physical resources.
•A hypervisor virtualizes a platform’s operating system.
• The hypervisor manages OS’ as virtual machines (VMs) , enabling
multiple OS’ to share the same physical hardware.
KVM (Kernel-based Virtual Machine)
•VMWare and Xen are the best-known virtualization platforms.
•KVM (Kernel-based Virtual Machine) is an open-source virtualization platform
- Linux host OS
• Run multiple virtual machines (Windows, MAC, etc) on your linux box
- IO is virtualized using a device model in KVM
• KVM requires a modified QEMU (open-source processor emulator) for
Virtualization using KVM in Linux
•KVM is a loadable kernel module
- kvm.ko
• provides the core virtualization infrastructure
- kvm-intel.ko / kvm-amd.ko
IO Device Model in KVM
•Original approach with full-virtualization - Guest hardware accesses are
intercepted by KVM
- QEMU emulates hardware behavior of common devices
• Video cards • PCI
• Input devices (mouse, keyboard)
IaaS is Not Enough
•IaaS provides virtual machines, but it cannot provide
elastic computing by itself, where services scale up
and down to meet user demand. - Dynamic provisioning
•Existing IaaS’ do not provide support for the sharing middleware platforms among different VMs
From IaaS to PaaS
Traditional Stack Networking Storage Servers OS Middleware Runtime Data Applications Y ou Mana ge IaaS Networking Storage Servers Virtualization OS Middleware Runtime Data Applications Y ou Mana ge Provider Mana ges PaaS Networking Storage Servers Virtualization OS Middleware Runtime Data Applications Y ou Mana ge Provi de r Ma na ge sPaaS
•Platform as a Service (PaaS) is a computing
platform that abstracts the infrastructure, OS, and middleware to drive developer productivity.
•PaaS leverages dynamic provisioning
Under-Provisioning
•In traditional computing, underestimating system utilization results first in lost revenue, then lost customers R es ou rc es Demand Capacity 1 2 3 R esou rce s Demand Capacity 1 2 3 R es ou rc es Demand Capacity Time (days) 1 2 3 Lost Users Lost Revenue
Over-Provisioning
•Overestimating system utilization results in higher than necessary infrastructure costs
Dynamically provisioning resources solves the
under-/over-provisioning problem. Unused resources Demand Capacity Time R es our ces
Dynamic Provisioning
•Cloud computing enables server computing
instances to be provisioned or deployed from a
administrative console or client application by the server administrator, network administrator, or any other enabled user.
•Self-managing systems perform dynamic
provisioning on behalf of a user or administrator in order to ensure quality of service (QoS) contracts are not broken and/or to meet some policy
How do we reuse middleware services?
•Image a single physical machine that is currently running 10 virtual machines (VMs), where each VM running has 5 active java programs.
•Assuming no virtualized application server, how many JVMs processes are running on the physical machine?
Multi-Tenancy
•Multi-tenancy is where a single instance of the
software runs on a server, serving multiple clients. - Think multiple users in a MySQL database
- Java 8 will support multi-tenancy (many java programs running in the same JVM)
•The software should be able to provide a single service to all customers by setting configurations
Deployment Model
•There are four primary cloud deployment models : - Public Cloud
- Private Cloud
- Community Cloud - Hybrid Cloud
Public Clouds
•Public clouds are owned by cloud service providers who charge for the use of cloud resources.
•Basic characteristics:
- Homogeneous infrastructure, Common policies - Shared resources and multi-tenancy
- Leased or rented infrastructure - Economies of scale
•EC2 (Amazon) Elastic Compute Cloud. General
purpose computing.
•Azure (Microsoft) General purpose computing on a
Microsoft platform.
•AppEngine (Google) Build scalable web applications
Amazon, Microsoft, Google Cloud Offerings
Amazon Web
Services Microsoft Azure Google AppEngine
Computation
model(VM) x86 ISA via Xen VM Microsoft CLR VM Predefined application structure and framework
Storage model SimpleDB, S3 SQL Data Services MegaStore/BigTable
Networking model Declarative
specification of IP level topology Automatic based on programmer’s declarative descriptions of app components Fixed topology to accommodate 3-tier Web app structure
Private Clouds
•The cloud infrastructure belongs to and is operated by only one organization.
•Basic characteristics :
- Heterogeneous infrastructure; Customized policies - Dedicated resources
- In-house infrastructure; End-to-end control
Public vs. Private
Public Cloud Private Cloud
Infrastructure Homogeneous Heterogeneous
Policy Model Common defined Customized & Tailored
Resource Model Shared & Multi-tenant Dedicated
Cost Model Operational expenditure Capital expenditure
Other types of Clouds
•Community cloud
- The cloud infrastructure is shared by several organizations and supports a specific community that has shared
concerns (e.g., mission, security requirements, policy, and compliance considerations).
•Hybrid cloud
- The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or
proprietary technology that enables data and application portability.
Obstacles To Cloud Computing
•Data Lock-in
•No standardized APIs. - OpenStack API
•Data Confidentiality/Auditability
•Data transfer bottlenecks/costs
Challenges for Cloud computing
•Will all data migrate to the cloud? - The post-PC era.
•By 2015 some 6.3 exabytes of mobile data will be flowing each month*. How much will end up in the cloud?
•What will we do with all the new data generated by the Internet of Things and DNA sequencing
machines?
•How will we manage security, ownership and migration of data stored in the cloud?
References
•NIST (National Institute of Standards and
Technology). http://csrc.nist.gov/groups/SNS/cloud-computing/
•Dean et al., MapReduce: simplified data processing on large clusters, Comms of ACM, vol 51(1), 2008.
•Armburst et al., “Above the Clouds: A Berkeley View of Cloud Computing”
References
•“Cloud Computing: Principles and Paradigms,” R. Buyya et al. (eds.), Wiley, 2010.
•“Cloud Computing: Principles, Systems and
Applications,” L. Gillam et al. (eds.) Springer, 2010.
•Jeffrey Dean and Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters” in
OSDI 2004
•Senjay Ghemawat, : “The Google File System”.
SIGOPS Operating Systems Review 37(5), 2003
•M. Isard et al.: “Dryad: Distributed Data-parallel Programs from Sequential Building Blocks” in