CLOUD & DISTRIBUTED COMPUTING FOR BIG DATA. 周志遠助理教授清大資工系 Large-scale System Architecture (LSA) Lab

(1)

CLOUD & DISTRIBUTED

COMPUTING FOR BIG DATA

周志遠

助理教授

清大資工系

Large-scale System

(2)

Outline

• Big Data

• Cloud Computing

(3)

Videos Images Machine-Generated Data Social Data Documents Records System logs

A World Full of Data

• Amount of data in the world

• 800 Terabytes, 2000 • 160 Exabytes, 2006 • 0.8 Zettabytes, 2009 • 2.7 Zettabytes, 2012 • 35 Zettabytes by 2020

• Data generated in ONE day

• 7 TB, Twitter

• 10 TB, Facebook

35ZB = enough data to fill a stack of DVDs reaching half way to Mars

(4)

The Explosion of Data

• A increased number and variety of data sources that

generate large quantities of data

• Sensors(e.g. measurements) • Mobile devices(e.g. phone)

• Social Network (e.g. twitter, wikis) • OLTP (e.g. bank transactions)

OLTP Social NetworksScientific Devices

SOCIAL

(5)

The Explosion of Data

• Dramatic decline in the cost of HW, especially storage

• The cost reduction is in the order of about 40-45% per year – which means

it becomes half in 2 years

(6)

The Explosion of Data

• Realize data is “too valuable” to delete

• Diagnose system

• Understand user behavior

• Evaluate merchandise & products • Make business decision

(7)

What is Big Data?

• Extracting values(insight) from an immense volume, variety, velocity and veracity of data, in context, beyond what was previously possible (traditional relational DBs)

(8)

Big Data in Action

Tapping into diverse data

sets

Finding and monetizing

unknown relationships

Data driven business

decisions

(9)

Big opportunities

Improve operational effectiveness

• Machines/sensors: predict failures, network attacks

• Financial risk management: reduce fraud, increase security

Reduce data warehouse cost

• Integrate new data sources without increased database cost

• Provide online access to ‘dark data’

Drive incremental revenue

• Predict customer behavior across all channels

• Understand and monetize customer behavior

(10)

Examples: Sales

• Traditional Ad on TV

• Like it or not,

everyone sees the SAME!

• Today’s Ad on Internet

• Collect user data

• Analyze user behavior

(11)

Examples: Healthcare

• Prevent medical problem • Early diagnosis

(12)

Applications for Big Data Analytics

Homeland Security

Finance Smarter Healthcare Multi-channel

sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn, NBO

(13)

Moving from Big Data to Smart Data is a

multistep process

• “The issue is not about the volume of data but the ability

to analyze and act on data in real time.”

• http://tinyurl.com/atcanjw

(14)

(15)

Outline

• Big Data

• Cloud Computing

(16)

What is Cloud?

• Definition from Whatis.com

• The name cloud computing was inspired by the cloud symbol that's

often used to represent the Internet in flowcharts and diagrams. Cloud computing is a general term for anything that involves delivering hosted services over the Internet.

(17)

What is Cloud?

• Cloud computing is a distributed technology which

delivers hosted services over the internet to provide easy access, scalable (and elastic) IT services

(18)

Definition from NIST

•

Cloud computing is a

model

for enabling convenient,

on-demand network access

to a

shared pool

of

configurable computing resources

(e.g., networks,

servers, storage, applications, and services) that can

be

rapidly provisioned

and released with minimal

management effort or service provider interaction.

•

Five characteristics

• Broad network access • On-demand self-service • Resource pooling

• Rapid elasticity • Measured service

(19)

Cloud Computing

• In a simple term:

• A way to access and provide computing

through

Internet services

• How is that important?

• Measured



Rent

not

Buy

• On-demand



Service

not

Usage

• Elasticity



Dynamic

not

Static

(20)

Internet Services

• Web-Based Uniform Access

• HTTP, Restful API

• Enable thin client applications

• Cheap client hardware

• Diversity of end devices • Client simplicity

• Service Level Agreement (SLA)

• A contract between a service provider and a customer that specifies,

usually in measurable terms (QoS), what services provider will furnish

• Common content in contract

• Performance guarantee: Up-time and down-time ratio, System throughput, …

• Problem management detail

• Penalties for non-performance

(21)

Measured Service:

Rent

not

Buy

• Past: Buy machines before we use it

(22)

Measured Service:

Rent

not

Buy

• Now: Rent machines when we use it

(23)

On-demand:

Service

not

Usage

• Past: Users do the installation & maintenance

(24)

On-demand:

Service

not

Usage

•

Now: Users specify their requirement & needs

(25)

Rapid elasticity:

Dynamic

not

Static

•

Past: Fixed amount of resources & computing power

• Cannot be adjusted to workload variation

Reso ur ces Demand Capacity 1 2 3 Reso ur ces Demand Capacity 1 2 3 Loss Users Loss Revenue Demand Capacity Time Reso ur ces Unused resources

OR

(26)

Rapid elasticity:

Dynamic

not

Static

•

Now: Resources are provisioned dynamically

• Add or remove resources on-demand at anytime • Save cost

• Handle burst or unexpected demand

Demand Capacity Time R es o u rces Demand Capacity Time R es o u rces

(27)

Resource pooling:

Shared

not

Dedicated

•

Past: A physical machine is dedicated to a single

person or for a single purpose/usage

• With different storage space, computing power, OS,

configurations, applications, etc…

• New machine must be purchased & setup when new

service is required Web Server Email Server DB Server File Server

(28)

Resource pooling:

Shared

not

Dedicated

•

Now: A physical machine can be shared among

multiple persons or for multiple purposes/usages

• Isolated environment without interference

• Independent setup and configuration

• Resource can be allocated and distributed

Web Server Email Server DB Server File Server Web Server Email Server DB Server File Server Physical Machine Virtual Machines

(29)

Benefits From Cloud

• For the market and enterprises (IT)

•

Reduce initial investment

•

Reduce capital expenditure

•

Improve industrial specialization

•

Improve resource utilization

(30)

Benefits From Cloud

• For the end user and individuals

•

Lower computer costs

•

Improved performance

•

Instant update & free software

•

Device independence

(31)

10 Obstacles of Cloud

(32)

Data Lock-In

• Reality:

• No existing standard for Cloud API

• Customers cannot easily extract their data and programs from one site

to another

• Users are vulnerable to price increases • What if providers going out of business

• Opportunity:

• Standardize API (However, it will reduce provider’s profit) • Multi-cloud service

(33)

Data Confidential & Auditability

• Reality:

• Many data is sensitive

• Countries have their own government

laws for data privacy

• Opportunity:

(34)

Data Transfer Bottleneck

• Reality:

• The cheapest way to send a lot of data is to physically

send disks or even whole computers via overnight delivery services

• E.g.: 10TB from San Francisco to Seattle

• Through Internet: 45days, $1000USD service charge

• Through FedEx: 1night, $400USD mail fee

• Opportunity:

• No charge and small latency within a cloud platform

(E.g. between EC2 and S3 of Amazon Cloud Service)

• Provider better and more data service, such as data

(35)

Enabling Technologies

• Web services  Accessibility • Virtualization  resource sharing • Distributed computing

 scalability & elasticity

• General system issues

• Security

• Fault tolerance • Monitoring

(36)

Outline

• Big Data

• Cloud Computing

(37)

Distributed Computing

• A computer system in which several interconnected computers share the computing tasks assigned to the system

Mainframe

Mini Computer Workstation

(38)

Strengths of Distributed Computing

• Scalability

• From small cluster to Google’s datacenter

• ~900,000 servers

• 206,040 sqft (50 basketball courts)

• Using low cost commodity server

• Elasticity

• Scale-in/out instead of Scale-up/down

• Allow servers dynamic join or leave systems

• Reliability & Availability

• Achieve fault tolerance through decentralized management • Prevent data lost through replication

• Improve availability through snapshot, checkpoint, rerun • New paradigm: treat exception as norm

(39)

(40)

Outline

•

MapReduce Model: Hadoop

•

SCALA: SPARK(In-memory computing)

•

Streaming Processing: STORM

(41)

MapReduce Model

• Google MapReduce (2004)

• Jeffrey Dean et al. MapReduce: Simplified Data Processing on

Large Clusters. OSDI 2004.

• Apache Hadoop (2005)

• http://hadoop.apache.org/

• http://developer.yahoo.com/hadoop/tutorial/

• Apache Hadoop 2.0 (2012)

• Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another

Resource Negotiator, SOCC 2013.

• Separation between resource management and computation

(42)

Typical Large-Data Problem

1. Iterate over a large number of records

2. Extract something of interest from each record 3. Shuffle and sort intermediate results

4. Aggregate intermediate results 5. Generate final output

Key idea: provide

a functional abstraction

(43)

MapReduce Programming Model

• A parallel programming model (divide-conquer)

• Map: processes a key/value pair to generate a set of intermediate

key/value pairs

• Reduce: merges all intermediate values associated with the same

(44)

MapReduce Word Count Example

• User specify the map and reduce functions

Key Value

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values: sum += v;

(45)

MapReduce “Runtime”

• Handles scheduling

• Assigns workers to map and reduce tasks

• Handles “data distribution”

• Moves processes to data

• Handles synchronization

• Gathers, sorts, and shuffles intermediate data

• Handles errors and faults

• Detects worker failures and restarts

(46)

Limitation of MapReduce

• Simple but limited programming model

• Only apply two computation functions in a job: Map & Reduce

More complex work must use multiple jobs

• The input and output of a job must store into a FS

(47)

Outline

•

MapReduce Model: Hadoop

•

SCALA: SPARK(In-memory computing)

(48)

Spark

•

Utilize

DSM

(Distributed Shared Memory) in

data processing to enable

in-memory

computing

•

Allow users to explicitly

cache dataset

in

memory across machines

and reuse it in

multiple MapReduce-like parallel operations

•

Retain the

scalability and

fault tolerance

(49)

Resilient Distributed Datasets (RDDs)

• Spark introduces an data abstraction called Resilient

Distributed Datasets (RDDs):

• RDD is a read-only collection of objects partitioned across a set

of machines

• RDD can be rebuilt if a partition is lost using the “lineage”

technique

• Spark is integrated into a general programming language

called “scala”

• Pure-bred O.O language: every variable/dataset is an object and

every operation is a method-call

• Seamless Java interpreter

• Programmer specify operations to transform dataset • Operations are parallelized and executed by Spark

(50)

Scala: Word Count Example

• val lines = sc.textFile(“hamlet.txt”)!

• val counts = lines.flatMap(line => line.split(“ “)).

map(word => (word, 1)). reduceByKey(_ + _)

(51)

(52)

(53)

Example: Log Mining

• Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1 Block 2 Block 3 Worker Worker Worker Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . . tasks results Cache 1 Cache 2 Cache 3 Base RDD _{Transformed RDD} Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

(54)

RDD Fault Tolerance

• RDDs maintain lineage information that can be used to

reconstruct lost partitions

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD

filter

(func = _.contains(...)) (func = _.split(...)) map

*If lineage is too long, it could cause stack overflow (out of memory) problem during execution

(55)

Outline

•

MapReduce Model: Hadoop

•

In-memory computing: SPARK

(56)

Storm

•

Developed by BackType which was acquired by

Twitter

•

Lots of tools for data (i.e. batch) processing

• Hadoop, Pig, HBase, Hive, …

•

None of them are realtime systems which is

becoming a real requirement for businesses

•

Storm provides realtime computation

• Scalable

• Guarantees no data loss

• Extremely robust and fault-tolerant

(57)

Storm Components

• Tuple: Core unit of data (immutable set of key/value pair) • Stream: Unbounded sequence of tuples

• Spouts: Source of streams

• Bolts: Processes input streams and produces new streams

(58)

Storm Components

• Topology: Spouts and bolts execute as many tasks across

(59)

(60)

Machine Learning on Big Data

•

Mahout on Hadoop

• https://mahout.apache.org/ •

MLlib on Spark

• http://spark.apache.org/mllib/ •

GraphLab Toolkits

• http://graphlab.org/projects/toolkits.html • GraphLab Computer Vision Toolkit

(61)

Graph Processing with BSP

(Bulk Synchronous Parallel) model

• Pregel (2010)

• Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph

Processing. SIGMOD 2010. • Apache Hama (2010) • https://hama.apache.org/ • Apache Giraph (2012) • https://giraph.apache.org/ • GraphX (2013)

• Reynold Xin et al. GraphX: A Resilient Distributed Graph System on Spark.

GRADES (SIGMOD workshop) 2013.

(62)

Graph Processing with BSP

(Bulk Synchronous Parallel) model

• GraphLab (2010)

• Yucheng Low et al. GraphLab: A New Parallel Framework for

Machine Learning. UAI 2010.

• Yucheng Low, et al. Distributed GraphLab: A Framework for

Machine Learning and Data Mining in the Cloud. PVLDB 2012.

• http://graphlab.org/projects/index.html

• http://graphlab.org/resources/publications.html

• PowerGraph (2012)

• Joseph E. Gonzalez et al. PowerGraph: Distributed Graph-Parallel

(63)

(64)

(65)

(66)

(67)

Master Fork * 4 File (3*64MB) Cat Cat Cat Puppy Dog Cat Dog Cat : 4 Dog : 2 Puppy : 1 Input Map phase Reduce phase Output Intermediate files Server: 4 machines Machine#1 Map job1 Machine#2 Map job2 Machine#3 Map job3 Machine#4 Reduce Job1 Machine#2 Reduce Job2 Machine#3

Local Read Local Write

Partition function Cat:1 Cat:1 Cat:1 Dog:1 Dog:1 Cat:1 Location

Remote Read Write to _{Global FS}

Puppy:1

Sorted by key

Location

• A task is automatically re-run when worker failure occurs

• Last task is running by multiple workers

User program

(68)

Collection<T> collection; bool IsLegal(Key k); string Hash(Key);

var results = from c in collection where IsLegal(c.key)

select new { Hash(c.key), c.value};

DryadLINQ = LINQ + Dryad

C#

collection

results

C#

Vertex code Query plan (Dryad job) Data

(69)

Query on Big Data

•

Query with procedural language

•

Google Sawzall (2003)

• Rob Pike et al. Interpreting the Data: Parallel Analysis

with Sawzall. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 2003.

•

Apache Pig (2006)

• Christopher Olston et al. Pig Latin: A Not-So-Foreign

Language for Data Processing. SIGMOD 2008.

(70)

SQL-like Query

• Apache Hive (2007)

• Facebook Data Infrastructure Team. Hive - A Warehousing Solution

Over a Map-Reduce Framework. VLDB 2009.

• https://hive.apache.org/ • On top of Apache Hadoop

• Shark (2012)

• Reynold Xin et al. Shark: SQL and Rich Analytics at Scale.

Technical Report. UCB/EECS 2012.

• http://shark.cs.berkeley.edu/ • On top of Apache Spark

• Apache MRQL (2013)

• http://mrql.incubator.apache.org/

(71)

Pregel & Apache Giraph

• Computation Model

• Superstep as iteration

• Vertex state machine:

Active and Inactive, vote to halt

• Message passing between

vertices

• Combiners • Aggregators

• Topology mutation

• Master/worker model • Graph partition: hashing • Fault tolerance: checkpointing and confined recovery 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6 Superstep 0 Superstep 1 Superstep 2 Superstep 3

Vote to halt Active

CLOUD & DISTRIBUTED COMPUTING FOR BIG DATA. 周 志 遠 助 理 教 授 清 大 資 工 系 Large-scale System Architecture (LSA) Lab