CLOUD & DISTRIBUTED
COMPUTING FOR BIG DATA
周志遠
助理教授
清大資工系
Large-scale System
Outline
•
Big Data
•
Cloud Computing
Videos Images Machine-Generated Data Social Data Documents Records System logs
A World Full of Data
• Amount of data in the world
• 800 Terabytes, 2000 • 160 Exabytes, 2006 • 0.8 Zettabytes, 2009 • 2.7 Zettabytes, 2012 • 35 Zettabytes by 2020
• Data generated in ONE day
• 7 TB, Twitter
• 10 TB, Facebook
35ZB = enough data to fill a stack of DVDs reaching half way to Mars
The Explosion of Data
• A increased number and variety of data sources that
generate large quantities of data
• Sensors(e.g. measurements) • Mobile devices(e.g. phone)
• Social Network (e.g. twitter, wikis) • OLTP (e.g. bank transactions)
OLTP Social NetworksScientific Devices
SOCIAL
The Explosion of Data
• Dramatic decline in the cost of HW, especially storage
• The cost reduction is in the order of about 40-45% per year – which means
it becomes half in 2 years
The Explosion of Data
• Realize data is “too valuable” to delete
• Diagnose system
• Understand user behavior
• Evaluate merchandise & products • Make business decision
What is Big Data?
• Extracting values(insight) from an immense volume, variety, velocity and veracity of data, in context, beyond what was previously possible (traditional relational DBs)
Big Data in Action
Tapping into diverse data
sets
Finding and monetizing
unknown relationships
Data driven business
decisions
Big opportunities
Improve operational effectiveness
• Machines/sensors: predict failures, network attacks
• Financial risk management: reduce fraud, increase security
Reduce data warehouse cost
• Integrate new data sources without increased database cost
• Provide online access to ‘dark data’
Drive incremental revenue
• Predict customer behavior across all channels
• Understand and monetize customer behavior
Examples: Sales
• Traditional Ad on TV
• Like it or not,
everyone sees the SAME!
• Today’s Ad on Internet
• Collect user data
• Analyze user behavior
Examples: Healthcare
• Prevent medical problem • Early diagnosis
Applications for Big Data Analytics
Homeland Security
Finance Smarter Healthcare Multi-channel
sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn, NBO
Moving from Big Data to Smart Data is a
multistep process
• “The issue is not about the volume of data but the ability
to analyze and act on data in real time.”
• http://tinyurl.com/atcanjw
Outline
•
Big Data
•
Cloud Computing
What is Cloud?
• Definition from Whatis.com
• The name cloud computing was inspired by the cloud symbol that's
often used to represent the Internet in flowcharts and diagrams. Cloud computing is a general term for anything that involves delivering hosted services over the Internet.
What is Cloud?
• Cloud computing is a distributed technology which
delivers hosted services over the internet to provide easy access, scalable (and elastic) IT services
Definition from NIST
•
Cloud computing is a
model
for enabling convenient,
on-demand network access
to a
shared pool
of
configurable computing resources
(e.g., networks,
servers, storage, applications, and services) that can
be
rapidly provisioned
and released with minimal
management effort or service provider interaction.
•
Five characteristics
• Broad network access • On-demand self-service • Resource pooling
• Rapid elasticity • Measured service
Cloud Computing
•
In a simple term:
•
A way to access and provide computing
through
Internet services
•
How is that important?
•
Measured
Rent
not
Buy
•
On-demand
Service
not
Usage
•
Elasticity
Dynamic
not
Static
Internet Services
• Web-Based Uniform Access
• HTTP, Restful API
• Enable thin client applications
• Cheap client hardware
• Diversity of end devices • Client simplicity
• Service Level Agreement (SLA)
• A contract between a service provider and a customer that specifies,
usually in measurable terms (QoS), what services provider will furnish
• Common content in contract
• Performance guarantee: Up-time and down-time ratio, System throughput, …
• Problem management detail
• Penalties for non-performance
Measured Service:
Rent
not
Buy
•
Past: Buy machines before we use it
Measured Service:
Rent
not
Buy
•
Now: Rent machines when we use it
On-demand:
Service
not
Usage
•
Past: Users do the installation & maintenance
On-demand:
Service
not
Usage
•
Now: Users specify their requirement & needs
Rapid elasticity:
Dynamic
not
Static
•
Past: Fixed amount of resources & computing power
• Cannot be adjusted to workload variation
Reso ur ces Demand Capacity 1 2 3 Reso ur ces Demand Capacity 1 2 3 Loss Users Loss Revenue Demand Capacity Time Reso ur ces Unused resources
OR
Rapid elasticity:
Dynamic
not
Static
•
Now: Resources are provisioned dynamically
• Add or remove resources on-demand at anytime • Save cost
• Handle burst or unexpected demand
Demand Capacity Time R es o u rces Demand Capacity Time R es o u rces
Resource pooling:
Shared
not
Dedicated
•
Past: A physical machine is dedicated to a single
person or for a single purpose/usage
• With different storage space, computing power, OS,
configurations, applications, etc…
• New machine must be purchased & setup when new
service is required Web Server Email Server DB Server File Server
Resource pooling:
Shared
not
Dedicated
•
Now: A physical machine can be shared among
multiple persons or for multiple purposes/usages
• Isolated environment without interference• Independent setup and configuration
• Resource can be allocated and distributed
Web Server Email Server DB Server File Server Web Server Email Server DB Server File Server Physical Machine Virtual Machines
Benefits From Cloud
•
For the market and enterprises (IT)
•
Reduce initial investment
•Reduce capital expenditure
•
Improve industrial specialization
•Improve resource utilization
Benefits From Cloud
•
For the end user and individuals
•
Lower computer costs
•Improved performance
•
Instant update & free software
•Device independence
10 Obstacles of Cloud
Data Lock-In
• Reality:
• No existing standard for Cloud API
• Customers cannot easily extract their data and programs from one site
to another
• Users are vulnerable to price increases • What if providers going out of business
• Opportunity:
• Standardize API (However, it will reduce provider’s profit) • Multi-cloud service
Data Confidential & Auditability
• Reality:
• Many data is sensitive
• Countries have their own government
laws for data privacy
• Opportunity:
Data Transfer Bottleneck
• Reality:
• The cheapest way to send a lot of data is to physically
send disks or even whole computers via overnight delivery services
• E.g.: 10TB from San Francisco to Seattle
• Through Internet: 45days, $1000USD service charge
• Through FedEx: 1night, $400USD mail fee
• Opportunity:
• No charge and small latency within a cloud platform
(E.g. between EC2 and S3 of Amazon Cloud Service)
• Provider better and more data service, such as data
Enabling Technologies
• Web services Accessibility • Virtualization resource sharing • Distributed computing scalability & elasticity
• General system issues
• Security
• Fault tolerance • Monitoring
Outline
•
Big Data
•
Cloud Computing
Distributed Computing
• A computer system in which several interconnected computers share the computing tasks assigned to the system
Mainframe
Mini Computer Workstation
Strengths of Distributed Computing
• Scalability
• From small cluster to Google’s datacenter
• ~900,000 servers
• 206,040 sqft (50 basketball courts)
• Using low cost commodity server
• Elasticity
• Scale-in/out instead of Scale-up/down
• Allow servers dynamic join or leave systems
• Reliability & Availability
• Achieve fault tolerance through decentralized management • Prevent data lost through replication
• Improve availability through snapshot, checkpoint, rerun • New paradigm: treat exception as norm
Outline
•
MapReduce Model: Hadoop
•
SCALA: SPARK(In-memory computing)
•Streaming Processing: STORM
MapReduce Model
• Google MapReduce (2004)
• Jeffrey Dean et al. MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004.
• Apache Hadoop (2005)
• http://hadoop.apache.org/
• http://developer.yahoo.com/hadoop/tutorial/
• Apache Hadoop 2.0 (2012)
• Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another
Resource Negotiator, SOCC 2013.
• Separation between resource management and computation
Typical Large-Data Problem
1. Iterate over a large number of records
2. Extract something of interest from each record 3. Shuffle and sort intermediate results
4. Aggregate intermediate results 5. Generate final output
Key idea: provide
a functional abstraction
MapReduce Programming Model
• A parallel programming model (divide-conquer)
• Map: processes a key/value pair to generate a set of intermediate
key/value pairs
• Reduce: merges all intermediate values associated with the same
MapReduce Word Count Example
• User specify the map and reduce functions
Key Value
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values: sum += v;
MapReduce “Runtime”
• Handles scheduling
• Assigns workers to map and reduce tasks
• Handles “data distribution”
• Moves processes to data
• Handles synchronization
• Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
• Detects worker failures and restarts
Limitation of MapReduce
• Simple but limited programming model
• Only apply two computation functions in a job: Map & Reduce
More complex work must use multiple jobs
• The input and output of a job must store into a FS
Outline
•
MapReduce Model: Hadoop
•
SCALA: SPARK(In-memory computing)
Spark
•
Utilize
DSM
(Distributed Shared Memory) in
data processing to enable
in-memory
computing
•
Allow users to explicitly
cache dataset
in
memory across machines
and reuse it in
multiple MapReduce-like parallel operations
•
Retain the
scalability and
fault tolerance
Resilient Distributed Datasets (RDDs)
• Spark introduces an data abstraction called Resilient
Distributed Datasets (RDDs):
• RDD is a read-only collection of objects partitioned across a set
of machines
• RDD can be rebuilt if a partition is lost using the “lineage”
technique
• Spark is integrated into a general programming language
called “scala”
• Pure-bred O.O language: every variable/dataset is an object and
every operation is a method-call
• Seamless Java interpreter
• Programmer specify operations to transform dataset • Operations are parallelized and executed by Spark
Scala: Word Count Example
• val lines = sc.textFile(“hamlet.txt”)!
• val counts = lines.flatMap(line => line.split(“ “)).
map(word => (word, 1)). reduceByKey(_ + _)
Example: Log Mining
• Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1 Block 2 Block 3 Worker Worker Worker Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to
reconstruct lost partitions
messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func = _.contains(...)) (func = _.split(...)) map
*If lineage is too long, it could cause stack overflow (out of memory) problem during execution
Outline
•
MapReduce Model: Hadoop
•
In-memory computing: SPARK
Storm
•
Developed by BackType which was acquired by
•
Lots of tools for data (i.e. batch) processing
• Hadoop, Pig, HBase, Hive, …
•
None of them are realtime systems which is
becoming a real requirement for businesses
•
Storm provides realtime computation
• Scalable
• Guarantees no data loss
• Extremely robust and fault-tolerant
Storm Components
• Tuple: Core unit of data (immutable set of key/value pair) • Stream: Unbounded sequence of tuples
• Spouts: Source of streams
• Bolts: Processes input streams and produces new streams
Storm Components
• Topology: Spouts and bolts execute as many tasks across
Machine Learning on Big Data
•Mahout on Hadoop
• https://mahout.apache.org/ •MLlib on Spark
• http://spark.apache.org/mllib/ •GraphLab Toolkits
• http://graphlab.org/projects/toolkits.html • GraphLab Computer Vision ToolkitGraph Processing with BSP
(Bulk Synchronous Parallel) model
• Pregel (2010)• Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph
Processing. SIGMOD 2010. • Apache Hama (2010) • https://hama.apache.org/ • Apache Giraph (2012) • https://giraph.apache.org/ • GraphX (2013)
• Reynold Xin et al. GraphX: A Resilient Distributed Graph System on Spark.
GRADES (SIGMOD workshop) 2013.
Graph Processing with BSP
(Bulk Synchronous Parallel) model
• GraphLab (2010)• Yucheng Low et al. GraphLab: A New Parallel Framework for
Machine Learning. UAI 2010.
• Yucheng Low, et al. Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the Cloud. PVLDB 2012.
• http://graphlab.org/projects/index.html
• http://graphlab.org/resources/publications.html
• PowerGraph (2012)
• Joseph E. Gonzalez et al. PowerGraph: Distributed Graph-Parallel
Master Fork * 4 File (3*64MB) Cat Cat Cat Puppy Dog Cat Dog Cat : 4 Dog : 2 Puppy : 1 Input Map phase Reduce phase Output Intermediate files Server: 4 machines Machine#1 Map job1 Machine#2 Map job2 Machine#3 Map job3 Machine#4 Reduce Job1 Machine#2 Reduce Job2 Machine#3
Local Read Local Write
Partition function Cat:1 Cat:1 Cat:1 Dog:1 Dog:1 Cat:1 Location
Remote Read Write to Global FS
Puppy:1
Sorted by key
Location
• A task is automatically re-run when worker failure occurs
• Last task is running by multiple workers
User program
Collection<T> collection; bool IsLegal(Key k); string Hash(Key);
var results = from c in collection where IsLegal(c.key)
select new { Hash(c.key), c.value};
DryadLINQ = LINQ + Dryad
C#
collection
results
C#
C#
C#
Vertex code Query plan (Dryad job) DataQuery on Big Data
•
Query with procedural language
•Google Sawzall (2003)
• Rob Pike et al. Interpreting the Data: Parallel Analysis
with Sawzall. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 2003.
•
Apache Pig (2006)
• Christopher Olston et al. Pig Latin: A Not-So-Foreign
Language for Data Processing. SIGMOD 2008.
SQL-like Query
• Apache Hive (2007)
• Facebook Data Infrastructure Team. Hive - A Warehousing Solution
Over a Map-Reduce Framework. VLDB 2009.
• https://hive.apache.org/ • On top of Apache Hadoop
• Shark (2012)
• Reynold Xin et al. Shark: SQL and Rich Analytics at Scale.
Technical Report. UCB/EECS 2012.
• http://shark.cs.berkeley.edu/ • On top of Apache Spark
• Apache MRQL (2013)
• http://mrql.incubator.apache.org/
Pregel & Apache Giraph
• Computation Model
• Superstep as iteration
• Vertex state machine:
Active and Inactive, vote to halt
• Message passing between
vertices
• Combiners • Aggregators
• Topology mutation
• Master/worker model • Graph partition: hashing • Fault tolerance: checkpointing and confined recovery 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6 Superstep 0 Superstep 1 Superstep 2 Superstep 3
Vote to halt Active