YANG, Lin
COMP 6311 Spring 2012
CSE
HKUST
Outline
Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
Data-driven World
Science
Data bases from astronomy, genomics, environmental data, transportation data, …
Humanities and Social Sciences
Scanned books, historical documents, social interactions data, …
Business & Commerce
Corporate sales, stock market transactions, census, airline traffic, …
Entertainment
Internet images, Hollywood movies, MP3 files, …
Medicine
Statistics from 2009
350 Million Named users
175 Million Active users in one day
35 Million Users updating status each day
55 Million Status each day
2.5 Billion Photos/Month
1.6 Million Active pages
Growth: 12 TB/day, 2 PB/year
Global data volume: 8.7 PB
4
What can we do with these data?
What can we do?
Scientific breakthroughs
Business process
efficiencies
Realistic special effects
Improve quality-of-life:
healthcare, transportation,
environmental disasters,
daily life, …
Could We Do More?
YES:
but need major
advances in our capability
to analyze those data
Challenges
6 Demand Capacity Time Res o u rc e s Time V o lu me Moore Data Demand Capacity Time Res o u rc e s
Basic Requirements
Capacity
Elasticity
Challenges(Cont.)
More Requirements
High performance
Fault tolerance
Load Balance
Cost-efficient parallelization
High availability and disaster recovery
…
Outline
Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
The State of The Art
Parallel DBMS technologies
Proposed in the late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary DBMS Engines
intended as Data Warehousing solutions for very large
enterprises
MapReduce
pioneered by Google
Parallel DBMS
Popularly used for more than two decades
Research Projects: Gamma, Grace, …
Commercial: Multi-billion dollar industry but access to
only a privileged few
Relational Data Model
Indexing
SQL interface
Advanced query optimization
MapReduce
Dean et al., OSDI’04
Key contribution:
Propose a simple but powerful programming model
Parallelization
Fault-tolerance
Load balancing
Implement the framework of this model and provide
MapReduce(Cont.)
Data flow of MR
12
Input Data M splits
M pieces of intermediate output R splits R pieces of outputs Partition_1 Partition_2 Reduce Map M = input size / 64 MB R = #machine * 2
MapReduce(Cont.)
MapReduce(Cont.)
Word counter example
14
MapReduce(Cont.)
Parallelization
Map and Reduce
Fault tolerance
Master pings workers periodically
Master write periodic checkpoint
Re-execute the unfinished Reduce and all Map tasks of failed
machine
Load balance
Break tasks into small granularity, schedule by master
Optimization
Locality: Try to schedule Map task to the node has the
corresponding data
Backup Tasks: When the job is close to completion, run backup
executions of remaining tasks, the first completed one wins
Introduction to
Hadoop is a popular open-source map-reduce
implementation which is being used as an alternative
to store and process extremely large data sets on
commodity hard-ware.
Inspired by Google MapReduce and GFS
16 Architecture of Hadoop
Trend of Big Data Management
17
Outline
Background
Overview of Big Data Management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
Comparison btw PDB & MR
Pavlo et al., SIGMOD’09
Candidates
Hadoop: an open-source implementation of MapReduce.
DBMS-X: a parallel relational database.
Vertica: a parallel DBMS which stores data in
column-base format.
Task 1
Grep specific-pattern string from large scale of records
5.6 million records, 100 bytes per record
Test 1: Fixes the size of the data per node as 535MB
Test 2: Fixes the total dataset size as 1TB
Comparison(Cont.)
Data Loading
20 Observation:
1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was
actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes;
Comparison(Cont.)
Grep
Observations:
1. For Figure 4, such little data is being processed that the Hadoop start-up costs become the limiting factor in its performance( 10–25 seconds); 2. For Figure5, as the data volume
increased, Hadoop performs almost as fast as PDB;
Comparison(Cont.)
Task 2: analytical works
Data Schema & Volume
22 HTML Doc(600,000 records) Attributes Type url VARCHAR(100) Contents TEXT UserVisits(155 M records, 20GB/Node) Attributes Type srcIP VARCHAR(16) dstURL VARCHAR(100) visitDate DATE adRevenue FLOAT userAgent VARCHAR(64) countryCode VARCHAR(3) langCode VARCHAR(6) searchWord VACHAR(32) Duration INT Rankings(18 M records, 1GB/node) Attributes Type pageURL VARCHAR(100) pageRank INT avgDuration INT
Comparison(Cont.)
Data loading Selection
Find the pageURL in the Rankings table with a pageRank above a user-defined threshold.
Observations:
1. With the help of index, PDB
outperform hadoop significantly; 2. As the number of node increased,
the start-up cost would increase too.
Comparison(Cont.)
Aggregation
24 Calculate the total adRevenue generated for each srcIP in the UserVisits table
(20GB/node), grouped by the srcIP (Fig.7) or 7-character prefix of srcIP(Fig.8) Observations:
1. PDB outperform hadoop;
2. For PDB, communication cost dominate the execution time 3. Since Vertica is column-store, it
could perform better from not reading unused parts of table.
Comparison(Cont.)
Join
Observations:
1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop
can only perform complete scan. 3. UserVisits and Rankings in PDB
are partitioned by the join key, so PDB could do the join locally on each node without any network overhead
Find the srcIP that generated the most revenue within a particular date range, then calculate the average pageRank of all the pages visited during this interval.
SQL MR
SELECT INTO Temp srcIP, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgRank
FROM Temp ORDER BY
1. Filter on UserVisits and join wth Rankings
2. Compute the total
adRevenue and avgRank based on srcIP
3. Find the one with largest adRevenu
Comparison(Cont.)
UDF Aggregation
26 Scan the HTML documents and
search for all the URLs appeared, and count the reference number across the entire set for each unique URL
Observations:
1. Bottom segment is the time to execute the UDF and upper
segment represent the query time; 2. Both DBMS-X and Hadoop have
approximately performance
SQL MR
SELECT INTO Temp F(contents)
FROM Documents;
SELECT url, SUM(value)
FROM Temp
GROUPBY url;
F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database.
Quick Summary
At the scale of experiences above, parallel DBMS perform
much better than Hadoop
the result of a number of technologies developed over the
past 25 years: B-Tree index, column-store, compression
algorithm, sophisticated query engine and etc.
Hadoop get a better fault tolerance with performance
penalty.
Hadoop is more easy to set up and use the PDB at large
scale of nodes.
PDB don’t do a good job in UDF aggregation.
Outline
Background
Overview of big data management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
Hive
Thusoo et al., VLDB’09
Problem
MapReduce requires developers to write custom
programs which are hard to maintain and reuse.
Analyst are familiar with SQL
Key contribution
Proposed an open-source data warehouse solution built
on top of Hadoop.
Hive provides HiveQL, a SQL-like query language which
support select, join, aggregate, union all and sub-queries
in from clause.
Hive(Cont.)
Metastore:
system catalog.
Thrift Server:
a framework
for cross-language services.
Driver:
manages the life
cycle of a HiveQL statement
during compilation,
optimization and execution.
Hive(Cont.)
FROM (SELECT a.status, b.school, b.gender
FROM status_updates a JOIN profiles b ON (a.userid = b.userid and
a.ds='2009-03-20' ) ) subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION(ds='2009-03-20')
SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION(ds='2009-03-20')
SELECT subq1.school, COUNT(1) GROUP BY subq1.school
Hive(Cont.)
Future work
Make HiveQL subsume SQL
Add cost-based optimizer
Columnar storage and more intelligent data placement
Performance enhancement
Integrate with commercial BI tools
Multi-query optimization and generic n-way joins in a
single MR job
HadoopDB
Azza Abouzeid et al. VLDB’09
Problem
It’s the age of big data.
Properties for large data analysis :
Performance
Flexible query interface
Fault tolerance
Load balance
Parallel database
MapReduce
Can we have both of them?
HadoopDB(Cont.)
Key contribution:
To build
a hybrid system
to archive all the required
properties.
Basic idea:
Connect
multiple single-node database
using Hadoop as task
coordinator and net communication layer
(Fault tolerance &
Load balance)
Queries are
parallelized
across nodes using MapReduce
framework
Queries are pushed inside of corresponding database note as
much as possible
(Performance & Multi-interface)
HadoopDB(Cont.)
Translate HiveQL to MapReduce, and translate some of MR back to SQL Repartition data on a given key or breaking apart single node data into chunks maintains meta information about the databases Connect to database and execute SQL and return key/valueHadoopDB(Cont.)
SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY
YEAR(saleDate);
Outline
Background
Overview of big data management
Comparison btw PDB and MR
DB Solution on MapReduce
Conclusion
Conclusion
It is the age of big Data now
Major technologies for big data management
Parallel DBMS: Well studied and wildly used
MapReduce: Attracting new technology with bright
future while exist lots of drawbacks, especially in the
aspect of performance
Hive and HadoopDB give a good paradigm for build
DB solution on the top of MapReduce
There are still lots of work could be done in the field of
big data management.
Reference
Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. OSDI’04
Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of
Approaches to Large-Scale Data Analysis. SIGMOD’09
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A
Warehousing Solution Over a Map-Reduce Framework. VLDB’09
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi.
HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads. VLDB’09
Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and
Cloud Computing: Current State and Future Opportunities.
EDBT’11