YANG, Lin COMP 6311 Spring 2012 CSE HKUST

(1)

YANG, Lin

COMP 6311 Spring 2012

CSE

HKUST

(2)

Outline



Background



Overview of Big Data Management



Comparison btw PDB and MR



DB Solution on MapReduce



Conclusion

(3)

Data-driven World

 Science

 Data bases from astronomy, genomics, environmental data, transportation data, …

 Humanities and Social Sciences

 Scanned books, historical documents, social interactions data, …

 Business & Commerce

 Corporate sales, stock market transactions, census, airline traffic, …

 Entertainment

 Internet images, Hollywood movies, MP3 files, …

 Medicine

(4)



Statistics from 2009



350 Million Named users



175 Million Active users in one day



35 Million Users updating status each day



55 Million Status each day



2.5 Billion Photos/Month



1.6 Million Active pages



Growth: 12 TB/day, 2 PB/year



Global data volume: 8.7 PB

4

(5)

What can we do with these data?



What can we do?



Scientific breakthroughs



Business process

efficiencies



Realistic special effects



Improve quality-of-life:

healthcare, transportation,

environmental disasters,

daily life, …



Could We Do More?



YES:

but need major

advances in our capability

to analyze those data

(6)

Challenges

6 Demand Capacity Time Res o u rc e s Time V o lu me Moore Data Demand Capacity Time Res o u rc e s



Basic Requirements

Capacity

Elasticity

(7)

Challenges(Cont.)



More Requirements



High performance



Fault tolerance



Load Balance



Cost-efficient parallelization



High availability and disaster recovery



…

(8)

Outline



Background



Conclusion

(9)

The State of The Art



Parallel DBMS technologies



Proposed in the late eighties



Matured over the last two decades



Multi-billion dollar industry: Proprietary DBMS Engines

intended as Data Warehousing solutions for very large

enterprises



MapReduce



pioneered by Google

(10)

Parallel DBMS



Popularly used for more than two decades



Research Projects: Gamma, Grace, …



Commercial: Multi-billion dollar industry but access to

only a privileged few



Relational Data Model



Indexing



SQL interface



Advanced query optimization

(11)

MapReduce

Dean et al., OSDI’04



Key contribution:



Propose a simple but powerful programming model



Parallelization



Fault-tolerance



Load balancing



Implement the framework of this model and provide

(12)

MapReduce(Cont.)



Data flow of MR

12

Input Data M splits

M pieces of intermediate output R splits R pieces of outputs Partition_1 Partition_2 Reduce Map M = input size / 64 MB R = #machine * 2

(13)

MapReduce(Cont.)

(14)

MapReduce(Cont.)



Word counter example

14

(15)

MapReduce(Cont.)



Parallelization



Map and Reduce



Fault tolerance



Master pings workers periodically



Master write periodic checkpoint



Re-execute the unfinished Reduce and all Map tasks of failed

machine



Load balance



Break tasks into small granularity, schedule by master



Optimization



Locality: Try to schedule Map task to the node has the

corresponding data



Backup Tasks: When the job is close to completion, run backup

executions of remaining tasks, the first completed one wins

(16)

Introduction to



Hadoop is a popular open-source map-reduce

implementation which is being used as an alternative

to store and process extremely large data sets on

commodity hard-ware.



Inspired by Google MapReduce and GFS

16 Architecture of Hadoop

(17)

Trend of Big Data Management

17

(18)

Outline



Background



DB Solution on MapReduce



Conclusion

(19)

Comparison btw PDB & MR

Pavlo et al., SIGMOD’09



Candidates



Hadoop: an open-source implementation of MapReduce.



DBMS-X: a parallel relational database.



Vertica: a parallel DBMS which stores data in

column-base format.



Task 1



Grep specific-pattern string from large scale of records



5.6 million records, 100 bytes per record



Test 1: Fixes the size of the data per node as 535MB



Test 2: Fixes the total dataset size as 1TB

(20)

Comparison(Cont.)



Data Loading

20 Observation:

1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was

actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes;

(21)

Comparison(Cont.)



Grep

Observations:

1. For Figure 4, such little data is being processed that the Hadoop start-up costs become the limiting factor in its performance( 10–25 seconds); 2. For Figure5, as the data volume

increased, Hadoop performs almost as fast as PDB;

(22)

Comparison(Cont.)



Task 2: analytical works



Data Schema & Volume

22 HTML Doc(600,000 records) Attributes Type url VARCHAR(100) Contents TEXT UserVisits(155 M records, 20GB/Node) Attributes Type srcIP VARCHAR(16) dstURL VARCHAR(100) visitDate DATE adRevenue FLOAT userAgent VARCHAR(64) countryCode VARCHAR(3) langCode VARCHAR(6) searchWord VACHAR(32) Duration INT Rankings(18 M records, 1GB/node) Attributes Type pageURL VARCHAR(100) pageRank INT avgDuration INT

(23)

Comparison(Cont.)



Data loading Selection

Find the pageURL in the Rankings table with a pageRank above a user-defined threshold.

Observations:

1. With the help of index, PDB

outperform hadoop significantly; 2. As the number of node increased,

the start-up cost would increase too.

(24)

Comparison(Cont.)



Aggregation

24 Calculate the total adRevenue generated for each srcIP in the UserVisits table

(20GB/node), grouped by the srcIP (Fig.7) or 7-character prefix of srcIP(Fig.8) Observations:

1. PDB outperform hadoop;

2. For PDB, communication cost dominate the execution time 3. Since Vertica is column-store, it

could perform better from not reading unused parts of table.

(25)

Comparison(Cont.)



Join

Observations:

1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop

can only perform complete scan. 3. UserVisits and Rankings in PDB

are partitioned by the join key, so PDB could do the join locally on each node without any network overhead

Find the srcIP that generated the most revenue within a particular date range, then calculate the average pageRank of all the pages visited during this interval.

SQL MR

SELECT INTO Temp srcIP, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgRank

FROM Temp ORDER BY

1. Filter on UserVisits and join wth Rankings

2. Compute the total

adRevenue and avgRank based on srcIP

3. Find the one with largest adRevenu

(26)

Comparison(Cont.)



UDF Aggregation

26 Scan the HTML documents and

search for all the URLs appeared, and count the reference number across the entire set for each unique URL

Observations:

1. Bottom segment is the time to execute the UDF and upper

segment represent the query time; 2. Both DBMS-X and Hadoop have

approximately performance

SQL MR

SELECT INTO Temp F(contents)

FROM Documents;

SELECT url, SUM(value)

FROM Temp

GROUPBY url;

F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database.

(27)

Quick Summary



At the scale of experiences above, parallel DBMS perform

much better than Hadoop



the result of a number of technologies developed over the

past 25 years: B-Tree index, column-store, compression

algorithm, sophisticated query engine and etc.



Hadoop get a better fault tolerance with performance

penalty.



Hadoop is more easy to set up and use the PDB at large

scale of nodes.



PDB don’t do a good job in UDF aggregation.

(28)

Outline



Background



Overview of big data management



Conclusion

(29)

Hive

Thusoo et al., VLDB’09



Problem



MapReduce requires developers to write custom

programs which are hard to maintain and reuse.



Analyst are familiar with SQL



Key contribution



Proposed an open-source data warehouse solution built

on top of Hadoop.



Hive provides HiveQL, a SQL-like query language which

support select, join, aggregate, union all and sub-queries

in from clause.

(30)

Hive(Cont.)



Metastore:

system catalog.



Thrift Server:

a framework

for cross-language services.



Driver:

manages the life

cycle of a HiveQL statement

during compilation,

optimization and execution.

(31)

Hive(Cont.)

FROM (SELECT a.status, b.school, b.gender

FROM status_updates a JOIN profiles b ON (a.userid = b.userid and

a.ds='2009-03-20' ) ) subq1

INSERT OVERWRITE TABLE gender_summary

PARTITION(ds='2009-03-20')

SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

PARTITION(ds='2009-03-20')

SELECT subq1.school, COUNT(1) GROUP BY subq1.school

(32)

Hive(Cont.)



Future work



Make HiveQL subsume SQL



Add cost-based optimizer



Columnar storage and more intelligent data placement



Performance enhancement



Integrate with commercial BI tools



Multi-query optimization and generic n-way joins in a

single MR job

(33)

HadoopDB

Azza Abouzeid et al. VLDB’09



Problem



It’s the age of big data.



Properties for large data analysis :



Performance



Flexible query interface



Fault tolerance



Load balance

Parallel database

MapReduce

Can we have both of them?

(34)

HadoopDB(Cont.)



Key contribution:



To build

a hybrid system

to archive all the required

properties.



Basic idea:



Connect

multiple single-node database

using Hadoop as task

coordinator and net communication layer

(Fault tolerance &

Load balance)



Queries are

parallelized

across nodes using MapReduce

framework



Queries are pushed inside of corresponding database note as

much as possible

(Performance & Multi-interface)

(35)

HadoopDB(Cont.)

Translate HiveQL to MapReduce, and translate some of MR back to SQL Repartition data on a given key or breaking apart single node data into chunks maintains meta information about the databases Connect to database and execute SQL and return key/value

(36)

HadoopDB(Cont.)



SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY

YEAR(saleDate);

(37)

(38)

Outline



Background



Overview of big data management



DB Solution on MapReduce



Conclusion

(39)

Conclusion



It is the age of big Data now



Major technologies for big data management



Parallel DBMS: Well studied and wildly used



MapReduce: Attracting new technology with bright

future while exist lots of drawbacks, especially in the

aspect of performance



Hive and HadoopDB give a good paradigm for build

DB solution on the top of MapReduce



There are still lots of work could be done in the field of

big data management.

(40)

Reference



Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data

Processing on Large Clusters. OSDI’04



Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of

Approaches to Large-Scale Data Analysis. SIGMOD’09



Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A

Warehousing Solution Over a Map-Reduce Framework. VLDB’09



Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi.

HadoopDB: An Architectural Hybrid of MapReduce and



DBMS Technologies for Analytical Workloads. VLDB’09



Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and

Cloud Computing: Current State and Future Opportunities.

EDBT’11

(41)