• No results found

YANG, Lin COMP 6311 Spring 2012 CSE HKUST

N/A
N/A
Protected

Academic year: 2021

Share "YANG, Lin COMP 6311 Spring 2012 CSE HKUST"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

YANG, Lin

COMP 6311 Spring 2012

CSE

HKUST

(2)

Outline

Background

Overview of Big Data Management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

(3)

Data-driven World

 Science

 Data bases from astronomy, genomics, environmental data, transportation data, …

 Humanities and Social Sciences

 Scanned books, historical documents, social interactions data, …

 Business & Commerce

 Corporate sales, stock market transactions, census, airline traffic, …

 Entertainment

 Internet images, Hollywood movies, MP3 files, …

 Medicine

(4)

Statistics from 2009

350 Million Named users

175 Million Active users in one day

35 Million Users updating status each day

55 Million Status each day

2.5 Billion Photos/Month

1.6 Million Active pages

Growth: 12 TB/day, 2 PB/year

Global data volume: 8.7 PB

4

(5)

What can we do with these data?

What can we do?

Scientific breakthroughs

Business process

efficiencies

Realistic special effects

Improve quality-of-life:

healthcare, transportation,

environmental disasters,

daily life, …

Could We Do More?

YES:

but need major

advances in our capability

to analyze those data

(6)

Challenges

6 Demand Capacity Time Res o u rc e s Time V o lu me Moore Data Demand Capacity Time Res o u rc e s

Basic Requirements

Capacity

Elasticity

(7)

Challenges(Cont.)

More Requirements

High performance

Fault tolerance

Load Balance

Cost-efficient parallelization

High availability and disaster recovery

(8)

Outline

Background

Overview of Big Data Management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

(9)

The State of The Art

Parallel DBMS technologies

Proposed in the late eighties

Matured over the last two decades

Multi-billion dollar industry: Proprietary DBMS Engines

intended as Data Warehousing solutions for very large

enterprises

MapReduce

pioneered by Google

(10)

Parallel DBMS

Popularly used for more than two decades

Research Projects: Gamma, Grace, …

Commercial: Multi-billion dollar industry but access to

only a privileged few

Relational Data Model

Indexing

SQL interface

Advanced query optimization

(11)

MapReduce

Dean et al., OSDI’04

Key contribution:

Propose a simple but powerful programming model

Parallelization

Fault-tolerance

Load balancing

Implement the framework of this model and provide

(12)

MapReduce(Cont.)

Data flow of MR

12

Input Data M splits

M pieces of intermediate output R splits R pieces of outputs Partition_1 Partition_2 Reduce Map M = input size / 64 MB R = #machine * 2

(13)

MapReduce(Cont.)

(14)

MapReduce(Cont.)

Word counter example

14

(15)

MapReduce(Cont.)

Parallelization

Map and Reduce

Fault tolerance

Master pings workers periodically

Master write periodic checkpoint

Re-execute the unfinished Reduce and all Map tasks of failed

machine

Load balance

Break tasks into small granularity, schedule by master

Optimization

Locality: Try to schedule Map task to the node has the

corresponding data

Backup Tasks: When the job is close to completion, run backup

executions of remaining tasks, the first completed one wins

(16)

Introduction to

Hadoop is a popular open-source map-reduce

implementation which is being used as an alternative

to store and process extremely large data sets on

commodity hard-ware.

Inspired by Google MapReduce and GFS

16 Architecture of Hadoop

(17)

Trend of Big Data Management

17

(18)

Outline

Background

Overview of Big Data Management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

(19)

Comparison btw PDB & MR

Pavlo et al., SIGMOD’09

Candidates

Hadoop: an open-source implementation of MapReduce.

DBMS-X: a parallel relational database.

Vertica: a parallel DBMS which stores data in

column-base format.

Task 1

Grep specific-pattern string from large scale of records

5.6 million records, 100 bytes per record

Test 1: Fixes the size of the data per node as 535MB

Test 2: Fixes the total dataset size as 1TB

(20)

Comparison(Cont.)

Data Loading

20 Observation:

1. Hadoop outperforms both of PDB 2. For DBMS-X, the data was

actually loaded on each node sequentially, while the additional housekeeping can be done in parallel across nodes;

(21)

Comparison(Cont.)

Grep

Observations:

1. For Figure 4, such little data is being processed that the Hadoop start-up costs become the limiting factor in its performance( 10–25 seconds); 2. For Figure5, as the data volume

increased, Hadoop performs almost as fast as PDB;

(22)

Comparison(Cont.)

Task 2: analytical works

Data Schema & Volume

22 HTML Doc(600,000 records) Attributes Type url VARCHAR(100) Contents TEXT UserVisits(155 M records, 20GB/Node) Attributes Type srcIP VARCHAR(16) dstURL VARCHAR(100) visitDate DATE adRevenue FLOAT userAgent VARCHAR(64) countryCode VARCHAR(3) langCode VARCHAR(6) searchWord VACHAR(32) Duration INT Rankings(18 M records, 1GB/node) Attributes Type pageURL VARCHAR(100) pageRank INT avgDuration INT

(23)

Comparison(Cont.)

Data loading Selection

Find the pageURL in the Rankings table with a pageRank above a user-defined threshold.

Observations:

1. With the help of index, PDB

outperform hadoop significantly; 2. As the number of node increased,

the start-up cost would increase too.

(24)

Comparison(Cont.)

Aggregation

24 Calculate the total adRevenue generated for each srcIP in the UserVisits table

(20GB/node), grouped by the srcIP (Fig.7) or 7-character prefix of srcIP(Fig.8) Observations:

1. PDB outperform hadoop;

2. For PDB, communication cost dominate the execution time 3. Since Vertica is column-store, it

could perform better from not reading unused parts of table.

(25)

Comparison(Cont.)

Join

Observations:

1. PDB outperforms Hadoop again! 2. PDB use indexes while Hadoop

can only perform complete scan. 3. UserVisits and Rankings in PDB

are partitioned by the join key, so PDB could do the join locally on each node without any network overhead

Find the srcIP that generated the most revenue within a particular date range, then calculate the average pageRank of all the pages visited during this interval.

SQL MR

SELECT INTO Temp srcIP, AVG(pageRank) as AvgRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgRank

FROM Temp ORDER BY

1. Filter on UserVisits and join wth Rankings

2. Compute the total

adRevenue and avgRank based on srcIP

3. Find the one with largest adRevenu

(26)

Comparison(Cont.)

UDF Aggregation

26 Scan the HTML documents and

search for all the URLs appeared, and count the reference number across the entire set for each unique URL

Observations:

1. Bottom segment is the time to execute the UDF and upper

segment represent the query time; 2. Both DBMS-X and Hadoop have

approximately performance

SQL MR

SELECT INTO Temp F(contents)

FROM Documents;

SELECT url, SUM(value)

FROM Temp

GROUPBY url;

F is a user-defined function, which parses the contents of each record in the Documents table and emits URLs into the database.

(27)

Quick Summary

At the scale of experiences above, parallel DBMS perform

much better than Hadoop

the result of a number of technologies developed over the

past 25 years: B-Tree index, column-store, compression

algorithm, sophisticated query engine and etc.

Hadoop get a better fault tolerance with performance

penalty.

Hadoop is more easy to set up and use the PDB at large

scale of nodes.

PDB don’t do a good job in UDF aggregation.

(28)

Outline

Background

Overview of big data management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

(29)

Hive

Thusoo et al., VLDB’09

Problem

MapReduce requires developers to write custom

programs which are hard to maintain and reuse.

Analyst are familiar with SQL

Key contribution

Proposed an open-source data warehouse solution built

on top of Hadoop.

Hive provides HiveQL, a SQL-like query language which

support select, join, aggregate, union all and sub-queries

in from clause.

(30)

Hive(Cont.)

Metastore:

system catalog.

Thrift Server:

a framework

for cross-language services.

Driver:

manages the life

cycle of a HiveQL statement

during compilation,

optimization and execution.

(31)

Hive(Cont.)

FROM (SELECT a.status, b.school, b.gender

FROM status_updates a JOIN profiles b ON (a.userid = b.userid and

a.ds='2009-03-20' ) ) subq1

INSERT OVERWRITE TABLE gender_summary

PARTITION(ds='2009-03-20')

SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary

PARTITION(ds='2009-03-20')

SELECT subq1.school, COUNT(1) GROUP BY subq1.school

(32)

Hive(Cont.)

Future work

Make HiveQL subsume SQL

Add cost-based optimizer

Columnar storage and more intelligent data placement

Performance enhancement

Integrate with commercial BI tools

Multi-query optimization and generic n-way joins in a

single MR job

(33)

HadoopDB

Azza Abouzeid et al. VLDB’09

Problem

It’s the age of big data.

Properties for large data analysis :

Performance

Flexible query interface

Fault tolerance

Load balance

Parallel database

MapReduce

Can we have both of them?

(34)

HadoopDB(Cont.)

Key contribution:

To build

a hybrid system

to archive all the required

properties.

Basic idea:

Connect

multiple single-node database

using Hadoop as task

coordinator and net communication layer

(Fault tolerance &

Load balance)

Queries are

parallelized

across nodes using MapReduce

framework

Queries are pushed inside of corresponding database note as

much as possible

(Performance & Multi-interface)

(35)

HadoopDB(Cont.)

Translate HiveQL to MapReduce, and translate some of MR back to SQL Repartition data on a given key or breaking apart single node data into chunks maintains meta information about the databases Connect to database and execute SQL and return key/value
(36)

HadoopDB(Cont.)

SELECT YEAR(saleDate), SUM(revenue) FROM sales GROUP BY

YEAR(saleDate);

(37)
(38)

Outline

Background

Overview of big data management

Comparison btw PDB and MR

DB Solution on MapReduce

Conclusion

(39)

Conclusion

It is the age of big Data now

Major technologies for big data management

Parallel DBMS: Well studied and wildly used

MapReduce: Attracting new technology with bright

future while exist lots of drawbacks, especially in the

aspect of performance

Hive and HadoopDB give a good paradigm for build

DB solution on the top of MapReduce

There are still lots of work could be done in the field of

big data management.

(40)

Reference

Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data

Processing on Large Clusters. OSDI’04

Andrew Pavlo, Erik Paulson, Alexander Rasin. A Comparison of

Approaches to Large-Scale Data Analysis. SIGMOD’09

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive - A

Warehousing Solution Over a Map-Reduce Framework. VLDB’09

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi.

HadoopDB: An Architectural Hybrid of MapReduce and

DBMS Technologies for Analytical Workloads. VLDB’09

Divyakant Agrawal, Sudipto Das, AmrEl Abbadi. Big Data and

Cloud Computing: Current State and Future Opportunities.

EDBT’11

(41)

References

Related documents

Statistical machine translations to translate hindi language for informational purposes only parts of translation and then you can result in the playwright to Confinement in

Our solution is that we propose a lightweight data middleware in support of document stores, to translate the SQL statements to the NoSQL operations using the MapReduce framework

Out the given word into english to binary translator button to translate a binary code for conveying text?. The computer will convert those

Our document translation companies, word documents for words, standard arabic translations and script and shapes of business.. Reports with additional pages will be quoted; we will

Translates through the translate word english am i using the analytics and personalization company, the cloud to understand how many pages you choose this user.. Translated copy

Translate any type plain text document including Chinese emails Microsoft Word Documents Text Web pages Files Chat Correspondence Power use Excel. How to translate English pdf

Designed to provide students with continued study in key concepts and practices of business management, including marketing, finance, entrepreneurship,

This scale over a G bass creates G7 (sus4, lydian. The chord scale is G ionian. G ionian contains the same notes as 11). The chord scale is G ionian.. Chapter 5: From Major to