• No results found

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data

Data-intensive Computing

Methods, Tools, and Applications

(CMSC 34900)

Ian Foster

Computation Institute

(2)
(3)
(4)

SQL Overview

  Structured Query Language

  The standard for relational database

management systems (RDBMS)

  RDBMS: A database management system

that manages data as a collection of tables in which all relationships are represented by common values in related tables

(5)

5

5

(6)

SQL Environment

 

Catalog

  A set of schemas that constitute the description of

a database

 

Schema (or Database)

  The structure that contains descriptions of objects

created by a user (base tables, views,

constraints)

 

Data Definition Language (DDL)

  Commands that define a database, including

creating, altering, and dropping tables and establishing constraints

 

Data Manipulation Language (DML)

  Commands that maintain and query a database

(7)
(8)
(9)

9

9

Figure 7-4 DDL, DML, DCL, and the database development process

(10)

Common SQL Commands

 

Data Definition Language

(DDL):

 Create  Drop  Alter

 

Data Manipulation Language

(DML):

 Select  Update  Insert  Delete

 

Data Control Language

(DCL):

(11)

11

11

Internal Schema Definition

 

Control processing/storage efficiency:

  Choice of indexes

  File organizations for base tables   File organizations for indexes

  Data clustering

  Statistics maintenance  

Creating indexes

  Speed up random/sequential access to base

table data

  Example

  CREATE INDEX NAME_IDX ON CUSTOMER_T

(CUSTOMER_NAME)

  This makes an index for the CUSTOMER_NAME

field of the CUSTOMER_T table

(12)
(13)

MapReduce

or

(14)

An example problem

  We have a large number of documents,

each labeled in some way with the name of the site where they occur

  Find sites with documents that contain

more than five instances of the words “IBM” or “Google”

(15)

15

MapReduce approach

  Map: Create a histogram for each

document listing frequently occurring words

  Reduce: Group documents by their site of

origin

  Map: Identify documents with more than

(16)

map(String key, String value) // key: document name

// value: document contents for each word w in value:

(17)

17

map(String key, String value)

// key: (site id + document name) // value: document contents

histogram = CountWords(value); EmitIntermediate

(18)

SQL solution

Assume a table Documents of the form: (siteid, docid, word, …)

(19)

19

“MapReduce – A major step backwards”

  A giant step backward

  No schemas, Codasyl instead of Relational

  A sub-optimal implementation

  Uses brute force sequential search, instead of indexing

  Materializes O(m.r) intermediate files

  Does not incorporate data skew

  Not novel at all

  Represents a specific implementation of well known

techniques developed nearly 25 years ago

  Missing most common current DBMS features

  Bulk loader, indexing, updates, transactions, integrity

constraints, referential Integrity, views

  Incompatible with DBMS tools

  Report writers, business intelligence tools, data mining

(20)

Architectural  

Element   Parallel  Databases   MapReduce  

Schema  Support   Structured   Unstructured   Indexing   B-­‐Trees  or  Hash  based   None  

Programming  Model   Relational   Codasyl  

Data  Distribution   Projections  before  aggregation   Logic  moved  to  data,  but  no  optimizations   Execution  Strategy   Push   Pull  

Flexibility   No,  but  Ruby  on  Rails,  LINQ   Yes  

Fault  Tolerance   Transactions  have  to  be  restarted  in  the  event  of  a   failure  

Yes:  Replication,  

(21)

21

MapReduce response

They label as misconceptions:

  MapReduce cannot use indices and implies

a full scan of all input data

  Data on each node can be indexed or

otherwise partitionable

  MapReduce input and outputs are always

simple files in a file system

  No, can be databases, tables, etc.

  MapReduce requires the use of inefficient

textual data formats

(22)

  The comparison paper says, "MR is always

forced to start a query with a scan of the entire input file."

  MapReduce does not require a full scan

over the data; it requires only an

implementation of its input interface to yield a set of records that match some input specification. Examples of input specifications are:

  All records in a set of files

  All records with a visit-date in the range

(23)

23   Extracting outgoing links from a collection of

HTML documents and aggregating by target document;

  Stitching together overlapping satellite images to

remove seams and to select high-quality imagery for Google Earth

  Generating a collection of inverted index files

using a compression scheme tuned for efficient support of Google search queries

  Processing all road segments in the world and

rendering map tile images that display these segments for Google Maps

  Fault-tolerant parallel execution of programs

written in higher-level languages such as Sawzall and Pig Latin

(24)

“Grep” example

  Scan through a data set of 100-byte

records looking for a three-character pattern.

  Each record consists of a unique key in the

first 10 bytes, followed by a 90-byte random value.

  The search pattern is only found in the last

(25)

25

Data Loading

  Hadoop

  Command line utility

  DBMS-X   LOAD SQL command   Administrative command to re-organize data

Dataset

  Record = 10B key + 90B random value   5.6 million records = 535MB/node   Another set = 1TB/ cluster

(26)

Grep Task Results

(27)

27

Select Task Results

27

SELECT  

pageURL

,

 

pageRank  

(28)
(29)

29

Summary

  DBMS-X 3.2 times, Vertica 2.3 times faster than

Hadoop

  Parallel DBMS win because

  B-tree indices to speed the execution of selection

operations,

  novel storage mechanisms (e.g., column-orientation)   aggressive compression techniques with ability to

operate directly on compressed data

  sophisticated parallel algorithms for querying large

amounts of relational data.

  Ease of installation and use   Fault tolerance?

References

Related documents

According to the aforementioned issues, seven research questions were raised in this study, in which the emphasis of each components of sustainable development, such as the

[r]

 Participating life insurance policyowner dividends will decrease 11 per cent on average compared to the dividend that would have been received at the 2013 policy anniversary

This binding constraint enables me to compare these treatment firms (those paying at least $1 million in salary) to the control companies before and after the exogenous shock. The

Vice President and Director of Payments System Research, Federal Reserve Bank of Kansas City. Session

It may not otherwise be altered, photocopied, reproduced, distributed, published, recorded, made available on another website, or otherwise transmitted by any means without the

Socially responsible investment can be done by individuals or institutions, it can include positive or negative screening based on social and environmental criteria;

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and