Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

(1)

Big Data

Data-intensive Computing

Methods, Tools, and Applications

(CMSC 34900)

Ian Foster

Computation Institute

(2)

(3)

(4)

SQL Overview

  Structured Query Language

  The standard for relational database

management systems (RDBMS)

  RDBMS: A database management system

that manages data as a collection of tables in which all relationships are represented by common values in related tables

(5)

5

(6)

SQL Environment

 

Catalog

  A set of schemas that constitute the description of

a database

 

Schema (or Database)

  The structure that contains descriptions of objects

created by a user (base tables, views,

constraints)

 

Data Definition Language (DDL)

  Commands that define a database, including

creating, altering, and dropping tables and establishing constraints

 

Data Manipulation Language (DML)

  Commands that maintain and query a database

(7)

(8)

(9)

9

Figure 7-4 DDL, DML, DCL, and the database development process

(10)

Common SQL Commands

 

Data Definition Language

(DDL):

 Create  Drop  Alter

 

Data Manipulation Language

(DML):

 Select  Update  Insert  Delete

 

Data Control Language

(DCL):

(11)

11

Internal Schema Definition

 

Control processing/storage efficiency:

  Choice of indexes

  File organizations for base tables   File organizations for indexes

  Data clustering

  Statistics maintenance  

Creating indexes

  Speed up random/sequential access to base

table data

  Example

  CREATE INDEX NAME_IDX ON CUSTOMER_T

(CUSTOMER_NAME)

  This makes an index for the CUSTOMER_NAME

field of the CUSTOMER_T table

(12)

(13)

MapReduce

or

(14)

An example problem

  We have a large number of documents,

each labeled in some way with the name of the site where they occur

  Find sites with documents that contain

more than five instances of the words “IBM” or “Google”

(15)

15

MapReduce approach

  Map: Create a histogram for each

document listing frequently occurring words

  Reduce: Group documents by their site of

origin

  Map: Identify documents with more than

(16)

map(String key, String value) // key: document name

// value: document contents for each word w in value:

(17)

17

map(String key, String value)

// key: (site id + document name) // value: document contents

histogram = CountWords(value); EmitIntermediate

(18)

SQL solution

Assume a table Documents of the form: (siteid, docid, word, …)

(19)

19

“MapReduce – A major step backwards”

  A giant step backward

  No schemas, Codasyl instead of Relational

  A sub-optimal implementation

  Uses brute force sequential search, instead of indexing

  Materializes O(m.r) intermediate files

  Does not incorporate data skew

  Not novel at all

  Represents a specific implementation of well known

techniques developed nearly 25 years ago

  Missing most common current DBMS features

  Bulk loader, indexing, updates, transactions, integrity

constraints, referential Integrity, views

  Incompatible with DBMS tools

  Report writers, business intelligence tools, data mining

(20)

Architectural

Element Parallel Databases MapReduce

Schema Support Structured Unstructured Indexing B-‐Trees or Hash based None

Programming Model Relational Codasyl

Data Distribution Projections before _aggregation Logic moved to data, but _{no optimizations} Execution Strategy Push Pull

Flexibility No, but Ruby on Rails, _LINQ Yes

Fault Tolerance Transactions have to be restarted in the event of a failure

Yes: Replication,

(21)

21

MapReduce response

They label as misconceptions:

  MapReduce cannot use indices and implies

a full scan of all input data

  Data on each node can be indexed or

otherwise partitionable

  MapReduce input and outputs are always

simple files in a file system

  No, can be databases, tables, etc.

  MapReduce requires the use of inefficient

textual data formats

(22)

  The comparison paper says, "MR is always

forced to start a query with a scan of the entire input file."

  MapReduce does not require a full scan

over the data; it requires only an

implementation of its input interface to yield a set of records that match some input specification. Examples of input specifications are:

  All records in a set of files

  All records with a visit-date in the range

(23)

23   Extracting outgoing links from a collection of

HTML documents and aggregating by target document;

  Stitching together overlapping satellite images to

remove seams and to select high-quality imagery for Google Earth

  Generating a collection of inverted index files

using a compression scheme tuned for efficient support of Google search queries

  Processing all road segments in the world and

rendering map tile images that display these segments for Google Maps

  Fault-tolerant parallel execution of programs

written in higher-level languages such as Sawzall and Pig Latin

(24)

“Grep” example

  Scan through a data set of 100-byte

records looking for a three-character pattern.

  Each record consists of a unique key in the

first 10 bytes, followed by a 90-byte random value.

  The search pattern is only found in the last

(25)

25

Data Loading

  Hadoop

  Command line utility

  DBMS-X   LOAD SQL command   Administrative command to re-organize data

Dataset

  Record = 10B key + 90B random value   5.6 million records = 535MB/node   Another set = 1TB/ cluster

(26)

Grep Task Results

(27)

27

Select Task Results

27

SELECT

pageURL

,

pageRank

(28)

(29)

29

Summary

  DBMS-X 3.2 times, Vertica 2.3 times faster than

Hadoop

  Parallel DBMS win because

  B-tree indices to speed the execution of selection

operations,

  novel storage mechanisms (e.g., column-orientation)   aggressive compression techniques with ability to

operate directly on compressed data

  sophisticated parallel algorithms for querying large

amounts of relational data.

  Ease of installation and use   Fault tolerance?