Big Data
Data-intensive Computing
Methods, Tools, and Applications
(CMSC 34900)
Ian Foster
Computation Institute
SQL Overview
Structured Query Language
The standard for relational database
management systems (RDBMS)
RDBMS: A database management system
that manages data as a collection of tables in which all relationships are represented by common values in related tables
5
5
SQL Environment
Catalog
A set of schemas that constitute the description of
a database
Schema (or Database)
The structure that contains descriptions of objects
created by a user (base tables, views,
constraints)
Data Definition Language (DDL)
Commands that define a database, including
creating, altering, and dropping tables and establishing constraints
Data Manipulation Language (DML)
Commands that maintain and query a database
9
9
Figure 7-4 DDL, DML, DCL, and the database development process
Common SQL Commands
Data Definition Language
(DDL):
Create Drop Alter
Data Manipulation Language
(DML):
Select Update Insert Delete
Data Control Language
(DCL):
11
11
Internal Schema Definition
Control processing/storage efficiency:
Choice of indexes
File organizations for base tables File organizations for indexes
Data clustering
Statistics maintenance
Creating indexes
Speed up random/sequential access to base
table data
Example
CREATE INDEX NAME_IDX ON CUSTOMER_T
(CUSTOMER_NAME)
This makes an index for the CUSTOMER_NAME
field of the CUSTOMER_T table
MapReduce
or
An example problem
We have a large number of documents,
each labeled in some way with the name of the site where they occur
Find sites with documents that contain
more than five instances of the words “IBM” or “Google”
15
MapReduce approach
Map: Create a histogram for each
document listing frequently occurring words
Reduce: Group documents by their site of
origin
Map: Identify documents with more than
map(String key, String value) // key: document name
// value: document contents for each word w in value:
17
map(String key, String value)
// key: (site id + document name) // value: document contents
histogram = CountWords(value); EmitIntermediate
SQL solution
Assume a table Documents of the form: (siteid, docid, word, …)
19
“MapReduce – A major step backwards”
A giant step backward
No schemas, Codasyl instead of Relational
A sub-optimal implementation
Uses brute force sequential search, instead of indexing
Materializes O(m.r) intermediate files
Does not incorporate data skew
Not novel at all
Represents a specific implementation of well known
techniques developed nearly 25 years ago
Missing most common current DBMS features
Bulk loader, indexing, updates, transactions, integrity
constraints, referential Integrity, views
Incompatible with DBMS tools
Report writers, business intelligence tools, data mining
Architectural
Element Parallel Databases MapReduce
Schema Support Structured Unstructured Indexing B-‐Trees or Hash based None
Programming Model Relational Codasyl
Data Distribution Projections before aggregation Logic moved to data, but no optimizations Execution Strategy Push Pull
Flexibility No, but Ruby on Rails, LINQ Yes
Fault Tolerance Transactions have to be restarted in the event of a failure
Yes: Replication,
21
MapReduce response
They label as misconceptions:
MapReduce cannot use indices and implies
a full scan of all input data
Data on each node can be indexed or
otherwise partitionable
MapReduce input and outputs are always
simple files in a file system
No, can be databases, tables, etc.
MapReduce requires the use of inefficient
textual data formats
The comparison paper says, "MR is always
forced to start a query with a scan of the entire input file."
MapReduce does not require a full scan
over the data; it requires only an
implementation of its input interface to yield a set of records that match some input specification. Examples of input specifications are:
All records in a set of files
All records with a visit-date in the range
23 Extracting outgoing links from a collection of
HTML documents and aggregating by target document;
Stitching together overlapping satellite images to
remove seams and to select high-quality imagery for Google Earth
Generating a collection of inverted index files
using a compression scheme tuned for efficient support of Google search queries
Processing all road segments in the world and
rendering map tile images that display these segments for Google Maps
Fault-tolerant parallel execution of programs
written in higher-level languages such as Sawzall and Pig Latin
“Grep” example
Scan through a data set of 100-byte
records looking for a three-character pattern.
Each record consists of a unique key in the
first 10 bytes, followed by a 90-byte random value.
The search pattern is only found in the last
25
Data Loading
Hadoop
Command line utility
DBMS-X LOAD SQL command Administrative command to re-organize data
Dataset
Record = 10B key + 90B random value 5.6 million records = 535MB/node Another set = 1TB/ clusterGrep Task Results
27
Select Task Results
27
SELECT
pageURL
,
pageRank
29
Summary
DBMS-X 3.2 times, Vertica 2.3 times faster than
Hadoop
Parallel DBMS win because
B-tree indices to speed the execution of selection
operations,
novel storage mechanisms (e.g., column-orientation) aggressive compression techniques with ability to
operate directly on compressed data
sophisticated parallel algorithms for querying large
amounts of relational data.
Ease of installation and use Fault tolerance?