Oracle Big Data SQL Architectural Deep Dive

(1)

(2)

Oracle Big Data SQL

Architectural Deep Dive

Dan McClary, Ph.D.

Big Data Product Management

Oracle

(3)

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for

information purposes only, and may not be incorporated into any contract. It is not a

commitment to deliver any material, code, or functionality, and should not be relied upon

in making purchasing decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole discretion of Oracle.

(4)

Agenda

1

2

3 The Data Analytics Challenge

Why Unified Query Matters

SQL on Hadoop and More: Unifying Metadata

Query Franchising: Smart Scan for Hadoop

Oracle Confidential – Internal/Restricted/Highly Restricted 4

(5)

Data Analytics Challenge

Separate silos of information to analyze

(6)

Data Analytics Challenge

Separate data access interfaces

(7)

SQL on Hadoop is Obvious

(8)

Data Analytics Challenge

No comprehensive SQL interface across Oracle, Hadoop and NoSQL

(9)

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

9

(10)

Before

After

What Does Unified Query Mean for You?

Data Science

PhD

???

(11)

Before

After

What Does Unified Query Mean for You?

(12)

Use Rich Oracle SQL Dialect Over

All

Data

Snapshot of Oracle SQL Analytic Functions

• Ranking functions

– rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median,

quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

(13)

next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but"

Pattern Matching With Oracle SQL

Snapshot of Oracle SQL Analytic Functions

Simplified, sophisticated, standards based syntax

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

10:00 10:05 10:10 10:15 10:20 10:25

(14)

Oracle Big Data SQL – A New Architecture

• Powerful, high-performance SQL on Hadoop

–

Full Oracle SQL capabilities on Hadoop

–

SQL query processing local to Hadoop nodes

• Simple data integration of Hadoop and Oracle Database

–

Single SQL point-of-entry to access all data

–

Scalable joins between Hadoop and RDBMS data

• Optimized hardware

–

Balanced Configurations

–

No bottlenecks

(15)

Want to know what this really

means.

(16)

SQL on Hadoop and More: Unifying

Metadata

(17)

Why Unify Metadata?

Catalog

CUSTOMERS

SALES

CREATE TABLE

customers…

CREATE TABLE

sales…

SELECT customers.name, sales.amount

SELECT name FROM customers

customers

sales

Query across sources  Integrate new metadata

• No changes for users and applications

• Seamlessly handle schema-on-read

• Exploit remote data distribution

(18)

How Data is Stored in Hadoop

Oracle Confidential – Internal/Restricted/Highly Restricted 18 {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB File

Block B1

Block B2

Block B3

• 1 block = 256 MB

• Example File = 4096 blocks

(19)

How MapReduce and Hive Read Data

19

Data Node

disk

Consumer

SCAN

Create

ROWS

&

COLUMNS

• Scan and row creation needs to be able to

work on “any” data format

• Data definitions and column deserializations

are needed to provide a table

RecordReader

=> Scans data (keys and values)

InputFormat

=> Defines parallelism

SerDe

=> Makes columns

(20)

SQL-on-Hadoop Engines Share Metadata, not MapReduce

Hive Metastore

Hive

Impala

SparkSQL