• No results found

Oracle Big Data SQL Architectural Deep Dive

N/A
N/A
Protected

Academic year: 2021

Share "Oracle Big Data SQL Architectural Deep Dive"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL

Architectural Deep Dive

Dan McClary, Ph.D.

Big Data Product Management

Oracle

(3)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for

information purposes only, and may not be incorporated into any contract. It is not a

commitment to deliver any material, code, or functionality, and should not be relied upon

in making purchasing decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole discretion of Oracle.

(4)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Agenda

1

2

3

The Data Analytics Challenge

Why Unified Query Matters

SQL on Hadoop and More: Unifying Metadata

Query Franchising: Smart Scan for Hadoop

Oracle Confidential – Internal/Restricted/Highly Restricted 4

(5)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data Analytics Challenge

Separate silos of information to analyze

(6)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data Analytics Challenge

Separate data access interfaces

(7)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

SQL on Hadoop is Obvious

Oracle Confidential – Internal/Restricted/Highly Restricted 7

(8)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Data Analytics Challenge

No comprehensive SQL interface across Oracle, Hadoop and NoSQL

(9)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

9

(10)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Before

After

What Does Unified Query Mean for You?

Data Science

PhD

???

(11)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Before

After

What Does Unified Query Mean for You?

(12)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Use Rich Oracle SQL Dialect Over

All

Data

Snapshot of Oracle SQL Analytic Functions

• Ranking functions

– rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median,

quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

(13)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but"

Pattern Matching With Oracle SQL

Snapshot of Oracle SQL Analytic Functions

Simplified, sophisticated, standards based syntax

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 13

10:00 10:05 10:10 10:15 10:20 10:25

(14)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL – A New Architecture

Powerful, high-performance SQL on Hadoop

Full Oracle SQL capabilities on Hadoop

SQL query processing local to Hadoop nodes

Simple data integration of Hadoop and Oracle Database

Single SQL point-of-entry to access all data

Scalable joins between Hadoop and RDBMS data

Optimized hardware

Balanced Configurations

No bottlenecks

(15)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Want to know what this really

means.

(16)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

SQL on Hadoop and More: Unifying

Metadata

(17)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Why Unify Metadata?

Catalog

CUSTOMERS

SALES

CREATE TABLE

customers…

CREATE TABLE

sales…

SELECT customers.name, sales.amount

SELECT name FROM customers

customers

sales

Query across sources  Integrate new metadata

No changes for users and applications

Seamlessly handle schema-on-read

Exploit remote data distribution

(18)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

How Data is Stored in Hadoop

Oracle Confidential – Internal/Restricted/Highly Restricted 18 {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB File

Block B1

Block B2

Block B3

• 1 block = 256 MB

• Example File = 4096 blocks

(19)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

How MapReduce and Hive Read Data

19

Data Node

disk

Consumer

SCAN

Create

ROWS

&

COLUMNS

Scan and row creation needs to be able to

work on “any” data format

Data definitions and column deserializations

are needed to provide a table

RecordReader

=> Scans data (keys and values)

InputFormat

=> Defines parallelism

SerDe

=> Makes columns

(20)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

SQL-on-Hadoop Engines Share Metadata, not MapReduce

Hive Metastore

Oracle Confidential – Internal/Restricted/Highly Restricted 20

Hive Metastore

Hive

Impala

SparkSQL

Oracle Big Data SQL

Table Definitions:

movieapp_log_json

Tweets

avro_log

(21)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 21

Extend Oracle External Tables

CREATE TABLE movielog (

click VARCHAR2(4000))

ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

))

REJECT LIMIT UNLIMITED;

New types of external tables

ORACLE_HIVE (inherit metadata)

ORACLE_HDFS (specify metadata)

Access parameters for Big Data

Hadoop cluster

Remote Hive database/table

(22)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 22

Enhance Oracle External Tables

Transparent schema-for-read

Use fast C-based readers when possible

Use native Hadoop classes otherwise

Engineered to understand parallelism

Map external units of parallelism to Oracle

Architected for extensibility

StorageHandler capability enables future support

for other data sources

Examples: MongoDB, HBase, Oracle NoSQL DB

CREATE TABLE ORDER (

cust_num VARCHAR2(10),

order_num VARCHAR2(20),

order_total NUMBER(8,2))

ORGANIZATION EXTERNAL (

TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

PARALLEL 20

(23)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

(24)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

CUSTOMERS

SELECT name, SUM(purchase)

FROM customers

GROUP BY name;

What Can Big Data Learn from Exadata?

Intelligent Storage Maximizes Performance

Oracle Exadata Storage Server Oracle Exadata Storage Server

Oracle SQL query issued

• Plan constructed

• Query executed

1

Smart Scan Works on Storage

• Filter out unneeded rows

• Project only queried columns

• Score data models

• Bloom filters to speed up joins

(25)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Query Franchising

– dispatch of query processing to

self-similar compute agents on disparate systems

without loss of operational fidelity

(26)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Storage Layer

Oracle Confidential – Internal/Restricted/Highly Restricted 26

Big Data SQL Server: A New Hadoop Processing Engine

Filesystem (HDFS)

NoSQL Databases

(Oracle NoSQL DB, Hbase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduce

and Hive

Spark

Impala

Search

Big Data

SQL

(27)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Smart Scan for Hadoop: Optimizing Performance

Oracle Confidential – Internal/Restricted/Highly Restricted 27

Data Node

Disk

Big Data SQL Server

External Table Services

Smart Scan

“Oracle on top”

Apply filter predicates

Project columns

Parse semi-structured data

“Hadoop on the bottom”

Work close to the data

Schema-on-read with Hadoop classes

(28)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

B B B

How do we query Hadoop?

Big Data SQL Query Execution

HDFS Data Node BDS Server HDFS Data Node BDS Server

Query compilation determines:

• Data locations

• Data structure

• Parallelism

1

Fast reads using Big Data SQL Server

• Schema-for-read using Hadoop classes

• Smart Scan selects only relevant data

2

Process filtered result

• Move relevant data to database

• Join with database tables

• Apply database security policies

3

Hive Metastore HDFS

NameNode

1

(29)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Parallel Query and Hadoop

Mapping Hadoop to Oracle

B B B

Hive Metastore HDFS

NameNode

Determine Hadoop Parallelism

• Determine schema-for-read

• Determine InputSplits

• Arrange splits for best performance

1

Map to Oracle Parallelism

• Map splits to granules

• Assign granules to PX Servers

2

PX Servers Route Work

• Offload work to Big Data SQL Servers

• Aggregate

• Join

• Apply PL/SQL

3

1

2

PX

InputSplits

(30)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Big Data SQL Server Dataflow

Disks

Data Node

Big Data SQL Server

External Table Services

Smart Scan

Read data from HDFS Data Node

• Direct-path reads

• C-based readers when possible

• Use native Hadoop classes otherwise

1

Translate bytes to Oracle

2

Apply Smart Scan to Oracle bytes

• Apply filters

• Project Columns

• Parse JSON/XML

• Score models

3

RecordReader

SerDe

1011001

0

1011001

0

10110010

1

2

3

(31)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

But How Does Security Work?

B B B

Database security for query access

• Virtual Private Databases

• Redaction

• Audit Vault and Database Firewall

1

Hadoop security for Hadoop jobs

• Kerberos Authentication

• Apache Sentry (RBAC)

• Audit Vault

2

System-specific encryption

• Database tablespace encryption

• BDA On-disk Encryption

3

SELECT * FROM my_bigdata_table

WHERE SALES_REP_ID =

SYS_CONTEXT('USERENV','SESSION_USER');

Filter on

SESSION_USER

DBMS_REDACT.ADD_POLICY(

object_schema => 'MCLICK',

object_name => 'TWEET_V',

column_name => 'USERNAME',

policy_name => 'tweet_redaction',

function_type => DBMS_REDACT.PARTIAL,

function_parameters =>

'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',

expression => '1=1'

);

***

(32)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

SQL, Everywhere

(33)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Move less data  Go faster

More Lessons from Exadata

Storage Indexes

• Skip reads on irrelevant data

• Big Hadoop Blocks ~ Big Speed Up

1

Caching

• Cache frequently accessed columns

• HDFS Caching

(34)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data Management System

Rich, comprehensive SQL access to all enterprise data

34

(35)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data Management System

Unite Information Lifecycles

35

NoSQL

Automatic ILM

• Roll off cold partitions to Hadoop

• Promote hot data to Oracle

Shared REST APIs

• App-embedded schema NoSQL

• Shared schema Oracle

(36)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data Management System

Unify All Query

36

(37)
(38)
(39)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Big Data SQL

39

SELECT w.sess_id,w.cust_id,c.name

FROM web_logs w, customers c

WHERE w.source_country = ‘Brazil’

AND c.customer_id = w.cust_id

Relevant SQL runs on BDA nodes

10’s of Gigabytes of Data

Only

columns and rows needed to

answer query are returned

Hadoop Cluster

B B B

Big Data SQL

Oracle Database

CUSTOMERS

WEB_LOGS

References

Related documents

Since 1996 Wärtsilä has produced a standard family of transverse thrusters (controllable, CT, and fixed pitch, FT) in the power range up to 3300 kW.. Many operators rely on

It has been recognized that theories for describing the states of stress and failure in unsaturated soil require consideration of the thermodynamic properties of the pore water in

Similarly, comfort and travel time are valued higher by commuters from zones close to CBD (i.e., within 5 km to the CBD) than those from city peripherals. It was, how- ever,

Dick Baker is on the National Coach of the Year selection committee, 4 coaches from Section 1 were nominated, none from the region were selected nationally, but need more

In addition, wasta (connections) is used extensively within Jordanian bureaucracy to create advantages for oneself and relatives (T. Al- Masri). In this way,

From 1980 to 1991, the existence of single editions of the regional newspapers in the major cities justifies the concentration of records in the central region of

Full-scale dynamic analysis of an innovative rockfall fence under impact using the discrete element method: from the local scale to the structure scale.. Full-scale dynamic analysis

The rock fall hazard may be defined as the probability of a rock fall of a given magnitude (or kinetic energy) reaching the element at risk, which can be expressed as the probability