Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL
Architectural Deep Dive
Dan McClary, Ph.D.
Big Data Product Management
Oracle
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
1
2
3
The Data Analytics Challenge
Why Unified Query Matters
SQL on Hadoop and More: Unifying Metadata
Query Franchising: Smart Scan for Hadoop
Oracle Confidential – Internal/Restricted/Highly Restricted 4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge
Separate silos of information to analyze
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge
Separate data access interfaces
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop is Obvious
Oracle Confidential – Internal/Restricted/Highly Restricted 7
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analytics Challenge
No comprehensive SQL interface across Oracle, Hadoop and NoSQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management System
Rich, comprehensive SQL access to all enterprise data
9
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Before
After
What Does Unified Query Mean for You?
Data Science
PhD
???
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Before
After
What Does Unified Query Mean for You?
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Use Rich Oracle SQL Dialect Over
All
Data
Snapshot of Oracle SQL Analytic Functions
• Ranking functions
– rank, dense_rank, cume_dist, percent_rank, ntile
• Window Aggregate functions (moving and cumulative)
– Avg, sum, min, max, count, variance, stddev, first_value, last_value
• LAG/LEAD functions
– Direct inter-row reference using offsets
• Reporting Aggregate functions
– Sum, avg, min, max, variance, stddev, count, ratio_to_report
• Statistical Aggregates
– Correlation, linear regression family, covariance
• Linear regression
– Fitting of an ordinary-least-squares regression line to a set of number pairs.
– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions
• Descriptive Statistics
– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median,
quantile values, +/- n sigma values, top/bottom 5 values
• Correlations
– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).
• Cross Tabs
– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa
• Hypothesis Testing
– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA
• Distribution Fitting
– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
next = lineNext.getQuantity(); }
if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";
return state; }
if (gt(q, prev) && gt(q, next)) { state = "T";
return state; }
if (lt(q, prev) && lt(q, next)) { state = "B";
return state; }
if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";
return state; }
if (q.isEmpty() || eq(q, prev)) { state = "F";
return state; }
return state; }
private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return a.equals(b); }
private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) > Double.parseDouble(b); }
private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) < Double.parseDouble(b); }
public String getState() { return this.state; }
}
BagFactory bagFactory = BagFactory.getInstance(); @Override
public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;
boolean noMoreValues = false; String matchList = "";
ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;
DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;
String msg = "Expected input to be DataBag, but"
Pattern Matching With Oracle SQL
Snapshot of Oracle SQL Analytic Functions
Simplified, sophisticated, standards based syntax
SELECT first_x, last_z
FROM ticker MATCH_RECOGNIZE (
PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND
z.time - FIRST(x.time) <= 7 ))
250+ Lines of Java UDF
12 Lines of SQL
20x less code
Finding Patterns in Stock Market Data - Double Bottom (W)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 13
10:00 10:05 10:10 10:15 10:20 10:25
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL – A New Architecture
•
Powerful, high-performance SQL on Hadoop
–
Full Oracle SQL capabilities on Hadoop
–
SQL query processing local to Hadoop nodes
•
Simple data integration of Hadoop and Oracle Database
–
Single SQL point-of-entry to access all data
–
Scalable joins between Hadoop and RDBMS data
•
Optimized hardware
–
Balanced Configurations
–
No bottlenecks
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Want to know what this really
means.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop and More: Unifying
Metadata
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Why Unify Metadata?
Catalog
CUSTOMERS
SALES
CREATE TABLE
customers…
CREATE TABLE
sales…
SELECT customers.name, sales.amount
SELECT name FROM customers
customers
sales
Query across sources Integrate new metadata
•
No changes for users and applications
•
Seamlessly handle schema-on-read
•
Exploit remote data distribution
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
How Data is Stored in Hadoop
Oracle Confidential – Internal/Restricted/Highly Restricted 18 {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
Example: 1TB File
Block B1
Block B2
Block B3
• 1 block = 256 MB
• Example File = 4096 blocks
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
How MapReduce and Hive Read Data
19
Data Node
disk
Consumer
SCAN
Create
ROWS
&
COLUMNS
•
Scan and row creation needs to be able to
work on “any” data format
•
Data definitions and column deserializations
are needed to provide a table
RecordReader
=> Scans data (keys and values)
InputFormat
=> Defines parallelism
SerDe
=> Makes columns
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL-on-Hadoop Engines Share Metadata, not MapReduce
Hive Metastore
Oracle Confidential – Internal/Restricted/Highly Restricted 20
Hive Metastore
Hive
Impala
SparkSQL
Oracle Big Data SQL
…
Table Definitions:
movieapp_log_json
Tweets
avro_log
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 21
Extend Oracle External Tables
CREATE TABLE movielog (
click VARCHAR2(4000))
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster
))
REJECT LIMIT UNLIMITED;
•
New types of external tables
–
ORACLE_HIVE (inherit metadata)
–
ORACLE_HDFS (specify metadata)
•
Access parameters for Big Data
–
Hadoop cluster
–
Remote Hive database/table
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 22
Enhance Oracle External Tables
•
Transparent schema-for-read
–
Use fast C-based readers when possible
–
Use native Hadoop classes otherwise
•
Engineered to understand parallelism
–
Map external units of parallelism to Oracle
•
Architected for extensibility
–
StorageHandler capability enables future support
for other data sources
–
Examples: MongoDB, HBase, Oracle NoSQL DB
CREATE TABLE ORDER (
cust_num VARCHAR2(10),
order_num VARCHAR2(20),
order_total NUMBER(8,2))
ORGANIZATION EXTERNAL (
TYPE
ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
PARALLEL 20
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
CUSTOMERS
SELECT name, SUM(purchase)
FROM customers
GROUP BY name;
What Can Big Data Learn from Exadata?
Intelligent Storage Maximizes Performance
Oracle Exadata Storage Server Oracle Exadata Storage Server
Oracle SQL query issued
• Plan constructed
• Query executed
1
Smart Scan Works on Storage
• Filter out unneeded rows
• Project only queried columns
• Score data models
• Bloom filters to speed up joins
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Query Franchising
– dispatch of query processing to
self-similar compute agents on disparate systems
without loss of operational fidelity
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Storage Layer
Oracle Confidential – Internal/Restricted/Highly Restricted 26
Big Data SQL Server: A New Hadoop Processing Engine
Filesystem (HDFS)
NoSQL Databases
(Oracle NoSQL DB, Hbase)
Resource Management (YARN, cgroups)
Processing Layer
MapReduce
and Hive
Spark
Impala
Search
Big Data
SQL
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Smart Scan for Hadoop: Optimizing Performance
Oracle Confidential – Internal/Restricted/Highly Restricted 27
Data Node
Disk
Big Data SQL Server
External Table Services
Smart Scan
“Oracle on top”
–
Apply filter predicates
–
Project columns
–
Parse semi-structured data
“Hadoop on the bottom”
–
Work close to the data
–
Schema-on-read with Hadoop classes
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
B B B
How do we query Hadoop?
Big Data SQL Query Execution
HDFS Data Node BDS Server HDFS Data Node BDS Server
Query compilation determines:
• Data locations
• Data structure
• Parallelism
1
Fast reads using Big Data SQL Server
• Schema-for-read using Hadoop classes
• Smart Scan selects only relevant data
2
Process filtered result
• Move relevant data to database
• Join with database tables
• Apply database security policies
3
Hive Metastore HDFS
NameNode
1
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Parallel Query and Hadoop
Mapping Hadoop to Oracle
B B B
Hive Metastore HDFSNameNode
Determine Hadoop Parallelism
• Determine schema-for-read
• Determine InputSplits
• Arrange splits for best performance
1
Map to Oracle Parallelism
• Map splits to granules
• Assign granules to PX Servers
2
PX Servers Route Work
• Offload work to Big Data SQL Servers
• Aggregate
• Join
• Apply PL/SQL
3
1
2
PX
InputSplits
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data SQL Server Dataflow
Disks
Data Node
Big Data SQL Server
External Table Services
Smart Scan
Read data from HDFS Data Node
• Direct-path reads
• C-based readers when possible
• Use native Hadoop classes otherwise
1
Translate bytes to Oracle
2
Apply Smart Scan to Oracle bytes
• Apply filters
• Project Columns
• Parse JSON/XML
• Score models
3
RecordReader
SerDe
1011001
0
1011001
0
10110010
1
2
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
But How Does Security Work?
B B B
Database security for query access
• Virtual Private Databases
• Redaction
• Audit Vault and Database Firewall
1
Hadoop security for Hadoop jobs
• Kerberos Authentication
• Apache Sentry (RBAC)
• Audit Vault
2
System-specific encryption
• Database tablespace encryption
• BDA On-disk Encryption
3
SELECT * FROM my_bigdata_table
WHERE SALES_REP_ID =
SYS_CONTEXT('USERENV','SESSION_USER');
Filter on
SESSION_USER
DBMS_REDACT.ADD_POLICY(
object_schema => 'MCLICK',
object_name => 'TWEET_V',
column_name => 'USERNAME',
policy_name => 'tweet_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters =>
'VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',
expression => '1=1'
);
***
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
SQL, Everywhere
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Move less data Go faster
More Lessons from Exadata
Storage Indexes
• Skip reads on irrelevant data
• Big Hadoop Blocks ~ Big Speed Up
1
Caching
• Cache frequently accessed columns
• HDFS Caching
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management System
Rich, comprehensive SQL access to all enterprise data
34
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management System
Unite Information Lifecycles
35
NoSQL
Automatic ILM
• Roll off cold partitions to Hadoop
• Promote hot data to Oracle
Shared REST APIs
• App-embedded schema NoSQL
• Shared schema Oracle
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Management System
Unify All Query
36
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data SQL
39