Seamless Access from
Oracle Database to Your Big Data
Brian Macdonald
Big Data and Analytics Specialist
Oracle Enterprise Architect
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Agenda
•
Hadoop and SQL access methods
•
What is Oracle Big Data SQL
•
Big Data SQL Architecture
•
Big Data SQL Configuration
•
Roadmap
•
Customer Story
•
Q&A
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
First Lets Define Big Data
Structured & Unstructured Data
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop is Obvious
Although Implementations Vary
•
Hive
•
Impala, HAWQ, IBM Big SQL
•
Oracle SQL Connector for Hadoop (OSCH)
•
Oracle Big Data SQL
•
A million more (Tez, Presto, Hadapt, Stinger, Polybase, Drill, Lots of
start ups)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SQL Analytics Challenge
Separate silos of information to analyze
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SQL Analytics Challenge
No comprehensive SQL interface
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL
Hadoop + NoSQL + Relational…
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL – A New Architecture
•
Powerful, high-performance SQL on Hadoop
–
Full Oracle SQL capabilities on Hadoop
–
SQL query processing local to Hadoop nodes
•
Simple data integration of Hadoop and Oracle Database
–
Single SQL point-of-entry to access all data
–
Scalable joins between Hadoop and RDBMS data
•
Oracle Security
–
Govern all Data through a Single Set of Security Policies
–
Redaction, VPD, etc.
–
Tool Access
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Use Rich Oracle SQL Dialect Over All
Data
Snapshot of Oracle SQL Analytic Functions
• Ranking functions
– rank, dense_rank, cume_dist, percent_rank, ntile
• Window Aggregate functions (moving and cumulative)
– Avg, sum, min, max, count, variance, stddev, first_value, last_value
• LAG/LEAD functions
– Direct inter-row reference using offsets
• Reporting Aggregate functions
– Sum, avg, min, max, variance, stddev, count, ratio_to_report
• Statistical Aggregates
– Correlation, linear regression family, covariance
• Linear regression
– Fitting of an ordinary-least-squares regression line to a set of number
pairs.
– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR
functions
• Descriptive Statistics
– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count,
min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values
• Correlations
– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).
• Cross Tabs
– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency
coefficient, Cohen's kappa
• Hypothesis Testing
– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann
Whitney test, Kolmogorov-Smirnov test, One-way ANOVA
• Distribution Fitting
– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal,
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL Architecture
•
Two components of Oracle Big Data SQL
–
External Table extension
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL Architecture
•
Two components of Oracle Big Data SQL
–
External Table extension
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 12
A Smarter Oracle External Table
Oracle Table
HDFS Data
You define:
You get:
• Table name
• Automatic discovery of Hive table metadata
• Oracle types
• Automatic translation from Hadoop types
• Automatic conversion from any InputFormat
• Any Degree of Parallelism
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Unify Metadata: Publish Hive Metadata to Oracle Catalog
13
CREATE TABLE movieapp_log_json
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE
ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;
Big Data Appliance
+
Hadoop/NoSQL
Exadata
+
Oracle
Database
Oracle Catalog
External Table
Hive metadata
External Table
Hive Metastore
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Accessible through Oracle Data Dictionary
Immediately – So the DBA doesn’t need to go to Hadoop
•
ALL_HIVE_DATABASES
•
ALL_HIVE_TABLES
•
ALL_HIVE_COLUMNS
•
DBA_HIVE_DATABASES
•
DBA_HIVE_TABLES
•
DBA_HIVE_COLUMNS
•
USER_HIVE_DATABASES
•
USER_HIVE_TABLES
•
USER_HIVE_COLUMNS
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 15
Extend Oracle External Tables
CREATE TABLE movielog (
click VARCHAR2(4000))
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster
))
REJECT LIMIT UNLIMITED;
•
New types of external tables
–
ORACLE_HIVE (inherit metadata)
–
ORACLE_HDFS (specify metadata)
•
Access parameters for Big Data
–
Hadoop cluster
–
Remote Hive database/table
•
DBMS_HADOOP Package for automatic
import
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How Data is Stored in Hadoop
As files. Pretty Simple
17 {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
Example: 1TB File
CREATE TABLE ORDER (custid VARCHAR2(10),
recommended VARCHAR2(20),
activity (NUMBER 8,2))
ORGANIZATION EXTERNAL (TYPE
oracle_hdfs
)
LOCATION ("hdfs:/usr/cust/summary/*");
•
Assumes Default Values
•
Table Options
–
Fields
–
Column Maps
–
Delimiters
–
Fileformats
•
json, textfile, sequencefile,…
–
Serdes
•
i.e regex
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Creating an External Table against Hive
DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE (
cluster_id IN VARCHAR2,
db_name IN VARCHAR2 := NULL,
hive_table_name IN VARCHAR2,
hive_partition IN BOOLEAN,
table_name IN VARCHAR2 := NULL,
perform_ddl IN BOOLEAN DEFAULT FALSE,
text_of_ddl OUT VARCHAR2
);
set serveroutput on
DECLARE
DDLout VARCHAR2(4000);
BEGIN
dbms_hadoop.create_extddl_for_hive(
CLUSTER_ID=>
'bigdatalite
',
DB_NAME=>
'brian
',
HIVE_TABLE_NAME=>
'movie
',
HIVE_PARTITION=>FALSE,
TABLE_NAME=>
'movie
',
PERFORM_DDL=>FALSE,
TEXT_OF_DDL=>DDLout);
dbms_output.put_line(DDLout);
END;
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 19
Oracle External Tables – Flexibility for Varied File Structures
•
Transparent schema-for-read
–
Use fast C-based readers when possible
–
Use native Hadoop classes otherwise
•
Engineered to understand parallelism
–
Map external units of parallelism to Oracle
•
Architected for extensibility
–
StorageHandler capability enables support for other
data sources
–
Examples: MongoDB, HBase, Oracle NoSQL DB
CREATE TABLE ORDER (
cust_num VARCHAR2(10),
order_num VARCHAR2(20),
order_total NUMBER(8,2))
ORGANIZATION EXTERNAL (
TYPE
ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
PARALLEL 20
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
StorageHandlers: Extensibility Beyond HDFS
Hive Metastore
Oracle Big Data SQL
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data SQL Architecture
•
Two components of Oracle Big Data SQL
–
External Table extension
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What gives Exadata
extreme
performance?
22
Oracle Database 12c
SQL
Offload Query to
Exadata Storage Servers
Small data subset
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Introducing Oracle Big Data SQL
23
Massively Parallel SQL Query across Oracle, Hadoop and NoSQL
Oracle Database 12c
Hadoop & NoSQL
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Big Data Appliance X5-2
Sun Oracle X5-2L Servers with per server:
•
2 * 18 Core Intel Xeon E5 Processors
•
128 GB Memory
•
96TB Disk space
Integrated Software (4.2):
•
Oracle Linux 6.6
•
Oracle Big Data SQL 1.1*
•
Cloudera Distribution of Apache Hadoop 5.4 – EDH Edition
•
Cloudera Manager 5.4
•
Oracle R Distribution
•
Oracle NoSQL Database CE
24
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Introducing Oracle Big Data SQL
25
Massively Parallel SQL Query across Oracle, Hadoop and NoSQL
Oracle Database 12c
Offload Query to
Exadata Storage Servers
Small data subset
quickly returned
Hadoop & NoSQL
Offload Query to
Data Nodes
SQL
data
subset
SQL
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Storage Layer
26
Big Data SQL Server: A New Hadoop Processing Engine
Filesystem (HDFS)
(Oracle NoSQL DB, Hbase)
NoSQL Databases
Resource Management (YARN, cgroups)
Processing Layer
MapReduce
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How do we query Hadoop?
Big Data SQL Query Execution
HDFSData Node BDSQL
HDFS Data Node BDSQL
Query compilation determines:
• Data locations
• Data structure
• Parallelism
1
Fast reads using Big Data SQL Server
• Schema-for-read using Hadoop classes
• Smart Scan selects only relevant data
2
Process filtered result
• Move relevant data to database
• Join with database tables
• Apply database security policies
3
Hive Metastore HDFS
NameNode
1
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Apply Advanced Security on Hadoop & NoSQL
28
Same security policies across all data
Oracle Database 12c
Hadoop
Redacted
data
subset
SQL
Raw JSON
data in
Hadoop
JSON
Customer
data
in Oracle
DBMS_REDACT.ADD_POLICY(
object_schema => 'sales',
object_name => 'customer_detail',
column_name => 'last_name',
policy_name => 'customer_privacy',
function_type => DBMS_REDACT.FULL,
expression => '1=1'
);
•
Redaction
•
Virtual Private Database
•
Fine-grain Access Control
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Configuration
•
Install Oracle Big Data SQL on the BDA using Mammoth
•
Run the Big Data SQL-Exadata installation script on
each
Oracle Exadata
database node
–
Sets up connectivity from Exadata to the Big Data SQL Servers on the BDA.
–
Installs a Hadoop client
–
Configure directories and files
–
Big Data SQL Agent
–
Oracle directory objects
–
Others
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Directories
•
Two Types of directories are created
•
Common Directory – must be on cluster wide shared files system
–
Subdirectories for jar files
–
bigdata.properties (paths,etc.)
•
Cluster Directory(s)
–
Configuration details for each BDA Cluster
–
Sub directory of Common directory
•
Oracle Directories that point to these Dirs
–
ORACLE_BIGDATA_CONFIG – Common Directory
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Big Data SQL Agents
•
Created by Install Script
•
This multi-threaded agent bridges the metadata between Oracle Database
and Hadoop.
•
It launches a single JVM - instead of one for every process (which can be
quite slow).
•
create public database link BDSQL$_XXXX using
'extproc_connection_data'; (XXXX is the name of each BDA cluster from
Cluster Directories
•
create public database link BDSQL$_DEFAULT_CLUSTER using
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
If Kerberos is used on BDA
•
Must create ticket (kinit) for BDS user
•
BDS runs as Oracle User
•
Need to renew tickets
–
cron
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Requirements - For Now
•
Exadata
•
Oracle 12.1.0.2.1+
•
Storage Servers 2.1.1.1 or 12.1.1.0
•
Exadata configured on the same InfiniBand subnet as BDA
•
Exadata and BDA connected by InfiniBand
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Roadmap
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Today
•
Hadoop DoP linked to RDBMS DoP
–
Lead to many idle PQ processes
–
Required explicit declaration
Next
•
Unlink Hadoop and RDBMS DoP
•
Automatic max Hadoop parallelism
–
Even on serial tables
•
An average of 40% faster
–
Even at equivalent DoP
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Today
•
All blocks in a query must be read
from disk
–
Large (256MB) disk I/O for each block
Next
•
Automatically create Storage
Indexes in Big Data SQL Agents
•
Check index before reading blocks
–
Skip unnecessary I/Os
•
An average of 65% faster
–
Up to 100x faster for highly selective
queries
Storage Indexing
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Customer Examples
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building Customer Loyalty
Deliver personalized multi-channel content to every
customer (example: Kroger’s ‘MyMagazine’)
Expand to a wide variety of interaction data to build
customer profiles
Challenges
Customer loyalty marketing and programs for
major retailers and consumer brands
Company Overview
2x improvements in campaign performance
Large-scale concurrent processing of complex SQL
70% of analysis is done in SQL, uses R as well
Benefits
Solution Overview
Oracle Exadata X3-8
Oracle Database with Advanced Analytics
Oracle ZFS Backup Appliance
Big Data Appliance
Next: Big Data SQL
SQL Analysis R-based Analysis Machine Learning
BDA
ZFS
X3-8
X3-8
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |