Seamless Access from Oracle Database to Your Big Data

(1)

Seamless Access from

Oracle Database to Your Big Data

Brian Macdonald

Big Data and Analytics Specialist

Oracle Enterprise Architect

(2)

Agenda

• Hadoop and SQL access methods

• What is Oracle Big Data SQL

• Big Data SQL Architecture

• Big Data SQL Configuration

• Roadmap

• Customer Story

• Q&A

(3)

First Lets Define Big Data

Structured & Unstructured Data

(4)

SQL on Hadoop is Obvious

Although Implementations Vary

• Hive

• Impala, HAWQ, IBM Big SQL

• Oracle SQL Connector for Hadoop (OSCH)

• Oracle Big Data SQL

• A million more (Tez, Presto, Hadapt, Stinger, Polybase, Drill, Lots of

start ups)

(5)

SQL Analytics Challenge

Separate silos of information to analyze

(6)

SQL Analytics Challenge

No comprehensive SQL interface

(7)

Oracle Big Data SQL

Hadoop + NoSQL + Relational…

(8)

Oracle Big Data SQL – A New Architecture

• Powerful, high-performance SQL on Hadoop

–

Full Oracle SQL capabilities on Hadoop

–

SQL query processing local to Hadoop nodes

• Simple data integration of Hadoop and Oracle Database

–

Single SQL point-of-entry to access all data

–

Scalable joins between Hadoop and RDBMS data

• Oracle Security

–

Govern all Data through a Single Set of Security Policies

–

Redaction, VPD, etc.

–

Tool Access

(9)

Use Rich Oracle SQL Dialect Over All

Data

Snapshot of Oracle SQL Analytic Functions

• Ranking functions

– rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

– Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

– Direct inter-row reference using offsets

• Reporting Aggregate functions

– Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

– Correlation, linear regression family, covariance

• Linear regression

– Fitting of an ordinary-least-squares regression line to a set of number

pairs.

– Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR

functions

• Descriptive Statistics

– DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count,

min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

– Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric).

• Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency

coefficient, Cohen's kappa

• Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann

Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal,

(10)

Oracle Big Data SQL Architecture

• Two components of Oracle Big Data SQL

–

External Table extension

(11)

Oracle Big Data SQL Architecture

• Two components of Oracle Big Data SQL

–

External Table extension

(12)

A Smarter Oracle External Table

Oracle Table

HDFS Data

You define:

You get:

• Table name

• Automatic discovery of Hive table metadata

• Oracle types

• Automatic translation from Hadoop types

• Automatic conversion from any InputFormat

• Any Degree of Parallelism

(13)

Unify Metadata: Publish Hive Metadata to Oracle Catalog

13

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

REJECT LIMIT UNLIMITED;

Big Data Appliance

+

Hadoop/NoSQL

Exadata

+

Oracle

Database

Oracle Catalog

External Table

Hive metadata

External Table

Hive Metastore

(14)

Accessible through Oracle Data Dictionary

Immediately – So the DBA doesn’t need to go to Hadoop

• ALL_HIVE_DATABASES

• ALL_HIVE_TABLES

• ALL_HIVE_COLUMNS

• DBA_HIVE_DATABASES

• DBA_HIVE_TABLES

• DBA_HIVE_COLUMNS

• USER_HIVE_DATABASES

• USER_HIVE_TABLES

• USER_HIVE_COLUMNS

(15)

Extend Oracle External Tables

CREATE TABLE movielog (

click VARCHAR2(4000))

ORGANIZATION EXTERNAL (

TYPE ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

))

REJECT LIMIT UNLIMITED;

• New types of external tables

–

ORACLE_HIVE (inherit metadata)

–

ORACLE_HDFS (specify metadata)

• Access parameters for Big Data

–

Hadoop cluster

–

Remote Hive database/table

• DBMS_HADOOP Package for automatic

import

(16)

(17)

How Data is Stored in Hadoop

As files. Pretty Simple

17 {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB File

CREATE TABLE ORDER (custid VARCHAR2(10),

recommended VARCHAR2(20),

activity (NUMBER 8,2))

ORGANIZATION EXTERNAL (TYPE

oracle_hdfs

)

LOCATION ("hdfs:/usr/cust/summary/*");

• Assumes Default Values

• Table Options

–

Fields

–

Column Maps

–

Delimiters

–

Fileformats

• json, textfile, sequencefile,…

–

Serdes

• i.e regex

(18)

Creating an External Table against Hive

DBMS_HADOOP.CREATE_EXTDDL_FOR_HIVE (

cluster_id IN VARCHAR2,

db_name IN VARCHAR2 := NULL,

hive_table_name IN VARCHAR2,

hive_partition IN BOOLEAN,

table_name IN VARCHAR2 := NULL,

perform_ddl IN BOOLEAN DEFAULT FALSE,

text_of_ddl OUT VARCHAR2

);

set serveroutput on

DECLARE

DDLout VARCHAR2(4000);

BEGIN

dbms_hadoop.create_extddl_for_hive(

CLUSTER_ID=>

'bigdatalite

',

DB_NAME=>

'brian

',

HIVE_TABLE_NAME=>

'movie

',

HIVE_PARTITION=>FALSE,

TABLE_NAME=>

'movie

',

PERFORM_DDL=>FALSE,

TEXT_OF_DDL=>DDLout);

dbms_output.put_line(DDLout);

END;

(19)

Oracle External Tables – Flexibility for Varied File Structures

• Transparent schema-for-read

–

Use fast C-based readers when possible

–

Use native Hadoop classes otherwise

• Engineered to understand parallelism

–

Map external units of parallelism to Oracle

• Architected for extensibility

–

StorageHandler capability enables support for other

data sources

–

Examples: MongoDB, HBase, Oracle NoSQL DB

CREATE TABLE ORDER (

cust_num VARCHAR2(10),

order_num VARCHAR2(20),

order_total NUMBER(8,2))

ORGANIZATION EXTERNAL (

TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

PARALLEL 20

(20)

StorageHandlers: Extensibility Beyond HDFS

Hive Metastore

Oracle Big Data SQL

(21)

Oracle Big Data SQL Architecture

• Two components of Oracle Big Data SQL

–

External Table extension

(22)

What gives Exadata

extreme

performance?

22

Oracle Database 12c

SQL

Offload Query to

Exadata Storage Servers

Small data subset

(23)

Introducing Oracle Big Data SQL

23

Massively Parallel SQL Query across Oracle, Hadoop and NoSQL

Oracle Database 12c

Hadoop & NoSQL

(24)

Big Data Appliance X5-2

Sun Oracle X5-2L Servers with per server:

• 2 * 18 Core Intel Xeon E5 Processors

• 128 GB Memory

• 96TB Disk space

Integrated Software (4.2):

• Oracle Linux 6.6

• Oracle Big Data SQL 1.1*

• Cloudera Distribution of Apache Hadoop 5.4 – EDH Edition

• Cloudera Manager 5.4

• Oracle R Distribution

• Oracle NoSQL Database CE

24

(25)

Introducing Oracle Big Data SQL

25

Massively Parallel SQL Query across Oracle, Hadoop and NoSQL

Oracle Database 12c

Offload Query to

Exadata Storage Servers

Small data subset

quickly returned

Hadoop & NoSQL

Offload Query to

Data Nodes

SQL

data

subset

SQL

(26)

Storage Layer

26

Big Data SQL Server: A New Hadoop Processing Engine

Filesystem (HDFS)

_{(Oracle NoSQL DB, Hbase)}

NoSQL Databases

Resource Management (YARN, cgroups)

Processing Layer

MapReduce

(27)

How do we query Hadoop?

Big Data SQL Query Execution

HDFSData Node BDSQL

HDFS Data Node BDSQL

Query compilation determines:

• Data locations

• Data structure

• Parallelism

1 Fast reads using Big Data SQL Server

• Schema-for-read using Hadoop classes

• Smart Scan selects only relevant data

2 Process filtered result

• Move relevant data to database

• Join with database tables

• Apply database security policies

3

Hive Metastore HDFS

NameNode

1

(28)

Apply Advanced Security on Hadoop & NoSQL

28

Same security policies across all data

Oracle Database 12c

Hadoop

Redacted

data

subset

SQL

Raw JSON

data in

Hadoop

JSON

Customer

data

in Oracle

DBMS_REDACT.ADD_POLICY(

object_schema => 'sales',

object_name => 'customer_detail',

column_name => 'last_name',

policy_name => 'customer_privacy',

function_type => DBMS_REDACT.FULL,

expression => '1=1'

);

• Redaction

• Virtual Private Database

• Fine-grain Access Control

(29)

Configuration

• Install Oracle Big Data SQL on the BDA using Mammoth

• Run the Big Data SQL-Exadata installation script on

each

Oracle Exadata

database node

–

Sets up connectivity from Exadata to the Big Data SQL Servers on the BDA.

–

Installs a Hadoop client

–

Configure directories and files

–

Big Data SQL Agent

–

Oracle directory objects

–

Others

(30)

Directories

• Two Types of directories are created

• Common Directory – must be on cluster wide shared files system

–

Subdirectories for jar files

–

bigdata.properties (paths,etc.)

• Cluster Directory(s)

–

Configuration details for each BDA Cluster

–

Sub directory of Common directory

• Oracle Directories that point to these Dirs

–

ORACLE_BIGDATA_CONFIG – Common Directory

(31)

Big Data SQL Agents

• Created by Install Script

• This multi-threaded agent bridges the metadata between Oracle Database

and Hadoop.

• It launches a single JVM - instead of one for every process (which can be

quite slow).

• create public database link BDSQL$_XXXX using

'extproc_connection_data'; (XXXX is the name of each BDA cluster from

Cluster Directories

• create public database link BDSQL$_DEFAULT_CLUSTER using

(32)

If Kerberos is used on BDA

• Must create ticket (kinit) for BDS user

• BDS runs as Oracle User

• Need to renew tickets

–

cron

(33)

Requirements - For Now

• Exadata

• Oracle 12.1.0.2.1+

• Storage Servers 2.1.1.1 or 12.1.1.0

• Exadata configured on the same InfiniBand subnet as BDA

• Exadata and BDA connected by InfiniBand

(34)

Roadmap

(35)

Today

• Hadoop DoP linked to RDBMS DoP

–

Lead to many idle PQ processes

–

Required explicit declaration

• Unlink Hadoop and RDBMS DoP

• Automatic max Hadoop parallelism

–

Even on serial tables

• An average of 40% faster

–

Even at equivalent DoP

(36)

Today

• All blocks in a query must be read

from disk

–

Large (256MB) disk I/O for each block

• Automatically create Storage

Indexes in Big Data SQL Agents

• Check index before reading blocks

–

Skip unnecessary I/Os

• An average of 65% faster

–

Up to 100x faster for highly selective

queries

Storage Indexing

(37)

Customer Examples

(38)

Building Customer Loyalty



Deliver personalized multi-channel content to every

customer (example: Kroger’s ‘MyMagazine’)



Expand to a wide variety of interaction data to build

customer profiles

Challenges



Customer loyalty marketing and programs for

major retailers and consumer brands

Company Overview



2x improvements in campaign performance



Large-scale concurrent processing of complex SQL



70% of analysis is done in SQL, uses R as well

Benefits

Solution Overview



Oracle Exadata X3-8



Oracle Database with Advanced Analytics



Oracle ZFS Backup Appliance



Big Data Appliance



Next: Big Data SQL

SQL Analysis R-based Analysis Machine Learning

BDA

ZFS

X3-8

(39)

Thank You

&