Big Data: Are you ready?

(1)

(2)

Big Data:”Are you ready?”

Oracle Big Data SQL

George Bourmas

Enterprise Architect

EMEA XLOB Enterprise Architects

September 13, 2014

(3)

(4)

(5)

Thoughts

Things

Processes

(6)

22×

_2011-2016

12.5 Billion

2020

1.3 Billion

Today

(7)

Use

Data

12 %

Executives who feel they

understand the impact data

will have on their organizations

Produce

(8)

VOLUME

VELOCITY

VARIETY

_VALUE

SOCIAL

BLOG

SMART

METER

1011001010010

0100110101010

1011100101010

100100101

(9)

Framework for distributed processing

Large Data Sets

Clusters of Computers

Simple Computing Models

Highly Available Service

(10)

Application

Storage Nodes

Datacenter B

Storage Nodes

Datacenter A

Application

NoSQL DB Driver

Application

NoSQL DB Driver

Application



Simple Key-Value Data Model



Horizontally Scalable



Highly Available



Simple administration



ACID Transactions at scale



Transparent load balancing



Elastic Configuration



Commercial grade software and

support

Oracle NoSQL Database

Scalable, Highly Available, Key-Value Database

(11)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(12)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database

Advanced

Analytics

(13)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database

Endeca

OBI EE

Expose All Data to End Users

(14)

Exadata

+

Oracle Database

Data Reservoir

Big Data Appliance

+

Hadoop

(15)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(16)

Securing Big Data

• _{Increasingly, Big Data solutions are}

capturing sensitive information must be

protected and audited

• _{This is no different than critical data}

(17)

Authenticate

users with secure Kerberos protocol

Authorize

access to data with fine grained controls

Audit

activity and access with Oracle Audit Vault

and Database Firewall

Encrypt

data as it flows through the system

Enhanced Big Data Security

(18)

Unified Data Platform

Advanced Query & Analysis

Full Power of SQL and Advanced Analytics

Transparent to Applications

No Changes to Application Code

Single View of All Data

Unified Metadata Across RDBMS & Hadoop

Fastest Performance

Utilize SQL Processing Across the Platform

Leverage Existing Skills

(19)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(20)

How can we possibly

leverage the strengths of

both platforms?

(21)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database



_{Low-cost Scalability}



_{Flexible Schema on Read}



_{Abstract Storage Model}



_Open



_{Rapid Evolution}



_{Extreme Performance}



_{Highly Secure}



_{Analytic SQL}



_{Rich Tool Set}



_{Vast Expertise}

(22)

1

2

3 Oracle Big Data SQL: A New Architecture

Technical Overview

(23)

Driving Business Value from Technology Innovation

Use the Right Tool for the Job and benefit from the Power of

“AND”

Run the Business



_{Integrate existing systems}



_{Support mission-critical tasks}



_{Protect existing expenditures}

Relational

Hadoop

Change the Business



_{Disrupt competitors}



Disintermediate supply chains



_{Leverage new paradigms}

NoSQL

Scale the Business



_{Serve data faster}



_{Meet mobile challenges}



_{Scale-out economically}

(24)

What if you could…

• _{Make all data easily available to all your Oracle Database applications}

• _{While supporting the full breadth of Oracle SQL query language}

• _{With all the security of Oracle Database 12c}

• _{Without moving data between your Hadoop cluster and the RDBMS}

• _{And deliver fast query performance}

• _{While leveraging your existing skills}

(25)

Oracle Big Data SQL – A New Architecture

• _{Powerful, high-performance SQL on Hadoop}

–

_{Full Oracle SQL capabilities on Hadoop}

–

_{SQL query processing local to Hadoop nodes}

• _{Simple data integration of Hadoop and Oracle Database}

–

_{Single SQL point-of-entry to access all data}

–

_{Scalable joins between Hadoop and RDBMS data}

• _{Optimized hardware}

(26)

How did we do this?

1. Unify Metadata, Query Language and Security

(27)

Unifying Metadata, Query Language and

Security

(28)

Data Stored in Hadoop

Hadoop/NoSQL Ecosystem

ç

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}

{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}

{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}

{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}

{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

(29)

Hive

• _{Provides SQL-like interface to data stored in HDFS}

• _{Allows applications to process data stored in any format}

• _{Tables capture metadata required to locate and parse data}

• _{SQL query generates a MapReduce job to process the data}

• _{Big Data SQL uses Hive metadata to simplify}

administration, but it’s not required.

(30)

Schema on Read: MapReduce and Hive

Simple Case: Single Column

> select * from movieapp_log_json

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07",…

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22",…

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26",…

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32”,…

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42",…

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43",…

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50",…

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03”,…

HiveQL:

CREATE EXTERNAL TABLE movieapp_log_json

(

click STRING

)

ROW FORMAT

DELIMITED

LINES TERMINATED BY '\n'

STORED AS TEXTFILE

LOCATION '/user/oracle/applog';

(31)

Schema on Read: MapReduce and Hive

Same Source with Columns Derived Using SerDe

> select * from movielog_cols

HiveQL:

CREATE EXTERNAL TABLE movielog_cols (

custid int,

movieid int,

activity int, …)

ROW FORMAT SERDE

'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'/user/oracle/applog_json';

(32)

Hive Metastore

SQL Execution Engines Share Metadata

Hive Metastore

Hive

Impala

Shark

Oracle Big Data SQL

…

Table Definitions:

movieapp_log_json

movielog

(33)

Unify Metadata: Publish Hive Metadata to Oracle Catalog

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

REJECT LIMIT UNLIMITED;

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

REJECT LIMIT UNLIMITED;

Big Data Appliance

Exadata

₊

Oracle Catalog

External Table

Hive metadata

External Table

Hive Metastore

(34)

Automation: Oracle Data Modeler

Import Hive definitions into model

Automatically generate Oracle DDL for

imported tables

(35)

New Data Sources for Oracle External Tables

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster)

)

REJECT LIMIT UNLIMITED

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE

ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

)

REJECT LIMIT UNLIMITED

• _{New set of properties}

–

_{ORACLE_HIVE and ORACLE_HDFS access drivers}

–

_{Identify a Hadoop cluster, data source, column}

mapping, error handling, overflow handling, logging

• _{New table metadata passed from Oracle DDL to}

Hadoop readers at query execution

• _{Architected for extensibility}

(36)

Use Rich Oracle SQL Dialect Over

All

Data

Snapshot of Oracle SQL Analytic Functions

• _{Ranking functions}

–

rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

–

Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

–

Direct inter-row reference using offsets

• Reporting Aggregate functions

–

Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

–

Correlation, linear regression family, covariance

• _{Linear regression}

–

Fitting of an ordinary-least-squares regression line to a set of number

• _{Descriptive Statistics}

–

DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count,

min, max, range, mean, stats_mode, variance, standard deviation, median,

quantile values, +/- n sigma values, top/bottom 5 values

• Correlations

–

Pearson’s correlation coefficients, Spearman's and Kendall's (both

nonparametric).

• _{Cross Tabs}

–

_{Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency}

coefficient, Cohen's kappa

• Hypothesis Testing

–

_{Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square,}

Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

(37)

package pigstuff; import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.PigException; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; /** * * @author nbayliss */

public class W_FINDER extends EvalFunc<Tuple> { private class V0Line {

String state = null; String[] attributes; String prev = ""; String next = "";

public V0Line(String[] atts) { attributes = atts; }

public String[] getAttributes() { return attributes; }

public void setState(String state) { this.state = state;

}

public String getQuantity() { return this.attributes[4]; }

public String setState(V0Line linePrev, V0Line lineNext) { String q = this.getQuantity(); if (linePrev == null) { prev = ""; } else { prev = linePrev.getQuantity(); } if (lineNext == null) { next = ""; } else { next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but" + " got " + o.getClass().getName();

throw new ExecException(msg, errCode, PigException.BUG); }

DataBag bag = (DataBag) o;

Iterator<Tuple> valueIt = bag.iterator(); while (!finished) {

if (valueIt.hasNext()) {

Tuple nextTuple = valueIt.next(); line = nextTuple.get(0).toString(); String ordk = nextTuple.get(1).toString(); String part = nextTuple.get(2).toString(); String supp = nextTuple.get(3).toString(); String q = nextTuple.get(4).toString(); String ship = nextTuple.get(5).toString(); pbkey = nextTuple.get(6).toString();

thisLine = new V0Line(new String[]{line, ordk, part, supp, q, ship, pbkey}); if (lineFifo.isEmpty()) { thisLine.setState("S"); } /*else { if (!lineFifo.get(0).getGroup().equals(thisLine.getGroup())) { thisLine.setState("S"); } } */ lineFifo.add(0, thisLine); if (!valueIt.hasNext()) { noMoreValues = true; thisLine.setState("E"); } } else { thisLine = null; }

String evalState = null; if (lineFifo.size() == 3) { processLine = lineFifo.get(1); if (processLine.getState() == null) { prevLine = lineFifo.get(0); nextLine = lineFifo.get(2); processLine.setState(prevLine, nextLine); } evalLine = lineFifo.remove(2); evalState = evalLine.getState(); } else { if (noMoreValues) { evalLine = lineFifo.remove(lineFifo.size() - 1); evalState = evalLine.getState(); } } if (evalState != null) { matchList += evalState; if (matchList.length() > 5) { matchList = matchList.substring(1); } if (matchList.equals("SBTBE") || matchList.equals("TBTBE") || matchList.equals("SBTBT") || matchList.equals("TBTBT")) { //String[] atts = evalLine.getAttributes(); c++; } } if (lineFifo.isEmpty()) { finished = true; } }

Tuple outputTuple = TupleFactory.getInstance().newTuple(3); outputTuple.set(0, line);

outputTuple.set(1, pbkey); outputTuple.set(2, new Long(c));

Pattern Matching With Oracle SQL

Snapshot of Oracle SQL Analytic Functions

Simplified, sophisticated, standards based syntax

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

38

(38)

Govern

All

Data

Store JSON data unconverted

in Hadoop

Oracle Database 12c

Oracle Big Data Appliance

SQL

Data analyzed via SQL

Store business-critical data in

Oracle

DBMS_REDACT.ADD_POLICY(

object_schema => 'hr',

object_name => 'employee',



_{Apply advanced security on Hadoop}

−

Masking/Redaction

(39)

(40)

Intelligent Query Optimization

One Query Spanning Oracle Database, Hadoop & NoSQL

Query Data in RDBMS,

Hadoop & NoSQL

Oracle SQL

Oracle NoSQL

DB

BDS Server

HDFS Data

Node

BDS Server

Oracle Database

Storage Server

Fast

Massive Parallelism

Filtered Locally

(41)

Data Stored in Hadoop

Hadoop/NoSQL Ecosystem

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}

{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}

{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}

{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}

{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}