Oracle Big Data SQL Konference Data a znalosti 2015

(1)

Oracle Big Data SQL

Konference Data a znalosti 2015

Jakub ILLNER

Information Management Architect

XLOB Enterprise Cloud Architects

23 July 2015, version 2

(2)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Is SQL Dead?

Introducing Oracle Big Data SQL

How Oracle Big Data SQL Works

Demonstration

Questions and Answers

1

2

3

4

2

5

(3)

Is SQL Dead?

(4)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

4

(5)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

//Part 5 // Keeping context var lastUniqueId = "foobar" var lastRecord: (DataKey, Int) = null var lastLastRecord: (DataKey, Int) = null var position = 0 it.foreach( r => { position = position + 1 if (!lastUniqueId.equals(r._1.uniqueId)) { lastRecord = null lastLastRecord = null }

//Part 6 : Finding those peaks and valleys if (lastRecord != null && lastLastRecord != null) {

if (lastRecord._2 < r._2 && lastRecord._2 < lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId,

position,

lastRecord._1.eventTime, lastRecord._2, false))

} else if (lastRecord._2 > r._2 && lastRecord._2 > lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId, position, lastRecord._1.eventTime, lastRecord._2, true)) } } lastUniqueId = r._1.uniqueId lastLastRecord = lastRecord lastRecord = r }) results.iterator })

//Part 7 : pretty everything up pivotPointRDD.map(r => {

val pivotType = if (r.isPeak) "peak" else "valley" r.uniqueId + "," + r.position + "," + r.eventTime + "," + r.eventValue + "," + pivotType } ).saveAsTextFile(outputPath) }

class DataKey(val uniqueId:String, val eventTime:Long) extends Serializable with Comparable[DataKey] { override def compareTo(other:DataKey): Int = { val compare1 = uniqueId.compareTo(other.uniqueId) if (compare1 == 0) { eventTime.compareTo(other.eventTime) } else { compare1 } } }

class PivotPoint(val uniqueId: String, val position:Int, val eventTime:Long, val eventValue:Int,

val isPeak:Boolean) extends Serializable {} }

package com.hadooparchitecturebook.spark.peaksandvalleys import org.apache.hadoop.io.{Text, LongWritable} import org.apache.hadoop.mapred.TextInputFormat import org.apache.spark.rdd.ShuffledRDD

import org.apache.spark.{Partitioner, SparkContext, SparkConf} import scala.collection.mutable

/**

* Created by ted.malaska on 12/7/14. */

object SparkPeaksAndValleysExecution {

5 Peaks and Valleys in Spark vs. SQL

SELECT PRIMARY KEY

, POSITION

, EVENT_VALUE

, CASE

WHEN LEAD_EVENT_VALUE is null or LAG_EVENT_VALUE is null

then 'EDGE'

WHEN EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE

then 'VALLEY'

WHEN EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE

then 'PEAK'

ELSE 'SLOPE'

AND AS POINT_TYPE

FROM (

SELECT PRIMARY_KEY

, POSITION

, EVENT_VALUE

, LEAD(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE

, LAG(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE

FROM PEAK_AND_VALLEY_TABLE

)

132 Lines of Scala/Spark

17 Lines of SQL

7x less code

Finding Peaks and Valleys in Stock Market Data

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

5 10:00 10:05 10:10 10:15 10:20 10:25

Ticker

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

(6)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

• Declarative

• Abstracted (from storage)

• Concise

• Powerful

• Simple to learn

• Rich analytical functions

• Fast

• Secure

• Standardized

• Widely used & known

6

(7)

• Hive

–

First SQL engine on Hadoop

–

Uses MapReduce for execution

–

Contains metastore (HCatalog)

–

New Hive-on-Spark project

• SparkSQL

–

Spark module for accessing

structured data

–

Fast, in-memory execution

–

Compatible with Hive

• Impala

–

Developed by Cloudera

–

Fast, in-memory execution

–

Introduced Parquet format

–

Compatible with Hive

• Presto

–

Developed by Facebook

–

Fast, low latency execution

–

Compatible with Hive

–

Connectivity to other sources

(8)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

What if you need to query Hadoop & RDBMS data?

• Pure Hadoop-on-SQL engines can access data in Hadoop only (HDFS, Hive,

Parquet, ORC, HBase etc.)

• Performance of BI tools like Cognos, Oracle BIEE, SAS, Tableau for large and

complex federated queries is limited

• Possible solution is to use SQL interface & Hadoop integration available

with the DBMS platform of choice

(9)

Which SQL-on-Hadoop approach will you use most?

From 9% in 2014 to 19% in 2015

(10)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Introducing Oracle Big Data SQL

One SQL to query ALL the data

(11)

What if you could …

• Make all data easily available to all your Oracle Database applications

• While supporting the full breadth of Oracle SQL query language

• With all the security of Oracle Database 12c

• Without moving data between your Hadoop cluster and the RDBMS

• And deliver fast query performance

• While leveraging your existing skills

(12)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

12 One SQL to query ALL the data

NoSQL

(13)

• Ranking functions

–

Rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions

(moving and cumulative)

–

Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions

–

Direct inter-row reference using offsets

• Reporting Aggregate functions

–

Sum, avg, min, max, variance, stddev, count, ratio_to_report

• Statistical Aggregates

–

Correlation, linear regression family, covariance

• Linear regression

–

Fitting of an ordinary-least-squares regression line to a set of

number pairs.

–

Frequently combined with the COVAR_POP, COVAR_SAMP, and

CORR functions

• Descriptive Statistics

–

DBMS_STAT_FUNCS: summarizes numerical columns of a table and

returns count, min, max, range, mean, stats_mode, variance,

standard deviation, median, quantile values, +/- n sigma values,

top/bottom 5 values

• Correlations

–

Pearson’s correlation coefficients, Spearman's and Kendall's (both

nonparametric).

• Cross Tabs

–

Enhanced with % statistics: chi squared, phi coefficient, Cramer's V,

contingency coefficient, Cohen's kappa

• Hypothesis Testing

–

Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test,

Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way

ANOVA

• Distribution Fitting

–

Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test,

Normal, Uniform, Weibull, Exponential

(14)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but"

14 Pattern Matching With Oracle SQL

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of Oracle SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

14 10:00 10:05 10:10 10:15 10:20 10:25

Ticker

(15)

Security – Virtual Private Database with Oracle SQL

• Oracle Virtual Private Database (VPD)

enables

you to create security policies to control

database access at the row and column level.

• Oracle VPD adds a

dynamic WHERE

clause to a

SQL statement that is issued against the table,

view, or synonym to which an Oracle Virtual

Private Database security policy was applied.

• Because you attach security policies directly to

the database objects (tables, views), and the

policies are automatically applied whenever a

user accesses data, there is

no way to bypass

security

.

• With Big Data SQL the Oracle Virtual Private

Database is available for Hadoop data

B B B

SELECT * FROM my_bigdata_table

WHERE SALES_REP_ID =

SYS_CONTEXT ('USERENV','SESSION_USER');

Filter on

SESSION_USER

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(16)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Security – Data Redaction with Oracle SQL

• Oracle Data Redaction

enables you to create

security policies to control what data is visible for

sensitive columns with personal or security

information.

• Oracle Data Redaction dynamically applies

redaction function to columns

. The function

transforms, obfuscates or hides the sensitive

information for unauthorized users.

• Since the policy is applied automatically by Oracle

Database there is

no way to bypass security

and

get the un-redacted data.

• With Big Data SQL the Oracle Virtual Private

Database is available for Hadoop data

B B B

D B M S _ R E D A C T . A D D _ P O L I C Y (

o b j e c t _ s c h e m a = > ' M C L I C K ' ,

o b j e c t _ n a m e = > ' T W E E T _ V ' ,

c o l u m n _ n a m e = > ' U S E R N A M E ' ,

p o l i c y _ n a m e = > ' t w e e t _ r e d a c t i o n ' ,

f u n c t i o n _ t y p e = > D B M S _ R E D A C T . P A R T I A L ,

f u n c t i o n _ p a r a m e t e r s = >

' V V V V V V V V V V V V V V V V V V V V V V V V V , * , 3 , 2 5 ' ,

e x p r e s s i o n = > ' 1 = 1 '

) ;

***

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(17)

How Oracle Big Data SQL Works

(18)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Hadoop data accessible through Oracle External Tables

• New set of properties

–

ORACLE_HIVE and ORACLE_HDFS access drivers

–

Identify a Hadoop cluster, data source, column

mapping, error handling, overflow handling,

logging

• New table metadata passed from Oracle DDL

to Hadoop readers at query execution

• Architected for extensibility

–

StorageHandler capability enables future support

for many other data sources

–

Examples: MongoDB, HBase, Oracle NoSQL DB

18 CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE

ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

)

(19)

B B B

How Oracle executes a query with Big Data SQL

HDFS Data Node

BDSQL

HDFS Data Node

BDSQL

Query compilation determines:

• Data locations

• Data structure

• Parallelism

1 Fast reads using Big Data SQL Server

• Schema-for-read using Hadoop classes

• Smart Scan selects only relevant data

2 Process filtered result

• Move relevant data to database

• Join with database tables

• Apply database security policies

3 Hive Metastore

HDFS

NameNode

₁

2 ₃

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(20)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Storage Layer

20 Big Data SQL: A New Hadoop Processing Engine

Filesystem (HDFS)

NoSQL Databases

(Oracle NoSQL DB, HBase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduce

and Hive

Spark

Impala

Search

Big Data

SQL

(21)

Big Data SQL Uses Hive Metastore, not MapReduce

Hive Metastore

Hive

Impala

SparkSQL