• No results found

Oracle Big Data SQL Konference Data a znalosti 2015

N/A
N/A
Protected

Academic year: 2021

Share "Oracle Big Data SQL Konference Data a znalosti 2015"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Oracle Big Data SQL

Konference Data a znalosti 2015

Jakub ILLNER

Information Management Architect

XLOB Enterprise Cloud Architects

23 July 2015, version 2

(2)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Agenda

Is SQL Dead?

Introducing Oracle Big Data SQL

How Oracle Big Data SQL Works

Demonstration

Questions and Answers

1

2

3

4

2

5

(3)

Is SQL Dead?

(4)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

4

(5)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

//Part 5 // Keeping context var lastUniqueId = "foobar" var lastRecord: (DataKey, Int) = null var lastLastRecord: (DataKey, Int) = null var position = 0 it.foreach( r => { position = position + 1 if (!lastUniqueId.equals(r._1.uniqueId)) { lastRecord = null lastLastRecord = null }

//Part 6 : Finding those peaks and valleys if (lastRecord != null && lastLastRecord != null) {

if (lastRecord._2 < r._2 && lastRecord._2 < lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId,

position,

lastRecord._1.eventTime, lastRecord._2, false))

} else if (lastRecord._2 > r._2 && lastRecord._2 > lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId, position, lastRecord._1.eventTime, lastRecord._2, true)) } } lastUniqueId = r._1.uniqueId lastLastRecord = lastRecord lastRecord = r }) results.iterator })

//Part 7 : pretty everything up pivotPointRDD.map(r => {

val pivotType = if (r.isPeak) "peak" else "valley" r.uniqueId + "," + r.position + "," + r.eventTime + "," + r.eventValue + "," + pivotType } ).saveAsTextFile(outputPath) }

class DataKey(val uniqueId:String, val eventTime:Long) extends Serializable with Comparable[DataKey] { override def compareTo(other:DataKey): Int = { val compare1 = uniqueId.compareTo(other.uniqueId) if (compare1 == 0) { eventTime.compareTo(other.eventTime) } else { compare1 } } }

class PivotPoint(val uniqueId: String, val position:Int, val eventTime:Long, val eventValue:Int,

val isPeak:Boolean) extends Serializable {} }

package com.hadooparchitecturebook.spark.peaksandvalleys import org.apache.hadoop.io.{Text, LongWritable} import org.apache.hadoop.mapred.TextInputFormat import org.apache.spark.rdd.ShuffledRDD

import org.apache.spark.{Partitioner, SparkContext, SparkConf} import scala.collection.mutable

/**

* Created by ted.malaska on 12/7/14. */

object SparkPeaksAndValleysExecution {

5

Peaks and Valleys in Spark vs. SQL

SELECT PRIMARY KEY

, POSITION

, EVENT_VALUE

, CASE

WHEN LEAD_EVENT_VALUE is null or LAG_EVENT_VALUE is null

then 'EDGE'

WHEN EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE

then 'VALLEY'

WHEN EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE

then 'PEAK'

ELSE 'SLOPE'

AND AS POINT_TYPE

FROM (

SELECT PRIMARY_KEY

, POSITION

, EVENT_VALUE

, LEAD(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE

, LAG(EVENT_VALUE,1,null)

OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE

FROM PEAK_AND_VALLEY_TABLE

)

132 Lines of Scala/Spark

17 Lines of SQL

7x less code

Finding Peaks and Valleys in Stock Market Data

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

5

10:00 10:05 10:10 10:15 10:20 10:25

Ticker

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

(6)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Declarative

Abstracted (from storage)

Concise

Powerful

Simple to learn

Rich analytical functions

Fast

Secure

Standardized

Widely used & known

6

(7)

Hive

First SQL engine on Hadoop

Uses MapReduce for execution

Contains metastore (HCatalog)

New Hive-on-Spark project

SparkSQL

Spark module for accessing

structured data

Fast, in-memory execution

Compatible with Hive

Impala

Developed by Cloudera

Fast, in-memory execution

Introduced Parquet format

Compatible with Hive

Presto

Developed by Facebook

Fast, low latency execution

Compatible with Hive

Connectivity to other sources

(8)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

What if you need to query Hadoop & RDBMS data?

Pure Hadoop-on-SQL engines can access data in Hadoop only (HDFS, Hive,

Parquet, ORC, HBase etc.)

Performance of BI tools like Cognos, Oracle BIEE, SAS, Tableau for large and

complex federated queries is limited

Possible solution is to use SQL interface & Hadoop integration available

with the DBMS platform of choice

(9)

Which SQL-on-Hadoop approach will you use most?

From 9% in 2014 to 19% in 2015

(10)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Introducing Oracle Big Data SQL

One SQL to query ALL the data

(11)

What if you could …

Make all data easily available to all your Oracle Database applications

While supporting the full breadth of Oracle SQL query language

With all the security of Oracle Database 12c

Without moving data between your Hadoop cluster and the RDBMS

And deliver fast query performance

While leveraging your existing skills

(12)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

12

One SQL to query ALL the data

NoSQL

(13)

Ranking functions

Rank, dense_rank, cume_dist, percent_rank, ntile

Window Aggregate functions

(moving and cumulative)

Avg, sum, min, max, count, variance, stddev, first_value, last_value

LAG/LEAD functions

Direct inter-row reference using offsets

Reporting Aggregate functions

Sum, avg, min, max, variance, stddev, count, ratio_to_report

Statistical Aggregates

Correlation, linear regression family, covariance

Linear regression

Fitting of an ordinary-least-squares regression line to a set of

number pairs.

Frequently combined with the COVAR_POP, COVAR_SAMP, and

CORR functions

Descriptive Statistics

DBMS_STAT_FUNCS: summarizes numerical columns of a table and

returns count, min, max, range, mean, stats_mode, variance,

standard deviation, median, quantile values, +/- n sigma values,

top/bottom 5 values

Correlations

Pearson’s correlation coefficients, Spearman's and Kendall's (both

nonparametric).

Cross Tabs

Enhanced with % statistics: chi squared, phi coefficient, Cramer's V,

contingency coefficient, Cohen's kappa

Hypothesis Testing

Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test,

Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way

ANOVA

Distribution Fitting

Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test,

Normal, Uniform, Weibull, Exponential

(14)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but"

14

Pattern Matching With Oracle SQL

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of Oracle SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

14

10:00 10:05 10:10 10:15 10:20 10:25

Ticker

(15)

Security – Virtual Private Database with Oracle SQL

Oracle Virtual Private Database (VPD)

enables

you to create security policies to control

database access at the row and column level.

Oracle VPD adds a

dynamic WHERE

clause to a

SQL statement that is issued against the table,

view, or synonym to which an Oracle Virtual

Private Database security policy was applied.

Because you attach security policies directly to

the database objects (tables, views), and the

policies are automatically applied whenever a

user accesses data, there is

no way to bypass

security

.

With Big Data SQL the Oracle Virtual Private

Database is available for Hadoop data

B B B

SELECT * FROM my_bigdata_table

WHERE SALES_REP_ID =

SYS_CONTEXT ('USERENV','SESSION_USER');

Filter on

SESSION_USER

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(16)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Security – Data Redaction with Oracle SQL

Oracle Data Redaction

enables you to create

security policies to control what data is visible for

sensitive columns with personal or security

information.

Oracle Data Redaction dynamically applies

redaction function to columns

. The function

transforms, obfuscates or hides the sensitive

information for unauthorized users.

Since the policy is applied automatically by Oracle

Database there is

no way to bypass security

and

get the un-redacted data.

With Big Data SQL the Oracle Virtual Private

Database is available for Hadoop data

B B B

D B M S _ R E D A C T . A D D _ P O L I C Y (

o b j e c t _ s c h e m a = > ' M C L I C K ' ,

o b j e c t _ n a m e = > ' T W E E T _ V ' ,

c o l u m n _ n a m e = > ' U S E R N A M E ' ,

p o l i c y _ n a m e = > ' t w e e t _ r e d a c t i o n ' ,

f u n c t i o n _ t y p e = > D B M S _ R E D A C T . P A R T I A L ,

f u n c t i o n _ p a r a m e t e r s = >

' V V V V V V V V V V V V V V V V V V V V V V V V V , * , 3 , 2 5 ' ,

e x p r e s s i o n = > ' 1 = 1 '

) ;

***

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(17)

How Oracle Big Data SQL Works

(18)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Hadoop data accessible through Oracle External Tables

New set of properties

ORACLE_HIVE and ORACLE_HDFS access drivers

Identify a Hadoop cluster, data source, column

mapping, error handling, overflow handling,

logging

New table metadata passed from Oracle DDL

to Hadoop readers at query execution

Architected for extensibility

StorageHandler capability enables future support

for many other data sources

Examples: MongoDB, HBase, Oracle NoSQL DB

18

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL



( TYPE

ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

)

)

(19)

B B B

How Oracle executes a query with Big Data SQL

HDFS Data Node

BDSQL

HDFS Data Node

BDSQL

Query compilation determines:

Data locations

Data structure

Parallelism

1

Fast reads using Big Data SQL Server

Schema-for-read using Hadoop classes

Smart Scan selects only relevant data

2

Process filtered result

Move relevant data to database

Join with database tables

Apply database security policies

3

Hive Metastore

HDFS

NameNode

1

2

3

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(20)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Storage Layer

20

Big Data SQL: A New Hadoop Processing Engine

Filesystem (HDFS)

NoSQL Databases

(Oracle NoSQL DB, HBase)

Resource Management (YARN, cgroups)

Processing Layer

MapReduce

and Hive

Spark

Impala

Search

Big Data

SQL

(21)

Big Data SQL Uses Hive Metastore, not MapReduce

Hive Metastore

Hive

Impala

SparkSQL

Oracle Big Data SQL

Common semantic

repository (schemas,

Java classes) for

most of

SQL-on-Hadoop tools

(22)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

How Data is Stored in Hadoop

22

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}

{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}

{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}

{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}

{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 1TB JSON File

Block B1

Block B2

Block B3

1 block = 256 MB

Example File = 4096 blocks

(23)

How MapReduce and Hive Read Data

Data Node

disk

Consumer

SCAN

Create

ROWS

& COLUMNS

Scan and row creation needs to be able to

work on “any” data format

Data definitions and column deserializations

are needed to provide a table

RecordReader

=> Scans data (keys and values)

InputFormat

=> Defines parallelism

SerDe

=> Makes columns

Metastore

=> Maps DDL to Java access classes

Hi

ve

St

or

age

Han

d

le

r

(24)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Big Data SQL Server Dataflow

Disks

Data Node

Big Data SQL

External Table Services

Smart Scan

Read data from HDFS Data Node

Direct-path reads

C-based readers when possible

Use native Hadoop classes otherwise

1

Translate bytes to Oracle

2

Apply Smart Scan to Oracle bytes

Apply filters

Project Columns

Parse JSON/XML

Score models

3

RecordReader

SerDe

1011001

0100111

1011001

0100111

10110010

1

2

3

24

(25)

Operations Pushed Down to Hadoop

JSON

Oracle Database 12c

Big Data SQL on Hadoop Cluster

Pushed down to Big Data SQL Cell

Hadoop scans (InputFormat, SerDe)

JSON parsing

WHERE clause evaluation

Storage index evaluation

Column projection

Bloom filters for faster joins

Score Data Mining models

Handled by Oracle Database

Query Compilation & Optimization

Joins

Aggregations

Ordering of results

PL/SQL evaluation

Table functions

Security policies

Request

Oracle Data Stream

Smart Scan

– Only

(26)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Oracle Big Data SQL Storage Index

Storage index provides query

speed-up through transparent

IO elimination of HDFS Blocks

Columns in SQL are mapped to

fields in the HDFS file via

External Table Definitions

Min / max value is recorded for

each HDFS Block in a

in-memory storage index

26

HDFS

Field 1, Field 2,

1001

1010

1045

1109

1043

1001

1045

1609

1043

11455

1909

12430

13010

10450

1909

2043

Field 3, … , Field n

HDFS

Block1

(256MB)

HDFS

Block2

(256MB)

Index

B1 – Movie_ID

Min: 1001

Max: 1609

B2 – Movie_ID

Min: 1909

Max: 13010

Example:

Find all ratings from

movies with a

MOVIE_ID of 1109

(27)

Oracle Parallel Query and Hadoop

B B B

Hive Metastore

HDFS

NameNode

Determine Hadoop Parallelism

Determine schema-for-read

Determine

InputSplits

Arrange splits for best performance

1

Map to Oracle Parallelism

Map splits to granules

Assign

batches

of granules to PX Servers

2

PX Servers Route Work

Send granule requests async to cells

Reap results

3

1

2

PX

InputSplits

Oracle Database 12c

Big Data SQL on Hadoop Cluster

(28)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

28

Oracle and Hadoop Parallelism

SELECT /*+PARALLEL(EVE,8)*/

CUST.NAME

, CUST.MSISDN

, EVE.MONTH

, EVE.EVENT_TYPE

, COUNT(*) AS EVENT_COUNT

, SUM(EVE.DURATION) AS DURATION

FROM D_CUSTOMERS CUST

, F_NETWORK_EVENTS EVE

WHERE CUST.MSISDN = EVE.MSISDN

GROUP BY CUST.NAME

, CUST.MSISDN

, EVE.MONTH

, EVE.EVENT_TYPE

ORDER BY 1,2,3,4

PX Server #1

PX Server #2

PX Server #3

PX Server #4

PX Server #5

PX Server #6

PX Server #7

PX Server #8

Parallelism defined by Degree of Parallelism

(DOP) – dynamic, statement, table level

DOP can be throttled by database if

maximum DOP exceeded or table too small

Oracle Database 12c

Granule Request

Granule Request

Parallelism defined by Hadoop InputSplits

(fan-out from PX to Hadoop)

Utilized as many cores as provided by

cgroups (first-come-first-serve)

Big Data SQL on Hadoop Cluster

Async

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Granule Request

Hos

t

1

Hos

t

4

Hos

t

3

Hos

t

2

(29)

Big Data SQL Prerequisites

Oracle 12c on Linux

Oracle Exadata

Oracle Big Data Appliance

Infiniband interconnection

between Oracle Exadata and

Oracle Big Data Appliance

B B B

Oracle Big Data Appliance

with CDH

Oracle 12c

on Oracle Exadata

(30)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Demonstration

Big Data SQL in Action

(31)
(32)
(33)
(34)
(35)
(36)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Questions and Answers

Provided we still have some time …

(37)

Knowledge Check

True or False

Hive is leveraged by Big Data SQL as a query execution

engine - allowing BDS queries to automatically execute

faster as the Hive execution engine improves (e.g. Spark

replaces MapReduce)

False

Big Data SQL leverages only the Hive Metastore (HCatalog)

and the corresponding classes (InputFormat, RecordReader,

SerDe) but it does not use Hive for execution

(38)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Knowledge Check

True or False

Oracle Big Data SQL sends all the data from Hadoop to

Oracle Database where the query is processed. Oracle

Database does column selection, it applies WHERE, GROUP

BY and ORDER BY clauses etc.

False

Big Data SQL Smart Scan performs low level processing of

query. Smart Scan does the column projection, it applies

WHERE condition & Bloom filters, it processes JSON etc.

(39)

Knowledge Check

True or False

Oracle’s ambition with Big Data SQL is to supersede all the

Hadoop-on-SQL engines like Hive, SparkSQL, Impala, Drill or

Presto.

False

Big Data SQL is for companies with significant Oracle assets

(e.g. Oracle Data Warehouse) who wants to access and process

both Hadoop and Oracle data from single SQL environment

(40)

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

40

(41)

Safe Harbor Statement

The preceding is intended to outline our general product direction. It is intended for

information purposes only, and may not be incorporated into any contract. It is not a

commitment to deliver any material, code, or functionality, and should not be relied upon

in making purchasing decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole discretion of Oracle.

(42)
(43)

References

Related documents

SSMA for Oracle V6.0 converts Oracle database objects (including stored procedures) to SQL Server database objects, loads those objects into SQL Server, migrates data from Oracle

Using an approach called query franchising , Big Data SQL provides unified query across Oracle database, Hadoop, and NoSQL datastores in a fashion which maximizes performance

More and more data is getting available on Hadoop SQL is an incredibly popular data querying language. Is a bridge between business analysts and organizations’

Analyst Workstation / Laptop (2 core, 16GB RAM) Oracle Database Server with ORE Hadoop Server (Oracle Big Data Appliance) In-Memory R Engine In-Memory R Engines spawned by DB.

Data Acquisition Options in BDA Overview of Oracle NoSQL Database Overview of Hadoop Framework Understand HDFS. NoSQL Versus HDFS

Oracle SQL Connector for HDFS can also query files in Oracle Data Pump format, such as data files created by Oracle Loader for Hadoop for offline loading and data files copied

› WHITE PAPER: High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. › &#34; , it was possible to load up to 12TB/hour from Oracle Big Data

Oracle Real Time Decisions Oracle BI Oracle Big Data SQL Tables in Hadoop Tables in DB SQL join Data Warehouse Oracle Database Oracle Advanced Analytics. E xi st in g So u