Oracle Big Data SQL
Konference Data a znalosti 2015
Jakub ILLNER
Information Management Architect
XLOB Enterprise Cloud Architects
23 July 2015, version 2
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Agenda
Is SQL Dead?
Introducing Oracle Big Data SQL
How Oracle Big Data SQL Works
Demonstration
Questions and Answers
1
2
3
4
2
5
Is SQL Dead?
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
4
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
//Part 5 // Keeping context var lastUniqueId = "foobar" var lastRecord: (DataKey, Int) = null var lastLastRecord: (DataKey, Int) = null var position = 0 it.foreach( r => { position = position + 1 if (!lastUniqueId.equals(r._1.uniqueId)) { lastRecord = null lastLastRecord = null }
//Part 6 : Finding those peaks and valleys if (lastRecord != null && lastLastRecord != null) {
if (lastRecord._2 < r._2 && lastRecord._2 < lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId,
position,
lastRecord._1.eventTime, lastRecord._2, false))
} else if (lastRecord._2 > r._2 && lastRecord._2 > lastLastRecord._2) { results.+=(new PivotPoint(r._1.uniqueId, position, lastRecord._1.eventTime, lastRecord._2, true)) } } lastUniqueId = r._1.uniqueId lastLastRecord = lastRecord lastRecord = r }) results.iterator })
//Part 7 : pretty everything up pivotPointRDD.map(r => {
val pivotType = if (r.isPeak) "peak" else "valley" r.uniqueId + "," + r.position + "," + r.eventTime + "," + r.eventValue + "," + pivotType } ).saveAsTextFile(outputPath) }
class DataKey(val uniqueId:String, val eventTime:Long) extends Serializable with Comparable[DataKey] { override def compareTo(other:DataKey): Int = { val compare1 = uniqueId.compareTo(other.uniqueId) if (compare1 == 0) { eventTime.compareTo(other.eventTime) } else { compare1 } } }
class PivotPoint(val uniqueId: String, val position:Int, val eventTime:Long, val eventValue:Int,
val isPeak:Boolean) extends Serializable {} }
package com.hadooparchitecturebook.spark.peaksandvalleys import org.apache.hadoop.io.{Text, LongWritable} import org.apache.hadoop.mapred.TextInputFormat import org.apache.spark.rdd.ShuffledRDD
import org.apache.spark.{Partitioner, SparkContext, SparkConf} import scala.collection.mutable
/**
* Created by ted.malaska on 12/7/14. */
object SparkPeaksAndValleysExecution {
5
Peaks and Valleys in Spark vs. SQL
SELECT PRIMARY KEY
, POSITION
, EVENT_VALUE
, CASE
WHEN LEAD_EVENT_VALUE is null or LAG_EVENT_VALUE is null
then 'EDGE'
WHEN EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE
then 'VALLEY'
WHEN EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE
then 'PEAK'
ELSE 'SLOPE'
AND AS POINT_TYPE
FROM (
SELECT PRIMARY_KEY
, POSITION
, EVENT_VALUE
, LEAD(EVENT_VALUE,1,null)
OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE
, LAG(EVENT_VALUE,1,null)
OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE
FROM PEAK_AND_VALLEY_TABLE
)
132 Lines of Scala/Spark
17 Lines of SQL
7x less code
Finding Peaks and Valleys in Stock Market Data
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.
5
10:00 10:05 10:10 10:15 10:20 10:25
Ticker
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
•
Declarative
•
Abstracted (from storage)
•
Concise
•
Powerful
•
Simple to learn
•
Rich analytical functions
•
Fast
•
Secure
•
Standardized
•
Widely used & known
6
•
Hive
–
First SQL engine on Hadoop
–
Uses MapReduce for execution
–
Contains metastore (HCatalog)
–
New Hive-on-Spark project
•
SparkSQL
–
Spark module for accessing
structured data
–
Fast, in-memory execution
–
Compatible with Hive
•
Impala
–
Developed by Cloudera
–
Fast, in-memory execution
–
Introduced Parquet format
–
Compatible with Hive
•
Presto
–
Developed by Facebook
–
Fast, low latency execution
–
Compatible with Hive
–
Connectivity to other sources
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
What if you need to query Hadoop & RDBMS data?
•
Pure Hadoop-on-SQL engines can access data in Hadoop only (HDFS, Hive,
Parquet, ORC, HBase etc.)
•
Performance of BI tools like Cognos, Oracle BIEE, SAS, Tableau for large and
complex federated queries is limited
•
Possible solution is to use SQL interface & Hadoop integration available
with the DBMS platform of choice
Which SQL-on-Hadoop approach will you use most?
From 9% in 2014 to 19% in 2015
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Introducing Oracle Big Data SQL
One SQL to query ALL the data
What if you could …
•
Make all data easily available to all your Oracle Database applications
•
While supporting the full breadth of Oracle SQL query language
•
With all the security of Oracle Database 12c
•
Without moving data between your Hadoop cluster and the RDBMS
•
And deliver fast query performance
•
While leveraging your existing skills
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
12
One SQL to query ALL the data
NoSQL
•
Ranking functions
–
Rank, dense_rank, cume_dist, percent_rank, ntile
•
Window Aggregate functions
(moving and cumulative)
–
Avg, sum, min, max, count, variance, stddev, first_value, last_value
•
LAG/LEAD functions
–
Direct inter-row reference using offsets
•
Reporting Aggregate functions
–
Sum, avg, min, max, variance, stddev, count, ratio_to_report
•
Statistical Aggregates
–
Correlation, linear regression family, covariance
•
Linear regression
–
Fitting of an ordinary-least-squares regression line to a set of
number pairs.
–
Frequently combined with the COVAR_POP, COVAR_SAMP, and
CORR functions
•
Descriptive Statistics
–
DBMS_STAT_FUNCS: summarizes numerical columns of a table and
returns count, min, max, range, mean, stats_mode, variance,
standard deviation, median, quantile values, +/- n sigma values,
top/bottom 5 values
•
Correlations
–
Pearson’s correlation coefficients, Spearman's and Kendall's (both
nonparametric).
•
Cross Tabs
–
Enhanced with % statistics: chi squared, phi coefficient, Cramer's V,
contingency coefficient, Cohen's kappa
•
Hypothesis Testing
–
Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test,
Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way
ANOVA
•
Distribution Fitting
–
Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test,
Normal, Uniform, Weibull, Exponential
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
next = lineNext.getQuantity(); }
if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";
return state; }
if (gt(q, prev) && gt(q, next)) { state = "T";
return state; }
if (lt(q, prev) && lt(q, next)) { state = "B";
return state; }
if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";
return state; }
if (q.isEmpty() || eq(q, prev)) { state = "F";
return state; }
return state; }
private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return a.equals(b); }
private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) > Double.parseDouble(b); }
private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) < Double.parseDouble(b); }
public String getState() { return this.state; }
}
BagFactory bagFactory = BagFactory.getInstance(); @Override
public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;
boolean noMoreValues = false; String matchList = "";
ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;
DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;
String msg = "Expected input to be DataBag, but"