SQL Made
Great
Analy&cal SQL
SQL - The Goto Language for Big Data
Analy&cs!
Hermann Bär, [email protected]
Product Management
Data Warehousing
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Who Am I Not? .. But Who SHOULD be Here ..
•
Keith Laker
–
[email protected]
–
19 years with Oracle
–
Worked in consulSng, global support, onsite support and product mgnt.
–
Part of Data Warehouse product management team
–
Product Manager for analyScal SQL
–
Based in Manchester, UK
Safe Harbor Statement
The following is intended to outline our
general product direc&on
. It is intended for
informa&on purposes only
, and may not be incorporated into any contract. It is
not a
commitment to deliver
any material, code, or funcSonality, and should
not be relied
upon
in making purchasing decisions. The development, release, and Sming of any
features or funcSonality described for Oracle’s products remains at the sole discreSon of
Oracle.
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Dear Reader –
Due to legal constraints we are not allowed to provide any material
about Oracle’s exciSng upcoming new technology. We sincerely
apologize for this.
However, there is no reason to wait and not to use the current
available version, Oracle Database 12c Release 12.1 up to its full
potenSal.
We hope you will find this material useful. Please visit us also on OTN
for further informaSon.
Agenda
SQL and Big Data
CalculaSng approximate answers
SQL made for analysis
Summary
1
2
3
4
All new
features
marked
as….
Some things you will be able to
do today and..
some things you will be able to
do tomorrow . . .
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Conceptual View of Big Data Architecture
AcSonable
Events
Event Engine
Data
Reservoir
Data Factory
InformaSon Store
Enterprise
ReporSng
Discovery Lab
AcSonable
InformaSon
AcSonable
Insights
Data
Streams
ExecuSon
InnovaSon
Discovery
Output
Events
& Data
Structured
Enterprise
Data
Other
Data
SQL
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Approximate result-sets
Exploring Today’s Big Data Lakes
•
Key
business challenges
–
Many queries rely on counts and/or staSsScal calculaSons
•
NDVs, Pareto’s 80:20 rule, idenSfying outliers etc.
–
Exact processing of large data sets is resource intensive
–
Exploratory queries don’t require completely accurate result
•
Trending analysis, social analysis, sessionizaSon analyScs
•
Oracle’s solu&ons
–
Provide “approximate result” capabiliSes in SQL
–
Key objec&ves
•
Return approximate results faster, minimal deviaSon from actual
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Answer “
How many
…” type quesSons
–
How many unique sessions today
–
How many unique customers logged on
–
How many unique events occurred
COUNT (DISTINCT . . .)
–
Returns the exact number of rows that
contain disSnct values of specified
expression
–
Can be resource intensive because
requires sorSng
… significantly faster solu&on is
APPROX_COUNT_DISTINCT
(expr)
–
Processes large amounts of data
significantly faster
–
Uses HyperLogLog algorithm
–
Negligible deviaSon from exact result
•
Ignores rows containing null values
–
Supports any scalar data type
•
Does not support BFILE, BLOB, CLOB, LONG,
LONG RAW, or NCLOB
Genng Approximate Uniqueness Counts
Results for accuracy
•
Real world customer workload
•
Accuracy that is typically
97%
with
95% confidence
Performance Results
•
Real world customer workload
•
5-50x
improvement
•
Notes:
–
This approach does not use sampling, it uses a
hash-based approach
–
Ignores rows that contain a null value for specified
expression
–
Supports any scalar data type other than BFILE,
BLOB, CLOB, LONG, LONG RAW, or NCLOB
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
COUNT(DISTINCT…) processing:
•
SORT
operaSons
•
8GB
of memory
(PGA)
•
164GB
of
temp
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | 15
With approx query processing:
•
No sort
•
Only 540MB PGA
•
Zero temp
1.
50x Faster
2.
15X Less Memory
3.
No temp
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Approximate Count DisSnct
•
Independent exemplary performance and accuracy analysis
–
Data courtesy of ChrisSan Antognini
•
High accuracy with superior performance, using bounded memory
Comparison of COUNT(DISTINCT) with APPROX_COUNT_DISTINCT
SQL is made for
Analysis
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
The On-Going EvoluSon of AnalyScal SQL
•
Introduction of
“window” functions
8i
•
Enhanced window functions
(percentile, etc)
•
Rollup, grouping sets, cube
9i
•
SQL Pivot
•
Recursive WITH
•
ListAgg
•
Nth value window
11g
•
Pattern matching
•
Top N clause
•
Approx. count distinct
•
JSON support
12c
Slide - 18
•
Statistical functions
•
SQL model clause
•
Partition Outer Join
•
Data mining
Making Code Simpler: Pattern Matching with SQL
Java vs. SQL: Searching for ‘W’ Patterns in Stock Trade Data
250+ Lines of Java and PIG
package pigstuff; import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.PigException; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; /** * * @author nbayliss */
private class V0Line { String state = null; String[] attributes; String prev = "”; String next = ””;
public V0Line(String[] atts) { attributes = atts;
}
public String[] getAttributes() { return attributes;
}
public void setState(String state) { this.state = state;
}
public String setState(V0Line linePrev, V0Line lineNext) { private boolean eq(String a, String b) {
private boolean gt(String a, String b) {
public Tuple exec(Tuple input) throws IOException { @Override
public Schema outputSchema(Schema input) { Schema.FieldSchema linenumber = new
Schema.FieldSchema("linenumber", DataType.CHARARRAY); Schema.FieldSchema pbykey = new
Schema.FieldSchema("pbykey", DataType.CHARARRAY);
Schema.FieldSchema count = new Schema.FieldSchema("count", DataType.LONG);
Schema tupleSchema = new Schema(); tupleSchema.add(linenumber); tupleSchema.add(pbykey); tupleSchema.add(count);
return new Schema(tupleSchema); }
}
•
Find a W-shape
paXern in a Scker
stream:
•
Output the
beginning
and
ending
date of the paXern
•
Calculate
average price
each the W-shape
•
Find only paXerns that
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Making Code Simpler: Pattern Matching with SQL
Java vs. SQL: Searching for ‘W’ Patterns in Stock Trade Data
SELECT first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY name ORDER BY time
MEASURES FIRST(x.time) AS first_x,
LAST(z.time) AS last_z
ONE ROW PER MATCH
PATTERN (X+ Y+ W+ Z+)
DEFINE X AS (price < PREV(price)),
Y AS (price > PREV(price)),
W AS (price < PREV(price)),
Z AS (price > PREV(price) AND
z.time - FIRST(x.time) <= 7 ))
12 Lines of SQL
250+ Lines of Java and PIG
package pigstuff; import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.PigException; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; /** * * @author nbayliss */
private class V0Line { String state = null; String[] attributes; String prev = "”; String next = ””;
public V0Line(String[] atts) { attributes = atts;
}
public String[] getAttributes() { return attributes;
}
public void setState(String state) { this.state = state;
}
public String setState(V0Line linePrev, V0Line lineNext) { private boolean eq(String a, String b) {
private boolean gt(String a, String b) {
public Tuple exec(Tuple input) throws IOException { @Override
public Schema outputSchema(Schema input) { Schema.FieldSchema linenumber = new
Schema.FieldSchema("linenumber", DataType.CHARARRAY); Schema.FieldSchema pbykey = new
Schema.FieldSchema("pbykey", DataType.CHARARRAY);
Schema.FieldSchema count = new Schema.FieldSchema("count", DataType.LONG);
Schema tupleSchema = new Schema(); tupleSchema.add(linenumber); tupleSchema.add(pbykey); tupleSchema.add(count);
return new Schema(tupleSchema); }
}
20
SQL
analy&cs
for all
your data
delivers faster processing,
richer analyScs over all your
data – relaSonal, Hadoop,
Hive, NoSQL, event streams
etc
Faster and Smarter SQL
Faster and resource
efficient and processing for
counSng and staSsScal
driven queries
Richer SQL
New richer funcSons for
deeper analysis of big data,
IoT data sets
–
MATCH_RECOGNIZE
SQL is Made for AnalyScs
AcSonable Events Event Engine Data Reservoir Data Factory
Enterprise InformaSon Store
ReporSng Discovery Lab AcSonable InformaSon AcSonable Insights Data Streams ExecuSon InnovaSon Structured Enterprise Data Other Data
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |