Big Data:”Are you ready?”
Oracle Big Data SQL
George Bourmas
Enterprise Architect
EMEA XLOB Enterprise Architects
September 13, 2014
Thoughts
Things
Processes
22×
2011-2016
12.5
Billion
2020
1.3
Billion
Today
Use
Data
12
%
Executives who feel they
understand the impact data
will have on their organizations
Produce
VOLUME
VELOCITY
VARIETY
VALUE
SOCIAL
BLOG
SMART
METER
1011001010010
0100110101010
1011100101010
100100101
Framework for distributed processing
Large Data Sets
Clusters of Computers
Simple Computing Models
Highly Available Service
Application
Storage Nodes
Datacenter B
Storage Nodes
Datacenter A
Application
NoSQL DB Driver
Application
NoSQL DB Driver
Application
Simple Key-Value Data Model
Horizontally Scalable
Highly Available
Simple administration
ACID Transactions at scale
Transparent load balancing
Elastic Configuration
Commercial grade software and
support
Oracle NoSQL Database
Scalable, Highly Available, Key-Value Database
Capture massive data volumes
Analyze all data
Secure and unify the data platform
Use
Data
Produce
Data
Big Data Appliance
+
Hadoop
Exadata
+
Oracle Database
Advanced
Analytics
Big Data Appliance
+
Hadoop
Exadata
+
Oracle Database
Endeca
OBI EE
Expose All Data to End Users
Exadata
+
Oracle Database
Data Reservoir
Big Data Appliance
+
Hadoop
Capture massive data volumes
Analyze all data
Secure and unify the data platform
Use
Data
Produce
Data
Securing Big Data
•
Increasingly, Big Data solutions are
capturing sensitive information must be
protected and audited
•
This is no different than critical data
Authenticate
users with secure Kerberos protocol
Authorize
access to data with fine grained controls
Audit
activity and access with Oracle Audit Vault
and Database Firewall
Encrypt
data as it flows through the system
Enhanced Big Data Security
Unified Data Platform
Advanced Query & Analysis
Full Power of SQL and Advanced Analytics
Transparent to Applications
No Changes to Application Code
Single View of All Data
Unified Metadata Across RDBMS & Hadoop
Fastest Performance
Utilize SQL Processing Across the Platform
Leverage Existing Skills
Capture massive data volumes
Analyze all data
Secure and unify the data platform
Use
Data
Produce
Data
How can we possibly
leverage the strengths of
both platforms?
Big Data Appliance
+
Hadoop
Exadata
+
Oracle Database
Low-cost Scalability
Flexible Schema on Read
Abstract Storage Model
Open
Rapid Evolution
Extreme Performance
Highly Secure
Analytic SQL
Rich Tool Set
Vast Expertise
1
2
3
Oracle Big Data SQL: A New Architecture
Technical Overview
Driving Business Value from Technology Innovation
Use the Right Tool for the Job and benefit from the Power of
“AND”
Run the Business
Integrate existing systems
Support mission-critical tasks
Protect existing expenditures
Relational
Hadoop
Change the Business
Disrupt competitors
Disintermediate supply chains
Leverage new paradigms
NoSQL
Scale the Business
Serve data faster
Meet mobile challenges
Scale-out economically
What if you could…
•
Make all data easily available to all your Oracle Database applications
•
While supporting the full breadth of Oracle SQL query language
•
With all the security of Oracle Database 12c
•
Without moving data between your Hadoop cluster and the RDBMS
•
And deliver fast query performance
•
While leveraging your existing skills
Oracle Big Data SQL – A New Architecture
•
Powerful, high-performance SQL on Hadoop
–
Full Oracle SQL capabilities on Hadoop
–
SQL query processing local to Hadoop nodes
•
Simple data integration of Hadoop and Oracle Database
–
Single SQL point-of-entry to access all data
–
Scalable joins between Hadoop and RDBMS data
•
Optimized hardware
How did we do this?
1.
Unify Metadata, Query Language and Security
Unifying Metadata, Query Language and
Security
Data Stored in Hadoop
Hadoop/NoSQL Ecosystem
ç
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}
{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}
{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}
{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
Hive
•
Provides SQL-like interface to data stored in HDFS
•
Allows applications to process data stored in any format
•
Tables capture metadata required to locate and parse data
•
SQL query generates a MapReduce job to process the data
•
Big Data SQL uses Hive metadata to simplify
administration, but it’s not required.
Schema on Read: MapReduce and Hive
Simple Case: Single Column
> select * from movieapp_log_json
{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07",…
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22",…
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26",…
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32”,…
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42",…
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43",…
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50",…
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03”,…
HiveQL:
CREATE EXTERNAL TABLE movieapp_log_json
(
click STRING
)
ROW FORMAT
DELIMITED
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/oracle/applog';
Schema on Read: MapReduce and Hive
Same Source with Columns Derived Using SerDe
> select * from movielog_cols
HiveQL:
CREATE EXTERNAL TABLE movielog_cols (
custid int,
movieid int,
activity int, …)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/oracle/applog_json';
Hive Metastore
SQL Execution Engines Share Metadata
Hive Metastore
Hive
Impala
Shark
Oracle Big Data SQL
…
Table Definitions:
movieapp_log_json
movielog
Unify Metadata: Publish Hive Metadata to Oracle Catalog
CREATE TABLE movieapp_log_json
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE
ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;
CREATE TABLE movieapp_log_json
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE
ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;
Big Data Appliance
Exadata
+
Oracle Catalog
External Table
Hive metadata
External Table
Hive Metastore
Automation: Oracle Data Modeler
Import Hive definitions into model
Automatically generate Oracle DDL for
imported tables
New Data Sources for Oracle External Tables
CREATE TABLE movielog
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
( TYPE ORACLE_HIVE
DEFAULT DIRECTORY Dir1
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster)
)
REJECT LIMIT UNLIMITED
CREATE TABLE movielog
(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
( TYPE
ORACLE_HIVE
DEFAULT DIRECTORY Dir1
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename logs
com.oracle.bigdata.cluster mycluster
)
)
REJECT LIMIT UNLIMITED
•
New set of properties
–
ORACLE_HIVE and ORACLE_HDFS access drivers
–
Identify a Hadoop cluster, data source, column
mapping, error handling, overflow handling, logging
•
New table metadata passed from Oracle DDL to
Hadoop readers at query execution
•
Architected for extensibility
Use Rich Oracle SQL Dialect Over
All
Data
Snapshot of Oracle SQL Analytic Functions
•
Ranking functions
–
rank, dense_rank, cume_dist, percent_rank, ntile
•
Window Aggregate functions (moving and cumulative)
–
Avg, sum, min, max, count, variance, stddev, first_value, last_value
•
LAG/LEAD functions
–
Direct inter-row reference using offsets
•
Reporting Aggregate functions
–
Sum, avg, min, max, variance, stddev, count, ratio_to_report
•
Statistical Aggregates
–
Correlation, linear regression family, covariance
•
Linear regression
–
Fitting of an ordinary-least-squares regression line to a set of number
•
Descriptive Statistics
–
DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count,
min, max, range, mean, stats_mode, variance, standard deviation, median,
quantile values, +/- n sigma values, top/bottom 5 values
•
Correlations
–
Pearson’s correlation coefficients, Spearman's and Kendall's (both
nonparametric).
•
Cross Tabs
–
Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency
coefficient, Cohen's kappa
•
Hypothesis Testing
–
Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square,
Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
package pigstuff; import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.PigException; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; /** * * @author nbayliss */public class W_FINDER extends EvalFunc<Tuple> { private class V0Line {
String state = null; String[] attributes; String prev = ""; String next = "";
public V0Line(String[] atts) { attributes = atts; }
public String[] getAttributes() { return attributes; }
public void setState(String state) { this.state = state;
}
public String getQuantity() { return this.attributes[4]; }
public String setState(V0Line linePrev, V0Line lineNext) { String q = this.getQuantity(); if (linePrev == null) { prev = ""; } else { prev = linePrev.getQuantity(); } if (lineNext == null) { next = ""; } else { next = lineNext.getQuantity(); }
if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";
return state; }
if (gt(q, prev) && gt(q, next)) { state = "T";
return state; }
if (lt(q, prev) && lt(q, next)) { state = "B";
return state; }
if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";
return state; }
if (q.isEmpty() || eq(q, prev)) { state = "F";
return state; }
return state; }
private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return a.equals(b); }
private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) > Double.parseDouble(b); }
private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;
}
return Double.parseDouble(a) < Double.parseDouble(b); }
public String getState() { return this.state; }
}
BagFactory bagFactory = BagFactory.getInstance(); @Override
public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;
boolean noMoreValues = false; String matchList = "";
ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;
DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;
String msg = "Expected input to be DataBag, but" + " got " + o.getClass().getName();
throw new ExecException(msg, errCode, PigException.BUG); }
DataBag bag = (DataBag) o;
Iterator<Tuple> valueIt = bag.iterator(); while (!finished) {
if (valueIt.hasNext()) {
Tuple nextTuple = valueIt.next(); line = nextTuple.get(0).toString(); String ordk = nextTuple.get(1).toString(); String part = nextTuple.get(2).toString(); String supp = nextTuple.get(3).toString(); String q = nextTuple.get(4).toString(); String ship = nextTuple.get(5).toString(); pbkey = nextTuple.get(6).toString();
thisLine = new V0Line(new String[]{line, ordk, part, supp, q, ship, pbkey}); if (lineFifo.isEmpty()) { thisLine.setState("S"); } /*else { if (!lineFifo.get(0).getGroup().equals(thisLine.getGroup())) { thisLine.setState("S"); } } */ lineFifo.add(0, thisLine); if (!valueIt.hasNext()) { noMoreValues = true; thisLine.setState("E"); } } else { thisLine = null; }
String evalState = null; if (lineFifo.size() == 3) { processLine = lineFifo.get(1); if (processLine.getState() == null) { prevLine = lineFifo.get(0); nextLine = lineFifo.get(2); processLine.setState(prevLine, nextLine); } evalLine = lineFifo.remove(2); evalState = evalLine.getState(); } else { if (noMoreValues) { evalLine = lineFifo.remove(lineFifo.size() - 1); evalState = evalLine.getState(); } } if (evalState != null) { matchList += evalState; if (matchList.length() > 5) { matchList = matchList.substring(1); } if (matchList.equals("SBTBE") || matchList.equals("TBTBE") || matchList.equals("SBTBT") || matchList.equals("TBTBT")) { //String[] atts = evalLine.getAttributes(); c++; } } if (lineFifo.isEmpty()) { finished = true; } }
Tuple outputTuple = TupleFactory.getInstance().newTuple(3); outputTuple.set(0, line);
outputTuple.set(1, pbkey); outputTuple.set(2, new Long(c));