• No results found

Big Data: Are you ready?

N/A
N/A
Protected

Academic year: 2021

Share "Big Data: Are you ready?"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Big Data:”Are you ready?”

Oracle Big Data SQL

George Bourmas

Enterprise Architect

EMEA XLOB Enterprise Architects

September 13, 2014

(3)
(4)
(5)

Thoughts

Things

Processes

(6)

22×

2011-2016

12.5

Billion

2020

1.3

Billion

Today

(7)

Use

Data

12

%

Executives who feel they

understand the impact data

will have on their organizations

Produce

(8)

VOLUME

VELOCITY

VARIETY

VALUE

SOCIAL

BLOG

SMART

METER

1011001010010

0100110101010

1011100101010

100100101

(9)

Framework for distributed processing

Large Data Sets

Clusters of Computers

Simple Computing Models

Highly Available Service

(10)

Application

Storage Nodes

Datacenter B

Storage Nodes

Datacenter A

Application

NoSQL DB Driver

Application

NoSQL DB Driver

Application

Simple Key-Value Data Model

Horizontally Scalable

Highly Available

Simple administration

ACID Transactions at scale

Transparent load balancing

Elastic Configuration

Commercial grade software and

support

Oracle NoSQL Database

Scalable, Highly Available, Key-Value Database

(11)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(12)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database

Advanced

Analytics

(13)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database

Endeca

OBI EE

Expose All Data to End Users

(14)

Exadata

+

Oracle Database

Data Reservoir

Big Data Appliance

+

Hadoop

(15)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(16)

Securing Big Data

Increasingly, Big Data solutions are

capturing sensitive information must be

protected and audited

This is no different than critical data

(17)

Authenticate

users with secure Kerberos protocol

Authorize

access to data with fine grained controls

Audit

activity and access with Oracle Audit Vault

and Database Firewall

Encrypt

data as it flows through the system

Enhanced Big Data Security

(18)

Unified Data Platform

Advanced Query & Analysis

Full Power of SQL and Advanced Analytics

Transparent to Applications

No Changes to Application Code

Single View of All Data

Unified Metadata Across RDBMS & Hadoop

Fastest Performance

Utilize SQL Processing Across the Platform

Leverage Existing Skills

(19)

Capture massive data volumes

Analyze all data

Secure and unify the data platform

Use

Data

Produce

Data

(20)

How can we possibly

leverage the strengths of

both platforms?

(21)

Big Data Appliance

+

Hadoop

Exadata

+

Oracle Database

Low-cost Scalability

Flexible Schema on Read

Abstract Storage Model

Open

Rapid Evolution

Extreme Performance

Highly Secure

Analytic SQL

Rich Tool Set

Vast Expertise

(22)

1

2

3

Oracle Big Data SQL: A New Architecture

Technical Overview

(23)

Driving Business Value from Technology Innovation

Use the Right Tool for the Job and benefit from the Power of

“AND”

Run the Business

Integrate existing systems

Support mission-critical tasks

Protect existing expenditures

Relational

Hadoop

Change the Business

Disrupt competitors

Disintermediate supply chains

Leverage new paradigms

NoSQL

Scale the Business

Serve data faster

Meet mobile challenges

Scale-out economically

(24)

What if you could…

Make all data easily available to all your Oracle Database applications

While supporting the full breadth of Oracle SQL query language

With all the security of Oracle Database 12c

Without moving data between your Hadoop cluster and the RDBMS

And deliver fast query performance

While leveraging your existing skills

(25)

Oracle Big Data SQL – A New Architecture

Powerful, high-performance SQL on Hadoop

Full Oracle SQL capabilities on Hadoop

SQL query processing local to Hadoop nodes

Simple data integration of Hadoop and Oracle Database

Single SQL point-of-entry to access all data

Scalable joins between Hadoop and RDBMS data

Optimized hardware

(26)

How did we do this?

1.

Unify Metadata, Query Language and Security

(27)

Unifying Metadata, Query Language and

Security

(28)

Data Stored in Hadoop

Hadoop/NoSQL Ecosystem

ç

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}

{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}

{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}

{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}

{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

(29)

Hive

Provides SQL-like interface to data stored in HDFS

Allows applications to process data stored in any format

Tables capture metadata required to locate and parse data

SQL query generates a MapReduce job to process the data

Big Data SQL uses Hive metadata to simplify

administration, but it’s not required.

(30)

Schema on Read: MapReduce and Hive

Simple Case: Single Column

> select * from movieapp_log_json

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07",…

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22",…

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26",…

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32”,…

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42",…

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43",…

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50",…

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03”,…

HiveQL:

CREATE EXTERNAL TABLE movieapp_log_json

(

click STRING

)

ROW FORMAT

DELIMITED

LINES TERMINATED BY '\n'

STORED AS TEXTFILE

LOCATION '/user/oracle/applog';

(31)

Schema on Read: MapReduce and Hive

Same Source with Columns Derived Using SerDe

> select * from movielog_cols

HiveQL:

CREATE EXTERNAL TABLE movielog_cols (

custid int,

movieid int,

activity int, …)

ROW FORMAT SERDE

'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'/user/oracle/applog_json';

(32)

Hive Metastore

SQL Execution Engines Share Metadata

Hive Metastore

Hive

Impala

Shark

Oracle Big Data SQL

Table Definitions:

movieapp_log_json

movielog

(33)

Unify Metadata: Publish Hive Metadata to Oracle Catalog

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

REJECT LIMIT UNLIMITED;

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

REJECT LIMIT UNLIMITED;

Big Data Appliance

Exadata

+

Oracle Catalog

External Table

Hive metadata

External Table

Hive Metastore

(34)

Automation: Oracle Data Modeler

Import Hive definitions into model

Automatically generate Oracle DDL for

imported tables

(35)

New Data Sources for Oracle External Tables

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster)

)

REJECT LIMIT UNLIMITED

CREATE TABLE movielog

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

( TYPE

ORACLE_HIVE

DEFAULT DIRECTORY Dir1

ACCESS PARAMETERS

(

com.oracle.bigdata.tablename logs

com.oracle.bigdata.cluster mycluster

)

)

REJECT LIMIT UNLIMITED

New set of properties

ORACLE_HIVE and ORACLE_HDFS access drivers

Identify a Hadoop cluster, data source, column

mapping, error handling, overflow handling, logging

New table metadata passed from Oracle DDL to

Hadoop readers at query execution

Architected for extensibility

(36)

Use Rich Oracle SQL Dialect Over

All

Data

Snapshot of Oracle SQL Analytic Functions

Ranking functions

rank, dense_rank, cume_dist, percent_rank, ntile

Window Aggregate functions (moving and cumulative)

Avg, sum, min, max, count, variance, stddev, first_value, last_value

LAG/LEAD functions

Direct inter-row reference using offsets

Reporting Aggregate functions

Sum, avg, min, max, variance, stddev, count, ratio_to_report

Statistical Aggregates

Correlation, linear regression family, covariance

Linear regression

Fitting of an ordinary-least-squares regression line to a set of number

Descriptive Statistics

DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count,

min, max, range, mean, stats_mode, variance, standard deviation, median,

quantile values, +/- n sigma values, top/bottom 5 values

Correlations

Pearson’s correlation coefficients, Spearman's and Kendall's (both

nonparametric).

Cross Tabs

Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency

coefficient, Cohen's kappa

Hypothesis Testing

Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square,

Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

(37)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

package pigstuff; import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import org.apache.pig.EvalFunc; import org.apache.pig.PigException; import org.apache.pig.backend.executionengine.ExecException; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.data.TupleFactory; import org.apache.pig.impl.logicalLayer.FrontendException; import org.apache.pig.impl.logicalLayer.schema.Schema; /** * * @author nbayliss */

public class W_FINDER extends EvalFunc<Tuple> { private class V0Line {

String state = null; String[] attributes; String prev = ""; String next = "";

public V0Line(String[] atts) { attributes = atts; }

public String[] getAttributes() { return attributes; }

public void setState(String state) { this.state = state;

}

public String getQuantity() { return this.attributes[4]; }

public String setState(V0Line linePrev, V0Line lineNext) { String q = this.getQuantity(); if (linePrev == null) { prev = ""; } else { prev = linePrev.getQuantity(); } if (lineNext == null) { next = ""; } else { next = lineNext.getQuantity(); }

if (!q.isEmpty() && (prev.isEmpty() || (eq(q, prev) && gt(q, next)))) { state = "S";

return state; }

if (gt(q, prev) && gt(q, next)) { state = "T";

return state; }

if (lt(q, prev) && lt(q, next)) { state = "B";

return state; }

if (!q.isEmpty() && (next.isEmpty() || (gt(q, prev) && eq(q, next)))) { state = "E";

return state; }

if (q.isEmpty() || eq(q, prev)) { state = "F";

return state; }

return state; }

private boolean eq(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return a.equals(b); }

private boolean gt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) > Double.parseDouble(b); }

private boolean lt(String a, String b) { if (a.isEmpty() || b.isEmpty()) { return false;

}

return Double.parseDouble(a) < Double.parseDouble(b); }

public String getState() { return this.state; }

}

BagFactory bagFactory = BagFactory.getInstance(); @Override

public Tuple exec(Tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextLine; V0Line thisLine; V0Line processLine; V0Line evalLine = null; V0Line prevLine;

boolean noMoreValues = false; String matchList = "";

ArrayList<V0Line> lineFifo = new ArrayList<V0Line>(); boolean finished = false;

DataBag output = bagFactory.newDefaultBag(); if (input == null) { return null; } if (input.size() == 0) { return null; } Object o = input.get(0); if (o == null) { return null; } //Object o = input.get(0); if (!(o instanceof DataBag)) { int errCode = 2114;

String msg = "Expected input to be DataBag, but" + " got " + o.getClass().getName();

throw new ExecException(msg, errCode, PigException.BUG); }

DataBag bag = (DataBag) o;

Iterator<Tuple> valueIt = bag.iterator(); while (!finished) {

if (valueIt.hasNext()) {

Tuple nextTuple = valueIt.next(); line = nextTuple.get(0).toString(); String ordk = nextTuple.get(1).toString(); String part = nextTuple.get(2).toString(); String supp = nextTuple.get(3).toString(); String q = nextTuple.get(4).toString(); String ship = nextTuple.get(5).toString(); pbkey = nextTuple.get(6).toString();

thisLine = new V0Line(new String[]{line, ordk, part, supp, q, ship, pbkey}); if (lineFifo.isEmpty()) { thisLine.setState("S"); } /*else { if (!lineFifo.get(0).getGroup().equals(thisLine.getGroup())) { thisLine.setState("S"); } } */ lineFifo.add(0, thisLine); if (!valueIt.hasNext()) { noMoreValues = true; thisLine.setState("E"); } } else { thisLine = null; }

String evalState = null; if (lineFifo.size() == 3) { processLine = lineFifo.get(1); if (processLine.getState() == null) { prevLine = lineFifo.get(0); nextLine = lineFifo.get(2); processLine.setState(prevLine, nextLine); } evalLine = lineFifo.remove(2); evalState = evalLine.getState(); } else { if (noMoreValues) { evalLine = lineFifo.remove(lineFifo.size() - 1); evalState = evalLine.getState(); } } if (evalState != null) { matchList += evalState; if (matchList.length() > 5) { matchList = matchList.substring(1); } if (matchList.equals("SBTBE") || matchList.equals("TBTBE") || matchList.equals("SBTBT") || matchList.equals("TBTBT")) { //String[] atts = evalLine.getAttributes(); c++; } } if (lineFifo.isEmpty()) { finished = true; } }

Tuple outputTuple = TupleFactory.getInstance().newTuple(3); outputTuple.set(0, line);

outputTuple.set(1, pbkey); outputTuple.set(2, new Long(c));

Pattern Matching With Oracle SQL

Snapshot of Oracle SQL Analytic Functions

Simplified, sophisticated, standards based syntax

SELECT first_x, last_z

FROM ticker MATCH_RECOGNIZE (

PARTITION BY name ORDER BY time

MEASURES FIRST(x.time) AS first_x,

LAST(z.time) AS last_z

ONE ROW PER MATCH

PATTERN (X+ Y+ W+ Z+)

DEFINE X AS (price < PREV(price)),

Y AS (price > PREV(price)),

W AS (price < PREV(price)),

Z AS (price > PREV(price) AND

z.time - FIRST(x.time) <= 7 ))

250+ Lines of Java UDF

12 Lines of SQL

20x less code

Finding Patterns in Stock Market Data - Double Bottom (W)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved.

38

(38)

Govern

All

Data

Store JSON data unconverted

in Hadoop

Oracle Database 12c

Oracle Big Data Appliance

SQL

Data analyzed via SQL

Store business-critical data in

Oracle

DBMS_REDACT.ADD_POLICY(

object_schema => 'hr',

object_name => 'employee',

Apply advanced security on Hadoop

Masking/Redaction

(39)
(40)

Intelligent Query Optimization

One Query Spanning Oracle Database, Hadoop & NoSQL

Query Data in RDBMS,

Hadoop & NoSQL

Oracle SQL

Oracle NoSQL

DB

BDS Server

HDFS Data

Node

BDS Server

Oracle Database

Storage Server

Fast

Massive Parallelism

Filtered Locally

(41)

Data Stored in Hadoop

Hadoop/NoSQL Ecosystem

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8}

{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7}

{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7}

{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8}

{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9}

{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7}

{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9}

{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7}

{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7}

{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7}

{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7}}

{"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}

Example: 10GB File

Block B1

Block B2

Block B3

1 block = 256 MB

(42)

Enhance Oracle External Table Performance

Previously external tables were “file-centric”

1 file == 1 unit of parallelism

Enhanced external tables understand

parallelism

Automatically map external units of parallelism to

Oracle “Granules”

1 Input Split == 1 Oracle “Granule”

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

PARALLEL 20

REJECT LIMIT UNLIMITED;

CREATE TABLE movieapp_log_json

(click VARCHAR2(4000))

ORGANIZATION EXTERNAL

(TYPE

ORACLE_HIVE

DEFAULT DIRECTORY DEFAULT_DIR

)

PARALLEL 20

(43)

Query Execution on Hadoop

select last_name, state,

movie, genre

from movielog m,

customer c

where genre=‘comedy’

and c.custid = m.custid

1

Query compilation determines:

Data locations

Data structure

Parallelism

1

2

Parallel reads using Big Data SQL Server:

Parallel unit: PQ Slaves & InputSplits

Filter rows and project columns

2

HDFS

NameNode

3

Process filtered result

Move relevant data to database

Join with database tables

(44)

Big Data SQL Server Minimizes Data Movement

Data Node

Big Data SQL Server

External Table Services

1.Read using Hadoop Classes

2.Convert to Oracle Data

Stream

Hadoop Smart Scan

1.Apply filter predicates

2.Apply column projections

3.Apply row-level functions

JSON Parsing

Work close to the data

Scans and serializations from Hadoop classes

Transformation into Oracle data stream

Smart Scan: Emit only relevant data

Apply filter predicates

Include complex predicates, e.g. JSON_EXISTS

Bloom filters for faster joins

Score Data Mining models

(45)

Oracle Big Data SQL

One

fast SQL query, on

all

your data.

Oracle SQL on Hadoop

and beyond

With a

Smart Scan

service as in Exadata

With

native

SQL operators

(46)
(47)

What Does Big Data SQL Push Down?

Store data unconverted in Hadoop

JSON

Oracle Database 12c

Oracle Big Data Appliance

Store business-critical data in Oracle

Pushed down to Big Data SQL Cell

Hadoop scans (InputFormat, SerDe)

JSON parsing

Handled by Oracle Database

Joins

Aggregations

(48)

How Do Clusters Get Added to Big Data SQL?

At each DataNode, Big Data SQL installs

JSON

Oracle Database 12c

Oracle Big Data Appliance

On Oracle Database Server

Big Data SQL Cell

(49)

How Does Security Work in Big Data SQL?

Hadoop-native security

JSON

Oracle Database 12c

Oracle Big Data Appliance

Policies stored and managed by Oracle

Open Source Security Standards

Kerberos

LDAP/AD

Oracle Advanced Security

Fine grained database access

control

(50)
(51)

References

Related documents

Also conducting dynamic response analysis of the full-scale rockfall protection wall with the TLAS inputting time history of the transmitted impact force obtained from the test

The minimum requirements on the qualifications and experience of the key personnel of a registered specialist contractor in site formation works category (RSC(SF)) are given in

It has been recognized that theories for describing the states of stress and failure in unsaturated soil require consideration of the thermodynamic properties of the pore water in

Project structure WP  2 Di sse m ina ti on  an d   St akeh o ld er  C o nsul ta ti on   Accident analyses WP 3 Development and validation of the methodological framework

Analyst Workstation / Laptop (2 core, 16GB RAM) Oracle Database Server with ORE Hadoop Server (Oracle Big Data Appliance) In-Memory R Engine In-Memory R Engines spawned by DB.

Normal to slim body im- ages as presented in a photographic array were selected as being more attractive, less likely to have diabetes and hypertension, healthier and to be

In the case of MTDs, spatial planning should achieve a coordinating role primarily in relation to the sectoral planning basis for the protection of nature (protected area

Secondary bond market intervention by EFSF has a twofold objective. First, it serves to support the functioning of the debt markets and appropriate price formation in