• No results found

A Perfect Storm. Oracle Big Data Science for Enterprise R and SAS Users. Marcos Arancibia, Consulting Product Manager

N/A
N/A
Protected

Academic year: 2021

Share "A Perfect Storm. Oracle Big Data Science for Enterprise R and SAS Users. Marcos Arancibia, Consulting Product Manager"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

A Perfect Storm

Oracle Big Data Science for Enterprise R and SAS Users

Mark Hornick, Director, Advanced Analytics

mark.hornick@oracle.com @MarkHornick

Marcos Arancibia, Consulting Product Manager marcos.arancibia@oracle.com

(2)

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for

information purposes only, and may not be incorporated into any contract. It is not a

commitment to deliver any material, code, or functionality, and should not be relied upon

in making purchasing decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole discretion of Oracle.

(3)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Perfect Storm

: something that describes

an actual phenomenon that happens to

occur in such a confluence, resulting in an

event of unusual magnitude

-

Wikipedia

(4)

Masters in Data Science

Big Data

(5)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5

Events of unusual magnitude?

A changing of the guard

Massive migrations

(6)

Agenda

What is R?

Who is using R and why?

Overview of Oracle R Technologies

Global customer tour

(7)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

What is R?

R is an Open Source scripting language and environment for statistical computing and graphics

http://www.R-project.org/

Started in 1994 as an Alternative to SAS, SPSS and other proprietary Statistical Environments

The R environment

– R is an integrated suite of software facilities for data manipulation, calculation and graphical display

Millions of R users worldwide

– Widely taught in Universities

– Many Corporate Analysts and Data Scientists know and use R

Thousands of open sources packages to enhance productivity such as:

– Bioinformatics

– Spatial Statistics

– Financial Market Analysis

(8)

Why statisticians , data analysts, data scientists use R

R environment is ..

Powerful

Extensible

Graphical

Extensive statistics

OOTB functionality with

many ‘knobs’ but

smart defaults

Ease of installation and use

Free

R is a statistics language similar to Base SAS or SPSS statistics

(9)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

R’s Popularity – Number of Google Scholar hits

Robert A. Muenchen

http://r4stats.com/articles/popularity/

“SPSS has a clear lead, but you

can see that its dominance

peaked in 2007 and its use is

now in sharp decline. SAS

never came close to SPSS’

level of dominance, and it

peaked in 2008.“

SPSS

SAS

(10)

R’s Popularity – Number of Google Scholar Documents

SAS and SPSS removed

Robert A. Muenchen

http://r4stats.com/articles/popularity/

“…the use of R is

experiencing very rapid

growth and is pulling away

from the pack, solidifying its

position in third place.”

(11)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

R’s Popularity – Job Trends

http://r4stats.com/articles/popularity/

(12)
(13)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Customer Pain Points with Advanced Analytics

for example…

“It takes too long to get my data or to get the ‘right’ data”

“I can’t analyze or mine all of my data – it has to be sampled”

“Putting models and results into production is ad hoc and complex”

“Recoding models into SQL, C, or Java takes time and is error prone”

“Our company is concerned about data security, backup and recovery”

“We need to build 10s of thousands of models fast to meet business objectives”

(14)
(15)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Oracle R Distribution

An Oracle-Supported Redistribution of Open Source R

Enhanced linear algebra performance via dynamically loaded libraries

Improve scalability at client and database for embedded R execution

Enterprise support for customers of Oracle Advanced Analytics option,

Big Data Appliance, and Oracle Linux

Free download

Oracle contributes bug fixes and enhancements to open source R

Ability to dynamically load

Intel Math Kernel Library

AMD Core Math Library

Solaris Sun Performance Library

Oracle

Support

(16)

ROracle

R package enabling scalable and performant connectivity to Oracle Database

– Open source, publicly available on CRAN

– Oracle is maintainer

Oracle Database Interface (DBI) for R

– Re-implemented and optimized driver based on OCI

– Execute SQL statements from R interface

– Enables transactional behavior for insert, update, and delete

Oracle Database

(17)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise

Oracle Advanced Analytics Option to Oracle Database

Eliminate memory constraint of client R engine

Minimize or eliminate data movement latency

Leverage Oracle Database as HPC environment

Execute R scripts through database server

machine for scalability and performance

Leverage parallel, distributed in-database data

mining algorithms

Execute and manage R scripts via SQL

Operationalize R scripts in production

applications – eliminate porting R code

Avoid reinventing code to integrate R results

into existing applications

Client R Engine

ORE packages Oracle Database User tables In-db stats Database Server Machine SQL Interfaces SQL*Plus, SQLDeveloper, … 17

(18)

f(dat,args,…) {

Oracle Database

Data

c1 c2 ci cn R Script build model

f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)

Model c1 Model c2 Model cn Model ci

R Datastore Repository R Script

Database-centric architecture

Smart meter scenario

(19)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | scores c1 scores c2 scores ci scores cn f(dat,args,…) { }

Oracle Database

Data

c1 c2 ci cn R Script score data

f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)

Model Model Model Model

R Datastore Repository R Script

Database-centric architecture

Smart meter scenario

(20)

Build

models and store in database, partition on CUST_ID

ore.groupApply (CUST_USAGE_DATA,

CUST_USAGE_DATA$CUST_ID,

function(dat, ds.name) { cust_id <- dat$CUST_ID[1]

mod <- lm(Consumption ~ . -CUST_ID, dat)

mod$effects <- mod$residuals <- mod$fitted.values <- NULL name <- paste("mod", cust_id,sep="")

assign(name, mod)

ds.name1 <- paste(ds.name,".",cust_id,sep="")

ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE) TRUE

},

ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE

)

(21)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Score

customers in database, partition on CUST_ID

ore.groupApply(CUST_USAGE_DATA_NEW, CUST_USAGE_DATA_NEW$CUST_ID, function(dat, ds.name) { cust_id <- dat$CUST_ID[1] ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.load(ds.name1)

name <- paste("mod", cust_id,sep="") mod <- get(name)

prd <- predict(mod, newdata=dat)

prd[as.integer(rownames(prd))] <- prd

res <- cbind(CUST_ID=cust_id, PRED = prd) data.frame(res)

},

ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE, FUN.VALUE=data.frame(CUST_ID=numeric(0), PRED=numeric(0))

)

16 lines

(22)

Performance

with DOP=24

1000 Models

Data: 26,280,000 rows

Total build time: 65.2 seconds

Total scoring time: 25.7 seconds

(all data)

10,000 Models

Data: 262,800,000 rows

Total build time: 516 seconds

Total scoring time: 217 seconds

(all data)

50,000 Models

Data: 1,314,000,000 rows

Total build time: 55.85 minutes

Total scoring time: 18 minutes

(all data)

1 10 100 1000 10000 26.3 262.8 1314 Ex e cu tion (se c) Build Time Score Time

(23)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Oracle Advanced Analytics on Exadata X3-2 ¼ Rack

Model Building using 30 numeric variables:

Leading Vendor on a machine connected directly to the same Exadata box took 2+ hours for ETL+Exec on 34mi records

Scalability of the new distributed ore.lm() Linear Regression

7,200

10.8

25.5

34.8

315

1 10 100 1000 10000

Leading Vendor-34mi OAA-34mi OAA-180mi OAA-299mi OAA-2.99Bi

Sec

onds (log

sc

ale)

Engine/Database size (records)

(24)

Oracle Advanced Analytics Option

Fastest Way to Deliver Scalable Enterprise-wide Predictive Analytics

Better Decisions with Deeper Insights & Predictive Analytics

– Understand and predict customer behavior for churn, fraud,

cross-sell, and many other business problems

Easy to Use

– Data analysts: Mining work flow GUI (part of SQL Developer)

– Data scientists: R and SQL languages supported

– DBA: SQL integration

Comprehensive Analytics on a Simple Architecture

– Performance and scalability of the Oracle Database

– Lowest Total Cost of Ownership

– No need for separate analytical servers

Components

(25)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Oracle R Advanced Analytics for Hadoop

Oracle Big Data Connectors option to Big Data Appliance

R script {CRAN packages}

Hadoop Cluster

R HDFS HDFS Nodes Oracle Database R Hive MapReduce Nodes R MapReduce {CRAN packages} R Client ORD ORD Hadoop Job Mapper Reducer H ad oo p A bs tracti on La y er R sqoop/OLH H C ac he

Transparent access to Hadoop Cluster from R

Manipulate data in HDFS, Hive, database, and file system

Write and execute MapReduce jobs with R

Leverage CRAN R packages to work on HDFS-resident data

(26)

Oracle

Exadata

Oracle

Exalytics

Oracle Big Data Platform

Oracle Big Data

Appliance

Oracle

Big Data

Connectors

Optimized for

Analytics & In-Memory Workloads “System of Record”

Optimized for DW/OLTP Optimized for Hadoop,

R, and NoSQL Processing

Oracle Enterprise Performance Management Oracle Business Intelligence Applications Oracle Business Intelligence Tools Oracle Endeca Information Discovery Hadoop Oracle R Distribution Applications Oracle NoSQL Database

Oracle Big Data Connectors Oracle Data Integrator Data Warehouse Oracle Advanced Analytics Oracle Database Oracle Advanced Analytics Oracle R Enterprise Oracle Data Mining Oracle R Advanced Analytics for Hadoop + … Oracle R Distribution

(27)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

A Global Customer Tour

(28)

Panoramic Houston skyline

The space shuttle Challenger

atop its Boeing 747 SCA, flying

Quick Houston Facts:

Most populous city in Texas

Metropolitan area is the fifth-most populated in the U.S., with over 6 million people

Leading in energy, manufacturing, aeronautics, transportation, health care sectors and building oilfield equipment

Only New York City is home to more Fortune 500 Headquarters.

Oracle R Enterprise at Apache Oil:

Segmentation of drilling problems to understand potential problems ahead of time

Predictive maintenance of assets to prevent waiting a day for replacement of drill bits or other

(29)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29

Mayan City of Tikal

Guatemala City Today

Quick Guatemala Facts:

15.8 mi inhabitants

Guatemala City is the Capital

Spanish spoken by 93% of Population

21 Mayan and 2 Amerindian languages also spoken

Service sector is largest component of GDP at 63%, followed by industry sector at 23.8% and agriculture sector at 13.2% (2010 est.)

Oracle R Enterprise at TIGO:

Customer Behavior of 5.5M customers with 1.8B transactions

Generate 5 models per customer to understand mobility using Lat/Long of the Cell Tower of each transaction

Evaluate 27.5M segmentation models in 25 minutes, or over 1M models/minute

(30)

Quick Cincinnati Facts:

First major American city founded after American Revolution

First major inland “purely American” city in country

Cincinnati Reds have a storied history as being first professional club, hosting first night game, and

dominating 1970s as the "Big Red Machine"

Cincinnati Reds:

Music Hall

American

Sign

Museum

Oracle Advanced Analytics at dunnhumby:

Very long ETL time eliminated with in-Database Advanced Analytics

Modeling behavior of millions of shoppers

Coupon optimization for Retailers on Billions of transactions

(31)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Quick Orlando Facts:

Nicknamed "The City Beautiful"

Symbol is the fountain at Lake Eola

“Theme Park Capital of the World“

51+ million tourists a year, 3.6 million international

Walt Disney World Resort:

Magic Kingdom, Hollywood Studios, Epcot, Animal Kingdom

Universal Studios Orlando

SeaWorld

TODAY, 2:00PM!

Moscone South 308

CON2898

Oracle R Enterprise at Olive Garden:

Olive Garden, traditionally managing its 830

restaurants nationally, transitioned to a localized approach with the help of predictive analytics

Evaluated 115 million transactions in just 5 percent the time required by previous BI tool

Supporting Olive Garden’s latest remodel campaign, continuing to uncover millions in profits by

(32)

Quick Lima, Peru Facts:

Capital and the largest city of Peru with 9M citizens

Most populous metropolitan area of Peru

Fifth largest city in the Americas (as defined by "city proper")

Home to one of the oldest higher learning institutions in the New World

National University of San Marcos, founded on May 12, 1551

Oracle R Enterprise at Financiera Uno:

Reduce time to build credit scoring models to ensure their market relevancy

Scale to handle “big data” volumes

Rapidly deploy credit scoring models into production applications

(33)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 33

Quick London Facts:

One of world's leading financial centers

Has fifth-or sixth-largest metropolitan area GDP in the world depending on measurement

World cultural capital

World's most-visited city as measured by international arrivals

World's largest city airport system measured by passenger traffic

Oracle R Enterprise at Major Financial Company:

Earnings calculations reduced from 7 hours to 4 minutes

Scoring on written premium reduced from 100 minutes to 7 minutes

Scoring on earned premium reduced from 25 minutes to 8 minutes with added functionality

(34)

Quick Geneva Facts:

Most populous city of Romandy, the French-speaking part of Switzerland

A financial center

Worldwide center for diplomacy

Headquarters of many of the agencies of the United Nations and the Red Cross

Hosts highest number of international organizations in the world

Oracle R Enterprise at CERN:

Real time monitoring and anomaly detection of tens of thousands of events per second

CERN Central Logging Service: complex in-database time series analysis and forecasting

(35)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Quick Croatia Facts:

Member of European Union (EU) and United

Nations (UN)

Tourism is a significant source of revenue

during the summer

Ranked 18th most popular tourist destination

in the world

Oracle R Enterprise at ZABA Bank:

Historical Customer Behavior Analysis

shortened from several months to 2 weeks

Specialized Variable Clustering algorithm

running in parallel to replace leading vendor

solution

Faster model development resulted in better

model quality and increasing bottom line

(36)

Quick Korea Facts:

Roughly half of the country's 50 million people

reside in the metropolitan area surrounding its

capital, Seoul

Seoul Capital Area is the second largest in the

world with over 25 million residents

Eighth largest country in international trade

A regional power with world's 10th largest

defense budget

Oracle R Enterprise at BISTEL:

Oracle ORE enables BISTEL to perform analytics

with much more data faster and enables them

to gain more insight (root cause and prediction)

With Oracle Exadata BISTEL can do enterprise

advanced process control in Mega/Giga fabs in

high-tech manufacturing

(37)
(38)

http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/overview/stubhub-optimization-fraud-detect-2265566.mp4

(39)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Demonstration

(40)
(41)

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

See these Advanced Analytics Talks at OOW’14

CON2898 - Developing Relevant Dining Visits with

Oracle Advanced Analytics at Olive Garden

CON2452 - Extending the Power of In-Database Analytics with

Oracle Big Data Appliance

CON8596 - Predictive Analytics with Oracle Data Mining

CON6545 - Market Basket Analysis at Dunkin’ Brands

CON8631 - Big Data and Predictive Analytics: Fiserv Data Mining Case Study

(42)

Learn More about Oracle’s R Technologies…

http://oracle.com/goto/R

(43)
(44)

References

Related documents

Big Data Lite includes software products that are optional on the Oracle Big Data Appliance (BDA), including Oracle NoSQL Database Enterprise Edition and Oracle Big Data

In addition, wasta (connections) is used extensively within Jordanian bureaucracy to create advantages for oneself and relatives (T. Al- Masri). In this way,

Oracle Unified Information Architecture In -Data bas e A nal y tics Data Warehouse Oracle Advanced Analytics Oracle Database Oracle BI Foundation Suite Oracle Real-Time

Oracle ERP &amp; CRM Solutions on Exadata Advanced Analytics, In- Memory, Big Data SQL Oracle Database Data Warehouse on Exadata ODI Big Data Connectors ODI..

It has been recognized that theories for describing the states of stress and failure in unsaturated soil require consideration of the thermodynamic properties of the pore water in

The rock fall hazard may be defined as the probability of a rock fall of a given magnitude (or kinetic energy) reaching the element at risk, which can be expressed as the probability

For instance, while the semantically anomalous training group performed better on semantically anomalous sentences than the other two groups (Figure 2), these participants still

Oracle Engineered Solutions Schema-less Unstructured Data Variety In-DB Analytics “R” Mining Text Graph Oracle NoSQL DB HDFS Hadoop Oracle Data Integrator Oracle Loader for