Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
A Perfect Storm
Oracle Big Data Science for Enterprise R and SAS Users
Mark Hornick, Director, Advanced Analyticsmark.hornick@oracle.com @MarkHornick
Marcos Arancibia, Consulting Product Manager marcos.arancibia@oracle.com
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Perfect Storm
: something that describes
an actual phenomenon that happens to
occur in such a confluence, resulting in an
event of unusual magnitude
-
Wikipedia
Masters in Data Science
Big Data
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 5
Events of unusual magnitude?
A changing of the guard
Massive migrations
Agenda
•
What is R?
•
Who is using R and why?
•
Overview of Oracle R Technologies
•
Global customer tour
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What is R?
• R is an Open Source scripting language and environment for statistical computing and graphics
http://www.R-project.org/
• Started in 1994 as an Alternative to SAS, SPSS and other proprietary Statistical Environments
• The R environment
– R is an integrated suite of software facilities for data manipulation, calculation and graphical display
• Millions of R users worldwide
– Widely taught in Universities
– Many Corporate Analysts and Data Scientists know and use R
• Thousands of open sources packages to enhance productivity such as:
– Bioinformatics
– Spatial Statistics
– Financial Market Analysis
Why statisticians , data analysts, data scientists use R
R environment is ..
•
Powerful
•
Extensible
•
Graphical
•
Extensive statistics
•
OOTB functionality with
many ‘knobs’ but
smart defaults
•
Ease of installation and use
•
Free
R is a statistics language similar to Base SAS or SPSS statistics
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
R’s Popularity – Number of Google Scholar hits
Robert A. Muenchen
http://r4stats.com/articles/popularity/
“SPSS has a clear lead, but you
can see that its dominance
peaked in 2007 and its use is
now in sharp decline. SAS
never came close to SPSS’
level of dominance, and it
peaked in 2008.“
SPSS
SAS
R’s Popularity – Number of Google Scholar Documents
SAS and SPSS removed
Robert A. Muenchen
http://r4stats.com/articles/popularity/
“…the use of R is
experiencing very rapid
growth and is pulling away
from the pack, solidifying its
position in third place.”
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
R’s Popularity – Job Trends
http://r4stats.com/articles/popularity/
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Customer Pain Points with Advanced Analytics
for example…
“It takes too long to get my data or to get the ‘right’ data”
“I can’t analyze or mine all of my data – it has to be sampled”
“Putting models and results into production is ad hoc and complex”
“Recoding models into SQL, C, or Java takes time and is error prone”
“Our company is concerned about data security, backup and recovery”
“We need to build 10s of thousands of models fast to meet business objectives”
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle R Distribution
•
An Oracle-Supported Redistribution of Open Source R
•
Enhanced linear algebra performance via dynamically loaded libraries
•
Improve scalability at client and database for embedded R execution
•
Enterprise support for customers of Oracle Advanced Analytics option,
Big Data Appliance, and Oracle Linux
•
Free download
•
Oracle contributes bug fixes and enhancements to open source R
Ability to dynamically load
Intel Math Kernel Library
AMD Core Math Library
Solaris Sun Performance Library
Oracle
Support
ROracle
•
R package enabling scalable and performant connectivity to Oracle Database
– Open source, publicly available on CRAN
– Oracle is maintainer
•
Oracle Database Interface (DBI) for R
– Re-implemented and optimized driver based on OCI
– Execute SQL statements from R interface
– Enables transactional behavior for insert, update, and delete
Oracle Database
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise
Oracle Advanced Analytics Option to Oracle Database
•
Eliminate memory constraint of client R engine
•
Minimize or eliminate data movement latency
•
Leverage Oracle Database as HPC environment
•
Execute R scripts through database server
machine for scalability and performance
•
Leverage parallel, distributed in-database data
mining algorithms
•
Execute and manage R scripts via SQL
•
Operationalize R scripts in production
applications – eliminate porting R code
•
Avoid reinventing code to integrate R results
into existing applications
Client R Engine
ORE packages Oracle Database User tables In-db stats Database Server Machine SQL Interfaces SQL*Plus, SQLDeveloper, … 17f(dat,args,…) {
Oracle Database
Data
c1 c2 ci cn R Script build modelf(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)
Model c1 Model c2 Model cn Model ci
R Datastore Repository R Script
Database-centric architecture
Smart meter scenario
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | scores c1 scores c2 scores ci scores cn f(dat,args,…) { }
Oracle Database
Data
c1 c2 ci cn R Script score dataf(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)
Model Model Model Model
R Datastore Repository R Script
Database-centric architecture
Smart meter scenario
Build
models and store in database, partition on CUST_ID
ore.groupApply (CUST_USAGE_DATA,
CUST_USAGE_DATA$CUST_ID,
function(dat, ds.name) { cust_id <- dat$CUST_ID[1]
mod <- lm(Consumption ~ . -CUST_ID, dat)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL name <- paste("mod", cust_id,sep="")
assign(name, mod)
ds.name1 <- paste(ds.name,".",cust_id,sep="")
ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE) TRUE
},
ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE
)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Score
customers in database, partition on CUST_ID
ore.groupApply(CUST_USAGE_DATA_NEW, CUST_USAGE_DATA_NEW$CUST_ID, function(dat, ds.name) { cust_id <- dat$CUST_ID[1] ds.name1 <- paste(ds.name,".",cust_id,sep="") ore.load(ds.name1)
name <- paste("mod", cust_id,sep="") mod <- get(name)
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
res <- cbind(CUST_ID=cust_id, PRED = prd) data.frame(res)
},
ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE, FUN.VALUE=data.frame(CUST_ID=numeric(0), PRED=numeric(0))
)
16 lines
Performance
with DOP=24
•
1000 Models
–
Data: 26,280,000 rows
–
Total build time: 65.2 seconds
–
Total scoring time: 25.7 seconds
(all data)•
10,000 Models
–
Data: 262,800,000 rows
–
Total build time: 516 seconds
–
Total scoring time: 217 seconds
(all data)•
50,000 Models
–
Data: 1,314,000,000 rows
–
Total build time: 55.85 minutes
–
Total scoring time: 18 minutes
(all data)1 10 100 1000 10000 26.3 262.8 1314 Ex e cu tion (se c) Build Time Score Time
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle Advanced Analytics on Exadata X3-2 ¼ Rack
Model Building using 30 numeric variables:
Leading Vendor on a machine connected directly to the same Exadata box took 2+ hours for ETL+Exec on 34mi records
Scalability of the new distributed ore.lm() Linear Regression
7,200
10.8
25.5
34.8
315
1 10 100 1000 10000Leading Vendor-34mi OAA-34mi OAA-180mi OAA-299mi OAA-2.99Bi
Sec
onds (log
sc
ale)
Engine/Database size (records)
Oracle Advanced Analytics Option
Fastest Way to Deliver Scalable Enterprise-wide Predictive Analytics
• Better Decisions with Deeper Insights & Predictive Analytics
– Understand and predict customer behavior for churn, fraud,
cross-sell, and many other business problems
• Easy to Use
– Data analysts: Mining work flow GUI (part of SQL Developer)
– Data scientists: R and SQL languages supported
– DBA: SQL integration
• Comprehensive Analytics on a Simple Architecture
– Performance and scalability of the Oracle Database
– Lowest Total Cost of Ownership
– No need for separate analytical servers
• Components
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle R Advanced Analytics for Hadoop
Oracle Big Data Connectors option to Big Data Appliance
R script {CRAN packages}
Hadoop Cluster
R HDFS HDFS Nodes Oracle Database R Hive MapReduce Nodes R MapReduce {CRAN packages} R Client ORD ORD Hadoop Job Mapper Reducer H ad oo p A bs tracti on La y er R sqoop/OLH H C ac he•
Transparent access to Hadoop Cluster from R
•
Manipulate data in HDFS, Hive, database, and file system
•
Write and execute MapReduce jobs with R
•
Leverage CRAN R packages to work on HDFS-resident data
Oracle
Exadata
Oracle
Exalytics
Oracle Big Data Platform
Oracle Big Data
Appliance
Oracle
Big Data
Connectors
Optimized forAnalytics & In-Memory Workloads “System of Record”
Optimized for DW/OLTP Optimized for Hadoop,
R, and NoSQL Processing
Oracle Enterprise Performance Management Oracle Business Intelligence Applications Oracle Business Intelligence Tools Oracle Endeca Information Discovery Hadoop Oracle R Distribution Applications Oracle NoSQL Database
Oracle Big Data Connectors Oracle Data Integrator Data Warehouse Oracle Advanced Analytics Oracle Database Oracle Advanced Analytics Oracle R Enterprise Oracle Data Mining Oracle R Advanced Analytics for Hadoop + … Oracle R Distribution
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
A Global Customer Tour
Panoramic Houston skyline
The space shuttle Challenger
atop its Boeing 747 SCA, flying
Quick Houston Facts:
•Most populous city in Texas
•Metropolitan area is the fifth-most populated in the U.S., with over 6 million people
• Leading in energy, manufacturing, aeronautics, transportation, health care sectors and building oilfield equipment
•Only New York City is home to more Fortune 500 Headquarters.
Oracle R Enterprise at Apache Oil:
•Segmentation of drilling problems to understand potential problems ahead of time
•Predictive maintenance of assets to prevent waiting a day for replacement of drill bits or other
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29
Mayan City of Tikal
Guatemala City Today
Quick Guatemala Facts:
•15.8 mi inhabitants
•Guatemala City is the Capital
•Spanish spoken by 93% of Population
•21 Mayan and 2 Amerindian languages also spoken
•Service sector is largest component of GDP at 63%, followed by industry sector at 23.8% and agriculture sector at 13.2% (2010 est.)
Oracle R Enterprise at TIGO:
•Customer Behavior of 5.5M customers with 1.8B transactions
•Generate 5 models per customer to understand mobility using Lat/Long of the Cell Tower of each transaction
•Evaluate 27.5M segmentation models in 25 minutes, or over 1M models/minute
Quick Cincinnati Facts:
•First major American city founded after American Revolution
•First major inland “purely American” city in country
•Cincinnati Reds have a storied history as being first professional club, hosting first night game, and
dominating 1970s as the "Big Red Machine"
Cincinnati Reds:
Music Hall
American
Sign
Museum
Oracle Advanced Analytics at dunnhumby:
•Very long ETL time eliminated with in-Database Advanced Analytics
•Modeling behavior of millions of shoppers
•Coupon optimization for Retailers on Billions of transactions
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Quick Orlando Facts:
•Nicknamed "The City Beautiful"
•Symbol is the fountain at Lake Eola
•“Theme Park Capital of the World“
•51+ million tourists a year, 3.6 million international
•Walt Disney World Resort:
•Magic Kingdom, Hollywood Studios, Epcot, Animal Kingdom
•Universal Studios Orlando
•SeaWorld
TODAY, 2:00PM!
Moscone South 308
CON2898
Oracle R Enterprise at Olive Garden:
•Olive Garden, traditionally managing its 830
restaurants nationally, transitioned to a localized approach with the help of predictive analytics
•Evaluated 115 million transactions in just 5 percent the time required by previous BI tool
•Supporting Olive Garden’s latest remodel campaign, continuing to uncover millions in profits by
Quick Lima, Peru Facts:
•Capital and the largest city of Peru with 9M citizens
•Most populous metropolitan area of Peru
•Fifth largest city in the Americas (as defined by "city proper")
•Home to one of the oldest higher learning institutions in the New World
•National University of San Marcos, founded on May 12, 1551
Oracle R Enterprise at Financiera Uno:
•Reduce time to build credit scoring models to ensure their market relevancy
•Scale to handle “big data” volumes
•Rapidly deploy credit scoring models into production applications
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 33
Quick London Facts:
•One of world's leading financial centers
•Has fifth-or sixth-largest metropolitan area GDP in the world depending on measurement
•World cultural capital
•World's most-visited city as measured by international arrivals
•World's largest city airport system measured by passenger traffic
Oracle R Enterprise at Major Financial Company:
•Earnings calculations reduced from 7 hours to 4 minutes
•Scoring on written premium reduced from 100 minutes to 7 minutes
•Scoring on earned premium reduced from 25 minutes to 8 minutes with added functionality
Quick Geneva Facts:
•Most populous city of Romandy, the French-speaking part of Switzerland
•A financial center
•Worldwide center for diplomacy
•Headquarters of many of the agencies of the United Nations and the Red Cross
•Hosts highest number of international organizations in the world
Oracle R Enterprise at CERN:
•Real time monitoring and anomaly detection of tens of thousands of events per second
•CERN Central Logging Service: complex in-database time series analysis and forecasting
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Quick Croatia Facts:
•
Member of European Union (EU) and United
Nations (UN)
•
Tourism is a significant source of revenue
during the summer
•
Ranked 18th most popular tourist destination
in the world
Oracle R Enterprise at ZABA Bank:
•
Historical Customer Behavior Analysis
shortened from several months to 2 weeks
•
Specialized Variable Clustering algorithm
running in parallel to replace leading vendor
solution
•
Faster model development resulted in better
model quality and increasing bottom line
Quick Korea Facts:
•
Roughly half of the country's 50 million people
reside in the metropolitan area surrounding its
capital, Seoul
•
Seoul Capital Area is the second largest in the
world with over 25 million residents
•
Eighth largest country in international trade
•
A regional power with world's 10th largest
defense budget
Oracle R Enterprise at BISTEL:
•
Oracle ORE enables BISTEL to perform analytics
with much more data faster and enables them
to gain more insight (root cause and prediction)
•
With Oracle Exadata BISTEL can do enterprise
advanced process control in Mega/Giga fabs in
high-tech manufacturing
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/overview/stubhub-optimization-fraud-detect-2265566.mp4
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Demonstration
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |