**Mario Inchiosa **

** ** *Chief Scientist, Revolution Analytics *

### In-Database Analytics Deep Dive with Teradata and Revolution R

**Tim Miller **

*Partner Integration Lab, Teradata *

### • Introduction

### • Revolution R Enterprise

### • Case Study – Global Internet Marketplace

### • Under the Hood

### • Summary & Questions

### Agenda

### • What data storage/management software do you use?

### > Hadoop

### > Teradata

### > LSF Clusters/Grids

### > Servers

**Please choose all that apply **

### Poll Question #1

### • Most powerful statistical programming language

### – Flexible, extensible and comprehensive for productivity

### • Most widely used data analysis software

### – Used by 2M+ data scientists, statisticians and analysts

### • Create beautiful and unique data visualizations

### – As seen in New York Times, Twitter and Flowing Data

### • Thriving open-source community

### – Leading edge of analytics research

### • Fills the talent gap

### – New graduates prefer R

### What is R?

### R is Hot

### bit.ly/r-is- hot

**WHITE** **PAPER**

### Exploding growth and demand for R

### • **R is the highest paid IT skill **

>

### Dice.com, Jan 2014

### • **R most-used data science language after ** SQL

>

### O’Reilly, Jan 2014

### • **R is used by 70% of data miners **

>

### Rexer, Sep 2013

### • **R is #15 of all programming languages **

>

### RedMonk, Jan 2014

### • **R growing faster than any other data ** science language

>

### KDnuggets, Aug 2013

### • **More than 2 million users worldwide **

**R Usage Growth **

Rexer Data Miner Survey, 2007-2013 70% of data miners report using R

R is the first choice of more data miners than any other software

Source: www.rexeranalytics.com

Debt<10% of Income Debt=0%

Good Credit Risks

Bad CreditRisks

Good
Credit
Risks
**Yes**

**Yes**
**Yes**

NO

**NO**
**NO**

Income>$40K

**SQL Request ** **Sample Data **

Debt<10% of Income Debt=0%

Good Credit Risks

Bad Credit Risks

Good
Credit
Risks
**Yes**

**Yes**
**Yes**

NO

**NO**
**NO**

Income>$40K

**Results **

**Desktop and Server Analytic Architecture **

**In-Database Analytic Architecture **

**Results **

**Server Based vs. In-Database Architectures **

**Server Based vs. In-Database Architectures**

### Why Is Teradata Different?

**Exponential **

**Performance Improvement **

### Analyst

**Node level calculation**

**: 1 2 7 9 = 4.5**

### • R is distributed across nodes or servers

### • Runs independently of the other nodes/servers

>

### Great for row independent processing such as Model Scoring

>

### However, for analytic functions requiring all the data such as Model Building…

### – **Onus is on the R programmer to understand data parallelism **

### Challenges Running R in Parallel

### 1 1 1 1 2 9 1 7 9 3 9 9

**System level calculation: 1 1 1 1 1 2 3 7 9 9 9 9 = 2.5 ** **Example: Median (Midpoint) **

**Node Level **

### 1. Find median per node 2. Consolidate and find

### the midpoint of the results

### 3. Produce the wrong answer

**System Level ** 1. Sort all the data 2. Take midpoint 3. Produce the right

### answer

**< Wrong **

**< Right **

### R Operations on Data

### R operates on independent rows

### > Score models for a given observation

### > Parsing Text field

### > Log(x)

### R operates on independent partitions

### > Fit a model to a partition such as region, time, product or store

### R operates on the entire data set

### > Global sales average

### > Regression on all customers

**R Client **

**R Client **

**R Client **

### • What statistical programming tools do you use?

### > R/RRE

### > SAS

### > SPSS

### > Statistica

### > KXEN

**Please choose all that apply **

### Poll Question #2

**Who is Revolution Analytics? **

### Revolution Analytics

**OUR COMPANY **

**The leading provider ** **of advanced **

**analytics software**

**and services **

### based on open source

### R, since 2007

**OUR SOFTWARE **

**The only Big Data, Big **

**Analytics software**

### platform based on the data science language R

**SOME KUDOS **

**Visionary **

### Gartner Magic Quadrant for Advanced Analytics

### Platforms, 2014

**Finance ** ** Insurance **

**Healthcare & Pharma ** **Digital Economy ** **Analytics Service Providers **

** Manufacturing & High Tech **

### Revolution R Enterprise

### is….

### the only big data big analytics platform based on open source R, the de facto statistical computing language for

### modern analytics

### • High Performance, Scalable Analytics

### • Portable Across Enterprise Platforms

### • Easier to Build & Deploy Analytics

### Big Data In-memory bound Hybrid memory & disk

### scalability Operates on bigger volumes & factors Speed of

### Analysis Single threaded Parallel threading Shrinks analysis time Enterprise

### Readiness Community support Commercial support Delivers full service production support

### Analytic Breadth &

### Depth

### 5000+ innovative analytic packages

### Leverage open source packages plus Big Data

### ready packages Supercharges R Commercial

### Viability Risk of deployment of

### open source Commercial license Eliminate risk with open source

**It Has Some Limitations for Enterprises ** R: Open Source that Drives Innovation, but…

**It Has Some Limitations for Enterprises**

**The Big Data Big Analytics Platform **

**The Big Data Big Analytics Platform**

**Introducing Revolution R Enterprise (RRE) **

**Introducing Revolution R Enterprise (RRE)**

**DistributedR **

**DevelopR DeployR **

**ScaleR ** **ConnectR **

### • Big Data Big Analytics Ready

### > Enterprise readiness

### > High performance analytics

### > Multi-platform architecture

### > Data source integration

### > Development tools

### > Deployment tools

### The Platform Step by Step:

**R Capabilities **

**R Capabilities**

**R+CRAN **

• Open source R interpreter

• UPDATED R 3.1.1

• Freely-available R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and

packages

**RevoR **

• Based on open source R

• Adds high-performance math
**Available On: **

• Teradata Database

• Hortonworks Hadoop

• Cloudera Hadoop

• MapR Hadoop

• IBM Platform LSF Linux

• Microsoft HPC Clusters

• Windows & Linux Servers

• Windows & Linux Workstations

**DeployR **

• Web services software development kit for integration analytics via Java, JavaScript or .NET APIs

• Integrates R Into application infrastructures

**Capabilities: **

• Invokes R Scripts from web services calls

• RESTful interface for easy integration

• Works with web & mobile apps, leading BI & Visualization tools and business rules engines

**DevelopR **

• Integrated development environment for R

• Visual ‘step-into’ debugger

• Based on Visual Studio Isolated Shell

**Available on: **

• Windows

**DevelopR **

**DeployR **

### The Platform Step by Step:

**Tools & Deployment **

**Tools & Deployment**

### DevelopR - Integrated Development Environment

### Script with type ahead

### and code snippets Solutions window for organizing code and

### data

### Packages installed and

### loaded

### Objects loaded in the R Environment

### Object details Sophisticated debugging

### with breakpoints ,

### variable values etc.

### DeployR - Integration with 3rd Party Software

### • Seamless

–

### Bring the power of R to any web enabled application

### • Simple

–

### Leverage common APIs including JS, Java, .NET

### • Scalable

–

### Robustly scale user and compute workloads

### • Secure

–

### Manage enterprise security with LDAP & SSO

### Data Analysis

Business Intelligence

### Mobile Web Apps

### Cloud / SaaS

R / Statistical Modeling Expert

**DeployR **

Deployment Expert

### The Platform Step by Step:

**Parallelization & Data Sourcing ** **ConnectR **

**Parallelization & Data Sourcing**

• High-speed & direct connectors
**Available for: **

• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database

• ODBC

**ScaleR **

• Ready-to-Use high-performance big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Correlation & covariance matrices

• Predictive Models – linear, logistic, GLM

• Machine learning

• Monte Carlo simulation

• Tools for distributing customized algorithms across nodes

**DistributedR **

• Distributed computing framework

• Delivers portability across platforms
**Available on: **

• Teradata Database

• Hortonworks / Cloudera / MapR

• Windows Servers / HPC Clusters

• IBM Platform LSF Linux Clusters

• Red Hat Linux Servers

• SuSE Linux Servers

### Revolution R Enterprise ScaleR:

**High Performance Big Data Analytics **

**High Performance Big Data Analytics**

### Data Prep, Distillation & Descriptive Analytics

**R Data Step ** **Descriptive **

**Statistics ** **Statistical **

**Tests ** **Sampling **

• Data import – Delimited, Fixed, SAS, SPSS, ODBC

**• Variable creation & **

**transformation using any R **
**functions and packages **

• Recode variables

• Factor variables

• Missing value handling

• Sort

• Merge

• Split

• Aggregate by category (means, sums)

• Min / Max

• Mean

• Median (approx.)

• Quantiles (approx.)

• Standard Deviation

• Variance

• Correlation

• Covariance

• Sum of Squares (cross product matrix)

• Pairwise Cross tabs

• Risk Ratio & Odds Ratio

• Cross-Tabulation of Data

• Marginal Summaries of Cross Tabulations

• Chi Square Test

• Kendall Rank Correlation

• Fisher’s Exact Test

• Student’s t-Test

• Subsample (observations &

variables)

• Random Sampling

### Revolution R Enterprise ScaleR (continued)

**Predictive ** **Models **

• Covariance/Correlation/Sum of Squares/Cross-product Matrix

• Multiple Linear Regression

• Logistic Regression

• Generalized Linear Models (GLM) - All exponential family distributions:

binomial, Gaussian, inverse Gaussian, Poisson, Tweedie.

Standard link functions including: cauchit, identity, log, logit, probit.

- User defined distributions

& link functions.

• Classification & Regression Trees and Forests

• Gradient Boosted Trees

• Residuals for all models

• Histogram

• ROC Curves (actual data and predicted values)

• Lorenz Curve

• Line and Scatter Plots

• Tree Visualization

**Data **

**Visualization ** **Variable ** **Selection **

• Stepwise Regression

• Linear

• Logistic

• GLM

• Monte Carlo

**• Run open source R **
**functions and **
**packages across **
**cores and nodes **

**Cluster ** **Analysis **

• K-Means

**Classification **

**& Regression **

• Decision Trees

• Decision Forests

• Gradient Boosted Trees

• Prediction (scoring)

• PMML Export

**Simulation ** **and HPC **

**Deployment **

### Statistical Modeling Machine Learning

**DistributedR ** **ScaleR **

**ConnectR **

**DeployR **

### Write Once…Deploy Anywhere.

**DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE **

**In the Cloud **

Amazon AWS
**Workstations &** **Servers **

^{Windows }

_{Linux }

**Clustered Systems **

IBM Platform LSF
Microsoft HPC
**Hadoop **

Hortonworks, Cloudera, MapR
**EDW **

Teradata Database
### • *Challenge: Model and score 250M * *customers *

### • *Server-based workflow was * *taking 3 days *

### • *Move calculation in-database to * *drastically reduce runtime, *

*process twice as many *

*customers, and increase lift *

### Case Study - Global Internet Marketplace

### • Binomial Logistic Regression

### > 50+ Independent variables including categorical with indicator variables

### > Train from small sample (many thousands) – not a problem in and of itself

### > Scoring across entire corpus (many hundred millions) – slightly more challenging

### Existing Open Source R model

### • Same Binomial Logistic Regression

### > 50+ Independent variables including categorical with indicator variables

### > Train from large sample (many millions) – more accurately captures user patterns and increases lift

### > Scoring across entire corpus (many hundred millions) – completes in minutes

### Revolution R Enterprise model

**By moving the compute to the data **

**By moving the compute to the data**

### RRE Used to Optimized the Current Process

**Before ** **After **

**Reduced 3 day process to 10 minutes **

**Scaling study: Time vs. Number of Rows **

### Benchmarking the Optimized Process

**rows **

**time**

**NOTE: **

• **Teradata Environment **

> **4 node, 1700 Appliance **

• **RRE Environment **

> **version 7.2, **

> **R 3.0.2 **

**Server-based (Not In-DB) ** **In-DB **

**Server-based (Not In-DB)**

**In-DB**

### • Before

### trainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters) fits <- predict(trainit, newdata=test.data, type='response')

### • After

### trainit <- ^{rxGlm} **(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters) ** fits <- ^{rxPredict} (trainit, newdata=test.data, type='response')

^{rxGlm}

^{rxPredict}

**Recode Open Source R to Revolution R Enterprise **

### Optimization process

### Revolution R Enterprise

### How RRE Scale R

### Actually Works

**Open ** **Source R **

**Revolution R ** **Enterprise **

**Computation (4-core laptop) ** **Open Source R ** **Revolution R ** **Speedup ** **Linear Algebra**

^{1 }### Matrix Multiply 176 sec 9.3 sec 18x

### Cholesky Factorization 25.5 sec 1.3 sec 19x

### Linear Discriminant Analysis 189 sec 74 sec 3x

**General R Benchmarks**

^{2 }### R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x

### R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable

1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php2. http://r.research.att.com/benchmarks/

### Customers report 3-50x performance improvements compared to Open Source R —

### without changing any code

**RevoR - Performance Enhanced R ** **Revolution R Enterprise: **

**RevoR - Performance Enhanced R**

**Revolution R Enterprise:**

**Across Cores and Nodes **

### Scalable and Parallelized

### • Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize

### • Process a chunk of data at a time, giving linear scalability

### • Process an unlimited number of rows of data in a fixed amount of RAM

### • Independent of the “compute context” (number of cores,

### computers, distributed computing platform), giving portability across these dimensions

### • Independent of where the data is coming from, giving portability with respect to data sources

**“Parallel External Memory Algorithms” **

**“Parallel External Memory Algorithms”**

### Scalability and Portability of PEMAs

### • Efficient computational algorithms

### • Efficient memory management – minimize data copying and data conversion

### • Heavy use of C++ templates; optimal code

### • Efficient data file format; fast access by row and column

### • Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)

### • Handle categorical variables efficiently

### ScaleR Performance

### Speed and Scalability Comparison

### • Unique PEMAs: Parallel, external- memory algorithms

### • High-performance, scalable

### replacements for R/SAS analytic functions

### • Parallel/distributed processing eliminates CPU bottleneck

### • Data streaming eliminates memory size limitations

### • Works with in-memory and disk-

### based architectures

### In-Database Billion Row Logistic Regression

### • 114 seconds on Teradata 2650 (6 nodes, 72 cores), including time to read data

### • Scales linearly with number of rows

### • Scales linearly with number

### of nodes: 3x faster than on 2

### node Teradata system

### Allstate compares SAS, Hadoop, and R for Big-Data Insurance Models

**Approach ** **Platform ** **Time to fit **

### SAS 16-core Sun Server 5 hours

### rmr/MapReduce 10-node 80-core

### Hadoop Cluster > 10 hours

### R 250 GB Server Impossible (> 3 days)

### Revolution R Enterprise In-Teradata on

### 6-node 2650 3.3 minutes

### Generalized linear model, 150 million observations, 70 degrees of freedom

http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html

### • At what stage are you in your in-database analytics deployment project?

### > Still researching tools and methods

### > Evaluating/Selecting data storage/management platform

### > Evaluating/Selecting analytics programming tools

### > Launched the project/working on it now

### > We’re done and looking for another one!

**Please select one answer **

### Poll Question #3

### • Revolution R Enterprise has a new “data source”, RxTeradata (ODBC and TPT)

# Change the data source if necessary

tdConn <- "DRIVER=…; IP=…; DATABASE=…; UID=…; PWD=…“

** teradataDS <- RxTeradata(table=“…", connectionString=tdConn, …) **

### • Revolution R Enterprise has a new “compute context”, RxInTeradata

# Change the “compute context”

** tdCompute <- rxInTeradata(connectionString=..., shareDir=..., remoteShareDir=..., **
revoPath=..., wait=.., consoleOutput=...)

### • Sample code for R Logistic Regression

# Specify model formula and parameters

** rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier **
+ F(CRSDepTime), data=teradataDS)

### RRE End-User’s Perspective

### • Table User Defined Functions (UDFs) allow users to place a function in the FROM clause of a SELECT statement

### • Table Operators extend the existing table UDF capability:

### > Table Operators are Object Oriented

### – Inputs and outputs can be arbitrary and not “fixed” as Table UDF’s require

### > Table Operators have a simpler row iterator interface

### – Interface simply produces output rows providing a more natural application development interface than Table UDF’s

### > Table operators operate on a stream of rows.

### – Rows are buffered for high-performance, eliminating row at a time processing

### > Table operators support PARTITON BY and ORDER BY

### – Allows the development of Map Reduce style operators in-database

### Table Operators – Teradata 14.10+

### RRE Architecture in Teradata 14.10+

Worker Process

### Message Passing Layer Master Process …

### Request Response **Teradata 14.10+ **

### Data

### Partition Data

### Partition Data

### Partition Data Partition Master Process

Worker

Process Worker

Process Worker Process

### …

### * All communication is done by binary BLOB’s

### PE Layer

### AMP Layer

### 1. RRE commands are sent to a “Master Process” - an External Stored Procedure (XSP) in the Parsing Engine that provides parallel coordination

### 2. RRE analytics are split into “Worker

### Process” tasks that run in a Table Operator (TO) on every AMP.

### a. HPA analytics iterate over the data, and intermediate results are

### analyzed and managed by the XSP.

### b. HPC analytics do not iterate, and final results from each AMP are returned to the XSP

### 3. Final combined results are assembled by the XSP and returned to the user

**tdConnect <- rxTeradata(<data, connection string, …>) ** **tdCompute <- rxInTeradata(<data, server arguments, …>) **

**tdConnect <- rxTeradata(<data, connection string, …>)**

### ** PUT-based Installer

### • High-performance, scalable, portable, fully-featured algorithms

### • Integration with R ecosystem

### • Compatibility with Big Data ecosystem

### Summary

**PARTNERS Mobile App **
**InfoHub Kiosks **

**teradata-partners.com **

**WE LOVE FEEDBACK **

### Questions

### Rate this Session

### Questions?

### Resources for you (available on RevolutionAnalytics.com):

### • White Paper: Teradata and Revolution Analytics: For the Big Data Era, An Analytics Revolution

### • Webinar: Big Data Analytics with Teradata and Revolution Analytics

**PARTNERS Mobile App **
**InfoHub Kiosks **

**teradata-partners.com **