In-Database Analytics Deep Dive with Teradata and Revolution R

44  Download (0)

Full text


Mario Inchiosa

Chief Scientist, Revolution Analytics

In-Database Analytics Deep Dive with Teradata and Revolution R

Tim Miller

Partner Integration Lab, Teradata


• Introduction

• Revolution R Enterprise

• Case Study – Global Internet Marketplace

• Under the Hood

• Summary & Questions



• What data storage/management software do you use?

> Hadoop

> Teradata

> LSF Clusters/Grids

> Servers

Please choose all that apply

Poll Question #1


• Most powerful statistical programming language

– Flexible, extensible and comprehensive for productivity

• Most widely used data analysis software

– Used by 2M+ data scientists, statisticians and analysts

• Create beautiful and unique data visualizations

– As seen in New York Times, Twitter and Flowing Data

• Thriving open-source community

– Leading edge of analytics research

• Fills the talent gap

– New graduates prefer R

What is R?

R is Hot hot



Exploding growth and demand for R

R is the highest paid IT skill

>, Jan 2014

R most-used data science language after SQL


O’Reilly, Jan 2014

R is used by 70% of data miners


Rexer, Sep 2013

R is #15 of all programming languages


RedMonk, Jan 2014

R growing faster than any other data science language


KDnuggets, Aug 2013

More than 2 million users worldwide

R Usage Growth

Rexer Data Miner Survey, 2007-2013 70% of data miners report using R

R is the first choice of more data miners than any other software



Debt<10% of Income Debt=0%

Good Credit Risks

Bad CreditRisks

Good Credit Risks Yes

Yes Yes




SQL Request Sample Data

Debt<10% of Income Debt=0%

Good Credit Risks

Bad Credit Risks

Good Credit Risks Yes

Yes Yes





Desktop and Server Analytic Architecture

In-Database Analytic Architecture


Server Based vs. In-Database Architectures

Why Is Teradata Different?


Performance Improvement



Node level calculation

: 1 2 7 9 = 4.5

• R is distributed across nodes or servers

• Runs independently of the other nodes/servers


Great for row independent processing such as Model Scoring


However, for analytic functions requiring all the data such as Model Building…

Onus is on the R programmer to understand data parallelism

Challenges Running R in Parallel

1 1 1 1 2 9 1 7 9 3 9 9

System level calculation: 1 1 1 1 1 2 3 7 9 9 9 9 = 2.5 Example: Median (Midpoint)

Node Level

1. Find median per node 2. Consolidate and find

the midpoint of the results

3. Produce the wrong answer

System Level 1. Sort all the data 2. Take midpoint 3. Produce the right


< Wrong

< Right


R Operations on Data

R operates on independent rows

> Score models for a given observation

> Parsing Text field

> Log(x)

R operates on independent partitions

> Fit a model to a partition such as region, time, product or store

R operates on the entire data set

> Global sales average

> Regression on all customers

R Client

R Client

R Client


• What statistical programming tools do you use?




> Statistica


Please choose all that apply

Poll Question #2


Who is Revolution Analytics?

Revolution Analytics



The leading provider of advanced

analytics software

and services

based on open source

R, since 2007


The only Big Data, Big

Analytics software

platform based on the data science language R



Gartner Magic Quadrant for Advanced Analytics

Platforms, 2014


Finance Insurance

Healthcare & Pharma Digital Economy Analytics Service Providers

Manufacturing & High Tech


Revolution R Enterprise


the only big data big analytics platform based on open source R, the de facto statistical computing language for

modern analytics

• High Performance, Scalable Analytics

• Portable Across Enterprise Platforms

• Easier to Build & Deploy Analytics


Big Data In-memory bound Hybrid memory & disk

scalability Operates on bigger volumes & factors Speed of

Analysis Single threaded Parallel threading Shrinks analysis time Enterprise

Readiness Community support Commercial support Delivers full service production support

Analytic Breadth &


5000+ innovative analytic packages

Leverage open source packages plus Big Data

ready packages Supercharges R Commercial

Viability Risk of deployment of

open source Commercial license Eliminate risk with open source

It Has Some Limitations for Enterprises R: Open Source that Drives Innovation, but…


The Big Data Big Analytics Platform

Introducing Revolution R Enterprise (RRE)


DevelopR DeployR

ScaleR ConnectR

• Big Data Big Analytics Ready

> Enterprise readiness

> High performance analytics

> Multi-platform architecture

> Data source integration

> Development tools

> Deployment tools


The Platform Step by Step:

R Capabilities


• Open source R interpreter

• UPDATED R 3.1.1

• Freely-available R algorithms

• Algorithms callable by RevoR

• Embeddable in R scripts

• 100% Compatible with existing R scripts, functions and



• Based on open source R

• Adds high-performance math Available On:

• Teradata Database

• Hortonworks Hadoop

• Cloudera Hadoop

• MapR Hadoop

• IBM Platform LSF Linux

• Microsoft HPC Clusters

• Windows & Linux Servers

• Windows & Linux Workstations



• Web services software development kit for integration analytics via Java, JavaScript or .NET APIs

• Integrates R Into application infrastructures


• Invokes R Scripts from web services calls

• RESTful interface for easy integration

• Works with web & mobile apps, leading BI & Visualization tools and business rules engines


• Integrated development environment for R

• Visual ‘step-into’ debugger

• Based on Visual Studio Isolated Shell

Available on:

• Windows



The Platform Step by Step:

Tools & Deployment


DevelopR - Integrated Development Environment

Script with type ahead

and code snippets Solutions window for organizing code and


Packages installed and


Objects loaded in the R Environment

Object details Sophisticated debugging

with breakpoints ,

variable values etc.


DeployR - Integration with 3rd Party Software

• Seamless

Bring the power of R to any web enabled application

• Simple

Leverage common APIs including JS, Java, .NET

• Scalable

Robustly scale user and compute workloads

• Secure

Manage enterprise security with LDAP & SSO

Data Analysis

Business Intelligence

Mobile Web Apps

Cloud / SaaS

R / Statistical Modeling Expert


Deployment Expert


The Platform Step by Step:

Parallelization & Data Sourcing ConnectR

• High-speed & direct connectors Available for:

• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database



• Ready-to-Use high-performance big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Correlation & covariance matrices

• Predictive Models – linear, logistic, GLM

• Machine learning

• Monte Carlo simulation

• Tools for distributing customized algorithms across nodes


• Distributed computing framework

• Delivers portability across platforms Available on:

• Teradata Database

• Hortonworks / Cloudera / MapR

• Windows Servers / HPC Clusters

• IBM Platform LSF Linux Clusters

• Red Hat Linux Servers

• SuSE Linux Servers


Revolution R Enterprise ScaleR:

High Performance Big Data Analytics

Data Prep, Distillation & Descriptive Analytics

R Data Step Descriptive

Statistics Statistical

Tests Sampling

• Data import – Delimited, Fixed, SAS, SPSS, ODBC

• Variable creation &

transformation using any R functions and packages

• Recode variables

• Factor variables

• Missing value handling

• Sort

• Merge

• Split

• Aggregate by category (means, sums)

• Min / Max

• Mean

• Median (approx.)

• Quantiles (approx.)

• Standard Deviation

• Variance

• Correlation

• Covariance

• Sum of Squares (cross product matrix)

• Pairwise Cross tabs

• Risk Ratio & Odds Ratio

• Cross-Tabulation of Data

• Marginal Summaries of Cross Tabulations

• Chi Square Test

• Kendall Rank Correlation

• Fisher’s Exact Test

• Student’s t-Test

• Subsample (observations &


• Random Sampling


Revolution R Enterprise ScaleR (continued)

Predictive Models

• Covariance/Correlation/Sum of Squares/Cross-product Matrix

• Multiple Linear Regression

• Logistic Regression

• Generalized Linear Models (GLM) - All exponential family distributions:

binomial, Gaussian, inverse Gaussian, Poisson, Tweedie.

Standard link functions including: cauchit, identity, log, logit, probit.

- User defined distributions

& link functions.

• Classification & Regression Trees and Forests

• Gradient Boosted Trees

• Residuals for all models

• Histogram

• ROC Curves (actual data and predicted values)

• Lorenz Curve

• Line and Scatter Plots

• Tree Visualization


Visualization Variable Selection

• Stepwise Regression

• Linear

• Logistic


• Monte Carlo

• Run open source R functions and packages across cores and nodes

Cluster Analysis

• K-Means


& Regression

• Decision Trees

• Decision Forests

• Gradient Boosted Trees

• Prediction (scoring)

• PMML Export

Simulation and HPC


Statistical Modeling Machine Learning


DistributedR ScaleR



Write Once…Deploy Anywhere.


In the Cloud

Amazon AWS

Workstations & Servers

Windows Linux

Clustered Systems

IBM Platform LSF Microsoft HPC


Hortonworks, Cloudera, MapR


Teradata Database


Challenge: Model and score 250M customers

Server-based workflow was taking 3 days

Move calculation in-database to drastically reduce runtime,

process twice as many

customers, and increase lift

Case Study - Global Internet Marketplace


• Binomial Logistic Regression

> 50+ Independent variables including categorical with indicator variables

> Train from small sample (many thousands) – not a problem in and of itself

> Scoring across entire corpus (many hundred millions) – slightly more challenging

Existing Open Source R model


• Same Binomial Logistic Regression

> 50+ Independent variables including categorical with indicator variables

> Train from large sample (many millions) – more accurately captures user patterns and increases lift

> Scoring across entire corpus (many hundred millions) – completes in minutes

Revolution R Enterprise model


By moving the compute to the data

RRE Used to Optimized the Current Process

Before After

Reduced 3 day process to 10 minutes


Scaling study: Time vs. Number of Rows

Benchmarking the Optimized Process




Teradata Environment

> 4 node, 1700 Appliance

RRE Environment

> version 7.2,

> R 3.0.2

Server-based (Not In-DB) In-DB


• Before

trainit <- glm(as.formula(specs[[i]]), data =, family='binomial', maxit=iters) fits <- predict(trainit,, type='response')

• After

trainit <- rxGlm (as.formula(specs[[i]]), data =, family='binomial', maxIterations=iters) fits <- rxPredict (trainit,, type='response')

Recode Open Source R to Revolution R Enterprise

Optimization process


Revolution R Enterprise

How RRE Scale R

Actually Works


Open Source R

Revolution R Enterprise

Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra


Matrix Multiply 176 sec 9.3 sec 18x

Cholesky Factorization 25.5 sec 1.3 sec 19x

Linear Discriminant Analysis 189 sec 74 sec 3x

General R Benchmarks


R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x

R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable



Customers report 3-50x performance improvements compared to Open Source R —

without changing any code

RevoR - Performance Enhanced R Revolution R Enterprise:


Across Cores and Nodes

Scalable and Parallelized


• Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize

• Process a chunk of data at a time, giving linear scalability

• Process an unlimited number of rows of data in a fixed amount of RAM

• Independent of the “compute context” (number of cores,

computers, distributed computing platform), giving portability across these dimensions

• Independent of where the data is coming from, giving portability with respect to data sources

“Parallel External Memory Algorithms”

Scalability and Portability of PEMAs


• Efficient computational algorithms

• Efficient memory management – minimize data copying and data conversion

• Heavy use of C++ templates; optimal code

• Efficient data file format; fast access by row and column

• Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)

• Handle categorical variables efficiently

ScaleR Performance


Speed and Scalability Comparison

• Unique PEMAs: Parallel, external- memory algorithms

• High-performance, scalable

replacements for R/SAS analytic functions

• Parallel/distributed processing eliminates CPU bottleneck

• Data streaming eliminates memory size limitations

• Works with in-memory and disk-

based architectures


In-Database Billion Row Logistic Regression

• 114 seconds on Teradata 2650 (6 nodes, 72 cores), including time to read data

• Scales linearly with number of rows

• Scales linearly with number

of nodes: 3x faster than on 2

node Teradata system


Allstate compares SAS, Hadoop, and R for Big-Data Insurance Models

Approach Platform Time to fit

SAS 16-core Sun Server 5 hours

rmr/MapReduce 10-node 80-core

Hadoop Cluster > 10 hours

R 250 GB Server Impossible (> 3 days)

Revolution R Enterprise In-Teradata on

6-node 2650 3.3 minutes

Generalized linear model, 150 million observations, 70 degrees of freedom


• At what stage are you in your in-database analytics deployment project?

> Still researching tools and methods

> Evaluating/Selecting data storage/management platform

> Evaluating/Selecting analytics programming tools

> Launched the project/working on it now

> We’re done and looking for another one!

Please select one answer

Poll Question #3


• Revolution R Enterprise has a new “data source”, RxTeradata (ODBC and TPT)

# Change the data source if necessary

tdConn <- "DRIVER=…; IP=…; DATABASE=…; UID=…; PWD=…“

teradataDS <- RxTeradata(table=“…", connectionString=tdConn, …)

• Revolution R Enterprise has a new “compute context”, RxInTeradata

# Change the “compute context”

tdCompute <- rxInTeradata(connectionString=..., shareDir=..., remoteShareDir=..., revoPath=..., wait=.., consoleOutput=...)

• Sample code for R Logistic Regression

# Specify model formula and parameters

rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=teradataDS)

RRE End-User’s Perspective


• Table User Defined Functions (UDFs) allow users to place a function in the FROM clause of a SELECT statement

• Table Operators extend the existing table UDF capability:

> Table Operators are Object Oriented

– Inputs and outputs can be arbitrary and not “fixed” as Table UDF’s require

> Table Operators have a simpler row iterator interface

– Interface simply produces output rows providing a more natural application development interface than Table UDF’s

> Table operators operate on a stream of rows.

– Rows are buffered for high-performance, eliminating row at a time processing

> Table operators support PARTITON BY and ORDER BY

– Allows the development of Map Reduce style operators in-database

Table Operators – Teradata 14.10+


RRE Architecture in Teradata 14.10+

Worker Process

Message Passing Layer Master Process …

Request Response Teradata 14.10+


Partition Data

Partition Data

Partition Data Partition Master Process


Process Worker

Process Worker Process

* All communication is done by binary BLOB’s

PE Layer

AMP Layer

1. RRE commands are sent to a “Master Process” - an External Stored Procedure (XSP) in the Parsing Engine that provides parallel coordination

2. RRE analytics are split into “Worker

Process” tasks that run in a Table Operator (TO) on every AMP.

a. HPA analytics iterate over the data, and intermediate results are

analyzed and managed by the XSP.

b. HPC analytics do not iterate, and final results from each AMP are returned to the XSP

3. Final combined results are assembled by the XSP and returned to the user

tdConnect <- rxTeradata(<data, connection string, …>) tdCompute <- rxInTeradata(<data, server arguments, …>)

** PUT-based Installer


• High-performance, scalable, portable, fully-featured algorithms

• Integration with R ecosystem

• Compatibility with Big Data ecosystem



PARTNERS Mobile App InfoHub Kiosks



Rate this Session


Resources for you (available on

• White Paper: Teradata and Revolution Analytics: For the Big Data Era, An Analytics Revolution

• Webinar: Big Data Analytics with Teradata and Revolution Analytics


PARTNERS Mobile App InfoHub Kiosks



Rate this Session

Thank You!




Related subjects :