Master Thesis Presentation [IT4BI]

(1)

Coupling Databases

and Advanced

Master Thesis Presentation [IT4BI]

Student:

Sedem Seakomo

and Advanced

Analytical Tools (R)

Supervisor:

(2)

Outline

Introduction

The Problem

State of the art (Existing systems review)

Methodology

Empirical Work

The Results

(3)

Introduction

Background & Motivation Research Questions

Scope (Delimitations & Limitations) Importance & Contribution

Importance & Contribution Related Work

The Problem

State of the art (Existing systems review)

Methodology

(4)

Introduction

SQL/relational DBMS are powerful systems!

Managing, querying, and aggregating data

But what about complex analytics?

Not really!

Inferences, predictions, subtle relationships in data Inferences, predictions, subtle relationships in data

In spite of this, organizations still house large

amount of data in various SQL/RDBMS

(5)

Introduction

Objective:

Examine level of development of integration of R+DBMS

Assess performance, scalability and completeness of R+DBMS integration

R+DBMS integration

Motivation:

New Industry (Analytics Industry)

Development in Analytics front: Gleaning information and insights from data, now an industry in itself

(6)

Research Questions

What is the current level of development (completeness) of integration of R with DMBS?

How is the performance of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? How is the scalability of coupling databases with advanced How is the scalability of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? What are the inherent implications of architectures of R integration that impact performance?

(7)

Scope (delimitations & limitations)

Focused on benchmarking the performance, scalability and completeness of selected DBMS+R

Benchmarks covers mainly matrix operations employed (forms the core of) in advanced analytics

Benchmarking of intra-command parallelism was not covered Benchmarking of intra-command parallelism was not covered Focused on coupling of R and RDBMS (Oracle, Postgres, DB2 and SQL Server); non-RDBMS or NoSQL databases not covered Focused on directly coupling R at the data layer (not at the analytic layer and/or presentation layer)

(8)

Introduction

Contributions:

Better performance is achievable by coupling databases with advanced analytical tools (R)

Such approach is recommended for complex analytics involving significant amount of data

analytics involving significant amount of data

Architectures where more analytic functions have equivalent native SQL counterparts executable in-database produces best performance results

Caveat: data used in analytic process must be efficiently retrieved and passed to the analytic functions, lest there will be worsen performance

(9)

Introduction

Related work:

Analytics and databases:

Database Analytics Acceleration using FPGAs [10]

For evaluating expensive analytics queries while saving CPU resources

The MADlib Analytics Library or MAD skills, the SQL [11]

Introduces open-source library of in-database analytic methods of SQL-based algorithms for machine learning, data mining and statistics inside database engine algorithms for machine learning, data mining and statistics inside database engine

Towards a Unified Architecture for in-RDBMS Analytics [12]

Presents unified architecture for in-RDBMS analytics with emphasis on faster implementation of new statistical techniques in RDBMS

Performance benchmark studies w.r.t R:

By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14]

Centered on comparing performance of versions of R implementations, R implementation with and without some packages and R as analytical tool compared with other analytical tools

(10)

The Problem

Introduction

The Problem

Advanced analytical tools

Database Management Systems Bringing the “two worlds” together Bringing the “two worlds” together

Thesis statement (Hypothesis) declaration

State of the art (Existing systems review)

Methodology

Empirical Work

The Results

(11)

Advanced Analytical Tools

Inclined towards linear algebra

Up-side:

Statistical software provide rich and very advanced Statistical software provide rich and very advanced analytical functionality for data analysis and modelling

Down-side:

Can handle only limited amounts of data.

Example: Some packages (base R and IBM SPSS) operate entirely in main memory

(12)

Database Management Systems

Founded on relational algebra (RDBMS)

Up-side:

DBMS can store and process large amount of data DBMS can store and process large amount of data

Down-side:

But provide insufficient analytical functionality SQL simulations of linear algebra operations

will often result in abysmal I/O and CPU performance are knotty for linear algebra operations with iterations

(13)

Bringing the “

two worlds

” together

We have a case at hand!

So, how do we bridge the gap?

Advanced Statistical Packages (linear algebra) Database Management

Systems (relational algebra) Packages (linear algebra) Systems (relational algebra)

(14)

Bringing the “

two worlds

” together

Solution:

synergy!

Employ extended RDBMS features to power the embedded/integrated/coupled execution of R.

(15)

Bringing the “

two worlds

” together

Has the following advantages:

Avert performance problems associated with the abusive use of SQL (relational algebra ops) for advanced analytics (linear algebra ops)

Synergize robust data management capabilities of DBMS and rich statistical functionalities of analytical tools

Benefits (Performance+Security) of taking algorithms (Processing Logic) to data rather than data to algorithm

(16)

Thesis Statement (Hypothesis)

Coupling databases and advanced analytical tools (R) leads to better and enhanced analytic performance

(17)

State of the art

Introduction

The Problem

State of the art (Existing systems review)

Advanced analytical tool R

Different DBMS architecture of R implementation Different DBMS architecture of R implementation Choice of DBMS for empirical study

Methodology

Empirical Work

The Results

(18)

Integration with R

At three layers within the analytic stack

Data Layer

(e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza)

Analytics Layer

(e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis)

Presentation Layer

(19)

Integration with R at Data Layer

Alternative ways of integrating R with db

Outside-in:

R connect with DB using JDBC/ODBC and R retrieves (pulls) the data to be analyzed from the db

Inside-out:

Data is transferred (pushed) to R from within the database and the aggregated and/or analyzed results sets are sent back from R to the database

Embedded:

(20)

Diff DBMS Architecture w.r.t R

Integrations/Architectural Arrangements

DBMS Embedded Outside-in/Inside-out

Oracle YES: ORE Server YES: ROracle, JDBC

PostgreSQL YES: PLR YES: RPostgres, RODBC

Sybase RAP YES: RAP Store- UDF(C, C++) YES: RJDBC

SQL Server NO: But CLR, Ext Proc YES: RODBC

DB2

NO: But CLR, Ext Proc UDF(C,

C++, Java, COBOL) YES: RJDBC, RODBC

Cloudera

Impala NO YES: ODBC, JDBC

(21)

Methodology

Introduction

The Problem

State of the art (Existing systems review)

Methodology

Research Approach Research Approach Research Design Data Used

Empirical Work

The Results

Conclusions

(22)

Methodology

Research Approach

Quantitative research (experimental) approach

Need to collect numeric performance data

Carry out various kinds of numeric-based analyses

Research Design

Adopted and adapted R Benchmark 2.5 [3] and

Revolution RevoR Enterprise Benchmark [2]

Tests designed for stand-alone R and R+ Oracle, PostgreSQL, DB2, SQLServer

Tests: 3 categories of performance tests;

(23)

Data Used

Input data generated in R, data set consists

two dimensional array of floating-point numbers 1,000 observations (cols) by 16,000 variables (rows)

Used a stochastic process, Brownian Motion

Where:

X (the series) then stand-in for the Brownian Motion

MatrixX obs1 obs2 ... obs1000

var1 var2 var3 ... var16000

)

(

)

(

1

k

i

Y

X

i n n i

=

∑

•

=

(24)

Empirical Work

Introduction

The Problem

State of the art (Existing systems review)

Methodology

Empirical Work

Benchmark tests

Experimental design

Measurements & controls

The Results

Conclusions

(25)

Experimental Setup:R+SQLServer

Traditional (RODBC) Integration

Installed SQL Server 2012 (64-bit)

Installed Open Source R 2.13.2 (64-bit client) Installed of RODBC package from RGUI

Common Language Runtime (CLR) Integration

CLR Stored Procedures are .NET objects which run in db memory

Created the usual R script files

Developed C# CLR with embedded R; compiled to get DLL Enabled CLR integration feature of the SQL Server

(26)

The Results

Introduction

The Problem

State of the art (Existing systems review)

Methodology

Empirical Work

The Results

MC, MF, PC, Overall benchmarks

Implications of the results and findings

Which integration architecture works well?

(27)

Empirical Results

Average overall benchmark results:

SQL Server OVERALL Performance Stand-Alone R PostgreSQL Oracle DB2 S y st e m /+ R OVERALL

(28)

Empirical Results

Matrix Calculation (MC) benchmark results

SQL Server

Matrix Calculation Performance (MC)

- 10.00 20.00 30.00 40.00 50.00 60.00 Stand-Alone R PostgreSQL Oracle DB2 Run-time (normalised) S y st e m /+ R MC

(29)

Empirical Results

Matrix Function (MF) benchmark results

SQL Server

Matrix Function Performance (MF)

Stand-Alone R PostgreSQL Oracle DB2 S y st e m /+ R MF

(30)

Empirical Results

Program Control (PC) benchmark results

Benchmark Stand-Alone R PostgreSQL Oracle DB2 SQL Server PC01 2.71 2.78 2.77 2.70 2.77 PC02 0.30 0.40 0.31 0.28 0.29 PC03 0.63 0.60 0.43 0.65 0.66 PC04 0.51 0.50 0.51 0.53 0.51 PC05 0.38 0.38 0.36 0.38 0.38 KEY:

PC01: Fibonacci numbers; ctrl flow PC03: gcd2; recursive PC05: Escoufier’s method on matrix

(31)

Empirical Results

Paired t-test on PC results (PosgreSQL, Oracle)

Variable 1 Variable 2 Mean 0.932 0.876 Variance 1.07492 1.12668 Observations 5 5 Pearson Correlation 0.997786692 Hypothesized Mean Hypothesized Mean Difference 0 df 4 t Stat 1.691541861 P(T<=t) one-tail 0.082996265 t Critical one-tail 2.131846782 P(T<=t) two-tail 0.16599253 t Critical two-tail 2.776445105

(32)

Empirical Results

Scalability benchmark results

10.00 15.00 20.00 25.00 30.00 dTimes1-4m-r dTimes1-4mc-ore

Oracle shows slightly better scalability edge over stand-alone R for small datasets

-5.00 10.00 M C 1 M C 2 M C 3 M C 4 M C 5 M F 1 M F 2 M F 3 M F 4 M F 5 M F 6 M F 7 M F 8 dTimes1-4mc-ore 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00 200.00 dTimes4-16mc-r dTimes4-16mc-ore Stand-alone R is overwhelmed

by large datasets; that is when R+DBMS’s edge is manifested

(33)

Empirical Results Reliability

Average vs. Minimum Results: about same

Oracle DB2 SQL Server S y st e m /+ R

Avg OVERALL Performance

Performance patterns

observed remain exactly same

- 50.00 100.00 150.00 200.00 Stand-Alone R PostgreSQL Run-time (normalised) S y st e m /+ R OVERALL PostgreSQL Oracle DB2 SQL Server S y st e m /+ R

Min OVERALL Performance

OVERALL No significant variations in

(34)

Why R+PostgreSQL works bad?

Postgres performed well in tests with less data

Timing retrieval of database resident data as matrix

Retrieving

DB Data Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run7 Total Average

Oracle 0.15 0.14 0.14 0.14 0.14 0.15 0.14 1.00 0.14

PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24

Direct rows fetch (SELECT * FROM stockHist)

Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec)

Implication:

Poor performance of PostgreSQL-coupled-R is not

exclusively the consequence of the implementation but also the database itself (data retrieval /fetching)

(35)

Why R+Oracle works well?

In-db statistic engine Storing of R Script in-db

(36)

Conclusions

Introduction

The Problem

State of the art (Existing systems review)

Methodology

Empirical Work

The Results

Conclusions

Lessons learnt Future Studies Final Words

(37)

Implications to Research Questions

Growing level of development of R+DBMS

Most capabilities of stand-alone R obtainable in R+DBMS

Better performance with R+DBMS

Provided data is efficiently retrieved and passed R still competitive in less data-intensive analytics R still competitive in less data-intensive analytics

Architectural Implications

Positioning of analytic engine w.r.t database

Existence of native SQL equivalent of analytic functions Extent of exploitation of db parallelism and db scalability

(38)

Future Studies

Benchmark in-memory, col-oriented, doc-oriented dbs

Study effective & efficient data access by analytic functions Compare performance gains: R+RDBMS vs. R+NoSQL dbs Max gain from parallelism and scalability of R+DBMSs Benchmark on different OS and varying data amounts Benchmark on datasets with varied attribute properties

(39)

Conclusions (final words)

In recommending R+DBMS, architectures must

Facilitate efficient retrieval and passing of data from the database objects (tables, views, procedures, etc) to the analytic functions;

to the analytic functions;

Lessen or eliminate data movement and reduce other run-time overheads;

Maintain data security (the C.I.A of data);

(40)

References

1. Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011

2. Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL:

http://www.revolutionanalytics.com/revolution-revor-enterprise-benchmark-details.

3. Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R.

4. Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion.

http://probaperception.blogspot.com.es/2012/10/generate-stock-option-prices-how-to.html. Blog. 2012.

5. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL:

http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.

6. David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog.

2012. 2012.

7. Elaine Chen. Using R and Tableau. Tech. rep. Also available as

http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableau-software_0.pdf. 2013.

8. CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/.

9. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL:

http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.

10. Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database

analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM. 2012, pp. 411–420.

11. Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012), pp. 1700–1711.

12. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the

(41)