Coupling Databases
and Advanced
Master Thesis Presentation [IT4BI]
Student:
Sedem Seakomo
and Advanced
Analytical Tools (R)
Supervisor:
Outline
Introduction
The Problem
State of the art (Existing systems review)
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Introduction
Introduction
Background & Motivation Research Questions
Scope (Delimitations & Limitations) Importance & Contribution
Importance & Contribution Related Work
The Problem
State of the art (Existing systems review)
Methodology
Introduction
SQL/relational DBMS are powerful systems!
Managing, querying, and aggregating data
But what about complex analytics?
Not really!
Inferences, predictions, subtle relationships in data Inferences, predictions, subtle relationships in data
In spite of this, organizations still house large
amount of data in various SQL/RDBMS
Introduction
Objective:
Examine level of development of integration of R+DBMS
Assess performance, scalability and completeness of R+DBMS integration
R+DBMS integration
Motivation:
New Industry (Analytics Industry)
Development in Analytics front: Gleaning information and insights from data, now an industry in itself
Research Questions
What is the current level of development (completeness) of integration of R with DMBS?
How is the performance of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? How is the scalability of coupling databases with advanced How is the scalability of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? What are the inherent implications of architectures of R integration that impact performance?
Scope (delimitations & limitations)
Focused on benchmarking the performance, scalability and completeness of selected DBMS+R
Benchmarks covers mainly matrix operations employed (forms the core of) in advanced analytics
Benchmarking of intra-command parallelism was not covered Benchmarking of intra-command parallelism was not covered Focused on coupling of R and RDBMS (Oracle, Postgres, DB2 and SQL Server); non-RDBMS or NoSQL databases not covered Focused on directly coupling R at the data layer (not at the analytic layer and/or presentation layer)
Introduction
Contributions:
Better performance is achievable by coupling databases with advanced analytical tools (R)
Such approach is recommended for complex analytics involving significant amount of data
analytics involving significant amount of data
Architectures where more analytic functions have equivalent native SQL counterparts executable in-database produces best performance results
Caveat: data used in analytic process must be efficiently retrieved and passed to the analytic functions, lest there will be worsen performance
Introduction
Related work:
Analytics and databases:
Database Analytics Acceleration using FPGAs [10]
For evaluating expensive analytics queries while saving CPU resources
The MADlib Analytics Library or MAD skills, the SQL [11]
Introduces open-source library of in-database analytic methods of SQL-based algorithms for machine learning, data mining and statistics inside database engine algorithms for machine learning, data mining and statistics inside database engine
Towards a Unified Architecture for in-RDBMS Analytics [12]
Presents unified architecture for in-RDBMS analytics with emphasis on faster implementation of new statistical techniques in RDBMS
Performance benchmark studies w.r.t R:
By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14]
Centered on comparing performance of versions of R implementations, R implementation with and without some packages and R as analytical tool compared with other analytical tools
The Problem
Introduction
The Problem
Advanced analytical tools
Database Management Systems Bringing the “two worlds” together Bringing the “two worlds” together
Thesis statement (Hypothesis) declaration
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Advanced Analytical Tools
Inclined towards linear algebra
Up-side:
Statistical software provide rich and very advanced Statistical software provide rich and very advanced analytical functionality for data analysis and modelling
Down-side:
Can handle only limited amounts of data.
Example: Some packages (base R and IBM SPSS) operate entirely in main memory
Database Management Systems
Founded on relational algebra (RDBMS)
Up-side:
DBMS can store and process large amount of data DBMS can store and process large amount of data
Down-side:
But provide insufficient analytical functionality SQL simulations of linear algebra operations
will often result in abysmal I/O and CPU performance are knotty for linear algebra operations with iterations
Bringing the “
two worlds
” together
We have a case at hand!
So, how do we bridge the gap?
Advanced Statistical Packages (linear algebra) Database Management
Systems (relational algebra) Packages (linear algebra) Systems (relational algebra)
Bringing the “
two worlds
” together
Solution:
synergy!
Employ extended RDBMS features to power the embedded/integrated/coupled execution of R.
Bringing the “
two worlds
” together
Has the following advantages:
Avert performance problems associated with the abusive use of SQL (relational algebra ops) for advanced analytics (linear algebra ops)
Synergize robust data management capabilities of DBMS and rich statistical functionalities of analytical tools
Benefits (Performance+Security) of taking algorithms (Processing Logic) to data rather than data to algorithm
Thesis Statement (Hypothesis)
Coupling databases and advanced analytical tools (R) leads to better and enhanced analytic performance
State of the art
Introduction
The Problem
State of the art (Existing systems review)
Advanced analytical tool R
Different DBMS architecture of R implementation Different DBMS architecture of R implementation Choice of DBMS for empirical study
Methodology
Empirical Work
The Results
Integration with R
At three layers within the analytic stack
Data Layer
(e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza)
Analytics Layer
(e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis)
Presentation Layer
Integration with R at Data Layer
Alternative ways of integrating R with db
Outside-in:
R connect with DB using JDBC/ODBC and R retrieves (pulls) the data to be analyzed from the db
Inside-out:
Data is transferred (pushed) to R from within the database and the aggregated and/or analyzed results sets are sent back from R to the database
Embedded:
Diff DBMS Architecture w.r.t R
Integrations/Architectural Arrangements
DBMS Embedded Outside-in/Inside-out
Oracle YES: ORE Server YES: ROracle, JDBC
PostgreSQL YES: PLR YES: RPostgres, RODBC
Sybase RAP YES: RAP Store- UDF(C, C++) YES: RJDBC
SQL Server NO: But CLR, Ext Proc YES: RODBC
DB2
NO: But CLR, Ext Proc UDF(C,
C++, Java, COBOL) YES: RJDBC, RODBC
Cloudera
Impala NO YES: ODBC, JDBC
Methodology
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Research Approach Research Approach Research Design Data UsedEmpirical Work
The Results
Conclusions
Methodology
Research Approach
Quantitative research (experimental) approach
Need to collect numeric performance data
Carry out various kinds of numeric-based analyses
Research Design
Research Design
Adopted and adapted R Benchmark 2.5 [3] and
Revolution RevoR Enterprise Benchmark [2]
Tests designed for stand-alone R and R+ Oracle, PostgreSQL, DB2, SQLServer
Tests: 3 categories of performance tests;
Data Used
Input data generated in R, data set consists
two dimensional array of floating-point numbers 1,000 observations (cols) by 16,000 variables (rows)
Used a stochastic process, Brownian Motion
Used a stochastic process, Brownian Motion
Where:
X (the series) then stand-in for the Brownian Motion
MatrixX obs1 obs2 ... obs1000
var1 var2 var3 ... var16000
)
(
)
(
1k
i
Y
X
i n n i=
∑
•
=Empirical Work
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
Empirical Work
Benchmark tests
Experimental design
Measurements & controls
The Results
Conclusions
Experimental Setup:R+SQLServer
Traditional (RODBC) Integration
Installed SQL Server 2012 (64-bit)
Installed Open Source R 2.13.2 (64-bit client) Installed of RODBC package from RGUI
Common Language Runtime (CLR) Integration
CLR Stored Procedures are .NET objects which run in db memory
Created the usual R script files
Developed C# CLR with embedded R; compiled to get DLL Enabled CLR integration feature of the SQL Server
The Results
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
Empirical Work
The Results
MC, MF, PC, Overall benchmarks
Implications of the results and findings
Which integration architecture works well?
Empirical Results
Average overall benchmark results:
SQL Server OVERALL Performance Stand-Alone R PostgreSQL Oracle DB2 S y st e m /+ R OVERALL
Empirical Results
Matrix Calculation (MC) benchmark results
SQL Server
Matrix Calculation Performance (MC)
- 10.00 20.00 30.00 40.00 50.00 60.00 Stand-Alone R PostgreSQL Oracle DB2 Run-time (normalised) S y st e m /+ R MC
Empirical Results
Matrix Function (MF) benchmark results
SQL Server
Matrix Function Performance (MF)
Stand-Alone R PostgreSQL Oracle DB2 S y st e m /+ R MF
Empirical Results
Program Control (PC) benchmark results
Benchmark Stand-Alone R PostgreSQL Oracle DB2 SQL Server PC01 2.71 2.78 2.77 2.70 2.77 PC02 0.30 0.40 0.31 0.28 0.29 PC03 0.63 0.60 0.43 0.65 0.66 PC04 0.51 0.50 0.51 0.53 0.51 PC05 0.38 0.38 0.36 0.38 0.38 KEY:
PC01: Fibonacci numbers; ctrl flow PC03: gcd2; recursive PC05: Escoufier’s method on matrix
Empirical Results
Paired t-test on PC results (PosgreSQL, Oracle)
Variable 1 Variable 2 Mean 0.932 0.876 Variance 1.07492 1.12668 Observations 5 5 Pearson Correlation 0.997786692 Hypothesized Mean Hypothesized Mean Difference 0 df 4 t Stat 1.691541861 P(T<=t) one-tail 0.082996265 t Critical one-tail 2.131846782 P(T<=t) two-tail 0.16599253 t Critical two-tail 2.776445105Empirical Results
Scalability benchmark results
10.00 15.00 20.00 25.00 30.00 dTimes1-4m-r dTimes1-4mc-ore
Oracle shows slightly better scalability edge over stand-alone R for small datasets
-5.00 10.00 M C 1 M C 2 M C 3 M C 4 M C 5 M F 1 M F 2 M F 3 M F 4 M F 5 M F 6 M F 7 M F 8 dTimes1-4mc-ore 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00 200.00 dTimes4-16mc-r dTimes4-16mc-ore Stand-alone R is overwhelmed
by large datasets; that is when R+DBMS’s edge is manifested
Empirical Results Reliability
Average vs. Minimum Results: about same
Oracle DB2 SQL Server S y st e m /+ R
Avg OVERALL Performance
Performance patterns
observed remain exactly same
- 50.00 100.00 150.00 200.00 Stand-Alone R PostgreSQL Run-time (normalised) S y st e m /+ R OVERALL PostgreSQL Oracle DB2 SQL Server S y st e m /+ R
Min OVERALL Performance
OVERALL No significant variations in
Why R+PostgreSQL works bad?
Postgres performed well in tests with less data
Timing retrieval of database resident data as matrix
Retrieving
DB Data Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run7 Total Average
Oracle 0.15 0.14 0.14 0.14 0.14 0.15 0.14 1.00 0.14
PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 19.24
Direct rows fetch (SELECT * FROM stockHist)
Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec)
Implication:
Poor performance of PostgreSQL-coupled-R is not
exclusively the consequence of the implementation but also the database itself (data retrieval /fetching)
Why R+Oracle works well?
In-db statistic engine Storing of R Script in-db
Conclusions
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
Empirical Work
The Results
Conclusions
Lessons learnt Future Studies Final WordsImplications to Research Questions
Growing level of development of R+DBMS
Most capabilities of stand-alone R obtainable in R+DBMS
Better performance with R+DBMS
Provided data is efficiently retrieved and passed R still competitive in less data-intensive analytics R still competitive in less data-intensive analytics
Architectural Implications
Positioning of analytic engine w.r.t database
Existence of native SQL equivalent of analytic functions Extent of exploitation of db parallelism and db scalability
Future Studies
Benchmark in-memory, col-oriented, doc-oriented dbs
Study effective & efficient data access by analytic functions Compare performance gains: R+RDBMS vs. R+NoSQL dbs Max gain from parallelism and scalability of R+DBMSs Benchmark on different OS and varying data amounts Benchmark on datasets with varied attribute properties
Conclusions (final words)
In recommending R+DBMS, architectures must
Facilitate efficient retrieval and passing of data from the database objects (tables, views, procedures, etc) to the analytic functions;
to the analytic functions;
Lessen or eliminate data movement and reduce other run-time overheads;
Maintain data security (the C.I.A of data);
References
1. Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011
2. Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL:
http://www.revolutionanalytics.com/revolution-revor-enterprise-benchmark-details.
3. Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R.
4. Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion.
http://probaperception.blogspot.com.es/2012/10/generate-stock-option-prices-how-to.html. Blog. 2012.
5. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL:
http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
6. David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog.
2012. 2012.
7. Elaine Chen. Using R and Tableau. Tech. rep. Also available as
http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableau-software_0.pdf. 2013.
8. CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/.
9. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL:
http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
10. Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database
analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM. 2012, pp. 411–420.
11. Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012), pp. 1700–1711.
12. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the